JP4143487B2

JP4143487B2 - Time-series information control system and method, and time-series information control program

Info

Publication number: JP4143487B2
Application number: JP2003188313A
Authority: JP
Inventors: 政秀蟻生
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-06-30
Filing date: 2003-06-30
Publication date: 2008-09-03
Anticipated expiration: 2023-06-30
Also published as: JP2005024736A

Description

【０００１】
【発明の属する技術分野】
本発明は、制御命令及び制御対象の双方が時系列情報である制御系に好適な時系列情報制御システム及びその方法並びに時系列情報制御プログラムに関する。
【０００２】
【従来の技術】
近年、制御システムにおいては、制御対象に対する制御命令を音声によって行うものが開発されている。
【０００３】
例えば、特許文献１においては、音声付き動画データ中の音声のレベルや動画データの解析結果に基づいて、動画情報において必要なフレーム区間を割当てる例が開示されている。例えば、撮影対象人物が正面を向いているかを判定することで、撮影対象者がカメラの方を向いて発声しているフレーム区間を検出するのである。
【０００４】
また、特許文献２においては、動画の編集点を動画情報に含まれる音声で制御する発明が開示されている。即ち、この発明では、指定した動画編集点に対して、動画内の音声又は画像の変化点（シーンの変化点、又は無音部との境界）で編集を行うように、編集点を自動補正するものである。
【０００５】
【特許文献１】
特開２００２−１７６６１９号公報
【０００６】
【特許文献２】
特開平１１−２３９３２０号公報
【０００７】
【特許文献３】
特開２００２−２７８７４７号公報
【０００８】
【特許文献４】
特開平１１−１０９４９８号公報
【０００９】
【特許文献５】
特開２００２−３４２０６５号公報
【００１０】
【発明が解決しようとする課題】
このように、従来、時系列情報を扱う種々の制御システムが開示されている。上記特許文献１，２においては、音声という時系列情報で、映像や音声等の時系列情報を制御している。しかし、文献１，２は、音声等の時系列情報に含まれる内容が制御命令として用いられるのではなく、制御対象の時系列的な特徴を利用して制御を行うものである。制御する側の時系列情報に含まれる内容に応じた高度な制御を行うことは不可能である。
【００１１】
これに対し、音声等の時系列情報に含まれる内容に応じた制御を行うものとして、特許文献３〜５に開示されたものがある。即ち、特許文献３においては、ゲームの操作を音声で行う技術が開示されており、特許文献４には、カメラの制御を音声で行う技術が開示されている。また、特許文献５においては、対話形式で制御を行う発明が開示されている。特許文献５の発明では、予め決められた各制御段階が発声によって順番に実行される。
【００１２】
これらの特許文献３〜５は、制御側と制御対象とはいずれも時系列情報である。つまり、時系列情報である音声による命令の内容を音声認識技術によって認識し、認識結果に基づいて、時系列情報である音声及び映像等の制御を行うものである。
【００１３】
ところで、制御対象が時系列情報、即ち、時間と共に変化する情報（以下、被制御時系列情報ともいう）である場合には、制御時間を適切に制御できないときには制御の有効性が低下することもある。しかしながら、制御側も音声等の時系列情報である場合には、制御対象に対する処理期間を適切に制御することができない。
【００１４】
例えば、声によって制御対象に命令を伝達する場合には、先ず、音声が自然言語処理され、コンピュータによって処理可能な情報となって音声の意味が制御命令として解釈され、更に、制御対象において処理可能な信号に変換された後、制御対象に供給される。従って、人間が実際に命令を実行しようとした時点と、時系列情報に基づく制御命令が制御対象に伝達されて実行される時点とは異なり、従来のままでは適切なタイミングでは制御することはできない。
【００１５】
本発明は、制御命令とその制御対象とがいずれも時系列情報である場合において、両者の時間的な関連付けを可能にすることによって、制御対象に対する処理を施す系列上の位置を制御可能にして制御の有効性を向上させることができる時系列情報制御システム及びその方法並びに時系列情報制御プログラムを提供することを目的とする。
【００１６】
【課題を解決するための手段】
本発明に係る時系列情報制御システムは、制御命令を含む時系列情報である制御命令情報を取り込む入力手段と、前記入力手段が取込んだ前記制御命令情報から前記制御命令を抽出する制御命令抽出手段と、前記入力手段が取込んだ前記制御命令情報から時系列特徴量を抽出する特徴量抽出手段と、前記入力手段が取込んだ前記制御命令情報に含まれる前記制御命令の発生時間と前記制御命令の対象となる被制御時系列情報に対する前記制御命令に応じた処理を施す系列上の位置との間の時間的な関連を、前記制御命令情報の時系列特徴量の各時点と前記被制御時系列情報の時系列特徴量の各時点との比較によって取得する関連付け手段と、前記被制御時系列情報に対する処理が可能な処理手段を制御するものであって、前記制御命令情報から抽出された制御命令に応じて、前記被制御時系列情報に対する処理が可能な前記処理手段による前記被制御時系列情報に対する処理の処理内容を決定すると共に、前記制御命令情報から抽出された制御命令の時系列特徴量と前記関連付け手段が取得した時間的な関連とに基づいて、前記制御命令の発生時間と前記処理を施す系列上の位置との差を無くしながら、前記被制御時系列情報に対する処理の処理内容に応じて処理を施す系列上の位置を決定する演算手段とを具備したことを特徴とするものである。
【００１７】
本発明において、入力手段によって取込まれた制御命令情報から、制御命令抽出手段によって制御命令が抽出される。また、特徴量抽出手段は、制御命令情報から時系列特徴量を抽出する。関連付け手段は、取込まれた制御命令情報と制御命令の対象となる被制御時系列情報との間の時間的な関連を取得する。演算手段は、制御命令情報から抽出された制御命令及び時系列特徴量と関連付け手段が取得した時間的な関連とに基づいて、処理手段を制御して、処理手段による被制御時系列情報に対する処理の処理内容及び処理を施す系列上の位置を決定する。これにより、制御命令が発生した時間又はタイミングに対して任意の時間関係を有する時間又はタイミングで、制御命令の内容に応じた処理内容で、被制御時系列情報に対する処理が行われる。
【００１８】
なお、装置に係る本発明は方法に係る発明としても成立する。
【００１９】
また、装置に係る本発明は、コンピュータに当該発明に相当する処理を実行させるためのプログラムとしても成立する。
【００２０】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態について詳細に説明する。図１は本発明の第１の実施の形態に係る時系列情報制御システムを示すブロック図である。
【００２１】
図１において、制御命令情報１０１は、ユーザが制御対象の機器に与える制御命令の情報で時系列的な特徴を有する。例えば、制御命令情報１０１としては、音声による命令や、身振りによる命令等の他に、ボタン操作のような単独の発現自体は瞬時でも、押下時間長や複数のボタンがある場合に押される順番に意味があるような命令も含まれる。
【００２２】
被制御時系列情報１０２は、制御対象の時系列情報である。ユーザは図示しない所定の機器に対する制御命令情報１０１を発することで、被制御時系列情報１０２に対する制御を行うことができるようになっている。被制御時系列情報１０２には、大別して、その情報の起源が機器外にある情報、例えば機器がビデオ編集装置である場合の編集対象の動画情報や、機器がラジオの場合の電波情報等の情報と、機器から発生される情報、例えば機器がコンピュータの場合の時系列をともなったシミュレーション結果情報や、対話機器である場合のシステムで生成された応答情報等の情報が考えられる。いずれにしても、被制御時系列情報１０２は、ユーザが制御対象とする情報であって、機器で制御可能な時系列情報を指す。
【００２３】
時系列情報受信部１０３は、制御命令情報１０１や、必要な場合には被制御時系列情報１０２を受信し、後段の処理に用いるための適当な信号処理を行う。例えば、時系列情報受信部１０３は、制御命令情報１０１がユーザの発声した音声の場合には、図示しないマイクロフォンで当該音声を受信して周波数分析等の信号処理を行って音声認識で用いられる形式に変換する。また例えば被制御時系列情報１０２が動画情報である場合は、ビデオテープやＤＶＤビデオ等のメディアから当該情報を取り出して後段の処理で用いられる形式に変換する。
【００２４】
記憶部１０４は、時系列情報受信部１０３で得られた時系列情報や、後述する演算部１０５や被制御時系列情報処理部１０６の一時作業領域や演算手続き内容を保存しておくための記憶領域である。
【００２５】
演算部１０５は、図１の各部の制御や必要となる処理計算を行う。また演算部１０５は、時系列情報受信部１０３で得られた制御命令情報１０１から制御命令を抽出することができる。例えば制御命令がユーザからの命令音声である場合には、音声認識を行って制御命令を抽出する。また演算部１０５は制御命令情報１０１から時系列の特徴量を抽出することもできる。
【００２６】
本実施の形態においては、演算部１０５は、制御命令情報１０１の時間軸と被制御時系列情報１０２との時間的な関連を求めるようになっている。例えば、演算部１０５は、制御命令情報１０１の時系列の特徴量を時間的な関連を得るための時間情報として用いる。なお時間情報の抽出については時系列情報受信部１０３で行ってもよい。また例えば、制御命令情報１０１及び被制御時系列情報１０２にタイムスタンプを付けておいたり、記憶部１０４には一定量の時系列情報がリング・バッファに保存することができて、その記憶位置から時間的な対応付けしたりすることもできる。
【００２７】
また、演算部１０５は、被制御時系列情報１０２の起源が機器から発生される時系列情報である場合にはその生成を行う。
【００２８】
また演算部１０５は被制御時系列情報１０２から時系列の特徴量を得るための制御を行うことができる。例えば、被制御時系列情報１０２は動画像情報である場合には当該動画のシーンの切れ目を判断したり、被制御時系列情報１０２が音声の場合には音量の情報を得ることよって、被制御時系列情報１０２の時間情報を得る。
【００２９】
被制御時系列情報処理部１０６は、制御命令とその制御命令の時間的な特徴量、場合によっては被制御時系列情報１０２の時系列特徴を上記の演算部１０５が考慮及び被制御時系列情報１０２の処理内容を判断した結果を踏まえ、被制御時系列情報１０２に対する処理を行う。
【００３０】
次に、このように構成された実施の形態の動作について説明する。
【００３１】
いま、例えば、所定の時間範囲における制御命令情報１０１が人間の音声による編集開始命令で、被制御時系列情報１０２がビデオ編集における編集対象の動画情報であるものとする。これらの情報１０１，１０２は時系列情報受信部１０３によって受信される。演算部１０５は、例えば、制御命令情報１０１を取込む図示しないマイクロフォンに設定されているタイマの値、及び被制御時系列情報１０２を処理する図示しないビデオ編集装置に設定されているタイマの値によって、制御処理の基準時間を得る。
【００３２】
ここで、オペレータが図示しないビデオ編集装置のモニタ画面によって再生映出中の動画の編集ポイントに到達したと判断したものとする。オペレータはマイクロフォンを介して音声により「編集点」と発声する。この制御命令情報１０１は時系列情報受信部１０３によって受信される。
【００３３】
演算部１０５は、制御命令情報１０１の時系列の特徴量を抽出しており、「編集点」を示す制御命令情報１０１の発声の開始タイミングを時系列の特徴量の変化から検出し、「編集点」を示す制御命令情報１０１の発声の終了タイミングを時系列の特徴量の変化から検出する。演算部１０５は、「編集点」の受信期間におけるマイクロフォン及びビデオ編集装置のタイマ値、制御命令情報１０１及び被制御時系列情報１０２の時間情報を取得して、「編集点」の受信期間の各時間と、被制御時系列情報１０２の時間情報との対応、即ち、制御命令情報１０１と被制御時系列情報１０２との時間的な関連を求める。
【００３４】
演算部１０５は「編集点」の発声から音声認識によってその制御命令の内容を判断し、この判断結果に基づいて被制御時系列情報処理部１０６を制御して、被制御時系列情報１０２に対する処理を行わせる。演算部１０５が制御命令の内容を判断する処理の時間等によって、被制御時系列情報処理部１０６がビデオ編集装置によって被制御時系列情報１０２の制御を開始する時点は、「編集点」が発せられた期間の終了後である。
【００３５】
しかし、演算部１０５が被制御時系列情報１０２に対して制御命令に基づく作用を及ぼす時点は、制御命令の内容によっては過去に及んでも構わない。そこで、本実施の形態においては、演算部１０５は、例えば、「編集点」の受信開始タイミングにおける被制御時系列情報１０２の時間情報のタイミング（系列上の位置）を、被制御時系列情報１０２である動画情報の編集点に設定し、このタイミングの前後で動画情報を編集させるようにする。
【００３６】
なお、演算部１０５は、検出した「編集点」中のいずれの時間タイミングに相当する被制御時系列情報１０２の時間情報のタイミングで編集を実施させるように制御してもよく、更に、「編集点」の受信期間と所定の関係を有するいずれの時間タイミングに相当する被制御時系列情報１０２の時間情報のタイミングで編集を実施させるように制御してもよい。
【００３７】
このように本実施の形態においては、制御命令の時系列情報と被制御時系列情報との時間的な関連を求め、求めた時間的な関連を用いて、制御命令の時系列情報の発生期間又はタイミングと所定の関係を有する被制御時系列情報の期間又はタイミングにおいて、被制御時系列情報に対して、制御命令の時系列情報の内容に基づく処理を実行させている。これにより、制御命令の内容に基づく効果的な処理を可能にして制御の有効性を向上させている。
【００３８】
なお、上記実施の形態は、コンピュータ上のプログラムモジュールによって各部の機能を実現することも可能である。
【００３９】
図２は本発明の第２の実施の形態を示すブロック図である。また、図３は第２の実施の形態の全体構成を示す説明図である。
【００４０】
本実施の形態は第１の実施の形態を具体的な制御系である映像記録再生装置に適用した例を示すものであり、図１中の制御命令情報１０１は図２のマイク入力３０２に相当し、被制御時系列情報１０２は、ビデオテープ３０８に記憶された情報に相当し、時系列情報受信部１０３は信号処理部３０３及びビデオテープ制御装置３０７に相当し、記憶部１０４はメモリ３０５に相当し、演算部１０５は中央演算部３０４に相当し、被制御時系列情報処理部１０６は映像信号処理部３０９に相当する。
【００４１】
先ず、図３を参照して制御系の全体構成について説明する。
【００４２】
ユーザ２０１はヘッドセットマイク２０２を装着する。ヘッドセットマイク２０２は、ユーザ２０１が発声した音声を音声信号に変換してビデオデッキ２０３に供給する。ビデオデッキ２０３は、音声認識機能を有しており、音声信号入力によって各部の操作が可能である。また、ビデオデッキ２０３は、所定の記録メディアに記録された映像情報を再生してディスプレイ２０４に映出することができるようになっている。
【００４３】
なお、図３では音声入力のためにヘッドセットマイク２０２を用いたが、ハンズフリーフォンのようにユーザから距離が離れる形態でもよいし、音声入力のための装置としては一般にユーザの発声を集音する技術を採用することができる。また、ビデオデッキ２０３の音声認識機能については、隠れマルコフモデル(HMM)を用いた単語音声認識のような既存の技術を用いてもよいし、ユーザの発声から決められた語彙に判別できるパターン認識技術で採用してもよい。
【００４４】
このような構成において、ユーザ２０１は、ビデオデッキ２０３を音声入力によって操作する。例えば、「記録」、「再生」、「特殊再生」、「停止」等の各種処理を音声によって発音することで、ビデオデッキ２０３を制御するのである。
【００４５】
図３はこのようなビデオデッキの一例を示すものである。
【００４６】
ビデオデッキ２０３は、ユーザ２０１が発声した音声をヘッドセットマイク２０２を介してマイク入力３０２として受け取る。マイク入力３０２はビデオデッキ２０３の信号処理部３０３に入力される。信号処理部３０３は、入力されたマイク入力３０２に基づく時系列信号に対して、中央演算部３０４の命令を受けて後段の処理に必要な形式への変換処理を施したり、特徴量を抽出したりするようになっている。信号処理部３０３は、例えば、ＤＳＰ(Digital Signal Processor)のような信号処理チップや、プログラム上の信号処理モジュールによって実現することができる。なお、信号処理部３０３は、時系列特徴量等に依存して構成することもできる。
【００４７】
中央演算部３０４はビデオデッキ２０３全体の制御を司る。信号処理部３０３からの信号を記憶領域たるメモリ３０５に保存したり、ビデオデッキのキー入力３０６を受け付けて該当する操作を行ったり、後述するビデオ関係の操作の制御を行ったりする。中央演算部３０４は、既存の制御チップと制御プログラムによる構成等が考えられる。メモリ３０５は、中央演算部３０４の処理やプログラムを保存し、マイク入力３０２から信号処理部３０３を介して得られた時系列信号を保存し、ビデオテープ３０８から得た情報を記憶する記憶装置の総称を指すものとする。メモリ３０５は、メモリチップ等によって構成することができる。
【００４８】
ビデオテープ制御装置３０７は、中央演算部３０４の制御を受けて、ビデオテープ３０８の読み出し位置を変えたり、ビデオテープ３０８を再生して動画情報を読み込むことができる。ビデオテープ制御装置３０７は、再生した動画情報を必要によって信号処理部３０３に出力すると共に、被制御時系列情報としての動画情報を映像信号処理部３０９に出力する。映像信号処理部３０９はビデオテープ制御装置３０７からの動画情報を受けて、ディスプレイ２０４において映出可能な形式の情報に変換し、映像出力３１０として出力する。これらのビデオテープ制御装置３０７、映像信号処理部３０９は、一般に商品化されているビデオデッキの該当部分によって実現可能である。
【００４９】
次に、このように構成された実施の形態の動作について図４及び図５を参照して説明する。図４は音声認識結果と制御内容との対応関係を示す概念図であり、図５は処理の流れを説明するためのフローチャートである。
【００５０】
ビデオデッキ２０３は制御命令として音声を取り込み、音声認識によってその制御内容を抽出することができる。抽出された命令語である音声認識結果と、それに対応する制御内容は、この実施例では図４の概念図に示す対応関係を有する。図４に示すように、例えばユーザの「再生」という命令発声に対して、ビデオデッキ２０３は、ビデオテープ３０８の停止位置から再生を開始する処理を実行する。
【００５１】
いま、ビデオデッキ２０３は既に再生状態であるものとする。ここで、図５のステップ５０１において、ユーザ２０１が制御命令を発声するものとする。そして、この発声内容が「スロー再生」であったものとする。時系列信号であるユーザ２０１の発声は、マイク入力３０２としてビデオデッキ２０３の信号処理部３０３に入力される。信号処理部３０３は、マイク入力３０２の時系列信号から発声が始まった時間を検出すると共に、音声認識に必要な特徴量への変換を行う（ステップ５０２）。この特徴量は例えばメル周波数ケプストラム係数（以下ＭＦＣＣ：Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）や知覚線形予測分析（以下ＰＬＰ：ＰｅｒｃｅｐｔｕａｌＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＡｎａｌｙｓｉｓ）で得られるような音声認識特徴量でも実現可能なものとする。
【００５２】
またユーザの発声を検出する手段としては、入力音声の音量、入力音声波形の零交差、線形予測分析の係数やその残差など特徴量から判断するような、世の中で用いられている手法で実現可能なものとする。これらのような制御命令情報の時系列的特徴量はこれまで述べてきたように、被制御時系列情報との時間的な対応づけに用いることができる。ここでの実施例では発声の有無という２値を検出するという例で説明するが、用途によっては発声の判断に対する確からしさという形で連続値を音声検出の情報にしてもかまわない。
【００５３】
信号処理部３０３で得られた音声検出の情報や、認識に要する特徴量等の時系列的特徴量は、中央演算部３０４の制御のもと、必要な場合はメモリ３０５に記憶される。そして、中央演算部３０４によって、音声検出の時間に関する情報は被制御時系列情報であるビデオテープ３０８に記憶された動画再生情報の時間に対応付けられる一方、音声認識処理によって発声情報から制御命令が抽出される（ステップ５０３）。これにより、ユーザの制御命令が「スロー再生」であったことが分かる。
【００５４】
中央演算部３０４は、図４のような制御命令とそれに対する命令処理との対応関係に基づいて、制御命令の時系列情報から得られた音声検出の時間情報を用いて、"再生中のビデオ映像を制御命令が発声された時刻まで戻し、その時点からスロー再生を行う"という処理を行うことを決定する（ステップ５０４）。中央演算部３０４は決定した処理に必要な手続きを生成して各処理部を制御する。その結果、ビデオテープ制御装置３０７はユーザ２０１の発声が開始された時点の再生位置までビデオテープ３０８を巻戻し、スロー再生の読み取りを開始する。ビデオテープ制御装置３０７によって再生された映像信号は映像信号処理部３０９を介してディスプレイ２０４に供給されて、画面上に映出される（ステップ５０５）。
【００５５】
こうして、本実施の形態においては、ユーザが特殊な操作をすることなく、スロー再生を希望した再生位置から再生画像の再生を実行させることができる。
【００５６】
なお、誤認識が生じた場合に備えて、「キャンセル」と発声されたら前の命令の発声時点を参照して処理を決定する方法も考えられる。このような処理はアプリケーションに依存したものであり、この場合でも、制御命令の時系列的特徴量によって被制御時系列情報に対する処理期間（タイミング）を決定すると共に、制御命令の内容から被制御時系列情報に対する処理内容を決定する点は同様である。
【００５７】
図６は本発明の第３の実施の形態を示す説明図である。本実施の形態の構成は図２及び図３と同様であり、制御方法のみが異なる。本実施の形態は音声によってビデオデッキを操作する例に適用したものである。
【００５８】
本実施の形態は、ビデオテープ制御装置３０７（図３参照）から得た動画情報に対して動画のシーン検出を行う例についてのものである。図６はこの場合の制御命令と命令処理との対応関係を示している。図６に示すように、各制御命令に対してカテゴリが割り当てられている。図６の例では、各制御命令は「則実行」、「映像調整」及び「時点実行」の３カテゴリのいずれかに属する。
【００５９】
ユーザ２０１（図２参照）から「一時停止」という発声があるものとする。なお、ユーザ２０１が例えば「一時停止して下さい」のように、制御命令語として登録されている発声の前後に、不要な語が付加されている場合でも、既存のキーワードスポッティング技術によって、必要な部分の音声認識が可能である。
【００６０】
本実施の形態においても、基本的には図５のフローチャートに従って処理を実行する。本実施の形態においては、中央演算部３０４は、音声認識によって制御命令を取り出すと、図６に示す概念図から、制御命令のカテゴリを抽出する。なお、中央演算部３０４がカテゴリを判断する方法としては、例えばプログラム上で配列のテーブルを用意すること等によって実現することができ。
【００６１】
本実施の形態においては、制御命令の実行に際して、制御命令自体と抽出したカテゴリとが考慮される。ユーザ２０１による制御命令語「一時停止」に対するカテゴリは図６より「映像調整」である。制御命令を実行する場合にはこのカテゴリ用の処理を加味する。具体的には、先ず「一時停止」を発声したときの制御命令と被制御時系列情報との対応付けを参考にして、被制御時系列情報の中で「一時停止」と発声が開始された時点を注目する。「映像調整」というカテゴリの場合には、制御命令が発せられた時点の再生位置近傍に記録されている被制御時系列情報について、探索及びシーン検出を行う。そして、ユーザが「一時停止」と発声した時点に対応した再生位置周辺において、シーンの切り替わり位置を指定して、被制御時系列情報の再生を一時停止させる。これにより、ユーザ２０１は、注目したい場面を声によって直感的に指定することができ、また、映像が切り替わった後のような、ユーザが注目したかったであろう時点に自動的に制御を移行させることができる。またこのとき、制御命令の時点に注目したときに、そこに最も近いシーン変化位置という扱いや、その時点の周辺で最も大きくシーンが変っている位置という扱いや、その時点からの時間的な距離とシーン変化量からなる関数から最も適切な位置を決めるという扱いでも構わない。同様のことは、ここでの実施例に限らず、本発明の実施形態において制御命令情報と被制御時系列情報との時系列的な特徴と、制御命令内容から処理を判断する際にあてはまることである。
【００６２】
なお、以上説明したような制御命令のカテゴリについては最初から固定でもいいし、上位のアプリケーションによってそのカテゴリと処理との関連付けや内容の変更を行ってもよい。
【００６３】
また、シーン検出については、既存の変化点検出で動画像の中から取り出された特徴量情報から判断する仕組みがあれば実現可能である。また制御命令を発した時間近傍で被制御時系列情報の特徴量を探索する場合（図６の例では「一時停止」の発声タイミングで動画のシーン検出をする場合）には、制御命令の音声を検出した時点を基にある時間枠内において探索してもよいし、制御命令によってその時間範囲を変更してもよい。また、シーン検出に関しては画像認識の技術も利用して動画上の画像の変化を利用して、カテゴリごとの操作を行ってもよい。またシーン検出については動画の二次元情報だけでなく、動画に付随した情報、例えば音声情報を利用してもよい。
【００６４】
また、被制御時系列情報の任意の量または任意の時間、もしくは全体をメモリ等の記憶領域に保存しておき、その記憶領域上の情報を元にしてシーン検出を行うことも可能である。また、制御命令情報から抽出された制御命令と時系列特徴量に基づいて決められた処理期間及び処理内容で、前記記憶領域上の被制御時系列情報に対して処理を施して、その結果を出力することもできる。例えば、本実施の形態の例では、被制御時系列情報を逐次記憶領域に保存し、「スロー再生」の制御命令に対して、記憶領域上の被制御時系列情報で制御命令の開始時点に対応する時点（アドレス）から情報を読み出すことで、所望の位置からのスロー再生を行うことができる。
【００６５】
また、制御命令情報や被制御時系列情報から特徴量を抽出する場合にパラメータが介在してもよい。例えば、本実施の形態における制御命令情報である音声や、被制御時系列情報に付随する音声情報に対して雑音推定を行い、その推定された雑音情報をパラメータとして記憶しておき、音声情報から特徴量を得る際に雑音の影響を除去した後所定の特徴量を得るようにしてもよい。こうすることで、より精度の高い時系列情報の特徴量を得ることができ、これにより、制御命令に基づく処理期間を高精度に制御することができ、一層効果的な制御が可能となる。このようなパラメータの具体例としては、マイクロホンアレーで目的音と妨害音方向を推定するためのフィルタパラメータである、例えば文献（永田仁史ら,「話者追尾２チャネルマイクロホンアレーに関する検討」, 電子情報通信学会論文誌A, Vol. J８２-A No. ６ pp. ８６０-８６６, １９９９年６月）等に記載されたものや、スペクトル・サブトラクション法における推定雑音等（例えば、文献（サイード・ブイ・ヴァセッジ(Saeed V. Vaseghi) 著, 「アドバンスドディジタルシグナルプロセッシングアンドノイズリダクション(Advanced Digital Signal Processing and Noise Reduction)」, (英国), 第二版, ワイリー(WILEY), ２０００年９月)）に記載されたものが挙げられる。また、この場合のパラメータである推定雑音の強さから、被制御時系列情報の処理に影響を与えてもよい。例えば本実施の形態の例において、推定雑音の量が小さい場合には動画に付随した音声情報に基づくシーン検出結果を重視するようにする一方、推定雑音の量が大きい場合は動画の画像情報に基づくシーン検出結果を重視して、制御命令の処理期間（又はタイミング）を決定するようにしてもよい。
【００６６】
図７は本発明の第４の実施の形態を示す説明図である。図７において図２と同一の構成要素には同一符号を付して説明を省略する。
【００６７】
本実施の形態はビデオデッキ２０３に代えてビデオデッキ２０８を採用すると共に、ビデオカメラ２０５を付加した点が第２の実施の形態と異なる。ビデオカメラ２０５は、ユーザ２０１を撮像することができるようになっている。なお、ビデオカメラ２０５はユーザを自動で追尾する機能を有していてもよい。
【００６８】
ビデオデッキ２０８は、図３のビデオデッキ２０３と同一の構成を備えると共に、ビデオカメラ２０５からの映像信号を信号処理部３０３（図３参照）を介して取込んで、中央演算部３０４等によって所定の信号処理を実施することができようになっている。例えば、中央演算部３０４は、ビデオカメラ２０５が撮像したユーザ２０１の画像情報から、ユーザ２０１がビデオデッキ２０８の方向を向いているか否か等を判断することができるようになっている。例えば、中央演算部３０４は、入力画像中のユーザの眼球の白目の部分の割合等によって、ユーザの顔の向き等を推定する手法を採用する。
【００６９】
このように構成された実施の形態においては、ビデオデッキ２０８は、マイク入力３０２による制御命令の受信の他に、ビデオカメラ２０５からの画像情報による制御命令も受信することができる。上述した第２の実施の形態においては、ユーザ２０１からの制御命令の発声に対し、命令を音声認識してその制御命令の内容によって音響的な特徴量の１つである発声の開始時点まで被制御時系列情報の再生位置を戻した。これに対し、本実施の形態の場合には、ユーザ２０１の顔の向き情報も処理内容決定に反映させる。
【００７０】
即ち、中央演算部３０４は、ヘッドセットマイク２０２からの音声入力については、第２の実施の形態と同様に制御命令の抽出と時系列特徴量に基づく制御命令の発声区間の検出に用いる。一方、中央演算部３０４は、ビデオカメラ２０５からの情報については、ユーザ２０１がビデオカメラ２０５（又はビデオデッキ２０８）の方を向いていれば処理の実行、向いていなければ処理を行わないという２値の制御命令として用いる。
【００７１】
この場合には、ビデオカメラ２０５からの情報の時系列特徴量は、ユーザ２０１の顔の向きに対応する。中央演算部３０４は、ビデオカメラ２０５の出力に基づく時系列特徴量に対して、ユーザ２０１の顔の向きがある範囲内の角度であるか否かによってビデオカメラ２０５に向いているか否かを判断する２値判断を行ってもよく、向きを角度としたときの連続量を求めてもよい。
【００７２】
こうすることで、マイク入力３０２からの音声情報と、ビデオカメラ２０５からの画像情報との夫々に対して制御命令と時系列特徴量とを定めることができる。これにより、例えば、ヘッドセットマイク２０２からの入力による制御命令があって、且つビデオカメラ２０５からの入力による制御命令が処理の実行を意味する場合に、初めて実際の処理を行うように制御することができる。
【００７３】
このように本実施の形態においては、制御命令情報が複数の情報源からの時系列情報であった場合には、それぞれの制御命令情報の内容を総合して被制御時系列情報に対する処理を決定することによって、一層高精度で効果的な制御が可能となる。例えば、ユーザが特に意識せずともビデオの方を向いて制御命令情報を与えたときだけ、ビデオデッキでの処理を可能にすることもできる。
【００７４】
また、複数の制御命令の時系列情報が相反する内容で時間的に重なりがある場合には、所定のルールを決めておけばよい。例えば、図７の例において、ユーザが発声の途中からビデオカメラの方を向いた場合等については、時間的な重なりが長い方の処理内容を重視する方法を採用してもよく、例えばまた、時間的に一番最後の処理内容に基づいて制御を実施するという方法を採用してもよい。
【００７５】
図８は本発明の第５の実施の形態を示す説明図である。本実施の形態は音声記録装置に適用したものである。図１中の制御命令情報１０１は図８のユーザ７０１の音声に相当し、被制御時系列情報１０２はユーザ７０１の音声及び外界の音声７０２に相当し、時系列情報受信部１０３は信号処理部７０５に相当し、記憶部１０４はメモリ７０７に相当し、演算部１０５は中央演算部７０６に相当し、被制御時系列情報処理部１０６は外部記憶装置７０８に相当する。
【００７６】
本実施の形態の音声記録装置は、ユーザ７０１や装置の外界の音声７０２を記録するものであり、商品化されているカセットレコーダやデジタルメモリレコーダ等を利用可能である。例えば、デジタルメモリレコーダを採用して、ユーザの音声で操作を行う場合について説明する。
【００７７】
音声記録装置７０３は、ユーザ７０１や外界の音声７０２をマイク７０４で収音する。信号処理部７０５は収音された音声を取込んで、後段の処理に必要な情報、例えば、音量や雑音の強さ等のパラメータを含む情報を得る。中央演算部７０６は、入力された音声及び後段の処理に必要な情報を、時間の対応が取れるようにメモリ７０７に記録する。
【００７８】
本実施の形態は、制御命令情報と被制御時系列情報とがマイク７０４を介して入力された時点では区別なく扱われる点である。ユーザ７０１の発声のうち、所定の制御命令を内容とするものについては制御命令情報であり、マイク７０４からのその他の音が被制御時系列情報である。
【００７９】
メモリ７０７は、書き込まれた音声信号をある任意の時間、例えば２０秒間等の所定の一定時間分だけ一時記憶することができるようになっている。即ち、中央演算部７０６は、現在から２０秒前までの音声をメモリ７０７からいつでも呼び出せるようになっている。例えば、リング・バッファ等の技術を採用することで、このような呼び出しが実現可能である。また、この２０秒間のような任意の時間は電池の量や周囲の雑音等の各種状況に応じて変更することができる。
【００８０】
中央処理部７０６は、メモリ７０７上の音声情報を外部記憶装置７０８に記録させることができる。外部記憶装置７０８としては、例えば半導体を利用した記憶装置等を利用することができる。
【００８１】
次に、このように構成された実施の形態の作用について説明する。
【００８２】
本実施の形態における基本的な処理の流れは図５のフローチャートと同様である。
【００８３】
ユーザが音声記録装置７０３に対する命令として「録音」と発声することによって、この命令内容に応じて音声記録装置７０３がマイク７０４からの入力情報を適切に処理する場合の例について説明する。
【００８４】
マイク７０４からの入力情報は、制御命令情報であるか又は被制御時系列情報であり、音声信号として信号処理部７０５に供給される。信号処理部７０５は、入力された信号を音声情報として符号化し時間的な呼び出しができるように、メモリ７０７に記録する。即ち、メモリ７０７には制御命令情報と被制御時系列情報の両方が記憶される。
【００８５】
信号処理部７０５は、中央演算部７０６の制御の下、音声検出や音声認識に必要な特徴量を得る。中央演算部７０６は、抽出された特徴量から、マイク７０４を介して入力された音声情報のうち、ユーザ７０１の発声に基づく音声区間を検出する。そして、中央演算部７０６は、検出された音声区間に対して音声認識を実施する。この音声検出については、ユーザの個人認証も含めて登録されたユーザの音声を検出するようなものでも構わない。検出方法としては前述のように公知の種々の方法を採用することができる。
【００８６】
また、マイク７０４はマイクアレイのように複数のマイクから構成され、信号処理部７０５において遅延和による処理や独立成分分析の処理のような実現可能な音源分離を用いて、ユーザ７０１の音声とそれ以外の音声７０２を分離してから、これまで実施例を述べてきたような手段を用いても構わない。
【００８７】
また、中央演算部７０６は、音声検出と音声認識とについて、例えば音声検出が終ってからその区間内の情報で音声認識を行ってもよいし、音声区間を検出しだした時点から音声認識の処理も逐次的に行いつつ、音声区間の検出が終了した時点の分までの情報を下に音声認識を行ってもよい。また、中央演算部７０６は、音声認識において、制御命令語に不要語が付加された制御命令語を認識できるように、キーワードスポッティングによる音声認識処理を実施するようにしてもよい。この場合には「え〜と、録音」のように「え〜と」のような不要語が付加されていても、「録音」という制御命令語を認識することができる。
【００８８】
このようにして、マイク７０４の入力音声の中から制御命令情報を抽出することができ、更にメモリ７０７に記録された被制御時系列情報との時間的な対応、即ち制御命令情報の音声区間位置を得ることができる。そして制御命令情報の内容に従って、メモリ７０７に記憶された音声情報の処理を決定して処理を行うことができる。例えば「録音」というユーザの発声が開始された時点から、その発声も含めて音声を記録することができる。
【００８９】
このように本実施の形態においては、制御命令情報と被制御時系列情報とを同一の信号処理系によって処理する場合でも、制御命令に基づく制御期間（タイミング）及び制御内容で、被制御時系列情報に対する処理を行うことができる。
【００９０】
図９は本発明の第６の実施の形態を示す説明図である。図９は音声認識結果である制御命令語と制御内容との対応関係を示す概念図である。
【００９１】
本実施の形態は図８の装置を用いた音声記録の方法に適用したものである。本実施の形態においては、図９に示す４つの命令に注目して説明する。
【００９２】
中央演算部７０６（図８参照）は、「３秒前から録音」のように「○秒前から録音」という制御命令が音声認識結果から得られた場合には、メモリ７０７に記録された音声情報のうち、音声区間検出時点から○秒前のタイミングに対応するアドレスから音声情報を読み出して外部記憶装置７０８に出力して記録させる。制御命令情報と被制御時系列情報との時間的な関連付けが行われていることから、メモリ７０７に記憶されている音声情報であれば、このように任意の時間帯の音声情報を記録させることができる。
【００９３】
これにより、ユーザ７０１は周りの音声で記録しておきたいと思える音声があった場合には、そのことを思ったときに発声すればいいという自然なインタフェースで、少し前の音声から収音対象の音声として記録させることができる。
【００９４】
また、単に「録音」と発声した場合には、中央演算部７０６は、音声検出の開始時点からの音声録音を開始させる。また、「録音スタート」と発声した場合には、中央演算部７０６は、発声を終了したと検出された時点からのメモリ７０７に記録された音声情報を、外部記憶装置７０８に転送させて記録させる。これにより、ユーザ７０１からの制御命令語を含まずに録音したい場合と、制御命令語を含んでも直ちに録音したい場合の両方に対応することができる。
【００９５】
また、「録音スタンバイ」と発声した場合には、中央演算部７０６は、次に音声を検出した時点からメモリ７０７に記録された音声を外部記憶装置７０８で記録することにする。具体的には、「録音スタンバイ」と認識したその音声区間以降について、音声の特徴量に注目して新たに音声が入力されたか否かを検出する。そして、新たな音声入力が検出されると、その検出時点からメモリ７０７上の音声情報を外部記憶装置７０８に与えて記録させる。これにより、記録したい音声の前に無音がなく、記録したかった時点からの音声を記録しておくことができる。
【００９６】
このように本実施の形態においては、記録するための制御命令語を発する前の時間も含めて、任意の時間帯の被制御時系列情報である音声を記録することができる。
【００９７】
なお、本実施の形態においては、音声を録音する例について説明したが、録音終了時やその他の機能についても同様に、任意の時間帯又はタイミングで制御可能であることは明らかである。
【００９８】
なお、本発明は上記各実施の形態に限定されるものではなく、種々の変形が考えられる。
【００９９】
【発明の効果】
以上説明したように本発明によれば、制御命令とその制御対象とがいずれも時系列情報である場合において、両者の時間的な関連付けを可能にすることによって、制御対象に対する処理を施す系列上の位置を制御可能にして制御の有効性を向上させることができるという効果を有する。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る時系列情報制御システムを示すブロック図。
【図２】本発明の第２の実施の形態を示すブロック図。
【図３】第２の実施の形態の全体構成を示す説明図。
【図４】音声認識結果と制御内容との対応関係を示す概念図。
【図５】処理の流れを説明するためのフローチャート。
【図６】本発明の第３の実施の形態を示す説明図。
【図７】本発明の第４の実施の形態を示す説明図。
【図８】本発明の第５の実施の形態を示す説明図。
【図９】本発明の第６の実施の形態を示す説明図。
【符号の説明】
１０１…制御命令情報、１０２…被制御時系列情報、１０３…時系列情報受信部、１０４…記憶部、１０５…演算部、１０６…被制御時系列情報処理部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a time-series information control system suitable for a control system in which both a control command and a control target are time-series information, a method thereof, and a time-series information control program.
[0002]
[Prior art]
In recent years, a control system has been developed that performs a control command for a control target by voice.
[0003]
For example, Patent Document 1 discloses an example in which a necessary frame section is allocated in moving image information based on the sound level in moving image data with sound and the analysis result of moving image data. For example, by determining whether the person to be photographed is facing the front, the frame section in which the person to be photographed is speaking toward the camera is detected.
[0004]
Further, Patent Document 2 discloses an invention in which editing points of a moving image are controlled by sound included in moving image information. That is, according to the present invention, the editing point is automatically corrected so that editing is performed at the changing point of the sound or image in the moving image (the changing point of the scene or the boundary with the silent portion) with respect to the designated moving image editing point. Is.
[0005]
[Patent Document 1]
JP 2002-176619 A
[0006]
[Patent Document 2]
JP-A-11-239320
[0007]
[Patent Document 3]
JP 2002-278747 A
[0008]
[Patent Document 4]
Japanese Patent Laid-Open No. 11-109498
[0009]
[Patent Document 5]
JP 2002-342065 A
[0010]
[Problems to be solved by the invention]
Thus, various control systems that handle time-series information have been disclosed. In Patent Documents 1 and 2, time-series information such as video and audio is controlled by time-series information called audio. However, in References 1 and 2, the contents included in the time series information such as voice are not used as control commands, but control is performed using the time series characteristics of the controlled object. It is impossible to perform advanced control according to the contents included in the time-series information on the controlling side.
[0011]
On the other hand, there are those disclosed in Patent Documents 3 to 5 for performing control according to the contents included in the time series information such as voice. That is, Patent Document 3 discloses a technique for performing game operations by voice, and Patent Document 4 discloses a technique for performing camera control by voice. Patent Document 5 discloses an invention for performing control in an interactive manner. In the invention of Patent Document 5, each predetermined control step is sequentially executed by utterance.
[0012]
In these patent documents 3 to 5, both the control side and the controlled object are time-series information. That is, the content of the voice command as the time series information is recognized by the voice recognition technology, and the voice and the video as the time series information are controlled based on the recognition result.
[0013]
By the way, when the control target is time-series information, that is, information that changes with time (hereinafter also referred to as controlled time-series information), the effectiveness of the control may be reduced when the control time cannot be controlled appropriately. is there. However, when the control side is also time-series information such as voice, the processing period for the control target cannot be appropriately controlled.
[0014]
For example, when a command is transmitted to a control target by voice, first, the voice is processed in natural language, becomes information that can be processed by a computer, the meaning of the voice is interpreted as a control command, and can be further processed in the control target. After being converted into a correct signal, it is supplied to the controlled object. Therefore, unlike a point in time when a human actually tries to execute a command and a point in time when a control command based on time-series information is transmitted to a control target and executed, control cannot be performed at an appropriate timing as in the past. .
[0015]
The present invention makes it possible to control the position on a sequence for performing processing on a control target by enabling temporal association between the control command and the control target when both are time-series information. It is an object of the present invention to provide a time-series information control system and method and a time-series information control program capable of improving the effectiveness of control.
[0016]
[Means for Solving the Problems]
A time-series information control system according to the present invention includes an input unit that captures control command information that is time-series information including a control command, and a control command extraction that extracts the control command from the control command information captured by the input unit Means, feature quantity extraction means for extracting time-series feature quantities from the control instruction information captured by the input means, generation time of the control instructions included in the control instruction information captured by the input means, and the The temporal relationship between the position on the sequence where the processing according to the control command is performed on the controlled time-series information that is the target of the control command, and the time-series feature amount of the control command information Each time And the controlled time series information Each point of time series feature An association means obtained by comparison with Controlling a processing means capable of processing the controlled time series information, wherein the processing means capable of processing the controlled time series information according to a control command extracted from the control command information While determining the processing content of the processing for the controlled time series information, Control command extracted from the control command information of Based on the time series feature quantity and the temporal relation acquired by the association means, there is no difference between the generation time of the control command and the position on the series on which the processing is performed. However, according to the processing content of the processing for the controlled time series information And an arithmetic means for determining a position on the sequence to be processed.
[0017]
In the present invention, the control command is extracted by the control command extraction unit from the control command information fetched by the input unit. Further, the feature quantity extraction means extracts a time series feature quantity from the control command information. The associating means obtains a temporal relationship between the fetched control command information and the controlled time-series information that is the target of the control command. The computing means controls the processing means based on the control command and time-series feature quantity extracted from the control command information and the temporal relation acquired by the associating means, and performs processing on the controlled time-series information by the processing means. The processing content and the position on the sequence to be processed are determined. Thereby, the process for the controlled time-series information is performed at the time or timing having an arbitrary time relationship with the time or timing at which the control command is generated, with the processing content according to the content of the control command.
[0018]
Note that the present invention relating to an apparatus is also established as an invention relating to a method.
[0019]
Further, the present invention relating to the apparatus is also realized as a program for causing a computer to execute processing corresponding to the present invention.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a time-series information control system according to the first embodiment of the present invention.
[0021]
In FIG. 1, control command information 101 is information on a control command given to a device to be controlled by a user and has a time-series characteristic. For example, as the control command information 101, in addition to a voice command, a gesture command, or the like, a single expression such as a button operation itself is instantaneous, even if there is a press time length or a plurality of buttons. Instructions that make sense are also included.
[0022]
The controlled time series information 102 is time series information to be controlled. The user can control the controlled time-series information 102 by issuing control command information 101 for a predetermined device (not shown). The controlled time-series information 102 is roughly divided into information whose origin is outside the device, such as video information to be edited when the device is a video editing device, radio wave information when the device is a radio, etc. Information and information generated from the device, for example, simulation result information with time series when the device is a computer, and information such as response information generated by the system when the device is an interactive device are conceivable. In any case, the controlled time-series information 102 is information to be controlled by the user and indicates time-series information that can be controlled by the device.
[0023]
The time-series information receiving unit 103 receives the control command information 101 and, if necessary, the controlled time-series information 102, and performs appropriate signal processing for use in subsequent processing. For example, when the control command information 101 is a voice uttered by the user, the time-series information receiving unit 103 receives the voice with a microphone (not shown), performs signal processing such as frequency analysis, and is used for voice recognition. Convert to For example, when the controlled time-series information 102 is moving image information, the information is taken out from a medium such as a video tape or a DVD video and converted into a format used in subsequent processing.
[0024]
The storage unit 104 stores time-series information obtained by the time-series information receiving unit 103, temporary work areas of the calculation unit 105 and the controlled time-series information processing unit 106 to be described later, and calculation procedure contents. It is an area.
[0025]
The calculation unit 105 performs control of each unit in FIG. 1 and necessary processing calculation. The arithmetic unit 105 can extract a control command from the control command information 101 obtained by the time series information receiving unit 103. For example, when the control command is a command voice from the user, the control command is extracted by performing voice recognition. The computing unit 105 can also extract time-series feature values from the control command information 101.
[0026]
In the present embodiment, the calculation unit 105 obtains a temporal relationship between the time axis of the control command information 101 and the controlled time series information 102. For example, the calculation unit 105 uses the time-series feature amount of the control command information 101 as time information for obtaining a temporal relationship. The time information extraction may be performed by the time-series information receiving unit 103. Also, for example, the control command information 101 and the controlled time-series information 102 can be time stamped, or a certain amount of time-series information can be stored in the ring buffer in the storage unit 104, and the storage location 104 It can also be associated with time.
[0027]
In addition, when the origin of the controlled time series information 102 is time series information generated from the device, the calculation unit 105 generates the time series information 102.
[0028]
The arithmetic unit 105 can perform control for obtaining a time-series feature amount from the controlled time-series information 102. For example, when the controlled time-series information 102 is moving image information, it is determined whether or not the scene cut of the moving image is determined, or when the controlled time-series information 102 is sound, the volume information is obtained. Time information of the time series information 102 is obtained.
[0029]
The controlled time-series information processing unit 106 considers the control command and the temporal feature amount of the control command, and in some cases, the time-series feature of the controlled time-series information 102, and the arithmetic unit 105 considers and controls the time-series information Based on the result of determining the processing content of 102, processing is performed on the controlled time-series information 102.
[0030]
Next, the operation of the embodiment configured as described above will be described.
[0031]
For example, it is assumed that the control command information 101 in a predetermined time range is an editing start command by human voice, and the controlled time series information 102 is moving image information to be edited in video editing. These pieces of information 101 and 102 are received by the time series information receiving unit 103. For example, the arithmetic unit 105 uses a timer value set in a microphone (not shown) that takes in the control command information 101 and a timer value set in a video editing device (not shown) that processes the controlled time-series information 102. The reference time for the control process is obtained.
[0032]
Here, it is assumed that the operator determines that the editing point of the moving image being reproduced is reached on the monitor screen of the video editing apparatus (not shown). The operator utters “edit point” by voice through the microphone. The control command information 101 is received by the time series information receiving unit 103.
[0033]
The arithmetic unit 105 extracts the time series feature quantity of the control command information 101, detects the start timing of the utterance of the control command information 101 indicating "edit point" from the change of the time series feature quantity, The end timing of the utterance of the control command information 101 indicating “point” is detected from the change in the time-series feature amount. The calculation unit 105 acquires the timer value of the microphone and the video editing apparatus in the reception period of “edit point”, the time information of the control command information 101 and the controlled time series information 102, and receives each of the reception period of “edit point”. The correspondence between the time and the time information of the controlled time series information 102, that is, the temporal relationship between the control command information 101 and the controlled time series information 102 is obtained.
[0034]
The calculation unit 105 determines the content of the control command by speech recognition from the utterance of the “edit point”, and controls the controlled time-series information processing unit 106 based on the determination result to process the controlled time-series information 102. To do. An “edit point” is issued when the controlled time-series information processing unit 106 starts controlling the controlled time-series information 102 by the video editing device depending on the processing time for the arithmetic unit 105 to determine the contents of the control command. After the end of the given period.
[0035]
However, the time point at which the arithmetic unit 105 acts on the controlled time-series information 102 based on the control command may extend to the past depending on the content of the control command. Therefore, in the present embodiment, for example, the calculation unit 105 determines the timing (position on the sequence) of the time information of the controlled time series information 102 at the reception start timing of “edit point” as the controlled time series information 102. Is set to the editing point of the moving image information, and the moving image information is edited before and after this timing.
[0036]
Note that the arithmetic unit 105 may perform control so that editing is performed at the timing of the time information of the controlled time-series information 102 corresponding to any time timing in the detected “edit point”. Control may be performed so that editing is performed at the timing of the time information of the controlled time-series information 102 corresponding to any time timing having a predetermined relationship with the reception period of “dot”.
[0037]
As described above, in the present embodiment, the temporal relationship between the time series information of the control command and the controlled time series information is obtained, and the generation period of the time series information of the control command is obtained using the obtained temporal relationship. Alternatively, in the period or timing of the controlled time series information having a predetermined relationship with the timing, the process based on the content of the time series information of the control command is executed on the controlled time series information. As a result, effective processing based on the contents of the control command is enabled to improve the effectiveness of the control.
[0038]
In the above embodiment, the functions of the respective units can be realized by program modules on a computer.
[0039]
FIG. 2 is a block diagram showing a second embodiment of the present invention. FIG. 3 is an explanatory diagram showing the overall configuration of the second embodiment.
[0040]
This embodiment shows an example in which the first embodiment is applied to a video recording / reproducing apparatus as a specific control system, and the control command information 101 in FIG. 1 corresponds to the microphone input 302 in FIG. The controlled time series information 102 corresponds to information stored on the video tape 308, the time series information receiving unit 103 corresponds to the signal processing unit 303 and the video tape control device 307, and the storage unit 104 is stored in the memory 305. The operation unit 105 corresponds to the central operation unit 304, and the controlled time-series information processing unit 106 corresponds to the video signal processing unit 309.
[0041]
First, the overall configuration of the control system will be described with reference to FIG.
[0042]
A user 201 wears a headset microphone 202. The headset microphone 202 converts the voice uttered by the user 201 into an audio signal and supplies it to the video deck 203. The video deck 203 has a voice recognition function, and can operate each unit by inputting a voice signal. The video deck 203 can play back video information recorded on a predetermined recording medium and display it on the display 204.
[0043]
In FIG. 3, the headset microphone 202 is used for voice input. However, the user may be separated from the user like a hands-free phone, and the voice input from the user is generally collected as a voice input device. Technology can be adopted. As for the voice recognition function of the video deck 203, an existing technique such as word voice recognition using a Hidden Markov Model (HMM) may be used, or a pattern recognition that can be discriminated into a vocabulary determined from a user's utterance. May be employed in technology.
[0044]
In such a configuration, the user 201 operates the video deck 203 by voice input. For example, the video deck 203 is controlled by sounding various processes such as “recording”, “playback”, “special playback”, and “stop” by voice.
[0045]
FIG. 3 shows an example of such a video deck.
[0046]
The video deck 203 receives the voice uttered by the user 201 as the microphone input 302 via the headset microphone 202. The microphone input 302 is input to the signal processing unit 303 of the video deck 203. The signal processing unit 303 converts the time-series signal based on the input microphone input 302 into a format necessary for subsequent processing in response to an instruction from the central processing unit 304, or extracts a feature amount. It has come to be. The signal processing unit 303 can be realized by, for example, a signal processing chip such as a DSP (Digital Signal Processor) or a signal processing module on a program. The signal processing unit 303 can also be configured depending on time-series feature amounts and the like.
[0047]
The central processing unit 304 controls the entire video deck 203. A signal from the signal processing unit 303 is stored in the memory 305 as a storage area, a video deck key input 306 is received, a corresponding operation is performed, and a video-related operation described later is controlled. The central processing unit 304 may be configured by an existing control chip and a control program. The memory 305 stores processing and programs of the central processing unit 304, stores time-series signals obtained from the microphone input 302 via the signal processing unit 303, and stores information obtained from the video tape 308. It shall be generic. The memory 305 can be configured by a memory chip or the like.
[0048]
Under the control of the central processing unit 304, the video tape control device 307 can change the reading position of the video tape 308 or reproduce the video tape 308 to read moving image information. The video tape control device 307 outputs the reproduced moving image information to the signal processing unit 303 as necessary, and outputs moving image information as controlled time series information to the video signal processing unit 309. The video signal processing unit 309 receives the moving image information from the video tape control device 307, converts it into information in a format that can be displayed on the display 204, and outputs it as a video output 310. These video tape control device 307 and video signal processing unit 309 can be realized by a corresponding part of a video deck that is generally commercialized.
[0049]
Next, the operation of the embodiment configured as described above will be described with reference to FIGS. FIG. 4 is a conceptual diagram showing the correspondence between speech recognition results and control contents, and FIG. 5 is a flowchart for explaining the flow of processing.
[0050]
The video deck 203 can capture voice as a control command and extract the control content by voice recognition. In this embodiment, the voice recognition result that is the extracted instruction word and the corresponding control content have the correspondence shown in the conceptual diagram of FIG. As shown in FIG. 4, for example, in response to the user's “play” command utterance, the video deck 203 executes a process of starting playback from the stop position of the video tape 308.
[0051]
It is assumed that the video deck 203 is already in a playback state. Here, it is assumed that the user 201 utters a control command in Step 501 of FIG. It is assumed that the utterance content is “slow playback”. The utterance of the user 201 that is a time-series signal is input to the signal processing unit 303 of the video deck 203 as a microphone input 302. The signal processing unit 303 detects the time when the utterance has started from the time-series signal of the microphone input 302 and performs conversion into a feature amount necessary for speech recognition (step 502). This feature amount can be realized by a speech recognition feature amount obtained by, for example, a mel frequency cepstrum coefficient (hereinafter referred to as MFCC) or a perceptual linear predictive analysis (hereinafter referred to as PLP: Perceptual Linear Predictive Analysis).
[0052]
As a means to detect user utterance, it is realized by a method used in the world such as judging from feature quantity such as volume of input speech, zero crossing of input speech waveform, coefficient of linear prediction analysis and its residual. Be possible. Such time-series feature quantities of control command information can be used for temporal correspondence with controlled time-series information as described above. In this embodiment, a description will be given of an example in which a binary value of presence / absence of utterance is detected. However, depending on the application, a continuous value may be used as voice detection information in the form of a certainty for utterance determination.
[0053]
Voice detection information obtained by the signal processing unit 303 and time-series feature amounts such as a feature amount necessary for recognition are stored in the memory 305 when necessary under the control of the central processing unit 304. Then, the central processing unit 304 associates the information related to the sound detection time with the time of the moving image reproduction information stored on the video tape 308 as controlled time-series information, while a control command is issued from the utterance information by the sound recognition process. Extracted (step 503). As a result, it is understood that the user's control command is “slow playback”.
[0054]
The central processing unit 304 uses the audio detection time information obtained from the time-series information of the control instruction based on the correspondence relationship between the control instruction and the instruction processing for the control instruction as shown in FIG. It is determined to perform the process of "returning the video to the time when the control command is uttered and starting slow reproduction from that time" (step 504). The central processing unit 304 generates a procedure necessary for the determined processing and controls each processing unit. As a result, the video tape control device 307 rewinds the video tape 308 to the playback position at the time when the user 201 started speaking, and starts reading of slow playback. The video signal reproduced by the video tape controller 307 is supplied to the display 204 via the video signal processing unit 309 and displayed on the screen (step 505).
[0055]
Thus, in the present embodiment, it is possible to execute playback of a playback image from a playback position where slow playback is desired without a special operation by the user.
[0056]
In addition, in preparation for erroneous recognition, a method may be considered in which processing is determined with reference to the time point when the previous command is issued when “cancel” is issued. Such processing depends on the application. Even in this case, the processing period (timing) for the controlled time-series information is determined by the time-series feature amount of the control command, and the controlled time is determined from the contents of the control command. The point which determines the processing content with respect to series information is the same.
[0057]
FIG. 6 is an explanatory view showing a third embodiment of the present invention. The configuration of the present embodiment is the same as that shown in FIGS. 2 and 3, and only the control method is different. This embodiment is applied to an example in which a video deck is operated by sound.
[0058]
The present embodiment is an example in which moving image scene detection is performed on moving image information obtained from the video tape control device 307 (see FIG. 3). FIG. 6 shows the correspondence between the control command and command processing in this case. As shown in FIG. 6, a category is assigned to each control instruction. In the example of FIG. 6, each control command belongs to one of three categories of “law execution”, “video adjustment”, and “time point execution”.
[0059]
Assume that the user 201 (see FIG. 2) utters “pause”. Even if unnecessary words are added before and after the utterance registered as the control command word, for example, “Please pause” for the user 201, it is necessary to use the existing keyword spotting technique. Partial speech recognition is possible.
[0060]
Also in this embodiment, the processing is basically executed according to the flowchart of FIG. In the present embodiment, the central processing unit 304 extracts the control command category from the conceptual diagram shown in FIG. Note that a method for the central processing unit 304 to determine the category can be realized by, for example, preparing an array table on a program.
[0061]
In the present embodiment, the control instruction itself and the extracted category are considered when executing the control instruction. The category for the control command “pause” by the user 201 is “video adjustment” from FIG. When executing a control command, the processing for this category is taken into account. Specifically, referring to the correspondence between the control command and the controlled time-series information when the “pause” is first spoken, the “pause” and the utterance are started in the controlled time-series information. Pay attention to the time. In the case of the category “video adjustment”, search and scene detection are performed for controlled time-series information recorded in the vicinity of the reproduction position at the time when the control command is issued. Then, in the vicinity of the playback position corresponding to the time point when the user says “pause”, the scene switching position is designated, and playback of the controlled time-series information is paused. As a result, the user 201 can intuitively specify the scene he / she wants to pay attention to by voice, and the control is automatically transferred to a point at which the user would like to pay attention, such as after the video is switched. Can be made. At this time, when focusing on the point of time of the control command, it is treated as the closest scene change position, the position where the scene has changed most around that point, and the time distance from that point. It may be handled that the most appropriate position is determined from a function including the scene change amount. The same applies not only to the examples described here but also to the time series characteristics of the control command information and the controlled time series information and the processing instruction contents determined in the embodiment of the present invention. It is.
[0062]
It should be noted that the category of the control command as described above may be fixed from the beginning, or the category and processing may be associated with each other or the content may be changed by a higher-level application.
[0063]
Further, scene detection can be realized if there is a mechanism for determining from feature amount information extracted from a moving image by existing change point detection. In addition, when searching for the feature amount of the controlled time-series information near the time when the control command is issued (in the example of FIG. 6, when detecting a moving image scene at the timing of “pause”), the voice of the control command A search may be made within a certain time frame based on the time point at which is detected, or the time range may be changed by a control command. In addition, regarding scene detection, an operation for each category may be performed using a change in an image on a moving image using an image recognition technique. For scene detection, not only the two-dimensional information of the moving image but also information accompanying the moving image, for example, audio information may be used.
[0064]
It is also possible to store an arbitrary amount, arbitrary time, or the entire controlled time-series information in a storage area such as a memory and perform scene detection based on the information on the storage area. In addition, processing is performed on the controlled time-series information in the storage area in the processing period and processing content determined based on the control command extracted from the control command information and the time-series feature value, and the result is obtained. It can also be output. For example, in the example of the present embodiment, controlled time-series information is sequentially stored in the storage area, and at the start time of the control command with the controlled time-series information in the storage area in response to the “slow playback” control command. By reading information from the corresponding time point (address), slow playback from a desired position can be performed.
[0065]
In addition, parameters may be interposed when extracting feature quantities from control command information and controlled time-series information. For example, noise estimation is performed on the voice that is the control command information in the present embodiment and the voice information that accompanies the controlled time-series information, and the estimated noise information is stored as a parameter. A predetermined feature amount may be obtained after removing the influence of noise when obtaining the feature amount. By doing so, it is possible to obtain a feature quantity of time-series information with higher accuracy, and thereby, it is possible to control the processing period based on the control command with high accuracy, and it is possible to perform more effective control. Specific examples of such parameters are filter parameters for estimating the target sound and the disturbing sound direction with a microphone array. For example, literature (Hitoshi Nagata et al., “Study on speaker tracking 2-channel microphone array”, electronic information The Journal of Communications Society of Japan A, Vol. J82-A No. 6 pp. 860-866, June 1999, estimated noise in the spectral subtraction method, etc. By Saeed V. Vaseghi, "Advanced Digital Signal Processing and Noise Reduction" (UK), 2nd edition, WILEY, September 2000)) Can be mentioned. Further, the processing of the controlled time-series information may be influenced from the strength of the estimated noise that is a parameter in this case. For example, in the example of this embodiment, when the amount of estimated noise is small, the scene detection result based on the audio information attached to the moving image is emphasized. On the other hand, when the amount of estimated noise is large, the image information of the moving image is included. The processing period (or timing) of the control command may be determined with emphasis on the scene detection result based on it.
[0066]
FIG. 7 is an explanatory view showing a fourth embodiment of the present invention. In FIG. 7, the same components as those in FIG.
[0067]
This embodiment is different from the second embodiment in that a video deck 208 is used instead of the video deck 203 and a video camera 205 is added. The video camera 205 can take an image of the user 201. Note that the video camera 205 may have a function of automatically tracking a user.
[0068]
The video deck 208 has the same configuration as that of the video deck 203 in FIG. 3 and receives a video signal from the video camera 205 via the signal processing unit 303 (see FIG. 3), and the central processing unit 304 or the like performs predetermined processing. The signal processing can be performed. For example, the central processing unit 304 can determine from the image information of the user 201 captured by the video camera 205 whether or not the user 201 is facing the video deck 208. For example, the central processing unit 304 employs a method of estimating the orientation of the user's face or the like based on the ratio of the white part of the user's eyeball in the input image.
[0069]
In the embodiment configured as described above, the video deck 208 can receive a control command based on image information from the video camera 205 in addition to the control command received from the microphone input 302. In the second embodiment described above, in response to the utterance of the control command from the user 201, the command is recognized and the content of the control command is subject to the start of utterance, which is one of acoustic features. The playback position of control time series information has been returned. On the other hand, in the case of the present embodiment, the face direction information of the user 201 is also reflected in the processing content determination.
[0070]
That is, the central processing unit 304 uses the voice input from the headset microphone 202 for the extraction of the control command and the detection of the utterance section of the control command based on the time-series feature amount, as in the second embodiment. On the other hand, for the information from the video camera 205, the central processing unit 304 executes processing when the user 201 is facing the video camera 205 (or the video deck 208), and does not perform processing when the user 201 is not facing. Used as a value control command.
[0071]
In this case, the time-series feature amount of information from the video camera 205 corresponds to the face direction of the user 201. The central processing unit 304 determines whether or not the time series feature amount based on the output of the video camera 205 is facing the video camera 205 depending on whether or not the face of the user 201 is within a certain range. Binary judgment may be performed, or a continuous amount when the direction is an angle may be obtained.
[0072]
In this way, it is possible to determine control commands and time-series feature amounts for the audio information from the microphone input 302 and the image information from the video camera 205, respectively. Thus, for example, when there is a control command by an input from the headset microphone 202 and the control command by the input from the video camera 205 means execution of the process, control is performed so that actual processing is performed for the first time. Can do.
[0073]
As described above, in the present embodiment, when the control command information is time-series information from a plurality of information sources, the processing for the controlled time-series information is determined by integrating the contents of each control command information. By doing so, it is possible to perform more accurate and effective control. For example, the processing in the video deck can be enabled only when the user gives control instruction information to the video without being particularly conscious.
[0074]
In addition, when time series information of a plurality of control instructions conflicts with each other in time, a predetermined rule may be determined. For example, in the example of FIG. 7, for a case where the user faces the video camera from the middle of the utterance, a method of placing importance on the processing content with the longer temporal overlap may be adopted. You may employ | adopt the method of implementing control based on the last processing content in time.
[0075]
FIG. 8 is an explanatory view showing a fifth embodiment of the present invention. This embodiment is applied to an audio recording apparatus. The control command information 101 in FIG. 1 corresponds to the voice of the user 701 in FIG. 8, the controlled time series information 102 corresponds to the voice of the user 701 and the external voice 702, and the time series information receiving unit 103 is a signal processing unit. 705, the storage unit 104 corresponds to the memory 707, the calculation unit 105 corresponds to the central calculation unit 706, and the controlled time-series information processing unit 106 corresponds to the external storage device 708.
[0076]
The audio recording apparatus of the present embodiment records the user 701 and the audio 702 of the outside world of the apparatus, and a commercially available cassette recorder, digital memory recorder, or the like can be used. For example, a case where a digital memory recorder is employed and an operation is performed with a user's voice will be described.
[0077]
The audio recording device 703 collects the user 701 and the external sound 702 with the microphone 704. The signal processing unit 705 captures the collected sound and obtains information necessary for subsequent processing, for example, information including parameters such as volume and noise intensity. The central processing unit 706 records the input voice and information necessary for subsequent processing in the memory 707 so that the time can be taken.
[0078]
The present embodiment is that control command information and controlled time-series information are handled without distinction when they are input via the microphone 704. Among the utterances of the user 701, those having a predetermined control command as content are control command information, and other sounds from the microphone 704 are controlled time-series information.
[0079]
The memory 707 can temporarily store the written audio signal for a predetermined period of time such as 20 seconds. That is, the central processing unit 706 can call up from the memory 707 at any time from the present 20 seconds ago. For example, such a call can be realized by employing a technique such as a ring buffer. Further, the arbitrary time such as 20 seconds can be changed according to various situations such as the amount of battery and ambient noise.
[0080]
The central processing unit 706 can record audio information on the memory 707 in the external storage device 708. As the external storage device 708, for example, a storage device using a semiconductor can be used.
[0081]
Next, the operation of the embodiment configured as described above will be described.
[0082]
The basic processing flow in the present embodiment is the same as the flowchart of FIG.
[0083]
An example in which the user utters “record” as a command to the voice recording device 703 and the voice recording device 703 appropriately processes input information from the microphone 704 in accordance with the content of the command will be described.
[0084]
Input information from the microphone 704 is control command information or controlled time-series information, and is supplied to the signal processing unit 705 as an audio signal. The signal processing unit 705 encodes the input signal as voice information and records it in the memory 707 so that it can be called in time. That is, the memory 707 stores both control command information and controlled time series information.
[0085]
The signal processing unit 705 obtains a feature amount necessary for voice detection and voice recognition under the control of the central processing unit 706. The central processing unit 706 detects a speech section based on the utterance of the user 701 from speech information input via the microphone 704 from the extracted feature amount. Then, the central processing unit 706 performs speech recognition on the detected speech section. The voice detection may be such that the registered user's voice including the user's personal authentication is detected. As the detection method, various known methods can be adopted as described above.
[0086]
The microphone 704 is composed of a plurality of microphones such as a microphone array. The signal processing unit 705 uses the sound source separation that can be realized such as delay sum processing and independent component analysis processing, and the user 701's voice. After separating the voice 702 other than the above, means as described in the embodiments may be used.
[0087]
In addition, the central processing unit 706 may perform voice recognition with respect to voice detection and voice recognition using, for example, information in the section after the voice detection is finished, or perform voice recognition from the time when the voice section is detected. While performing the processing sequentially, the speech recognition may be performed on the information up to the time point when the detection of the speech section is completed. Further, the central processing unit 706 may perform voice recognition processing by keyword spotting so that a control command word in which an unnecessary word is added to the control command word can be recognized in voice recognition. In this case, the control command word “recording” can be recognized even if an unnecessary word such as “e-to” is added like “e-to-recording”.
[0088]
In this way, the control command information can be extracted from the input voice of the microphone 704, and the time correspondence with the controlled time-series information recorded in the memory 707, that is, the voice section position of the control command information Can be obtained. Then, according to the content of the control command information, it is possible to determine and process the voice information stored in the memory 707. For example, the voice including the utterance can be recorded from the time when the user utters “recording”.
[0089]
As described above, in this embodiment, even when the control command information and the controlled time series information are processed by the same signal processing system, the controlled time series is controlled with the control period (timing) and the control content based on the control command. Information can be processed.
[0090]
FIG. 9 is an explanatory view showing a sixth embodiment of the present invention. FIG. 9 is a conceptual diagram showing the correspondence between control command words, which are voice recognition results, and control contents.
[0091]
This embodiment is applied to an audio recording method using the apparatus of FIG. In the present embodiment, description will be made by paying attention to the four instructions shown in FIG.
[0092]
The central processing unit 706 (see FIG. 8) reads the voice recorded in the memory 707 when a control command “sound from three seconds ago” is obtained from the voice recognition result, such as “record from three seconds ago”. Among the information, the voice information is read from the address corresponding to the timing of ○ seconds before the voice section detection time, and is output to the external storage device 708 for recording. Since the time association between the control command information and the controlled time-series information is performed, the sound information stored in the memory 707 can be recorded in this manner. Can do.
[0093]
As a result, when there is a voice that the user 701 wants to record with surrounding voices, the user 701 can utter the voice when he / she thinks about it. Can be recorded as voice.
[0094]
If the user simply utters “recording”, the central processing unit 706 starts voice recording from the start of voice detection. In addition, when uttering “recording start”, the central processing unit 706 transfers the voice information recorded in the memory 707 from the point in time when it is detected that the utterance has ended to the external storage device 708 for recording. . Thereby, it is possible to cope with both a case where recording is desired without including the control command word from the user 701 and a case where recording is desired immediately even if the control command word is included.
[0095]
When the “recording standby” is uttered, the central processing unit 706 records the sound recorded in the memory 707 from the next time the sound is detected in the external storage device 708. Specifically, it is detected whether or not a new voice has been input, focusing on the voice feature amount after the voice section recognized as “recording standby”. When a new voice input is detected, the voice information on the memory 707 is given to the external storage device 708 from the time of detection and recorded. Thereby, there is no silence before the voice to be recorded, and it is possible to record the voice from the time when it was desired to be recorded.
[0096]
As described above, in the present embodiment, it is possible to record a voice that is controlled time-series information in an arbitrary time zone including a time before issuing a control command word for recording.
[0097]
In the present embodiment, an example of recording sound has been described. However, it is obvious that the end of recording and other functions can also be controlled in an arbitrary time zone or timing.
[0098]
In addition, this invention is not limited to said each embodiment, A various deformation | transformation can be considered.
[0099]
【The invention's effect】
As described above, according to the present invention, when both a control command and its control target are time-series information, the time series information can be associated with each other, thereby performing processing on the control target. It is possible to improve the effectiveness of the control by making the position of the controllable.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a time-series information control system according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a second embodiment of the present invention.
FIG. 3 is an explanatory diagram showing an overall configuration of a second embodiment.
FIG. 4 is a conceptual diagram showing a correspondence relationship between a speech recognition result and control content.
FIG. 5 is a flowchart for explaining the flow of processing;
FIG. 6 is an explanatory diagram showing a third embodiment of the present invention.
FIG. 7 is an explanatory diagram showing a fourth embodiment of the present invention.
FIG. 8 is an explanatory diagram showing a fifth embodiment of the present invention.
FIG. 9 is an explanatory view showing a sixth embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 ... Control command information, 102 ... Controlled time series information, 103 ... Time series information receiving part, 104 ... Memory | storage part, 105 ... Operation part, 106 ... Controlled time series information processing part.

Claims

Input means for capturing control command information that is time-series information including control commands;
Control instruction extraction means for extracting the control instruction from the control instruction information captured by the input means;
Feature quantity extraction means for extracting time-series feature quantities from the control command information captured by the input means;
Between the generation time of the control command included in the control command information fetched by the input means and the position on the sequence on which processing according to the control command is performed on the controlled time-series information targeted by the control command An association means for acquiring the time relationship of the time series feature amount of the control command information by comparing each time point of the time series feature amount of the controlled time series information with each time point of the controlled time series information;
Controlling a processing means capable of processing the controlled time series information, wherein the processing means capable of processing the controlled time series information according to a control command extracted from the control command information Determines the processing content of the process for the controlled time-series information , and based on the time-series feature amount of the control command extracted from the control command information and the temporal relationship acquired by the association means, the control command while the occurrence time and without the difference between the position on the sequence of applying the treatment, equipped with a calculating means for determining the position on the sequence which processes according to the processing content of the processing for the controlled time-series information A time-series information control system characterized by that.

The time-series information control system according to claim 1, wherein at least one of the control command information and the controlled time-series information is voice information.

An input procedure for capturing control command information, which is time-series information including control commands,
A control command extraction procedure for extracting the control command from the captured control command information;
A feature quantity extraction procedure for extracting a time series feature quantity from the captured control command information;
Temporal time between the generation time of the control command included in the captured control command information and the position on the sequence on which processing according to the control command is performed on the controlled time-series information that is the target of the control command An association procedure for obtaining the association by comparing each time point of the time series feature amount of the control command information with each time point of the time series feature amount of the controlled time series information ;
A procedure for controlling processing means capable of processing the controlled time-series information, the processing means capable of processing the controlled time-series information in accordance with a control command extracted from the control command information The control instruction is determined based on the time series feature amount of the control instruction extracted from the control instruction information and the temporal relation acquired in the association procedure while determining the processing content of the process for the controlled time series information while the occurrence time and without the difference between the position on the sequence of applying the process, said comprising an arithmetic procedure for determining the position on the sequence which processes according to the processing content of the processing for the controlled time-series information A time-series information control method characterized by the above.

On the computer,
An input processing procedure for capturing control command information that is time-series information including a control command;
A control command extraction processing procedure for extracting the control command from the captured control command information;
A feature amount extraction processing procedure for extracting a time-series feature amount from the captured control command information;
Temporal time between the generation time of the control command included in the captured control command information and the position on the sequence on which processing according to the control command is performed on the controlled time-series information that is the target of the control command An association processing procedure for obtaining the association by comparing each time point of the time series feature amount of the control command information with each time point of the time series feature amount of the controlled time series information ;
A procedure for controlling processing means capable of processing the controlled time-series information, the processing means capable of processing the controlled time-series information in accordance with a control command extracted from the control command information The processing content of the processing for the controlled time-series information is determined, and the control is performed based on the time-series feature amount of the control command extracted from the control command information and the temporal relationship acquired in the association processing procedure. while no difference between the position of the sequence time of occurrence of the instruction and performing the process, the a computation procedure for determining the position on the sequence which processes according to the processing content of the processing for the controlled time-series information A time-series information control program for execution.