JP2004294659A

JP2004294659A - Speech recognition device

Info

Publication number: JP2004294659A
Application number: JP2003085340A
Authority: JP
Inventors: Toshimitsu Minowa; 利光蓑輪; Kiyomi Sakamoto; 清美阪本; Atsushi Yamashita; 敦士山下; Atsushi Iizaka; 篤飯阪
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-03-26
Filing date: 2003-03-26
Publication date: 2004-10-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of speech recognition through a simple operation wherein physical and mental burdens on a speaker are reduced. <P>SOLUTION: The speech recognition device is equipped with a sound signal output means 11 of inputting a sound including a speaker's speech and outputting a sound signal, a contact means 12 that the speaker touches when speaking, a speech section decision means 13 of deciding a speech section wherein the speaker speaks words according to a speech signal included in the sound signal and the contact state of the contact means 12, a speech recognition means 14 of recognizing the speech in the speech section, and a display means 15 of displaying a speech recognition result, thereby enabling speech recognition through easy operation. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置に関し、さらに詳しくは、話者が発した音声を認識する音声認識装置に関する。
【０００２】
【従来の技術】
従来の音声認識装置は、図５に示すようなものが知られている。図５に示された音声認識装置１は、音声を入力する音声入力部２と、話者が操作するスイッチ３と、スイッチ３の状態を検出するスイッチ状態検出部４と、スイッチ３の状態に応じて音声の入力を制御する入力制御部５と、音声を認識する音声認識部６と、音声の認識結果を表示する表示部７とを備えている。
【０００３】
従来の音声認識装置１は、まず、スイッチ状態検出部４によって、話者がスイッチ３をオフからオンにしたか否かが判断される。スイッチ３がオフからオンにされたと判断されたときは、入力制御部５によって、音声の入力開始が音声入力部２に通知される。次いで、音声入力部２によって、音声が入力される。そして、音声認識部６によって、入力された音声が認識され、表示部７によって、認識結果が表示される。
【０００４】
引き続き、スイッチ状態検出部４によって、話者がスイッチ３をオンからオフにしたか否かが判断される。スイッチ３がオンからオフにされたと判断されたときは、入力制御部５によって、音声の入力終了が所定時間だけ遅れて音声入力部２に通知される。
【０００５】
以上のように、従来の音声認識装置１は、スイッチ状態検出部４によって、スイッチ３の状態を検出し、スイッチ３がオン状態のときに入力された音声を認識できるようになっている（例えば、特許文献１参照）。
【０００６】
【特許文献１】
特開２００２−１０８３９０号公報（第４−５頁、第１図）
【０００７】
【発明が解決しようとする課題】
しかしながら、このような従来の音声認識装置では、話者は発声中に継続してスイッチを押し続けなければならないので、話者の肉体的および精神的な負担が大きい煩雑な操作を必要とするという問題があった。
【０００８】
本発明は、このような問題を解決するためになされたものであり、話者の肉体的および精神的な負担を軽減した簡単な操作で音声認識を行うことができる音声認識装置を提供するものである。
【０００９】
【課題を解決するための手段】
本発明の音声認識装置は、話者の音声を含む音響を入力し音響信号を出力する音響信号出力手段と、前記話者が発声するときに前記話者に接触する接触手段と、前記音響信号に含まれる音声信号および前記接触手段の接触状態に基づいて前記話者が前記音声を発している音声区間を判定する音声区間判定手段と、前記音声区間の前記音声を認識する音声認識手段とを備えたことを特徴とする構成を有している。
【００１０】
この構成により、音声区間判定手段は、音響信号に含まれる音声信号および接触手段の接触状態に基づいて話者が音声を発している音声区間を判定し、音声認識手段は、音声区間判定手段によって判定された音声区間の音声を認識するので、発声中の話者の自然な動作による継続的または断続的な接触手段の接触によって音声区間が判定され、話者の肉体的および精神的な負担を軽減した簡単な操作で音声認識を行うことができる。
【００１１】
また、本発明の音声認識装置は、前記接触手段は、前記話者に接触する接触部と、前記話者が前記接触部を接触しているか否かを検出し、前記話者が前記接触部に接触している接触動作状態を示す接触動作状態信号および前記話者が前記接触部の接触を停止している接触停止状態を示す接触停止状態信号の何れかを出力する接触状態検出部とを備え、前記音声区間判定手段は、前記音声信号のパワーを算出するパワー算出部と、周囲の騒音レベルに応じて前記音声信号のパワー閾値を設定するパワー閾値設定部と、前記接触動作状態信号が出力された時点の近傍において前記音声信号のパワーが前記パワー閾値を越えた時点から所定の時間遡った時点を音声区間開始時点とし、前記接触停止状態信号が出力された時点の近傍において前記音声信号のパワーが前記パワー閾値を下回る時点から所定の時間経過した時点を音声区間終了時点とすることによって前記音声区間を判定する音声区間判定部とを備えたことを特徴とする構成を有している。
【００１２】
この構成により、音声区間判定部は、接触動作状態信号が出力された時点の近傍において音声信号のパワーがパワー閾値を越えた時点から所定の時間遡った時点を音声区間開始時点とし、接触停止状態信号が出力された時点の近傍において音声信号のパワーがパワー閾値を下回る時点から所定の時間経過した時点を音声区間終了時点とすることによって音声区間を判定するので、音声区間開始時点においては、語頭の無声子音および無声化母音等のパワーが低い音声が欠落することを防止することができ、音声区間終了時点においては、語尾の無声子音および無声化母音等のパワーが低い音声が欠落することを防止することができる。
【００１３】
また、本発明の音声認識装置は、前記音声区間判定手段は、前記接触動作状態に続く前記接触停止状態の継続時間が所定の閾値以下のときに入力された前記音声と前記接触停止状態の前に入力された前記音声とを同一の前記音声区間に含むようにしたことを特徴とする構成を有している。
【００１４】
この構成により、各話者の接触間隔に応じた音声入力に対応して音声区間の設定を行うことができる。
【００１５】
また、本発明の音声認識装置は、前記音声区間判定手段は、前記音響信号に前記音声信号が含まれているか否かを判定する音声信号判定部を備え、前記音声区間判定部は、前記音声信号判定部によって前記音響信号に前記音声信号が含まれていると判断されたとき、前記接触状態および前記音声信号のパワーに基づいて前記音声区間を判定するようにしたことを特徴とする構成を有している。
【００１６】
この構成により、音声区間判定部は、音声信号判定部によって音響信号に音声信号が含まれていると判断されたとき、接触状態および音声信号のパワーに基づいて音声区間を判定するので、例えば、周囲の騒音により音響信号のパワーの変動が大きい場合でも、音声区間の判定を確実に行うことができる。
【００１７】
また、本発明の音声認識装置は、前記音声区間判定部は、前記音響信号に前記音声信号が含まれていると判断された時点の近傍において前記音声信号のパワーが前記パワー閾値を越えた時点から所定の時間遡った時点を音声区間開始時点とし、前記接触停止状態信号が出力された時点の近傍において前記音声信号の前記パワーが前記パワー閾値を下回る時点から所定の時間経過した時点を音声区間終了時点として前記音声区間を判定するようにしたことを特徴とする構成を有している。
【００１８】
この構成により、音声区間判定部は、音響信号に音声信号が含まれていると判断された時点の近傍において音声信号のパワーがパワー閾値を越えた時点から所定の時間遡った時点を音声区間開始時点とし、接触停止状態信号が出力された時点の近傍において音声信号のパワーがパワー閾値を下回る時点から所定の時間経過した時点を音声区間終了時点として音声区間を判定するので、周囲の騒音を話者の音声と誤ることなく、音声区間開始時点においては、語頭の無声子音および無声化母音等のパワーが低い音声が欠落することを防止することができ、音声区間終了時点においては、語尾の無声子音および無声化母音等のパワーが低い音声が欠落することを防止することができる。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して説明する。
【００２０】
まず、本発明の実施の形態の音声認識装置の構成について説明する。
【００２１】
図１に示すように、本実施の形態の音声認識装置１０は、話者の音声を含む音響を入力し音響信号を出力する音響信号出力手段１１と、話者が発声するときに接触する接触手段１２と、音響信号に含まれる音声信号および接触手段１２の接触状態に基づいて話者が音声を発している音声区間を判定する音声区間判定手段１３と、音声区間の音声を認識する音声認識手段１４と、音声認識された結果を表示する表示手段１５とを備えている。
【００２２】
音響信号出力手段１１は、音響を集音し音響信号に変換するマイクロホン１１ａとアナログ信号をデジタル信号に変換するＡＤ変換部１１ｂとを備えている。
【００２３】
接触手段１２は、話者が接触する接触部１２ａと、話者が接触部１２ａを接触しているか否かを検出し、話者が接触部１２ａを接触している接触動作状態を示す接触動作状態信号および話者が接触部１２ａの接触を停止している接触停止状態を示す接触停止状態信号の何れかを出力する接触状態検出部１２ｂとを備えている。なお、以下の説明において、接触動作状態および接触停止状態は、それぞれ、接触状態および非接触状態といい、接触動作状態信号および接触停止状態信号は、それぞれ、接触信号および非接触信号という。
【００２４】
接触部１２ａは、例えば、キーボード、スイッチ、圧電素子、または感熱素子等によって構成されている。接触状態検出部１２ｂは、例えば、ＣＰＵ、ＲＡＭ、ＲＯＭ等によって構成されている。
【００２５】
音声区間判定手段１３は、例えば、ＣＰＵ、ＲＡＭ、ＲＯＭ等により構成され、音響信号のパワーを算出するパワー算出部１３ａと、音響信号に音声信号が含まれているか否かを判定する音声信号判定部１３ｂと、周囲の騒音レベルに応じて音声信号のパワー閾値を設定するパワー閾値設定部１３ｃと、所定の閾値および変数等を記憶する記憶部１３ｄと、接触状態検出部１２ｂから接触信号が出力された時点の近傍において音声信号のパワーがパワー閾値を越えた時点から所定の時間遡った時点を音声区間開始時点とし、接触状態検出部１２ｂから非接触信号が出力された時点の近傍において音声信号のパワーがパワー閾値を下回る時点から所定の時間経過した時点を音声区間終了時点として音声区間を判定する音声区間判定部１３ｅとを備えている。
【００２６】
なお、パワー算出部１３ａは、音響信号出力手段１１から出力された音響信号のパワーを、例えば、２０ｍｓｅｃ毎に算出するようになっている。また、音声信号判定部１３ｂは、音響信号出力手段１１から出力された音響信号に関する自己相関関数、ゼロクロス頻度、および低次ケプストラム係数のうち少なくとも一つを算出し、予め大量の音声で学習した判別係数によって、音響信号に音声信号が含まれているか否かの判定（以下、音声性の判定という。）を行い、判定結果に応じて音声を表す信号または非音声を表す信号の何れかを、例えば、２０ｍｓｅｃ毎に音声区間判定部１３ｅに出力するようになっている。
【００２７】
また、パワー閾値設定部１３ｃは、音響信号出力手段１１から出力された音響信号に含まれる周囲の騒音レベルに応じてパワー閾値を動的に設定するようになっている。なお、音響信号出力手段１１から出力された音響信号がパワー閾値を超えた場合、周囲の騒音レベルによるものなのか、あるいは、話者の発声によるものなのかの判定は、接触状態検出部１２ｂの検出結果に基づいて行うことができる。
【００２８】
音声認識手段１４は、例えば、ＣＰＵ、ＲＡＭ、ＲＯＭ等により構成され、音声区間判定手段１３によって判定された音声区間の音声を認識するようになっている。なお、音声認識手段１４は、図示しない認識語彙辞書を備えている。また、表示手段１５は、液晶ディスプレイ、ＣＰＵ、ＲＡＭ、ＲＯＭ等により構成され、例えば、文字によって音声認識結果を表示するようになっている。
【００２９】
次に、本実施の形態の音声認識装置１０の動作について、図２を参照して説明する。
【００３０】
図２において、まず、マイクロホン１１ａによって音響が集音され、ＡＤ変換部１１ｂによってＡＤ変換された音響信号が、パワー算出部１３ａ、音声信号判定部１３ｂ、および音声区間判定部１３ｅに出力される（ステップＳ２１）。次いで、パワー算出部１３ａによって、音響信号のパワーが算出され、音声信号判定部１３ｂによって、音声性が判定される（ステップＳ２２）。
【００３１】
引き続き、接触状態検出部１２ｂによって、話者が接触部１２ａに接触したか否かが検出され、接触した場合はさらに初めて接触したか否かが判断される（ステップＳ２３）。ここで、接触部１２ａを圧電素子で構成した場合は、話者が接触部１２ａに触れたときの圧力が接触部１２ａによって検出され、接触状態検出部１２ｂから接触信号が音声区間判定部１３ｅに出力される。一方、話者が接触部１２ａに触れていないときは、接触状態検出部１２ｂから非接触信号が音声区間判定部１３ｅに出力される。なお、前述の初めて接触とは、ある接触が始まった時刻から時間閾値、例えば、４００ｍｓｅｃ遡った時間内に接触がなかった場合の接触、または、後述のステップＳ２４においてＰ≧Ｐｔｈ１、かつ音声信号と判断されなかった場合以降において、ある接触が始まった時刻から時間閾値遡った時間内に接触がなかった場合の接触をいう。
【００３２】
ステップＳ２３において、初めて接触したと判断された場合は、音声区間判定部１３ｅによって、パワー算出部１３ａで算出されたパワー値Ｐがパワー閾値設定部１３ｃで設定された第１のパワー閾値Ｐｔｈ１以上か否かの判断と、音響信号出力手段１１から出力された音響信号の音声性の判定とが実行される（ステップＳ２４）。
【００３３】
ステップＳ２４において、Ｐ≧Ｐｔｈ１、かつ音声性が音声と判定された場合は、音声区間判定部１３ｅによって、音声区間開始時点が決定され（ステップＳ２５）、一方、Ｐ≧Ｐｔｈ１、かつ音声性が音声と判定されなかった場合は、ステップＳ２１に戻る。
【００３４】
一方、ステップＳ２３において、初めて接触したと判断されなかった場合は、接触状態検出部１２ｂによって、話者が接触部１２ａに接触したか否かが検出される（ステップＳ２６）。ステップＳ２６において、話者が接触部１２ａに接触したと検出された場合は、ステップＳ２１に戻り、話者が接触部１２ａに接触したと検出されなかった場合は、音声区間判定部１３ｅによって、記憶部１３ｄから時間閾値Ｔｔｈが読み出され、非接触状態が時間閾値Ｔｔｈ以上継続しているか否かが判断される（ステップＳ２７）。
【００３５】
ステップＳ２７において、非接触状態が時間閾値Ｔｔｈ以上継続していると判断された場合は、音声区間判定部１３ｅによって、パワー算出部１３ａで算出されたパワー値Ｐがパワー閾値設定部１３ｃで設定された第２のパワー閾値Ｐｔｈ２以下か否かが判断され（ステップＳ２８）、非接触状態が時間閾値Ｔｔｈ以上継続していると判断されなかった場合は、ステップＳ２１に戻る。
【００３６】
ステップＳ２８において、Ｐ≦Ｐｔｈ２と判断された場合は、音声区間判定部１３ｅによって、音声区間終了時点が決定され（ステップＳ２９）、Ｐ≦Ｐｔｈ２と判断されなかった場合は、ステップＳ２１に戻る。
【００３７】
さらに、音声認識手段１４によって、音声区間判定部１３ｅで判定された音声区間開始時点から音声区間終了時点までに含まれる音声が認識される（ステップＳ３０）。そして、表示手段１５によって、音声認識された結果が、例えば、文字により表示される（ステップＳ３１）。
【００３８】
ここで、音声区間を判定する過程について具体例を挙げて説明する。なお、音声区間を判定するための条件として、音響信号のパワー、音声性の判定結果、および接触状態は、それぞれ、図３（ａ）、図３（ｂ）、および図３（ｃ）に示されたものとして説明する。また、前述のパワー閾値Ｐｔｈ１およびパワー閾値Ｐｔｈ２は、それぞれ、６ｄＢおよび１ｄＢとする。また、前述の時間閾値Ｔｔｈは、４００ｍｓｅｃとし、パワー閾値Ｐｔｈ１およびパワー閾値Ｐｔｈ２によって音声区間の開始および終了を判断するときに基準とする時間は、５００ｍｓｅｃとする。
【００３９】
まず、図３（ａ）から図３（ｃ）までに示された内容について説明する。
【００４０】
図３（ａ）において、横軸を時間、縦軸を音響信号のパワーとし、２０ｍｓｅｃ毎に出力された音響信号のパワー値をプロットしたパワー曲線４１と、パワーのミニマムホールド値４２と、第１ポイント４３と、第２ポイント４４とが示されている。
【００４１】
図３（ｂ）は、音声信号判定部１３ｂによる音声性の判定結果を示しており、時刻ｔ３から時刻ｔ４までは音声と判定され、時刻ｔ３以前および時刻ｔ４以降は非音声と判定されていることを示している。
【００４２】
図３（ｃ）は、接触状態検出部１２ｂによって出力された接触を表す接触信号および非接触を表す非接触信号を示している。接触信号は、時刻ｔ５〜ｔ６、時刻ｔ７〜ｔ８、および時刻ｔ９〜ｔ１０の範囲で出力され、その他の範囲においては、非接触信号が出力されている。すなわち、図３（ａ）および図３（ｃ）は、話者が接触部１２ａに３回接触しながら発声したことを示している。
【００４３】
次に、音声区間開始時点の決定について説明する。
【００４４】
図３（ｃ）に示された時刻ｔ５〜ｔ６の範囲における接触によって、前述のステップＳ２３において初めて接触と判断され、ステップＳ２４に進む。図３（ａ）に示すように、音響信号のパワー値Ｐは、時刻ｔ３近傍より上昇し始め、時刻ｔ１の第１ポイント４３において、過去５００ｍｓｅｃ以上にわたりミニマムホールド値４２に対し初めて６ｄＢ（Ｐｔｈ１）以上となる。このとき、図３（ｂ）に示すように、音声性は音声と判定されている。
【００４５】
したがって、ステップＳ２３およびステップＳ２４における判断条件、すなわち、接触状態検出部１２ｂの検出結果は接触、Ｐ≧Ｐｔｈ１、および音声性の判定結果は音声という３つ条件が満たされたことにより、音声区間判定部１３ｅによって、音声区間が開始されたと判断され、図３（ｄ）に示すように、時刻ｔ１から６００ｍｓｅｃ遡った時刻ｔ１１が音声区間開始時点と決定される。
【００４６】
ここで、時刻ｔ１から６００ｍｓｅｃ遡った時点を音声区間開始時点とするのは、前述のパワーと音声性を用いた判定では、例えば、語頭および文頭等においてパワーが低くなって無声化したり、音声学的に無声化する音声を検出できず、頭切れが生じたりするのを防止するためである。
【００４７】
次に、音声区間終了時点の決定について説明する。
【００４８】
図３（ｃ）に示された時刻ｔ９〜ｔ１０の範囲における接触によって、前述のステップＳ２６において接触と判断され、ステップＳ２７に進む。図３（ｃ）に示すように、時刻ｔ１０以降において、時間閾値Ｔｔｈの４００ｍｓｅｃ以上継続して非接触と判断されるので、ステップＳ２７からステップＳ２８に進む。
【００４９】
音響信号のパワー値Ｐは、時刻ｔ９近傍より下降し始め、時刻ｔ２の第２ポイント４４において、過去５００ｍｓｅｃ以上にわたりミニマムホールド値４２に対し初めて１ｄＢ（Ｐｔｈ２）以下となる。
【００５０】
したがって、ステップＳ２７およびステップＳ２８における判断条件、すなわち、接触状態検出部１２ｂの検出結果は非接触およびＰ≦Ｐｔｈ２という２つ条件が満たされたことにより、音声区間判定部１３ｅによって、音声区間が終了したと判断され、図３（ｄ）に示すように、時刻ｔ２から４００ｍｓｅｃ経過した時刻ｔ１２が音声区間終了時点と決定される。
【００５１】
なお、接触状態検出部１２ｂによって非接触が検出された時刻から、時間閾値Ｔｔｈの４００ｍｓｅｃ以内に再び接触と検出された場合は、音声区間判定部１３ｅによって、話者の接触および発声は断続的に継続されていると判断され、非接触になった時刻以前の音声区間と、再び接触となった時刻以後の音声区間とが同一の音声区間とされる。
【００５２】
例えば、図３（ｃ）において、時刻ｔ６〜ｔ７の時間を２５０ｍｓｅｃ、時刻ｔ８〜９の時間を２００ｍｓｅｃとした場合、どちらの時間も時間閾値Ｔｔｈの４００ｍｓｅｃ以下なので、時刻ｔ５〜ｔ６、時刻ｔ７〜８、および時刻ｔ９〜１０に発せられた音声は同一区間の音声とされる。
【００５３】
したがって、話者が発声中に接触部１２ａを撫で続ける場合のみならず、話者が音節および音節の拍に合わせて接触部１２ａを叩く場合等でも、音声区間判定部１３ｅによって、音声区間を正確に判定することができる。
【００５４】
なお、音声区間の開始時点は、音声信号判定部１３ｂの判定結果に基づいて決定するように構成してもよい。すなわち、音声信号判定部１３ｂによって音声性が音声と判定された時点の近傍において音声信号のパワーがパワー閾値Ｐｔｈ１を越えた時点から所定の時間遡った時点を音声区間開始時点とするようにしてもよい。
【００５５】
次に、音声認識手段１４の処理について、図４を参照して説明する。
【００５６】
図４（ａ）および図４（ｂ）は、それぞれ、前述の図３（ａ）および図３（ｄ）と同じグラフを示しており、図４（ｃ）は、予め記憶された語彙の標準パターンと認識された結果とがよく一致した区間を模式的に例示したものである。
【００５７】
図４（ｃ）において、例えば、第１の標準パターン５１から第４の標準パターン５４までをそれぞれ、「東京」、「横浜」、「千葉」、および「仙台」とすると、第１の標準パターン５１の「東京」とよく一致した区間が４個の矢印で示されている。よく一致するか否かは、予め記憶された標準パターンと入力された音声のパターンとの一致度をスコアによって表し、このスコアが所定の閾値を超えるか否かによって判断される。最終的に音声区間終了時点ｔ１２の近傍においてスコアが最大になった標準パターンが認識結果とされる。図４（ｃ）においては、第２の標準パターン５２「横浜」のうち、太い矢印で表した第２の標準パターン５２ａが認識結果とされたことを示している。
【００５８】
すなわち、音声認識手段１４は、入力音声を随時認識するキーワードスポッティング型の動作を行い、音声区間判定手段１３によって判定された音声区間の音声区間開始時点を開始点とし、音声区間終了時点近傍を終了点とする語彙または文などの認識結果を出力するようになっている。
【００５９】
なお、一般に、話者は、自分の発声中の拍（モーラ）に合わせて接触部１２ａを叩くことが多く、發音、促音、および長音等の特殊拍では叩かないことが多いが、接触状態検出部１２ｂからの出力を参照することによって特殊拍が含まれる音節の認識を容易に行うことができる。
【００６０】
また、話者が前述の時間閾値Ｔｔｈを任意に設定できる構成にすれば、各話者の発話速度に合わせた最適の時間閾値Ｔｔｈの設定を行うことができる。また、話者の発話速度を学習する発話速度学習手段を設け、この発話速度学習手段の学習結果に基づいて時間閾値Ｔｔｈを設定する構成としてもよい。
【００６１】
また、認識すべき語彙が少なく限定されている場合において、語彙のモーラ数および音節数等が異なる場合には、音声認識装置１０の構成を簡略化し、話者が接触部１２ａを叩いた回数のみを参照して音声認識するようにしてもよい。例えば、モーラ数で認識させる場合、７回の叩き入力があれば認識語彙辞書中の７モーラの単語、例えば、「経路探索」を認識結果とする。また、音節数で認識させる場合は、３回の叩き入力があれば認識語彙辞書中の３音節の単語、例えば、「コンピューター」を認識結果とする。さらに、漢字の数によっても認識させることができ、４回の叩き入力がある場合は、認識語彙辞書中の４字漢字の単語、例えば「経路探索」を認識結果とする。
【００６２】
また、音声認識装置１０に句点および読点等を通知するスイッチを設け、このスイッチが叩かれた時点を文節の切れ目または音声入力終了タイミング等とすることによって、より確実に音声の終了を検出することができる。
【００６３】
また、話者が接触部１２ａを叩いたり、撫でたりする際に発生する雑音が音声認識に悪影響を与える危険性を軽減するため、接触部１２ａから発生する雑音を予め学習させる構成とし、音声認識の際に、例えば、スペクトルサブトラクション法で影響を軽減すれば、より音声認識部の認識性能を安定化させることができる。
【００６４】
なお、例えば、ぬいぐるみに本発明の音声認識装置１０を適用する場合は、接触部１２ａの叩きの強度およびパターン等によって話者の喜怒哀楽などの感情を簡易に推定できるので、ぬいぐるみに対する親和性の向上を図ることもできる。例えば、接触部１２ａの叩きの速度が速く、強度が強い場合には怒りを表し、接触部１２ａの叩きの速度が遅い場合には悲しさ、または寂しさを表しているものと推定し、話者とぬいぐるみとのコミュニケーションを円滑化させることができる。
【００６５】
以上のように、本実施の形態の音声認識装置１０によれば、音声区間判定手段１３は、音響信号出力手段１１から出力された音響信号に含まれる音声信号および接触手段１２の接触状態に基づいて話者が音声を発している音声区間を判定する構成としたので、発声中の話者の自然な動作による継続的または断続的な接触手段１２の接触によって音声区間が判定され、話者の肉体的および精神的な負担を軽減した簡単な操作で音声認識を行うことができる。
【００６６】
また、例えば、ぬいぐるみに本発明の音声認識装置１０を適用する場合、ぬいぐるみに対する叩き、さすり、撫でる等の行為は、話者の負担にならず、しかも、これらの行為によって音声認識を行うことができ、話者とぬいぐるみとのコミュニケーションを円滑化させることができるので、単にぬいぐるみに触る場合よりも、ぬいぐるみに対する親近感および親和性等を話者に感じさせることができる。したがって、特に、話者が老人、子供、および孤独な人々等の場合は、話者の心を癒し、また、遊び心を刺激することができる。
【００６７】
【発明の効果】
以上説明したように、本発明によれば、話者の肉体的および精神的な負担を軽減した簡単な操作で音声認識を行うことができる音声認識装置を提供することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態の音声認識装置のブロック図
【図２】本発明の実施の形態の音声認識装置の各ステップのフローチャート
【図３】（ａ）音響信号のパワー値をプロットしたパワー曲線を示す図
（ｂ）音声性の判定結果を示す図
（ｃ）接触信号および非接触信号を示す図
（ｄ）音声区間を示す図
【図４】（ａ）音響信号のパワー値をプロットしたパワー曲線を示す図
（ｂ）音声区間を示す図
（ｃ）語彙の標準パターンと認識された結果とがよく一致した区間を模式的に例示した図
【図５】従来の音声認識装置のブロック図
【符号の説明】
１０音声認識装置
１１音響信号出力手段
１１ａマイクロホン
１１ｂＡＤ変換部
１２接触手段
１２ａ接触部
１２ｂ接触状態検出部
１３音声区間判定手段
１３ａパワー算出部
１３ｂ音声信号判定部
１３ｃパワー閾値設定部
１３ｄ記憶部
１３ｅ音声区間判定部
１４音声認識手段
１５表示手段
４１パワー曲線
４２ミニマムホールド値
４３第１ポイント
４４第２ポイント
５１第１の標準パターン
５２第２の標準パターン
５２ａ認識された第２の標準パターン
５３第３の標準パターン
５４第４の標準パターン[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition device, and more particularly, to a voice recognition device that recognizes voice uttered by a speaker.
[0002]
[Prior art]
As a conventional speech recognition device, the one shown in FIG. 5 is known. The voice recognition device 1 shown in FIG. 5 includes a voice input unit 2 for inputting voice, a switch 3 operated by a speaker, a switch state detection unit 4 for detecting a state of the switch 3, and a state of the switch 3. The apparatus includes an input control unit 5 that controls the input of voice in response to the voice, a voice recognition unit 6 that recognizes voice, and a display unit 7 that displays a voice recognition result.
[0003]
In the conventional voice recognition device 1, first, the switch state detection unit 4 determines whether or not the speaker has turned on the switch 3 from off. When it is determined that the switch 3 has been turned on from off, the input control unit 5 notifies the voice input unit 2 of the start of voice input. Next, a voice is input by the voice input unit 2. Then, the input voice is recognized by the voice recognition unit 6, and the recognition result is displayed on the display unit 7.
[0004]
Subsequently, the switch state detection unit 4 determines whether or not the speaker has turned off the switch 3 from on. When it is determined that the switch 3 has been turned off from on, the input control unit 5 notifies the voice input unit 2 of the end of voice input with a delay of a predetermined time.
[0005]
As described above, in the conventional voice recognition device 1, the switch status detection unit 4 detects the status of the switch 3 and can recognize the voice input when the switch 3 is on (for example, And Patent Document 1).
[0006]
[Patent Document 1]
JP-A-2002-108390 (pages 4 to 5, FIG. 1)
[0007]
[Problems to be solved by the invention]
However, in such a conventional voice recognition device, since the speaker must keep pressing the switch continuously during the utterance, it is necessary to perform a complicated operation that places a large physical and mental burden on the speaker. There was a problem.
[0008]
The present invention has been made in order to solve such a problem, and provides a voice recognition device capable of performing voice recognition by a simple operation with reduced physical and mental burden on a speaker. It is.
[0009]
[Means for Solving the Problems]
The voice recognition device of the present invention includes: a sound signal output unit that inputs sound including a speaker's voice and outputs a sound signal; a contact unit that contacts the speaker when the speaker speaks; A voice section determination unit that determines a voice section in which the speaker is emitting the voice based on a voice signal included in the voice signal and a contact state of the contact unit; and a voice recognition unit that recognizes the voice in the voice section. It has a configuration characterized by being provided.
[0010]
With this configuration, the voice section determination unit determines the voice section in which the speaker is emitting a voice based on the voice signal included in the audio signal and the contact state of the contact unit, and the voice recognition unit uses the voice section determination unit. Since the voice of the determined voice section is recognized, the voice section is determined by the continuous or intermittent contact of the speaker during natural utterance, and the physical and mental burden on the speaker is reduced. Voice recognition can be performed with reduced simple operations.
[0011]
Further, in the voice recognition device of the present invention, the contact means detects a contact portion that contacts the speaker and whether the speaker is in contact with the contact portion, and the speaker detects the contact portion. A contact operation state signal indicating a contact operation state that is in contact with a contact state detection unit that outputs one of a contact stop state signal indicating a contact stop state in which the speaker has stopped contacting the contact unit. The audio section determination means includes: a power calculation unit that calculates the power of the audio signal; a power threshold setting unit that sets a power threshold of the audio signal according to a surrounding noise level; A point in time when the power of the audio signal exceeds the power threshold in the vicinity of the output time is set back to a predetermined time and is set as an audio section start point, and the audio is output in the vicinity of the point in time when the contact stop state signal is output. And a voice section determination unit that determines the voice section by setting a time point at which a predetermined time has elapsed from a time point at which the power of the signal falls below the power threshold value as a voice section end time point. I have.
[0012]
With this configuration, the voice section determination unit determines the voice section start time as the voice section start time when a predetermined time has elapsed from the time when the power of the voice signal exceeded the power threshold near the time when the contact operation state signal was output, and Since a voice section is determined by determining a voice section end time as a time point at which a predetermined time has elapsed from the time point at which the power of the voice signal falls below the power threshold near the time point when the signal is output, at the start of the voice section, It is possible to prevent a voice with low power, such as unvoiced consonants and unvoiced vowels, from being lost. Can be prevented.
[0013]
Further, in the voice recognition device of the present invention, the voice section determination means may include a voice input when the duration of the contact stop state following the contact operation state is equal to or less than a predetermined threshold value and the voice input before the contact stop state. Is included in the same voice section in the same voice section.
[0014]
With this configuration, it is possible to set a voice section corresponding to a voice input corresponding to the contact interval of each speaker.
[0015]
Also, the voice recognition device of the present invention, the voice section determination means includes a voice signal determination unit that determines whether the voice signal is included in the audio signal, the voice section determination unit, When the signal determination unit determines that the audio signal includes the audio signal, the configuration is such that the audio section is determined based on the contact state and the power of the audio signal. Have.
[0016]
With this configuration, the voice section determination unit determines the voice section based on the contact state and the power of the voice signal when the voice signal determination unit determines that the voice signal includes the voice signal. Even when the power of the acoustic signal greatly fluctuates due to ambient noise, it is possible to reliably determine the voice section.
[0017]
Further, in the voice recognition device according to the present invention, the voice section determination unit may determine that the power of the voice signal exceeds the power threshold near a time when it is determined that the voice signal is included in the voice signal. The audio section start time is defined as a point in time that is earlier than a predetermined time from the time when the power of the audio signal falls below the power threshold in the vicinity of the point in time when the contact stop state signal is output. The voice section is determined as the end point.
[0018]
With this configuration, the voice section determination unit determines that the voice section starts at a point in time when the power of the voice signal exceeds the power threshold and goes back a predetermined time near the time when the voice signal is determined to include the voice signal. The voice section is determined as the time point, and the time point when a predetermined time has elapsed from the time point when the power of the voice signal falls below the power threshold near the time point when the contact stop state signal is output is determined as the voice section end time point. At the beginning of the voice section, it is possible to prevent the lack of low-power voices such as unvoiced consonants and unvoiced vowels at the beginning of the voice section. It is possible to prevent a low-power voice such as a consonant or a unvoiced vowel from being lost.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0020]
First, the configuration of the speech recognition device according to the embodiment of the present invention will be described.
[0021]
As shown in FIG. 1, a voice recognition device 10 according to the present embodiment includes a sound signal output unit 11 that inputs a sound including a speaker's voice and outputs a sound signal, and a contact that contacts when the speaker speaks. A voice section determining means for determining a voice section in which a speaker is uttering a voice based on a voice signal included in an audio signal and a contact state of the contact means, and voice recognition for recognizing voice in the voice section. Means 14 and display means 15 for displaying the result of voice recognition.
[0022]
The acoustic signal output unit 11 includes a microphone 11a that collects sound and converts the sound into a sound signal, and an AD converter 11b that converts an analog signal into a digital signal.
[0023]
The contact means 12 detects a contact portion 12a with which the speaker is in contact and whether or not the speaker is in contact with the contact portion 12a, and indicates a contact operation state in which the speaker is in contact with the contact portion 12a. A contact state detection unit 12b that outputs either a state signal or a contact stop state signal indicating a contact stop state in which the speaker stops contacting the contact unit 12a. In the following description, the contact operation state and the contact stop state are referred to as a contact state and a non-contact state, respectively, and the contact operation state signal and the contact stop state signal are referred to as a contact signal and a non-contact signal, respectively.
[0024]
The contact portion 12a is configured by, for example, a keyboard, a switch, a piezoelectric element, a thermal element, or the like. The contact state detection unit 12b is configured by, for example, a CPU, a RAM, a ROM, and the like.
[0025]
The voice section determination means 13 includes, for example, a CPU, a RAM, a ROM, and the like, and includes a power calculation unit 13a that calculates the power of the audio signal, and an audio signal determination that determines whether the audio signal includes the audio signal. Unit 13b, a power threshold setting unit 13c for setting a power threshold of the audio signal according to the ambient noise level, a storage unit 13d for storing a predetermined threshold value and a variable, and the like, and a contact signal is output from the contact state detection unit 12b. A point in time near the time when the power of the audio signal exceeds the power threshold in the vicinity of the predetermined time is defined as a voice section start time, and in the vicinity of the time when the non-contact signal is output from the contact state detection unit 12b, And a voice section determination unit 13e that determines a voice section as a voice section end time when a predetermined time has elapsed since the power of the power falls below the power threshold. Eteiru.
[0026]
The power calculation unit 13a calculates the power of the audio signal output from the audio signal output unit 11, for example, every 20 msec. The audio signal determination unit 13b calculates at least one of an autocorrelation function, a zero-cross frequency, and a low-order cepstrum coefficient relating to the audio signal output from the audio signal output unit 11, and determines in advance a large amount of audio learned. The coefficient is used to determine whether or not the audio signal includes a voice signal (hereinafter referred to as voice determination). According to the determination result, either a signal representing voice or a signal representing non-voice is determined. For example, it outputs to the voice section determination unit 13e every 20 msec.
[0027]
The power threshold setting unit 13c dynamically sets the power threshold according to the surrounding noise level included in the audio signal output from the audio signal output unit 11. When the sound signal output from the sound signal output means 11 exceeds the power threshold, it is determined whether the sound signal is based on the surrounding noise level or the sound of the speaker. This can be performed based on the detection result.
[0028]
The voice recognizing means 14 includes, for example, a CPU, a RAM, a ROM, and the like, and recognizes the voice of the voice section determined by the voice section determining means 13. The voice recognition unit 14 includes a recognition vocabulary dictionary (not shown). The display means 15 includes a liquid crystal display, a CPU, a RAM, a ROM, and the like, and is configured to display a speech recognition result in characters, for example.
[0029]
Next, the operation of the speech recognition apparatus 10 according to the present embodiment will be described with reference to FIG.
[0030]
In FIG. 2, first, sound is collected by the microphone 11a, and the sound signal that has been AD-converted by the AD converter 11b is output to the power calculator 13a, the sound signal determiner 13b, and the sound section determiner 13e ( Step S21). Next, the power of the acoustic signal is calculated by the power calculation unit 13a, and the voice property is determined by the voice signal determination unit 13b (step S22).
[0031]
Subsequently, the contact state detection unit 12b detects whether or not the speaker has contacted the contact unit 12a, and if so, determines whether or not the contact has been made for the first time (step S23). Here, when the contact portion 12a is configured by a piezoelectric element, the pressure when the speaker touches the contact portion 12a is detected by the contact portion 12a, and the contact signal is transmitted from the contact state detection portion 12b to the voice section determination portion 13e. Is output. On the other hand, when the speaker is not touching the contact section 12a, a non-contact signal is output from the contact state detection section 12b to the voice section determination section 13e. Note that the above-mentioned first contact is a time threshold from the time when a certain contact starts, for example, a contact when there is no contact within a time period of 400 msec, or P ≧ Pth1 in step S24 to be described later and an audio signal. After the case where it is not determined, it means a contact when there is no contact within a period of time preceding the time when a certain contact has started by a time threshold.
[0032]
If it is determined in step S23 that the contact has been made for the first time, the voice section determination unit 13e determines whether the power value P calculated by the power calculation unit 13a is equal to or greater than the first power threshold Pth1 set by the power threshold setting unit 13c. The determination as to whether or not the voice signal is to be made and the determination of the voice of the acoustic signal output from the acoustic signal output means 11 are executed (step S24).
[0033]
In step S24, if P ≧ Pth1 and the voice is determined to be voice, the voice section start time is determined by the voice section determination unit 13e (step S25), while P ≧ Pth1 and the voice is voice. If not, the process returns to step S21.
[0034]
On the other hand, if it is not determined in step S23 that the contact has been made for the first time, the contact state detection unit 12b detects whether or not the speaker has contacted the contact unit 12a (step S26). In step S26, if it is detected that the speaker has touched the contact portion 12a, the process returns to step S21. If it is not detected that the speaker has touched the contact portion 12a, the voice section determination unit 13e stores the information. The time threshold Tth is read from the unit 13d, and it is determined whether the non-contact state continues for the time threshold Tth or more (step S27).
[0035]
If it is determined in step S27 that the non-contact state continues for the time threshold value Tth or more, the power value P calculated by the power calculation unit 13a is set by the voice section determination unit 13e by the power threshold setting unit 13c. It is determined whether or not the second power threshold Pth2 or less (Step S28). If it is not determined that the non-contact state has continued for the time threshold Tth or more, the process returns to Step S21.
[0036]
If it is determined in step S28 that P ≦ Pth2, the voice section determination unit 13e determines the end point of the voice section (step S29). If it is not determined that P ≦ Pth2, the process returns to step S21.
[0037]
Further, the voice recognition means 14 recognizes voice included from the voice section start time to the voice section end time determined by the voice section determination unit 13e (step S30). Then, the result of the voice recognition is displayed by, for example, characters by the display means 15 (step S31).
[0038]
Here, the process of determining a voice section will be described with a specific example. As the conditions for determining the voice section, the power of the acoustic signal, the determination result of the voiceness, and the contact state are shown in FIGS. 3A, 3B, and 3C, respectively. It will be described as having been performed. Further, the power threshold Pth1 and the power threshold Pth2 are 6 dB and 1 dB, respectively. The above-mentioned time threshold Tth is set to 400 msec, and the reference time when judging the start and end of the voice section based on the power threshold Pth1 and the power threshold Pth2 is set to 500 msec.
[0039]
First, the contents shown in FIGS. 3A to 3C will be described.
[0040]
In FIG. 3A, the horizontal axis represents time, the vertical axis represents the power of the audio signal, and a power curve 41 in which the power value of the audio signal output every 20 msec is plotted; a minimum power hold value 42; A point 43 and a second point 44 are shown.
[0041]
FIG. 3B shows the result of the voice determination performed by the voice signal determination unit 13b. The voice is determined to be voice from time t3 to time t4, and the voice is determined to be non-voice before time t3 and after time t4. It is shown that.
[0042]
FIG. 3C illustrates a contact signal indicating contact and a non-contact signal indicating non-contact output by the contact state detection unit 12b. The contact signal is output in the range from time t5 to t6, time t7 to t8, and time t9 to t10, and in other ranges, a non-contact signal is output. That is, FIGS. 3A and 3C show that the speaker uttered while touching the contact portion 12a three times.
[0043]
Next, determination of the voice section start time will be described.
[0044]
Due to the contact in the range of times t5 to t6 shown in FIG. 3C, it is determined that the contact has been made for the first time in step S23, and the process proceeds to step S24. As shown in FIG. 3A, the power value P of the acoustic signal starts to increase near the time t3, and at the first point 43 at the time t1, is 6 dB (Pth1) for the first time with respect to the minimum hold value 42 for the past 500 msec or more. That is all. At this time, as shown in FIG. 3B, the voice property is determined to be voice.
[0045]
Therefore, the judgment conditions in step S23 and step S24, that is, the detection result of the contact state detection unit 12b satisfies the three conditions of contact, P ≧ Pth1, and the vocality determination result satisfies the three conditions of the voice section determination. The unit 13e determines that the voice section has started, and as shown in FIG. 3D, the time t11 which is 600 msec earlier than the time t1 is determined as the voice section start time.
[0046]
Here, the point of time 600 msec before the time t1 is set as the voice section start time. In the above-described determination using the power and the voice property, for example, the power becomes low at the beginning of a word, the beginning of a sentence, etc. This is to prevent a situation in which the voice to be silenced cannot be detected and the head is cut off.
[0047]
Next, determination of the end point of the voice section will be described.
[0048]
Due to the contact in the range from time t9 to t10 shown in FIG. 3C, it is determined that the contact has occurred in step S26, and the process proceeds to step S27. As shown in FIG. 3 (c), after time t10, it is determined that the contact is non-contact continuously for the time threshold value Tth of 400 msec or more, so that the process proceeds from step S27 to step S28.
[0049]
The power value P of the acoustic signal starts to fall from near the time t9, and at the second point 44 at the time t2, becomes 1 dB (Pth2) or less for the first time with respect to the minimum hold value 42 for the past 500 msec or more.
[0050]
Therefore, the judgment condition in step S27 and step S28, that is, the detection result of the contact state detection unit 12b satisfies the two conditions of non-contact and P ≦ Pth2, so that the voice section is ended by the voice section determination unit 13e. Thus, as shown in FIG. 3D, time t12 at which 400 msec has elapsed from time t2 is determined as the end of the voice section.
[0051]
When contact is detected again within 400 msec of the time threshold Tth from the time when non-contact is detected by the contact state detection unit 12b, the contact and utterance of the speaker are intermittently performed by the voice section determination unit 13e. It is determined that the voice section is continued, and the voice section before the time of non-contact and the voice section after the time of contact again are the same voice section.
[0052]
For example, in FIG. 3C, if the time from time t6 to time t7 is 250 msec and the time from time t8 to time 9 is 200 msec, both times are equal to or less than the time threshold Tth of 400 msec. 8, and the sounds emitted at times t9 to t10 are sounds in the same section.
[0053]
Therefore, not only when the speaker continues to stroke the contact portion 12a during utterance, but also when the speaker strikes the contact portion 12a in synchronization with the syllables and the beats of the syllables, the speech segment determination unit 13e can accurately determine the speech segment. Can be determined.
[0054]
Note that the start point of the voice section may be determined based on the determination result of the voice signal determination unit 13b. That is, the time point when the power of the audio signal exceeds the power threshold value Pth1 by a predetermined time near the time point when the audio signal is determined to be audio by the audio signal determination unit 13b may be set as the audio section start time point. Good.
[0055]
Next, the processing of the voice recognition means 14 will be described with reference to FIG.
[0056]
FIGS. 4 (a) and 4 (b) show the same graphs as those of FIGS. 3 (a) and 3 (d), respectively, and FIG. 4 (c) shows a pre-stored vocabulary standard. This is a diagram schematically illustrating a section in which a pattern and a recognized result match well.
[0057]
In FIG. 4C, for example, if the first standard pattern 51 to the fourth standard pattern 54 are respectively “Tokyo”, “Yokohama”, “Chiba”, and “Sendai”, the first standard pattern A section that well matches 51 “Tokyo” is indicated by four arrows. Whether or not they match well is represented by a score representing the degree of coincidence between the standard pattern stored in advance and the input voice pattern, and is determined based on whether or not this score exceeds a predetermined threshold. Finally, the standard pattern having the maximum score near the voice section end time t12 is determined as the recognition result. FIG. 4C shows that, out of the second standard patterns 52 “Yokohama”, the second standard pattern 52 a represented by a thick arrow is a recognition result.
[0058]
In other words, the voice recognition means 14 performs a keyword spotting type operation of recognizing the input voice as needed, with the voice section start time of the voice section determined by the voice section determination means 13 as the start point and ending near the voice section end time. A recognition result of a vocabulary or a sentence to be a point is output.
[0059]
In general, a speaker often hits the contact portion 12a in accordance with a beat (mora) during his or her utterance, and often does not hit with a special beat such as sound generation, prompting sound, and long sound. By referring to the output from the unit 12b, it is possible to easily recognize a syllable including a special beat.
[0060]
Further, if the configuration is such that the speaker can arbitrarily set the above-mentioned time threshold Tth, it is possible to set the optimum time threshold Tth according to the utterance speed of each speaker. Further, an utterance speed learning means for learning the utterance speed of the speaker may be provided, and the time threshold value Tth may be set based on the learning result of the utterance speed learning means.
[0061]
In addition, when the number of words to be recognized is small and limited, and the number of mora and the number of syllables of the words are different, the configuration of the voice recognition device 10 is simplified, and only the number of times the speaker hits the contact portion 12a is used. May be referred to for voice recognition. For example, in the case of performing recognition by the number of moras, if there are seven hit inputs, words of 7 moras in the recognized vocabulary dictionary, for example, “route search” are regarded as the recognition result. In the case of performing recognition based on the number of syllables, if there are three hit inputs, a word of three syllables in the recognized vocabulary dictionary, for example, “computer” is used as the recognition result. Furthermore, recognition can also be made based on the number of kanji. If there are four hit inputs, words of four kanji in the recognized vocabulary dictionary, for example, "path search" are used as the recognition result.
[0062]
Further, by providing a switch for notifying a punctuation mark and a reading point to the speech recognition device 10 and setting the point at which this switch is hit as a break of a phrase or a speech input end timing, etc., the end of the speech can be detected more reliably. Can be.
[0063]
Further, in order to reduce the risk that the noise generated when the speaker hits or strokes the contact portion 12a and adversely affects the speech recognition, a configuration is provided in which the noise generated from the contact portion 12a is learned in advance, and the voice recognition is performed. In this case, if the influence is reduced by, for example, the spectral subtraction method, the recognition performance of the speech recognition unit can be further stabilized.
[0064]
For example, when the voice recognition device 10 of the present invention is applied to a stuffed toy, the emotions of the speaker, such as emotions and emotions, can be easily estimated based on the strength and pattern of the tapping of the contact portion 12a. Can also be improved. For example, it is presumed that the speed of tapping the contact portion 12a is high and the intensity is high, indicating anger, and that the speed of tapping of the contact portion 12a is low indicates sadness or loneliness. Communication with the stuffed toy can be facilitated.
[0065]
As described above, according to the voice recognition device 10 of the present embodiment, the voice section determination means 13 is based on the voice signal included in the sound signal output from the sound signal output means 11 and the contact state of the contact means 12. Therefore, the voice section is determined by the continuous or intermittent contact of the contact means 12 due to the natural movement of the uttering speaker, and the voice section of the speaker is determined. Speech recognition can be performed with simple operations that reduce physical and mental burdens.
[0066]
In addition, for example, when the voice recognition device 10 of the present invention is applied to a stuffed toy, actions such as hitting, rubbing, and stroking the stuffed animal do not impose a burden on the speaker, and it is possible to perform voice recognition by these actions. Since it is possible to facilitate communication between the speaker and the stuffed animal, it is possible to make the speaker feel closer to the stuffed animal and have an affinity for the stuffed animal rather than simply touching the stuffed animal. Therefore, especially when the speaker is an elderly person, a child, a lonely person, or the like, the speaker's heart can be healed and playfulness can be stimulated.
[0067]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a voice recognition device capable of performing voice recognition by a simple operation with a reduced physical and mental burden on a speaker.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart of each step of the voice recognition device according to the embodiment of the present invention;
FIG. 3A is a diagram showing a power curve in which power values of acoustic signals are plotted.
(B) A diagram showing the determination result of voice characteristics
(C) A diagram showing a contact signal and a non-contact signal
(D) Diagram showing voice section
FIG. 4A is a diagram showing a power curve in which power values of acoustic signals are plotted.
(B) Diagram showing voice section
(C) A diagram schematically illustrating a section in which the standard pattern of the vocabulary matches the recognized result well.
FIG. 5 is a block diagram of a conventional voice recognition device.
[Explanation of symbols]
10 Speech recognition device
11 Acoustic signal output means
11a microphone
11b AD converter
12 contact means
12a Contact part
12b Contact state detector
13 Voice section determination means
13a Power calculator
13b Audio signal determination unit
13c Power threshold setting unit
13d storage unit
13e Voice section judgment unit
14 Voice recognition means
15 Display means
41 Power Curve
42 Minimum hold value
43 1st point
44 2nd point
51 1st standard pattern
52 Second Standard Pattern
52a Recognized second standard pattern
53 Third Standard Pattern
54 4th standard pattern

Claims

Sound signal output means for inputting sound including a speaker's voice and outputting a sound signal; contact means for contacting the speaker when the speaker utters; a sound signal included in the sound signal and the contact A voice section determining means for determining a voice section in which the speaker is emitting the voice based on a contact state of the means; and voice recognition means for recognizing the voice in the voice section. Recognition device.

The contact unit detects a contact portion that contacts the speaker and whether or not the speaker is in contact with the contact portion, and indicates a contact operation state in which the speaker is in contact with the contact portion. A contact state detection unit that outputs one of a contact operation state signal and a contact stop state signal indicating a contact stop state in which the speaker has stopped contacting the contact unit. A power calculation unit for calculating the power of the audio signal, a power threshold setting unit for setting a power threshold of the audio signal in accordance with an ambient noise level, and the audio signal near a time when the contact operation state signal is output A point in time when the power of the power exceeds the power threshold is a predetermined time, and the point in time is a voice section start point, and the power of the voice signal exceeds the power threshold in the vicinity of the point in time when the contact stop state signal is output. Speech recognition apparatus according to claim 1, characterized in that a determining speech segment determination unit the speech section by the time the predetermined time has elapsed from the time the speech section end time of turn.

The voice section determination means may determine that the voice input when the duration of the contact stop state following the contact operation state is equal to or less than a predetermined threshold is the same as the voice input before the contact stop state. The speech recognition apparatus according to claim 2, wherein the speech section is included in the speech section.

The audio section determination unit includes an audio signal determination unit that determines whether the audio signal includes the audio signal. The audio section determination unit uses the audio signal determination unit to convert the audio signal into the audio signal. The voice recognition according to claim 2 or 3, wherein when it is determined that a signal is included, the voice section is determined based on the contact state and the power of the voice signal. apparatus.

The voice section determination unit determines a point in time when the power of the voice signal exceeds the power threshold by a predetermined time from a point in time near the time when it is determined that the voice signal is included in the audio signal. As the start time, the voice section is determined as the voice section end time when a predetermined time has elapsed from the time when the power of the voice signal falls below the power threshold near the time when the contact stop state signal is output. The speech recognition device according to claim 4, wherein