JP2015161718A

JP2015161718A - speech detection device, speech detection method and speech detection program

Info

Publication number: JP2015161718A
Application number: JP2014035316A
Authority: JP
Inventors: 弘行川原; Hiroyuki Kawahara; 雅紀鈴木; Masaki Suzuki
Original assignee: FERIX Inc
Current assignee: FERIX Inc
Priority date: 2014-02-26
Filing date: 2014-02-26
Publication date: 2015-09-07

Abstract

PROBLEM TO BE SOLVED: To provide a speech detection device capable of easily detecting a voice question to a system without receiving any special operation, a speech detection method and a speech detection program.SOLUTION: A speech detection device 1 includes: a voice acquisition part 11 for continuously acquiring voice data; a time measuring part 12 for measuring a silence time t1 in which a level of an acquired voice is lower than an utterance determination threshold value a1, a speech time t2 in which the level of the voice exceeds the utterance determination threshold value a1 after the lapse of the silence time t1, and a non-response time t3 in which the level of the voice is lower than the utterance determination threshold value a1 after the lapse of the speech time t2; and a determination part 13 by which, in the case where the silence time t1, the speech time t2 and the non-response time t3 meet a predetermined condition, the voices during the speech time t2 are determined as a speech of a question to the device.

Description

本発明は、音声データから発話を検出する装置、方法及びプログラムに関する。 The present invention relates to an apparatus, a method, and a program for detecting an utterance from audio data.

従来、カーナビゲーション装置、携帯端末等の電子機器には、音声認識の結果に応じて所定の処理を実行する機能が搭載されているものがある。これらの電子機器では一般に、ユーザがボタン押下等の所定の操作を行ったことに応じて音声認識機能が起動する。 2. Description of the Related Art Conventionally, some electronic devices such as a car navigation device and a portable terminal are equipped with a function for executing a predetermined process in accordance with a result of voice recognition. In these electronic devices, generally, a voice recognition function is activated in response to a user performing a predetermined operation such as pressing a button.

ところが、ユーザ操作を契機として音声認識機能が起動される場合、例えばユーザの両手が塞がっているとき、又は直接的な操作が困難な機器に指示するとき等には、利便性が低下する。そこで、例えば、音声データにおける韻律等の特徴量を用いてシステムへの問いかけを判別する方法（非特許文献１参照）、無音区間の長さから会話状態を判別する方法（特許文献１参照）が提案されている。 However, when the voice recognition function is activated in response to a user operation, for example, when both hands of the user are closed or when an instruction is given to a device that is difficult to perform a direct operation, the convenience is reduced. Therefore, for example, there are a method for determining a question to the system using feature values such as prosody in speech data (see Non-Patent Document 1) and a method for determining a conversation state from the length of a silent section (see Patent Document 1). Proposed.

特開２００５−１９６０２５号公報JP 2005-196025 A

山形知行，佐古淳，滝口哲也，有木康雄，“韻律及び話者交代情報を用いたシステム要求検出”，第９回音声言語シンポジウム，ＳＩＧ−ＳＬＰ６９，ｐｐ．２８９−２９４，２００７−１２Tomoyuki Yamagata, Satoshi Sako, Tetsuya Takiguchi, Yasuo Ariki, “System Request Detection Using Prosody and Speaker Change Information”, 9th Spoken Language Symposium, SIG-SLP69, pp. 289-294, 2007-12

しかしながら、音声データの特徴量を用いた負荷が高い処理は、ＣＰＵの処理性能及び消費電力を要するため、携帯端末等の電子機器には不都合な場合がある。また、無音区間の長さにより判別する場合、システムへの問いかけのみを判別することは難しかった。 However, processing with a high load using audio data feature amounts requires processing performance and power consumption of the CPU, which may be inconvenient for electronic devices such as portable terminals. Also, when determining based on the length of the silent section, it is difficult to determine only the question to the system.

本発明は、特別な操作を受け付けることなく、システムに対する音声による問いかけを容易に検出できる発話検出装置、発話検出方法及び発話検出プログラムを提供することを目的とする。 It is an object of the present invention to provide an utterance detection device, an utterance detection method, and an utterance detection program that can easily detect a voice inquiry to a system without receiving a special operation.

本発明に係る発話検出装置は、音声データを連続して取得する音声取得部と、取得した音声の大きさが発声判定閾値に満たない第１の時間、当該第１の時間が経過した後に音声の大きさが前記発声判定閾値を超えた第２の時間、及び当該第２の時間が経過した後に音声の大きさが前記発声判定閾値に満たない第３の時間を計測する計時部と、前記第１の時間、前記第２の時間及び前記第３の時間が所定の条件を満たす場合に、前記第２の時間における音声を、自装置に対する問いかけの発話であると判定する判定部と、を備える。 The utterance detection device according to the present invention includes a voice acquisition unit that continuously acquires voice data, a first time when the size of the acquired voice does not satisfy the utterance determination threshold, and a voice after the first time has elapsed. A second time when the size of the voice exceeds the utterance determination threshold, and a time measuring unit that measures a third time when the voice size does not satisfy the utterance determination threshold after the second time has elapsed, A determination unit that determines, when the first time, the second time, and the third time satisfy a predetermined condition, that the voice at the second time is an inquiry utterance to the own device; Prepare.

前記所定の条件は、前記第１の時間が第１の時間閾値以上であること、前記第２の時間が第２の時間閾値から第３の時間閾値の範囲内にあること、及び前記第３の時間が第４の時間閾値以上であることを含んでもよい。 The predetermined condition is that the first time is greater than or equal to a first time threshold, the second time is within a range from a second time threshold to a third time threshold, and the third May include that the time is greater than or equal to a fourth time threshold.

前記発話検出装置は、所定の識別器により、前記第２の時間における音声データのパターンが人の声によるものか否かを識別する識別部を備え、前記判定部は、前記所定の条件を満たし、かつ、前記識別部により前記第２の時間における音声データのパターンが人の声によるものと識別された場合に、前記第２の時間における音声を、自装置に対する問いかけの発話であると判定してもよい。 The utterance detection device includes an identification unit that identifies whether or not the pattern of the voice data in the second time is a human voice by a predetermined classifier, and the determination unit satisfies the predetermined condition And when the voice data pattern at the second time is identified as a human voice by the identification unit, the voice at the second time is determined to be an utterance of an inquiry to the own device. May be.

前記計時部は、前記第１の時間及び前記第３の時間の音声の大きさに関する統計値に基づいて、前記発声判定閾値を調整してもよい。 The timekeeping unit may adjust the utterance determination threshold based on a statistical value related to the loudness of the first time and the third time.

前記発話検出装置は、前記第２の時間における音声が自装置に対する問いかけの発話であると判定された場合に、所定の演算処理結果を音声出力する出力部を備えてもよい。 The utterance detection device may include an output unit that outputs a predetermined arithmetic processing result as a voice when it is determined that the voice at the second time is a questioned utterance to the own device.

前記発話検出装置は、前記第２の時間における音声が自装置に対する問いかけの発話であると判定された場合に、続いて取得される音声データに対して所定の音声解析処理を実行する音声解析部を備えてもよい。 The speech detection device, when it is determined that the speech at the second time is an inquiry speech to the device itself, a speech analysis unit that executes a predetermined speech analysis process on the speech data acquired subsequently May be provided.

前記発話検出装置は、前記第２の時間における音声データを記憶する記憶部と、前記第２の時間における音声が自装置に対する問いかけの発話であると判定された場合に、前記記憶部に記憶された音声データに対して所定の音声解析処理を実行する音声解析部と、を備えてもよい。 The utterance detection device is stored in the storage unit when the voice data at the second time is stored, and when it is determined that the voice at the second time is an inquiry utterance to the own device. A voice analysis unit that performs a predetermined voice analysis process on the voice data.

本発明に係る発話検出方法は、音声データを連続して取得する音声取得ステップと、取得した音声の大きさが発声判定閾値に満たない第１の時間、当該第１の時間が経過した後に音声の大きさが前記発声判定閾値を超えた第２の時間、及び当該第２の時間が経過した後に音声の大きさが前記発声判定閾値に満たない第３の時間を計測する計時ステップと、前記第１の時間、前記第２の時間及び前記第３の時間が所定の条件を満たす場合に、前記第２の時間における音声を、自装置に対する問いかけの発話であると判定する判定ステップと、をコンピュータが実行する。 The speech detection method according to the present invention includes a voice acquisition step of continuously acquiring voice data, a first time when the size of the acquired voice does not satisfy the utterance determination threshold, and a voice after the first time has elapsed. Measuring a second time when the magnitude of the voice exceeds the utterance determination threshold, and a third time when the volume of the voice does not satisfy the utterance determination threshold after the second time has elapsed, and A determination step of determining, when the first time, the second time, and the third time satisfy a predetermined condition, that the voice at the second time is an inquiry utterance to the device; The computer runs.

本発明に係る発話検出プログラムは、音声データを連続して取得する音声取得ステップと、取得した音声の大きさが発声判定閾値に満たない第１の時間、当該第１の時間が経過した後に音声の大きさが前記発声判定閾値を超えた第２の時間、及び当該第２の時間が経過した後に音声の大きさが前記発声判定閾値に満たない第３の時間を計測する計時ステップと、前記第１の時間、前記第２の時間及び前記第３の時間が所定の条件を満たす場合に、前記第２の時間における音声を、自装置に対する問いかけの発話であると判定する判定ステップと、をコンピュータに実行させる。 The speech detection program according to the present invention includes a speech acquisition step of continuously acquiring speech data, a first time when the size of the acquired speech is less than the speech determination threshold, and a speech after the first time has elapsed. Measuring a second time when the magnitude of the voice exceeds the utterance determination threshold, and a third time when the volume of the voice does not satisfy the utterance determination threshold after the second time has elapsed, and A determination step of determining, when the first time, the second time, and the third time satisfy a predetermined condition, that the voice at the second time is an inquiry utterance to the device; Let the computer run.

本発明によれば、特別な操作を受け付けることなく、システムに対する音声による問いかけを容易に検出できる。 According to the present invention, it is possible to easily detect a voice inquiry to the system without receiving a special operation.

第１実施形態に係る発話検出装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech detection apparatus which concerns on 1st Embodiment. 第１実施形態に係る音声データの一例を示す図である。It is a figure which shows an example of the audio | voice data which concern on 1st Embodiment. 第１実施形態に係る発話検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the speech detection process which concerns on 1st Embodiment. 第２実施形態に係る発話検出装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech detection apparatus which concerns on 2nd Embodiment. 第３実施形態に係る発話検出装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech detection apparatus which concerns on 3rd Embodiment. 第４実施形態に係る発話検出装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech detection apparatus which concerns on 4th Embodiment.

［第１実施形態］
以下、本発明の第１実施形態について説明する。
本実施形態の発話検出装置１は、取得した音声データが自装置への問いかけであることを検出する。 [First Embodiment]
The first embodiment of the present invention will be described below.
The utterance detection device 1 of the present embodiment detects that the acquired voice data is an inquiry to the device itself.

図１は、発話検出装置１の機能構成を示すブロック図である。
発話検出装置１は、音声取得部１１と、計時部１２と、判定部１３と、出力部１４とを備える。
音声取得部１１は、音声データを連続して取得する。 FIG. 1 is a block diagram illustrating a functional configuration of the utterance detection apparatus 1.
The utterance detection device 1 includes a voice acquisition unit 11, a timer unit 12, a determination unit 13, and an output unit 14.
The voice acquisition unit 11 continuously acquires voice data.

図２は、音声取得部１１が取得した音声データの一例を示す図である。
音声取得部１１は、時系列に、少なくとも音声の大きさを示す振幅を連続して取得する。 FIG. 2 is a diagram illustrating an example of audio data acquired by the audio acquisition unit 11.
The sound acquisition unit 11 continuously acquires at least an amplitude indicating the size of the sound in time series.

計時部１２は、取得した振幅が発声判定閾値ａ１に満たない沈黙時間ｔ１（第１の時間）、ｔ１が経過した後に振幅がａ１を超えた発話時間ｔ２（第２の時間）、及びｔ２が経過した後に振幅がａ１に満たない無応答時間ｔ３（第３の時間）を計測する。 The timer 12 has a silence time t1 (first time) in which the acquired amplitude is less than the utterance determination threshold a1, a speech time t2 (second time) in which the amplitude exceeds a1 after t1 has elapsed, and t2 After the elapse of time, a non-response time t3 (third time) in which the amplitude is less than a1 is measured.

沈黙時間ｔ１は、人間と人間が会話していない状態であることを判別するための指標である。すなわち、システム（発話検出装置１）に対する問いかけが発話される前には一定以上の沈黙が生じるという前提に従い、計時部１２は、この沈黙時間ｔ１を計測する。 The silence time t1 is an index for determining that the person is not in a conversation state. That is, the time measuring unit 12 measures the silence time t1 in accordance with the premise that a certain amount of silence occurs before an inquiry to the system (the utterance detection device 1) is uttered.

発話時間ｔ２は、音声が問いかけであることを判別するための指標である。すなわち、システムに対する問いかけの場合、人間への問いかけと比較して短い時間の発話となる前提に従い、計時部１２は、この発話時間ｔ２を計測する。 The utterance time t2 is an index for determining that the voice is a question. That is, in the case of an inquiry to the system, the timer 12 measures the utterance time t2 in accordance with the premise that the utterance is shorter in time than the inquiry to the human.

無応答時間ｔ３は、音声がシステムに対する問いかけであり、応答する人間が存在しないことを判別するための指標である。すなわち、システムへの問いかけに対して人間からは一定以上応答がない前提に従い、計時部１２は、この無応答時間ｔ３を計測する。 The no-response time t3 is an index for determining that the voice is an inquiry to the system and that no human responds. That is, the timer 12 measures the non-response time t3 in accordance with the premise that there is no response from a human being to a question to the system.

また、計時部１２は、沈黙時間ｔ１及び無応答時間ｔ３の音声の大きさ、すなわち発話がないときの周辺ノイズの大きさの統計値に基づいて、発声判定閾値ａ１を調整する。具体的には、計時部１２は、例えば、沈黙時間ｔ１及び無応答時間ｔ３における音声データの標準偏差を算出することにより、環境音レベル及び個体間誤差を算出し、この算出結果に基づいて発声判定閾値ａ１を調整する。これにより、計時部１２は、周辺ノイズが大きいほど発声判定閾値ａ１を大きくし、周辺ノイズが小さいほど発声判定閾値ａ１を小さくして、発話時間ｔ２を正しく計測できるように調整する。 In addition, the time measuring unit 12 adjusts the utterance determination threshold value a 1 based on the statistic value of the loudness of the silence time t 1 and the no-response time t 3, that is, the magnitude of ambient noise when there is no utterance. Specifically, the time measuring unit 12 calculates the environmental sound level and the inter-individual error, for example, by calculating the standard deviation of the sound data at the silence time t1 and the no-response time t3, and speaks based on the calculation result. The determination threshold value a1 is adjusted. Thereby, the time measuring unit 12 increases the utterance determination threshold value a1 as the surrounding noise increases, and decreases the utterance determination threshold value a1 as the surrounding noise decreases, so that the utterance time t2 can be correctly measured.

判定部１３は、沈黙時間ｔ１、発話時間ｔ２及び無応答時間ｔ３が所定の条件を満たす場合に、発話時間ｔ２における音声を、発話検出装置１に対する問いかけの発話であると判定する。
所定の条件は、沈黙時間ｔ１が第１の時間閾値（例えば、３０秒）以上であること、発話時間ｔ２が第２の時間閾値（例えば、０．５秒）から第３の時間閾値（例えば、２秒）の範囲内にあること、及び無応答時間ｔ３が第４の時間閾値（例えば、４秒）以上であることを含む。 When the silence time t1, the utterance time t2, and the no-response time t3 satisfy predetermined conditions, the determination unit 13 determines that the voice at the utterance time t2 is an utterance that is an inquiry to the utterance detection device 1.
The predetermined condition is that the silence time t1 is not less than a first time threshold (for example, 30 seconds), and the speech time t2 is from a second time threshold (for example, 0.5 seconds) to a third time threshold (for example, 0.5 seconds). 2 seconds), and the no-response time t3 is equal to or greater than a fourth time threshold (for example, 4 seconds).

出力部１４は、発話時間ｔ２における音声が発話検出装置１に対する問いかけの発話であると判定された場合に、応答として所定の演算処理の結果を出力する。出力部１４は、例えば、現在時刻を取得して読み上げる等、ユーザからの問いかけに応じて、所定の応答出力を行う。 When it is determined that the voice at the utterance time t 2 is an inquiry utterance to the utterance detection device 1, the output unit 14 outputs a result of a predetermined calculation process as a response. The output unit 14 outputs a predetermined response in response to an inquiry from the user, for example, acquiring the current time and reading it out.

図３は、発話検出装置１による発話検出処理の流れを示すフローチャートである。
本処理は、音声取得部１１による音声データの取得と並行して実行される。 FIG. 3 is a flowchart showing the flow of speech detection processing by the speech detection device 1.
This process is executed in parallel with the acquisition of audio data by the audio acquisition unit 11.

ステップＳ１において、計時部１２は、沈黙時間ｔ１の計時を開始する。
ステップＳ２において、判定部１３は、音声データの振幅が発声判定閾値ａ１を超えているか否かを判定する。この判定がＹＥＳの場合、発話が開始されたので、処理はステップＳ３に移る。一方、判定がＮＯの場合、沈黙が継続しているので、ステップＳ２の判定が繰り返される。
ステップＳ３において、計時部１２は、沈黙時間ｔ１を確定する。 In step S1, the time measuring unit 12 starts measuring the silence time t1.
In step S2, the determination unit 13 determines whether the amplitude of the audio data exceeds the utterance determination threshold value a1. If this determination is YES, since the utterance has been started, the process proceeds to step S3. On the other hand, if the determination is NO, silence continues, so the determination in step S2 is repeated.
In step S3, the time measuring unit 12 determines the silence time t1.

ステップＳ４において、計時部１２は、発話時間ｔ２の計時を開始する。
ステップＳ５において、判定部１３は、音声データの振幅が発声判定閾値ａ１を超えているか否かを判定する。この判定がＹＥＳの場合、発話が継続しているので、ステップＳ５の判定が繰り返される。一方、判定がＮＯの場合、発話が終了しているので、処理はステップＳ６に移る。
ステップＳ６において、計時部１２は、発話時間ｔ２を確定する。 In step S4, the time measuring unit 12 starts measuring the utterance time t2.
In step S5, the determination unit 13 determines whether the amplitude of the audio data exceeds the utterance determination threshold value a1. If this determination is YES, since the utterance continues, the determination in step S5 is repeated. On the other hand, if the determination is NO, since the utterance has ended, the process proceeds to step S6.
In step S6, the time measuring unit 12 determines the utterance time t2.

ステップＳ７において、計時部１２は、無応答時間ｔ３の計時を開始する。
ステップＳ８において、判定部１３は、音声データの振幅が発声判定閾値ａ１を超えているか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ１０に移る。一方、判定がＮＯの場合、無音が継続しているので、処理はステップＳ９に移る。 In step S 7, the time measuring unit 12 starts measuring the non-response time t 3.
In step S8, the determination unit 13 determines whether the amplitude of the audio data exceeds the utterance determination threshold value a1. If this determination is YES, the process proceeds to step S10. On the other hand, if the determination is NO, silence continues, and the process moves to step S9.

ステップＳ９において、判定部１３は、無応答時間ｔ３が第４の時間閾値（４秒）を超えているか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ１０に移る。一方、判定がＮＯの場合、無応答の時間が継続しているので、処理はステップＳ８に戻る。
ステップＳ１０において、計時部１２は、無応答時間ｔ３を確定する。 In step S9, the determination unit 13 determines whether or not the no-response time t3 exceeds a fourth time threshold (4 seconds). If this determination is YES, the process proceeds to step S10. On the other hand, if the determination is NO, the no-response time continues, so the process returns to step S8.
In step S10, the time measuring unit 12 determines the no-response time t3.

ステップＳ１１において、判定部１３は、確定した沈黙時間ｔ１、発話時間ｔ２及び無応答時間ｔ３の全てが所定の時間閾値の範囲内（ｔ１＞３０秒、２秒＞ｔ２＞０．５秒、ｔ３＞４秒）であるか否かを判定する。この判定がＹＥＳの場合、発話時間ｔ２で検出された発話はシステムへの問いかけと判断され、処理はステップＳ１２に移る。一方、判定がＮＯの場合、発話はシステムへの問いかけではないと判断され、処理はステップＳ１に戻る。
ステップＳ１２において、出力部１４は、検出された問いかけに対する応答として、現在時刻の出力等、所定の応答出力を行う。 In step S11, the determination unit 13 determines that all of the determined silence time t1, speech time t2, and no response time t3 are within a predetermined time threshold range (t1> 30 seconds, 2 seconds>t2> 0.5 seconds, t3). > 4 seconds). If this determination is YES, it is determined that the utterance detected at the utterance time t2 is an inquiry to the system, and the process proceeds to step S12. On the other hand, if the determination is NO, it is determined that the utterance is not an inquiry to the system, and the process returns to step S1.
In step S12, the output unit 14 performs a predetermined response output such as output of the current time as a response to the detected inquiry.

本実施形態によれば、発話検出装置１は、沈黙時間ｔ１、発話時間ｔ２及び無応答時間ｔ３が所定の条件を満たす場合に、発話が自装置への問いかけであると判定する。したがって、発話検出装置１は、ボタン押下等の特別な操作を受け付けることなく、自装置に対する音声による問いかけを容易に検出できる。 According to the present embodiment, the utterance detection device 1 determines that the utterance is an inquiry to the device when the silence time t1, the utterance time t2, and the no-response time t3 satisfy predetermined conditions. Therefore, the utterance detection device 1 can easily detect a voice inquiry to the own device without receiving a special operation such as a button press.

発話検出装置１は、沈黙時間ｔ１の長さにより、会話がない状態を判別し、発話時間ｔ２の長さにより、発話が問いかけであることを判別し、無応答時間ｔ３の長さにより、応答する人間がいないことを判別する。したがって、発話検出装置１は、高負荷な音声解析を行うことなく、容易に自装置への問いかけを検出できる。 The utterance detection device 1 determines that there is no conversation based on the length of the silence time t1, determines that the utterance is a question based on the length of the utterance time t2, and responds based on the length of the no-response time t3. It is determined that there is no human to do. Therefore, the utterance detection device 1 can easily detect an inquiry to the own device without performing high-load voice analysis.

また、発話検出装置１は、周辺ノイズの大きさに応じて発声判定閾値ａ１を調整できるので、状況に応じて発話時間ｔ２を適切に計測でき、より適切に問いかけを検出できる。 Further, since the utterance detection device 1 can adjust the utterance determination threshold value a1 according to the size of the ambient noise, the utterance time t2 can be appropriately measured according to the situation, and the inquiry can be detected more appropriately.

また、発話検出装置１は、問いかけを検出したことに応じて、所定の応答出力を行うので、ユーザの問いかけを契機とした処理を実行できる。したがって、ユーザは、特別な操作をせず音声による問いかけにより、例えば現在時刻の読み上げ等の応答を得られる。 Moreover, since the utterance detection apparatus 1 outputs a predetermined response in response to detecting the inquiry, the utterance detection apparatus 1 can execute processing triggered by the user's inquiry. Therefore, the user can obtain a response such as reading out the current time, for example, by making a voice inquiry without performing any special operation.

［第２実施形態］
以下、本発明の第２実施形態について説明する。なお、第１実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。
本実施形態の発話検出装置１ａは、発話時間ｔ２における音声が人間の声であることを確認する。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described. In addition, about the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted or simplified.
The utterance detection device 1a of this embodiment confirms that the voice at the utterance time t2 is a human voice.

図４は、発話検出装置１ａの機能構成を示すブロック図である。
発話検出装置１ａは、音声取得部１１と、計時部１２と、判定部１３と、出力部１４と、識別部１５とを備える。 FIG. 4 is a block diagram showing a functional configuration of the utterance detection device 1a.
The utterance detection device 1 a includes a voice acquisition unit 11, a timing unit 12, a determination unit 13, an output unit 14, and an identification unit 15.

識別部１５は、例えば機械学習を用いた所定の識別器により、発話時間ｔ２における音声データのパターンが人の声によるものか否かを識別する。
例えば、識別部１５は、音声データをフーリエ変換によって得られる周波数パターンを、識別器に入力し、少なくとも人間の声と、それ以外の音（例えば、足音、ドアをノックする音、トイレを流す音、犬の鳴き声、携帯電話の着信音等）とを識別する。このとき、音声が五十音のいずれであるか（例えば、「あ」と「か」とを区別する）等の詳細な解析は不要である。 The discriminating unit 15 discriminates whether or not the voice data pattern at the utterance time t2 is based on a human voice, for example, using a predetermined discriminator using machine learning.
For example, the identification unit 15 inputs a frequency pattern obtained by Fourier transform of voice data to the classifier, and at least a human voice and other sounds (for example, footsteps, door knocking sounds, and toilet running sounds) , Dog calls, mobile phone ringtones, etc.). At this time, it is not necessary to perform detailed analysis such as whether the voice is a Japanese syllabary (for example, distinguishing between “a” and “ka”).

これにより、判定部１３は、第１実施形態におけるｔ１〜ｔ３の条件を満たし、かつ、識別部１５により発話時間ｔ２における音声データのパターンが人の声によるものと識別された場合に、発話時間ｔ２における音声を、発話検出装置１ａに対する問いかけの発話であると判定する。 Thereby, the determination unit 13 satisfies the conditions of t1 to t3 in the first embodiment, and the speech time is determined when the identification unit 15 identifies that the voice data pattern at the speech time t2 is based on a human voice. It is determined that the voice at t2 is an utterance for asking the utterance detection device 1a.

本実施形態によれば、発話検出装置１ａは、発話の音声データのパターンを識別し、人の声であることを判定する。発話検出装置１ａは、識別器により人の声とそれ以外とを判別するので、高負荷な音声解析をすることなく、問いかけ検出の精度を向上できる。 According to the present embodiment, the utterance detection device 1a identifies the voice data pattern of the utterance and determines that it is a human voice. Since the utterance detection device 1a discriminates between human voices and other voices using a discriminator, the accuracy of question detection can be improved without performing high-load voice analysis.

［第３実施形態］
以下、本発明の第３実施形態について説明する。なお、第１実施形態又は第２実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。
本実施形態の発話検出装置１ｂは、発話内容を解析し、解析結果に応じた処理を実行する。 [Third Embodiment]
Hereinafter, a third embodiment of the present invention will be described. In addition, about the structure similar to 1st Embodiment or 2nd Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted or simplified.
The utterance detection device 1b according to the present embodiment analyzes utterance contents and executes processing according to the analysis result.

図５は、発話検出装置１ｂの機能構成を示すブロック図である。
発話検出装置１ａは、音声取得部１１と、計時部１２と、判定部１３と、出力部１４と、識別部１５と、音声解析部１６とを備える。 FIG. 5 is a block diagram showing a functional configuration of the utterance detection device 1b.
The utterance detection device 1 a includes a voice acquisition unit 11, a timer unit 12, a determination unit 13, an output unit 14, an identification unit 15, and a voice analysis unit 16.

音声解析部１６は、発話時間ｔ２における音声が発話検出装置１ｂに対する問いかけの発話であると判定された場合に、続いて入力される音声データに対して所定の音声解析処理を実行する。
具体的には、例えば、出力部１４により「およびですか？」等の応答出力を行った後、音声解析部１６は、新たに音声取得部１１により取得された音声データを詳細に解析し、発話内容を判定する。 When it is determined that the voice at the utterance time t2 is a questioned utterance to the utterance detection device 1b, the voice analysis unit 16 performs a predetermined voice analysis process on the voice data that is subsequently input.
Specifically, for example, after outputting a response such as “and?” By the output unit 14, the voice analysis unit 16 analyzes the voice data newly acquired by the voice acquisition unit 11 in detail, Determine the utterance content.

本実施形態によれば、発話検出装置１ｂは、自装置への問いかけを検出した場合に、続いて発話された音声の解析処理を実行する。したがって、ユーザは、発話検出装置１ｂに対する音声による問いかけを契機として、詳細な音声認識機能を起動できる。この結果、発話検出装置１ｂは、高負荷な音声解析機能を常時動作させる必要がないので、特に携帯型端末において、負荷を低減し省電力を実現できる。 According to the present embodiment, the utterance detection device 1b performs an analysis process on the subsequently uttered voice when detecting an inquiry to the own device. Therefore, the user can activate a detailed speech recognition function triggered by a voice inquiry to the speech detection device 1b. As a result, the utterance detection device 1b does not need to always operate a high-load voice analysis function, and thus can reduce the load and save power, particularly in a portable terminal.

［第４実施形態］
以下、本発明の第４実施形態について説明する。なお、第１実施形態から第３実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。
本実施形態の発話検出装置１ｃは、発話内容を解析し、解析結果に応じた処理を実行する。 [Fourth Embodiment]
The fourth embodiment of the present invention will be described below. In addition, about the structure similar to 1st Embodiment to 3rd Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted or simplified.
The utterance detection device 1c according to the present embodiment analyzes the utterance content and executes a process according to the analysis result.

図６は、発話検出装置１ｃの機能構成を示すブロック図である。
発話検出装置１ａは、音声取得部１１と、計時部１２と、判定部１３と、出力部１４と、識別部１５と、音声解析部１６と、記憶部１７とを備える。 FIG. 6 is a block diagram showing a functional configuration of the utterance detection device 1c.
The utterance detection device 1 a includes a voice acquisition unit 11, a timing unit 12, a determination unit 13, an output unit 14, an identification unit 15, a voice analysis unit 16, and a storage unit 17.

記憶部１７は、発話時間ｔ２における音声データを一時記憶する。
音声解析部１６は、発話時間ｔ２における音声が発話検出装置１ｃに対する問いかけの発話であると判定された場合に、記憶部１７に記憶された音声データに対して所定の音声解析処理を実行する。 The storage unit 17 temporarily stores voice data at the utterance time t2.
The voice analysis unit 16 performs a predetermined voice analysis process on the voice data stored in the storage unit 17 when it is determined that the voice at the utterance time t2 is a questioned utterance to the utterance detection device 1c.

本実施形態によれば、発話検出装置１ｃは、発話時間ｔ２の音声データを記憶しておき、自装置への問いかけと判定した場合に、記憶されている音声データに対して解析処理を実行する。したがって、ユーザは、問いかけ後に音声解析のために新たに発話することなくなるため、利便性が向上する。 According to the present embodiment, the utterance detection device 1c stores voice data of the utterance time t2, and executes analysis processing on the stored voice data when it is determined as an inquiry to the own device. . Therefore, since the user does not speak anew for voice analysis after making an inquiry, convenience is improved.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. Further, the effects described in the present embodiment are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the present embodiment.

発話検出処理の流れは一例であり、沈黙時間ｔ１、発話時間ｔ２及び無応答時間ｔ３に関する前述の条件を満たす発話を検出する方法であればよい。
発話検出装置１は、沈黙時間ｔ１が所定以上に長く続いた場合に、発話検出処理を終了してもよい。また、発話時間ｔ２において人の声が識別された際には、発話検出処理を終了するまでの時間を延長してもよい。 The flow of the utterance detection process is an example, and any method may be used as long as it detects an utterance that satisfies the above-described conditions regarding the silence time t1, the utterance time t2, and the no-response time t3.
The utterance detection device 1 may end the utterance detection process when the silence time t1 continues longer than a predetermined time. Further, when a human voice is identified at the utterance time t2, the time until the utterance detection process is ended may be extended.

発話検出装置１による発話検出方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（発話検出装置１）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。
本発明は、ナビゲーション装置又は時計装置等、様々な電子機器に適用可能であり、さらに、発話検出プログラムは、各種電子機器に配置可能なハードウェアチップに組み込まれて配布されてもよい。 The speech detection method by the speech detection device 1 is realized by software. When realized by software, a program constituting the software is installed in the information processing apparatus (speech detection apparatus 1). These programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network.
The present invention can be applied to various electronic devices such as a navigation device or a clock device, and the speech detection program may be distributed by being incorporated in a hardware chip that can be arranged in various electronic devices.

１、１ａ、１ｂ、１ｃ発話検出装置
１１音声取得部
１２計時部
１３判定部
１４出力部
１５識別部
１６音声解析部
１７記憶部
ａ１発声判定閾値
ｔ１沈黙時間
ｔ２発話時間
ｔ３無応答時間 DESCRIPTION OF SYMBOLS 1, 1a, 1b, 1c Utterance detection apparatus 11 Voice acquisition part 12 Timing part 13 Judgment part 14 Output part 15 Identification part 16 Voice analysis part 17 Storage part a1 Utterance judgment threshold t1 Silence time t2 Utterance time t3 No response time

Claims

An audio acquisition unit that continuously acquires audio data;
The first time when the volume of the acquired voice is less than the utterance determination threshold, the second time when the volume of the voice exceeds the utterance determination threshold after the lapse of the first time, and the second time A time measuring unit that measures a third time when the volume of the voice is less than the utterance determination threshold after elapse of time,
A determination unit that determines that the voice at the second time is an utterance of an inquiry to the device when the first time, the second time, and the third time satisfy a predetermined condition; An utterance detection device comprising:

The predetermined condition is that the first time is greater than or equal to a first time threshold, the second time is within a range from a second time threshold to a third time threshold, and the third The utterance detection device according to claim 1, wherein the time is a time equal to or greater than a fourth time threshold.

A discriminating unit for discriminating whether or not the pattern of the voice data in the second time is based on a human voice by a predetermined discriminator;
When the determination unit satisfies the predetermined condition and the identification unit identifies that the pattern of the voice data at the second time is due to a human voice, the voice at the second time is The utterance detection apparatus according to claim 1, wherein the utterance detection apparatus determines that the utterance is an inquiry to the own apparatus.

4. The utterance detection according to claim 1, wherein the time measuring unit adjusts the utterance determination threshold value based on a statistical value related to a loudness level of the first time and the third time. 5. apparatus.

The utterance according to any one of claims 1 to 4, further comprising: an output unit that outputs a predetermined calculation processing result as a voice when it is determined that the voice at the second time is a questioned utterance to the own device. Detection device.

A speech analysis unit that executes a predetermined speech analysis process on speech data that is subsequently acquired when it is determined that the speech at the second time is an inquiry utterance to the device. The utterance detection device according to claim 5.

A storage unit for storing audio data at the second time;
A voice analysis unit that performs a predetermined voice analysis process on the voice data stored in the storage unit when it is determined that the voice at the second time is an utterance of an inquiry to the device. The utterance detection device according to any one of claims 1 to 5.

A voice acquisition step for continuously acquiring voice data;
The first time when the volume of the acquired voice is less than the utterance determination threshold, the second time when the volume of the voice exceeds the utterance determination threshold after the lapse of the first time, and the second time A time measuring step of measuring a third time when the volume of the voice is less than the utterance determination threshold after elapse of
A determination step of determining, when the first time, the second time, and the third time satisfy a predetermined condition, that the voice at the second time is an inquiry utterance to the device; A speech detection method executed by a computer.

A voice acquisition step for continuously acquiring voice data;
The first time when the volume of the acquired voice is less than the utterance determination threshold, the second time when the volume of the voice exceeds the utterance determination threshold after the lapse of the first time, and the second time A time measuring step of measuring a third time when the volume of the voice is less than the utterance determination threshold after elapse of
A determination step of determining, when the first time, the second time, and the third time satisfy a predetermined condition, that the voice at the second time is an inquiry utterance to the device; An utterance detection program for causing a computer to execute.