JP2015004928A

JP2015004928A - Response target voice determination device, response target voice determination method, and response target voice determination program

Info

Publication number: JP2015004928A
Application number: JP2013131650A
Authority: JP
Inventors: 隆行荒川; Takayuki Arakawa
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-06-24
Filing date: 2013-06-24
Publication date: 2015-01-08
Anticipated expiration: 2033-06-24
Also published as: JP6171617B2

Abstract

PROBLEM TO BE SOLVED: To provide a response target voice determination device, a response target voice determination method, and a response target voice determination program that can determine, with high accuracy and without loosing usability, a voice vocalized by a user to a robot or a device with the intention of inputting.SOLUTION: Provided is a response target vocalization determining unit 11 which: detects that a silent interval continuance length which is the length of a silent interval continuing after vocalization by a user, is longer than a prescribed threshold length of silent interval continuance, in order to determine that the vocalization is vocalization conducted by the user to a system; and detects that the silent interval continuance length is shorter than the threshold length of silent interval continuance, in order to determine that the vocalization is vocalization conducted by the user to a target other than the system.

Description

本発明は、音声対話システムにおいて、システムに対してなされた発声を検出し、応答の対象とするか否かを判定する応答対象音声判定装置、応答対象音声判定方法および応答対象音声判定プログラムに関する。 The present invention relates to a response target speech determination device, a response target speech determination method, and a response target speech determination program for detecting whether or not a speech made to the system is detected in a speech dialogue system.

音声認識技術を用いた人間とロボットの会話を実現するシステムや、ユーザが発声した音声コマンドを認識し、内容に応じた情報提供などを行うシステムでは、応答対象外の音による誤動作が問題となる。以下、これらのシステムを単にシステムという。また、ユーザがロボットや機器に対し入力の意図を持って発声する音声を、応答対象音声という。 In a system that realizes human-robot conversation using voice recognition technology, or a system that recognizes voice commands uttered by the user and provides information according to the contents, malfunction due to sound that is not the response target becomes a problem . Hereinafter, these systems are simply referred to as systems. In addition, the voice that the user utters with the intention of input to the robot or device is referred to as response target voice.

特許文献１には、顔認識を行い、ユーザの顔が向いている方向を判定し、顔がシステムの方を向いている時になされた発声のみを応答の対象とする音声妥当性判定装置（apparatus and method for determining relevance of input speech）が開示されている。 Japanese Patent Application Laid-Open No. 2004-151867 discloses a speech validity determination device (apparatus) that performs face recognition, determines a direction in which a user's face is facing, and responds only to a utterance made when the face is facing the system. and method for determining relevance of input speech).

非特許文献１には、人間同士の会話中に、システムに対する音声コマンドの入力を可能とする技術（音声スポッタ）が記載されている。音声スポッタを用いることにより、ユーザは、有声休止で言いよどんだ後に故意的に高い声で発声するといった、通常の発声ではない特殊な発声を行うことで、明示的に処理対象音声をシステムに通知することができる。 Non-Patent Document 1 describes a technique (voice spotter) that allows voice commands to be input to the system during human conversation. By using the voice spotter, the user explicitly notifies the system of the processing target voice by performing a special utterance that is not a normal utterance, such as deliberately uttering with a high voice after speaking in a voiced pause. can do.

米国特許出願公開第２０１２／０２５９６３８号明細書US Patent Application Publication No. 2012/0259638

後藤真孝、北山広治、伊藤克亘、小林哲則、「音声スポッタ：人間同士の会話中に音声認識が利用可能な音声入力インターフェース」、情報処理学会論文誌、Ｍａｒ２００７、Ｖｏｌ．４８、Ｎｏ．３、ｐｐ．１２７４Masataka Goto, Koji Kitayama, Katsunobu Ito, Tetsunori Kobayashi, “Voice Spotter: A voice input interface that can use voice recognition during human conversation”, Journal of Information Processing Society of Japan, Mar 2007, Vol. 48, no. 3, pp. 1274

特許文献１に記載された技術は、システムを注視しながら人間同士が会話するような場合に、顔方向だけでは必ずしも正確な判定が行えないという問題がある。特に、システムが何らかの情報をディスプレイなどに表示し、その内容についてユーザが話し合うことが想定される場合には、顔方向による音声妥当性判定は信頼性が低い。また、非特許文献１に記載された技術は、ユーザが処理対象音声をシステムに通知する際に特殊な発話を行う必要があり、ユーザビリティが損なわれるという問題がある。 The technique described in Patent Document 1 has a problem that accurate determination cannot always be performed only by the face direction when humans talk while gazing at the system. In particular, when it is assumed that the system displays some information on a display or the like and the user discusses the content, the voice validity determination based on the face direction is not reliable. Further, the technique described in Non-Patent Document 1 has a problem that it is necessary to perform a special utterance when the user notifies the system of processing target speech, and usability is impaired.

そこで、本発明は、ユーザがロボットや機器に対し入力の意図を持って発声した音声を、精度良くかつユーザビリティを損なうことなく判定することができる応答対象音声判定装置、応答対象音声判定方法および応答対象音声判定プログラムを提供することを目的とする。 Accordingly, the present invention provides a response target speech determination device, a response target speech determination method, and a response that can accurately determine speech uttered by a user with an intention to input to a robot or device without impairing usability. An object is to provide a target speech determination program.

本発明による応答対象音声判定装置は、ユーザによる発声後に続く沈黙区間の長さである沈黙区間継続長が、所定の沈黙区間継続長閾値よりも長くなることを検知し、当該発声をユーザがシステムに対して行った発声であると判定し、沈黙区間継続長が沈黙区間継続長閾値よりも短くなることを検知し、当該発声をユーザがシステム以外に対して行った発声であると判定する応答対象発声判定部を備えることを特徴とする。 The response target speech determination device according to the present invention detects that the silence interval duration, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold, and the user performs the utterance by the system. A response that determines that the utterance was made to the user, detects that the silence duration duration is shorter than the silence duration duration threshold, and determines that the utterance is made by a user other than the system A target utterance determination unit is provided.

本発明による応答対象音声判定方法は、ユーザによる発声後に続く沈黙区間の長さである沈黙区間継続長が、所定の沈黙区間継続長閾値よりも長くなることを検知し、当該発声をユーザがシステムに対して行った発声であると判定し、沈黙区間継続長が沈黙区間継続長閾値よりも短くなることを検知し、当該発声をユーザがシステム以外に対して行った発声であると判定することを特徴とする。 The response target speech determination method according to the present invention detects that the silence interval duration, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold, and the user performs the utterance in the system It is determined that the utterance has been made to the system, and it is detected that the silence duration duration is shorter than the silence duration duration threshold, and the utterance is determined to be utterance made by the user to a system other than the system. It is characterized by.

本発明による応答対象音声判定プログラムは、コンピュータに、ユーザによる発声後に続く沈黙区間の長さである沈黙区間継続長が、所定の沈黙区間継続長閾値よりも長くなることを検知し、当該発声をユーザがシステムに対して行った発声であると判定し、沈黙区間継続長が沈黙区間継続長閾値よりも短くなることを検知し、当該発声をユーザがシステム以外に対して行った発声であると判定する処理を実行させることを特徴とする。 The response target voice determination program according to the present invention detects that the silence interval duration, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold value, and sends the utterance to the computer. It is determined that the utterance is made by the user to the system, it is detected that the silence interval duration is shorter than the silence interval duration threshold, and the utterance is made by the user other than the system. The determination process is executed.

本発明によれば、ユーザがロボットや機器に対し入力の意図を持って発声した音声を、精度良くかつユーザビリティを損なうことなく判定することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice which the user uttered with the intention of input with respect to a robot or an apparatus can be determined with sufficient accuracy and without impairing usability.

本発明による応答対象音声判定装置の第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of the response object audio | voice determination apparatus by this invention. 応答対象音声判定装置の第１の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 1st Embodiment of a response object audio | voice determination apparatus. 人間同士の会話における沈黙区間継続長の頻度分布を示す説明図である。It is explanatory drawing which shows the frequency distribution of the silence interval continuation length in the conversation between people. 本発明による応答対象音声判定装置の第２の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 2nd Embodiment of the response object audio | voice determination apparatus by this invention. 応答対象音声判定装置の第２の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 2nd Embodiment of a response object audio | voice determination apparatus. システムに対しなされた発声から抽出された音声特徴量と、システム外に対してなされた発声から抽出された音声特徴量の頻度分布を示す説明図である。It is explanatory drawing which shows the frequency distribution of the audio | voice feature-value extracted from the utterance made with respect to the system, and the audio | voice feature-value extracted from the utterance made with respect to the outside of a system. 本発明による応答対象音声判定装置の第３の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 3rd Embodiment of the response object audio | voice determination apparatus by this invention. 本発明による応答対象音声判定装置の第４の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 4th Embodiment of the response object audio | voice determination apparatus by this invention. 本発明による応答対象音声判定装置の第５の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 5th Embodiment of the response object audio | voice determination apparatus by this invention. 本発明による応答対象音声判定装置の第６の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 6th Embodiment of the response object audio | voice determination apparatus by this invention. 本発明による応答対象音声判定装置の最小構成を示すブロック図である。It is a block diagram which shows the minimum structure of the response object audio | voice determination apparatus by this invention. 本発明による応答対象音声判定装置の他の最小構成を示すブロック図である。It is a block diagram which shows the other minimum structure of the response object audio | voice determination apparatus by this invention.

実施形態１．
以下、本発明の第１の実施形態を図面を参照して説明する。 Embodiment 1. FIG.
A first embodiment of the present invention will be described below with reference to the drawings.

図１は、本発明による応答対象音声判定装置の第１の実施形態の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a first embodiment of a response target speech determination device according to the present invention.

図１に示すように、応答対象音声判定装置は、入力音信号取得部１０１と、入力音信号切り出し部１０２と、音声区間判定閾値格納部１０３と、音声区間判定部１０４と、音声区間音信号保存部１０５と、沈黙区間継続長閾値格納部１０６と、応答対象発声判定部１０７とを備える。 As illustrated in FIG. 1, the response target speech determination device includes an input sound signal acquisition unit 101, an input sound signal cutout unit 102, a speech segment determination threshold storage unit 103, a speech segment determination unit 104, and a speech segment sound signal. A storage unit 105, a silence interval duration threshold storage unit 106, and a response target utterance determination unit 107 are provided.

入力音信号取得部１０１は、入力音信号の時系列（time series of input sound signal）を取得する。入力音信号取得部１０１は、音声入力装置、例えばマイクロホンを用いて入力音信号の時系列を取得する。 The input sound signal acquisition unit 101 acquires a time series of input sound signals. The input sound signal acquisition unit 101 acquires a time series of input sound signals using a sound input device, for example, a microphone.

入力音信号切り出し部１０２は、入力音信号を入力とし、フレームごとに切り出した音信号を出力する。 The input sound signal cutout unit 102 receives an input sound signal and outputs a sound signal cut out for each frame.

音声区間判定閾値格納部１０３は、予め定められた音声区間判定に係わる閾値（pre-determined threshold for voice activity detection）を格納する。 The voice segment determination threshold storage unit 103 stores a predetermined threshold for voice segment determination (pre-determined threshold for voice activity detection).

音声区間判定部１０４は、フレームごとに切り出された音信号と、音声区間判定に係わる閾値を入力とし、当該フレームが音声区間（active voice segment）に含まれるか、音声が存在しない沈黙区間（silence segment）に含まれるか、を判定する。 The voice segment determination unit 104 receives a sound signal cut out for each frame and a threshold related to the voice segment determination, and the frame is included in the voice segment (active voice segment) or a silent segment (silence where no voice exists) segment).

音声区間音信号保存部１０５は、音声区間判定部１０４で音声区間と判定された音信号を保存する。 The voice segment sound signal storage unit 105 stores the sound signal determined as the voice segment by the voice segment determination unit 104.

沈黙区間継続長閾値格納部１０６は、予め定められた沈黙区間継続長閾値（pre-determined threshold of duration of silence segment）を格納する。 The silence interval duration threshold storage unit 106 stores a predetermined silence interval duration threshold (pre-determined threshold of duration of silence segment).

応答対象発声判定部１０７は、音声区間判定部１０４で沈黙区間と判定された区間の継続長と沈黙区間継続長閾値とを比較し、当該沈黙区間以前の音声区間を応答対象とし応答するか、応答を保留するかを判定する。 The response target utterance determination unit 107 compares the duration of the section determined to be a silence period by the voice segment determination unit 104 with the silence period duration threshold, and responds with the voice period before the silence period as a response target, Determine whether to hold the response.

なお、入力音信号取得部１０１、入力音信号切り出し部１０２、音声区間判定部１０４および応答対象発声判定部１０７は、例えば、応答対象音声判定プログラムに従って動作するコンピュータによって実現される。この場合、ＣＰＵが応答対象音声判定プログラムを読み込み、そのプログラムに従って、入力音信号取得部１０１、入力音信号切り出し部１０２、音声区間判定部１０４および応答対象発声判定部１０７として動作する。また、入力音信号取得部１０１、入力音信号切り出し部１０２、音声区間判定部１０４および応答対象発声判定部１０７が別々のハードウェアで実現されていてもよい。 The input sound signal acquisition unit 101, the input sound signal cutout unit 102, the speech segment determination unit 104, and the response target utterance determination unit 107 are realized by, for example, a computer that operates according to a response target speech determination program. In this case, the CPU reads the response target voice determination program, and operates as the input sound signal acquisition unit 101, the input sound signal cutout unit 102, the voice segment determination unit 104, and the response target utterance determination unit 107 according to the program. Further, the input sound signal acquisition unit 101, the input sound signal cutout unit 102, the voice segment determination unit 104, and the response target utterance determination unit 107 may be realized by separate hardware.

また、音声区間判定閾値格納部１０３、音声区間音信号保存部１０５および沈黙区間継続長閾値格納部１０６は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 In addition, the voice segment determination threshold storage unit 103, the voice segment sound signal storage unit 105, and the silence segment duration threshold storage unit 106 are specifically an optical disk device, a magnetic disk device, a memory, or the like included in the response target voice determination device. Realized by a storage device.

次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.

図２は、応答対象音声判定装置の第１の実施形態の動作を示すフローチャートである。図３は、人間同士の会話における沈黙区間継続長の頻度分布を示す説明図である。 FIG. 2 is a flowchart showing the operation of the first embodiment of the response target speech determination device. FIG. 3 is an explanatory diagram showing the frequency distribution of the duration of silence intervals in a conversation between humans.

図２に示すように、まず、入力音信号切り出し部１０２は、入力音信号取得部１０１が取得した入力音の時系列（time series of input sound signal）を入力する。そして、入力音信号切り出し部１０２は、入力音の時系列から単位時間のフレーム分の波形データを切り出す（ステップＳ１０１）。 As shown in FIG. 2, first, the input sound signal cutout unit 102 inputs a time series of input sound acquired by the input sound signal acquisition unit 101. Then, the input sound signal cutout unit 102 cuts out waveform data for a unit time frame from the time series of the input sound (step S101).

例えば、入力音信号切り出し部１０２は、マイクロホンなどによって取得されたアナログデータを、サンプリング周波数８０００Ｈｚ、量子化ビット１６ビット、Linear-PCMのデジタルデータとして取得し、１０ミリ秒毎に２５６点分の波形データを切り出す。なお、入力音信号切り出し部１０２は、他のサンプリング周波数、量子化ビット、切り出しの間隔、切り出す点数などによって、波形データを切り出すようにしてもよい。 For example, the input sound signal cutout unit 102 acquires analog data acquired by a microphone or the like as digital data with a sampling frequency of 8000 Hz, a quantization bit of 16 bits, and Linear-PCM, and a waveform for 256 points every 10 milliseconds. Cut out the data. Note that the input sound signal cutout unit 102 may cut out waveform data based on other sampling frequencies, quantization bits, cutout intervals, cutout points, and the like.

応答対象音声判定装置は、以下に示すステップＳ１０２〜Ｓ１０４の処理をフレーム単位で行う。 The response target speech determination device performs the processes of steps S102 to S104 shown below for each frame.

ステップＳ１０２において、音声区間判定部１０４は、ステップＳ１０１でフレームごとに切り出された入力音に対して音声区間判定を行う。音声区間の判定方法として、例えば、振幅パワーなどを求め、音声区間判定閾値格納部１０３に格納されている閾値と大小を比較する方法が考えられる。振幅パワーＰｔは、以下に示す式（１）で算出される。 In step S102, the speech segment determination unit 104 performs speech segment determination on the input sound cut out for each frame in step S101. As a method for determining a speech section, for example, a method of obtaining amplitude power or the like and comparing the threshold value stored in the speech section determination threshold storage unit 103 with a magnitude is conceivable. The amplitude power Pt is calculated by the following equation (1).

Ｎは、１フレームに含まれるサンプル点数である。xtは、時刻ｔの入力音データ（波形データ）の値である。音声区間判定部１０４は、振幅パワーが閾値より大きければ音声状態と判定し、振幅パワーが閾値より小さければ沈黙状態と判定する。なお、ここでは振幅パワーを用いたが、ゼロ交差数や、音声モデルと非音声モデルとの尤度比、ピッチ周波数、ＳＮ比など他の特徴量を用いてもよい。単位時間ごとに判定された音声状態のうち同じ状態が連続する区間を音声区間とする。また、単位時間ごとに判定された沈黙状態のうち同じ状態が連続する区間を沈黙区間とする。 N is the number of sample points included in one frame. xt is the value of the input sound data (waveform data) at time t. The voice section determination unit 104 determines that the voice state is present when the amplitude power is greater than the threshold value, and determines that the voice state is silent when the amplitude power is smaller than the threshold value. Although the amplitude power is used here, other feature quantities such as the number of zero crossings, the likelihood ratio between the speech model and the non-speech model, the pitch frequency, and the SN ratio may be used. A section in which the same state continues among voice states determined every unit time is defined as a voice section. In addition, an interval in which the same state continues among silence states determined every unit time is defined as a silence interval.

音声区間の始端は、連続した沈黙状態が途切れ音声状態に変化した時点である。この時点は、同時に沈黙区間の終端である。音声区間の終端は、連続した音声状態が途切れ沈黙状態に変化した時点である。この時点は、同時に沈黙区間の始端である。このように音声区間および沈黙区間が確定するのは、連続した状態が途切れた時点である。 The beginning of the speech section is the point in time when the continuous silence state is interrupted and changes to a speech state. This time is the end of the silence period at the same time. The end of the voice interval is when the continuous voice state changes to a discontinuous silence state. This is the beginning of the silence period at the same time. The voice segment and the silence segment are determined in this way when the continuous state is interrupted.

ここで、短い音声区間が発生しないように、沈黙状態から音声状態に変化した後、一定の長さ音声状態と判定されなければ音声区間の始端（沈黙区間の終端）として認めない、といったハングオーバー処理を行うようにしてもよい。また、短い沈黙区間が発生しないように、音声状態から沈黙状態に変化した後、一定の長さ沈黙状態と判定されなければ音声区間の終端（沈黙区間の始端）として認めない、といったハングオーバー処理を行うようにしてもよい。 Here, in order to prevent a short voice interval from occurring, a hangover such that after changing from a silence state to a voice state, it is not recognized as the beginning of the voice interval (end of the silence interval) unless it is determined that the voice state is of a certain length. Processing may be performed. In addition, a hangover process such that the end of the voice interval (the start of the silence interval) is not recognized unless it is determined to be a silence state of a certain length after changing from the voice state to the silence state so that a short silence interval does not occur. May be performed.

以降の処理は、着目するフレーム、つまり処理対象のフレームが音声区間に含まれるか、沈黙区間に含まれるかによって、分岐する。 Subsequent processing branches depending on whether the frame of interest, that is, the frame to be processed is included in the speech interval or the silence interval.

音声区間判定部１０４が着目するフレームが音声区間に含まれると判定した場合、音声区間判定部１０４は、当該フレームの音信号を音声区間音信号保存部１０５に保存する（ステップＳ１０３）。この後、ステップＳ１０１の処理から、次のフレームに対する処理が行われる。 When the speech segment determination unit 104 determines that the frame of interest is included in the speech segment, the speech segment determination unit 104 stores the sound signal of the frame in the speech segment sound signal storage unit 105 (step S103). Thereafter, the processing for the next frame is performed from the processing of step S101.

音声区間判定部１０４が着目するフレームが沈黙区間に含まれると判定した場合、応答対象発声判定部１０７は、直前の音声区間から当該フレームまで継続する沈黙区間の継続長と、沈黙区間継続長閾値とを比較する。沈黙区間の継続長が沈黙区間継続長閾値より大きい場合には、応答対象発声判定部１０７は、直前の音声区間を応答対象と判定する。それ以外の場合には、応答対象発声判定部１０７は、応答保留と判定する（ステップＳ１０４）。 When the speech segment determination unit 104 determines that the frame of interest is included in the silence segment, the response target utterance determination unit 107 determines the duration of the silence segment that continues from the previous speech segment to the frame, and the silence segment duration threshold. And compare. If the duration of the silence interval is greater than the silence interval duration threshold, the response target utterance determination unit 107 determines that the immediately preceding voice interval is a response target. In other cases, the response target utterance determination unit 107 determines that the response is suspended (step S104).

応答対象発声判定部１０７は、直前の音声区間を応答対象音声と判定した場合、対応する音声区間の音信号を音声区間音信号保存部１０５より取得し、出力する（ステップＳ１０５）。つまり、応答対象発声判定部１０７は、応答対象音声を出力する。 When the response target utterance determination unit 107 determines that the immediately preceding voice segment is the response target speech, the response target utterance determination unit 107 acquires and outputs the sound signal of the corresponding voice segment from the voice segment sound signal storage unit 105 (step S105). That is, the response target utterance determination unit 107 outputs response target speech.

応答対象発声判定部１０７が応答保留と判定した場合、ステップＳ１０１の処理から、次のフレームに対する処理が行われる。 When the response target utterance determination unit 107 determines that the response is on hold, processing for the next frame is performed from the processing of step S101.

応答対象と判定するか応答保留と判定するかは、継続する沈黙の長さをＬとし、沈黙区間継続長閾値をＴｈとするとき、以下に示す式（２）および式（３）を用いて判定される。 Whether the response target is determined to be a response hold or not is determined by using the following formulas (2) and (3) when the length of the silence to be continued is L and the threshold for the silence interval duration is Th: Determined.

Ｌ＞Ｔｈ直前の音声区間を応答対象音声と判定し応答・・・式（２）
Ｌ ≦ Ｔｈ応答を保留・・・式（３） L> Th The voice section immediately before is determined as the voice to be responded, and the response (2)
L ≦ Th Hold the response ・・・ Formula (3)

応答対象発声判定部１０７は、応答保留の継続中に音声区間が開始したとき、一つ前の音声区間を応答対象外発声とみなし棄却する。なお、最初の沈黙区間には直前の音声区間が存在しないため、式（１）を満たしたとしても応答対象音声は判定されない。 The response target utterance determination unit 107 considers the previous speech segment as a non-response subject utterance and rejects it when the speech segment starts while the response hold is continued. In addition, since there is no immediately preceding voice section in the first silence period, even if Expression (1) is satisfied, the response target voice is not determined.

沈黙区間継続長閾値は、ユーザ等が予め実験によって求めることが考えられる。例えば、ユーザは、図３に示すような、システム外への発声、例えば人間同士の会話における沈黙区間継続長とその頻度から、人間同士の会話が誤って応答対象音声と判定されることが少なくなるように沈黙区間継続長閾値を決定する。 It is conceivable that the silence duration duration threshold value is obtained in advance by an experiment by a user or the like. For example, as shown in FIG. 3, the user rarely determines that the conversation between humans is a response target voice erroneously based on the utterance to the outside of the system, for example, the duration of the silence interval in the conversation between humans and the frequency thereof. The silence duration duration threshold is determined so that

システムは、応答対象音声判定装置が出力した応答対象音声に対し、処理を実行する。例えば、システムが、応答対象音声に対し、音声認識を行い、音声認識より出力されたテキストに応じた応答をユーザに返すことが考えられる。また、システムが、応答保留とされている音声区間に対しても音声認識を行い、仮の音声認識出力テキストとして保持しておき、応答対象音声判定装置が応答対象音声と判定したときに、保持された仮の音声認識出力テキストを有効にすることも考えられる。 The system performs processing on the response target voice output by the response target voice determination device. For example, it is conceivable that the system performs speech recognition on the response target speech and returns a response corresponding to the text output from the speech recognition to the user. In addition, the system performs voice recognition for the voice section that is put on hold, and holds it as a temporary voice recognition output text, and holds it when the response target voice determination device determines that it is a response target voice. It is also conceivable to validate the provisional speech recognition output text.

以上に説明したように、本実施形態では、発声後の沈黙区間の継続長を用いて応答対象音声か否かを判定する。そのため、ユーザは発声後、システムの応答を待つだけでよい。従って、本実施形態によれば、ユーザビリティの高い音声対話ユーザインターフェースを提供することができる。 As described above, in the present embodiment, it is determined whether or not the voice is a response target voice by using the duration of the silence period after the utterance. Therefore, the user only has to wait for a system response after speaking. Therefore, according to the present embodiment, it is possible to provide a voice interaction user interface with high usability.

実施形態２．
以下、本発明の第２の実施形態を図面を参照して説明する。 Embodiment 2. FIG.
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.

図４は、本発明による応答対象音声判定装置の第２の実施形態の構成を示すブロック図である。 FIG. 4 is a block diagram showing a configuration of the second embodiment of the response target speech determination device according to the present invention.

図４に示すように、第２の実施形態における応答対象音声判定装置は、第１の実施形態の構成に加えて、音声特徴量算出部２０１と、音声特徴量閾値・重み格納部２０２と、第二の沈黙区間継続長閾値算出部２０３とを備える。 As shown in FIG. 4, in addition to the configuration of the first embodiment, the response target speech determination device according to the second embodiment includes a speech feature amount calculation unit 201, a speech feature amount threshold / weight storage unit 202, And a second silence interval duration threshold value calculation unit 203.

音声特徴量算出部２０１は、音声区間判定部１０４で音声区間と判定された音信号を入力とし、音声特徴量を算出し出力する。 The voice feature quantity calculation unit 201 receives the sound signal determined as the voice section by the voice section determination unit 104, calculates the voice feature quantity, and outputs it.

音声特徴量閾値・重み格納部２０２は、予め定められた音声特徴量の閾値および重み（pre-determined threshold and weight for prosody features）を格納する。 The voice feature amount threshold / weight storage unit 202 stores a predetermined threshold value and weight for a voice feature amount (pre-determined threshold and weight for prosody features).

第二の沈黙区間継続長閾値算出部２０３は、音声特徴量と、音声特徴量の閾値および重みと、沈黙区間継続長閾値を入力とし、第二の沈黙区間継続長を算出し出力する。 The second silence interval duration threshold value calculation unit 203 receives the audio feature value, the threshold value and weight of the audio feature value, and the silence interval duration threshold value, and calculates and outputs the second silence interval duration length.

なお、音声特徴量算出部２０１および第二の沈黙区間継続長閾値算出部２０３は、例えば、応答対象音声判定プログラムに従って動作するコンピュータによって実現される。この場合、ＣＰＵが応答対象音声判定プログラムを読み込み、そのプログラムに従って、音声特徴量算出部２０１および第二の沈黙区間継続長閾値算出部２０３として動作する。また、音声特徴量算出部２０１および第二の沈黙区間継続長閾値算出部２０３が別々のハードウェアで実現されていてもよい。 Note that the audio feature quantity calculation unit 201 and the second silence interval duration threshold calculation unit 203 are realized by, for example, a computer that operates according to a response target audio determination program. In this case, the CPU reads the response target sound determination program, and operates as the sound feature amount calculation unit 201 and the second silence interval duration threshold value calculation unit 203 according to the program. Further, the audio feature amount calculation unit 201 and the second silence interval duration threshold value calculation unit 203 may be realized by separate hardware.

また、音声特徴量閾値・重み格納部２０２、は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 The voice feature value threshold / weight storage unit 202 is specifically realized by a storage device such as an optical disk device, a magnetic disk device, or a memory included in the response target voice determination device.

図５は、応答対象音声判定装置の第２の実施形態の動作を示すフローチャートである。図６は、システムに対しなされた発声から抽出された音声特徴量と、システム外に対してなされた発声から抽出された音声特徴量の頻度分布を示す説明図である。 FIG. 5 is a flowchart showing the operation of the second embodiment of the response target speech determination device. FIG. 6 is an explanatory diagram showing the frequency distribution of speech feature values extracted from utterances made to the system and speech feature values extracted from utterances made outside the system.

ステップＳ２０１〜Ｓ２０３の処理は、第１の実施形態におけるステップＳ１０１〜Ｓ１０３の処理と同様である。 The processing in steps S201 to S203 is the same as the processing in steps S101 to S103 in the first embodiment.

音声区間判定部１０４が、着目するフレームが音声区間に含まれると判定し、ステップＳ２０３の処理を実行した後、音声特徴量算出部２０１は、着目するフレームの音信号から音声特徴量を算出する（ステップＳ２０４）。音声特徴量は、音信号から抽出される特徴量である。音声特徴量は、例えば、音声区間における振幅パワーや、その平均値、分散値、最大値、最小値や、フォルマント周波数や、ケプストラム、といった音声認識で広く用いられている特徴量である。また、音声特徴量として、音声区間の継続長を用いることも考えられる。 After the speech segment determination unit 104 determines that the frame of interest is included in the speech segment and executes the process of step S203, the speech feature amount calculation unit 201 calculates the speech feature amount from the sound signal of the frame of interest. (Step S204). The voice feature amount is a feature amount extracted from the sound signal. The speech feature amount is a feature amount widely used in speech recognition such as amplitude power in a speech section, average value, variance value, maximum value, minimum value, formant frequency, and cepstrum. It is also conceivable to use the duration of the speech segment as the speech feature amount.

次に、第二の沈黙区間継続長閾値算出部２０３が、ステップＳ２０４で算出された音声特徴量と、音声特徴量閾値・重み格納部２０２に格納されている音声特徴量の閾値および重みと、沈黙区間継続長閾値格納部１０６に格納されている沈黙区間継続長閾値とから、第二の沈黙区間継続長閾値を算出する（ステップＳ２０５）。具体的には、沈黙区間継続長閾値をＴｈ１、音声特徴量をＦ、音声特徴量閾値をＴｈＦ、音声特徴量重みをｗＦとすると、第二の沈黙区間継続長閾値算出部２０３は、以下に示す式（４）により、第二の沈黙区間継続長閾値Ｔｈ２を算出する。 Next, the second silence interval duration threshold value calculation unit 203 calculates the audio feature value calculated in step S204, the threshold value and weight of the audio feature value stored in the audio feature value threshold / weight storage unit 202, A second silence interval duration threshold is calculated from the silence interval duration threshold stored in the silence interval duration threshold storage unit 106 (step S205). Specifically, when the silence interval duration threshold is Th1, the speech feature amount is F, the speech feature amount threshold is ThF, and the speech feature amount weight is wF, the second silence interval duration threshold calculation unit 203 is as follows. The second silence interval continuation length threshold Th2 is calculated by the equation (4) shown.

Th２＝ Th１ ― sgn × wF ×（F − ThF）・・・式（４） Th2 = Th1−sgn × wF × (F−ThF) (4)

ここで、sgnは+１または-１の値をとる。システムに対する発声に対し値が大きくなる特徴量を用いる場合にはsgnを+1とし、システムに対する発声に対し値が小さくなる特徴量を用いる場合にはsgnを-1とする。例えば、ユーザがシステムに対して話すときは声が大きくなる傾向があると考えられるため、声の大きさを特徴量として用いる場合には、sgnは+1となる。また、ユーザはシステムに対してはゆっくり話す傾向があると考えられるため、話す速度を特徴量として用いる場合には、sgnは-1となる。 Here, sgn takes a value of +1 or -1. Sgn is set to +1 when using a feature value that increases with respect to the utterance to the system, and sgn is set to -1 when using a feature value that decreases with respect to the utterance to the system. For example, when the user speaks to the system, it is considered that the voice tends to become louder. Therefore, when the loudness of the voice is used as the feature amount, sgn becomes +1. In addition, since it is considered that the user tends to speak slowly with respect to the system, sgn is -1 when the speaking speed is used as the feature amount.

なお、応答対象音声判定装置は、ステップＳ２０３からステップＳ２０５までの処理を、フレームごとに行ってもよいし、音声区間の終端（沈黙区間の始端）において一括で行ってもよい。 Note that the response target speech determination apparatus may perform the processing from step S203 to step S205 for each frame, or may be performed collectively at the end of the speech interval (the start end of the silence interval).

第二の沈黙区間継続長閾値算出部２０３がステップＳ２０５の処理を実行した後、ステップＳ１０１の処理から、次のフレームに対する処理が行われる。 After the second silence interval duration threshold calculation unit 203 executes the process of step S205, the process for the next frame is performed from the process of step S101.

ステップＳ２０２で音声区間判定部１０４が着目するフレームが沈黙区間に含まれると判定した場合、応答対象発声判定部１０７は、沈黙区間の継続長と沈黙区間継続長閾値を比較する代わりに、沈黙区間の継続長と第二の沈黙区間継続長閾値を比較し、直前の音声区間を応答対象とし応答するか、応答を保留するか判定する（ステップＳ２０６）。 When the speech segment determination unit 104 determines in step S202 that the frame of interest is included in the silence segment, the response target utterance determination unit 107 does not compare the silence segment duration length with the silence segment duration threshold value. And the second silence interval duration threshold are compared, and it is determined whether to respond with the immediately preceding speech interval as a response target or to suspend the response (step S206).

応答対象発声判定部１０７は、直前の音声区間を応答対象音声と判定した場合、対応する音声区間の音信号を音声区間音信号保存部１０５より取得し、出力する（ステップＳ２０７）。つまり、応答対象発声判定部１０７は、応答対象音声を出力する。 When the response target utterance determination unit 107 determines that the immediately preceding voice segment is the response target speech, the response target utterance determination unit 107 acquires and outputs the sound signal of the corresponding voice segment from the voice segment sound signal storage unit 105 (step S207). That is, the response target utterance determination unit 107 outputs response target speech.

応答対象発声判定部１０７が応答保留と判定した場合、ステップＳ２０１の処理から、次のフレームに対する処理が行われる。 When the response target utterance determination unit 107 determines that the response is on hold, processing for the next frame is performed from the processing of step S201.

音声特徴量閾値は、ユーザ等が予め実験によって求めることが考えられる。例えば、図６に示すように、ユーザは、システムに対してなされた発声から抽出された音声特徴量の頻度と、システム以外に対してなされた発声から抽出された音声特徴量の頻度とから、これらをできるだけ分離するように音声特徴量閾値を決定する。 It is conceivable that the voice feature amount threshold is obtained in advance by an experiment by a user or the like. For example, as shown in FIG. 6, the user can calculate from the frequency of voice feature values extracted from utterances made to the system and the frequency of voice feature values extracted from utterances made to systems other than the system. The voice feature amount threshold is determined so as to separate them as much as possible.

以上に説明したように、本実施形態では、音声特徴量と音声特徴量閾値との比較により、ユーザによる発声がシステムに対してなされた可能性が高いと判断した際には、第二の沈黙区間継続長閾値を沈黙区間継続長閾値よりも短くする。それにより、遅延が少なく応答対象音声の判定を行うことができる。反対に、音声特徴量と音声特徴量閾値との比較により、ユーザによる発声がシステムに対してなされた可能性が低いと判断した際には、第二の沈黙区間継続長閾値を沈黙区間継続長閾値よりも長くする。それにより、応答保留の時間が長くなり、応答保留をしている間にユーザが次の発声を行うことにより、応答が棄却される可能性が高くなる。 As described above, in the present embodiment, when it is determined that there is a high possibility that the user has made an utterance to the system based on the comparison between the voice feature quantity and the voice feature quantity threshold, the second silence is set. The section duration threshold is made shorter than the silence section duration threshold. Thereby, it is possible to determine the response target voice with little delay. On the other hand, if it is determined by the comparison between the voice feature quantity and the voice feature quantity threshold value that the user is unlikely to utter the system, the second silence period duration threshold value is set as the silence period duration threshold value. Make it longer than the threshold. As a result, the response hold time becomes longer, and the possibility that the response is rejected increases when the user makes the next utterance while holding the response hold.

このように、本実施形態では、第二の沈黙区間継続長閾値と発声後の沈黙区間の継続長とをもとに応答対象音声か否かを判定する。従って、本実施形態によれば、ユーザがシステムに対して話すときの声の大きさや話す速度などを考慮した、応答対象音声の判定を行うことができ、よりユーザビリティの高い音声対話ユーザインターフェースを提供することできる。 As described above, in this embodiment, it is determined whether or not the voice is a response target voice based on the second silence interval duration threshold value and the duration of the silence interval after the utterance. Therefore, according to the present embodiment, it is possible to determine the response target voice in consideration of the volume of the voice when the user speaks to the system, the speaking speed, etc., and to provide a voice interaction user interface with higher usability it can.

実施形態３．
以下、本発明の第３の実施形態を図面を参照して説明する。 Embodiment 3. FIG.
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.

図７は、本発明による応答対象音声判定装置の第３の実施形態の構成を示すブロック図である。 FIG. 7 is a block diagram showing the configuration of the third embodiment of the response target speech determination apparatus according to the present invention.

図７に示すように、第３の実施形態における応答対象音声判定装置は、第１の実施形態の構成に加えて、映像信号取得部３０１と、映像特徴量算出部３０２と、映像特徴量閾値・重み格納部３０３と、第二の沈黙区間継続長閾値算出部３０４とを備える。 As shown in FIG. 7, in addition to the configuration of the first embodiment, the response target audio determination device according to the third embodiment includes a video signal acquisition unit 301, a video feature amount calculation unit 302, and a video feature amount threshold value. A weight storage unit 303 and a second silence interval duration threshold calculation unit 304 are provided.

映像信号取得部３０１は、カメラなどを用いて映像信号を取得する。 The video signal acquisition unit 301 acquires a video signal using a camera or the like.

映像特徴量算出部３０２は、映像信号と、音声区間判定部１０４で判定された区間に関する情報（以下、区間情報という。）とを入力とし、音声区間、沈黙区間、またはその両方の区間の映像特徴量を算出し出力する。区間情報は、音声区間継続長などを含む。 The video feature amount calculation unit 302 receives the video signal and information about the section determined by the audio section determination unit 104 (hereinafter referred to as section information), and the video of the audio section, the silence section, or both sections. Calculate and output feature values. The section information includes a voice section continuation length and the like.

映像特徴量閾値・重み格納部３０３は、予め定められた映像特徴量の閾値および重みを格納する。 The video feature amount threshold / weight storage unit 303 stores a predetermined threshold and weight of the video feature amount.

第二の沈黙区間継続長閾値算出部３０４は、映像特徴量と沈黙区間継続長閾値と映像特徴量閾値と映像特徴量重みとを入力とし、第二の沈黙区間継続長閾値を算出し出力する。 The second silence interval duration threshold calculation unit 304 receives the image feature amount, the silence interval duration threshold value, the image feature amount threshold value, and the image feature amount weight, and calculates and outputs the second silence interval duration threshold value. .

なお、映像信号取得部３０１、映像特徴量算出部３０２および第二の沈黙区間継続長閾値算出部３０４は、例えば、応答対象音声判定プログラムに従って動作するコンピュータによって実現される。この場合、ＣＰＵが応答対象音声判定プログラムを読み込み、そのプログラムに従って、映像信号取得部３０１、映像特徴量算出部３０２および第二の沈黙区間継続長閾値算出部３０４として動作する。また、映像信号取得部３０１、映像特徴量算出部３０２および第二の沈黙区間継続長閾値算出部３０４が別々のハードウェアで実現されていてもよい。 Note that the video signal acquisition unit 301, the video feature amount calculation unit 302, and the second silence interval duration threshold calculation unit 304 are realized by, for example, a computer that operates according to a response target audio determination program. In this case, the CPU reads the response target audio determination program, and operates as the video signal acquisition unit 301, the video feature amount calculation unit 302, and the second silence interval duration threshold calculation unit 304 according to the program. In addition, the video signal acquisition unit 301, the video feature amount calculation unit 302, and the second silence interval duration threshold value calculation unit 304 may be realized by separate hardware.

また、映像特徴量閾値・重み格納部３０３は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 The video feature amount threshold / weight storage unit 303 is specifically realized by a storage device such as an optical disk device, a magnetic disk device, or a memory included in the response target sound determination device.

映像特徴量算出部３０２は、映像信号取得部３０１で得られた映像信号から映像特徴量を算出する。映像特徴量としては、顔や視線、体の向きなどが考えられる。なお、映像特徴量算出部３０２は、映像特徴量に対して音声区間で平均を求めるようにしてもよい。また、映像特徴量算出部３０２は、映像特徴量をもとに、ユーザがシステムの方向を向いている時間や、ユーザがシステムに向いている時間と音声区間継続長との比を求めるようにしてもよい。 The video feature amount calculation unit 302 calculates a video feature amount from the video signal obtained by the video signal acquisition unit 301. As the video feature amount, a face, a line of sight, a body orientation, and the like can be considered. Note that the video feature amount calculation unit 302 may obtain an average of the video feature amount in the audio section. In addition, the video feature quantity calculation unit 302 obtains the time during which the user is facing the system or the ratio between the time when the user is facing the system and the duration of the audio section based on the video feature quantity. May be.

第二の沈黙区間継続長閾値算出部３０４は、沈黙区間継続長閾値と映像特徴量閾値とから、第二の沈黙区間継続長閾値を算出する。沈黙区間継続長閾値をＴｈ０、映像特徴量をＦ、映像特徴量閾値をＴｈＦ、映像特徴量重みをｗＦとすると、第二の沈黙区間継続長閾値Ｔｈ２は、以下に示す式（５）で算出される。 The second silence interval duration threshold calculation unit 304 calculates a second silence interval duration threshold from the silence interval duration threshold and the video feature amount threshold. Assuming that the silence interval duration threshold is Th0, the video feature amount is F, the video feature amount threshold is ThF, and the video feature amount weight is wF, the second silence interval duration threshold Th2 is calculated by the following equation (5). Is done.

Ｔｈ２＝Ｔｈ１ ― ｗＦ×（Ｆ−ＴｈＦ）・・・式（５） Th2 = Th1−wF × (F−ThF) (5)

応答対象発声判定部１０７は、算出した第二の沈黙区間継続長閾値Ｔｈ２を用いて、第１の実施形態と同様の方法により、直前の音声区間を応答対象とし応答するか、応答を保留するか判定する。 The response target utterance determination unit 107 uses the calculated second silence interval duration threshold Th2 as a response target or responds to the previous speech interval by the same method as in the first embodiment. To determine.

映像特徴量は、沈黙区間でも算出可能である。映像特徴量算出部３０２が音声区間と沈黙区間と別々に特徴量を算出する場合、式（５）は、式（６）のように変形される。 The video feature amount can be calculated even in the silent section. When the video feature quantity calculation unit 302 calculates the feature quantity separately for the voice section and the silence section, Expression (5) is transformed into Expression (6).

Ｔｈ２＝Ｔｈ１ ― ｗＦｖ×（Ｆｖ−ＴｈＦｖ） ― ｗＦｓ×（Ｆｓ−ＴｈＦｓ）・・・式（６） Th2 = Th1−wFv × (Fv−ThFv) −wFs × (Fs−ThFs) (6)

ここで、Ｆｖは音声区間の映像特徴量を示す。ＴｈＦｖは音声区間の映像特徴量閾値を示す。ｗＦｖは音声区間の映像特徴量重みを示す。Ｆｓは沈黙区間の映像特徴量を示す。ＴｈＦｓは沈黙区間の映像特徴量閾値を示す。ｗＦｓは沈黙区間の映像特徴量の重みを示す。 Here, Fv indicates the video feature amount of the audio section. ThFv indicates a video feature amount threshold value in an audio section. wFv indicates the video feature amount weight of the audio section. Fs indicates the video feature amount in the silent section. ThFs indicates a video feature amount threshold value in the silent section. wFs indicates the weight of the video feature amount in the silent section.

なお、音声区間判定部１０４において、音信号のみから音声区間、沈黙区間を判定することに加えて、映像を用いて音声区間、沈黙区間を判定するようにしてもよい。例えば、映像を用いて口の大きさや動きなどを解析し、口が小さいとき、または口が動いていないときには、沈黙区間と判定するといったことが考えられる。 Note that the audio section determination unit 104 may determine the audio section and the silence section using video in addition to determining the audio section and the silence section from only the sound signal. For example, it is conceivable to analyze the size and movement of the mouth using an image, and when the mouth is small or when the mouth is not moving, it is determined that it is a silence interval.

以上に説明したように、本実施形態では、映像特徴量をもとに算出した第二の沈黙区間継続長閾値を用いて、直前の音声区間を応答対象とし応答するか、応答を保留するかを判定する。従って、ユーザの顔や視線、体の向きなどを考慮した応答対象音声の判定を行うことができる。それにより、ユーザは、システムに対して応答を望む場合に、発声中および発声後にシステム方向を注視し、しばらく沈黙するだけでよい。従って、本実施形態によれば、よりユーザビリティの高い音声対話ユーザインターフェースを提供できる。 As described above, in this embodiment, using the second silence interval duration threshold value calculated based on the video feature amount, whether to respond with the immediately preceding audio interval as a response target or whether to hold the response Determine. Accordingly, it is possible to determine the response target sound in consideration of the user's face, line of sight, body orientation, and the like. Thereby, when the user wants to respond to the system, the user only has to keep an eye on the system direction during and after speaking and be silent for a while. Therefore, according to this embodiment, a voice interaction user interface with higher usability can be provided.

実施形態４．
以下、本発明の第４の実施形態を図面を参照して説明する。 Embodiment 4 FIG.
Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings.

図８は、本発明による応答対象音声判定装置の第４の実施形態の構成を示すブロック図である。 FIG. 8 is a block diagram showing a configuration of the fourth embodiment of the response target speech determination device according to the present invention.

図８に示すように、第４の実施形態における応答対象音声判定装置は、第１の実施形態の構成に加えて、対話活性度算出部４０１と、対話活性度閾値・重み格納部４０２と、第二の沈黙区間継続長閾値算出部４０３とを備える。 As shown in FIG. 8, in addition to the configuration of the first embodiment, the response target speech determination apparatus according to the fourth embodiment includes a dialog activity level calculation unit 401, a dialog activity level threshold / weight storage unit 402, And a second silence interval duration threshold value calculation unit 403.

対話活性度算出部４０１は、音声区間判定部１０４で求まった複数の音声区間と沈黙区間の時間的関係性から対話活性度（conversation activity）を算出する。本実施形態では、対話活性度算出部４０１は、複数の音声区間と沈黙区間の時間的関係性として、音声区間と沈黙区間の切り替わる頻度を用いる。対話活性度算出部４０１が算出に用いる音声区間と沈黙区間の範囲は、例えば、着目するフレームから過去に遡り、ある一定時間に存在する音声区間と沈黙区間を対象とする。 The conversation activity level calculation unit 401 calculates a conversation activity (conversation activity) from the temporal relationship between a plurality of voice intervals and silence intervals determined by the voice interval determination unit 104. In the present embodiment, the dialogue activity level calculation unit 401 uses the frequency of switching between the voice period and the silence period as the temporal relationship between the plurality of voice periods and the silence period. The range of the speech interval and the silence interval used for calculation by the conversation activity level calculation unit 401 is, for example, the speech interval and the silence interval existing in a certain fixed time, going back from the frame of interest.

対話活性度閾値・重み格納部４０２は、予め定められた対話活性度の閾値および重みを格納する。 The dialogue activity level threshold / weight storage unit 402 stores a threshold and a weight of a predetermined dialogue activity level.

第二の沈黙区間継続長閾値算出部４０３は、沈黙区間継続長閾値格納部１０６に格納されている沈黙区間継続長閾値と、対話活性度と、対話活性度の閾値および重みとを入力とし、第二の沈黙区間継続長閾値を算出し出力する。 The second silence interval duration threshold calculation unit 403 receives the silence interval duration threshold stored in the silence interval duration threshold storage unit 106, the dialogue activity, the dialogue activity threshold and the weight, and Calculate and output the second silence interval duration threshold.

なお、対話活性度算出部４０１および第二の沈黙区間継続長閾値算出部４０３は、例えば、応答対象音声判定プログラムに従って動作するコンピュータによって実現される。この場合、ＣＰＵが応答対象音声判定プログラムを読み込み、そのプログラムに従って、対話活性度算出部４０１および第二の沈黙区間継続長閾値算出部４０３として動作する。また、対話活性度算出部４０１および第二の沈黙区間継続長閾値算出部４０３が別々のハードウェアで実現されていてもよい。 Note that the dialogue activity level calculation unit 401 and the second silence interval duration threshold value calculation unit 403 are realized by, for example, a computer that operates according to a response target voice determination program. In this case, the CPU reads the response target voice determination program and operates as the dialogue activity calculation unit 401 and the second silence interval duration threshold calculation unit 403 according to the program. Moreover, the dialogue activity level calculation unit 401 and the second silence interval duration threshold value calculation unit 403 may be realized by separate hardware.

また、対話活性度閾値・重み格納部４０２は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 The interactive activity threshold / weight storage unit 402 is specifically realized by a storage device such as an optical disk device, a magnetic disk device, or a memory included in the response target voice determination device.

実施形態５．
以下、本発明の第５の実施形態を図面を参照して説明する。 Embodiment 5. FIG.
Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings.

図９は、本発明による応答対象音声判定装置の第５の実施形態の構成を示すブロック図である。 FIG. 9 is a block diagram showing the configuration of the fifth exemplary embodiment of the response target speech determination device according to the present invention.

図９に示すように、第５の実施形態における応答対象音声判定装置は、第１の実施形態の構成に加えて、複数音信号取得部５０１と、入力音信号切り出し部５０２と、音声区間判定部５０３と、対話活性度算出部５０４と、対話活性度閾値・重み格納部５０５と、第二の沈黙区間継続長閾値算出部５０６とを備える。 As shown in FIG. 9, in addition to the configuration of the first embodiment, the response target speech determination device according to the fifth embodiment includes a multiple sound signal acquisition unit 501, an input sound signal cutout unit 502, and a speech segment determination. Unit 503, dialogue activity level calculation unit 504, dialogue activity level threshold / weight storage unit 505, and second silence interval duration threshold value calculation unit 506.

複数音信号取得部５０１は、複数の音声入力装置、例えばマイクロホンを用いて、話者や方向ごとに複数チャネルの入力音信号を取得する。 The multi-sound signal acquisition unit 501 acquires multi-channel input sound signals for each speaker and direction using a plurality of sound input devices, for example, microphones.

入力音信号切り出し部５０２は、複数チャネルの入力音信号を入力とし、それぞれフレームごとに切り出した音信号を出力する。 The input sound signal cutout unit 502 receives input sound signals of a plurality of channels and outputs sound signals cut out for each frame.

音声区間判定部５０３は、複数の音声区間検出部（ＶＡＤ（voice activity detection）１〜ＶＡＤＮ）を含む。音声区間判定部５０３は、ＶＡＤ１〜ＶＡＤＮを用いて、フレームごとに切り出された複数チャネルの音信号と、音声区間判定閾値格納部１０３に格納されている音声区間判定に係わる閾値を入力とし、当該フレームが音声区間（active voice segment）に含まれるか、音声が存在しない沈黙区間（silence segment）に含まれるか、をチャネルごとに判定する。 The voice segment determination unit 503 includes a plurality of voice segment detection units (VAD (voice activity detection) 1 to VADN). The voice section determination unit 503 receives the sound signals of a plurality of channels cut out for each frame using VAD1 to VADN and the threshold value related to the voice section determination stored in the voice section determination threshold storage unit 103, and It is determined for each channel whether the frame is included in a voice segment (active voice segment) or a silence segment (silence segment) in which no voice exists.

対話活性度算出部５０４は、音声区間判定部５０３で求まった音声区間と沈黙区間の時間的関係性から、対話活性度（conversation activity）を算出する。 The conversation activity level calculation unit 504 calculates the conversation activity (conversation activity) from the temporal relationship between the voice period and the silence period obtained by the voice period determination unit 503.

対話活性度閾値・重み格納部５０５は、予め定められた対話活性度の閾値および重みを格納する。 The interactive activity threshold / weight storage unit 505 stores a predetermined threshold and weight of interactive activity.

第二の沈黙区間継続長閾値算出部５０６は、沈黙区間継続長閾値格納部１０６に格納されている沈黙区間継続長閾値と、対話活性度と、対話活性度閾値・重みとを入力とし、第二の沈黙区間継続長閾値を算出し出力する。 The second silence interval duration threshold value calculation unit 506 receives the silence interval duration threshold value stored in the silence interval duration threshold storage unit 106, the dialogue activity level, the dialogue activity level threshold / weight, Calculate and output the second silence interval duration threshold.

本実施形態では、対話活性度算出部５０４は、対話活性度を、複数チャンネルに対し求められた複数の音声区間と沈黙区間の時間的関係性を用いて算出する。対話活性度算出部５０４が算出に用いる時間的関係性として、例えば音声区間と沈黙区間の切り替わる頻度、複数チャンネルの音声区間オーバーラップの頻度、発話の占有率から算出される発話者エントロピーなどが考えられる。 In the present embodiment, the dialogue activity level calculation unit 504 calculates the dialogue activity level using temporal relationships between a plurality of voice intervals and silence intervals obtained for a plurality of channels. As the temporal relationship used by the conversation activity level calculation unit 504 for calculation, for example, the frequency of switching between a speech segment and a silence segment, the frequency of speech segment overlap of multiple channels, the speaker entropy calculated from the occupancy rate of the speech, etc. It is done.

発話の占有率は、話者を問わず音声区間と判定した区間のうち、特定の話者が発声している音声区間の長さの割合である。発話者エントロピーSは、以下に示す式（７）を用いて算出される。 The occupancy rate of the utterance is the ratio of the length of the voice section in which a specific speaker is speaking out of the sections determined to be voice sections regardless of the speaker. The speaker entropy S is calculated using the following equation (7).

S = ― Σ_i P_i log P_i ・・・式（７） S = ― Σ_i P_i log P_i (7)

ここで、P_iはi番目の発話者の発話の占有率を示す。例えば３人の話者がいて、１番目の話者（話者Ａ）の音声区間継続長が５秒、２番目の話者（話者Ｂ）の音声区間継続長が２秒、３番目の話者（話者Ｃ）の音声区間継続長が１秒であったとき、P_1、P_2、P_3は、式（８）から式（１０）に示す値となる。また、発話者エントロピーSは、式（１１）で算出される。 Here, P_i represents the occupancy rate of the utterance of the i-th speaker. For example, if there are three speakers, the first speaker (speaker A) has a voice duration of 5 seconds, the second speaker (speaker B) has a voice duration of 2 seconds, and the third When the duration of the speech section of the speaker (speaker C) is 1 second, P_1, P_2, and P_3 are values shown in equations (8) to (10). Further, the speaker entropy S is calculated by the equation (11).

P_1 = 5/8 ・・・式（８）
P_2 = 2/8 ・・・式（９）
P_3 = 1/8 ・・・式（１０）
S = − P_1 log（P_1） − P_2 log（P_2） − P_3 log（P_3）・・・式（１１） P_1 = 5/8 ・・・ Formula (8)
P_2 = 2/8 ... Formula (9)
P_3 = 1/8 ... Formula (10)
S =-P_1 log (P_1)-P_2 log (P_2)-P_3 log (P_3) ... Formula (11)

なお、複数音信号取得部５０１、入力音信号切り出し部５０２、音声区間判定部５０３、対話活性度算出部５０４および第二の沈黙区間継続長閾値算出部５０６は、例えば、応答対象音声判定プログラムに従って動作するコンピュータによって実現される。この場合、ＣＰＵが応答対象音声判定プログラムを読み込み、そのプログラムに従って、複数音信号取得部５０１、入力音信号切り出し部５０２、音声区間判定部５０３、対話活性度算出部５０４および第二の沈黙区間継続長閾値算出部５０６として動作する。また、複数音信号取得部５０１、入力音信号切り出し部５０２、音声区間判定部５０３、対話活性度算出部５０４および第二の沈黙区間継続長閾値算出部５０６が別々のハードウェアで実現されていてもよい。 The multiple sound signal acquisition unit 501, the input sound signal cutout unit 502, the voice segment determination unit 503, the dialogue activity calculation unit 504, and the second silence segment duration threshold calculation unit 506 are, for example, according to a response target voice determination program. Realized by an operating computer. In this case, the CPU reads the response target voice determination program, and according to the program, the multiple sound signal acquisition unit 501, the input sound signal cutout unit 502, the voice segment determination unit 503, the dialogue activity calculation unit 504, and the second silence interval continuation It operates as the long threshold value calculation unit 506. Further, the multiple sound signal acquisition unit 501, the input sound signal cutout unit 502, the voice segment determination unit 503, the dialogue activity calculation unit 504, and the second silence segment duration threshold calculation unit 506 are realized by separate hardware. Also good.

また、対話活性度閾値・重み格納部５０５は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 The interactive activity threshold / weight storage unit 505 is specifically realized by a storage device such as an optical disk device, a magnetic disk device, or a memory included in the response target voice determination device.

実施形態６．
以下、本発明の第６の実施形態を図面を参照して説明する。 Embodiment 6. FIG.
The sixth embodiment of the present invention will be described below with reference to the drawings.

図１０は、本発明による応答対象音声判定装置の第６の実施形態の構成を示すブロック図である。 FIG. 10 is a block diagram showing the configuration of the sixth embodiment of the response target speech determination apparatus according to the present invention.

図１０に示すように、第６の実施形態における応答対象音声判定装置は、第２の実施形態の構成に加えて、最大遅延時間格納部６０１を備える。 As illustrated in FIG. 10, the response target speech determination device according to the sixth exemplary embodiment includes a maximum delay time storage unit 601 in addition to the configuration of the second exemplary embodiment.

なお、最大遅延時間格納部６０１は、具体的には、応答対象音声判定装置が備える光ディスク装置や磁気ディスク装置、メモリ等の記憶装置によって実現される。 Note that the maximum delay time storage unit 601 is specifically realized by a storage device such as an optical disk device, a magnetic disk device, or a memory included in the response target sound determination device.

本実施形態では、応答対象発声判定部１０７が、第二の沈黙区間継続長閾値算出部２０３において算出された第二の沈黙区間継続長閾値と最大遅延時間とを比較する。最大遅延時間は、本実施形態では、システムがユーザに応答するまでの遅延時間の最大値である。そして、応答対象発声判定部１０７は、第二の沈黙区間継続長閾値が最大遅延時間より長い場合に直前の音声区間を応答の対象外として棄却する。具体的には、継続する沈黙の長さをＬとし、第二の沈黙区間継続長閾値をＴｈ、最大遅延時間をＤとするとき、応答対象発声判定部１０７は、以下に示す式（１２）から式（１４）を用いて直前の音声区間が応答対象音声であるか否かを判定する。 In this embodiment, the response target utterance determination unit 107 compares the second silence interval duration threshold calculated by the second silence interval duration threshold calculation unit 203 with the maximum delay time. In this embodiment, the maximum delay time is the maximum value of the delay time until the system responds to the user. Then, the response target utterance determination unit 107 rejects the immediately preceding voice segment as a non-response target when the second silence interval duration threshold is longer than the maximum delay time. Specifically, when the length of the continuous silence is L, the second silence interval duration threshold is Th, and the maximum delay time is D, the response target utterance determination unit 107 uses the following equation (12): From Equation (14), it is determined whether or not the immediately preceding speech segment is a response target speech.

Ｄ＜Ｔｈ直前の音声区間を応答対象音声外と判定し棄却・・・式（１２）
Ｄ≧ＴｈかつＬ≦Ｔｈ応答保留・・・式（１３）
Ｄ≧ＴｈかつＬ＞Ｔｈ直前の音声区間を応答対象音声と判定し応答・・・式（１４） D <Th It is determined that the voice segment immediately before is outside the response target voice and is rejected. Expression (12)
D ≧ Th and L ≦ Th Response hold ... Formula (13)
D ≧ Th and L> Th The speech segment immediately before is determined as the response target speech, and the response (14)

応答対象発声判定部１０７が直前の音声区間を応答対象音声外と判定し棄却した場合、応答対象音声判定装置またはシステムが、ユーザに通知することが考えられる。例えば、「発声が聞き取れませんでした」というメッセージを通知することが考えられる。 When the response target utterance determination unit 107 determines that the immediately preceding voice section is outside the response target voice and rejects it, the response target voice determination device or system may notify the user. For example, it may be possible to notify a message that “the utterance could not be heard”.

なお、上記の各実施形態は複数組み合わせて用いることもできる。 Note that a plurality of the above embodiments can be used in combination.

次に、本発明の概要を説明する。図１１は、本発明による応答対象音声判定装置の最小構成を示すブロック図である。図１２は、本発明による応答対象音声判定装置の他の最小構成を示すブロック図である。 Next, the outline of the present invention will be described. FIG. 11 is a block diagram showing the minimum configuration of the response target speech determination device according to the present invention. FIG. 12 is a block diagram showing another minimum configuration of the response target speech determination device according to the present invention.

図１１に示すように、本発明による応答対象音声判定装置は、ユーザによる発声後に続く沈黙区間の長さである沈黙区間継続長が、所定の沈黙区間継続長閾値よりも長くなることを検知し、当該発声をユーザがシステムに対して行った発声であると判定し、沈黙区間継続長が沈黙区間継続長閾値よりも短くなることを検知し、当該発声をユーザがシステム以外に対して行った発声であると判定する応答対象発声判定部１１（図１に示す応答対象発声判定部１０７に相当。）を備える。 As shown in FIG. 11, the response target speech determination device according to the present invention detects that the silence interval duration, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold. The utterance is determined to be the utterance made by the user to the system, the silence interval duration is detected to be shorter than the silence interval continuation length threshold, and the utterance is made by the user other than the system. A response target utterance determination unit 11 (corresponding to the response target utterance determination unit 107 shown in FIG. 1) that is determined to be uttered is provided.

そのような構成によれば、発声後の沈黙区間の継続長を用いて応答対象音声か否かを判定するため、ユーザは発声後、システムの応答を待つだけでよい。従って、ユーザビリティの高い音声対話ユーザインターフェースを提供することができる。 According to such a configuration, since it is determined whether or not the voice is a response target voice using the duration of the silence period after the utterance, the user only has to wait for a system response after the utterance. Therefore, it is possible to provide a voice interaction user interface with high usability.

また、音声入力装置が集音した音信号の時系列に対し、音声区間および沈黙区間を判定する音声区間判定部１２（図４または図７に示す音声区間判定部１０４に相当。）と、音声区間、沈黙区間、またはその両方の区間に対応する特徴量を抽出する特徴量算出部１３（図４に示す音声特徴量算出部２０１、または図７に示す映像特徴量算出部３０２に相当。）と、特徴量と、予め定められた特徴量の閾値および重みと、予め定められた第一の沈黙区間継続長閾値（沈黙区間継続長閾値に相当。）から、第二の沈黙区間継続長閾値を求める第二の沈黙区間継続長閾値算出部１４（図４に示す第二の沈黙区間継続長閾値算出部２０３、または図７に示す第二の沈黙区間継続長閾値算出部３０４に相当。）とを備え、応答対象発声判定部１１が、第二の沈黙区間継続長閾値を用いて判定を行ってもよい。そのような構成によれば、応答対象音声の判定の精度を劣化させることなく、遅延が少ない応答対象音声の判定を行うことができ、よりユーザビリティの高い音声対話ユーザインターフェースを提供することできる。 Also, a voice segment determination unit 12 (corresponding to the voice segment determination unit 104 shown in FIG. 4 or FIG. 7) that determines a voice segment and a silence segment with respect to a time series of sound signals collected by the voice input device, and a voice. A feature amount calculation unit 13 that extracts feature amounts corresponding to a section, a silence section, or both sections (corresponding to the audio feature amount calculation unit 201 shown in FIG. 4 or the video feature amount calculation unit 302 shown in FIG. 7). And the second silence interval duration threshold value from the feature amount, a predetermined feature amount threshold value and weight, and a predetermined first silence interval duration threshold value (corresponding to the silence interval duration threshold value). The second silence interval duration threshold calculation unit 14 (corresponding to the second silence interval duration threshold calculation unit 203 shown in FIG. 4 or the second silence interval duration threshold calculation unit 304 shown in FIG. 7). And the response target utterance determination unit 11 It determined using the silence interval duration threshold may be performed. According to such a configuration, it is possible to determine the response target voice with less delay without degrading the accuracy of the determination of the response target voice, and it is possible to provide a voice interaction user interface with higher usability.

また、特徴量算出部１３が、音声区間に対応する音信号から音声特徴量を１つ以上抽出し、第二の沈黙区間継続長閾値算出部１４が、音声特徴量を用いてもよい。そのような構成によれば、ユーザがシステムに対して話すときの声の大きさや話す速度などを考慮した応答対象音声の判定を行うことができ、よりユーザビリティの高い音声対話ユーザインターフェースを提供することできる。 Further, the feature quantity calculation unit 13 may extract one or more voice feature quantities from the sound signal corresponding to the voice section, and the second silence section duration threshold calculation unit 14 may use the voice feature quantity. According to such a configuration, it is possible to determine the response target voice in consideration of the loudness and speaking speed when the user speaks to the system, and it is possible to provide a voice interaction user interface with higher usability.

また、特徴量算出部１３が、音声区間に対応する映像から映像特徴量を抽出し、または沈黙区間に対応する映像から映像特徴量を抽出し、または両方の区間に対応する映像から映像特徴量を抽出し、第二の沈黙区間継続長閾値算出部１４が、映像特徴量を１つ以上用いてもよい。そのような構成によれば、ユーザの顔や視線、体の向きなどを考慮した応答対象音声の判定を行うことができる。それにより、ユーザは、システムに対して応答を望む場合に、発声中および発声後にシステム方向を注視し、しばらく沈黙するだけでよい。従って、よりユーザビリティの高い音声対話ユーザインターフェースを提供できる。 Also, the feature amount calculation unit 13 extracts a video feature amount from the video corresponding to the audio section, extracts a video feature amount from the video corresponding to the silence section, or extracts a video feature amount from the video corresponding to both sections. And the second silence interval duration threshold value calculation unit 14 may use one or more video feature quantities. According to such a configuration, it is possible to determine the response target speech in consideration of the user's face, line of sight, body orientation, and the like. Thereby, when the user wants to respond to the system, the user only has to keep an eye on the system direction during and after speaking and be silent for a while. Therefore, a voice interaction user interface with higher usability can be provided.

また、図１２に示すように、対話活性度算出部１５（図９に示す対話活性度算出部５０４に相当。）を備え、音声区間判定部１２（図９に示す音声区間判定部５０３に相当。）が、複数の音声入力装置が集音した複数チャネルの音信号の時系列それぞれに対し、音声区間および沈黙区間を判定し、対話活性度算出部１５が、複数チャネルの音声区間および沈黙区間の時間的関係性から対話活性度を算出し、第二の沈黙区間継続長閾値算出部１４（図９に示す第二の沈黙区間継続長閾値算出部５０６に相当。）が、対話活性度と、予め定められた対話活性度の閾値および重みと、予め定められた第一の沈黙区間継続長閾値とから、第二の沈黙区間継続長閾値を算出してもよい。そのような構成によれば、複数の音声区間と沈黙区間の時間的関係性から算出した対話活性度を考慮した応答対象音声の判定を行うことができる。 Also, as shown in FIG. 12, the dialogue activity level calculation unit 15 (corresponding to the dialogue activity level calculation unit 504 shown in FIG. 9) is provided, and the voice segment determination unit 12 (corresponding to the voice segment determination unit 503 shown in FIG. 9). .) Determines a voice interval and a silence interval for each of the time series of the sound signals of a plurality of channels collected by a plurality of sound input devices, and the dialogue activity calculation unit 15 determines the sound intervals and the silence intervals of the plurality of channels. And the second silence interval duration threshold value calculation unit 14 (corresponding to the second silence interval duration threshold value calculation unit 506 shown in FIG. 9) calculates the dialogue activity level. Alternatively, the second silence interval duration threshold value may be calculated from a predetermined dialogue activity threshold value and weight and a predetermined first silence interval duration threshold value. According to such a configuration, it is possible to determine the response target speech in consideration of the dialogue activity calculated from the temporal relationship between the plurality of speech intervals and the silence intervals.

また、システムがユーザに応答するまでの遅延時間の最大値である最大遅延時間を予め格納する最大遅延時間格納部１６（図１０に示す最大遅延時間格納部６０１に相当。）を備え、応答対象発声判定部１１が、第二の沈黙区間継続長閾値が最大遅延時間よりも長くなることを検知し、ユーザによる発声を応答対象外発声として棄却してもよい。そのような構成によれば、例えば、ユーザの声が聞き取りづらい場合、つまり音声区間を正しく認識できなかった場合に、ユーザにその旨を通知することができる。 The system also includes a maximum delay time storage unit 16 (corresponding to the maximum delay time storage unit 601 shown in FIG. 10) that stores in advance a maximum delay time that is the maximum delay time until the system responds to the user. The utterance determination unit 11 may detect that the second silence interval continuation length threshold is longer than the maximum delay time, and may reject the utterance by the user as a non-response subject utterance. According to such a configuration, for example, when it is difficult to hear the user's voice, that is, when the voice section cannot be recognized correctly, the user can be notified of that.

１１応答対象発声判定部
１２音声区間判定部
１３特徴量算出部
１４第二の沈黙区間継続長閾値算出部
１５対話活性度算出部
１６最大遅延時間格納部
１０１入力音信号取得部
１０２入力音信号切り出し部
１０３音声区間判定閾値格納部
１０４音声区間判定部
１０５音声区間音信号保存部
１０６沈黙区間継続長閾値格納部
１０７応答対象発声判定部
２０１音声特徴量算出部
２０２音声特徴量閾値・重み格納部
２０３第二の沈黙区間継続長閾値算出部
３０１映像信号取得部
３０２映像特徴量算出部
３０３映像特徴量閾値・重み格納部
３０４第二の沈黙区間継続長閾値算出部
４０１対話活性度算出部
４０２対話活性度閾値・重み格納部
４０３第二の沈黙区間継続長閾値算出部
５０１複数音信号取得部
５０２入力音信号切り出し部
５０３音声区間判定部
５０４対話活性度算出部
５０５対話活性度閾値・重み格納部
５０６第二の沈黙区間継続長閾値算出部
６０１最大遅延時間格納部 DESCRIPTION OF SYMBOLS 11 Response object utterance determination part 12 Voice area determination part 13 Feature-value calculation part 14 2nd silence period duration threshold value calculation part 15 Dialogue activity calculation part 16 Maximum delay time storage part 101 Input sound signal acquisition part 102 Input sound signal extraction Unit 103 speech segment determination threshold storage unit 104 speech segment determination unit 105 speech segment sound signal storage unit 106 silence segment duration threshold storage unit 107 response target utterance determination unit 201 speech feature amount calculation unit 202 speech feature amount threshold / weight storage unit 203 Second silence interval duration threshold calculation unit 301 Video signal acquisition unit 302 Video feature amount calculation unit 303 Video feature amount threshold / weight storage unit 304 Second silence interval duration threshold calculation unit 401 Dialog activity calculation unit 402 Dialog activity Threshold value / weight storage unit 403 Second silence interval duration threshold value calculation unit 501 Multiple sound signal acquisition unit 502 Input sound signal Number cutout section 503 Voice section determination section 504 Dialogue activity calculation section 505 Dialogue activity threshold value / weight storage section 506 Second silence section duration threshold calculation section 601 Maximum delay time storage section

Claims

Detects that the duration of the silence interval, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold, and determines that the utterance is the utterance made by the user to the system A response target utterance determination unit that detects that the silence interval duration is shorter than the silence interval duration threshold and determines that the utterance is made by a user other than the system. A characteristic response target speech determination device.

A voice segment determination unit that determines a voice segment and a silence segment with respect to a time series of sound signals collected by the voice input device;
A feature amount calculation unit that extracts feature amounts corresponding to the speech section, the silence section, or both sections;
Second silence interval duration threshold value calculation for obtaining a second silence interval duration threshold value from the feature amount, a predetermined feature amount threshold value and weight, and a predetermined first silence interval duration threshold value With
The response target speech determination apparatus according to claim 1, wherein the response target utterance determination unit performs determination using the second silence interval duration threshold.

The feature quantity calculation unit extracts one or more voice feature quantities from the sound signal corresponding to the voice section,
The response target sound determination apparatus according to claim 2, wherein the second silence interval duration threshold value calculation unit uses the sound feature amount.

The feature quantity calculation unit extracts video feature quantities from the video corresponding to the audio section, extracts video feature quantities from the video corresponding to the silent section, or extracts video feature quantities from the video corresponding to both sections. ,
The response target sound determination device according to claim 2, wherein the second silence interval duration threshold value calculation unit uses one or more of the video feature amounts.

With a dialogue activity calculator
The voice section determination unit determines a voice section and a silence section for each of the time series of sound signals of a plurality of channels collected by a plurality of voice input devices,
The interaction activity calculation unit calculates interaction activity from the temporal relationship between the voice interval and the silence interval of the plurality of channels,
A second silence interval duration threshold value calculation unit calculates a second silence from the dialogue activity level, a predetermined dialogue activity threshold value and weight, and a predetermined first silence interval duration threshold value. The response target speech determination apparatus according to claim 2, wherein a section duration threshold is calculated.

A maximum delay time storage unit that stores in advance a maximum delay time that is the maximum delay time until the system responds to the user;
The response target utterance determination unit detects that the second silence interval duration threshold is longer than the maximum delay time, and rejects the utterance by the user as a non-response target utterance. The response target voice determination device according to any one of the preceding claims.

Detects that the duration of the silence interval, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold, and determines that the utterance is the utterance made by the user to the system And detecting that the silence interval duration is shorter than the silence interval duration threshold, and determining that the utterance is an utterance made by a user other than the system. Method.

For the time series of the sound signal collected by the voice input device, determine the voice interval and silence interval,
Extracting feature quantities corresponding to the speech section, the silence section, or both sections;
A second silence interval duration threshold is determined from the feature amount, a predetermined feature amount threshold and weight, and a predetermined first silence interval duration threshold.
The response target speech determination method according to claim 7, wherein the utterance determination process by the user is performed using the second silence interval duration threshold value.

On the computer,
Detects that the duration of the silence interval, which is the length of the silence interval following the utterance by the user, is longer than a predetermined silence interval duration threshold, and determines that the utterance is the utterance made by the user to the system A response target voice for detecting that the silence duration duration is shorter than the silence duration duration threshold value and executing a process of determining that the speech is made by a user other than the system. Judgment program.

On the computer,
A process for determining a voice section and a silence section for a time series of sound signals collected by the voice input device;
A process of extracting feature amounts corresponding to the speech section, the silence section, or both sections;
A process for obtaining a second silence interval duration threshold from the feature amount, a predetermined feature amount threshold and weight, and a predetermined first silence interval duration threshold;
The response target speech determination program according to claim 9, wherein a process for determining a speech by a user is executed using the second silence interval duration threshold.