JP2015191219A

JP2015191219A - Voice processing system, voice processing method, and program

Info

Publication number: JP2015191219A
Application number: JP2014070717A
Authority: JP
Inventors: 健花沢; Takeshi Hanazawa; 玲史近藤; Reishi Kondou
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2015-11-02

Abstract

PROBLEM TO BE SOLVED: To provide a voice processing system, a voice processing method, and a program which can accurately detect a voice during output of a response signal even when the response signal and the input voice are the voice.SOLUTION: The voice processing system includes: response selection means for selecting a response voice; band selection means for selecting at least a part of a band of the band excluding the frequency band of the selected response voice; and voice detection means for detecting the voice of the input signal at least in the part of the selected band.

Description

本発明は、音声処理システム、音声処理方法及びプログラムに関する。 The present invention relates to a voice processing system, a voice processing method, and a program.

音声応答・音声対話においては、応答信号や応答音声の出力中に割り込んで音声入力を行うバージイン機能がある。バージイン機能は、例えば、言い間違いや音声認識の誤認識に起因して言い直す場合、応答を待たずにすぐに次の入力を行う場合、気が変わったのでやり直す場合などに利用される。しかし、応答信号が音声入力用のマイクに回り込むことにより、音声検出や音声認識の性能が低下するという問題がある。 In the voice response / voice dialogue, there is a barge-in function for interrupting and inputting a voice while outputting a response signal or a response voice. The barge-in function is used, for example, when rephrasing due to misrepresentation or misrecognition of voice recognition, when performing the next input immediately without waiting for a response, or when re-executing because of a change in mind. However, there is a problem that the performance of voice detection and voice recognition deteriorates when the response signal goes around the microphone for voice input.

これに対して、応答信号の周波数特性に着目し、応答信号が少ない周波数帯域により大きい重み付けを行う手法がある（特許文献１）。特許文献１に記載の方法は、複数チャネルのシステム音すなわち応答信号（例えばビープ音や音楽）に基づいて周波数帯域別に重み付けを行い、応答信号に含まれそうな帯域ほど小さい重みをかけることで音声/非音声判別を行うものである。よって、特許文献１に記載の方法は、応答信号と入力音声との周波数帯域の重なりが小さい場合には、音声検出等の性能の低下を抑えることができる。 On the other hand, there is a method of paying more attention to the frequency characteristics of the response signal and performing greater weighting on the frequency band with less response signal (Patent Document 1). The method described in Patent Document 1 weights each frequency band based on a system sound of multiple channels, that is, a response signal (for example, beep sound or music), and applies a smaller weight to a band that is likely to be included in the response signal. / Non-voice discrimination. Therefore, the method described in Patent Document 1 can suppress degradation in performance such as voice detection when the overlap of frequency bands between the response signal and the input voice is small.

特開２０１２−１８９９０７JP2012-189907

音声対話システムのように応答信号も入力音声も人の声等の音声である場合には、応答音声と入力音声の周波数帯域の重なりが大きい。よって、特許文献１に記載された技術では、音声検出等の性能が下がるという課題がある。
When the response signal and the input voice are voices such as a human voice as in the voice dialogue system, the frequency bands of the response voice and the input voice overlap greatly. Therefore, the technique described in Patent Document 1 has a problem that the performance of voice detection and the like is lowered.

［発明の目的］
本発明の目的は、応答信号も入力音声も音声である場合においても、応答信号の出力中に精度よく音声検出を行うことができる音声処理システム、音声処理方法及びプログラムを提供することである。
[Object of invention]
An object of the present invention is to provide a voice processing system, a voice processing method, and a program capable of accurately detecting a voice while outputting a response signal even when the response signal and the input voice are voices.

本発明は、応答音声を選択する応答選択手段と、前記選択された応答音声の周波数帯域を除いた帯域の少なくとも一部の帯域を選択する帯域選択手段と、前記選択された帯域の少なくとも一部において入力された信号の音声検出を行う音声検出手段とを備える音声処理システムである。 The present invention provides response selection means for selecting response voice, band selection means for selecting at least a part of a band excluding the frequency band of the selected response voice, and at least part of the selected band. Is a voice processing system including voice detection means for detecting voice of the signal input in.

本発明は、応答音声を選択する応答選択ステップと、前記選択された応答音声の周波数帯域を除いた帯域の少なくとも一部の帯域を選択する帯域選択ステップと、前記選択された帯域の少なくとも一部において入力された信号の音声検出を行う音声検出ステップとを有する音声処理方法である。 The present invention provides a response selection step of selecting a response voice, a band selection step of selecting at least a part of a band excluding a frequency band of the selected response voice, and at least a part of the selected band And a voice detection step of performing voice detection of the signal input in.

本発明は、応答音声を選択する応答選択ステップと、前記選択された応答音声の周波数帯域を除いた帯域の少なくとも一部の帯域を選択する帯域選択ステップと、前記選択された帯域の少なくとも一部において入力された信号の音声検出を行う音声検出ステップとをコンピュータに実行させるプログラムである。
The present invention provides a response selection step of selecting a response voice, a band selection step of selecting at least a part of a band excluding a frequency band of the selected response voice, and at least a part of the selected band Is a program that causes a computer to execute a voice detection step of performing voice detection of a signal input in.

本発明における音声処理システムでは、音声対話システムのように応答信号も入力音声も音声である場合においても、応答信号の出力中に精度よく音声検出を行うことができる。
In the speech processing system according to the present invention, even when the response signal and the input speech are speech as in the speech dialogue system, speech detection can be performed with high accuracy during the output of the response signal.

本発明の第１の実施の形態に係るハードウェア構成図である。It is a hardware block diagram concerning the 1st embodiment of the present invention. 本発明の第１の実施の形態に係るブロック図である。It is a block diagram concerning the 1st embodiment of the present invention. 本発明の第１の実施の形態に係るフローチャートである。3 is a flowchart according to the first embodiment of the present invention. 本発明の第２の実施の形態に係るブロック図である。It is a block diagram concerning the 2nd embodiment of the present invention. 本発明の第２の実施の形態に係る概念図である。It is a conceptual diagram concerning the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るフローチャートである。It is a flowchart which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係るブロック図である。It is a block diagram concerning the 3rd embodiment of the present invention. 本発明の第３の実施の形態に係るフローチャートである。It is a flowchart which concerns on the 3rd Embodiment of this invention.

実施形態１．
図１は、本発明の第１の実施形態に係る音声処理システム１のハードウェア構成図である。図１に示すように、音声処理システム１は、ＣＰＵ１０、メモリ１２、ハードディスクドライブ（ＨＤＤ）１４、図示しないネットワークを介してデータの通信を行う通信インタフェース（ＩＦ）１６、ディスプレイ等の表示装置１８およびキーボードやマウス等のポインティングデバイスを含む入力装置２０を有する。これらの構成要素は、バス２２を通して互いに接続されており、互いにデータの入出力を行う。なお、音声処理システム１のハードウェア構成は、この構成に制限されず、適宜変更することができる。また、音声処理システム１は１台のコンピュータシステムで構成される必要はなく、複数台のコンピュータシステムで構成されていてもよい。 Embodiment 1. FIG.
FIG. 1 is a hardware configuration diagram of a speech processing system 1 according to the first embodiment of the present invention. As shown in FIG. 1, the audio processing system 1 includes a CPU 10, a memory 12, a hard disk drive (HDD) 14, a communication interface (IF) 16 that performs data communication via a network (not shown), a display device 18 such as a display, and the like. The input device 20 includes a pointing device such as a keyboard and a mouse. These components are connected to each other through the bus 22 and input / output data to / from each other. Note that the hardware configuration of the voice processing system 1 is not limited to this configuration, and can be changed as appropriate. Further, the voice processing system 1 does not need to be configured by one computer system, and may be configured by a plurality of computer systems.

図２は、本発明の第１の実施の形態による音声処理システムの構成を示すブロック図である。 FIG. 2 is a block diagram showing the configuration of the speech processing system according to the first embodiment of the present invention.

図２に示すように、第１の実施の形態による音声処理システムは、応答選択手段１１１と、帯域選択手段１２１と、音声検出手段１３１とを有する。 As shown in FIG. 2, the voice processing system according to the first embodiment includes response selection means 111, band selection means 121, and voice detection means 131.

応答選択手段１１１は、あらかじめ使用する周波数帯域が既知である応答音声を選択し、帯域選択手段１２１に通知する。なお、応答音声とは、音声処理システムが出力する音声のことである。応答音声は、例えば、ある入力された音声に対してその内容に対して応答する内容の音声である。 The response selection unit 111 selects a response voice whose frequency band to be used in advance is known, and notifies the band selection unit 121 of the response voice. The response voice is a voice output from the voice processing system. The response voice is, for example, a voice having a content that responds to a certain input voice.

帯域選択手段１２１は、前記応答選択手段１１１で選択された応答音声の周波数を除いた帯域を選択し、選択された帯域の情報である帯域情報を音声検出手段１３１に通知する。帯域選択手段１２１は、応答音声の周波数の少なくとも一部を除いた帯域を選択してもよい。例えば、帯域選択手段１２１は、応答音声の周波数のうち、応答音声をより多く含む周波数を除いた帯域を選択してもよい。 The band selection unit 121 selects a band excluding the frequency of the response voice selected by the response selection unit 111, and notifies the voice detection unit 131 of band information that is information on the selected band. The band selection unit 121 may select a band excluding at least a part of the response voice frequency. For example, the band selection unit 121 may select a band excluding frequencies that contain more response voices from the response voice frequencies.

音声検出手段１３１は、前記帯域情報を利用して、入力された信号に対する音声検出を行う。音声検出手段１３１は、前記選択された帯域の少なくとも一部の帯域を利用して音声検出を行ってもよい。 The voice detection means 131 performs voice detection on the input signal using the band information. The voice detection unit 131 may perform voice detection using at least a part of the selected band.

次に、本発明を実施するための第１の実施の形態の動作について詳細に説明する。図３は、第１の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the first embodiment for carrying out the present invention will be described in detail. FIG. 3 is a flowchart illustrating an example of the operation of the first embodiment.

応答選択手段１１１は、あらかじめ使用する周波数帯域が既知の応答音声を選択し、選択された応答音声を音声検出手段１３１に通知する（ステップ１０１）。帯域選択手段１２１は、応答音声選択手段１１１にて選択された応答音声の周波数帯域の少なくとも一部を除いた帯域を選択し、選択された帯域の情報である帯域情報を音声検出手段１３１に通知する（ステップ１０２）。音声検出手段１３１は、前記選択された帯域の少なくとも一部の帯域において音声検出を行う（ステップ１０３）。 The response selection unit 111 selects a response voice whose frequency band to be used in advance is known, and notifies the voice detection unit 131 of the selected response voice (step 101). The band selection unit 121 selects a band excluding at least a part of the frequency band of the response voice selected by the response voice selection unit 111 and notifies the voice detection unit 131 of band information that is information on the selected band. (Step 102). The voice detecting means 131 performs voice detection in at least a part of the selected band (step 103).

音声検出手段１３１が音声を検出した場合はステップ１０１に戻る（ステップ１０４）。 If the voice detecting means 131 detects voice, the process returns to step 101 (step 104).

本実施の形態によれば、応答音声の周波数帯域以外の帯域を利用して入力信号の音声検出を行うため、応答信号の出力中であっても、入力信号の音声検出を行うことができる。 According to the present embodiment, since the voice of the input signal is detected using a band other than the frequency band of the response voice, the voice of the input signal can be detected even while the response signal is being output.

実施形態２．
図４は、本発明の第２の実施の形態による音声処理システムの構成を示すブロック図である。本発明の第２の実施の形態による音声処理システムは、第１の実施例に相当する。本発明の第２の実施の形態による音声処理システムは、音声対話装置であってもよい。 Embodiment 2. FIG.
FIG. 4 is a block diagram showing a configuration of a voice processing system according to the second embodiment of the present invention. The speech processing system according to the second embodiment of the present invention corresponds to the first example. The voice processing system according to the second embodiment of the present invention may be a voice interaction device.

本実施形態の音声処理システムは、音声応答装置２、応答音声記憶手段２１２、出力手段１６２、入力手段１７２を備える。を備える。音声応答装置２は、応答選択手段１１２、帯域選択手段１２２、音声検出手段１３２、音声認識手段１４２、音声再生手段１５２を備える。 The voice processing system according to this embodiment includes a voice response device 2, a response voice storage unit 212, an output unit 162, and an input unit 172. Is provided. The voice response device 2 includes a response selection unit 112, a band selection unit 122, a voice detection unit 132, a voice recognition unit 142, and a voice reproduction unit 152.

応答選択手段１１２は、応答音声記憶手段２１２に記憶されたあらかじめ使用する周波数帯域が既知の１以上の応答音声から、応答音声を選択する。さらに、応答選択手段１１２は、帯域選択手段１２２と音声再生手段１５２に、選択された応答音声を通知する。 The response selection unit 112 selects a response voice from one or more response voices whose frequency bands to be used in advance stored in the response voice storage unit 212 are known. Further, the response selection unit 112 notifies the selected response voice to the band selection unit 122 and the voice reproduction unit 152.

帯域選択手段１２２は、応答選択手段１１２にて選択された応答音声の周波数帯域を除いた帯域を選択し、音声検出手段１３２に帯域情報を通知する。 The band selection unit 122 selects a band excluding the frequency band of the response voice selected by the response selection unit 112 and notifies the voice detection unit 132 of the band information.

帯域選択手段１２２は、応答音声が含まれる帯域すべてを除いた帯域を帯域情報として通知しても良いし、一定時間単位で応答音声が含まれる帯域を除いた帯域を帯域情報として、時間情報とともに通知しても良い。例えば帯域選択手段１２２は、図５に示すように、応答音声が含まれる帯域を単位時間あたり（例えば処理フレームごとに）で抽出し、その逆側の帯域を帯域情報として設定してもよい。その際に、応答選択手段１１２は、直前に使用した周波数帯域をできるだけ継続して使うよう応答音声を選択してもよい。それにより、同一のユーザが連続して音声入力を行う場合に、応答音声と入力音声の周波数帯域の重なりがより少なくなる。応答音声が含まれる帯域すべてを除いた帯域を帯域情報とするする手法は、低コストで音声検出を行うことができる。一方、一定時間単位で帯域情報を変更する手法は高精度に音声検出を行うことができる。 The band selection unit 122 may notify the band excluding the entire band including the response voice as band information, or the band excluding the band including the response voice in a certain time unit as the band information together with the time information. You may be notified. For example, as shown in FIG. 5, the band selection unit 122 may extract a band including the response voice per unit time (for example, for each processing frame) and set the band on the opposite side as band information. At that time, the response selection unit 112 may select the response voice so as to use the frequency band used immediately before as long as possible. Thereby, when the same user performs voice input continuously, the frequency bands of the response voice and the input voice overlap less. The technique of using the band information excluding the entire band including the response voice as the band information can perform voice detection at a low cost. On the other hand, the method of changing the band information in a fixed time unit can perform voice detection with high accuracy.

また、帯域選択手段１２２は、音声検出の対象となる周波数帯域をあらかじめ複数のサブバンドに分割し、該当するサブバンドを離散的に選択してもよい。さらに、帯域選択手段１２２は、サブバンドごとに応答音声の含まれる量に応じて重み付けを行う。帯域選択手段１２２は、含まれる量が多いサブバンドほど重み付けを小さくしてもよい。音声検出をサブバンドごとに行う技術については公知の技術であるから、ここでは説明を省略する。 Further, the band selection unit 122 may divide a frequency band that is a target of voice detection into a plurality of subbands in advance, and discretely select the corresponding subbands. Furthermore, the band selection unit 122 performs weighting according to the amount of response sound included in each subband. The band selection unit 122 may reduce the weighting of subbands with a larger amount. Since the technique for performing voice detection for each subband is a known technique, a description thereof is omitted here.

音声検出手段１３２は、後述の入力手段１７２より入力音声を、帯域選択手段１２２より帯域情報をそれぞれ受け付け、入力音声の音声検出を行う。 The voice detection unit 132 receives input voice from an input unit 172 (to be described later) and band information from the band selection unit 122, and performs voice detection of the input voice.

音声検出手段１３２は、音声を検出した際に、その旨を後述の音声再生手段１５２に通知してもよい。音声再生手段１５２は前記通知を受け取ると音声再生を停止してもよい。これにより、本実施形態による音声処理システムは、音声検出が正しく行われた際にただちに応答音声の再生を停止するため、その後の音声検出・音声認識などの処理をより高精度に行うことが可能となる。 When the sound detection unit 132 detects a sound, the sound detection unit 132 may notify the sound reproduction unit 152 described later. The audio reproduction means 152 may stop the audio reproduction upon receiving the notification. As a result, the voice processing system according to the present embodiment stops the playback of the response voice as soon as the voice detection is correctly performed, so that subsequent processing such as voice detection and voice recognition can be performed with higher accuracy. It becomes.

また、音声検出手段１３２は、帯域選択手段１２２より受け付ける帯域情報がサブバンドごとに重み付けされている場合、重みに応じて検出する際の閾値を変更してもよい。音声検出手段１３２は、例えば、重みの大きいサブバンドにおける検出結果をより信頼度が高い結果として用いてもよい。それにより、音声検出手段１３２は、より高精度に音声を検出できる。音声認識手段１４２は、後述の入力手段１７２で入力された音声を音声認識する。さらに、応答選択手段１１２は、音声認識手段１４２の音声認識結果に基づいて応答音声を選択する。 In addition, when the band information received from the band selection unit 122 is weighted for each subband, the voice detection unit 132 may change the threshold for detection according to the weight. For example, the voice detection unit 132 may use a detection result in a subband having a large weight as a result with higher reliability. Thereby, the voice detection means 132 can detect the voice with higher accuracy. The voice recognition means 142 recognizes the voice input by the input means 172 described later. Further, the response selection unit 112 selects a response voice based on the voice recognition result of the voice recognition unit 142.

音声再生手段１５２は、応答選択手段１１２にて選択された応答音声を、出力手段１６２にて再生させる。 The sound reproducing means 152 causes the output means 162 to reproduce the response sound selected by the response selecting means 112.

出力手段１６２は、音声などの信号を出力する手段である。出力手段１６２は、例えばスピーカであればよい。 The output unit 162 is a unit that outputs a signal such as voice. The output unit 162 may be a speaker, for example.

入力手段１７２は、音声などの信号を入力する手段である。入力手段１７２は、例えばマイクであればよい。 The input means 172 is a means for inputting a signal such as voice. The input unit 172 may be a microphone, for example.

応答音声記憶手段２１２は、応答音声を記憶する手段である。 The response voice storage unit 212 is a unit that stores the response voice.

次に、本発明を実施するための第２の実施の形態の動作について詳細に説明する。図６は、第２の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the second embodiment for carrying out the present invention will be described in detail. FIG. 6 is a flowchart illustrating an example of the operation of the second embodiment.

応答選択手段１１２は、応答音声記憶手段２１２からあらかじめ使用する周波数帯域が既知の応答音声を選択し、音声再生手段１５２と帯域選択手段１２２にそれぞれ通知する（ステップ２０１）。応答選択手段１１２は、例えば、システム起動時には「こんにちは」など対話の開始に適した応答音声を選択してもよい。帯域選択手段１２２は、応答選択手段１１２から通知された応答音声の周波数帯域を除いた帯域を選択し、音声検出手段１３２に帯域情報を通知する（ステップ２０２）。音声再生手段１５２は、応答選択手段１１２から通知された応答音声を出力手段１６２にて再生する（ステップ２０３）。 The response selection unit 112 selects a response voice whose frequency band to be used in advance is known from the response voice storage unit 212, and notifies the voice playback unit 152 and the band selection unit 122 of each of them (step 201). Answer selection means 112, for example, may be selected response voice suitable for the start of the conversation, such as "Hello" at system startup. The band selection unit 122 selects a band excluding the frequency band of the response voice notified from the response selection unit 112, and notifies the voice detection unit 132 of the band information (step 202). The sound reproducing means 152 reproduces the response sound notified from the response selecting means 112 on the output means 162 (step 203).

音声検出手段１３２は、入力手段１７２より入力音声を、帯域選択手段１２２より帯域情報をそれぞれ受け付けて、入力音声の音声検出を行う（ステップ２０４）。音声検出手段１３２が音声を検出した場合には、音声認識手段１４２は音声検出結果を用いて音声認識を行い、ステップ２０１に戻る（ステップ２０５、２０６）。 The voice detection unit 132 receives the input voice from the input unit 172 and the band information from the band selection unit 122, and performs voice detection of the input voice (step 204). When the voice detection unit 132 detects voice, the voice recognition unit 142 performs voice recognition using the voice detection result, and returns to step 201 (steps 205 and 206).

本実施の形態によれば、応答音声の周波数帯域以外の帯域において入力信号の音声検出を行うため、応答信号の出力中であっても、入力信号の音声検出を行うことができる。特に、応答音声と入力音声の周波数帯域が重なりやすい場合において、応答音声帯域の時間変化に応じた音声検出帯域の変更を行うことで、より高精度に音声を検出できる。また、直前に使用した周波数帯域をできるだけ継続して使うよう応答音声を選択することで、同一のユーザが連続して音声入力を行う場合により精度良く周波数帯域の重なりを避けることができる。 According to the present embodiment, since the voice of the input signal is detected in a band other than the frequency band of the response voice, the voice of the input signal can be detected even while the response signal is being output. In particular, when the frequency band of the response voice and the input voice is likely to overlap, the voice can be detected with higher accuracy by changing the voice detection band according to the time change of the response voice band. Further, by selecting the response voice so as to use the frequency band used immediately before as much as possible, it is possible to avoid frequency band overlap more accurately when the same user performs voice input continuously.

実施形態３．
図７は、本発明の第３の実施の形態による音声処理システムの構成を示すブロック図である。本実施の形態による音声処理システムは、第２の実施例に相当する。 Embodiment 3. FIG.
FIG. 7 is a block diagram showing the configuration of a speech processing system according to the third embodiment of the present invention. The speech processing system according to the present embodiment corresponds to the second example.

本実施形態の音声処理システムは音声応答装置３、応答音声記憶手段２１３、出力手段１６３、入力手段１７３を備える。音声応答装置３は、応答選択手段１１３、帯域選択手段１２３、音声検出手段１３３、音声認識手段１４３、音声再生手段１５３、シナリオ参照手段１８３を備える。 The voice processing system of this embodiment includes a voice response device 3, a response voice storage unit 213, an output unit 163, and an input unit 173. The voice response device 3 includes a response selection unit 113, a band selection unit 123, a voice detection unit 133, a voice recognition unit 143, a voice reproduction unit 153, and a scenario reference unit 183.

応答選択手段１１３、帯域選択手段１２３、音声検出手段１３３、音声再生手段１５３、出力手段１６３、入力手段１７３、応答音声記憶手段２１３は、各々、応答選択手段１１２、帯域選択手段１２２、音声検出手段１３２、音声再生手段１５２、出力手段１６２、入力手段１７２、応答音声記憶手段２１２と同様の機能を有するため説明を省略する。 The response selection unit 113, the band selection unit 123, the voice detection unit 133, the voice reproduction unit 153, the output unit 163, the input unit 173, and the response voice storage unit 213 are respectively the response selection unit 112, the band selection unit 122, and the voice detection unit. 132, the sound reproduction means 152, the output means 162, the input means 172, and the response sound storage means 212 have the same functions, and thus description thereof is omitted.

音声認識手段１４３は、音声検出手段１３３から通知される音声検出結果を用いて入力手段１７３から入力される音声を認識する。音声認識手段１４３は、認識結果をシナリオ参照手段１８３に通知する。 The voice recognition unit 143 recognizes the voice input from the input unit 173 using the voice detection result notified from the voice detection unit 133. The voice recognition unit 143 notifies the scenario reference unit 183 of the recognition result.

シナリオ参照手段１８３は、シナリオ記憶手段２２３を参照して音声認識手段１４３から通知される認識結果に対応するシナリオを応答選択手段１１３へ通知する。 The scenario reference unit 183 refers to the scenario storage unit 223 and notifies the response selection unit 113 of a scenario corresponding to the recognition result notified from the voice recognition unit 143.

応答選択手段１１３は通知されたシナリオに応じた応答音声を選択する。 The response selection unit 113 selects a response voice corresponding to the notified scenario.

シナリオ記憶手段２２３は、音声認識結果に対応する応答の内容を示すシナリオを記憶している。 The scenario storage unit 223 stores a scenario indicating the content of a response corresponding to the voice recognition result.

次に、本発明を実施するための第３の実施の形態の動作について詳細に説明する。図８は、第３の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the third embodiment for carrying out the present invention will be described in detail. FIG. 8 is a flowchart illustrating an example of the operation of the third embodiment.

まず、応答音声記憶手段２１３に記憶されたあらかじめ使用する周波数帯域が既知の１以上の応答音声から、応答音声を選択する。さらに、応答選択手段１１３は、帯域選択手段１２３と音声再生手段１５３に、選択された応答音声を通知する。（ステップ３０１）。 First, a response sound is selected from one or more response sounds whose frequency bands to be used in advance stored in the response sound storage unit 213 are known. Further, the response selection unit 113 notifies the selected response sound to the band selection unit 123 and the audio reproduction unit 153. (Step 301).

帯域選択手段１２３は、応答選択手段１１３から通知された応答音声の周波数帯域を除いた帯域を選択し、音声検出手段１３３に通知する（ステップ３０２）。 The band selection unit 123 selects a band excluding the frequency band of the response voice notified from the response selection unit 113, and notifies the voice detection unit 133 of the selected band (step 302).

音声再生手段１５３は、応答選択手段１１３から通知された応答音声を出力手段１６３にて再生する（ステップ３０３）。 The voice reproduction means 153 reproduces the response voice notified from the response selection means 113 by the output means 163 (step 303).

音声検出手段１３３は、入力手段１７３より入力音声を、帯域選択手段１２３より帯域情報をそれぞれ受け付け、音声検出を行い、音声が検出された場合には検出結果を音声認識手段１４３に通知する（ステップ３０４）。 The voice detection unit 133 receives the input voice from the input unit 173 and the band information from the band selection unit 123, performs voice detection, and when the voice is detected, notifies the voice recognition unit 143 of the detection result (step). 304).

音声検出手段１３３が音声を検出した場合、音声認識手段１４３は、音声検出結果を用いて音声認識を行う。さらに、音声認識手段１４３は、音声認識結果をシナリオ参照手段１８３に通知する（ステップ３０５、３０６）。音声検出結果を用いて音声認識を行う技術については良く知られた技術であるから、ここでは説明を省略する。 When the voice detection unit 133 detects a voice, the voice recognition unit 143 performs voice recognition using the voice detection result. Further, the voice recognition unit 143 notifies the scenario reference unit 183 of the voice recognition result (steps 305 and 306). Since the technology for performing speech recognition using the speech detection result is a well-known technology, description thereof is omitted here.

シナリオ参照手段１８３は、シナリオ記憶手段２２３を参照し、音声認識結果に対応する応答の内容が存在すれば応答選択手段１１３へ通知し、ステップ３０１に戻る（ステップ３０７、３０８）。 The scenario reference unit 183 refers to the scenario storage unit 223, and if there is a response content corresponding to the voice recognition result, notifies the response selection unit 113, and returns to step 301 (steps 307 and 308).

本実施の形態によれば、シナリオに基づいて応答する音声処理システムにおいても、応答信号の出力中であっても入力信号の音声検出を行うことができる。 According to the present embodiment, even in a voice processing system that responds based on a scenario, voice detection of an input signal can be performed even while a response signal is being output.

なお、上述の説明で用いた複数のフローチャートでは、複数の処理が順番に記載されているが、各実施形態で実行される処理の実行順序は、その記載の順番に制限されない。各実施形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。また、上述の各実施形態及び変形例は、内容が相反しない範囲で組み合わせることができる。
In the plurality of flowcharts used in the above description, a plurality of processes are described in order, but the execution order of the processes executed in each embodiment is not limited to the description order. In each embodiment, the order of the illustrated steps can be changed within a range that does not hinder the contents. Moreover, each above-mentioned embodiment and modification can be combined in the range with which the content does not conflict.

１音声処理システム
２音声応答装置
３音声応答装置
１０ＣＰＵ
１２メモリ
１４ＨＤＤ
１６通信ＩＦ
１８表示装置
２０入力装置
２２バス
１１１、１１２、１１３応答選択手段
１２１、１２２、１２３帯域選択手段
１３１、１３２、１３３音声検出手段
１４２、１４３音声認識手段
１５２、１５３音声再生手段
１６２、１６３出力手段
１７２、１７３入力手段
１８３シナリオ参照手段
２１２、２１３応答音声記憶手段
２２３シナリオ記憶手段 DESCRIPTION OF SYMBOLS 1 Voice processing system 2 Voice response apparatus 3 Voice response apparatus 10 CPU
12 Memory 14 HDD
16 Communication IF
18 Display device 20 Input device 22 Bus 111, 112, 113 Response selection means 121, 122, 123 Band selection means 131, 132, 133 Voice detection means 142, 143 Voice recognition means 152, 153 Voice playback means 162, 163 Output means 172 , 173 Input means 183 Scenario reference means 212, 213 Response voice storage means 223 Scenario storage means

Claims

A response selection means for selecting a response voice;
Band selection means for selecting at least a part of the band excluding the frequency band of the selected response voice;
A speech processing system comprising speech detection means for performing speech detection of a signal input in at least a part of the selected band.

2. The voice processing system according to claim 1, wherein said band selecting means selects a band excluding a frequency band of said selected response voice on a time unit basis.

The band selection unit weights the subband so that the smaller the amount including the selected response voice in the subband is, the smaller the value is.
The voice processing system according to claim 1, wherein the voice detection unit performs voice detection of the input voice based on a value weighted for each subband.

The voice processing system according to any one of claims 1 to 3, wherein the response selection unit selects the response voice so that a large number of frequency bands including the immediately previous response voice are included.

A response selection step of selecting a response voice;
A band selection step of selecting at least a part of the band excluding the frequency band of the selected response voice;
And a voice detection step of performing voice detection of a signal input in at least a part of the selected band.

6. The voice processing method according to claim 5, wherein the band selection step selects a band excluding a frequency band of the selected response voice in units of time.

In the band selection step, the subband is weighted so as to be a smaller value as the amount including the selected response voice in the subband is larger.
The speech processing method according to claim 5 or 6, wherein the speech detection step performs speech detection of the input speech based on a value weighted for each subband.

A response selection step of selecting a response voice;
A band selection step of selecting at least a part of the band excluding the frequency band of the selected response voice;
A program for causing a computer to execute a sound detection step of performing sound detection of a signal input in at least a part of the selected band.

9. The program according to claim 8, wherein said band selecting step selects a band excluding a frequency band of said selected response voice in units of time.

In the band selection step, the subband is weighted so as to be a smaller value as the amount including the selected response voice in the subband is larger.
The program according to claim 8 or 9, wherein the voice detection step performs voice detection of the input voice based on a weighted value for each subband.