JP2015191220A

JP2015191220A - Voice processing system, voice processing method, and program

Info

Publication number: JP2015191220A
Application number: JP2014070718A
Authority: JP
Inventors: 健花沢; Takeshi Hanazawa; 玲史近藤; Reishi Kondou
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2015-11-02

Abstract

PROBLEM TO BE SOLVED: To provide a voice processing system, a voice processing method, and a program which can accurately detect a voice during output of a response signal even when the response signal and the input voice are the voice.SOLUTION: A voice processing system includes: voice detection means for detecting a voice using a frequency band used for voice detection of an input voice; band estimation means for notifying a response selection part of a band excluding the frequency band used for the voice detection of the input voice; and response selection means for selecting the response voice in which the frequency band known beforehand includes a large number of components of the notified band among the known response voices.

Description

本発明は、音声処理システム、音声処理方法及びプログラムに関する。 The present invention relates to a voice processing system, a voice processing method, and a program.

音声応答・音声対話においては、応答信号や応答音声の出力中に割り込んで音声入力を行うバージイン機能がある。バージイン機能は、例えば、言い間違いや音声認識の誤認識に起因して言い直す場合、応答を待たずにすぐに次の入力を行う場合、気が変わったのでやり直す場合などに利用される。しかし、応答信号が音声入力用のマイクに回り込むことにより、音声検出や音声認識の性能が低下するという問題がある。 In the voice response / voice dialogue, there is a barge-in function for interrupting and inputting a voice while outputting a response signal or a response voice. The barge-in function is used, for example, when rephrasing due to misrepresentation or misrecognition of voice recognition, when performing the next input immediately without waiting for a response, or when re-executing because of a change in mind. However, there is a problem that the performance of voice detection and voice recognition deteriorates when the response signal goes around the microphone for voice input.

これに対して、応答信号の周波数特性に着目し、応答信号が少ない周波数帯域により大きい重み付けを行う手法がある（特許文献１）。特許文献１に記載の方法は、複数チャネルのシステム音すなわち応答信号（例えばビープ音や音楽）に基づいて周波数帯域別に重み付けを行い、応答信号に含まれそうな帯域ほど小さい重みをかけることで音声/非音声判別を行うものである。よって、特許文献１に記載の方法は、応答信号と入力音声との周波数帯域の重なりが小さい場合には、音声検出等の性能の低下を抑えることができる。 On the other hand, there is a method of paying more attention to the frequency characteristics of the response signal and performing greater weighting on the frequency band with less response signal (Patent Document 1). The method described in Patent Document 1 weights each frequency band based on a system sound of multiple channels, that is, a response signal (for example, beep sound or music), and applies a smaller weight to a band that is likely to be included in the response signal. / Non-voice discrimination. Therefore, the method described in Patent Document 1 can suppress degradation in performance such as voice detection when the overlap of frequency bands between the response signal and the input voice is small.

特開２０１２−１８９９０７JP2012-189907

音声対話システムのように応答信号も入力音声も人の声等の音声である場合には、応答音声と入力音声の周波数帯域の重なりが大きい。よって、特許文献１に記載された技術では、音声検出等の性能が下がるという課題がある。
When the response signal and the input voice are voices such as a human voice as in the voice dialogue system, the frequency bands of the response voice and the input voice overlap greatly. Therefore, the technique described in Patent Document 1 has a problem that the performance of voice detection and the like is lowered.

［発明の目的]
本発明の目的は、応答信号も入力音声も音声である場合においても、応答信号の出力中に精度よく音声検出を行うことができる音声処理システム、音声処理方法及びプログラムを提供することである。
[Purpose of the invention]
An object of the present invention is to provide a voice processing system, a voice processing method, and a program capable of accurately detecting a voice while outputting a response signal even when the response signal and the input voice are voices.

本発明は、入力された音声の音声検出に使用された周波数帯域を用いて音声検出を行う音声検出手段と、前記入力された音声の音声検出に使用された周波数帯域を除いた帯域を応答選択部に通知する帯域推定手段と、あらかじめ使用する周波数帯域が既知の応答音声のうち前記通知された帯域の成分を多く含む応答音声を選択する応答選択手段とを備える音声処理システムである。 The present invention provides a voice detection means for performing voice detection using a frequency band used for voice detection of input voice, and a response selection of a band excluding the frequency band used for voice detection of the input voice. A voice processing system comprising: band estimation means for notifying a unit; and response selection means for selecting a response voice that contains a large amount of the notified band components among response voices whose frequency bands to be used in advance are known.

本発明は、入力された音声の音声検出に使用された周波数帯域を用いて音声検出を行う音声検出ステップと、前記入力された音声の音声検出に使用された周波数帯域を除いた帯域を応答選択部に通知する帯域推定ステップと、あらかじめ使用する周波数帯域が既知の応答音声のうち前記通知された帯域の成分を多く含む応答音声を選択する応答選択ステップとを有する音声処理方法である。 The present invention provides a voice detection step for performing voice detection using a frequency band used for voice detection of input voice, and a response selection of a band excluding the frequency band used for voice detection of the input voice. And a response selection step of selecting a response sound including a large amount of the notified band component from response sounds whose frequency band to be used in advance is known.

本発明は、入力された音声の音声検出に使用された周波数帯域を用いて音声検出を行う音声検出ステップと、前記入力された音声の音声検出に使用された周波数帯域を除いた帯域を応答選択部に通知する帯域推定ステップと、あらかじめ使用する周波数帯域が既知の応答音声のうち前記通知された帯域の成分を多く含む応答音声を選択する応答選択ステップとをコンピュータに実行させるプログラムである。
The present invention provides a voice detection step for performing voice detection using a frequency band used for voice detection of input voice, and a response selection of a band excluding the frequency band used for voice detection of the input voice. This is a program for causing a computer to execute a band estimation step to be notified to a unit and a response selection step to select a response sound that contains a large amount of the notified band component from response sounds whose frequency bands to be used in advance are known.

音声対話システムのように応答信号も入力音声も音声である場合においても、応答信号の出力中に精度よく音声検出を行うことができる。
Even when the response signal and the input voice are voices as in the voice dialogue system, voice detection can be performed with high accuracy during the output of the response signal.

本発明の第１の実施の形態に係るハードウェア構成図である。It is a hardware block diagram concerning the 1st embodiment of the present invention. 本発明の第１の実施の形態に係るブロック図である。It is a block diagram concerning the 1st embodiment of the present invention. 本発明の第１の実施の形態に係るフローチャートである。3 is a flowchart according to the first embodiment of the present invention. 本発明の第２の実施の形態に係るブロック図である。It is a block diagram concerning the 2nd embodiment of the present invention. 本発明の第２の実施の形態に係る概念図である。It is a conceptual diagram concerning the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るフローチャートである。It is a flowchart which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係るブロック図である。It is a block diagram concerning the 3rd embodiment of the present invention. 本発明の第３の実施の形態に係るフローチャートである。It is a flowchart which concerns on the 3rd Embodiment of this invention.

実施形態１．
図１は、本発明の第１の実施形態に係る音声処理システム１のハードウェア構成図である。図１に示すように、音声処理システム１は、ＣＰＵ１０、メモリ１２、ハードディスクドライブ（ＨＤＤ）１４、図示しないネットワークを介してデータの通信を行う通信インタフェース（ＩＦ）１６、ディスプレイ等の表示装置１８およびキーボードやマウス等のポインティングデバイスを含む入力装置２０を有する。これらの構成要素は、バス２２を通して互いに接続されており、互いにデータの入出力を行う。なお、音声処理システム１のハードウェア構成は、この構成に制限されず、適宜変更することができる。また、音声処理システム１は１台のコンピュータシステムで構成される必要はなく、複数台のコンピュータシステムで構成されていてもよい。 Embodiment 1. FIG.
FIG. 1 is a hardware configuration diagram of a speech processing system 1 according to the first embodiment of the present invention. As shown in FIG. 1, the audio processing system 1 includes a CPU 10, a memory 12, a hard disk drive (HDD) 14, a communication interface (IF) 16 that performs data communication via a network (not shown), a display device 18 such as a display, and the like. The input device 20 includes a pointing device such as a keyboard and a mouse. These components are connected to each other through the bus 22 and input / output data to / from each other. Note that the hardware configuration of the voice processing system 1 is not limited to this configuration, and can be changed as appropriate. Further, the voice processing system 1 does not need to be configured by one computer system, and may be configured by a plurality of computer systems.

図２は、本発明の第１の実施の形態による音声処理システムの構成を示すブロック図である。 FIG. 2 is a block diagram showing the configuration of the speech processing system according to the first embodiment of the present invention.

図２に示すように、第１の実施の形態による音声処理システムは、音声検出手段と、帯域推定手段１２１と、応答選択手段１３１とを有する。 As shown in FIG. 2, the speech processing system according to the first embodiment includes speech detection means, band estimation means 121, and response selection means 131.

音声検出手段１１１は、入力音声を受け付けて音声検出を行う。音声検出手段１１１は、音声が検出された場合にはその音声が含まれる周波数帯域を示す検出帯域情報を帯域推定手段１２１に通知する。 The voice detection unit 111 receives input voice and performs voice detection. When a voice is detected, the voice detection unit 111 notifies the band estimation unit 121 of detection band information indicating a frequency band in which the voice is included.

帯域推定手段１２１は、検出帯域情報の周波数帯域の少なくとも一部を除いた帯域を選択し、選択した帯域を推定帯域情報として応答選択手段１３１に通知する。 The band estimation unit 121 selects a band excluding at least a part of the frequency band of the detection band information, and notifies the response selection unit 131 of the selected band as estimated band information.

応答選択手段１３１は、推定帯域情報の帯域の成分を多く含む応答音声を選択する。なお、音声がある帯域の成分を多く含むとは、その帯域において音声の含まれる量（ゲイン）が大きいことを示す。なお、応答音声とは、音声処理システムが出力する音声のことである。応答音声は、例えば、ある入力された音声に対してその内容に対して応答する内容の音声である。 The response selection unit 131 selects response speech that includes a large amount of band components of the estimated band information. Note that the fact that the sound includes many components in a certain band indicates that the amount (gain) of the sound included in that band is large. The response voice is a voice output from the voice processing system. The response voice is, for example, a voice having a content that responds to a certain input voice.

次に、本発明を実施するための第１の実施の形態の動作について詳細に説明する。図３は、第１の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the first embodiment for carrying out the present invention will be described in detail. FIG. 3 is a flowchart illustrating an example of the operation of the first embodiment.

音声検出手段１１１は、入力音声を受け付けて音声検出を行い、検出帯域情報を通知する。（ステップ１０１）。帯域推定手段１２１は、音声検出手段１１１より通知された検出帯域情報の周波数帯域の少なくとも一部を除いた帯域を選択し、推定帯域情報を応答選択手段１３１に通知する（ステップ１０２）。応答選択手段１３１は、推定帯域情報の帯域の成分を多く含む応答音声を選択する（ステップ１０３）。（ゲイン） The voice detection unit 111 receives input voice, performs voice detection, and notifies detection band information. (Step 101). The band estimating unit 121 selects a band excluding at least a part of the frequency band of the detected band information notified from the voice detecting unit 111, and notifies the response selecting unit 131 of the estimated band information (step 102). The response selection unit 131 selects response speech that includes a large amount of band components of the estimated band information (step 103). (gain)

本実施の形態によれば、入力された音声の音声検出（例えば直前の音声検出）に使用された周波数帯域を除いた帯域の成分を多く含む応答音声を選択するため、応答信号の出力中であっても、精度よく入力信号の音声検出を行うことができる。例えば、入力音声が男声であり男声の周波数帯域において音声検出を行う場合には、女声の応答音声を選択することで、精度よく入力信号の音声検出を行うことができる。 According to the present embodiment, in order to select a response sound that includes many components in the band excluding the frequency band used for sound detection of the input sound (for example, immediately preceding sound detection), the response signal is being output. Even if it exists, the audio | voice detection of an input signal can be performed accurately. For example, when the input voice is a male voice and voice detection is performed in the male voice frequency band, the voice of the input signal can be accurately detected by selecting the female voice response voice.

実施形態２．
図４は、本発明の第２の実施の形態による音声処理システムの構成を示すブロック図である。本発明の第２の実施の形態による音声処理システムは、第１の実施例に相当する。なお、本発明の第２の実施の形態による音声処理システムは、後述する音声認識手段１４２を必ずしも備える必要はない。 Embodiment 2. FIG.
FIG. 4 is a block diagram showing a configuration of a voice processing system according to the second embodiment of the present invention. The speech processing system according to the second embodiment of the present invention corresponds to the first example. Note that the voice processing system according to the second embodiment of the present invention does not necessarily include the voice recognition unit 142 described later.

本実施形態の音声処理システムは、音声応答装置２、応答音声記憶手段２１２、出力手段１６２、入力手段１７２を備える。音声応答装置２は、音声検出手段１１２、帯域推定手段１２２、応答選択手段１３２、音声認識手段１４２、音声再生手段１５２を備える。 The voice processing system according to this embodiment includes a voice response device 2, a response voice storage unit 212, an output unit 162, and an input unit 172. The voice response device 2 includes a voice detection unit 112, a band estimation unit 122, a response selection unit 132, a voice recognition unit 142, and a voice reproduction unit 152.

音声検出手段１１２は、後述する入力手段１７２より入力音声を受け付け、音声検出を行う。音声検出手段１１２は、音声が検出された場合には、検出帯域情報を帯域推定手段１２２に通知する。音声検出手段１１２は、音声認識手段１４２に音声検出結果を通知してもよい。 The voice detection unit 112 receives an input voice from the input unit 172 described later and performs voice detection. The voice detection unit 112 notifies the band estimation unit 122 of detected band information when a voice is detected. The voice detection unit 112 may notify the voice recognition unit 142 of the voice detection result.

また、音声検出手段１１２は、後述の帯域推定情報を帯域推定手段１２２から受け付け、当該帯域情報により指定された周波数帯域において音声検出を行ってもよい。同一のユーザが連続して音声入力を行う場合には、直前または過去の音声を検出した周波数帯域において音声検出をすることにより、より精度よく音声検出をすることができる。なお、対話の開始時においては、過去の検出結果の帯域情報が存在しない場合もある。よって、音声検出手段１１２は、例えば音声検出開始時は、全周波数帯域を対象として音声検出を行ってもよい。 The voice detection unit 112 may receive band estimation information described later from the band estimation unit 122 and perform voice detection in a frequency band specified by the band information. When the same user performs voice input continuously, the voice can be detected with higher accuracy by detecting the voice in the frequency band in which the previous voice or the past voice is detected. Note that there may be no band information of past detection results at the start of the conversation. Therefore, the voice detection unit 112 may perform voice detection for the entire frequency band, for example, at the start of voice detection.

音声検出手段１１２は、音声検出の対象となる周波数帯域をあらかじめ複数のサブバンド（部分帯域）に分割し、音声が含まれるサブバンドを検出帯域情報として帯域推定手段１２２に通知してもよい。さらに、音声検出手段１１２は、サブバンドごとに入力音声の含まれる量（ゲイン）に応じて重み付けを行い、含まれる量が多いサブバンドほど大きい重み付けとした検出帯域情報を帯域推定手段１２２に通知してもよい。音声検出をサブバンドに分割して行う技術については公知の技術であるから、ここでは説明を省略する。 The voice detection unit 112 may divide a frequency band that is a target of voice detection into a plurality of subbands (partial bands) in advance, and notify the band estimation unit 122 of the subband including the voice as detection band information. Further, the voice detection unit 112 performs weighting according to the amount (gain) of the input voice included in each subband, and notifies the band estimation unit 122 of the detection band information with a larger weight for the subband having a larger amount. May be. Since the technique for performing voice detection by dividing into subbands is a known technique, the description thereof is omitted here.

帯域推定手段１２２は、周波数帯域の少なくとも一部を除いた帯域を選択し、推定帯域情報として応答選択手段１３２に通知する。また、帯域推定手段１２２は、推定帯域情報を音声検出手段１１２に通知してもよい。その場合、音声検出手段１１２は、次の入力音声に対して、推定帯域情報の周波数を除いた周波数において音声検出をしてもよい。帯域推定手段１２２は、直前の音声検出結果から推定した推定帯域情報を通知してもよいし、直近の複数の音声検出結果から推定した周波数帯域を平滑化した周波数帯域を推定帯域情報として通知してもよい。 The band estimation unit 122 selects a band excluding at least a part of the frequency band, and notifies the response selection unit 132 as estimated band information. Further, the band estimation unit 122 may notify the voice detection unit 112 of the estimated band information. In that case, the voice detection means 112 may detect the voice at the frequency excluding the frequency of the estimated band information for the next input voice. The band estimation unit 122 may notify the estimated band information estimated from the immediately preceding sound detection result, or notify the frequency band obtained by smoothing the frequency band estimated from the most recent sound detection results as estimated band information. May be.

応答選択手段１３２は、後述する応答音声記憶手段２１２から、応答として適切で、かつ帯域推定手段１２２より通知された推定帯域情報の帯域の成分を多く含む応答音声を選択する。例えば応答選択手段１３２は、図５に示すように、推定帯域情報の帯域（音声検出で使用された帯域を除いた帯域）の成分を主に含む応答音声を選択すれば良い。応答選択手段１３２は、選択した応答音声を音声再生手段１５２に通知する。応答選択手段１３２は、前記複数の対応する応答の中から、推定帯域情報の成分を多く含む応答音声を選択すればよい。また、応答選択手段１３２は、後述する音声認識手段１４２の音声認識結果にも基づいて応答音声を選択してもよい。 The response selection unit 132 selects response speech that is appropriate as a response and includes a large amount of band components of the estimated band information notified from the band estimation unit 122 from the response voice storage unit 212 described later. For example, as illustrated in FIG. 5, the response selection unit 132 may select a response voice mainly including a component of the band of the estimated band information (a band excluding the band used for voice detection). The response selection unit 132 notifies the audio reproduction unit 152 of the selected response voice. The response selection unit 132 may select a response voice that includes many estimated band information components from the plurality of corresponding responses. Further, the response selection unit 132 may select a response voice based on a voice recognition result of the voice recognition unit 142 described later.

さらに、応答選択手段１３２は、推定帯域情報がサブバンド別かつ重み付けされている場合、重みに応じて応答音声を選択してもよい。応答選択手段１３２は、例えば、重みの大きいサブバンドの成分を多く含む応答は優先度が低くなるように応答を選択することで、より精度よく入力信号の音声検出を行うことができる。一例として、周波数方向に８分割されたサブバンドＢ１〜Ｂ８のうち、Ｂ１は重みが０、Ｂ２〜Ｂ３は重みが大、Ｂ４〜Ｂ５は重みが小、Ｂ６は重みが大、Ｂ７〜Ｂ８は重みが中である場合を説明する。この例では、応答選択手段１３２は、複数の応答音声候補のうちＢ２〜Ｂ３およびＢ６のサブバンドにおける成分が少なく、かつ、Ｂ４〜Ｂ５のサブバンドにおける成分を多く含む応答を優先的に選択すればよい。 Furthermore, when the estimated band information is weighted by subband, the response selection unit 132 may select the response sound according to the weight. For example, the response selection unit 132 can perform voice detection of the input signal with higher accuracy by selecting a response such that a response including many components of a large weighted subband has a low priority. As an example, among subbands B1 to B8 divided into eight in the frequency direction, B1 has a weight of 0, B2 to B3 have a large weight, B4 to B5 have a small weight, B6 has a large weight, and B7 to B8 have A case where the weight is medium will be described. In this example, the response selection unit 132 preferentially selects a response that has few components in the subbands B2 to B3 and B6 and contains many components in the subbands B4 to B5 among the plurality of response speech candidates. That's fine.

音声認識手段１４２は、音声検出手段１１２が音声検出した結果を用いて音声認識を行う。 The voice recognition unit 142 performs voice recognition using the result of voice detection performed by the voice detection unit 112.

音声再生手段１５２は、応答選択手段１３２で選択された応答音声を、出力手段１７２で再生させる。音声再生手段１５２は、音声検出手段１１２が音声検出を開始したタイミングを音声検出手段から通知されてもよい。音声再生手段１５２は、通知を受け取ると音声再生を停止してもよい。これにより、音声検出が行われた際にただちに応答音声の再生が止まるため、音声検出手段１１２は、その後の音声検出をより高精度に行うことが可能となる。 The sound reproduction means 152 causes the output means 172 to reproduce the response sound selected by the response selection means 132. The sound reproducing means 152 may be notified of the timing when the sound detecting means 112 starts the sound detection from the sound detecting means. The audio reproduction means 152 may stop the audio reproduction when receiving the notification. Thereby, since the reproduction of the response voice stops immediately when the voice detection is performed, the voice detection unit 112 can perform subsequent voice detection with higher accuracy.

出力手段１６２は、音声などの信号を出力する手段である。出力手段１６２は、例えばスピーカであればよい。 The output unit 162 is a unit that outputs a signal such as voice. The output unit 162 may be a speaker, for example.

入力手段１７２は、音声などの信号を入力する手段である。入力手段１７２は、例えばマイクであればよい。 The input means 172 is a means for inputting a signal such as voice. The input unit 172 may be a microphone, for example.

応答音声記憶手段２１２は、例えば定められた入力信号に対して対応する応答を複数記憶している。 The response voice storage unit 212 stores a plurality of responses corresponding to a predetermined input signal, for example.

次に、本発明を実施するための第２の実施の形態の動作について詳細に説明する。図６は、第２の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the second embodiment for carrying out the present invention will be described in detail. FIG. 6 is a flowchart illustrating an example of the operation of the second embodiment.

音声検出手段１１２は、入力音声を受け付けて音声検出を行い、検出帯域情報を通知する（ステップ２０１）。帯域推定手段１２２は、検出帯域情報の帯域の少なくとも一部を除いた帯域を選択し、推定帯域情報を応答選択手段１３２に通知する（ステップ２０２）。応答選択手段１３２は、後述する応答音声記憶手段２１２から、応答として適切で、かつ推定帯域情報の成分を多く含む応答音声を選択する（ステップ２０３）。音声再生手段１５２は、応答選択手段１３２で選択された応答音声を、出力手段１７２で再生させる（ステップ２０４）。音声検出手段１１２は、次の入力音声の音声検出を行い、音声が検出された場合には、検出帯域情報を帯域推定手段１２２に通知し、ステップ２０２に戻る（ステップ２０５）。 The voice detection unit 112 receives input voice, performs voice detection, and notifies detection band information (step 201). The band estimation unit 122 selects a band excluding at least a part of the band of the detected band information, and notifies the response selection unit 132 of the estimated band information (step 202). The response selection unit 132 selects a response sound that is appropriate as a response and includes many components of the estimated band information from the response sound storage unit 212 described later (step 203). The sound reproducing means 152 causes the output means 172 to reproduce the response sound selected by the response selecting means 132 (step 204). The voice detection unit 112 performs voice detection of the next input voice, and when the voice is detected, notifies the band estimation unit 122 of the detected band information, and returns to step 202 (step 205).

なお、ステップ２０２の後に、音声認識手段１４２が音声検出結果を用いて音声認識を行ってもよい。その場合、ステップ２０３において、応答選択手段１３２は音声認識結果に基づいて応答音声を選択してもよい。 Note that after step 202, the voice recognition unit 142 may perform voice recognition using the voice detection result. In that case, in step 203, the response selection unit 132 may select a response voice based on the voice recognition result.

本実施の形態によれば、応答音声の出力中に入力音声の検出を精度よく行うことが可能となる。 According to the present embodiment, it is possible to accurately detect an input voice while outputting a response voice.

また、同一のユーザが連続して音声入力を行う場合には、直前または過去の音声を検出した周波数帯域において音声検出をすることにより、より精度よく音声検出をすることができる。 In addition, when the same user performs voice input continuously, voice detection can be performed with higher accuracy by performing voice detection in the frequency band in which the previous or previous voice is detected.

また、本実施の形態による音声処理システムは、音声検出の際にサブバンドごとに重み付けを行うことで、重みの小さいすなわち入力音声のゲインが小さい帯域を含む応答音声も選択することができる。よって、音声検出の精度劣化を抑えつつ応答音声のバリエーションを拡大することが可能となる。 In addition, the voice processing system according to the present embodiment can select response voices including a band having a small weight, that is, a gain of the input voice, by weighting each subband at the time of voice detection. Therefore, it is possible to expand the variation of response voices while suppressing deterioration in accuracy of voice detection.

実施形態３．
図７は、本発明の第３の実施の形態による音声処理システムの構成を示すブロック図である。本実施の形態による音声処理システムは、第２の実施例に相当する。 Embodiment 3. FIG.
FIG. 7 is a block diagram showing the configuration of a speech processing system according to the third embodiment of the present invention. The speech processing system according to the present embodiment corresponds to the second example.

本実施形態の音声処理システムは音声応答装置３、応答音声記憶手段２１３、出力手段１６３、入力手段１７３を備える。音声応答装置３は、音声検出手段１１３、帯域推定手段１２３、応答選択手段１３３、音声認識手段１４３、音声再生手段１５３、シナリオ参照手段１８３を備える。 The voice processing system of this embodiment includes a voice response device 3, a response voice storage unit 213, an output unit 163, and an input unit 173. The voice response device 3 includes voice detection means 113, band estimation means 123, response selection means 133, voice recognition means 143, voice reproduction means 153, and scenario reference means 183.

音声検出手段１１３、帯域推定手段１２３、応答選択手段１３３、音声再生手段１５３、出力手段１６３、入力手段１７３、応答音声記憶手段２１３は、各々、音声検出手段１１２、帯域推定手段１２２、応答選択手段１３２、音声再生手段１５２、出力手段１６２、入力手段１７２、応答音声記憶手段２１２と同様の機能を有するため説明を省略する。 The voice detection unit 113, the band estimation unit 123, the response selection unit 133, the voice reproduction unit 153, the output unit 163, the input unit 173, and the response voice storage unit 213 are respectively a voice detection unit 112, a band estimation unit 122, and a response selection unit. 132, the sound reproduction means 152, the output means 162, the input means 172, and the response sound storage means 212 have the same functions, and thus description thereof is omitted.

音声認識手段１４３は、音声検出手段１１３から通知される音声検出結果を用いて入力手段１７３から入力される音声を認識する。音声認識手段１４３は、認識結果をシナリオ参照手段１８３に通知する。 The voice recognition unit 143 recognizes the voice input from the input unit 173 using the voice detection result notified from the voice detection unit 113. The voice recognition unit 143 notifies the scenario reference unit 183 of the recognition result.

シナリオ参照手段１８３は、シナリオ記憶手段２２３を参照して音声認識手段１４３から通知される認識結果に対応するシナリオを応答選択手段１３３へ通知する。 The scenario reference unit 183 refers to the scenario storage unit 223 and notifies the response selection unit 133 of a scenario corresponding to the recognition result notified from the voice recognition unit 143.

応答選択手段１３３は通知されたシナリオに応じた応答音声を選択する。 The response selection unit 133 selects a response voice corresponding to the notified scenario.

シナリオ記憶手段２２３は、音声認識結果に対応する応答の内容を示すシナリオを記憶している。シナリオは、テキストレベルで指定されてもよいし、同じ内容であっても言い回しや語彙の自由度を許容するメタな表現で記述されてもよい。 The scenario storage unit 223 stores a scenario indicating the content of a response corresponding to the voice recognition result. The scenario may be specified at a text level, or may be described in a meta expression that allows the wording and vocabulary freedom even if the content is the same.

シナリオがテキストレベルで指定されている場合には、同一テキストの応答音声は、話者の声質や話し方が異なり、お互いの周波数帯域が重ならない応答音声であるとよい。一方、シナリオがメタな表現で記述される場合には、同一内容を示す応答音声は、声質の違いの他に言い回しや語彙の違いを利用することで、よりお互いの周波数帯域が重ならない応答音声にすることができる。例えば、「依頼」を表す表現として「〜してください」と「〜をお願いします」では、主に使用する周波数帯域が変わる場合がある。 When the scenario is specified at the text level, the response voices of the same text may be response voices in which the voice quality and way of speaking of the speakers are different and the frequency bands do not overlap each other. On the other hand, when the scenario is described in a meta-expression, the response voices showing the same contents are the response voices that do not overlap each other's frequency bands by using the difference of the wording and vocabulary in addition to the voice quality difference. Can be. For example, there are cases where the frequency band to be used mainly changes between “please do” and “please do” as expressions for “request”.

次に、本発明を実施するための第３の実施の形態の動作について詳細に説明する。図８は、第３の実施の形態の動作の一例を示すフローチャートである。 Next, the operation of the third embodiment for carrying out the present invention will be described in detail. FIG. 8 is a flowchart illustrating an example of the operation of the third embodiment.

まず、帯域推定手段１２３は、一回目の帯域推定を行い、推定帯域情報を応答選択手段１３２に通知する（ステップ３０１）。次に、応答選択手段１３３は、応答音声を選択する（ステップ３０２）。次に、音声再生手段１５３は、応答選択手段１３３から通知された応答音声を出力手段１６３で再生させる（ステップ３０３）。音声検出手段１１３は、入力手段１７３より受け付けた入力音声に対して音声検出を行う（ステップ３０４）。音声検出手段１１３は、音声が検出された場合には検出結果を音声認識手段１４３に通知する（ステップ３０５）。帯域推定手段１２３は、二回目の帯域推定を行い、推定帯域情報を応答選択手段１３３に通知する（ステップ３０６）。音声認識部１４３は、音声検出結果を用いて音声認識を行い、音声認識結果をシナリオ参照手段１８３に通知する（ステップ３０７）。シナリオ参照手段１８３はシナリオ記憶手段２２３を参照する（ステップ３０８）。シナリオ参照手段１８３は、音声認識手段１４３から通知される認識結果に対応するシナリオがある場合はその対応するシナリオを応答選択手段１３３へ通知してステップ３０２に戻る（ステップ３０９）。 First, the band estimation unit 123 performs the first band estimation and notifies the response selection unit 132 of the estimated band information (step 301). Next, the response selection means 133 selects a response voice (step 302). Next, the sound reproduction means 153 causes the output means 163 to reproduce the response sound notified from the response selection means 133 (step 303). The voice detection unit 113 performs voice detection on the input voice received from the input unit 173 (step 304). The voice detection means 113 notifies the voice recognition means 143 of the detection result when the voice is detected (step 305). The band estimation unit 123 performs the second band estimation and notifies the response selection unit 133 of the estimated band information (step 306). The voice recognition unit 143 performs voice recognition using the voice detection result and notifies the scenario reference means 183 of the voice recognition result (step 307). The scenario reference unit 183 refers to the scenario storage unit 223 (step 308). If there is a scenario corresponding to the recognition result notified from the voice recognition unit 143, the scenario reference unit 183 notifies the response selection unit 133 of the corresponding scenario and returns to step 302 (step 309).

なお、上述の説明で用いた複数のフローチャートでは、複数の処理が順番に記載されているが、各実施形態で実行される処理の実行順序は、その記載の順番に制限されない。各実施形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。また、上述の各実施形態及び変形例は、内容が相反しない範囲で組み合わせることができる。
In the plurality of flowcharts used in the above description, a plurality of processes are described in order, but the execution order of the processes executed in each embodiment is not limited to the description order. In each embodiment, the order of the illustrated steps can be changed within a range that does not hinder the contents. Moreover, each above-mentioned embodiment and modification can be combined in the range with which the content does not conflict.

１音声処理システム
２音声応答装置
３音声応答装置
１０ＣＰＵ
１２メモリ
１４ＨＤＤ
１６通信ＩＦ
１８表示装置
２０入力装置
２２バス
１１１、１１２、１１３音声検出手段
１２１、１２２、１２３帯域推定手段
１３１、１３２、１３３応答選択手段
１４２、１４３音声認識手段
１５２、１５３音声再生手段
１６２、１６３出力手段
１７２、１７３入力手段
１８３シナリオ参照手段
２１２、２１３応答音声記憶手段
２２３シナリオ記憶手段 DESCRIPTION OF SYMBOLS 1 Voice processing system 2 Voice response apparatus 3 Voice response apparatus 10 CPU
12 Memory 14 HDD
16 Communication IF
18 Display device 20 Input device 22 Bus 111, 112, 113 Audio detection means 121, 122, 123 Band estimation means 131, 132, 133 Response selection means 142, 143 Audio recognition means 152, 153 Audio reproduction means 162, 163 Output means 172 , 173 Input means 183 Scenario reference means 212, 213 Response voice storage means 223 Scenario storage means

Claims

Voice detection means for performing voice detection using the frequency band used for voice detection of the input voice;
Band estimation means for notifying a response selection unit of a band excluding a frequency band used for voice detection of the input voice;
A voice processing system comprising: a response selection unit that selects a response voice including a large number of components in the notified band among response voices whose frequency bands to be used in advance are known.

The voice detection unit performs voice detection on a frequency band that is divided into a plurality of partial bands in advance, and the band estimation unit determines a partial band that is not a partial band used for voice detection among the plurality of partial bands. The voice processing system according to claim 1 which notifies.

3. The voice detection unit performs weighting according to an amount of input voice included in each partial band, and the response selection unit selects a response voice that contains less voice in the partial band with a large weight. Voice processing system.

The voice processing system according to any one of claims 1 to 3, wherein the response selection unit selects the response voice so that a large number of frequency bands including the immediately previous response voice are included.

A voice detection step for performing voice detection using the frequency band used for voice detection of the input voice;
A band estimation step of notifying a response selection unit of a band excluding a frequency band used for voice detection of the input voice;
And a response selection step of selecting a response sound including a large number of components of the notified band among response sounds whose frequency bands to be used in advance are known.

The voice detection step performs voice detection for a frequency band that is divided into a plurality of partial bands in advance,
The voice processing method according to claim 5, wherein the band estimation step notifies a partial band excluding a partial band used for voice detection among the plurality of partial bands.

The voice detection step performs weighting according to the amount of input voice included in each partial band,
The voice processing method according to claim 6, wherein the response selection step selects a response voice that contains a small amount of voice in a partial band having a large weight.

A voice detection step for performing voice detection using the frequency band used for voice detection of the input voice;
A band estimation step of notifying a response selection unit of a band excluding a frequency band used for voice detection of the input voice;
A program that causes a computer to execute a response selection step of selecting a response sound that includes a large number of components of the notified band from response sounds whose frequency band to be used in advance is known.

The voice detection step performs voice detection for a frequency band that is divided into a plurality of partial bands in advance,
The program according to claim 8, wherein the band estimation step notifies a partial band of the plurality of partial bands excluding a partial band used for voice detection.

The voice detection step performs weighting according to the amount of input voice included in each partial band,
The program according to claim 9, wherein the response selecting step selects a response voice that includes a small amount of voice in a partial band having a large weight.