JP6922551B2

JP6922551B2 - Voice processing device, voice processing program, and voice processing method

Info

Publication number: JP6922551B2
Application number: JP2017161459A
Authority: JP
Inventors: 尚也川畑
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2021-08-18
Anticipated expiration: 2037-08-24
Also published as: JP2019041225A

Description

本発明は、音声処理装置、音声処理プログラム、及び音声処理方法に関し、例えば、テレビ会議システムや電話会議システム等において用いられる、通話の開始処理に適用し得るものである。 The present invention relates to a voice processing device, a voice processing program, and a voice processing method, and can be applied to, for example, a call start processing used in a video conferencing system, a conference call system, or the like.

近年、テレビ会議システムや電話会議システム等の遠隔通話システムを用いてテレビ会議やテレワークなどの遠隔地と通話やコミュニケーションを行う機会が増えている。 In recent years, there have been increasing opportunities to make calls and communicate with remote locations such as video conferencing and telework using remote communication systems such as video conferencing systems and telephone conferencing systems.

遠隔通話システムでは、遠隔地と通話を行うために接続する場合、システムに搭載されている画面で電話番号などの連絡先を入力、選択するか、画面上に表示されている通話相手の映像をタッチすることで遠隔地と接続することが多い。 In a remote call system, when connecting to make a call to a remote location, enter and select a contact such as a telephone number on the screen installed in the system, or display the image of the other party displayed on the screen. It is often connected to a remote location by touching it.

さらに、遠隔通話システムをロボットに組込み、近親者と単身の高齢者とのコミュニケーション支援するコミュニケーション支援ロボットシステムが特許文献１によって提案されている。 Further, Patent Document 1 proposes a communication support robot system in which a remote communication system is incorporated into a robot to support communication between a close relative and a single elderly person.

特許文献１に記載のコミュニケーション支援ロボットシステムは、タッチパネルディスプレイに表示されている、近親者や高齢者の映像をタッチすることで、通話相手に接続され通話が開始する。 The communication support robot system described in Patent Document 1 is connected to a call partner and starts a call by touching an image of a close relative or an elderly person displayed on a touch panel display.

特開２０１５−１８４５９７号公報Japanese Unexamined Patent Publication No. 2015-184597

しかしながら、特許文献１に記載のコミュニケーション支援ロボットシステムでは、タッチパネルディスプレイをタッチするなどの接続操作や接続コマンドで通話を開始することは、実際の対面での通話と異なっているため臨場感（対面で会話しているような感覚）が非常に低い。 However, in the communication support robot system described in Patent Document 1, starting a call by a connection operation such as touching a touch panel display or a connection command is different from an actual face-to-face call, so that there is a sense of presence (face-to-face). The feeling of having a conversation) is very low.

また、コミュニケーション支援ロボットに搭載されている音声認識システムを使用して、例えば、接続先の通話相手の名前等を呼ぶことで、呼びかけた音声（以下、呼びかけ音声）を音声認識システムに入力し、音声認識の結果から接続先を判定して接続を開始できるようにしても、呼びかけ音声は、音声認識処理に入力されてから音声認識の結果から通話相手が決定し、通話相手に接続されるため、呼びかけ音声が通話相手に伝わらない。このため、通話相手に突然接続され、通話相手は違和感や不安感を感じ、臨場感が向上しない。 In addition, using the voice recognition system installed in the communication support robot, for example, by calling the name of the other party to be connected, the called voice (hereinafter referred to as the calling voice) is input to the voice recognition system. Even if the connection destination can be determined from the voice recognition result and the connection can be started, the call partner is determined from the voice recognition result after being input to the voice recognition process, and the call partner is connected. , The call voice is not transmitted to the other party. For this reason, the call partner is suddenly connected, and the call partner feels a sense of discomfort or anxiety, and the sense of presence is not improved.

そのため、テレビ会議システム等で、呼びかけ音声で通話相手と接続する場合に、呼びかけ音声を通話相手に伝えてから通話が開始される音声処理装置が望まれている。 Therefore, in a video conferencing system or the like, when connecting to a call partner by a call voice, a voice processing device that starts a call after transmitting the call voice to the call partner is desired.

第１の本発明の音声処理装置は、（１）相手側と接続後に送信する接続コマンド音声を相手側で再生させるためのバッファであり、上記接続コマンド音声を含む入力信号を一定期間保持するバッファ部と、（２）上記入力信号に対して音声認識を行う音声認識部と、（３）上記音声認識部の結果を用いて、上記入力信号が、上記接続コマンド音声か否か判定するコマンド判定部と、（４）上記コマンド判定部により上記入力信号が上記接続コマンド音声と判定された場合には、上記バッファ部に保持されている上記入力信号を出力し、上記バッファ部に保持されている音声を出力したら、上記入力信号を出力するように切り替える出力切替え部とを有することを特徴とする。 The first voice processing device of the present invention is (1) a buffer for reproducing the connection command voice transmitted after connecting to the other side on the other side, and is a buffer for holding an input signal including the above connection command voice for a certain period of time. and parts, and a speech recognition unit which performs speech recognition on (2) the input signal, (3) using the results of the speech recognition unit, the input signal is, the connect command voice determining whether the command determination When the input signal is determined to be the connection command voice by the unit and (4) the command determination unit, the input signal held in the buffer unit is output and held in the buffer unit. It is characterized by having an output switching unit that switches to output the input signal when the voice is output.

第２の本発明の音声処理プログラムは、コンピュータを、（１）相手側と接続後に送信する接続コマンド音声を相手側で再生させるためのバッファであり、上記接続コマンド音声を含む入力信号を一定期間保持するバッファ部と、（２）上記入力信号に対して音声認識を行う音声認識部と、（３）上記音声認識部の結果を用いて、上記入力信号が、上記接続コマンド音声か否か判定するコマンド判定部と、（４）上記コマンド判定部により上記入力信号が上記接続コマンド音声と判定された場合には、上記バッファ部に保持されている上記入力信号を出力し、上記バッファ部に保持されている音声を出力したら、上記入力信号を出力するように切り替える出力切替え部として機能させることを特徴とする。 The second voice processing program of the present invention is a buffer for (1) playing back the connection command voice transmitted after connecting to the other side of the computer, and the input signal including the above connection command voice is played for a certain period of time. a buffer unit for holding, (2) and a voice recognition unit which performs speech recognition on the input signal, using the (3) results of the speech recognition unit, the input signal is determined whether the connection command voice When the input signal is determined to be the connection command voice by the command determination unit and (4) the command determination unit, the input signal held in the buffer unit is output and held in the buffer unit. It is characterized in that it functions as an output switching unit that switches to output the input signal after outputting the voice.

第３の本発明の音声処理方法は、バッファ部、音声認識部、コマンド判定部、及び出力切替え部を有し、（１）上記バッファ部は、相手側と接続後に送信する接続コマンド音声を相手側で再生させるために使用するものであり、上記接続コマンド音声を含む入力信号を一定期間保持し、（２）上記音声認識部は、上記入力信号に対して音声認識を行い、（３）上記コマンド判定部は、上記音声認識部の結果を用いて、上記入力信号が、上記接続コマンド音声か否か判定し、（４）上記出力切替え部は、上記コマンド判定部により上記入力信号が上記接続コマンド音声と判定された場合には、上記バッファ部に保持されている上記入力信号を出力し、上記バッファ部に保持されている音声を出力したら、上記入力信号を出力するように切り替えることを特徴とする。 The third voice processing method of the present invention includes a buffer unit, a voice recognition unit, a command determination unit, and an output switching unit. (1) The buffer unit receives a connection command voice transmitted after connecting to the other party. It is used for reproduction on the side, holds an input signal including the connection command voice for a certain period of time, (2) the voice recognition unit performs voice recognition on the input signal, and (3) the above. command determination unit uses the result of the speech recognition unit, the input signal is judged whether the connection command voice, (4) the output switching unit, the input signal is the connection by the command determination unit When it is determined to be a command voice, the input signal held in the buffer unit is output, and when the voice held in the buffer unit is output, the input signal is switched to be output. And.

本発明によれば、テレビ会議システム等で通話相手と接続するときに呼びかけ音声を通話相手に伝え、対面での会話と近い呼びかけ音声から会話が始まる状態を再現することで、双方が高い臨場感を感じることができる。 According to the present invention, when connecting to a call partner in a video conferencing system or the like, the call voice is transmitted to the call partner, and a state in which the conversation starts from the call voice close to the face-to-face conversation is reproduced, so that both sides have a high sense of presence. Can be felt.

第１の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice processing apparatus which concerns on 1st Embodiment. 第１の実施形態に係るコマンドリスト部の一例を示す説明図である。It is explanatory drawing which shows an example of the command list part which concerns on 1st Embodiment. 第２の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice processing apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice processing apparatus which concerns on 3rd Embodiment. 変形実施形態に係る音声処理装置をテレビ会議システムに適応した構成を示すブロック図である。It is a block diagram which shows the structure which adapted the audio processing apparatus which concerns on a modification embodiment to a video conferencing system.

（Ａ）第１の実施形態
以下では、本発明の音声処理装置、音声処理プログラム、及び音声処理方法の実施形態を、図面を参照しながら詳細に説明する。 (A) First Embodiment In the following, an embodiment of the voice processing device, the voice processing program, and the voice processing method of the present invention will be described in detail with reference to the drawings.

第１の実施形態は、例えば、テレビ会議システムや電話会議システム等のマイク入力部に上述した本発明の音声処理装置、音声処理プログラム、及び音声処理方法を適応した場合を例示したものである。 The first embodiment illustrates, for example, a case where the above-described voice processing device, voice processing program, and voice processing method of the present invention are applied to a microphone input unit of a video conferencing system, a telephone conferencing system, or the like.

（Ａ−１）第１の実施形態の構成
図１は、第１の実施形態に係る音声処理装置１００の構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing a configuration of a voice processing device 100 according to the first embodiment.

本発明の第１の実施形態の音声処理装置１００は、例えば、専用ボードとして構築されるようにしても良いし、ＤＳＰ（デジタルシグナルプロセッサ）への音声処理プログラムの書き込みによって実現されたものであっても良く、ＣＰＵと、ＣＰＵが実行するソフトウェア（音声処理プログラム）によって実現されたものであっても良いが、機能的には、図１で表すことができる。 The voice processing device 100 of the first embodiment of the present invention may be constructed as a dedicated board, for example, or is realized by writing a voice processing program to a DSP (digital signal processor). It may be realized by a CPU and software (speech processing program) executed by the CPU, but functionally, it can be represented by FIG.

図１において、本発明の第１の実施形態に係る音声処理装置１００は、マイク１０１、マイクアンプ１０２、ＡＤ変換器１０３、及び呼びかけ処理部１０４を有する。 In FIG. 1, the voice processing device 100 according to the first embodiment of the present invention includes a microphone 101, a microphone amplifier 102, an AD converter 103, and a call processing unit 104.

マイク１０１は、人の音声や音を受音するマイクである。 The microphone 101 is a microphone that receives human voice or sound.

マイクアンプ１０２は、マイク１０１により受音された入力信号を増幅するものである。 The microphone amplifier 102 amplifies the input signal received by the microphone 101.

ＡＤ変換器１０３は、マイクアンプ１０２により増幅された信号をアナログ信号からデジタル信号に変換するものである。以下、ＡＤ変換器１０３で変換された信号を「マイク入力信号」とする。 The AD converter 103 converts the signal amplified by the microphone amplifier 102 from an analog signal to a digital signal. Hereinafter, the signal converted by the AD converter 103 will be referred to as a “microphone input signal”.

呼びかけ処理部１０４は、入力されたマイク入力信号を出力端子に出力し、同時にマイク入力信号をオーディオバッファに保存する。さらに、呼びかけ処理部１０４は、マイク入力信号を音声認識し、音声認識結果がコマンドリスト部のコマンドの１つと一致した場合に、オーディオバッファに保存されている音信号を一定時間出力するように切替え、一定時間出力が完了すると再びマイク入力信号を出力する。 The call processing unit 104 outputs the input microphone input signal to the output terminal, and at the same time saves the microphone input signal in the audio buffer. Further, the call processing unit 104 recognizes the microphone input signal by voice, and when the voice recognition result matches one of the commands in the command list unit, switches to output the sound signal stored in the audio buffer for a certain period of time. , When the output is completed for a certain period of time, the microphone input signal is output again.

次に、呼びかけ処理部１０４の詳細な構成を説明する。 Next, the detailed configuration of the call processing unit 104 will be described.

呼びかけ処理部１０４は、入力端子１０５、オーディオバッファ部１０６、音声認識部１０７、コマンドリスト部１０８、コマンド判定部１０９、出力切替え部１１０、及び出力端子１１１を有する。 The call processing unit 104 includes an input terminal 105, an audio buffer unit 106, a voice recognition unit 107, a command list unit 108, a command determination unit 109, an output switching unit 110, and an output terminal 111.

入力端子１０５は、マイク入力信号を呼びかけ処理部１０４に入力するインタフェースである。 The input terminal 105 is an interface for calling a microphone input signal and inputting it to the processing unit 104.

オーディオバッファ部１０６は、マイク入力信号を一定時間保持するバッファである。 The audio buffer unit 106 is a buffer that holds the microphone input signal for a certain period of time.

音声認識部１０７は、マイク入力信号を音声認識し、音声認識の結果を出力する。 The voice recognition unit 107 recognizes the microphone input signal by voice and outputs the result of the voice recognition.

コマンドリスト部１０８は、コマンドが保持されているリストである。コマンドリスト部１０８は、例えば、図２のようにコマンドの一覧がテキストファイルで保持されている。なお、図２は、一例であって、保持するデータの内容及び形式は種々様々な値（形式）を適用することができる。 The command list unit 108 is a list in which commands are held. The command list unit 108 holds a list of commands in a text file as shown in FIG. 2, for example. Note that FIG. 2 is an example, and various values (formats) can be applied to the content and format of the data to be retained.

コマンド判定部１０９は、音声認識の結果がコマンドリスト部１０８のコマンドリストに存在するか否か判定し、判定結果を出力する。 The command determination unit 109 determines whether or not the voice recognition result exists in the command list of the command list unit 108, and outputs the determination result.

出力切替え部１１０は、コマンド判定結果から出力する音信号を決定し、音信号を出力する。 The output switching unit 110 determines the sound signal to be output from the command determination result, and outputs the sound signal.

出力端子１１１は、呼びかけ処理部１０４の音信号を出力するインタフェースである。 The output terminal 111 is an interface that outputs the sound signal of the call processing unit 104.

（Ａ−２）第１の実施形態の動作
本発明の第１の実施形態に係る音声処理装置１００の動作を詳細に説明する。 (A-2) Operation of First Embodiment The operation of the voice processing device 100 according to the first embodiment of the present invention will be described in detail.

まず、音声処理装置１００の動作が開始すると、話者が発した音声等の音信号や環境音が重畳したアナログ音信号が、マイク１０１に入力される。 First, when the operation of the voice processing device 100 starts, a sound signal such as a voice emitted by a speaker or an analog sound signal on which an environmental sound is superimposed is input to the microphone 101.

マイク１０１に入力された入力信号は、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部１０４の入力端子１０５にマイク入力信号ｘ（ｎ）として入力される。 The input signal input to the microphone 101 is amplified by the microphone amplifier 102, converted from an analog signal to a digital signal by the AD converter 103, and input as a microphone input signal x (n) to the input terminal 105 of the call processing unit 104. NS.

呼びかけ処理部１０４の入力端子１０５に信号が入力され始めると、まず、呼びかけ処理部１０４はマイク入力信号ｘ（ｎ）を出力切替え部１１０に出力する。 When a signal starts to be input to the input terminal 105 of the call processing unit 104, the call processing unit 104 first outputs the microphone input signal x (n) to the output switching unit 110.

出力切替え部１１０は、音声処理装置１００の動作時は、以下の（１）式に示すように、無音信号を出力信号ｙ（ｎ）として出力端子１１１に出力する。
ｙ（ｎ）＝０ …（１） When the voice processing device 100 is operating, the output switching unit 110 outputs a silent signal as an output signal y (n) to the output terminal 111 as shown in the following equation (1).
y (n) = 0 ... (1)

また、呼びかけ処理部１０４は、同時にマイク入力信号ｘ（ｎ）を、以下の（２）式に従い、オーディオバッファ部１０６のオーディオバッファｂｕｆｆｅｒ（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部１０４は、以下の（３）式に示すように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘを進める（インクリメントする）。

At the same time, the call processing unit 104 holds the microphone input signal x (n) at the write position write_index of the audio buffer buffer (n) of the audio buffer unit 106 according to the following equation (2). After holding, the call processing unit 104 advances (increments) the writing position write_index as shown in the following equation (3).

上記（３）式のＢＵＦＦＥＲ＿ＳＩＺＥは、オーディオバッファ部１０６のオーディオバッファのバッファの長さである。 BUFFER_SIZE in the above equation (3) is the buffer length of the audio buffer of the audio buffer unit 106.

さらに、呼びかけ処理部１０４は、同時にマイク入力信号ｘ（ｎ）を音声認識部１０７で音声認識を行い、音声認識結果をコマンド判定部１０９に出力する。 Further, the call processing unit 104 simultaneously performs voice recognition of the microphone input signal x (n) by the voice recognition unit 107, and outputs the voice recognition result to the command determination unit 109.

コマンド判定部１０９は、音声認識の結果とコマンドリスト部１０８に保持されているコマンド一覧を比較し、音声認識の結果がコマンドの一覧に存在するか否かの判定を行い、判定結果を出力切替え部１１０に出力する。 The command determination unit 109 compares the voice recognition result with the command list held in the command list unit 108, determines whether or not the voice recognition result exists in the command list, and outputs the determination result. Output to unit 110.

出力切替え部１１０は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在しないと判定された場合には、無音信号を出力端子１１１に出力し続ける。 When the command determination unit 109 determines that the voice recognition result of the voice recognition unit 107 does not exist in the command list of the command list unit 108, the output switching unit 110 continues to output a silent signal to the output terminal 111.

一方、出力切替え部１１０は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在すると判定された場合には、オーディオバッファ部１０６の読出し位置ｒｅａｄ＿ｉｎｄｅｘを、下記の（４）式に従い計算する。

On the other hand, when the command determination unit 109 determines that the voice recognition result of the voice recognition unit 107 exists in the command list of the command list unit 108, the output switching unit 110 sets the read position read_index of the audio buffer unit 106. Calculate according to the following formula (4).

上記（４）式のＬＥＮは、オーディオバッファ部１０６に保持されているマイク入力信号を再生する長さである。なお、ＬＥＮの決定方法は、種々の方法を広く適用することができ、例えば、オーディオバッファ部１０６のバッファサイズと同じ長さ（ＬＥＮ＝ＢＵＦＦＥＲ＿ＳＩＺＥ）とするなどの定数とする方法が存在する。また、オーディオバッファ部１０６に保持されているマイク入力信号に音声区間処理を行い、呼びかけ音声の長さを求めて、その長さをＬＥＮとする方法でも良い。 The LEN of the above equation (4) is a length for reproducing the microphone input signal held in the audio buffer unit 106. As a method for determining LEN, various methods can be widely applied, and there is a method of making a constant such that the length is the same as the buffer size of the audio buffer unit 106 (LEN = BUFFER_SIZE). Alternatively, a method may be used in which the microphone input signal held in the audio buffer unit 106 is subjected to voice section processing, the length of the calling voice is obtained, and the length is set to LEN.

そして、出力切替え部１１０は、以下の（５）式に示すようにオーディオバッファ部１０６に保持されている音信号を出力信号ｙ（ｎ）として出力端子１１１に一定時間（例えば、ＬＥＮの時間長分）出力し、以下の（６）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘを進める（インクリメントする）。

Then, as shown in the following equation (5), the output switching unit 110 uses the sound signal held in the audio buffer unit 106 as the output signal y (n) at the output terminal 111 for a certain period of time (for example, the time length of LEN). Minutes) Output and advance (increment) the read position read_index as shown in Eq. (6) below.

出力切替え部１１０は、オーディオバッファ部１０６に保持されている音信号を一定時間出力すると、以下の（７）式に示すように、マイク入力信号ｘ（ｎ）を出力信号ｙ（ｎ）として出力端子１１１に出力する。
ｙ（ｎ）＝ｘ（ｎ） …（７） When the output switching unit 110 outputs the sound signal held in the audio buffer unit 106 for a certain period of time, the output switching unit 110 outputs the microphone input signal x (n) as the output signal y (n) as shown in the following equation (7). Output to terminal 111.
y (n) = x (n) ... (7)

（Ａ−３）第１の実施形態の効果
以上のように、第１の実施形態によれば、音声処理装置１００は、マイク入力信号をオーディオバッファに保持し、同時に音声認識を行い、マイク入力信号が呼びかけ音声と判定されたとき、オーディオバッファに保持している呼びかけ音声を出力する。これにより、音声処理装置１００は、呼びかけ音声が相手に伝えることができるので、対面での会話に近い状態を再現でき、高い臨場感で会話を開始することができる。 (A-3) Effect of First Embodiment As described above, according to the first embodiment, the voice processing device 100 holds the microphone input signal in the audio buffer, simultaneously performs voice recognition, and performs microphone input. When the signal is determined to be the calling voice, the calling voice held in the audio buffer is output. As a result, the voice processing device 100 can transmit the call voice to the other party, so that a state close to a face-to-face conversation can be reproduced, and the conversation can be started with a high sense of presence.

（Ｂ）第２の実施形態
次に、本発明の音声処理装置、音声処理プログラム、及び音声処理方法の第２の実施形態を、図面を参照しながら詳細に説明する。 (B) Second Embodiment Next, a second embodiment of the voice processing device, the voice processing program, and the voice processing method of the present invention will be described in detail with reference to the drawings.

第２の実施形態は、本発明の音声処理装置の音出力方法が、第１の実施形態と異なっている場合を例示する。 The second embodiment illustrates a case where the sound output method of the voice processing device of the present invention is different from that of the first embodiment.

（Ｂ−１）第２の実施形態の構成
図３は、第２の実施形態に係る音声処理装置２００の構成を示すブロック図である。 (B-1) Configuration of Second Embodiment FIG. 3 is a block diagram showing a configuration of a voice processing device 200 according to the second embodiment.

第２の実施形態の音声処理装置２００は、出力切替え部１１０の代わりに信号加算部２０２を構成要素とする点が第１の実施形態の音声処理装置１００と異なる。それ以外の構成要素は、第１の実施形態に係る図１の音声処理装置１００の構成要素と同一、又は対応するものである。なお、図３において、第１の実施形態に係る音声処理装置１００の構成要素と同一、又は対応するものについては同一の符号を付している。 The voice processing device 200 of the second embodiment is different from the voice processing device 100 of the first embodiment in that the signal addition unit 202 is used as a component instead of the output switching unit 110. The other components are the same as or correspond to the components of the voice processing device 100 of FIG. 1 according to the first embodiment. In FIG. 3, the same or corresponding components of the voice processing device 100 according to the first embodiment are designated by the same reference numerals.

また、第１の実施形態と同一、又は対応する構成要素の詳細な説明は重複するため、ここでは省略する。 Further, since the detailed description of the components that are the same as or correspond to those of the first embodiment is duplicated, they will be omitted here.

図３において、本発明の第２の実施形態に係る音声処理装置２００は、マイク１０１、マイクアンプ１０２、ＡＤ変換器１０３、及び呼びかけ処理部２０１を有する。 In FIG. 3, the voice processing device 200 according to the second embodiment of the present invention includes a microphone 101, a microphone amplifier 102, an AD converter 103, and a call processing unit 201.

また、呼びかけ処理部２０１は、入力端子１０５、オーディオバッファ部１０６、音声認識部１０７、コマンドリスト部１０８、コマンド判定部１０９、信号加算部２０２、及び出力端子１１１を有する。 Further, the call processing unit 201 has an input terminal 105, an audio buffer unit 106, a voice recognition unit 107, a command list unit 108, a command determination unit 109, a signal addition unit 202, and an output terminal 111.

信号加算部２０２は、コマンド判定結果からマイク入力信号、又はマイク入力信号とオーディオバッファ部を加算した信号のいずれか一方の音信号を出力するか決定し、決定した音信号を出力する。 The signal addition unit 202 determines from the command determination result whether to output the sound signal of either the microphone input signal or the signal obtained by adding the microphone input signal and the audio buffer unit, and outputs the determined sound signal.

（Ｂ−２）第２の実施形態の動作
第２の実施形態に係る音声処理装置２００における音声処理の基本的な動作は、第１の実施形態で説明した音声処理と同様である。 (B-2) Operation of the Second Embodiment The basic operation of the voice processing in the voice processing device 200 according to the second embodiment is the same as the voice processing described in the first embodiment.

以下では、第１の実施形態と異なる点である信号加算部２０２における処理動作を中心に詳細に説明する。 Hereinafter, the processing operation in the signal addition unit 202, which is different from the first embodiment, will be described in detail.

信号加算部２０２は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在しないと判定された場合には、マイク入力信号ｘ（ｎ）を出力端子１１１に出力する。一方、信号加算部２０２は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在すると判定された場合には、先述の（４）式に従い、オーディオバッファ部１０６の読出し位置ｒｅａｄ＿ｉｎｄｅｘを計算する。 When the command determination unit 109 determines that the voice recognition result of the voice recognition unit 107 does not exist in the command list of the command list unit 108, the signal addition unit 202 outputs the microphone input signal x (n) to the output terminal 111. Output to. On the other hand, when the command determination unit 109 determines that the result of the voice recognition of the voice recognition unit 107 exists in the command list of the command list unit 108, the signal addition unit 202 sets the audio buffer according to the above equation (4). The read position read_index of unit 106 is calculated.

そして、信号加算部２０２は、以下の（８）式に従い、オーディオバッファ部１０６に保持されている音信号とマイク入力信号を加算し、加算した信号を出力信号ｙ（ｎ）として出力端子１１１に一定時間出力し、先述の（６）式に従い、読出し位置ｒｅａｄ＿ｉｎｄｅｘを進める（インクリメントする）。
ｙ（ｎ）＝ｘ（ｎ）＋ｂｕｆｆｅｒ（ｒｅａｄ＿ｉｎｄｅｘ） …（８） Then, the signal addition unit 202 adds the sound signal held in the audio buffer unit 106 and the microphone input signal according to the following equation (8), and the added signal is used as the output signal y (n) at the output terminal 111. The output is performed for a certain period of time, and the read position read_index is advanced (incremented) according to the above equation (6).
y (n) = x (n) + buffer (read_index)… (8)

信号加算部２０２は、加算した信号を一定時間出力すると、先述の（７）式に従い、マイク入力信号ｘ（ｎ）を出力信号ｙ（ｎ）として出力端子１１１に出力する。 When the signal addition unit 202 outputs the added signal for a certain period of time, the signal addition unit 202 outputs the microphone input signal x (n) as the output signal y (n) to the output terminal 111 according to the above-mentioned equation (7).

（Ｂ−３）第２の実施形態の効果
以上のように、第２の実施形態によれば、音声処理装置２００は、音声認識した結果がコマンドリストに存在する場合に、マイク入力信号をオーディオバッファに保持されている信号を加算した信号を出力する。これにより、音声処理装置２００は、音の遅延が少なく、音信号が途切れることなく、呼びかけ音声を出力することができる。 (B-3) Effect of Second Embodiment As described above, according to the second embodiment, the voice processing device 200 audios the microphone input signal when the voice recognition result exists in the command list. Outputs a signal obtained by adding the signals held in the buffer. As a result, the voice processing device 200 can output the calling voice without interrupting the sound signal with less delay in sound.

（Ｃ）第３の実施形態
次に、本発明の音声処理装置、音声処理プログラム、及び音声処理方法の第３の実施形態を、図面を参照しながら詳細に説明する。 (C) Third Embodiment Next, a third embodiment of the voice processing device, the voice processing program, and the voice processing method of the present invention will be described in detail with reference to the drawings.

第３の実施形態は、本発明の音声処理装置の音出力方法が、第１の実施形態、及び第２の実施形態と異なっている場合を例示する。 The third embodiment illustrates a case where the sound output method of the voice processing device of the present invention is different from the first embodiment and the second embodiment.

（Ｃ−１）第３の実施形態の構成
図４は、第３の実施形態に係る音声処理装置の構成を示すブロック図である。 (C-1) Configuration of Third Embodiment FIG. 4 is a block diagram showing a configuration of a voice processing device according to the third embodiment.

第３の実施形態の音声処理装置３００は、第１の実施形態の音声処理装置１００構成に加えて、遅延回復部３０３を構成要素とする点と、出力切替え部１１０の代わりに出力切替え部３０２を構成要素とする点が第１の実施形態の音声処理装置１００、及び第２の実施形態の音声処理装置２００と異なる。なお、図４において、第１の実施形態に係る音声処理装置１００の構成要素と同一、又は対応するものについては同一の符号を付している。 The voice processing device 300 of the third embodiment has a delay recovery unit 303 as a component in addition to the voice processing device 100 configuration of the first embodiment, and an output switching unit 302 instead of the output switching unit 110. Is different from the voice processing device 100 of the first embodiment and the voice processing device 200 of the second embodiment in that In FIG. 4, the same reference numerals are given to the components that are the same as or correspond to the components of the voice processing device 100 according to the first embodiment.

図４において、本発明の第３の実施形態に係る音声処理装置３００は、マイク１０１、マイクアンプ１０２、ＡＤ変換器１０３、及び呼びかけ処理部３０１を有する。 In FIG. 4, the voice processing device 300 according to the third embodiment of the present invention includes a microphone 101, a microphone amplifier 102, an AD converter 103, and a call processing unit 301.

また、呼びかけ処理部３０１は、入力端子１０５、遅延回復部３０３、オーディオバッファ部１０６、音声認識部１０７、コマンドリスト部１０８、コマンド判定部１０９、出力切替え部３０２、及び出力端子１１１を有する。 Further, the call processing unit 301 has an input terminal 105, a delay recovery unit 303, an audio buffer unit 106, a voice recognition unit 107, a command list unit 108, a command determination unit 109, an output switching unit 302, and an output terminal 111.

出力切替え部３０２は、コマンド判定結果から出力する音信号を決定し、音信号を出力する。また、出力する音信号が切り替わったか否かを遅延回復部３０３に出力する。 The output switching unit 302 determines the sound signal to be output from the command determination result, and outputs the sound signal. Further, it is output to the delay recovery unit 303 whether or not the output sound signal has been switched.

遅延回復部３０３は、入力端子１０５から出力切替え部１１０へのマイク入力信号の出力のタイミングを調整するものである。例えば、遅延回復部３０３は、入力されたマイク入力信号を所定時間分だけ遅延させて、マイク入力信号を出力する。 The delay recovery unit 303 adjusts the output timing of the microphone input signal from the input terminal 105 to the output switching unit 110. For example, the delay recovery unit 303 delays the input microphone input signal by a predetermined time and outputs the microphone input signal.

（Ｃ−３）第３の実施形態の動作
第３の実施形態に係る音声処理装置３００における音声処理の基本的な動作は、第１の実施形態で説明した呼びかけ処理と同様である。 (C-3) Operation of the Third Embodiment The basic operation of the voice processing in the voice processing device 300 according to the third embodiment is the same as the call processing described in the first embodiment.

以下では、第１の実施形態と異なる点である出力切替え部３０２、及び遅延回復部３０３における処理動作を中心に詳細に説明する。 Hereinafter, the processing operations in the output switching unit 302 and the delay recovery unit 303, which are different from the first embodiment, will be described in detail.

呼びかけ処理部３０１の入力端子１０５に信号が入力され始めると、呼びかけ処理部３０１はマイク入力信号ｘ（ｎ）を遅延回復部３０３に出力する。 When a signal starts to be input to the input terminal 105 of the call processing unit 301, the call processing unit 301 outputs the microphone input signal x (n) to the delay recovery unit 303.

遅延回復部３０３は、音声処理装置３００の動作開始時は、マイク入力信号ｘ（ｎ）を出力切替え部３０２に出力する。 The delay recovery unit 303 outputs the microphone input signal x (n) to the output switching unit 302 when the operation of the voice processing device 300 starts.

出力切替え部３０２は、音声処理装置１００の動作時には、先述の（１）式に示すように、無音信号を出力信号ｙ（ｎ）として出力端子１１１に出力する。 When the voice processing device 100 is operating, the output switching unit 302 outputs a silent signal as an output signal y (n) to the output terminal 111 as shown in the above equation (1).

出力切替え部３０２は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在しないと判定された場合には、無音信号を出力端子１１１に出力し続ける。 When the command determination unit 109 determines that the voice recognition result of the voice recognition unit 107 does not exist in the command list of the command list unit 108, the output switching unit 302 continues to output a silent signal to the output terminal 111.

一方、出力切替え部３０２は、コマンド判定部１０９で音声認識部１０７の音声認識の結果がコマンドリスト部１０８のコマンド一覧に存在すると判定された場合には、オーディオバッファ部１０６の読出し位置ｒｅａｄ＿ｉｎｄｅｘを、先述の（４）式に従い計算する。 On the other hand, when the command determination unit 109 determines that the voice recognition result of the voice recognition unit 107 exists in the command list of the command list unit 108, the output switching unit 302 sets the read position read_index of the audio buffer unit 106. Calculate according to the above equation (4).

そして、出力切替え部３０２は、先述の（５）式に示すようにオーディオバッファ部１０６に保持されている音信号を出力信号ｙ（ｎ）として出力端子１１１に一定時間出力し、先述の（６）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘを進める（インクリメントする）。 Then, the output switching unit 302 outputs the sound signal held in the audio buffer unit 106 as the output signal y (n) to the output terminal 111 for a certain period of time as shown in the above-mentioned equation (5), and outputs the sound signal to the output terminal 111 for a certain period of time. ) The read position read_index is advanced (incremented) as shown in the equation.

出力切替え部３０２は、オーディオバッファ部１０６に保持されている音信号を一定時間出力すると、出力が完了し、出力する音信号が切替わったことを知らせる信号を遅延回復部３０３に出力する。 When the output switching unit 302 outputs the sound signal held in the audio buffer unit 106 for a certain period of time, the output is completed, and the output switching unit 302 outputs a signal informing that the output sound signal has been switched to the delay recovery unit 303.

遅延回復部３０３は、出力切替え部３０２から出力する音信号が切り替わったときに、マイク入力信号ｘ（ｎ）に対して所定時間遅延回復処理を行い出力切替え部１１０に出力する。上記の遅延回復時間は、例えば、音声認識部１０７とコマンド判定部１０９の処理時間を考慮して定めても良いし、オーディオバッファ部から出力される呼びかけ音声の長さと同じにしても良い。また、遅延回復部３０３の遅延回復処理は、マイク入力信号ｘ（ｎ）に話速変換を施した、マイク入力信号ｘ’（ｎ）を出力し、遅延回復時間分の遅延が回復したらマイク入力信号ｘ（ｎ）を出力しても良いし、マイク入力信号ｘ（ｎ）に音声区間検出を行い、遅延回復時間分の無音を削除したするマイク入力信号ｘ’’（ｎ）を出力し、遅延回復時間分の遅延が回復したらマイク入力信号ｘ（ｎ）を出力しても良い。さらに、この遅延回復部３０３の処理は出力切替え部１１０の処理の後に行っても良い。 When the sound signal output from the output switching unit 302 is switched, the delay recovery unit 303 performs a delay recovery process for a predetermined time on the microphone input signal x (n) and outputs the microphone input signal x (n) to the output switching unit 110. The delay recovery time may be determined in consideration of the processing times of the voice recognition unit 107 and the command determination unit 109, or may be the same as the length of the call voice output from the audio buffer unit. Further, the delay recovery process of the delay recovery unit 303 outputs the microphone input signal x'(n) obtained by converting the speech speed of the microphone input signal x (n), and inputs the microphone when the delay corresponding to the delay recovery time is recovered. The signal x (n) may be output, or the microphone input signal x'' (n) that detects the audio section and deletes the silence for the delay recovery time is output to the microphone input signal x (n). When the delay corresponding to the delay recovery time is recovered, the microphone input signal x (n) may be output. Further, the processing of the delay recovery unit 303 may be performed after the processing of the output switching unit 110.

出力切替え部３０２は、遅延回復処理が完了すると、先述の（７）式に示すように、マイク入力信号ｘ（ｎ）を出力信号ｙ（ｎ）として出力端子１１１に出力する。 When the delay recovery process is completed, the output switching unit 302 outputs the microphone input signal x (n) as the output signal y (n) to the output terminal 111 as shown in the above equation (7).

（Ｃ−３）第３の実施形態の効果
以上のように、第３の実施形態によれば、音声処理装置３００が遅延回復部３０３を設けたことにより、オーディオバッファ部１０６から出力された呼びかけ音声信号の時間分の遅延を回復することが出来る。これにより、第１の実施形態に比べて、さらに音信号が途切れることなく、呼びかけ音声を出力することができる。 (C-3) Effect of Third Embodiment As described above, according to the third embodiment, the voice processing device 300 is provided with the delay recovery unit 303, so that the call output from the audio buffer unit 106 is made. It is possible to recover the delay of the voice signal for the time. As a result, as compared with the first embodiment, the calling voice can be output without further interruption of the sound signal.

（Ｄ）他の実施形態
上述した各実施形態においても、種々の変形実施形態を説明したが、本発明は以下の変形実施形態についても適用することができる。 (D) Other Embodiments Although various modified embodiments have been described in each of the above-described embodiments, the present invention can also be applied to the following modified embodiments.

（Ｄ−１）上述した各実施形態で説明した音声処理装置は、例えば、図５に示しているようなテレビ通話や電話会議で通話を開始するときに、音声の入力によるコマンドで通話を開始する装置に搭載されるようにしても良い。図５において、接続判定部４０２は、音声認識部１０７による音声認識結果及びコマンド判定部１０９に基づくコマンド判定結果に基づいて、ネットワーク４０５（例えば、相手側のテレビ電話）との接続判定を行い、接続判定結果を、出力端子４０３を介してＮＷ通信部４０４に出力する。ＮＷ通信部４０４は、接続判定結果に基づき、ネットワーク４０５との接続処理を行う。接続後、音声処理装置はＮＷ通信部４０４を介して、ネットワーク４０５と音声のやりとりが行われる。なお、ネットワーク４０５からの音声はＮＷ通信部４０４を介して、ＤＡ変換器４０６によりデジタル信号からアナログ信号に変換後、スピーカアンプ４０７で増幅され、スピーカ４０８により出力される。 (D-1) The voice processing device described in each of the above-described embodiments starts a call by a command by inputting voice when starting a call in a video call or a conference call as shown in FIG. 5, for example. It may be mounted on the device to be used. In FIG. 5, the connection determination unit 402 determines the connection with the network 405 (for example, the videophone of the other party) based on the voice recognition result by the voice recognition unit 107 and the command determination result based on the command determination unit 109. The connection determination result is output to the NW communication unit 404 via the output terminal 403. The NW communication unit 404 performs connection processing with the network 405 based on the connection determination result. After the connection, the voice processing device exchanges voice with the network 405 via the NW communication unit 404. The sound from the network 405 is converted from a digital signal to an analog signal by the DA converter 406 via the NW communication unit 404, amplified by the speaker amplifier 407, and output by the speaker 408.

１００…音声処理装置、１０１…マイク、１０２…マイクアンプ、１０３…ＡＤ変換器、１０４…呼びかけ処理部、１０５…入力端子、１０６…オーディオバッファ部、１０７…音声認識部、１０８…コマンドリスト部、１０９…コマンド判定部、１１０…出力切替え部、１１１…出力端子、２００…音声処理装置、２０１…呼びかけ処理部、２０２…信号加算部、３００…音声処理装置、３０１…呼びかけ処理部、３０２…出力切替え部、３０３…遅延回復部、４０２…接続判定部、４０３…出力端子、４０４…ＮＷ通信部、４０５…ネットワーク、４０６…ＤＡ変換器、４０７…スピーカアンプ、４０８…スピーカ。 100 ... voice processing device, 101 ... microphone, 102 ... microphone amplifier, 103 ... AD converter, 104 ... call processing unit, 105 ... input terminal, 106 ... audio buffer unit, 107 ... voice recognition unit, 108 ... command list unit, 109 ... Command determination unit, 110 ... Output switching unit, 111 ... Output terminal, 200 ... Voice processing unit, 201 ... Call processing unit, 202 ... Signal addition unit, 300 ... Voice processing device, 301 ... Call processing unit, 302 ... Output Switching unit, 303 ... Delay recovery unit, 402 ... Connection determination unit, 403 ... Output terminal, 404 ... NW communication unit, 405 ... Network, 406 ... DA converter, 407 ... Speaker amplifier, 408 ... Speaker.

Claims

A buffer for playing back the connection command voice transmitted after connecting to the other side, and a buffer unit that holds the input signal including the above connection command voice for a certain period of time.
A voice recognition unit that performs voice recognition for the above input signal,
Using the results of the speech recognition unit, the input signal, and the connection command voice determining whether the command judging section,
When the command determination unit determines that the input signal is the connection command voice, the input signal held in the buffer unit is output, and the voice held in the buffer unit is output. A voice processing device characterized by having an output switching unit that switches to output an input signal.

When the command determination unit determines that the input signal is the connection command voice, the output switching unit adds the current input signal to the input signal held in the buffer unit and outputs the output. The voice processing apparatus according to claim 1, wherein the voice processing apparatus is used.

The voice processing apparatus according to claim 1, further comprising a delay recovery unit for adjusting the timing of inputting the input signal to the output switching unit.

Computer,
A buffer for playing back the connection command voice transmitted after connecting to the other side, and a buffer unit that holds the input signal including the above connection command voice for a certain period of time.
A voice recognition unit that performs voice recognition for the above input signal,
Using the results of the speech recognition unit, the input signal, and the connection command voice determining whether the command judging section,
When the command determination unit determines that the input signal is the connection command voice, the input signal held in the buffer unit is output, and the voice held in the buffer unit is output. A voice processing program characterized by functioning as an output switching unit that switches to output an input signal.

It has a buffer unit, a voice recognition unit, a command determination unit, and an output switching unit.
The buffer unit is used to reproduce the connection command voice transmitted after connecting to the other side on the other side, and holds the input signal including the connection command voice for a certain period of time.
The voice recognition unit performs voice recognition on the input signal and performs voice recognition.
The command determination unit uses the result of the speech recognition unit, the input signal is judged whether the connection command voice,
When the command determination unit determines that the input signal is the connection command voice, the output switching unit outputs the input signal held in the buffer unit and holds the input signal in the buffer unit. A voice processing method characterized by switching to output the above input signal after outputting voice.