JP2006252458A

JP2006252458A - Voice signal processor for processing voice signals of a plurality of speakers, and program

Info

Publication number: JP2006252458A
Application number: JP2005071636A
Authority: JP
Inventors: Tsutomu Kobayashi; 力小林
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-03-14
Filing date: 2005-03-14
Publication date: 2006-09-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means for activating a conference and improving the progress of proceedings without increasing participants' burdens in an audio conference system. <P>SOLUTION: A voice signal receiving section 141 of a server device 14 receives voice signals of users 19 from respective terminal units 11. A response voice signal selecting section 1455 of the server device 14 determines whether conditions included in a response condition database 1444 are met, and when the conditions are met, the response voice signal selecting section 1455 reads a response voice signal associated with the conditions, from a response voice signal group 1445 and delivers it to a response voice signal mixing section 1457. The response voice signal mixing section 1457 mixes the response voice signal delivered from the response voice signal selecting section 1455, with a voice signal indicating the speech of each user 19 mixed by an spoken voice signal mixing section 1456, and then transmits the mixed signals to each terminal unit 11. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音響機器を用いた音声会議を行うための音声信号処理技術に関する。 The present invention relates to an audio signal processing technique for conducting an audio conference using an audio device.

電話による通話を活性化するために、通話を行っている一方のユーザが電話機に設けられた操作部に対し操作を行うことにより、通話を行っている他方のユーザの電話機に対し効果音信号を送信することを可能とする技術がある。そのような従来技術を開示した文献として、例えば特許文献１乃至３がある。
特開平１１−５５３７９号公報特開２００２−５１１１６号公報特開２００４−１１２１０９号公報 In order to activate the telephone call, a sound effect signal is sent to the telephone of the other user who is making a call by one user making a call operating the operation unit provided on the telephone. There are techniques that allow transmission. For example, Patent Literatures 1 to 3 disclose such prior art.
JP-A-11-55379 JP 2002-51116 A JP 2004-112109 A

ヘッドセット等の音響機器を用いて音声会議を実現可能とする音声会議システムにおいて、上記の従来技術を利用することにより、会議における会話の活性化を促すことができる。しかしながら、上記の従来技術による場合、会議に参加するいずれかのユーザが適するタイミングで適する効果音信号を他のユーザの端末装置等に送信するための操作を行う必要があった。従って、そのような役回りのユーザが会議に集中できない等の不都合があった。 In an audio conference system that can implement an audio conference using an acoustic device such as a headset, activation of conversation in the conference can be promoted by using the above-described conventional technology. However, according to the above-described conventional technology, it is necessary to perform an operation for transmitting a sound effect signal suitable for any user participating in the conference to a terminal device or the like of another user. Therefore, there are inconveniences such as users who are not able to concentrate on the conference.

また、音声会議においては、通常、一度に複数のユーザが発言を行った場合にそれを制したり、会議終了予定時刻が近づいてきた時点で参加者に対し会議内容の整理を促したりするための議事進行役が必要であるが、そのような役回りのユーザもまた、会議に集中できない等の不都合があった。 Also, in audio conferencing, usually when multiple users speak at once, to control it, or to encourage participants to organize the content of the meeting when the scheduled meeting end time approaches However, there are inconveniences such as users who are not able to concentrate on the meeting.

上記の状況に鑑み、本発明は音声会議システムにおいて参加者の負担を増加させることなく、会議の活性化および議事進行の改善を可能とする手段を提供することを目的とする。 In view of the above situation, an object of the present invention is to provide means for enabling activation of a conference and improvement of proceedings without increasing the burden on participants in an audio conference system.

上記課題を達成するために、本発明は、応答用の音声を示す応答音声信号を記憶する記憶手段と、複数の端末装置の各々を出力元とする音声信号を受け取る入力手段と、前記入力手段により受け取られた音声信号の各部分のレベルに基づいて当該音声信号から無言を示す信号を検出する検出手段と、前記検出手段により検出された無言を示す信号が継続する時間を計測する計時手段と、前記計時手段により計測された時間に対応する応答音声信号を選択する選択手段と、前記選択手段により選択された応答音声信号を出力する出力手段とを備えることを特徴とする音声信号処理装置を提供する。 In order to achieve the above object, the present invention provides a storage unit that stores a response voice signal indicating a response voice, an input unit that receives a voice signal from each of a plurality of terminal devices, and the input unit. Detecting means for detecting a signal indicating mute from the sound signal based on the level of each part of the sound signal received by the means, and time measuring means for measuring the duration of the signal indicating mute detected by the detecting means; An audio signal processing apparatus comprising: selection means for selecting a response audio signal corresponding to the time measured by the time measuring means; and output means for outputting the response audio signal selected by the selection means. provide.

かかる構成の音声信号処理装置によれば、無言期間の長さに応じて、適する応答音声信号が自動的に出力され、音声会議の活性化や議事進行の改善が図られる。 According to the audio signal processing apparatus having such a configuration, a suitable response audio signal is automatically output according to the length of the silent period, and the activation of the audio conference and the progress of the proceedings are improved.

また、前記音声信号処理装置において、前記記憶手段は、キーフレーズを示すキーフレーズデータをさらに記憶し、前記入力手段により受け取られた音声信号に対し音声認識処理を行うことにより当該音声信号により示される発言に含まれるフレーズを特定する音声認識手段と、前記音声認識手段により特定されたフレーズから前記キーフレーズデータにより示されるキーフレーズを検出する検出手段とをさらに備え、前記選択手段は、前記検出手段により検出されたキーフレーズを示すキーフレーズデータに対応する応答音声信号を選択するように構成されてもよい。 Further, in the voice signal processing device, the storage means further stores key phrase data indicating a key phrase, and is indicated by the voice signal by performing voice recognition processing on the voice signal received by the input means. Voice recognition means for specifying a phrase included in the utterance; and detection means for detecting a key phrase indicated by the key phrase data from the phrase specified by the voice recognition means; and the selection means includes the detection means The response voice signal corresponding to the key phrase data indicating the key phrase detected by the above may be selected.

かかる構成の音声信号処理装置によれば、特定の発言に応じて、適する応答音声信号が自動的に出力され、音声会議の活性化や議事進行の改善が図られる。 According to the audio signal processing apparatus having such a configuration, a suitable response audio signal is automatically output in response to a specific statement, so that the audio conference can be activated and the progress of proceedings can be improved.

また、前記音声信号処理装置において、前記選択手段は、過去の所定の期間内に前記入力手段により受け取られた音声信号の出力元の端末装置に対応する応答音声信号を選択するように構成されてもよい。 In the audio signal processing device, the selection unit is configured to select a response audio signal corresponding to a terminal device that is an output source of the audio signal received by the input unit within a predetermined period in the past. Also good.

かかる構成の音声信号処理装置によれば、参加者に応じて適する応答音声信号が自動的に出力され、より好ましい。 According to the audio signal processing apparatus having such a configuration, a response audio signal suitable for each participant is automatically output, which is more preferable.

また、前記音声信号処理装置において、特定の時点からの経過時間を計測する計時手段をさらに備え、前記選択手段は、前記計時手段により計測された時間に対応する応答音声信号を選択するように構成されてもよい。 The audio signal processing device further includes a time measuring unit that measures an elapsed time from a specific time point, and the selection unit selects a response audio signal corresponding to the time measured by the time measuring unit. May be.

かかる構成の音声信号処理装置によれば、例えば会議終了予定時刻から所定時間前にその旨を参加者に通知する応答音声信号を自動的に出力する等の処理が可能となり、利便性が高まる。 According to the audio signal processing device having such a configuration, for example, it is possible to perform a process such as automatically outputting a response audio signal for notifying a participant to that effect a predetermined time before the scheduled conference end time, and the convenience is enhanced.

また、前記音声信号処理装置において、前記入力手段により受け取られた１の端末装置を出力元とする音声信号が継続する時間を計測する計時手段をさらに備え、前記選択手段は、前記計時手段により計測された時間に対応する応答音声信号を選択するように構成されてもよい。 The audio signal processing apparatus further includes time measuring means for measuring a time duration of the audio signal output from the one terminal device received by the input means, and the selecting means is measured by the time measuring means. The response voice signal corresponding to the set time may be selected.

かかる構成の音声信号処理装置によれば、例えば長く継続して発言を行う参加者に対しその発言を制して他の参加者の発言を促す等の処理が可能となり、利便性が高まる。 According to the audio signal processing apparatus having such a configuration, for example, it is possible to perform a process such as encouraging a participant who speaks for a long time to suppress the speech and prompt other participants to speak, thereby improving convenience.

また、前記音声信号処理装置において、前記選択手段は、前記入力手段により１の端末装置を出力元とする音声信号が継続して受け取られているときに他の端末装置を出力元とする音声信号が受け取られた場合、所定の応答音声信号を選択するように構成されてもよい。 Further, in the audio signal processing device, the selection means is an audio signal whose output source is another terminal device when an audio signal whose output source is one terminal device is continuously received by the input means. May be configured to select a predetermined response audio signal.

かかる構成の音声信号処理装置によれば、複数の参加者が同時に発言する場合において、例えばそれらの発言を一旦制した後に会議の進行を整理する等の処理が可能となり、利便性が高まる。 According to the audio signal processing apparatus having such a configuration, when a plurality of participants speak at the same time, it is possible to perform processing such as organizing the progress of the conference after temporarily controlling those comments, and the convenience is enhanced.

また、前記音声信号処理装置において、前記出力手段により応答音声信号が出力されているときに前記入力手段により音声信号が受け取られた場合、前記出力手段は出力中の応答音声信号のうち未出力の部分の少なくとも一部を出力しないように構成されてもよい。 Further, in the audio signal processing apparatus, when the audio signal is received by the input unit when the response audio signal is output by the output unit, the output unit is not output among the response audio signals being output. You may comprise so that at least one part of a part may not be output.

かかる構成の音声信号処理装置によれば、自動的に出力される応答音声信号により、参加者の発言が妨げられる不都合が回避される。 According to the audio signal processing apparatus having such a configuration, the problem that the speech of the participant is hindered by the response audio signal that is automatically output is avoided.

また、前記音声信号処理装置において、前記出力手段は、過去の所定の期間内に前記入力手段により受け取られた音声信号の出力元の端末装置のみを送信先として応答音声信号を出力するように構成されてもよい。 Further, in the audio signal processing device, the output unit is configured to output a response audio signal using only a terminal device that is an output source of the audio signal received by the input unit within a predetermined period in the past as a transmission destination. May be.

かかる構成の音声信号処理装置によれば、応答音声信号により、現在の発言者以外の者が邪魔される不都合が回避される。 According to the audio signal processing device having such a configuration, the response audio signal avoids the inconvenience that a person other than the current speaker is disturbed.

また、前記音声信号処理装置において、前記入力手段により受け取られた音声信号と、前記選択手段により選択された応答音声信号とをミキシングするミキシング手段とをさらに備え、前記出力手段は、前記ミキシング手段によるミキシングの結果得られた音声信号を出力するように構成されてもよい。 The audio signal processing apparatus further includes a mixing unit that mixes the audio signal received by the input unit and the response audio signal selected by the selection unit, and the output unit includes the mixing unit. An audio signal obtained as a result of mixing may be output.

かかる構成の音声信号処理装置によれば、会議の参加者の発言を示す音声信号、と自動的に選択される応答音声信号とが同時に参加者の端末装置に送信され、それらを別々に出力する場合と比較して簡易なシステムの構築が可能となる。 According to the audio signal processing device having such a configuration, the audio signal indicating the speech of the conference participant and the automatically selected response audio signal are simultaneously transmitted to the participant's terminal device and are output separately. Compared to the case, a simple system can be constructed.

また、前記音声信号処理装置において、前記記憶手段は、前記応答音声信号に加えて、もしくは前記応答音声信号に代えて、応答用のメッセージ文字を示す応答テキストデータを記憶し、前記選択手段は、前記計時手段により計測された時間に対応する応答音声信号に加えて、もしくは前記計時手段により計測された時間に対応する応答音声信号に代えて、前記計時手段により計測された時間に対応する応答テキストデータを選択し、前記出力手段は、前記選択手段により選択された応答テキストデータを出力するように構成されてもよい。 Further, in the voice signal processing device, the storage means stores response text data indicating a message character for response in addition to the response voice signal or instead of the response voice signal, and the selection means includes: In addition to the response voice signal corresponding to the time measured by the time measuring means, or instead of the response voice signal corresponding to the time measured by the time measuring means, the response text corresponding to the time measured by the time measuring means The data may be selected, and the output unit may be configured to output response text data selected by the selection unit.

かかる構成の音声信号処理装置によれば、自動的に選択される応答メッセージの文字による参加者への通知が可能となり、音声による応答メッセージの通知と比較して、場合によって好都合である。 According to the voice signal processing apparatus having such a configuration, it is possible to notify the participant by the character of the response message that is automatically selected, which is advantageous in some cases as compared with the notification of the response message by voice.

また、前記音声信号処理装置において、前記記憶手段は、前記応答音声信号に代えて、１の応答音声信号を他の応答音声信号から識別するための応答識別データを記憶し、前記選択手段は、前記入力手段により受け取られた音声信号が所定の条件を満たす場合に、前記条件に対応する応答音声信号に代えて、前記条件に対応する応答識別データを選択し、前記出力手段は、前記選択手段により選択された応答識別データを出力するように構成されてもよい。 In the audio signal processing device, the storage unit stores response identification data for identifying one response audio signal from another response audio signal, instead of the response audio signal, and the selection unit includes: When the audio signal received by the input means satisfies a predetermined condition, instead of the response audio signal corresponding to the condition, response identification data corresponding to the condition is selected, and the output means is the selection means May be configured to output the response identification data selected by.

かかる構成の音声信号処理装置によれば、音声信号処理装置は応答音声信号と比較してデータ量の少ない応答識別データを端末装置等に送信することにより、応答音声信号を予め記憶もしくは他から取得可能な端末装置との組合せにより、音声会議の議事進行等の自動化が実現され、好都合である。 According to the audio signal processing device having such a configuration, the audio signal processing device transmits response identification data having a smaller amount of data compared to the response audio signal to the terminal device or the like, so that the response audio signal is stored in advance or acquired from others. By combining with a possible terminal device, automation of the proceedings of the audio conference is realized, which is convenient.

また、本発明は、上記の音声信号処理装置により行われる処理をコンピュータに実行させるプログラムを提供する。 The present invention also provides a program that causes a computer to execute processing performed by the above-described audio signal processing apparatus.

［実施形態］
［１．音声会議システムの構成］
本発明の実施形態にかかる音声会議システム１は、互いに異なる場所にいるユーザが、音声により会議を行うことを可能とするシステムである。さらに、音声会議システム１は、会議に参加するユーザの端末装置の各々に対し、適切なタイミングで相槌を示す音声信号や会議の進行を促す音声信号といった応答音声信号を自動送信する機能を有している。 [Embodiment]
[1. Configuration of audio conference system]
The audio conference system 1 according to the embodiment of the present invention is a system that allows users in different places to hold a conference by audio. Furthermore, the audio conference system 1 has a function of automatically transmitting a response audio signal such as an audio signal indicating a conflict or an audio signal prompting the progress of the conference to each of the terminal devices of users participating in the conference. ing.

図１は、音声会議システム１の構成を示したブロック図である。音声会議システム１は、複数の通信機器を相互に接続するネットワーク１０と、ネットワーク１０に各々接続された複数の端末装置１１と、端末装置１１の各々に接続されたヘッドセット１３と、ネットワーク１０に接続されたサーバ装置１４を備えている。 FIG. 1 is a block diagram showing the configuration of the audio conference system 1. The audio conference system 1 includes a network 10 that interconnects a plurality of communication devices, a plurality of terminal devices 11 that are connected to the network 10, a headset 13 that is connected to each of the terminal devices 11, and a network 10. A connected server device 14 is provided.

複数の端末装置１１およびヘッドセット１３の各々は、音声会議システム１のユーザ１９の各々により使用される。音声会議システム１を利用した会議に参加可能なユーザの数、すなわち端末装置１１およびヘッドセット１３の数は任意に変更可能であり、さらに会議の進行中に参加するユーザの構成が変動してもよい。 Each of the plurality of terminal devices 11 and the headset 13 is used by each of the users 19 of the audio conference system 1. The number of users who can participate in the conference using the audio conference system 1, that is, the number of the terminal devices 11 and the headsets 13 can be arbitrarily changed, and even if the configuration of the users participating during the conference is changed, Good.

図１に示すように、異なるユーザ１９および当該ユーザ１９が使用する端末装置１１およびヘッドセット１３を互いに区別する必要がある場合には、それぞれ、ユーザ１９−ｎ、端末装置１１−ｎおよびヘッドセット１３−ｎのように、末尾に「−ｎ」を付してそれらを区別する。ただし、「ｎ」は任意の自然数である。また、異なるユーザ１９および当該ユーザ１９が使用する端末装置１１およびヘッドセット１３を互いに区別する必要がない場合には、それぞれ、単にユーザ１９、端末装置１１およびヘッドセット１３と呼ぶ。 As shown in FIG. 1, when it is necessary to distinguish different users 19 and the terminal devices 11 and headsets 13 used by the users 19 from each other, the users 19-n, the terminal devices 11-n and the headsets, respectively. As in 13-n, “-n” is added to the end to distinguish them. However, “n” is an arbitrary natural number. Further, when there is no need to distinguish between the different user 19 and the terminal device 11 and the headset 13 used by the user 19, they are simply referred to as the user 19, the terminal device 11 and the headset 13, respectively.

ネットワーク１０は、有線または無線により相互接続された１以上の中継装置を備え、異なる通信機器間のデータの中継を行う。ネットワーク１０は、インターネット等の利用者を限定しないオープンネットワークであってもよいし、イントラネットやインターネットプロトコル以外の通信プロトコルを用いるＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等のいずれであってもよい。 The network 10 includes one or more relay devices interconnected by wire or wireless, and relays data between different communication devices. The network 10 may be an open network that does not limit users such as the Internet, or may be any of an intranet, a LAN (Local Area Network) using a communication protocol other than the Internet protocol, and the like.

端末装置１１は、ユーザ１９の音声を示す音声信号をサーバ装置１４に送信するとともに、サーバ装置１４から他のユーザ１９の音声を示す音声信号および当該音声信号に応じてサーバ装置１４において選択された応答音声信号がミキシングされた合成音声信号を受信する装置である。 The terminal device 11 transmits an audio signal indicating the voice of the user 19 to the server device 14 and is selected by the server device 14 from the server device 14 according to the audio signal indicating the audio of the other user 19 and the audio signal. This is a device for receiving a synthesized voice signal in which a response voice signal is mixed.

端末装置１１は、ヘッドセット１３のマイク部から入力されるユーザの発言を示す音声信号を増幅するマイクアンプ１１１と、マイクアンプ１１１から出力される音声信号をアナログ信号からデジタル信号に変換するＡ／Ｄコンバータ１１２と、Ａ／Ｄコンバータ１１２から出力される音声信号をネットワーク１０を介してサーバ装置１４に送信する音声信号送信部１１３を備えている。 The terminal device 11 amplifies an audio signal indicating a user's speech input from the microphone unit of the headset 13, and an A / A that converts the audio signal output from the microphone amplifier 111 from an analog signal to a digital signal. The D converter 112 and the audio signal transmission unit 113 that transmits the audio signal output from the A / D converter 112 to the server device 14 via the network 10 are provided.

また、端末装置１１は、サーバ装置１４からネットワーク１０を介して合成音声信号を受信する音声信号受信部１１４と、音声信号受信部１１４から出力される音声信号をデジタル信号からアナログ信号に変換するＤ／Ａコンバータ１１５と、Ｄ／Ａコンバータ１１５から出力される音声信号を増幅しヘッドセット１３のヘッドフォン部に出力するヘッドフォンアンプ１１６を備えている。 In addition, the terminal device 11 receives an audio signal receiving unit 114 that receives a synthesized audio signal from the server device 14 via the network 10, and converts the audio signal output from the audio signal receiving unit 114 from a digital signal to an analog signal D. / A converter 115 and a headphone amplifier 116 that amplifies the audio signal output from D / A converter 115 and outputs the amplified signal to the headphone section of headset 13.

さらに、端末装置１１は、制御部１１８により生成される各種制御信号をネットワーク１０を介してサーバ装置１４に送信する制御信号送信部１１７と、端末装置１１の各構成部の制御を行う制御部１１８と、ユーザに対し文字や図形の表示によりメッセージ等の通知を行う表示部１１９と、ユーザが端末装置１１に対し各種操作を行うための操作部１２０と、制御部１１８の処理を指定する制御プログラムやアプリケーションプログラムを記憶するとともに他の構成部のワークエリアとしても利用される記憶部１２１を備えている。また、記憶部１２１には、端末装置１１をネットワーク１０の中で識別するための識別子である端末ＩＤが予め記憶されている。 Further, the terminal device 11 transmits a control signal generated by the control unit 118 to the server device 14 via the network 10 and a control unit 118 that controls each component of the terminal device 11. A display unit 119 for notifying the user of a message or the like by displaying characters or graphics, an operation unit 120 for the user to perform various operations on the terminal device 11, and a control program for designating processing of the control unit 118 And a storage unit 121 that stores application programs and is used as a work area for other components. The storage unit 121 stores in advance a terminal ID that is an identifier for identifying the terminal device 11 in the network 10.

ヘッドセット１３は、ユーザ１９の発言を示すアナログ音声信号を生成して端末装置１１に出力するマイク部と、端末装置１１から入力されるアナログ音声信号を音声に変換して発音するヘッドフォン部を備えている。 The headset 13 includes a microphone unit that generates an analog audio signal indicating the speech of the user 19 and outputs the analog audio signal to the terminal device 11, and a headphone unit that converts the analog audio signal input from the terminal device 11 into sound and generates a sound. ing.

サーバ装置１４は、複数の端末装置１１の各々から音声信号を受信し、受信した音声信号の内容が所定の条件を充たす場合にその条件に対応する応答音声信号を選択し、端末装置１１から受信した音声信号と選択した応答音声信号をミキシングして合成音声信号を生成した後、各々の端末装置１１に送信する装置である。 The server device 14 receives an audio signal from each of the plurality of terminal devices 11, selects a response audio signal corresponding to the condition when the content of the received audio signal satisfies a predetermined condition, and receives it from the terminal device 11. This is a device that generates a synthesized speech signal by mixing the selected speech signal and the selected response speech signal, and then transmits the synthesized speech signal to each terminal device 11.

サーバ装置１４は、各々の端末装置１１からネットワーク１０を介してユーザ１９の発言を示す音声信号を受信する音声信号受信部１４１と、各々の端末装置１１からネットワーク１０を介して各種制御信号を受信する制御信号受信部１４２と、生成した合成音声信号をネットワーク１０を介して各々の端末装置１１に送信する音声信号送信部１４３と、制御部１４５の処理を指定する制御プログラムやアプリケーションプログラムと各種データを記憶するとともに他の構成部のワークエリアとしても利用される記憶部１４４と、サーバ装置１４の各構成部の制御を行う制御部１４５を備えている。 The server device 14 receives an audio signal reception unit 141 that receives an audio signal indicating the speech of the user 19 from each terminal device 11 via the network 10, and receives various control signals from each terminal device 11 via the network 10. A control signal receiving unit 142, a voice signal transmitting unit 143 that transmits the generated synthesized voice signal to each terminal device 11 via the network 10, and a control program or application program that specifies processing of the control unit 145 and various data And a storage unit 144 that is also used as a work area for other components and a control unit 145 that controls each component of the server device 14.

サーバ装置１４の記憶部１４４には、サーバ装置１４をネットワーク１０の中で識別するための識別子であるサーバＩＤが予め記憶されている。記憶部１４４は、さらに、会議のスケジュールや参加者等を示す会議データを複数格納したデータベース（以下、「ＤＢ」と呼ぶ）である会議ＤＢ１４４１と、音声会議システム１のユーザの情報を示すユーザデータを複数格納したユーザＤＢ１４４２と、応答音声信号の選択処理（後述）において用いられるキーフレーズを示すキーフレーズデータを複数格納したキーフレーズＤＢ１４４３と、応答音声信号の選択条件を示す応答条件データを複数格納した応答条件ＤＢ１４４４と、複数の応答音声信号を含む応答音声信号群１４４５を予め記憶している。 The storage unit 144 of the server device 14 stores in advance a server ID that is an identifier for identifying the server device 14 in the network 10. The storage unit 144 further includes a conference DB 1441 that is a database (hereinafter referred to as “DB”) that stores a plurality of conference data indicating conference schedules, participants, and the like, and user data that indicates user information of the audio conference system 1 A user DB 1442 storing a plurality of key phrases, a key phrase DB 1443 storing a plurality of key phrase data indicating key phrases used in response voice signal selection processing (described later), and a plurality of response condition data indicating selection conditions of response voice signals The response condition DB 1444 and the response sound signal group 1445 including a plurality of response sound signals are stored in advance.

図２は、会議ＤＢ１４４１の内容を例示した図である。会議ＤＢ１４４１に含まれる会議データは、会議開催の日付、会議開催の時間帯、会議の議題、会議への参加者を示すユーザＩＤ群、応答音声信号のうち相槌を示すものを参加者全員の端末装置に送信するか現在の発言者の端末装置にのみ送信するかを示す相槌モード、そして応答音声信号のスタイルを示す応答スタイルの各フィールドを有している。 FIG. 2 is a diagram illustrating the contents of the conference DB 1441. The conference data included in the conference DB 1441 includes the conference date, the conference time, the conference agenda, the user ID group indicating the conference participant, and the response audio signal indicating the conflict, all of the participants' terminals. Each field includes a conflict mode indicating whether to transmit to the device or only to the terminal device of the current speaker, and a response style indicating the style of the response voice signal.

会議データに含まれるユーザＩＤ群はユーザＤＢ１４４２に含まれるユーザデータのユーザＩＤを複数含むことにより、対応するユーザデータにより示されるユーザが会議データにより示される会議への参加者であることを示している。会議データに含まれる相槌モードは、相槌を示す応答音声信号を参加者全員の端末装置に送信することを示す「全員へ相槌」か、現在の発言者の端末装置にのみ送信することを示す「発言者へ相槌」のいずれかである。会議データに含まれる応答スタイルは、例えば「フォーマル」、「フランク」等であり、同じ内容の応答メッセージを示すが言い回しの異なる応答音声信号群のいずれを用いるかを示している。 The user ID group included in the conference data includes a plurality of user IDs of user data included in the user DB 1442, thereby indicating that the user indicated by the corresponding user data is a participant in the conference indicated by the conference data. Yes. The conference mode included in the conference data indicates that the response voice signal indicating the conference is transmitted to the terminal devices of all the participants, or “same to all” is transmitted, or is transmitted only to the terminal device of the current speaker. It is either “Speak to the speaker”. The response style included in the conference data is, for example, “formal”, “Frank”, etc., and indicates which of the response voice signals that indicate the same response message but have different wordings.

図３は、ユーザＤＢ１４４２の内容を例示した図である。ユーザＤＢ１４４２に含まれるユーザデータは、各々のユーザを識別するユーザＩＤ、ユーザの氏名、ユーザの役職、ユーザのパスワード、そしてユーザが現在使用している端末装置１１の端末ＩＤの各フィールドを有している。ただし、ユーザデータに含まれる端末ＩＤは、当該ユーザデータにより示されるユーザ１９が音声会議システム１を利用していない間は空欄であり、端末装置１１とサーバ装置１４との接続確立処理（後述）において、サーバ装置１４が端末装置１１から取得する端末ＩＤを対応するユーザデータに格納する。 FIG. 3 is a diagram illustrating the contents of the user DB 1442. The user data included in the user DB 1442 includes fields for a user ID for identifying each user, a user name, a user title, a user password, and a terminal ID of the terminal device 11 currently used by the user. ing. However, the terminal ID included in the user data is blank while the user 19 indicated by the user data is not using the voice conference system 1, and a process for establishing connection between the terminal device 11 and the server device 14 (described later). The server device 14 stores the terminal ID acquired from the terminal device 11 in the corresponding user data.

図４は、キーフレーズＤＢ１４４３の内容を例示した図である。キーフレーズＤＢ１４４３に含まれるキーフレーズデータは、キーフレーズの内容と、キーフレーズの種類の各フィールドを有している。キーフレーズの内容とは「〜です。」等の発言の内容を示し、種類とは「断定」、「意見」、「質問」等のいずれかであり、キーフレーズが発言者にとってどのような意図で発言される種類のものであるかを示す。 FIG. 4 is a diagram illustrating the contents of the key phrase DB 1443. The key phrase data included in the key phrase DB 1443 has fields for the contents of the key phrase and the type of the key phrase. The content of the key phrase indicates the content of the statement such as “~”. The type is one of “confirmed”, “opinion”, “question”, etc., and what the key phrase is for the speaker Indicates whether it is of the kind that is said in

図５は、応答条件ＤＢ１４４４の内容を示した図である。応答条件ＤＢ１４４４に含まれる応答条件データは、応答音声信号の選択のための条件と、条件が充たされた場合に選択されるべき応答音声信号を特定するための応答識別データと、応答識別データで特定される応答内容が相槌を意味するものであるか否かを示す「相槌」、応答識別データで特定される応答内容がいずれかのユーザ１９による発言中であっても通知されるべきものであるか否かを示す「割り込み」、そして応答識別データにワイルドカードが含まれる場合に当該ワイルドカードの内容を示す「ＸＸＸ」の各フィールドを有している。 FIG. 5 is a diagram showing the contents of the response condition DB 1444. The response condition data included in the response condition DB 1444 includes a condition for selecting a response voice signal, response identification data for specifying a response voice signal to be selected when the condition is satisfied, and response identification data. "Consultation" indicating whether or not the response content specified in the above means that it should be informed, even if the response content specified in the response identification data is being spoken by any user 19 And "XXX" indicating the contents of the wild card when the response identification data includes a wild card.

例えば、応答条件データに含まれる条件「キーフレーズ：種類＝断定ａｎｄ無言継続時間＝２秒」は、キーフレーズＤＢ１４４３においてフィールド「種類」の内容が「断定」であるキーフレーズがいずれかのユーザ１９の発言中に検出され、かつ、その後に無言状態が２秒間継続した、という条件を示している。応答条件データの条件には、例えば、同じユーザ１９による継続した発言の時間を示す「発言継続時間」、会議全体の継続時間を示す「会議継続時間」、会議全体の予定時間を示す「会議予定時間」、最後に発言したユーザ１９の役職を示す「発言ユーザ：役職」、現在発言を行っているユーザ１９の数を示す「同時発言者数」等、様々なパラメータを含めることができる。 For example, the condition “keyphrase: type = confirmed and silent duration = 2 seconds” included in the response condition data indicates that the key phrase DB 1443 has a keyphrase whose content of the field “type” is “conclusive”. It is detected during the utterance, and the silent state continues for 2 seconds thereafter. The condition of the response condition data includes, for example, “speech duration” indicating the duration of continuous speech by the same user 19, “conference duration” indicating the duration of the entire conference, and “meeting schedule” indicating the scheduled duration of the entire conference. Various parameters such as “time”, “speaking user: job title” indicating the title of the user 19 who has spoken last, and “number of simultaneous speakers” indicating the number of users 19 who are currently speaking can be included.

また、例えば応答条件データに含まれる応答識別データ「００７」は、条件が充たされた場合に、応答音声信号群１４４５に含まれる応答音声信号のうち、メッセージＩＤ「００７」により特定される応答音声信号を選択することを示している。また、例えば応答条件データに含まれる応答識別データ「ＲＡＮＤＯＭ（００１，００２，００３）」は、条件が充たされた場合に、メッセージＩＤ「００１」、「００２」および「００３」の各々により特定される応答音声信号のうち、いずれかをランダムに選択することを示している。 Further, for example, the response identification data “007” included in the response condition data is a response specified by the message ID “007” among the response audio signals included in the response audio signal group 1445 when the condition is satisfied. It shows that an audio signal is selected. For example, the response identification data “RANDOM (001, 002, 003)” included in the response condition data is specified by each of the message IDs “001”, “002”, and “003” when the condition is satisfied. It shows that any one of the response voice signals to be selected is randomly selected.

また、例えば、フィールド「ＸＸＸ」の内容が「ランダムユーザ」である応答条件データに含まれる応答識別データ「ＸＸＸ：氏名＋（ｉｆＸＸＸ：役職＝一般職員ｔｈｅｎ０１７，ｏｔｈｅｒｗｉｓｅ０１８）」は、会議に参加中の複数のユーザ１９のうちランダムに選択されたユーザ１９の氏名を示す応答音声信号と、メッセージＩＤ「０１７」および「０１８」の各々により特定される応答音声信号のうちいずれかを選択して、それらをその順序でつなぎ合わせることを示している。この場合、ランダムに選択されたユーザ１９の役職が一般職員であればメッセージＩＤ「０１７」で特定される応答音声信号が選択され、その他の場合にはメッセージＩＤ「０１８」で特定される応答音声信号が選択されることになる。 Further, for example, response identification data “XXX: full name + (if XXX: title = general staff then 017, otherwise 018)” included in the response condition data whose content of the field “XXX” is “random user” One of a response voice signal indicating the name of the user 19 selected at random among the plurality of participating users 19 and a response voice signal specified by each of the message IDs “017” and “018” are selected. And connect them in that order. In this case, if the position of the user 19 selected at random is a general employee, the response voice signal specified by the message ID “017” is selected. In other cases, the response voice specified by the message ID “018” is selected. The signal will be selected.

応答条件データに含まれるフィールド「ＸＸＸ」としては、現在発言を行っているユーザ１９の数が複数である場合に、それら複数の発言のうち最初に発言を開始したユーザ１９を示す「最先発言ユーザ」、ランダムに選択されたユーザ１９を示す「ランダムユーザ」、過去の累積発言時間が最も短いユーザ１９を示す「最短発言ユーザ」等、様々なものが考えられる。 In the field “XXX” included in the response condition data, when there are a plurality of users 19 who are currently speaking, “first earliest statement” indicating the user 19 who has started speaking first among the plurality of statements. Various things, such as “user”, “random user” indicating a randomly selected user 19, and “shortest speaking user” indicating a user 19 having the shortest accumulated speech time, are conceivable.

図６は、応答音声信号群１４４５に含まれる応答音声信号の構成を示した図である。ただし、図６は構成を便宜的に示したものであり、実際には応答音声信号群１４４５には、図６に示される応答メッセージもしくは氏名を示す音声波形データに対しメッセージＩＤもしくはユーザＩＤが各々対応付けられてなる応答音声信号が複数含まれている。 FIG. 6 is a diagram showing a configuration of response audio signals included in response audio signal group 1445. However, FIG. 6 shows the configuration for convenience. Actually, the response voice signal group 1445 has a message ID or a user ID for the response message shown in FIG. 6 or the voice waveform data indicating the name. A plurality of response voice signals associated with each other are included.

応答音声信号は、大きく応答メッセージを示す応答音声信号と、ユーザ１９の各々の氏名を示す応答音声信号に区分される。応答メッセージを示す応答音声信号は、さらに応答スタイル「フォーマル」、「フランク」等の各々のグループに区分される。これらのグループは、会議ＤＢ１４４１のフィールド「応答スタイル」において指定されるグループである。応答メッセージを示す応答音声信号は、メッセージＩＤにより互いに識別される。図６の例による場合、例えば、メッセージＩＤ「００１」は、応答メッセージ「ええ、ええ」という相槌の音声を示す応答音声信号に対応している。 The response voice signal is roughly divided into a response voice signal indicating a response message and a response voice signal indicating the name of each user 19. The response voice signal indicating the response message is further divided into groups such as response styles “formal” and “frank”. These groups are groups specified in the field “response style” of the conference DB 1441. Response voice signals indicating response messages are distinguished from each other by a message ID. In the case of the example of FIG. 6, for example, the message ID “001” corresponds to a response voice signal indicating the voice of the response message “Yes, yeah”.

ユーザ１９の氏名を示す応答音声信号は、ユーザＩＤにより互いに識別される。このユーザＩＤはユーザＤＢ１４４２におけるユーザＩＤを示している。図６の例による場合、例えば、ユーザＩＤ「０４２５」は、氏名「ササキコウジ」を呼ぶ音声を示す応答音声信号に対応している。 The response voice signals indicating the name of the user 19 are distinguished from each other by the user ID. This user ID indicates a user ID in the user DB 1442. In the case of the example of FIG. 6, for example, the user ID “0425” corresponds to a response voice signal indicating a voice calling the name “Koji Sasaki”.

図１に戻り、サーバ装置１４の記憶部１４４に記憶される他のデータを説明する。記憶部１４４には、音声会議システム１を用いた音声会議が開始されると、当該音声会議の開始から現在までの時間を示す会議継続時間データ１４４６、各々のユーザ１９に関し音声会議の開始から現在までの発言時間の履歴を示す発言継続時間データ１４４７、最後にいずれかのユーザ１９が発言を行った後に無言状態が継続している時間を示す無言継続時間データ１４４８、そして各々のユーザ１９の発言を示す音声信号を個別に過去の所定時間分だけ一時的に記憶するデータバッファ１４４９が記憶される。これらのデータは音声会議の継続中に一時的に記憶部１４４に記憶され、音声会議の終了に伴い記憶部１４４から削除される。 Returning to FIG. 1, other data stored in the storage unit 144 of the server device 14 will be described. When an audio conference using the audio conference system 1 is started, the storage unit 144 stores conference duration data 1446 indicating the time from the start of the audio conference to the current time, and from the start of the audio conference for each user 19. Utterance duration data 1447 indicating the history of the utterance time until the end, utterance duration data 1448 indicating the time during which the silent state continues after any of the users 19 uttered, and the utterances of each user 19 Is stored in the data buffer 1449 for temporarily storing the audio signals individually for a predetermined time in the past. These data are temporarily stored in the storage unit 144 during the continuation of the audio conference, and are deleted from the storage unit 144 when the audio conference ends.

サーバ装置１４の制御部１４５は、各々のユーザ１９の音声信号のレベルが例えば所定の閾値を超えるか否かを判定することにより当該音声信号からユーザ１９の発言を示す部分を検出し検出結果を計時部１４５３に引き渡す発言信号検出部１４５１と、すべてのユーザ１９の音声信号のレベルが例えば所定の閾値を下回るか否かを判定することにより音声会議における無言状態を検出し検出結果を計時部１４５３に引き渡す無言信号検出部１４５２と、音声会議の開始時点からの経過時間、各々のユーザ１９の発言の継続時間およびいずれのユーザ１９も発言を行わない無言状態の継続時間を計測しその結果をそれぞれ会議継続時間データ１４４６、発言継続時間データ１４４７および無言継続時間データ１４４８として記憶部１４４に記憶させる計時部１４５３と、各々のユーザ１９の音声信号からキーフレーズを示す部分を検出するキーフレーズ検出部１４５４を備えている。 The control unit 145 of the server device 14 detects a part indicating the speech of the user 19 from the voice signal by determining whether or not the level of the voice signal of each user 19 exceeds a predetermined threshold, for example, and detects the detection result. The speech signal detection unit 1451 delivered to the time measurement unit 1453, and the speech signal level of all users 19 is determined to determine whether or not the speech signal level is below a predetermined threshold, for example, thereby detecting a speechless state in the audio conference and measuring the detection result 1453. Mute signal detection unit 1452 to be handed over, the elapsed time from the start of the audio conference, the duration of each user 19's speech, and the duration of the silent state where none of the users 19 speak, Recorded in the storage unit 144 as conference duration data 1446, speech duration data 1447, and speech duration data 1448 A timer unit 1453 which includes a key phrase detection unit 1454 for detecting a portion indicating a key phrase from the audio signal of each user 19.

キーフレーズ検出部１４５４はキーフレーズを検出するため、例えばユーザ１９の音声信号に対しＦＦＴ(高速フーリエ変換)処理を施し、音声信号の有する周波数の分布において振幅がピークとなる周波数を特徴量として取り出す等し、予め記憶されている音声波形の同種の特徴量と比較することにより、音声信号に含まれる個々の音声内容を認識する。続いて、キーフレーズ検出部１４５４は認識した音声内容とキーフレーズＤＢ１４４３に含まれるフィールド「内容」のデータとを比較し、それらの一致を判定することにより、キーフレーズを検出する。以上はキーフレーズ検出部１４５４が周波数領域における特徴量に基づく音声認識法を用いる場合の例であるが、キーフレーズ検出部１４５４がキーフレーズの検出のために用いる音声認識法はこれに限られず、例えば時間−周波数領域における特徴量に基づく音声認識法や確率モデルによる音声認識法など、他のいずれの方法を用いてもよい。 In order to detect the key phrase, the key phrase detection unit 1454 performs, for example, an FFT (Fast Fourier Transform) process on the voice signal of the user 19, and extracts a frequency having a peak in the frequency distribution of the voice signal as a feature amount. Equally, each voice content included in the voice signal is recognized by comparing with the same kind of feature quantity of the voice waveform stored in advance. Subsequently, the key phrase detection unit 1454 detects the key phrase by comparing the recognized voice content with the data of the field “content” included in the key phrase DB 1443 and determining a match between them. The above is an example of the case where the key phrase detection unit 1454 uses the speech recognition method based on the feature quantity in the frequency domain, but the speech recognition method that the key phrase detection unit 1454 uses for detecting the key phrase is not limited to this, For example, any other method such as a speech recognition method based on a feature amount in the time-frequency domain or a speech recognition method based on a probability model may be used.

制御部１４５は、さらに、計時部１４５３により計測された時間やキーフレーズ検出部１４５４により検出されたキーフレーズ等が応答条件ＤＢ１４４４に含まれる条件を充たすか否かを判定し、条件を充たす応答条件データの応答識別データにより特定される応答音声信号を応答音声信号群１４４５から選択して応答音声信号ミキシング部１４５７に引き渡す応答音声信号選択部１４５５と、データバッファ１４４９に記憶された各々のユーザ１９の音声信号をミキシングして応答音声信号ミキシング部１４５７および音声信号送信部１４３に出力する発言音声信号ミキシング部１４５６と、応答音声信号選択部１４５５から引き渡される応答音声信号と発言音声信号ミキシング部１４５６から受け取る合成音声信号をミキシングして音声信号送信部１４３に出力する応答音声信号ミキシング部１４５７を備えている。 The control unit 145 further determines whether the time measured by the time measuring unit 1453, the key phrase detected by the key phrase detection unit 1454, and the like satisfy the conditions included in the response condition DB 1444, and the response conditions satisfying the conditions The response audio signal selection unit 1455 which selects the response audio signal specified by the response identification data of the data from the response audio signal group 1445 and delivers it to the response audio signal mixing unit 1457, and each user 19 stored in the data buffer 1449 A speech audio signal mixing unit 1456 that mixes the audio signal and outputs it to the response audio signal mixing unit 1457 and the audio signal transmission unit 143, and a response audio signal delivered from the response audio signal selection unit 1455 and a speech audio signal mixing unit 1456 Mixing the synthesized voice signal And a response voice signal mixing unit 1457 to output No. to the transmitting unit 143.

［２．音声会議システムの動作］
以下、複数のユーザ１９が音声会議システム１を用いて音声会議を行う場合の音声会議システム１の動作を説明する。まず、音声会議に参加するユーザ１９は各自の端末装置１１を操作して、端末装置１１とサーバ装置１４との間に通信コネクションを確立させる。通信コネクションの確立において、ユーザ１９は自分のユーザＩＤおよびパスワードを端末装置１１に入力し、端末装置１１の制御部１１８はユーザにより入力されたユーザＩＤおよびパスワードを含む制御信号を生成し制御信号送信部１１７を介してサーバ装置１４に送信する。その制御信号には端末装置１１の端末ＩＤが含まれている。 [2. Operation of the audio conference system]
Hereinafter, the operation of the audio conference system 1 when a plurality of users 19 conduct an audio conference using the audio conference system 1 will be described. First, the user 19 participating in the audio conference operates his / her own terminal device 11 to establish a communication connection between the terminal device 11 and the server device 14. In establishing the communication connection, the user 19 inputs his / her user ID and password to the terminal device 11, and the control unit 118 of the terminal device 11 generates a control signal including the user ID and password input by the user and transmits the control signal. The data is transmitted to the server device 14 via the unit 117. The control signal includes the terminal ID of the terminal device 11.

サーバ装置１４の制御部１４５は端末装置１１から制御信号を受信すると、受信した制御信号に含まれるユーザＩＤおよびパスワードを含むユーザデータをユーザＤＢ１４４２（図３参照）から検索し、検索したユーザデータのフィールド「端末ＩＤ」に制御信号に含まれる端末ＩＤを格納する。制御部１４５がユーザデータの検索に失敗した場合、サーバ装置１４は端末装置１１に対しユーザＩＤおよびパスワードの再入力を促すメッセージの表示を指示する制御信号を送信し、正しいユーザＩＤおよびパスワードの組合せを含む制御信号を端末装置１１から受信するまで、以下の動作を行わない。 When receiving the control signal from the terminal device 11, the control unit 145 of the server device 14 searches the user DB 1442 (see FIG. 3) for user data including the user ID and password included in the received control signal, and In the field “terminal ID”, the terminal ID included in the control signal is stored. If the control unit 145 fails to retrieve user data, the server device 14 transmits a control signal instructing the terminal device 11 to display a message prompting the user ID and password to be re-input, and the correct combination of the user ID and password. The following operation is not performed until a control signal including is received from the terminal device 11.

上記のように端末装置１１とサーバ装置１４との間に通信コネクションが確立され、サーバ装置１４によるユーザＩＤとパスワードに基づくユーザ１９の本人認証が成功すると、ユーザ１９は自分の発言を示す音声信号を端末装置１１からサーバ装置１４に送信すると同時に、他の端末装置１１のユーザ１９の発言を示す音声信号をサーバ装置１４から受信することが可能となる。 As described above, when a communication connection is established between the terminal device 11 and the server device 14 and the user device 19 authenticates the user 19 based on the user ID and the password, the user signal 19 indicates his / her speech. Is transmitted from the terminal device 11 to the server device 14, and at the same time, an audio signal indicating the speech of the user 19 of another terminal device 11 can be received from the server device 14.

例えば、ユーザ１９−１の発言を示す音声はヘッドセット１３−１のマイク部により音声信号に変換され、端末装置１１−１のマイクアンプ１１１、Ａ／Ｄコンバータ１１２および音声信号送信部１１３を介してサーバ装置１４に送信される。サーバ装置１４の音声信号受信部１４１は端末装置１１−１からユーザ１９−１の音声信号を受信するとともに、音声会議に参加する他のユーザ１９の音声信号をそれらのユーザ１９の端末装置１１から受け取り、それらを個別にデータバッファ１４４９に一時的に記憶させる。データバッファ１４４９に一時的に記憶された音声信号は、発言音声信号ミキシング部１４５６によりミキシングされ合成音声信号として音声信号送信部１４３に出力され、音声信号送信部１４３から音声会議に参加する全てのユーザ１９の端末装置１１に送信される。 For example, a voice indicating the speech of the user 19-1 is converted into a voice signal by the microphone unit of the headset 13-1, and the voice signal is transmitted via the microphone amplifier 111, the A / D converter 112, and the voice signal transmission unit 113 of the terminal device 11-1. To the server device 14. The audio signal receiving unit 141 of the server device 14 receives the audio signal of the user 19-1 from the terminal device 11-1, and also receives the audio signals of other users 19 participating in the audio conference from the terminal device 11 of those users 19. And temporarily store them individually in the data buffer 1449. The audio signal temporarily stored in the data buffer 1449 is mixed by the speech audio signal mixing unit 1456 and output to the audio signal transmission unit 143 as a synthesized audio signal, and all users who participate in the audio conference from the audio signal transmission unit 143 It is transmitted to 19 terminal devices 11.

そのようにサーバ装置１４から送信される合成音声信号は、端末装置１１−１において音声信号受信部１１４により受信され、Ｄ／Ａコンバータ１１５およびヘッドフォンアンプ１１６を介してヘッドセット１３−１のヘッドフォン部に出力される。ヘッドセット１３−１のヘッドフォン部は端末装置１１−１から受け取った合成音声信号を音声に変換し出力する。その結果、ユーザ１９−１は自分の発言を音声会議に参加する他のユーザ１９に伝達するとともに、他のユーザ１９の発言を聞くことができ、音声会議が成立する。 The synthesized audio signal transmitted from the server device 14 is received by the audio signal receiving unit 114 in the terminal device 11-1, and the headphone unit of the headset 13-1 via the D / A converter 115 and the headphone amplifier 116. Is output. The headphone unit of the headset 13-1 converts the synthesized voice signal received from the terminal device 11-1 into voice and outputs it. As a result, the user 19-1 can transmit his / her speech to the other users 19 participating in the audio conference and can listen to the speech of the other user 19 to establish the audio conference.

ところで、サーバ装置１４は上記の処理に加えて、発言音声信号ミキシング部１４５６により生成されるユーザ１９の発言を示す合成音声信号に対し、さらに必要に応じて適切な応答音声信号をミキシングして端末装置１１に送信する機能を有している。以下、その機能に関するサーバ装置１４の動作を説明する。 By the way, in addition to the above processing, the server device 14 further mixes an appropriate response voice signal as necessary with respect to the synthesized voice signal indicating the voice of the user 19 generated by the voice signal mixing unit 1456, and performs terminal processing. It has a function of transmitting to the device 11. Hereinafter, the operation of the server device 14 related to the function will be described.

会議ＤＢ１４４１（図２参照）に含まれる会議データのフィールド「日付」および「時間帯」により示される音声会議の開始時間になると、計時部１４５３はその開始時間を基準時刻とする経過時間の計測を開始し、その計測結果を順次、会議継続時間データ１４４６として記憶部１４４に記憶させる。 When the start time of the audio conference indicated by the fields “date” and “time zone” of the conference data included in the conference DB 1441 (see FIG. 2) is reached, the timer unit 1453 measures the elapsed time using the start time as a reference time. The measurement results are sequentially stored in the storage unit 144 as conference duration data 1446.

また、発言信号検出部１４５１はデータバッファ１４４９に新たに記憶される音声信号を常時監視しており、あるユーザ１９の音声信号のレベルが所定の時間以上、所定の閾値を継続して下回った後、所定の閾値を所定の時間だけ継続して上回った場合、発言開始を示す発言開始データを生成し、計時部１４５３に引き渡す。また、発言信号検出部１４５１は発言開始データの生成後に、音声信号のレベルが所定の時間以上、所定の閾値を継続して下回った場合、発言終了を示す発言終了データを生成し、計時部１４５３に引き渡す。計時部１４５３は発言信号検出部１４５１から発言開始データおよび発言終了データを受け取ると、それらのデータに基づきユーザ１９の発言時間帯を示すデータを、その音声信号の送信元の端末装置１１ごとに発言継続時間データ１４４７として記憶部１４４に記憶させる。ところで、発言信号検出部１４５１が音声信号のうち発言を示す部分の開始タイミングおよび終了タイミングを特定する方法は上記のものに限られず、例えば判定に用いる閾値や時間を所定の規則に従い可変とする等、様々な方法が考えられる。 Further, the speech signal detection unit 1451 constantly monitors the voice signal newly stored in the data buffer 1449, and after the level of the voice signal of a certain user 19 continues below a predetermined threshold for a predetermined time or more. When the predetermined threshold value is continuously exceeded for a predetermined time, speech start data indicating the start of speech is generated and delivered to the time measuring unit 1453. In addition, the speech signal detection unit 1451 generates speech end data indicating the end of speech when the level of the audio signal continuously falls below a predetermined threshold for a predetermined time or more after generation of the speech start data. To hand over. When the timing unit 1453 receives the speech start data and the speech end data from the speech signal detection unit 1451, the timer unit 1453 transmits the data indicating the speech time zone of the user 19 based on those data for each terminal device 11 that has transmitted the voice signal. The data is stored in the storage unit 144 as the duration data 1447. By the way, the method by which the speech signal detection unit 1451 identifies the start timing and the end timing of the speech signal portion of the audio signal is not limited to the above-described method. For example, the threshold value and time used for the determination are made variable according to a predetermined rule. Various methods are conceivable.

また、無言信号検出部１４５２はデータバッファ１４４９に新たに記憶される音声信号を常時監視しており、音声会議に参加中の全てのユーザ１９の音声信号が所定の時間以上、所定の閾値を継続して下回った場合、無言状態の開始を示す無言開始データを生成し、計時部１４５３に引き渡す。また、無言信号検出部１４５２は無言開始データの生成後に、いずれかのユーザ１９の音声信号が所定の時間以上、所定の閾値を継続して上回った場合、無言状態の終了を示す無言終了データを生成し、計時部１４５３に引き渡す。計時部１４５３は無言信号検出部１４５２から無言開始データおよび無言終了データを受け取ると、それらのデータに基づき最後に無言開始データを受け取った時点から現時点までの経過時間を示すデータを、無言継続時間データ１４４８として記憶部１４４に記憶させる。無言信号検出部１４５２が音声信号のうち無言を示す部分の開始タイミングおよび終了タイミングを特定する方法が上記のものに限られない点は、上述した発言信号検出部１４５１の場合と同様である。 The silent signal detection unit 1452 constantly monitors the audio signal newly stored in the data buffer 1449, and the audio signals of all the users 19 participating in the audio conference continue for a predetermined threshold for a predetermined time or longer. If it falls below, silent start data indicating the start of the silent state is generated and handed over to the timing unit 1453. In addition, the mute signal detection unit 1452 generates mute end data indicating the end of the mute state when the sound signal of any user 19 continuously exceeds a predetermined threshold for a predetermined time or longer after the mute start data is generated. Generated and handed over to the timer unit 1453. When the time counting unit 1453 receives the mute start data and mute end data from the mute signal detection unit 1452, the data indicating the elapsed time from the last time the mute start data is received to the present time based on the data is used. 1448 is stored in the storage unit 144. The point that the silence signal detection unit 1452 specifies the start timing and the end timing of the portion indicating silence in the audio signal is not limited to the above-described one, as in the case of the speech signal detection unit 1451 described above.

応答音声信号選択部１４５５は、上記のように計時部１４５３により順次更新される会議継続時間データ１４４６、発言継続時間データ１４４７および無言継続時間データ１４４８を常時監視しており、それらのデータが更新されると、更新されたデータに基づき応答条件ＤＢ１４４４（図５参照）に含まれる応答条件データのうち、フィールド「条件」により示される条件が充たされるものを検索する。応答音声信号選択部１４５５は、その検索に成功した場合、検索した応答条件データに含まれる応答識別データにより示されるメッセージＩＤおよびユーザＩＤに基づき、応答音声信号群１４４５から該当する応答音声信号を読み出す。その際、応答音声信号選択部１４５５は応答音声信号群１４４５（図６参照）に含まれる応答音声信号のうち、現在行われている音声会議に対応する会議データ（図２参照）のフィールド「応答スタイル」により示されるスタイルに応じたグループから応答音声信号を読み出す。 The response voice signal selection unit 1455 constantly monitors the conference duration data 1446, the speech duration data 1447, and the silent duration data 1448 that are sequentially updated by the timing unit 1453 as described above, and these data are updated. Then, based on the updated data, search is made for response condition data included in the response condition DB 1444 (see FIG. 5) that satisfies the condition indicated by the field “condition”. When the search is successful, the response audio signal selection unit 1455 reads out the corresponding response audio signal from the response audio signal group 1445 based on the message ID and the user ID indicated by the response identification data included in the searched response condition data. . At that time, the response audio signal selection unit 1455 selects the field “response” of the conference data (see FIG. 2) corresponding to the currently held audio conference among the response audio signals included in the response audio signal group 1445 (see FIG. 6). The response audio signal is read from the group corresponding to the style indicated by “style”.

一方、キーフレーズ検出部１４５４もまた、データバッファ１４４９に新たに記憶される音声信号を常時監視しており、既に説明したように、各々のユーザ１９の発言を示す音声信号からキーフレーズＤＢ１４４３（図４参照）に含まれるキーフレーズデータにより示されるキーフレーズの検出を行う。キーフレーズ検出部１４５４によりキーフレーズの検出が行われると、検出されたキーフレーズに対応するキーフレーズデータの種類に基づき応答条件ＤＢ１４４４に含まれる応答条件データのうち、フィールド「条件」により示される条件が充たされるものを検索する。応答音声信号選択部１４５５は、その検索に成功した場合、検索した応答条件データに含まれる応答識別データにより示されるメッセージＩＤおよびユーザＩＤに基づき、応答音声信号群１４４５から該当する応答音声信号を読み出す。その際も、応答音声信号選択部１４５５は会議データのフィールド「応答スタイル」により示されるスタイルに応じたグループから応答音声信号を読み出す。 On the other hand, the key phrase detection unit 1454 also constantly monitors the voice signal newly stored in the data buffer 1449, and as described above, the key phrase DB 1443 (FIG. 4) is detected. The key phrase indicated by the key phrase data included in (4) is detected. When the key phrase is detected by the key phrase detection unit 1454, the condition indicated by the field “condition” in the response condition data included in the response condition DB 1444 based on the type of the key phrase data corresponding to the detected key phrase. Search for items that satisfy. When the search is successful, the response audio signal selection unit 1455 reads out the corresponding response audio signal from the response audio signal group 1445 based on the message ID and the user ID indicated by the response identification data included in the searched response condition data. . At that time, the response audio signal selection unit 1455 reads the response audio signal from the group corresponding to the style indicated by the field “response style” of the conference data.

例えば、図２に例示される会議ＤＢ１４４１に含まれる第１行目の会議データにより示される音声会議が行われている際に、図５に例示される応答条件ＤＢ１４４４に含まれる第２行目の応答条件データの条件が充たされた場合、応答音声信号選択部１４５５は応答音声信号群１４４５のうち、応答スタイル「フォーマル」のグループに含まれる応答音声信号群の中からメッセージＩＤ「００７」で識別される応答音声信号、すなわちメッセージ「今のご質問にお答えできる方がいらっしゃいましたら、ご発言願います。」を示す音声信号を読み出すことになる。 For example, when the audio conference indicated by the first row conference data included in the conference DB 1441 illustrated in FIG. 2 is being performed, the second row included in the response condition DB 1444 illustrated in FIG. When the condition of the response condition data is satisfied, the response audio signal selection unit 1455 uses the message ID “007” from the response audio signal group included in the response style “formal” group in the response audio signal group 1445. The response voice signal to be identified, that is, the voice signal indicating the message "If you can answer the current question, please speak."

応答音声信号選択部１４５５は、応答条件データ（図５参照）のフィールド「割り込み」が「Ｙｅｓ」であるものに従い上記の応答音声信号の読み出しを行った場合には、読み出した応答音声信号の応答音声信号ミキシング部１４５７への引き渡しを即時に開始する。一方、応答音声信号選択部１４５５は、応答条件データのフィールド「割り込み」が「Ｎｏ」であるものに従い上記の応答音声信号の読み出しを行った場合には、無言継続時間データ１４４８により無言状態であると判断される場合には即時に、無言状態ではないと判断される場合には無言状態になるのを待って、読み出した応答音声信号の応答音声信号ミキシング部１４５７への引き渡しを開始する。それにより、例えば相槌等の応答メッセージはいずれのユーザ１９も発言を行っていない間に伝達される一方、例えば複数人による同時発言を制する場合のように割り込みを要する応答メッセージは、たとえいずれかのユーザ１９が発言を行っていても伝達される。 When the response audio signal is read according to the response condition data (see FIG. 5) in which the field “interrupt” is “Yes”, the response audio signal selection unit 1455 responds to the read response audio signal. Delivery to the audio signal mixing unit 1457 is immediately started. On the other hand, when the response voice signal is read according to the response condition data field “interrupt” being “No”, the response voice signal selection unit 1455 is in a silent state based on the silence duration data 1448. When it is determined that the response voice signal is determined to be silent, when it is determined that it is not the silent state, it waits for the silent state to be entered, and the delivery of the read response voice signal to the response voice signal mixing unit 1457 is started. As a result, for example, a response message such as a summon is transmitted while none of the users 19 is speaking, while a response message that requires interruption, for example when controlling simultaneous speaking by multiple people, is any It is transmitted even if the user 19 is speaking.

応答音声信号ミキシング部１４５７は、上記のように応答音声信号選択部１４５５から応答音声信号を受け取ると、受け取った応答音声信号を発言音声信号ミキシング部１４５６から受け取った合成音声信号とミキシングし、その結果得られる合成音声信号を音声信号送信部１４３に出力する。そのように応答音声信号ミキシング部１４５７から出力される合成音声信号は、音声会議に参加中のユーザ１９の発言を示す音声信号に対し、会議開始からの経過時間、各々のユーザ１９の累積発言時間、無言状態の継続時間、同時発言者数等に応じて、適する内容のメッセージを示す音声信号がミキシングされたものである。 When the response audio signal mixing unit 1457 receives the response audio signal from the response audio signal selection unit 1455 as described above, the response audio signal mixing unit 1457 mixes the received response audio signal with the synthesized audio signal received from the speech audio signal mixing unit 1456, and as a result. The resultant synthesized speech signal is output to the speech signal transmission unit 143. As described above, the synthesized voice signal output from the response voice signal mixing unit 1457 is an elapsed time from the start of the conference and an accumulated speech time of each user 19 with respect to the voice signal indicating the speech of the user 19 participating in the audio conference. A voice signal indicating a message having a suitable content is mixed according to the duration of the silent state, the number of simultaneous speakers, and the like.

ところで、現在行われている音声会議に対応する会議データ（図２参照）に含まれるフィールド「相槌モード」が「全員へ相槌」である場合、音声信号送信部１４３は音声会議に参加中の全てのユーザ１９の端末装置１１に対し、常時、応答音声信号ミキシング部１４５７から受け取った合成音声信号、すなわち応答音声信号がミキシングされた会議音声を示す音声信号を送信する。 By the way, when the field “conference mode” included in the conference data (see FIG. 2) corresponding to the currently performed audio conference is “conference to all”, the audio signal transmitting unit 143 is all participating in the audio conference. To the terminal device 11 of the user 19, the synthesized voice signal received from the response voice signal mixing unit 1457, that is, the voice signal indicating the conference voice mixed with the response voice signal is transmitted.

一方、現在行われている音声会議に対応する会議データに含まれるフィールド「相槌モード」が「発言者へ相槌」である場合、音声信号送信部１４３は、応答条件データ（図５参照）のフィールド「相槌」が「Ｙｅｓ」であるものに従い応答音声信号選択部１４５５から応答音声信号が応答音声信号ミキシング部１４５７に引き渡される間、発言継続時間データ１４４７により最後に発言を行ったユーザ１９の端末装置１１に対しては応答音声信号ミキシング部１４５７から受け取った合成音声信号を送信する一方、その他のユーザ１９の端末装置１１に対しては発言音声信号ミキシング部１４５６から受け取った合成音声信号、すなわち応答音声信号のミキシングされていない音声信号を送信する。ただし、応答条件データのフィールド「相槌」が「Ｎｏ」であるものに従い応答音声信号選択部１４５５から応答音声信号が応答音声信号ミキシング部１４５７に引き渡される間は、音声信号送信部１４３は全てのユーザ１９の端末装置１１に対し、応答音声信号ミキシング部１４５７から受け取った合成音声信号を送信する。その結果、相槌の音声については発言中のユーザ１９（正確には発言の途中で「間」を置いたユーザ１９）のみが耳にすることとなり、他のユーザ１９が相槌の音声により邪魔されることがない。 On the other hand, when the field “conference mode” included in the conference data corresponding to the current audio conference is “conference to speaker”, the audio signal transmission unit 143 displays the field of response condition data (see FIG. 5). While the response audio signal is delivered from the response audio signal selection unit 1455 to the response audio signal mixing unit 1457 in accordance with what “Yes” is “Yes”, the terminal device of the user 19 who made the last statement by the message duration data 1447 11, the synthesized voice signal received from the response voice signal mixing unit 1457 is transmitted to the terminal device 11 of the other user 19, while the synthesized voice signal received from the voice signal mixing unit 1456, that is, the response voice. Send an unmixed audio signal. However, while the response audio signal is delivered from the response audio signal selection unit 1455 to the response audio signal mixing unit 1457 according to the response condition data field “Consideration” being “No”, the audio signal transmission unit 143 is used by all users. The synthesized speech signal received from the response speech signal mixing unit 1457 is transmitted to the 19 terminal devices 11. As a result, only the user 19 who is speaking (accurately, the user 19 who put “between” in the middle of speech) hears the voice of the other party, and other users 19 are disturbed by the voice of the other party. There is nothing.

さらに、応答音声信号選択部１４５５は、相槌等の音声の再生中に新たにいずれかのユーザ１９が発言を開始した場合、その発言が相槌等の音声によって聞き手にとって聞き取りづらくなることのないように、必要に応じて応答音声信号の再生を停止する。より具体的には、応答音声信号選択部１４５５は、応答条件データ（図５参照）のフィールド「割り込み」が「Ｎｏ」であるものに従い応答音声信号ミキシング部１４５７に応答音声信号の引き渡しを行っている時に発言継続時間データ１４４７が更新され、いずれかのユーザ１９により発言が開始されたと判定すると、その時点で引き渡していた応答音声信号をその後、応答音声信号ミキシング部１４５７に引き渡さない。 Further, the response audio signal selection unit 1455 prevents any of the users 19 from starting to speak during the playback of the speech such as a match so that the speech is not easily heard by the listener due to the speech such as the match. If necessary, playback of the response audio signal is stopped. More specifically, the response audio signal selection unit 1455 delivers the response audio signal to the response audio signal mixing unit 1457 according to the field “interrupt” of the response condition data (see FIG. 5) being “No”. When it is determined that the utterance duration data 1447 is updated and the utterance is started by any of the users 19, the response voice signal delivered at that time is not delivered to the response voice signal mixing unit 1457 thereafter.

ただし、応答音声信号選択部１４５５は発言開始の判定を行った時に即時に応答音声信号の引き渡しを停止する代わりに、例えば応答音声信号により示される音声の次の音節の区切り部分までは引き渡しを継続し、その後、引き渡しを停止するようにしてもよい。その場合、例えば応答音声信号選択部１４５５により応答メッセージ「なるほど、そうですね。」の最初の「なる」までの音声信号が応答音声信号ミキシング部１４５７に引き渡された時点でいずれかのユーザ１９の発言があったとすると、応答音声信号選択部１４５５は「〜ほど、」までの音声信号を応答音声信号ミキシング部１４５７に引き渡した後、残りの「そうですね。」の音声信号の引き渡しをキャンセルする。その結果、相槌等の応答音声が不自然に中断される不都合が回避される。 However, instead of immediately stopping delivery of the response voice signal when the response voice signal selection unit 1455 determines to start speaking, the response voice signal selection unit 1455 continues to deliver, for example, until the next syllable delimiter of the voice indicated by the response voice signal. Thereafter, the delivery may be stopped. In that case, for example, when the voice signal up to the first “yes” of the response message “I see, it is right” is delivered to the response voice signal mixing unit 1457 by the response voice signal selection unit 1455, If there is, the response voice signal selection unit 1455 hands over the voice signals up to “about” to the response voice signal mixing unit 1457, and then cancels the delivery of the remaining “sure” voice signals. As a result, it is possible to avoid the inconvenience that the response sound such as a conflict is unnaturally interrupted.

一方、応答条件データ（図５参照）のフィールド「割り込み」が「Ｙｅｓ」であるものに従い応答音声信号ミキシング部１４５７に応答音声信号の引き渡しを行っている時にいずれかのユーザ１９により発言が開始されたと判定される場合には、応答音声信号選択部１４５５はその時点で引き渡していた応答音声信号の引き渡しを停止することなく、最後までその応答音声信号を応答音声信号ミキシング部１４５７に引き渡す。そのため、例えば重要な応答メッセージがユーザ１９に伝達されている途中においていずれかのユーザ１９が発言を開始によりその応答メッセージが最後まで伝達されない、といった不都合が回避される。 On the other hand, when the response voice signal is delivered to the response voice signal mixing unit 1457 according to the field “interrupt” of the response condition data (see FIG. 5) being “Yes”, the user 19 starts to speak. If it is determined that the response audio signal is selected, the response audio signal selection unit 1455 passes the response audio signal to the response audio signal mixing unit 1457 to the end without stopping the transfer of the response audio signal that has been transferred at that time. Therefore, for example, an inconvenience that one of the users 19 starts to speak while the important response message is being transmitted to the user 19 and the response message is not transmitted to the end is avoided.

以上のように、音声会議システム１によれば、音声会議に参加するいずれのユーザ１９の負担も増加させることなく、自動的にサーバ装置１４から各々の端末装置１１に送信される応答音声信号により、会議の活性化や、議事進行の改善が図られる。 As described above, according to the audio conference system 1, the response audio signal automatically transmitted from the server device 14 to each terminal device 11 without increasing the burden on any user 19 participating in the audio conference. , Revitalize meetings and improve the progress of proceedings.

ところで、上記の音声会議システム１においては、サーバ装置１４は各々の端末装置１１から受信した全ての音声信号をミキシングして得られる合成音声信号を各々の端末装置１１に送信するものとして説明したが、端末装置１１−ｎに対しては、端末装置１１−ｎ以外の端末装置１１から受信した音声信号のみをミキシングして送信するようにしてもよい。その場合、ユーザ１９−ｎの発言がヘッドセット１３−ｎのヘッドフォン部からエコーのように発音される不都合がなくなる。 In the voice conference system 1 described above, the server apparatus 14 has been described as transmitting a synthesized voice signal obtained by mixing all voice signals received from each terminal apparatus 11 to each terminal apparatus 11. For the terminal device 11-n, only the audio signal received from the terminal device 11 other than the terminal device 11-n may be mixed and transmitted. In that case, there is no inconvenience that the speech of the user 19-n is pronounced like an echo from the headphone unit of the headset 13-n.

また、上記の音声会議システム１においては、相槌モードは会議データにより特定され、各々の端末装置１１について相槌モードを異ならせることはできないものとして説明したが、相槌モードを各々の端末装置１１に応じて変更可能としてもよい。また、上記の音声会議システム１においては、応答スタイルも会議データにより特定され、各々の端末装置１１について応答スタイルを異ならせることはできないものとして説明したが、応答スタイルを各々の端末装置１１に応じて変更可能としてもよい。 In the audio conference system 1 described above, the conflict mode is specified by the conference data, and it has been described that the conflict mode cannot be made different for each terminal device 11. However, the conflict mode is determined according to each terminal device 11. Can be changed. Further, in the audio conference system 1 described above, the response style is also specified by the conference data, and it has been described that the response style cannot be made different for each terminal device 11, but the response style depends on each terminal device 11. Can be changed.

また、上記の音声会議システム１において、キーフレーズＤＢ１４４３および応答条件ＤＢ１４４４の内容を各々のユーザ１９もしくは音声会議システム１の管理者等により変更可能とすることにより、例えば特定のキーフレーズを特定のユーザ１９が発言することにより、その発言に対応する応答メッセージの再生を可能とするようにしてもよい。その場合、例えば議事進行役のユーザ１９は、予め登録しておいたキーフレーズを発言することにより、自分の発言としては言い出しにくい議事進行のための発言を音声会議システム１に行わせる、といったことが可能となり便利である。 Further, in the above audio conference system 1, the contents of the key phrase DB 1443 and the response condition DB 1444 can be changed by each user 19 or the administrator of the audio conference system 1. When 19 speaks, the response message corresponding to the comment may be reproduced. In that case, for example, the user 19 of the proceeding facilitator makes the voice conference system 1 make a speech for proceeding the proceeding that is difficult to say as his speech by speaking a key phrase registered in advance. Is possible and convenient.

また、上記の音声会議システム１においては、端末装置１１およびサーバ装置１４はネットワーク１０を介して相互に接続されるものとして説明したが、例えば端末装置１１およびサーバ装置１４を相互に専用線により接続するようにしてもよい。また、上記の音声会議システム１においては、端末装置１１とサーバ装置１４との間で送受信される信号はデジタル信号であるものとして説明したが、端末装置１１とサーバ装置１４との間で送受信される信号がアナログ信号であってもよい。 In the above audio conference system 1, the terminal device 11 and the server device 14 have been described as being connected to each other via the network 10. For example, the terminal device 11 and the server device 14 are connected to each other through a dedicated line. You may make it do. In the above audio conference system 1, the signal transmitted / received between the terminal device 11 and the server device 14 has been described as a digital signal, but is transmitted / received between the terminal device 11 and the server device 14. The signal may be an analog signal.

また、上記の音声会議システム１においては、サーバ装置１４は応答メッセージを音声信号として端末装置１１に送信するものとして説明したが、例えばサーバ装置１４は応答メッセージをテキストデータとして端末装置１１に送信するようにしてもよい。その場合、応答メッセージのテキストデータを受信した端末装置１１は、そのテキストデータにより示される文字を表示部１１９に表示させることによりユーザ１９に応答メッセージを通知することができる。また、応答メッセージのテキストデータを受信した端末装置１１が、音声合成処理によりテキストデータにより示される応答メッセージの音声信号を生成し、生成した音声信号をヘッドセット１３に出力することにより、ユーザ１９に応答メッセージの通知を行うようにしてもよい。 In the voice conference system 1 described above, the server apparatus 14 is described as transmitting a response message to the terminal apparatus 11 as an audio signal. For example, the server apparatus 14 transmits the response message to the terminal apparatus 11 as text data. You may do it. In this case, the terminal device 11 that has received the text data of the response message can notify the user 19 of the response message by causing the display unit 119 to display the characters indicated by the text data. Further, the terminal device 11 that has received the text data of the response message generates a voice signal of the response message indicated by the text data by voice synthesis processing, and outputs the generated voice signal to the headset 13, thereby enabling the user 19. A response message may be notified.

さらに、例えば応答音声信号群１４４５をサーバ装置１４に記憶させる代わりに各々の端末装置１１に記憶させておき、サーバ装置１４の応答音声信号選択部１４５５は応答音声信号の選択を行う代わりに、応答音声信号を特定するデータ、すなわち条件の充たされた応答条件データ（図５参照）に含まれるフィールド「応答識別データ」の内容を制御信号として端末装置１１に送信するようにしてもよい。その場合、端末装置１１はサーバ装置１４から受信した制御信号に含まれる応答識別データに従い、記憶部１２１に記憶されている応答音声信号群１４４５から応答音声信号を読み出し、ヘッドセット１３に出力することになる。 Further, for example, the response audio signal group 1445 is stored in each terminal device 11 instead of being stored in the server device 14, and the response audio signal selection unit 1455 of the server device 14 selects the response audio signal instead of selecting the response audio signal. The contents of the field “response identification data” included in the data specifying the audio signal, that is, the response condition data (see FIG. 5) satisfying the condition may be transmitted to the terminal device 11 as a control signal. In that case, the terminal device 11 reads the response audio signal from the response audio signal group 1445 stored in the storage unit 121 according to the response identification data included in the control signal received from the server device 14 and outputs the response audio signal to the headset 13. become.

また、上記の音声会議システム１においては、サーバ装置１４が各々の端末装置１１のために応答音声信号の送信サービスを提供するものとして説明したが、例えばサーバ装置を設けることなく、各々の端末装置１１がピアツーピア接続された状態において、各々の端末装置１１がサーバ装置１４と同様の機能を備え、互いに必要なデータの同期を取ることにより、各々の端末装置１１において応答音声信号の選択および再生を行うようにしてもよい。 Further, in the audio conference system 1 described above, the server apparatus 14 has been described as providing a response audio signal transmission service for each terminal apparatus 11, but each terminal apparatus is not provided with, for example, a server apparatus. In a state in which the terminal devices 11 are peer-to-peer connected, each terminal device 11 has the same function as the server device 14 and synchronizes necessary data with each other so that each terminal device 11 selects and reproduces the response voice signal. You may make it perform.

また、上記の音声会議システム１においては、サーバ装置１４が各々のユーザ１９の発言を示す音声信号と、応答音声信号とをミキシングした後、端末装置１１に送信するものとして説明したが、サーバ装置１４から端末装置１１に対し、各々区別可能な音声信号群を送信するようにしてもよい。その場合、端末装置１１においてサーバ装置１４から受信した音声信号のミキシングが行われることになるため、各々の端末装置１１においてユーザ１９が音声信号間のミキシングバランスを好みに応じて変更する等、より柔軟な音声信号の利用が可能となる。また、サーバ装置１４は、端末装置１１−ｎから受信した音声信号を端末装置１１−ｎに送信する音声信号群に含めないようにしてもよい。 In the audio conference system 1 described above, the server apparatus 14 has been described as mixing the audio signal indicating the speech of each user 19 and the response audio signal, and then transmitting them to the terminal apparatus 11. A group of audio signals that can be distinguished from each other may be transmitted from 14 to the terminal device 11. In that case, since the audio signal received from the server device 14 is mixed in the terminal device 11, the user 19 changes the mixing balance between the audio signals in each terminal device 11 according to preference. A flexible audio signal can be used. The server device 14 may not include the audio signal received from the terminal device 11-n in the audio signal group transmitted to the terminal device 11-n.

また、端末装置１１およびサーバ装置１４は、専用のハードウェアにより実現されてもよいし、音声信号の入出力が可能な汎用コンピュータにアプリケーションプログラムに従った処理を実行させることにより実現されてもよい。 The terminal device 11 and the server device 14 may be realized by dedicated hardware, or may be realized by causing a general-purpose computer capable of inputting / outputting audio signals to execute processing according to an application program. .

本発明の実施形態にかかる音声会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the audio conference system concerning embodiment of this invention. 本発明の実施形態にかかる会議ＤＢの内容を例示した図である。It is the figure which illustrated the contents of conference DB concerning the embodiment of the present invention. 本発明の実施形態にかかるユーザＤＢの内容を例示した図である。It is the figure which illustrated the contents of user DB concerning the embodiment of the present invention. 本発明の実施形態にかかるキーフレーズＤＢの内容を例示した図である。It is the figure which illustrated the contents of key phrase DB concerning the embodiment of the present invention. 本発明の実施形態にかかる応答条件ＤＢの内容を示した図である。It is the figure which showed the content of response condition DB concerning embodiment of this invention. 本発明の実施形態にかかる応答音声信号群に含まれる応答音声信号の構成を示した図である。It is the figure which showed the structure of the response audio | voice signal contained in the response audio | voice signal group concerning embodiment of this invention.

Explanation of symbols

１…音声会議システム、１０…ネットワーク、１１…端末装置、１３…ヘッドセット、１４…サーバ装置、１１１…マイクアンプ、１１２…Ａ／Ｄコンバータ、１１３・１４３…音声信号送信部、１１４・１４１…音声信号受信部、１１５…Ｄ／Ａコンバータ、１１６…ヘッドフォンアンプ、１１７…制御信号送信部、１１８・１４５…制御部、１１９…表示部、１２０…操作部、１２１・１４４…記憶部、１４２…制御信号受信部、１４４１…会議ＤＢ、１４４２…ユーザＤＢ、１４４３…キーフレーズＤＢ、１４４４…応答条件ＤＢ、１４４５…応答音声信号群、１４４６…会議継続時間データ、１４４７…発言継続時間データ、１４４８…無言継続時間データ、１４４９…データバッファ、１４５１…発言信号検出部、１４５２…無言信号検出部、１４５３…計時部、１４５４…キーフレーズ検出部、１４５５…応答音声信号選択部、１４５６…発言音声信号ミキシング部、１４５７…応答音声信号ミキシング部。 DESCRIPTION OF SYMBOLS 1 ... Voice conference system, 10 ... Network, 11 ... Terminal device, 13 ... Headset, 14 ... Server apparatus, 111 ... Microphone amplifier, 112 ... A / D converter, 113 * 143 ... Voice signal transmission part, 114 * 141 ... Audio signal receiving unit, 115 ... D / A converter, 116 ... headphone amplifier, 117 ... control signal transmission unit, 118/145 ... control unit, 119 ... display unit, 120 ... operation unit, 121/144 ... storage unit, 142 ... Control signal receiving unit, 1441 ... conference DB, 1442 ... user DB, 1443 ... key phrase DB, 1444 ... response condition DB, 1445 ... response voice signal group, 1446 ... conference duration data, 1447 ... speech duration data, 1448 ... Mute duration data, 1449 ... Data buffer, 1451 ... Speech signal detector, 1452 ... Mute No. detection unit, 1453 ... the timing unit, 1454 ... key phrase detection unit, 1455 ... response audio signal selection unit, 1456 ... speaking voice signal mixing section, 1457 ... response voice signal mixing section.

Claims

Storage means for storing a response voice signal indicating a response voice;
Input means for receiving an audio signal whose output source is each of a plurality of terminal devices;
Detecting means for detecting a signal indicating silence from the audio signal based on the level of each portion of the audio signal received by the input means;
A time measuring means for measuring a time during which a signal indicating mute detected by the detecting means continues;
Selecting means for selecting a response voice signal corresponding to the time measured by the time measuring means;
An audio signal processing apparatus comprising: output means for outputting a response audio signal selected by the selection means.

The storage means further stores key phrase data indicating a key phrase,
Voice recognition means for specifying a phrase included in the speech indicated by the voice signal by performing voice recognition processing on the voice signal received by the input means;
Detecting means for detecting a key phrase indicated by the key phrase data from a phrase specified by the voice recognition means; and
The audio signal processing apparatus according to claim 1, wherein the selection unit selects a response audio signal corresponding to key phrase data indicating the key phrase detected by the detection unit.

2. The audio signal processing according to claim 1, wherein the selection unit selects a response audio signal corresponding to a terminal device that is an output source of the audio signal received by the input unit within a predetermined period in the past. apparatus.

It further includes a time measuring means for measuring the elapsed time from a specific time point,
The audio signal processing apparatus according to claim 1, wherein the selection unit selects a response audio signal corresponding to the time measured by the time measuring unit.

Further comprising time measuring means for measuring a time during which an audio signal whose output source is one terminal device received by the input means continues,
The audio signal processing apparatus according to claim 1, wherein the selection unit selects a response audio signal corresponding to the time measured by the time measuring unit.

When the voice signal whose output source is one terminal apparatus is continuously received by the input means and the voice signal whose output source is another terminal apparatus is received by the input means, the selection means receives a predetermined response The audio signal processing apparatus according to claim 1, wherein an audio signal is selected.

When a voice signal is received by the input means while a response voice signal is being output by the output means, the output means does not output at least a part of a non-output portion of the response voice signal being output. The audio signal processing apparatus according to claim 1.

The voice according to claim 1, wherein the output means outputs a response voice signal using only a terminal device that is an output source of the voice signal received by the input means within a predetermined period in the past as a transmission destination. Signal processing device.

The storage means stores response text data indicating a message character for response in addition to the response voice signal or instead of the response voice signal,
In addition to the response voice signal corresponding to the time measured by the time measuring means, or in place of the response voice signal corresponding to the time measured by the time measuring means, the selection means is a time measured by the time measuring means. Select response text data corresponding to,
The audio signal processing apparatus according to claim 1, wherein the output unit outputs response text data selected by the selection unit.

Processing for storing a response voice signal indicating a response voice;
A process of receiving an audio signal having each of a plurality of terminal devices as an output source;
A process of detecting a signal indicating silence from the audio signal based on the level of each part of the received audio signal;
A process of measuring the duration of the detected signal indicating silence,
A process of selecting a response audio signal corresponding to the measured time;
A program for causing a computer to execute a process of outputting a selected response voice signal.