JP6701573B2

JP6701573B2 - Audio processing device, audio/video output device, and remote conference system

Info

Publication number: JP6701573B2
Application number: JP2016152762A
Authority: JP
Inventors: 亮人相場; 吉田　実; 実吉田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-08-03
Filing date: 2016-08-03
Publication date: 2020-05-27
Anticipated expiration: 2036-08-03
Also published as: US9959881B2; US20180040332A1; JP2018022028A

Description

本発明は、互いに異なる複数の位置のそれぞれから送られてくる音を取得する音取得手段を有する音声処理装置、並びにこれを備える音映像出力装置及び遠隔会議システムに関するものである。 The present invention relates to an audio processing device having sound acquisition means for acquiring sounds sent from a plurality of mutually different positions, and an audiovisual output device and a remote conference system including the audio processing device.

従来、互いに異なる複数の位置の中から、音発生源の存在する音源位置を特定する位置特定手段と、音源位置から送られてくる音を強調した音声信号を出力する音声信号出力手段とを有する音声処理装置が知られている。 Conventionally, it has a position specifying means for specifying a sound source position where a sound source exists from a plurality of different positions, and a sound signal output means for outputting a sound signal emphasizing a sound sent from the sound source position. Speech processing devices are known.

例えば、特許文献１に記載の音声処理装置は、複数のマイクにおける音の入力タイミングのずれに基づいて音源位置を特定し、その音源位置から送られてくる音だけを選択的に出力することでその音を強調するようになっている。かかる構成によれば、音源からの音をノイズと区別して出力することができるとされている。 For example, the audio processing device described in Patent Document 1 specifies a sound source position based on deviations of sound input timings in a plurality of microphones, and selectively outputs only the sound sent from the sound source position. The sound is emphasized. According to this configuration, it is said that the sound from the sound source can be output separately from the noise.

ところが、互いに異なる位置に座っている人たちで意見交換がなされる場合など、音源位置が切り替わる場合には、切り替わり後の音源位置が位置特定手段によって音源位置であると特定されるまでの間、その音源位置からの音が強調されないままになる。この一方で、切り替わり直前まで音源位置であった位置からの音（ほぼ無音）がより強調されることから、実際の音源位置から送られてくる音が不鮮明になってしまうという課題があった。 However, when the sound source position is switched, such as when people sitting at different positions exchange opinions, until the sound source position after switching is specified as the sound source position by the position specifying means, The sound from the sound source position remains unemphasized. On the other hand, since the sound (almost silent) from the position that was the sound source position just before the switching is emphasized more, there is a problem that the sound sent from the actual sound source position becomes unclear.

上述した課題を解決するために、本発明は、互いに異なる複数の位置のそれぞれから送られてくる音を取得する音取得手段と、前記複数の位置の中から、音源の存在する位置である音源位置を特定する位置特定手段と、前記位置特定手段によって特定された前記音源位置から送られてくる音を強調した音声信号を出力する音声信号出力手段とを有する音声処理装置において、前記複数の位置のうち、過去の所定時間内で前記音源位置になった履歴のある位置については、前記位置特定手段による前記音源位置の特定結果にかかわらず音を強調する位置である非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。 In order to solve the above-mentioned problems, the present invention provides a sound acquisition unit that acquires a sound sent from each of a plurality of different positions, and a sound source that is a position where a sound source exists from the plurality of positions. A plurality of positions in a voice processing device having position specifying means for specifying a position and sound signal output means for outputting a sound signal emphasizing a sound sent from the sound source position specified by the position specifying means. Among these, a position having a history of becoming the sound source position within a predetermined time in the past is a non-sound source emphasis target position that is a position for emphasizing a sound regardless of the result of the sound source position specified by the position specifying unit. Thus, the audio signal output means is configured.

本発明によれば、非音源位置から音源位置に切り替わった直後の音源位置から送られてくる音を不鮮明にしてしまうことを抑えることができるという優れた効果がある。 According to the present invention, there is an excellent effect that it is possible to prevent the sound sent from the sound source position immediately after switching from the non-sound source position to the sound source position to be unclear.

第一実施形態に係る遠隔会議システムの電気回路の要部を示すブロック図。The block diagram which shows the principal part of the electric circuit of the remote conference system which concerns on 1st embodiment. 同遠隔会議システムの音声映像出力装置における音声処理部の内部構成の要部を示すブロック図。The block diagram which shows the principal part of an internal structure of the audio processing part in the audiovisual output device of the remote conference system. 発言者のいる位置が位置特定部１１によって音源位置であると正しく認識されている現在地会議室の様子を示す模式図。The schematic diagram which shows the mode of the present location meeting room by which the position with a speaker is correctly recognized by the position specific|specification part 11 as a sound source position. 発言者が人物Ａから人物Ｃに交代された直後であって且つ人物Ｃのいる位置が音源位置であると正しく認識されていない現在地会議室の様子を示す模式図。The schematic diagram which shows the state of the present location meeting room immediately after the speaker was changed from the person A to the person C, and the position where the person C is present is not correctly recognized as the sound source position. 遡及時間Ｔ内における発言者履歴及び発言時間帯の一例を説明するための模式図。The schematic diagram for demonstrating an example of the speaker history and speech time zone in retroactive time T. 同遠隔会議システムの音声映像出力装置による非音源強調対象位置と音源位置との関係を説明するための模式図。The schematic diagram for demonstrating the relationship between the non-sound source emphasis target position and the sound source position by the audiovisual output device of the remote conference system. 同時発言者が複数いる場合にそれぞれの発言者のいる位置を音源位置とする例を説明するための模式図。The schematic diagram for demonstrating the example which sets the position where each speaker exists as a sound source position, when there are a plurality of simultaneous speakers. 第一実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローを示すフローチャート。The flowchart which shows the process flow implemented by the emphasis process part 12a of the remote conference system which concerns on 1st Example. 遡及時間Ｔ内における発言者履歴及び発言時間帯の第二例を説明するための模式図。The schematic diagram for demonstrating the 2nd example of the speaker history and speech time zone in retroactive time T. 第二実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローの一例を示すフローチャート。The flowchart which shows an example of the process flow implemented by the emphasis process part 12a of the remote conference system which concerns on 2nd Example. 遡及時間Ｔ内における発言者履歴及び発言時間帯の第三例を説明するための模式図。The schematic diagram for demonstrating the 3rd example of the speaker history and speech time zone in retroactive time T. 第三実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローを示すフローチャート。The flowchart which shows the processing flow implemented by the emphasis process part 12a of the remote conference system which concerns on 3rd Example. 音源位置からの音の強調を中止された現在地会議室の様子を示す模式図。The schematic diagram which shows the mode of the present location meeting room in which emphasis of the sound from a sound source position was stopped. 第四実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローを示すフローチャート。The flowchart which shows the processing flow implemented by the emphasis process part 12a of the remote conference system which concerns on 4th Example.

以下、本発明を適用した遠隔会議システムの第一実施形態について説明する。
まず、第一実施形態に係る遠隔会議システムの基本的な構成について説明する。図１は、第一実施形態に係る遠隔会議システムの電気回路の要部を示すブロック図である。 Hereinafter, a first embodiment of a remote conference system to which the present invention is applied will be described.
First, the basic configuration of the remote conference system according to the first embodiment will be described. FIG. 1 is a block diagram showing a main part of an electric circuit of the remote conference system according to the first embodiment.

同図において、遠隔会議システムは、複数の音声映像出力装置１００を備えている。これら音声映像出力装置１００は、互いに遠隔地に置かれた状態で、インターネット回線などのネットワーク回線２００を介して互いに通信することが可能になっている。 In the figure, the remote conference system includes a plurality of audio/video output devices 100. These audio-video output devices 100 can communicate with each other via a network line 200 such as the Internet line while being placed in a remote place.

音声映像出力装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１、メモリー２、映像処理部３、スピーカー４、ネットワークインタフェース（以下、ネットワークＩ／Ｆと記す）５などを有している。また、撮像素子インタフェース（以下、撮像素子Ｉ／Ｆと記す）６、映像出力インタフェース（以下、映像出力Ｉ／Ｆと記す）７、音声入出力インタフェース（以下、音声入出力Ｉ／Ｆと記す）８、カメラ９、音声処理部１０、マイクアレイ２０なども有している。 The audio/video output device 100 includes a CPU (Central Processing Unit) 1, a memory 2, an image processing unit 3, a speaker 4, a network interface (hereinafter referred to as a network I/F) 5, and the like. Further, an image sensor interface (hereinafter referred to as an image sensor I/F) 6, a video output interface (hereinafter referred to as a video output I/F) 7, a voice input/output interface (hereinafter referred to as a voice input/output I/F) It also has a camera 8, a voice processor 10, a microphone array 20, and the like.

ＣＰＵ１は、例えばメモリー２から読み込んだプログラムやデータに基づいた処理を実行することで、音声映像出力１００に備わっている各種の機能を実現する演算装置である。 The CPU 1 is an arithmetic unit that realizes various functions included in the audio/video output 100 by executing processing based on a program or data read from the memory 2, for example.

メモリー２は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フラッシュメモリーなどの記憶装置である。メモリー２には、ＣＰＵ１によって実行される様々な処理に必要な各種のソフトウエア、数値データ、演算データ、画像データ、音声データが格納される。 The memory 2 is a storage device such as a RAM (Random Access Memory), a ROM (Read Only Memory), a HDD (Hard Disk Drive), and a flash memory. The memory 2 stores various software necessary for various processes executed by the CPU 1, numerical data, calculation data, image data, and audio data.

映像処理部３は、映像データ又は映像信号に対して各種の映像処理を行う。また、音声処理部１０は、音声データ又は音声信号に対して各種の音声処理を行う。これら映像処理部３や音声処理部１０は、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）等のプロセッサを具備している。 The video processing unit 3 performs various kinds of video processing on video data or video signals. The audio processing unit 10 also performs various audio processes on audio data or audio signals. The video processing unit 3 and the audio processing unit 10 are provided with a processor such as a DSP (Digital Signal Processor).

ネットワークＩ／Ｆ５は、ネットワーク回線２００に通信可能に接続するためのインタフェースである。 The network I/F 5 is an interface for communicatively connecting to the network line 200.

撮像素子Ｉ／Ｆ６は、撮像用のカメラ９から出力される映像信号を所定の映像データとして取り込むインタフェースである。 The image sensor I/F 6 is an interface that captures a video signal output from the camera 9 for imaging as predetermined video data.

映像出力Ｉ／Ｆ７は、外部に設置されたＬＣＤ（Liquid Crystal Display）モニターや、プロジェクタなどの画像表示装置２１１に対して映像信号を送るためのインタフェースである。 The video output I/F 7 is an interface for sending a video signal to an image display device 211 such as an LCD (Liquid Crystal Display) monitor or a projector installed outside.

音声入出力Ｉ／Ｆ８は、音声入力用のマイクアレイ２０を介して入力された音声信号を所定の音声データとして取り込む。また、出力する音声データをスピーカー４で再生可能な音声信号に変換する。 The voice input/output I/F 8 takes in a voice signal input via the microphone array 20 for voice input as predetermined voice data. Also, the output audio data is converted into an audio signal that can be reproduced by the speaker 4.

システムバス２１は、アドレスバス、データバス及び各種制御信号を伝達する。 The system bus 21 transmits an address bus, a data bus and various control signals.

かかる構成の第一実施形態に係る遠隔会議システムにおいては、ＣＰＵ１、メモリー２、音声取得手段たるマイクアレイ２０、音声入出力Ｉ／Ｆ８、及び音声処理部１０などの組み合わせにより、音声処理装置が構成されている。また、ＣＰＵ１、メモリー２、撮像素子Ｉ／Ｆ６、カメラ９、及び映像処理部３などの組み合わせにより、映像処理装置が構成されている。 In the remote conference system according to the first embodiment having such a configuration, a voice processing device is configured by a combination of the CPU 1, the memory 2, the microphone array 20 as a voice acquisition unit, the voice input/output I/F 8, the voice processing unit 10, and the like. Has been done. A video processing device is configured by a combination of the CPU 1, the memory 2, the image sensor I/F 6, the camera 9, the video processing unit 3, and the like.

以下、同図に示される二台の音声映像出力装置１００のうち、内部の回路構成を示している方の音声映像出力装置１００が置かれている場所を、現在地会議室という。また、もう一方の音声映像出力装置１００が置かれている場所を、遠隔地会議室という。 Hereinafter, of the two audio/video output devices 100 shown in the figure, the place where the audio/video output device 100 showing the internal circuit configuration is placed is referred to as a present location conference room. A place where the other audio/video output device 100 is placed is called a remote conference room.

現在地会議室の音声映像出力装置１００において、カメラ９によって取得された映像信号は、撮像素子Ｉ／Ｆ６及びシステムバス２１を介して映像処理部３に送られる。また、マイクアレイ２０によって取得された音声信号は、音声入出力Ｉ／Ｆ８及びシステムバス２１を介して音声処理部１０に送られる。この一方で、ＣＰＵ１は、遠隔地会議室の音声映像出力装置１００から送られてきた音声信号を、システムバス２１を介して音声処理部１０に送る。また、遠隔地会議室の音声映像出力装置１００から送られてきた映像信号を、システムバス２１を介して映像処理部３に送る。 In the audio/video output device 100 in the present location conference room, the video signal acquired by the camera 9 is sent to the video processing unit 3 via the image sensor I/F 6 and the system bus 21. Further, the audio signal acquired by the microphone array 20 is sent to the audio processing unit 10 via the audio input/output I/F 8 and the system bus 21. On the other hand, the CPU 1 sends the audio signal sent from the audio/video output device 100 in the remote conference room to the audio processing unit 10 via the system bus 21. The video signal sent from the audio/video output device 100 in the remote conference room is sent to the video processing unit 3 via the system bus 21.

現在地会議室の音声映像出力装置１００の映像処理部３は、撮像手段たるカメラ９によって得られた映像信号と、遠隔地会議室の音声映像出力装置１００から送られてきた映像信号とに基づいて、次のような合成映像信号を生成する。即ち、後者の映像信号に基づく大きな画像の一部に、前者の映像信号に基づく小さな画像を枠付きで重ねた合成画像を表示するための合成映像信号である。そして、その合成映像信号を、システムバス２１及び映像出力Ｉ／Ｆ７を介して画像表示装置２１１に送る。これにより、画像表示装置２１１には、次のような合成映像が表示される。 The video processing unit 3 of the audio-video output device 100 in the present location conference room is based on the video signal obtained by the camera 9 which is the imaging means and the video signal sent from the audio-video output device 100 in the remote conference room. , Generates the following composite video signal. That is, it is a composite video signal for displaying a composite image in which a small image based on the former video signal is overlaid with a frame on a part of a large image based on the latter video signal. Then, the composite video signal is sent to the image display device 211 via the system bus 21 and the video output I/F 7. As a result, the following composite video is displayed on the image display device 211.

現在地会議室の音声映像出力装置１００の音声処理部１０は、音声取得手段たるマイクアレイ２０によって得られた音声信号と、遠隔地会議室の音声映像出力装置１００から送られてきた音声信号とを合成した合成音声信号を生成する。そして、その合成映像信号を、システムバス２１及び音声入出力Ｉ／Ｆ８を介してスピーカー４に出力する。これにより、スピーカー４からは、遠隔地会議室の会議参加者から発せられた音声や、現在地会議室の会議参加者から発せられた音声が出力される。 The audio processing unit 10 of the audio/video output device 100 in the present location conference room receives the audio signal obtained by the microphone array 20 as the audio acquisition means and the audio signal sent from the audio/video output device 100 in the remote conference room. Generate a synthesized speech signal. Then, the composite video signal is output to the speaker 4 via the system bus 21 and the audio input/output I/F 8. As a result, the speaker 4 outputs the voice emitted from the conference participants in the remote conference room and the voice emitted from the conference participants in the current conference room.

遠隔地会議室の音声映像出力装置１００においても、現在地会議室の音声映像出力装置１００と同様の処理が行われる。これにより、現在地会議室の会議参加者は、画像表示装置２１１に表示される遠隔地会議室の会議参加者の映像を見ながら、後者の会議参加者と意見交換をすることができる。また、遠隔地会議室の会議参加者は、遠隔地会議室に設置された画像表示装置に表示される現在地会議室の会議参加者の映像を見ながら、後者の会議参加者と意見交換をすることができる。 Also in the audio/video output device 100 in the remote conference room, the same processing as in the audio/video output device 100 in the current conference room is performed. Thereby, the conference participants in the present location conference room can exchange opinions with the latter conference participants while watching the image of the conference participants in the remote location conference room displayed on the image display device 211. Further, the conference participants in the remote conference room exchange views with the latter conference participants while watching the images of the conference participants in the current conference room displayed on the image display device installed in the remote conference room. be able to.

マイクアレイ２０は、互いに異なる指向性をもつ複数のマイクを具備していることで、音声映像出力装置１００の周囲における互いに異なる複数の位置のそれぞれから送られてくる音を個別且つ選択的に取得する。そして、それぞれのマイクで取得した音声信号を個別に出力する。以下、それら音声信号の組を、アレイ音声信号という。 The microphone array 20 includes a plurality of microphones having directivities different from each other, and thus individually and selectively acquires sounds sent from a plurality of different positions around the audio-video output device 100. To do. Then, the audio signals acquired by the respective microphones are individually output. Hereinafter, a set of those audio signals is referred to as an array audio signal.

図２は、音声処理部１０の内部構成の要部を示すブロック図である。音声処理部１０は、位置特定手段としての位置特定部１１、及び音声信号出力手段としての音声信号出力部１２を有している。また、音声信号出力部１２は、強調処理部１２ａ及び合成部１２ｂを具備している。 FIG. 2 is a block diagram showing a main part of an internal configuration of the voice processing unit 10. The voice processing unit 10 has a position specifying unit 11 as a position specifying unit and a sound signal output unit 12 as a sound signal output unit. The audio signal output unit 12 also includes an enhancement processing unit 12a and a synthesis unit 12b.

同図において、遠隔地会議室から送られてきた音声信号は、合成部１２ｂに入力される。これに対し、現在地会議室のマイクアレイ２０によって取得されたアレイ音声信号は、位置特定部１１及び強調処理部１２ａに並列で入力される。 In the figure, the audio signal sent from the remote conference room is input to the synthesis unit 12b. On the other hand, the array audio signal acquired by the microphone array 20 in the present location conference room is input in parallel to the position specifying unit 11 and the emphasis processing unit 12a.

位置特定部１１は、アレイ音声信号に含まれる複数の音声信号のうち、音強度の最も高い音声信号を音源音声信号として特定する。そして、マイクロアレイ２０に具備される複数のマイクのうち、その音源音声信号に対応するマイクを示す情報の信号を音源位置信号として強調処理部１２ａに出力する。つまり、位置特定部１１は、互いに異なる複数の位置の中から、音源の存在する位置である音源位置を特定する位置特定手段として機能している。 The position identifying unit 11 identifies, as the sound source audio signal, the audio signal having the highest sound intensity among the plurality of audio signals included in the array audio signal. Then, of the plurality of microphones included in the microarray 20, a signal of information indicating a microphone corresponding to the sound source audio signal is output to the emphasis processing unit 12a as a sound source position signal. That is, the position specifying unit 11 functions as a position specifying unit that specifies a sound source position, which is a position where a sound source exists, from a plurality of different positions.

強調処理部１２ａは、アレイ音声信号に含まれる複数の音声信号に基づいて、それら音声信号によって再現される音声のうち、位置特定部１１から送られてくる音源位置信号に対応する音声を、音源位置信号に対応しない音声よりも強調した音声信号を生成する。そして、その音声信号を合成部１２ｂに出力する。また、その音声信号を、システムバス２１等を介して、遠隔地会議室の音声映像出力装置１００に送る。 Based on a plurality of audio signals included in the array audio signal, the enhancement processing unit 12a extracts, from the sounds reproduced by the audio signals, the audio corresponding to the sound source position signal sent from the position identifying unit 11 as a sound source. A voice signal emphasized over a voice not corresponding to the position signal is generated. Then, the audio signal is output to the synthesis unit 12b. Also, the audio signal is sent to the audio/video output device 100 in the remote conference room via the system bus 21 or the like.

合成部１２ｂは、遠隔地会議室から送られてくる音声信号と、強調処理部１２ｂから送られてくる音声信号とを合成した合成音声信号を生成し、システムバス２１等を介してスピーカー４に出力する。 The synthesizer 12b synthesizes a voice signal sent from the remote conference room and a voice signal sent from the emphasis processor 12b to generate a synthesized voice signal, which is sent to the speaker 4 via the system bus 21 or the like. Output.

かかる構成では、現在地会議室において、音声映像出力装置１００の周囲に存在する複数の位置のうち、発言者のいる音源位置から送られてくる音を、発言者のいない非音源位置から送られてくる音よりも強調して出力する。これにより、非音源位置から送られてくるノイズを除去した明瞭な発言者の音声を、スピーカー４や遠隔地会議室の音声映像出力装置１００に出力することができる。 With such a configuration, in the current location conference room, the sound sent from the sound source position where the speaker is present among the plurality of positions existing around the audio/video output device 100 is sent from the non-sound source position where there is no speaker. The output is emphasized rather than the coming sound. As a result, the clear voice of the speaker from which the noise sent from the non-sound source position is removed can be output to the speaker 4 or the audio/video output device 100 in the remote conference room.

ところが、発言者が交代すると、交代直後の発言者の音声を十分な強度で出力せずに不鮮明にしてしまうという課題があった。 However, there is a problem that when the speaker is changed, the voice of the speaker immediately after the change is not output with sufficient strength and becomes unclear.

図３は、発言者のいる位置が位置特定部１１によって音源位置であると正しく認識されている現在地会議室の様子を示す模式図である。同図において、三角形の枠で囲まれた位置は、音の強調対象として選択されている位置を示している。この例では、人物Ａが発言しており、人物Ａのいる位置が音声映像出力装置１００の位置特定部１１によって音源位置であると正しく認識されていることで、人物Ａのいる位置が適切に音の強調対象として選択されている。 FIG. 3 is a schematic diagram showing a state of the current location conference room in which the position specifying unit 11 correctly recognizes the position of the speaker as the sound source position. In the figure, the position surrounded by a triangular frame indicates a position selected as a sound enhancement target. In this example, the person A is speaking, and the position of the person A is correctly recognized as the sound source position by the position specifying unit 11 of the audio/video output device 100. It is selected as the target of sound enhancement.

図４は、発言者が人物Ａから人物Ｃに交代された直後の現在地会議室の様子を示す模式図である。同図では、既に発言を終えた人物Ａの代わりに、人物Ｃが発言している。よって、音源位置は人物Ｃがいる位置になっているが、位置特定部１１は直前まで発言していた人物Ａのいる位置を音源位置であると認識している一方で、人物Ｃのいる位置を発言者のいない非音源位置であると認識している。このため、強調処理部１２ａは、発言者ではない人物Ａのいる位置を音の強調対象とする一方で、発言者である人物Ｃのいる位置を音の強調対象としない。そして、人物Ａのいる位置から送られてくるノイズを、発言者である人物Ｃの音声よりも強調した音声信号を出力することから、発言者Ｃの音声を不鮮明にしてしまう。つまり、発言者交代後の音源位置が位置特定部１１によって音源位置であると正しく認識されるまでの間、その音源位置からの音声を不鮮明にしてしまうのである。 FIG. 4 is a schematic diagram showing a state of the present location conference room immediately after the speaker is changed from the person A to the person C. In the figure, the person C is speaking instead of the person A who has already finished speaking. Therefore, the sound source position is the position where the person C is, but the position identifying unit 11 recognizes the position where the person A who has been speaking until immediately before is the sound source position, while the position where the person C is located. Is recognized as a non-sound source position with no speaker. For this reason, the emphasis processing unit 12a sets the position of the person A who is not the speaker as the sound emphasis target, and does not make the position of the person C who is the speaker the sound emphasis target. Then, since the noise signal sent from the position of the person A is emphasized as compared with the voice of the person C who is the speaker, the voice signal is output, so that the voice of the speaker C becomes unclear. In other words, until the position specifying unit 11 correctly recognizes the sound source position after the speaker change, the sound from the sound source position is made unclear.

次に、第一実施形態に係る遠隔会議システムの特徴的な構成について説明する。
第一実施形態に係る遠隔会議システムの強調処理部１２ａは、マイクアレイ２０によって具備される複数のマイクのそれぞれに対応する複数の位置のうち、過去の所定時間内で音源位置になった履歴のある位置を非音源強調対象位置とする。この非音源強調対象位置は、位置特定部１１による音源位置の特定結果にかかわらず音を強調する位置である。より詳しくは、位置特定部１１によって音源位置として特定されていなくても音を強調する位置である。 Next, a characteristic configuration of the remote conference system according to the first embodiment will be described.
The emphasizing processing unit 12a of the remote conference system according to the first embodiment, of the plurality of positions corresponding to each of the plurality of microphones included in the microphone array 20, records the history of the sound source position within the predetermined time in the past. A certain position is set as a non-source enhancement target position. The non-sound source emphasis target position is a position at which the sound is emphasized regardless of the result of specifying the sound source position by the position specifying unit 11. More specifically, it is a position where the sound is emphasized even if it is not specified as the sound source position by the position specifying unit 11.

図５は、遡及時間Ｔ内における発言者履歴及び発言時間帯の一例を説明するための模式図である。同図において、時間はｔという符号が付された矢印の方向に沿って経過している。遡及時間Ｔは、現時点から所定時間だけ過去に遡った時点に至るまでの時間である。図中でＡ，Ｂ，Ｃという符号が付された時間帯は、人物Ａ，人物Ｂ，人物Ｃが発言していた時間帯である。また、符号が付されていない時間帯は、発言者のいない時間帯である。同図に示される例では、遡及時間Ｔ内において、人物Ａのいる位置（以下Ａ位置という）、人物Ｂのいる位置（以下、Ｂ位置という）、及び人物Ｃのいる位置（以下、Ｃ位置という）の三つの位置が音源位置になった履歴をもっている。この場合、強調処理部１２ａは、Ａ位置、Ｂ位置、Ｃ位置のそれぞれについて、次のような処理を実施する。即ち、現時点で位置特定部１１によって音源位置であると特定されていなくても（それぞれの位置に対応する音源位置信号が送られていなくても）、音源位置であると特定されている位置から送られてくる音と同じ度合いで強調（増幅）した音声信号を出力する。例えば図６に示されるように、人物Ａだけが発言しているときであっても、Ａ位置だけでなく、Ｂ位置やＣ位置も音の強調対象とする。 FIG. 5 is a schematic diagram for explaining an example of a speaker history and a speaking time zone within the retroactive time T. In the figure, time elapses in the direction of the arrow labeled t. The retroactive time T is the time from the current time point to the time point retroactively preceded by a predetermined time. In the figure, the time zones denoted by the symbols A, B, and C are the time zones in which the person A, the person B, and the person C were speaking. Further, a time zone without a code is a time zone in which no speaker is present. In the example shown in the figure, within the retroactive time T, the position of the person A (hereinafter referred to as position A), the position of the person B (hereinafter referred to as position B), and the position of the person C (hereinafter referred to as position C). It has a history that the sound source positions are three positions. In this case, the emphasis processing unit 12a performs the following processing for each of the A position, the B position, and the C position. That is, even if the position specifying unit 11 does not specify the sound source position at the present time (even if the sound source position signal corresponding to each position is not transmitted), the position is determined to be the sound source position. It outputs an audio signal that is emphasized (amplified) to the same degree as the sound that is sent. For example, as shown in FIG. 6, even when only the person A is speaking, not only the A position but also the B position and the C position are sound enhancement targets.

かかる構成においては、Ａ位置、Ｂ位置、及びＣ位置のそれぞれについて、非音源位置から音源位置に切り替わった直後であっても、その音声を確実に強調して出力する。これにより、発言し始めた直後の人物Ａ、人物Ｂ、人物Ｃの音声の不鮮明化を回避することで、発言し始めた発言者の音声を不鮮明にしてしまうことを抑えることができる。更には、前述した三つの位置とは異なる位置から送られてくるノイズを強調しないことで、発言者の音声に対するノイズの混入を抑えることもできる。 In such a configuration, for each of the A position, the B position, and the C position, even immediately after switching from the non-sound source position to the sound source position, the sound is surely emphasized and output. This avoids obscuring the voices of the person A, the person B, and the person C immediately after starting to speak, thereby suppressing blurring of the voice of the speaker who started to speak. Furthermore, by not emphasizing the noise sent from the positions different from the above-mentioned three positions, it is possible to suppress the mixing of the noise with the voice of the speaker.

なお、マイクアレイ２０に具備される複数のマイクのそれぞれから送られてくる音声信号に基づいて音源位置を特定する構成を採用した例について説明したが、他の構成によって音源位置を特定するようにしてもよい。例えば、カメラ９によって取得した映像信号に基づいて周知の技術による顔認識を実施して現在地会議室にいる複数の人物及びその位置を把握し、それら人物のうち、口の活発な動きが認められる顔（＝発言者）のある位置を音源位置として特定してもよい。この場合、音源位置信号として、例えば音声映像出力装置１００の周囲のうち、所定の位置を基準の０［ｄｅｇ］として、音源位置のある方向の角度を音源位置信号として位置特定部１１から出力させればよい。加えて、音声取得手段としてマイクアレイ２０を採用している場合には、強調処理部１２ａを次のように構成すればよい。即ち、複数のマイクアレイ２０に具備されるマイクのうち、前記角度に対応するマイクから送られてくる音声信号を、音源位置からの音声信号として特定するように構成すればよい。 Although an example of adopting a configuration in which the sound source position is specified based on the audio signals sent from each of the plurality of microphones included in the microphone array 20 has been described, the sound source position may be specified by another structure. May be. For example, based on the video signal acquired by the camera 9, face recognition is performed by a well-known technique to recognize a plurality of persons in the current location conference room and their positions, and among these persons, active movement of the mouth is recognized. A certain position of the face (=speaker) may be specified as the sound source position. In this case, as the sound source position signal, for example, a predetermined position in the periphery of the audio/video output device 100 is set to 0 [deg] as a reference, and the angle in a certain direction of the sound source position is output from the position specifying unit 11 as the sound source position signal. Just do it. In addition, when the microphone array 20 is used as the voice acquisition unit, the enhancement processing unit 12a may be configured as follows. That is, among the microphones included in the plurality of microphone arrays 20, the audio signal sent from the microphone corresponding to the angle may be specified as the audio signal from the sound source position.

また、マイクアレイに具備される複数のマイクのそれぞれから送られてくる音声信号のうち、特定のマイクから送られてくる音声信号を選択してそれを他の音声信号よりも強調する例について説明したが、次のようにしてもよい。即ち、音声取得手段として、集音指向性を変化させることが可能な指向可変マイクを複数設ける。そして、それらの指向可変マイクについて、音源位置、又は音源位置になった履歴のある位置に集音指向性を合わせることで、それら位置から送られてくる音を強調するように強調処理部１２ａを構成してもよい。 Also, an example will be described in which of the audio signals sent from each of the plurality of microphones included in the microphone array, the audio signal sent from a specific microphone is selected and emphasized over other audio signals. However, you may do as follows. That is, a plurality of directional variable microphones capable of changing the sound collection directivity are provided as the sound acquisition means. Then, with respect to these directional variable microphones, the emphasis processing unit 12a is set so as to emphasize the sound sent from these positions by adjusting the sound collection directivity to the sound source position or a position having a history of being the sound source position. You may comprise.

また、複数の人物が同時に発言した場合には、それぞれの人物に対応する複数の位置のそれぞれを音源位置として特定するようにしてもよいし、最も高い音声強度で発言した人物のいる位置だけを音源位置として特定するようにしてもよい。 Further, when a plurality of persons speak at the same time, each of a plurality of positions corresponding to each person may be specified as a sound source position, or only the position where the person who speaks with the highest voice strength is present. You may make it identify as a sound source position.

次に掲げる表１は、各位置と、同時発言者数と、位置特定部１１から出力される音源位置信号との関係を示すものである。

The following Table 1 shows the relationship among each position, the number of simultaneous speakers, and the sound source position signal output from the position specifying unit 11.

表１に示される例では、位置の方向を示す角度［ｄｅｇ］の情報信号が音源位置信号として位置特定部１１から出力される。Ａ位置，Ｂ位置，Ｃ位置，Ｄ位置が音源位置として特定された場合には、音源情報信号として３０，１２０，２１０，３５０が位置特定部１１から出力される。ケース１では、Ａ位置にいる人物Ａと、Ｃ位置にいる人物Ｃとが同時に発言している。また、ケース２では、Ｂ位置にいる人物Ｂだけが発言している。また、ケース３ではＣ位置にいる人物ＣとＤ位置にいる人物Ｄとが同時に発言している。また、ケース４では発言者が存在していない。この場合、位置特定部１１から音源位置信号は出力されない。 In the example shown in Table 1, the information signal of the angle [deg] indicating the direction of the position is output from the position specifying unit 11 as the sound source position signal. When the A position, the B position, the C position, and the D position are specified as the sound source positions, 30, 120, 210, 350 are output from the position specifying unit 11 as the sound source information signals. In case 1, the person A at position A and the person C at position C speak at the same time. Further, in case 2, only the person B at the position B speaks. In case 3, the person C at the C position and the person D at the D position speak at the same time. Further, in case 4, there is no speaker. In this case, the sound source position signal is not output from the position specifying unit 11.

ケース１やケース２のように二以上の人物が同時に発言した場合、それら人物のそれぞれに対応する位置の全てを音源位置であると特定して、それぞれの位置に対応する音源位置信号を出力するように位置特定部１１を構成してもよい。この場合、例えばケース１では、図７に示されるように、発言中の人物ＡのいるＡ位置、及び発言中の人物ＣのいるＣ位置の両方が音源位置として特定されて、３０［ｄｅｇ］及び１２０［ｄｅｇ］の二つの音源位置信号が位置特定部１１から出力される。 When two or more persons speak at the same time as in case 1 and case 2, all the positions corresponding to those persons are specified as the sound source positions, and sound source position signals corresponding to the respective positions are output. The position specifying unit 11 may be configured as described above. In this case, for example, in case 1, as shown in FIG. 7, both the A position where the talking person A is located and the C position where the talking person C is located are specified as the sound source positions, and 30 [deg] And 120 [deg] of two sound source position signals are output from the position specifying unit 11.

また、二以上の人物が同時に発言場合、それら人物のうち、最も高い音声強度で発言した人物のいる位置だけを音源位置として特定してその位置に対応する音源位置信号を出力するように位置特定部１１を構成してもよい。例えばケース１において、人物Ａの方が人物Ｂよりも高い音声強度で発言した場合には、Ａ位置に対応する３０［ｄｅｇ］だけが音源位置信号として位置特定部１１から出力される。 Also, when two or more people speak at the same time, the position is specified so that only the position of the person who speaks with the highest sound intensity is specified as the sound source position and the sound source position signal corresponding to that position is output. The part 11 may be configured. For example, in case 1, when the person A speaks with a higher voice intensity than the person B, only 30 [deg] corresponding to the position A is output from the position identifying unit 11 as a sound source position signal.

次に、第一実施形態に係る遠隔会議システムに、より特徴的な構成を付加した各実施例について説明する。なお、以下に特筆しない限り、各実施例に係る遠隔会議システムの構成は、第一実施形態と同様である。また、たとえ音源位置として位置特定部１１に特定されていなくても、遡及時間Ｔ内での履歴に基づいて音の強調対象にすることが強調処理部１２ａによって決定される位置を、非音源強調対象位置という。 Next, each example in which a more characteristic configuration is added to the remote conference system according to the first embodiment will be described. Unless otherwise specified below, the configuration of the remote conference system according to each example is the same as that of the first embodiment. Even if the sound source position is not specified by the position specifying unit 11, the position determined by the emphasis processing unit 12a to be a sound emphasis target based on the history within the retroactive time T is set as a non-sound source emphasis position. It is called the target position.

［第一実施例］
第一実施例に係る遠隔会議システムの強調処理部１２ａは、遡及時間Ｔ内で音源位置になった履歴のある位置のうち、次の条件を満足した位置だけを非音源強調対象位置として選定する。即ち、遡及時間Ｔ内にて音源位置でない状態から音源位置に切り替わった回数（以下、切り替わり履歴数Ｎａという）が所定の閾値を超える、又は前記閾値以上になるという条件である。かかる条件の一例として、遡及時間Ｔ内における切り替わり履歴数Ｎａが二以上であるというものが挙げられる。この条件例を採用した場合には、図５に示される例では、Ａ位置、Ｂ位置、及びＣ位置のうち、Ａ位置及びＢ位置だけが非音源強調対象位置として選定される。 [First embodiment]
The emphasis processing unit 12a of the remote conference system according to the first embodiment selects only the position satisfying the following condition among the positions having the history of becoming the sound source position within the retroactive time T as the non-sound source emphasis target position. .. That is, the condition is that the number of times of switching from a state other than the sound source position to the sound source position within the retrospective time T (hereinafter, referred to as a switching history number Na) exceeds a predetermined threshold value or is equal to or more than the threshold value. An example of such a condition is that the switching history number Na within the retroactive time T is two or more. When this condition example is adopted, in the example shown in FIG. 5, only the A position and the B position among the A position, the B position, and the C position are selected as the non-source enhancement target positions.

図８は、第一実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローを示すフローチャートである。強調処理部１２ａは、まず、記憶回路に記憶している遡及時間Ｔ内における履歴データを読み込む（ステップ１：以下、ステップをＳと記す）。そして、遡及時間Ｔ内で音源位置となった履歴のある複数の位置のうち、切り替わり履歴数Ｎａが二以上である位置の全てを非音源強調対象位置として選定する（Ｓ２）。次いで、位置特定部１１から送られてくる音源位置信号に基づいて音源位置を特定した後（Ｓ３）、非音源強調対象位置及び音源位置のそれぞれから送られてくる音を他の音よりも強調する（Ｓ４）。その後、一連の処理フローをＳ１にリターンさせる。 FIG. 8 is a flowchart showing a processing flow executed by the emphasis processing unit 12a of the remote conference system according to the first embodiment. The emphasis processing unit 12a first reads the history data within the retroactive time T stored in the storage circuit (step 1: hereinafter, step is referred to as S). Then, among the plurality of positions having the history of the sound source positions within the retroactive time T, all the positions having the switching history number Na of 2 or more are selected as the non-sound source emphasis target positions (S2). Next, after the sound source position is specified based on the sound source position signal sent from the position specifying unit 11 (S3), the sound sent from each of the non-sound source emphasis target position and the sound source position is emphasized over other sounds. Yes (S4). After that, the series of processing flows is returned to S1.

かかる構成では、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 With such a configuration, it is possible to suppress the deterioration of the output sound quality caused by continuously emphasizing the small noise sent from the position of the person who has a low utterance frequency for a long time.

［第二実施例］
第二実施例に係る遠隔会議システムでは、非音源強調対象位置を増やしすぎることによるノイズの混入の増加を回避するために、非音源強調対象位置については、必要に応じて一つだけ選定するようになっている。なお、同時に複数の発言者が発生した場合には、発言の音強度が最も高い位置だけを音源位置とする。 [Second embodiment]
In the teleconferencing system according to the second embodiment, in order to avoid an increase in noise mixing due to an excessive increase in non-sound source emphasis target positions, only one non-sound source emphasis target position should be selected as necessary. It has become. When a plurality of speakers occur at the same time, only the position where the sound intensity of the statement is highest is set as the sound source position.

音声映像出力装置１００の強調処理部１２ａは、直近における音源位置の切り替わりの動向を把握した結果に基づいて、非音源強調対象位置を選定する。具体的には、直近において、遡及時間Ｔ内で音源位置になった履歴のある位置のうち、現状の音源位置とは異なる位置の中で直近の履歴に対応する位置（以下、直近履歴位置という）に着目する。この直近履歴位置と、現状の音源位置との間だけで音源位置が交互に切り替わり且つその切り替わり回数である交互回数Ｎｂが所定の閾値を超える、又は閾値以上になるという条件を満足した場合に、直近履歴位置を非音源強調対象位置とする。前述の条件を満足した場合には、二人の人物の間で活発な意見交換がなされていて現在位置の切り替わり頻度が比較的高くなる可能性が高いからである。 The emphasis processing unit 12a of the audio/video output device 100 selects the non-sound source emphasis target position based on the result of grasping the latest trend of switching of the sound source position. Specifically, among the positions having the history of becoming the sound source position within the retroactive time T, the position corresponding to the latest history among the positions different from the current sound source position (hereinafter referred to as the latest history position). ). When the sound source position is alternately switched only between the latest history position and the current sound source position and the number of times Nb, which is the number of times of switching, exceeds a predetermined threshold value or is equal to or more than a threshold value, The latest history position is set as the non-source enhancement target position. This is because if the above conditions are satisfied, there is a high possibility that the two people will actively exchange opinions and the current position will be switched frequently.

図５に示される例では、現状の音源位置がＡ位置である場合には、直近履歴位置はＢ位置である。図中の右から数えて二番目のブロック（人物Ｂが発言している時間帯）と、四番目のブロック（人物Ａが発言している時間帯）との間に、発言者なしのブロック（以下、空白ブロック又は空白時間という）が存在している。その空白時間が所定値未満である場合には、空白時間を無視して、二番目のブロックと四番目のブロックとについて、音源位置と直近履歴位置との間の交互切り替わりか否かを判断する。これに対し、空白時間が所定値以上である場合には、二番目のブロックと四番目のブロックとの交互切り替わりがないものと判断する。よって、空白時間が所定値未満である場合には、交互回数Ｎｂは４回である。また、空白時間が所定値以上である場合には、交互回数Ｎｂは１回である。交互回数Ｎｂが閾値を超える、又は閾値以上になるという条件を満足していれば、Ｂ位置を非音源強調対象位置とする。 In the example shown in FIG. 5, when the current sound source position is the A position, the latest history position is the B position. Between the second block (time zone in which person B is speaking) and the fourth block (time zone in which person A is speaking) counted from the right in the figure, a block without a speaker ( Hereinafter, there is a blank block or blank time). If the blank time is less than the predetermined value, the blank time is ignored, and it is determined whether the second block and the fourth block are alternately switched between the sound source position and the latest history position. .. On the other hand, when the blank time is equal to or greater than the predetermined value, it is determined that the second block and the fourth block are not alternately switched. Therefore, when the blank time is less than the predetermined value, the number of alternations Nb is four. Further, when the blank time is equal to or greater than the predetermined value, the alternating number Nb is one. If the condition that the number of alternations Nb exceeds the threshold value or becomes equal to or more than the threshold value is satisfied, the B position is set as the non-source enhancement target position.

なお、図５に示される例において、現状の音源位置がＢ位置である場合には、直近履歴位置はＡ位置である。そして、前記空白時間が所定値未満であれば、交互回数Ｎｂは５回である。また、前記空白時間が所定値以上であれば、交互回数Ｎｂは２回である。 In the example shown in FIG. 5, when the current sound source position is the B position, the latest history position is the A position. If the blank time is less than the predetermined value, the alternating number Nb is 5 times. If the blank time is equal to or greater than the predetermined value, the alternating number Nb is 2.

図９は、遡及時間Ｔ内における発言者履歴及び発言時間帯の第二例を説明するための模式図である。この第二例では、現状の音源位置がＡ位置である場合には、直近履歴位置はＢ位置である。そして、図中右から三番目の空白時間が所定値未満である場合には、交互回数Ｎｂは２回である。また、前記空白時間が所定値以上である場合には、交互回数Ｎｂは１回である。 FIG. 9 is a schematic diagram for explaining a second example of the speaker history and the speaking time zone within the retroactive time T. In this second example, when the current sound source position is the A position, the latest history position is the B position. Then, when the third blank time from the right in the figure is less than the predetermined value, the number of alternations Nb is two. Further, when the blank time is equal to or greater than the predetermined value, the number of alternations Nb is one.

また、第二例において、現状の音源位置がＢ位置である場合には、直近履歴位置はＡ位置である。そして、前記空白時間が所定値未満である場合には、交互回数Ｎｂは３回である。また、前記空白時間が所定値以上である場合には、交互回数Ｎｂは２回である。 Further, in the second example, when the current sound source position is the B position, the latest history position is the A position. When the blank time is less than the predetermined value, the alternating number Nb is 3 times. Further, when the blank time is equal to or greater than the predetermined value, the alternating number Nb is 2.

図１０は、第二実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローの一例を示すフローチャートである。このフローチャートにおいて、強調処理部１２ａは、まず、現状の音源位置について存在するか否かを判定する（Ｓ１）。そして、存在しない場合には（Ｓ１でＮ）、非音源強調対象位置を選定せずに（Ｓ９）、一連の処理フローをＳ１にリターンさせる。これにより、複数の位置の全てが強調されずに、どの位置の音声レベルを互いに同じになることから、次に発言者の音声を良好に拾うことができる。 FIG. 10 is a flowchart showing an example of a processing flow executed by the emphasis processing unit 12a of the remote conference system according to the second embodiment. In this flowchart, the emphasis processing unit 12a first determines whether or not the current sound source position exists (S1). If it does not exist (N in S1), the non-sound source emphasis target position is not selected (S9), and the series of processing flow is returned to S1. This makes it possible to pick up the voice of the next speaker satisfactorily, because the voice level of which position is the same as each other without emphasizing all of the plurality of positions.

一方、現状の音源位置が存在する場合（Ｓ１でＹ）、強調処理部１２ａは、過去の履歴に基づいて、上述した交換回数Ｎｂを求める（Ｓ２）。そして、現状の音源位置における上述した切り替わり履歴数Ｎａと、直近履歴位置の切り替わり履歴数Ｎａとの合計Ｎａ’を求める（Ｓ３）。なお、直近履歴位置は、現状の音源位置を除く位置群の中で、直近に音源位置になった履歴のある位置であるので、現状の音源位置とは異なる位置である。 On the other hand, when the current sound source position exists (Y in S1), the enhancement processing unit 12a obtains the number of times of exchange Nb described above based on the past history (S2). Then, a total Na' of the above-described switching history number Na at the current sound source position and the switching history number Na at the latest history position is obtained (S3). It should be noted that the latest history position is a position different from the current sound source position because it has a history of the latest sound source position in the position group excluding the current sound source position.

合計Ｎａ’を求めた強調処理部１２ａは、合計Ｎａ’について所定の閾値βを超えているか否かを判定する（Ｓ４）。超えていない場合には（Ｓ４でＹ）、後に合計Ｎａ’と比較するための閾値αを比較的小さな値に設定する（Ｓ５）。これに対し、超えている場合には（Ｓ４でＮ）、閾値αを比較的大きな値に設定する（Ｓ６）。このように、閾値αの値を変えることで、活発な意見交換によって短時間のうちに発言者が交互に切り替わっている場合（Ｓ４でＹ）には、閾値αを比較的小さな値にして、直近履歴位置を非音源強調位置として選定し易い条件を作る。これにより、短時間のうちに発言者が交互に切り替わる場合における発言者の音声における冒頭の不鮮明化をより抑えることができる。 The emphasis processing unit 12a that has obtained the total Na′ determines whether or not the total Na′ exceeds the predetermined threshold β (S4). If it does not exceed (Y in S4), the threshold value α for later comparison with the total Na' is set to a relatively small value (S5). On the other hand, when it exceeds (N in S4), the threshold value α is set to a relatively large value (S6). In this way, by changing the value of the threshold value α, when the speakers are alternately switched in a short time due to active exchange of opinions (Y in S4), the threshold value α is set to a relatively small value, Create a condition that makes it easy to select the most recent history position as the non-source enhancement position. As a result, it is possible to further suppress the blurring of the beginning of the voice of the speaker when the speakers switch alternately in a short time.

閾値αを設定した強調処理部１２ａは、合計Ｎａ’について、閾値αを超えているか否かを判定する（Ｓ７）。そして、超えている場合には（Ｓ７でＹ）、直近履歴位置を被音源強調位置とした後（Ｓ８）、一連の処理フローをＳ１にリターンさせる。これに対し、超えていない場合には（Ｓ７でＮ）、非音源強調位置を選定することなく（Ｓ９）、一連の処理フローをＳ１にリターンさせる。 The emphasis processing unit 12a having set the threshold value α determines whether or not the total Na′ exceeds the threshold value α (S7). If it exceeds (Y in S7), the latest history position is set as the sound source emphasized position (S8), and then the series of processing flow is returned to S1. On the other hand, when it does not exceed (N in S7), the series of processing flow is returned to S1 without selecting the non-source enhancement position (S9).

かかる構成においても、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 Even with such a configuration, it is possible to suppress the deterioration of the output sound quality due to the continuous enhancement of the small noises sent from the position of the person with the low frequency of speech for a long time.

なお、閾値αを比較的小さな値にするのか、あるいは比較的大きな値にするのかの判断基準として、合計Ｎａ’を採用した例について説明したが、音源位置と直近履歴位置とで同時に発言がなされた時間の合計を判断基準にしてもよい。例えば、図１１に示す例では、遡及時間Ｔにおいて、現状の音源位置であるＡ位置と、直近履歴位置であるＢ位置とで、同時に発言がなされた時間の合計は、１０００ブロック＋１３００ブロック＝２３００ブロックである。この２３００ブロックが所定値以上になった場合に、閾値αを比較的小さな値にすればよい。なお、ブロックは時間の単位である。 In addition, an example in which the total Na′ is adopted as a criterion for determining whether the threshold value α is set to a relatively small value or a relatively large value has been described, but a speech is made at the sound source position and the latest history position at the same time. The total of the time spent may be used as the criterion. For example, in the example shown in FIG. 11, at the retroactive time T, the total time of simultaneous speech at the position A, which is the current sound source position, and the position B, which is the latest history position, is 1000 blocks+1300 blocks=2300. It is a block. When this 2300 block becomes a predetermined value or more, the threshold value α may be set to a relatively small value. The block is a unit of time.

［第三実施例］
第三実施例に係る遠隔会議システムの強調処理部１２ａは、音源位置の切り替わりの頻度が所定の閾値を超える又は前記閾値以上になり、且つ各位置の切り替わり履歴数Ｎａに有意差が認められる場合に、非音源強調対象位置を選定する。音源位置の切り替わりの頻度が閾値を超えたり、閾値以上になったりしない場合には、意見交換の密度がそれほど高くなく、発言者の交代の際に比較的長い空白時間が発生して、現状の音源位置なしとみなされる可能性が高い。この場合、全ての位置の音が強調されなくなって、次の発言者の発言冒頭を良好に拾えるようになることから、非音源強調対象位置を選定しない。また、各位置の切り替わりの頻度に有意差が認められない場合、たとえ意見交換の密度が比較的高くても、参加者の殆どが発言回数に偏りなく発言していることから、それぞれの発言者のいる位置を非音源強調対象位置にすると、ノイズの混入が多くなる。 [Third embodiment]
The emphasis processing unit 12a of the teleconference system according to the third embodiment, when the frequency of switching the sound source position exceeds a predetermined threshold value or is equal to or more than the threshold value, and a significant difference is recognized in the switching history number Na of each position. Then, the non-source enhancement target position is selected. If the frequency of sound source position switching does not exceed or exceed the threshold, the density of opinion exchange is not so high, and a relatively long blank time occurs when the speakers change, and There is a high possibility that it will be considered that there is no sound source position. In this case, the sound at all positions is not emphasized, and the beginning of the speech of the next speaker can be picked up well, so the non-sound source emphasis target position is not selected. In addition, if there is no significant difference in the frequency of switching between positions, even if the density of opinion exchanges is relatively high, most of the participants are speaking evenly in the number of times of speaking. If the position with the noise is set as the non-source enhancement target position, noise will be mixed more.

第三実施例に係る遠隔会議システムでは、音源位置の切り替わりの頻度を把握する指標として、遡及時間Ｔ内における各位置の切り替わり履歴数Ｎａの合計Ｎａ”を採用している。即ち、その合計Ｎａ”が所定の閾値γを超える又は閾値γ以上になり、且つ各位置の切り替わり履歴数Ｎａに有意差が認められる場合に、非音源強調対象位置を選定する。 In the teleconference system according to the third embodiment, the total Na″ of the switching history number Na of each position within the retroactive time T is adopted as an index for grasping the frequency of switching the sound source position. When “” exceeds a predetermined threshold value γ or is equal to or more than the threshold value γ and a significant difference is found in the number Na of switching histories at each position, the non-source enhancement target position is selected.

選定条件としては、現状の音源位置を除く位置群（以下、非音源位置群という）の中で、切り替わり履歴数Ｎａが最も多い位置という条件を採用している。また、切り替わり履歴数Ｎａに有意差が認められるか否かについては、非音源位置群の中における替わり履歴数Ｎａの最大値と、全ての位置における切り替わり履歴数Ｎａの平均値との差（以下、有意差判定値Ｘという）に基づいて判定する。具体的には、有意差判定値Ｘが所定値未満、又は所定値以下である場合に、有意差が認められないと判定する。 As the selection condition, a condition that the switching history number Na is the largest in the position group excluding the current sound source position (hereinafter referred to as a non-sound source position group) is adopted. Further, regarding whether or not a significant difference is found in the switching history number Na, the difference between the maximum value of the switching history number Na in the non-sound source position group and the average value of the switching history number Na at all positions (hereinafter , Significant difference determination value X). Specifically, when the significant difference determination value X is less than a predetermined value or less than or equal to a predetermined value, it is determined that no significant difference is recognized.

次の表２は、遡及時間Ｔにおける各位置の切り替わり履歴数Ｎａの一例を示している。

Table 2 below shows an example of the switching history number Na at each position at the retroactive time T.

表２において、合計Ｎａ”は７である。この７が閾値γを超えている、又は閾値γ以上であるとする。この場合、現状の音源位置がＢ位置であれば、非音源位置群（Ａ位置、Ｃ位置及びＤ位置）の中で、切り替わり履歴数Ｎａが最大である位置はＡ位置である。各位置における切り替わり履歴数Ｎａの平均値は１．７５であるので（７／４）、Ａ位置の切り替わり履歴数Ｎａ（＝４）と、平均値（＝１．７５）との差としての有意差判定値Ｘは２．２５である。 In Table 2, the total Na″ is 7. This 7 exceeds the threshold γ or is equal to or more than the threshold γ. In this case, if the current sound source position is the B position, the non-sound source position group ( Among the positions A, C, and D), the position with the largest switching history number Na is position A. Since the average value of the switching history number Na at each position is 1.75 (7/4) , The A position switching history number Na (=4) and the average value (=1.75) are significant difference determination value X is 2.25.

次の表２は、遡及時間Ｔにおける各位置の切り替わり履歴数Ｎａの第二例を示している。

The following Table 2 shows a second example of the switching history number Na at each position at the retroactive time T.

表２と比べると、各位置における切り替わり履歴数Ｎａの差が小さいことがわかる。但し合計Ｎａ”は表２と同じ７である。この７が閾値γを超えている、又は閾値γ以上であるとする。この場合、現状の音源位置がＢ位置であれば、非音源位置群（Ａ位置、Ｃ位置及びＤ位置）の中で、切り替わり履歴数Ｎａが最大である位置はＡ位置及びＣ位置である。各位置における切り替わり履歴数Ｎａの平均値は１．７５であるので（７／４）、Ａ位置やＣ位置の切り替わり履歴数Ｎａ（＝４）２、平均値（＝１．７５）との差としての有意差判定値Ｘは０．２５である。これは比較的小さな値であるので、所定値未満又は所定値以下になって、非音源強調対象位置が選定されなくなる可能性が高い。 It can be seen from comparison with Table 2 that the difference in the number of switching history Na at each position is small. However, the total Na″ is the same 7 as in Table 2. It is assumed that this 7 exceeds the threshold γ or is equal to or more than the threshold γ. In this case, if the current sound source position is the B position, the non-sound source position group Among the (A position, C position, and D position), the positions where the switching history number Na is the maximum are the A position and the C position, because the average value of the switching history number Na at each position is 1.75 ( 7/4), the number of switching history Na (=4) 2 of the A position and the C position, and the significant difference judgment value X as a difference from the average value (=1.75) is 0.25. Since it is a small value, it is less than the predetermined value or less than the predetermined value, and there is a high possibility that the non-source enhancement target position will not be selected.

図１２は、第三実施例に係る遠隔会議システムの強調処理部１２ａによって実施される処理フローを示すフローチャートである。このフローチャートにおいて、強調処理部１２ａは、まず、現状の音源位置について存在するか否かを判定する（Ｓ１）。そして、存在しない場合には（Ｓ１でＮ）、非音源強調対象位置を選定せずに（Ｓ７）、一連の処理フローをＳ１にリターンさせる。これにより、複数の位置の全てが強調されずに、どの位置の音声レベルを互いに同じになることから、次に発言者の音声を良好に拾うことができる。 FIG. 12 is a flowchart showing a processing flow executed by the emphasis processing unit 12a of the remote conference system according to the third embodiment. In this flowchart, the emphasis processing unit 12a first determines whether or not the current sound source position exists (S1). If it does not exist (N in S1), the non-sound source enhancement target position is not selected (S7), and the series of processing flow is returned to S1. This makes it possible to pick up the voice of the next speaker satisfactorily, because the voice level of which position is the same as each other without emphasizing all of the plurality of positions.

一方、現状の音源位置が存在する場合（Ｓ１でＹ）、強調処理部１２ａは、遡及時間Ｔ内における各位置の切り替わり履歴数Ｎａと、それらの合計Ｎａ”とを求める（Ｓ２）。そして、合計Ｎａ”について、閾値γを超えているか否かを判定し（Ｓ３）、超えていない場合には（Ｓ３でＮ）、非音源強調対象位置を選定せずに（Ｓ７）、一連の処理フローをＳ１にリターンさせる。これにより、音源位置の音だけを強調するようにする。意見交換がそれほど活発でないことから、現状の発言者が発言を止めた後にある程度の空白時間が生じて全ての位置について音の強調をしなくなってから、次の発言者の発言がなされる可能性がたかい。よって、非音源強調対象位置を選定しなくても、次の発言者の発言における冒頭を不鮮明にしてしまう可能性は低い。 On the other hand, when the current sound source position exists (Y in S1), the enhancement processing unit 12a obtains the number Na of switching histories of each position within the retroactive time T and their total Na″ (S2). It is determined whether or not the total Na″ exceeds the threshold γ (S3), and if it does not exceed (N in S3), the non-source enhancement target position is not selected (S7), and a series of processing flow To S1. As a result, only the sound at the sound source position is emphasized. Since the exchange of opinions is not so active, it is possible that the next speaker will make a statement after the current speaker has stopped speaking and a certain amount of blank time has occurred and the sound has not been emphasized at all positions. It's hard. Therefore, even if the non-sound source emphasis target position is not selected, it is unlikely that the beginning of the speech of the next speaker is unclear.

一方、合計Ｎａ”が閾値γを超えている場合には（Ｓ３でＹ）、上述した有意義判定値Ｘを求めた後（Ｓ４）、その結果について所定値未満であるか否かを判定する（Ｓ５）。所定値未満である場合には（Ｓ６でＮ）、各人における発言回数に有意差があるものとみなして、非音源位置群の中で切り替わり履歴数Ｎａが最大である位置を非音源強調対象位置とした後（Ｓ６）、一連の処理フローをＳ１にリターンさせる。これに対し、所定値未満でない場合には（Ｓ５でＹ）、各人における発言回数に有意差があるものとみなして、非音源強調対象位置を選定せずに（Ｓ７）、一連の処理フローをＳ１にリターンさせる。 On the other hand, when the total Na″ exceeds the threshold value γ (Y in S3), the significance determination value X described above is obtained (S4), and then it is determined whether or not the result is less than a predetermined value ( S5) If it is less than the predetermined value (N in S6), it is considered that there is a significant difference in the number of times of speech in each person, and the position where the number of switching history Na is the maximum is determined as non-source position in the non-sound source position group. After setting the sound source emphasis target position (S6), the series of processing flow is returned to S1. On the other hand, if it is not less than the predetermined value (Y in S5), there is a significant difference in the number of times of speech in each person. Assuming that the non-source enhancement target position is not selected (S7), the series of processing flow is returned to S1.

なお、閾値γを、第二実施例に係る遠隔会議システムにおける閾値αと同様にして、所定の判断基準に基づいて比較的小さな値、比較的大きな値の何れかに設定するようにしてもよい。
［第四実施例］
第四実施例に係る遠隔会議システムの強調処理部１２ａも、第三実施例と同様に、音源位置の切り替わりの頻度が所定の閾値を超える又は前記閾値以上になり、且つ各位置の切り替わり履歴数Ｎａに有意差が認められる場合に、非音源強調対象位置を選定する。また、音源位置の切り替わりの頻度を把握する指標として、遡及時間Ｔ内における各位置の切り替わり履歴数Ｎａの合計Ｎａ”を採用している。即ち、その合計Ｎａ”が所定の閾値γを超える又は閾値γ以上になり、且つ各位置の切り替わり履歴数Ｎａに有意差が認められる場合に、非音源強調対象位置を選定する。 It should be noted that the threshold γ may be set to either a relatively small value or a relatively large value based on a predetermined determination criterion, similarly to the threshold α in the remote conference system according to the second embodiment. ..
[Fourth Embodiment]
Similarly to the third embodiment, the emphasis processing unit 12a of the remote conference system according to the fourth embodiment also has the frequency of sound source position switching exceeding a predetermined threshold value or being equal to or more than the threshold value, and the number of switching history of each position. When there is a significant difference in Na, a non-source enhancement target position is selected. Further, as an index for grasping the frequency of switching the sound source positions, the total Na″ of the switching history numbers Na of each position within the retroactive time T is adopted. That is, the total Na″ exceeds a predetermined threshold value γ or When the threshold value γ or more and the number of switching histories Na at each position are significantly different, the non-sound source emphasis target position is selected.

選定条件としては、現状の音源位置を除く位置群（以下、非音源位置群という）の中で、切り替わり履歴数Ｎａが所定の閾値を超える、又は閾値以上であるという条件を採用している。かかる構成においても、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 As a selection condition, a condition that the number Na of switching history exceeds a predetermined threshold value or is equal to or more than a threshold value in a position group excluding the current sound source position (hereinafter referred to as a non-sound source position group) is adopted. Even with such a configuration, it is possible to suppress the deterioration of the output sound quality due to the continuous enhancement of the small noises sent from the position of the person with the low frequency of speech for a long time.

なお、前記閾値を、第二実施例に係る遠隔会議システムにおける閾値αと同様にして、所定の判断基準に基づいて比較的小さな値、比較的大きな値の何れかに設定するようにしてもよい。 It should be noted that the threshold value may be set to either a relatively small value or a relatively large value based on a predetermined determination criterion, similarly to the threshold value α in the remote conference system according to the second embodiment. .

次に、本発明を適用した遠隔会議システムの第二実施形態について説明する。なお、第二実施形態に係る遠隔会議システムの基本的な構成については、第一実施形態に係る遠隔会議システムと同様であるので、説明を省略する。 Next, a second embodiment of the remote conference system to which the present invention is applied will be described. Note that the basic configuration of the remote conference system according to the second embodiment is the same as that of the remote conference system according to the first embodiment, so description thereof will be omitted.

第二実施形態に係る遠隔会議システムの強調処理部１２ａは、遡及時間Ｔにおける音源位置の切り替わる頻度が所定の閾値を超える又は前記閾値以上になった場合に、音源位置から送られてくる音を強調する処理を中止する。これにより、図１３に示されるように、全ての位置の音を強調せずに、各位置の音の出力強度を互いに同程度にする。かかる構成では、非音源位置から音源位置に切り替わった音源位置から送られてくる音を他の位置の音と同じ強度で出力しているので、切り替わり直後の音声の冒頭を不鮮明にしてしまうことを抑えることができる。 The emphasis processing unit 12a of the remote conference system according to the second embodiment, when the frequency of switching the sound source position at the retroactive time T exceeds a predetermined threshold value or becomes equal to or more than the threshold value, outputs a sound transmitted from the sound source position. Stop the emphasizing process. As a result, as shown in FIG. 13, the output intensities of the sounds at the respective positions are made approximately equal to each other without emphasizing the sounds at all the positions. With this configuration, the sound sent from the sound source position that has been switched from the non-sound source position to the sound source position is output with the same intensity as the sound at other positions, so it is possible to obscure the beginning of the sound immediately after switching. Can be suppressed.

図１４は、強調処理部１２ａによって実施される処理フローを示すフローチャートである。このフローチャートにおいて、強調処理部１２ａは、まず、現状の音源位置について存在するか否かを判定する（Ｓ１）。そして、存在しない場合には（Ｓ１でＮ）、何れの位置の音も強調していない現状を維持して、処理フローをＳ１にループさせる。これに対し、存在する場合には（Ｓ１でＹ）、遡及時間Ｔ内における各位置の切り替わり履歴数Ｎａと、それらの合計Ｎａ”とを求める（Ｓ２）。そして、合計Ｎａ”について、閾値γを超えているか否かを判定し（Ｓ３）、超えていない場合には（Ｓ３でＮ）、音源位置からの音を強調する処理を継続して（Ｓ５）、一連の処理フローをＳ１にリターンさせる。一方、超えている場合には（Ｓ３でＹ）、音源位置からの音の強調を中止して（Ｓ４）、一連の処理フローをＳ１にリターンさせる。 FIG. 14 is a flowchart showing a processing flow performed by the emphasis processing unit 12a. In this flowchart, the emphasis processing unit 12a first determines whether or not the current sound source position exists (S1). If it does not exist (N in S1), the current state in which the sound at any position is not emphasized is maintained and the processing flow is looped to S1. On the other hand, if they exist (Y in S1), the number Na of switching histories at each position within the retroactive time T and their total Na″ are obtained (S2). Is determined (S3), and if not exceeded (N in S3), the process of emphasizing the sound from the sound source position is continued (S5), and the series of process flow returns to S1. Let On the other hand, if it exceeds (Y in S3), emphasis of the sound from the sound source position is stopped (S4), and the series of processing flow is returned to S1.

なお、閾値γを、第二実施例に係る遠隔会議システムにおける閾値αと同様にして、所定の判断基準に基づいて比較的小さな値、比較的大きな値の何れかに設定するようにしてもよい。 It should be noted that the threshold γ may be set to either a relatively small value or a relatively large value based on a predetermined determination criterion, similarly to the threshold α in the remote conference system according to the second embodiment. .

以上に説明したものは一例であり、次の態様毎に特有の効果を奏する。
［態様Ａ］
互いに異なる複数の位置のそれぞれから送られてくる音を取得する音取得手段（例えばマイクアレイ２０）と、前記複数の位置の中から、音源の存在する位置である音源位置を特定する位置特定手段（例えば位置特定部１１）と、前記位置特定手段によって特定された前記音源位置から送られてくる音を強調した音声信号を出力する音声信号出力手段（例えば音声信号出力部１２）とを有する音声処理装置において、前記複数の位置のうち、過去の所定時間（例えば遡及時間Ｔ）内で前記音源位置になった履歴のある位置については、前記位置特定手段による前記音源位置の特定結果にかかわらず音を強調する位置である非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。 What has been described above is an example, and the following unique effects can be obtained.
[Aspect A]
Sound acquisition means (for example, the microphone array 20) for acquiring sounds sent from each of a plurality of different positions, and position specifying means for specifying a sound source position where a sound source exists from the plurality of positions. A sound having (for example, the position specifying unit 11) and a sound signal output unit (for example, the sound signal output unit 12) that outputs a sound signal emphasizing the sound sent from the sound source position specified by the position specifying unit. In the processing device, of the plurality of positions, a position having a history of becoming the sound source position within a predetermined past time (for example, retroactive time T) is irrespective of the sound source position specifying result by the position specifying means. It is characterized in that the audio signal output means is configured so as to be a non-source enhancement target position which is a position where a sound is emphasized.

態様Ａにおいては、複数の位置のうち、過去の所定時間内で音源位置となった履歴のある位置については、現時点で位置特定手段によって音源位置として特定されていなくても、そこから送られてくる音を強調する。これにより、音源位置となった履歴のある位置については、非音源位置から音源位置に切り替わった直後であっても、そこから送られてくる音を強調して出力する。よって、過去の所定時間に音源位置となった履歴のある位置であれば、非音源位置から音源位置に切り替わった直後の音を不鮮明にしてしまうことを抑えることができる。 In the aspect A, of the plurality of positions, a position having a history of becoming a sound source position within a predetermined time in the past is sent from the position even if it is not currently specified as the sound source position by the position specifying means. Emphasize the coming sound. As a result, with respect to the position having the history of the sound source position, even immediately after the switching from the non-sound source position to the sound source position, the sound sent from the position is emphasized and output. Therefore, it is possible to prevent the sound immediately after switching from the non-sound source position to the sound source position to be unclear if the position has a history of being the sound source position in the past predetermined time.

［態様Ｂ］
態様Ｂは、態様Ａにおいて、前記所定時間内で前記音源位置になった履歴のある位置のうち、前記所定時間内にて前記音源位置でない状態から前記音源位置に切り替わった回数（例えば切り替わり履歴数Ｎａ）が所定の閾値を超える、又は前記閾値以上になるという条件を満足した位置を前記非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成では、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 [Aspect B]
Aspect B is the number of times of switching from a state other than the sound source position to the sound source position within the predetermined time within a position having a history of being the sound source position within the predetermined time in the mode A (for example, the number of history changes). The audio signal output means is configured such that a position satisfying a condition that Na) exceeds a predetermined threshold value or is equal to or more than the threshold value is set as the non-source enhancement target position. With such a configuration, it is possible to suppress the deterioration of the output sound quality caused by continuously emphasizing the small noise sent from the position of the person who has a low utterance frequency for a long time.

［態様Ｃ］
態様Ｃは、態様Ａにおいて、直近における前記音源位置の切り替わりの動向が、前記所定時間内で前記音源位置になった履歴のある位置のうち、現状の前記音源位置とは異なる位置の中で直近の履歴に対応する位置（例えば直近履歴位置）と、現状の前記音源位置との間だけで音源位置が交互に切り替わり且つその切り替わり回数が所定の閾値を超える、又は前記閾値以上になるという条件を満足した場合に、前記直近の履歴に対応する位置を前記非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成においても、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 [Aspect C]
Aspect C is the same as Aspect A except that the latest trend of the switching of the sound source position is the latest among the positions different from the current sound source position among the positions having the history of becoming the sound source position within the predetermined time. The position corresponding to the history of (for example, the latest history position) and the current sound source position are alternately switched, and the number of times the switching is performed exceeds a predetermined threshold value or is equal to or more than the threshold value. When satisfied, the audio signal output means is configured so that the position corresponding to the latest history is set as the non-source enhancement target position. Even with such a configuration, it is possible to suppress the deterioration of the output sound quality due to the continuous enhancement of the small noises sent from the position of the person with the low frequency of speech for a long time.

［態様Ｄ］
態様Ｄは、態様Ａにおいて、前記音源位置の切り替わりの頻度が所定の閾値を超える又は前記閾値以上になった場合に、前記所定時間内に前記音源位置でない状態から前記音源位置に切り替わった回数である切り替わり履歴数が、現状の音源位置を除く位置群（例えば非音源位置群）の中で最も多い位置を前記非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成においても、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 [Aspect D]
Aspect D is the number of times of switching from a state other than the sound source position to the sound source position within the predetermined time when the frequency of switching the sound source position exceeds a predetermined threshold value or becomes equal to or more than the threshold value in the mode A. The voice signal output means is configured such that a position having the largest number of switching histories in the position group excluding the current sound source position (for example, a non-sound source position group) is set as the non-sound source emphasis target position. It is what Even with such a configuration, it is possible to suppress the deterioration of the output sound quality due to the continuous enhancement of the small noises sent from the position of the person with the low frequency of speech for a long time.

［態様Ｅ］
態様Ｅは、態様Ａにおいて、前記音源位置の切り替わりの頻度が所定の閾値を超える又は前記閾値以上になった場合に、前記切り替わり履歴数が所定値を超える、又は前記所定値以上である位置を前記非音源強調対象位置とするように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成においても、発言頻度の低い人物のいる位置から送られてくる小さなノイズを長時間に渡って強調し続けることによる出力音質の悪化を抑えることができる。 [Aspect E]
Aspect E is that in Aspect A, when the frequency of switching of the sound source positions exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the number of switching histories exceeds a predetermined value or a position that is equal to or more than the predetermined value. It is characterized in that the audio signal output means is configured so as to be the non-source enhancement target position. Even with such a configuration, it is possible to suppress the deterioration of the output sound quality due to the continuous enhancement of the small noises sent from the position of the person with the low frequency of speech for a long time.

［態様Ｆ］
態様Ｆは、態様Ｄ又はＥにおいて、前記音源位置の切り替わりの頻度が所定の閾値を超える又は前記閾値以上になった場合に、現状の前記音源位置を除く位置群の中で前記切り替わり履歴数の最も多い位置の前記切り替わり履歴数と、全ての位置における前記切り替わり履歴数の平均との差が所定の閾値を下回るか、あるいは前記閾値以下である場合には、前記複数の位置の全てについて音を強調しないように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成では、多くの位置を非音源強調対象位置とすることによるノイズの増大を回避することができる。 [Aspect F]
Aspect F is, in Aspect D or E, in the case where the frequency of switching of the sound source positions exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the number of switching history numbers in the position group excluding the current sound source position is changed. If the difference between the number of switching histories at the most positions and the average of the number of switching histories at all positions is below a predetermined threshold value or less than or equal to the threshold value, sound is output for all of the plurality of positions. It is characterized in that the audio signal output means is configured so as not to be emphasized. With such a configuration, it is possible to avoid an increase in noise caused by setting many positions as the non-source enhancement target positions.

［態様Ｇ］
態様Ｇは、互いに異なる複数の位置のそれぞれから送られてくる音を取得する音取得手段と、前記複数の位置の中から、音源の存在する位置である音源位置を特定する位置特定手段と、前記位置特定手段によって特定された前記音源位置から送られてくる音を強調した音声信号を出力する音声信号出力手段とを有する音声処理装置において、過去の所定期間内における二以上の前記位置の間で音源位置が切り替わる頻度が所定の閾値を超える又は前記閾値以上になった場合に、前記位置特定手段によって特定された前記音源位置から送られてくる音を強調する処理を中止するように、前記音声信号出力手段を構成したことを特徴とするものである。かかる構成においては、意見交換が活発に行われている場合に、全ての位置の音を強調せずに、各位置の音の出力強度を互いに同程度にする。これにより、非音源位置から音源位置に切り替わった音源位置から送られてくる音を他の位置の音と同じ強度で出力するようになるので、切り替わり直後の音声の冒頭を不鮮明にしてしまうことを抑えることができる。 [Aspect G]
Aspect G is a sound acquisition unit that acquires a sound sent from each of a plurality of different positions, a position specifying unit that specifies a sound source position that is a position where a sound source exists from the plurality of positions, In a voice processing device having a voice signal output means for outputting a voice signal emphasizing a sound sent from the sound source position specified by the position specifying means, between two or more positions within a predetermined period in the past. In the case where the frequency of switching the sound source position exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the processing for emphasizing the sound sent from the sound source position specified by the position specifying means is stopped, The audio signal output means is configured. In such a configuration, when opinions are actively exchanged, the sounds at all the positions are not emphasized, and the output intensities of the sounds at the respective positions are made approximately equal to each other. As a result, the sound sent from the sound source position that has been switched from the non-sound source position to the sound source position will be output with the same intensity as the sound at other positions, so it is possible to make the beginning of the sound immediately after switching unclear. Can be suppressed.

［態様Ｈ］
態様Ｈは、映像処理装置によって処理した映像信号と、音声処理装置によって処理した音声信号とを外部に出力する音声映像出力装置において、前記音声処理装置として、態様Ａ〜Ｇの何れかを用いたことを特徴とするものである。 [Aspect H]
Aspect H is an audio/video output device which outputs a video signal processed by a video processing device and an audio signal processed by an audio processing device to the outside, wherein any one of the modes A to G is used as the audio processing device. It is characterized by that.

［態様Ｉ］
態様Ｉは、態様Ｈにおいて、通信回線を介して外部機器と通信するための通信手段を設けたことを特徴とするものである。 [Aspect I]
Aspect I is characterized in that, in Aspect H, communication means is provided for communicating with an external device via a communication line.

［態様Ｊ］
態様Ｊは、互いに異なる場所に設置された複数の音声映像出力装置の間で通信回線を介して音及び映像を通信してそれぞれの場所にいる人たちで会議を行うための遠隔会議システムにおいて、複数の音声映像出力装置における少なくとも１つとして、態様Ｉを用いたことを特徴とするものである。 [Aspect J]
Aspect J is a teleconferencing system for communicating between a plurality of audio/video output devices installed at different locations via a communication line to transmit sound and video to hold a conference with people at each location. Aspect I is used as at least one of the plurality of audio/video output devices.

１：ＣＰＵ１（音声処理装置や映像処理装置の一部）
２：メモリー（音声処理装置や映像処理装置の一部）
３：映像処理部３（映像処理装置の一部）
６：撮像素子Ｉ／Ｆ（映像処理装置の一部）
８：音声入出力Ｉ／Ｆ（音声処理装置の一部）
１０：音声処理部（音声処理装置の一部）
１１：位置特定部（位置特定手段）
１２：音声信号出力部（音声信号出力手段）
１２ａ：強調処理部
１２ｂ：合成部
２０：マイクアレイ（音声処理装置の一部、音声取得手段） 1: CPU1 (part of audio processing device or video processing device)
2: Memory (part of audio processing device and video processing device)
3: Video processing unit 3 (part of video processing device)
6: Image sensor I/F (part of video processing device)
8: Voice input/output I/F (part of voice processing device)
10: Voice processing unit (part of voice processing device)
11: Position specifying unit (position specifying means)
12: Audio signal output section (audio signal output means)
12a: Enhancement processing unit 12b: Synthesis unit 20: Microphone array (a part of voice processing device, voice acquisition means)

特許第４７６０１６０号Patent No. 4760160

Claims

Sound acquisition means for acquiring sounds sent from each of a plurality of different positions, position specifying means for specifying a sound source position where a sound source exists from the plurality of positions, and the position specifying means A voice processing device having a voice signal output means for outputting a voice signal emphasizing a sound sent from the sound source position specified by
Among the plurality of positions, a position having a history of becoming the sound source position within a predetermined time in the past is a position for emphasizing a sound regardless of the result of specifying the sound source position by the position specifying unit, that is, non-sound source emphasis. An audio processing apparatus, characterized in that the audio signal output means is configured to be a target position.

The voice processing device according to claim 1,
Of the positions having a history of becoming the sound source position within the predetermined time, the number of times of switching from the state other than the sound source position to the sound source position within the predetermined time exceeds a predetermined threshold value or becomes equal to or more than the threshold value. An audio processing device, wherein the audio signal output means is configured such that a position satisfying the condition is set as the non-source enhancement target position.

The voice processing device according to claim 1,
The latest trend of the switching of the sound source position is a position corresponding to the latest history in a position different from the current sound source position among the positions having the history of becoming the sound source position within the predetermined time, When the condition that the sound source position is alternately switched only between the current sound source position and the number of times of switching exceeds a predetermined threshold value or is equal to or more than the threshold value is satisfied, the position corresponding to the latest history is determined. The audio processing device, wherein the audio signal output means is configured to be the non-source enhancement target position.

The voice processing device according to claim 1,
When the frequency of switching the sound source position exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the number of switching history that is the number of times the sound source position is switched from the state other than the sound source position within the predetermined time is the current number. An audio processing device, wherein the audio signal output means is configured such that the most number of positions in a position group excluding a sound source position are the non-sound source enhancement target positions.

The voice processing device according to claim 1,
When the frequency of switching the sound source position exceeds a predetermined threshold value or becomes equal to or more than the threshold value, a position where the number of switching histories exceeds a predetermined value or is equal to or more than the predetermined value is set as the non-sound source emphasis target position. As described above, the audio processing device comprising the audio signal output means.

The voice processing device according to claim 4 or 5,
When the frequency of switching of the sound source position exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the number of switching histories at the position having the largest number of switching histories in the position group excluding the current sound source position, If the difference from the average of the number of switching histories at all positions is below a predetermined threshold value or less than or equal to the threshold value, the audio signal output means is configured not to emphasize sound at all of the plurality of positions. A voice processing device comprising:

Sound acquisition means for acquiring sounds sent from each of a plurality of different positions, position specifying means for specifying a sound source position where a sound source exists from the plurality of positions, and the position specifying means A voice processing device having a voice signal output means for outputting a voice signal emphasizing a sound sent from the sound source position specified by
When the frequency of the sound source position switching between two or more positions within a predetermined period in the past exceeds a predetermined threshold value or becomes equal to or more than the threshold value, the sound source position specified by the position specifying means is sent. An audio processing apparatus, wherein the audio signal output means is configured so as to stop the processing for emphasizing the incoming sound.

In an audio/video output device that outputs the video signal processed by the video processing device and the audio signal processed by the audio processing device to the outside,
An audio/video output device, wherein the audio processing device according to claim 1 is used as the audio processing device.

In audio video output apparatus according to claim 8,
An audio/video output device comprising communication means for communicating with an external device via a communication line.

In a teleconferencing system for communicating between a plurality of audio/video output devices installed in different places via a communication line, sound and video to hold a conference with people at each place,
A teleconferencing system using the audio/video output device according to claim 9 as at least one of the audio/video output devices.