JPH0449757A

JPH0449757A - Conference speech terminal equipment

Info

Publication number: JPH0449757A
Application number: JP2160492A
Authority: JP
Inventors: Masaharu Shimada; 正治島田; Shinji Hayashi; 伸二林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1990-06-18
Filing date: 1990-06-18
Publication date: 1992-02-19
Anticipated expiration: 2012-10-15
Also published as: JP2662825B2

Abstract

PURPOSE:To maintain speech quality satisfactorily even when a state migrates from a talker state of one person to the one of another person by providing a spread attaching means to attach the spread feeling of an image between image static of two persons. CONSTITUTION:The spread attaching means is comprised of an adder 9, a delay circuit (DEL) 11, a high-pass filter (HPF) 12, a switching circuit 6b, an adder 10, and a substractor 13. The spread attaching means sets a plural talker state where two talkers exist simultaneously within a constant time after a control signal representing a received first sending origin is completed when the state migrates from the talker state of one person to the one of another person, and attaches the spread feeling of the image between the image static of the two persons. Therefore, it is possible to generate atmosphere as if a natural conference is held where senders participate generally as the image static.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、会議通話用の音声回線で相互に接続された複
数の通話端末に配置された会議通話端末装置に利用する
。DETAILED DESCRIPTION OF THE INVENTION [Industrial Field of Application] The present invention is applied to a conference call terminal device disposed in a plurality of call terminals interconnected by a conference call audio line.

本発明は、特に、個々の通話者端末において複数の通話
者から送話元を識別する場合、一人の話者状態から別の
一人の話者状態に遷移する際の音像移動定位の変化に対
する違和感を防止するようにした会議通話端末装置に関
する。In particular, the present invention aims to improve the sense of discomfort caused by changes in sound image movement localization when transitioning from one speaker state to another, especially when identifying a calling source from a plurality of callers at an individual caller terminal. The present invention relates to a conference call terminal device that prevents the above.

[Conventional technology]

電話会議のように多数の対地と対話を行う場合や、複数
人が二つの会議室にそれぞれ集合して２対地で対向して
対話を行う場合、一般のハンドセット電話では発声者が
誰であるのか判明せず混乱する欠点がある。When communicating with multiple parties, such as in a telephone conference, or when multiple people gather in two conference rooms and have conversations with two parties facing each other, it is difficult to identify who is speaking on a general handset telephone. It has the disadvantage of being unclear and confusing.

そこで、ヘッドホンの両耳受聴とすることで、この難点
を解消する方法がある。すなわち、ヘッドホンを使用し
て頭外部に音像を定位させる方法が提案されている。例
えば、鹿島出版社刊のブラウェルト、六本、後藤共著「
空間音響」に記載されている。この方式は、頭部回折に
よるインパルス応答１−１．およびＨ２と、レシーバ対
外耳道入りロインパルス応答の逆特性ＲｔおよびＲａを
あらかじめ畳込んだ全インパルス応答ｈＬ　（左耳用）
およびり、（右耳用）を畳込み両耳レシーバの入力とす
る。ただし、左右の人力の全インパルス応答りは、ｈ＝Ｈ＊Ｒの畳込み積分で表すことが可能となる。Therefore, there is a way to solve this problem by using headphones for binaural listening. That is, a method has been proposed in which a sound image is localized outside the head using headphones. For example, see the book co-authored by Blawelt, Rokubon, and Goto, published by Kajima Publishing Co., Ltd.
"Spatial Acoustics". This method is based on the impulse response 1-1. and H2, and the total impulse response hL (for the left ear), which is preconvoluted with the inverse characteristics Rt and Ra of the loin impulse response from the receiver to the ear canal.
and (for the right ear) are the inputs of the convolutional binaural receiver. However, the total impulse response of the left and right human forces can be expressed by the convolution integral of h=H*R.

第５図は頭外音像定位技術を説明するための説明図であ
る。１８は受聴者、１７は両耳レシーバ、２１．２２．
２３．２４および２５ハ通話者ＡＳＢ、Ｃ，ＤおよびＥ
のそれぞれの音像定位である。FIG. 5 is an explanatory diagram for explaining the extrahead sound image localization technique. 18 is a listener, 17 is a binaural receiver, 21.22.
23.24 and 25 callers ASB, C, D and E
These are the respective sound image localizations.

第５図において、いま自分を含め、６対地あるいは６人
の通話者（７）（ＡＳＢ、Ｃ，Ｄ、Ｅ、Ｆ）と話す音像
定位生成法を以下に述べる。なお音像定位とは発声する
場所の位置を意味している。In FIG. 5, the sound image localization generation method for talking to 6 stations or 6 callers (7) (ASB, C, D, E, F), including myself, will be described below. Note that sound image localization refers to the position of the place where the voice is uttered.

通話者ＡＳＢＳＣ，ＤおよびＥのそれぞれの音像定位２
１．２２．２３．２４および２５を前方に等分割に音像
定位を配置させるようにする。通常、テレビ会議や多対
地接続の音声会議、においては平均６対地接続が一般的
に多いと言われている。このため、通話相手は５人もし
くは、５対地の相手接続となる。ｎの数が多くなればそ
の音像位置からの話者同定認識率は低下する。ここでは
話を容易とするた杓に左９０度、左４５度、中央、右４
５度および右９０度の５箇所とする。Sound image localization 2 for callers ASBSC, D and E
1. 22, 23, 24 and 25 are arranged to equally divide the sound image localization in front. Normally, it is said that in video conferences and audio conferences with multiple ground connections, an average of six ground connections is common. Therefore, there are five people on the other end of the call or connection to five destinations. As the number n increases, the speaker identification recognition rate from the sound image position decreases. Here, for ease of discussion, we will use 90 degrees to the left, 45 degrees to the left, 45 degrees to the center, and 4 degrees to the right.
5 degrees and 5 places at 90 degrees to the right.

まず、希望の音像発生位置から受聴者の外耳までのイン
パルス応答特性を測定する。第５図では通話者Ｆが受聴
者１８となり、送話者Ｃの位置に発生する音像定位の場
合を示している。すなわち、前記インパルス応答特性と
はＨｃｔおよびＨｃｉｔである。その測定法は無響室内
でインパルス応答を測定する方法が用いられてきたが、
文献（林「ヘッドホンによる音場感再生とその主観評価
」日本音響学会聴覚研究会Ｈ−８９−１９，１９８９）
　によれば、頭外縁を自然に生じるためには、無響室よ
りも被測定者の後方、同側方の壁および天井は強い反射
をしない吸音材料の条件を満たず防音室の方がより自然
であり、実音場に近似できるので、優れでいるＬ報告さ
れている。拡声器より広帯域雑音を放射し、被験者の左
右外耳道入口にプローブチューブマイクあるいはｌ／８
インチ程度の小型マイクをセフ）し、この３点の信号を
同時にアナログディジタル変換する。これからクロスス
ペクトル法をもってインパルス応答を算出することが可
能である。First, the impulse response characteristics from the desired sound image generation position to the listener's outer ear are measured. FIG. 5 shows a case where the speaker F is the listener 18 and the sound image is localized at the location of the speaker C. That is, the impulse response characteristics are Hct and Hcit. The measurement method used has been to measure the impulse response in an anechoic chamber.
Literature (Hayashi “Sound field reproduction using headphones and its subjective evaluation” Auditory Research Group of the Acoustical Society of Japan H-89-19, 1989)
According to the authors, in order for the outer edge of the head to occur naturally, it is better to use a soundproof room than in an anechoic room, since the walls and ceiling behind and on the same side of the subject do not meet the requirements for sound-absorbing materials that do not have strong reflections. It has been reported that L is superior because it is natural and can approximate the actual sound field. Broadband noise is emitted from a loudspeaker, and probe tube microphones or l/8 are placed at the entrances of the subject's left and right ear canals.
A small microphone (about an inch in size) is used to simultaneously convert these three signals from analog to digital. From this, it is possible to calculate the impulse response using the cross-spectral method.

一方、ヘッドセットの逆特性ＲｔおよびＲＲを得るには
ヘッドセットの電気音響変換入力に対する外耳道音圧を
測定する。広帯域雑音をヘッドセットの電気人力とし、
ヘッドセットを被験者の外耳に装着し、ヘッドセットの
レシーバのπ１てパッドに穴をあけプローブチューブま
たは１／８インチマイクロホンを挿入し外耳人口の音圧
波形を取り出す。電気入力信号と外耳道音圧から変換さ
れた電気信号をアナログディジタル変換し逆フイルタ特
性を得る。この方法は、時間領域の最小二乗誤差による
逆フイルタ構成法として周知である。On the other hand, to obtain the inverse characteristics Rt and RR of the headset, the ear canal sound pressure with respect to the electroacoustic conversion input of the headset is measured. Using broadband noise as electrical power for the headset,
A headset is attached to the external ear of the subject, a hole is made in the π1 pad of the receiver of the headset, a probe tube or a 1/8 inch microphone is inserted, and the sound pressure waveform of the external ear is extracted. The electrical input signal and the electrical signal converted from the ear canal sound pressure are converted from analog to digital to obtain inverse filter characteristics. This method is well known as an inverse filter construction method using least squares error in the time domain.

従って、畳込んだ全インパルス応答りは、ｈ　π　Ｒ＊
　トＩで求められる。Therefore, the convolved total impulse response is h π R*
It can be found by

いま、受信された音声信号をＰ、レシーバの外耳人力の
音圧信号をＱとすれば、Ｑ＝Ｐ＊ｈからなる皆込み積分でレシーバの外耳人力Ｑに列する音
声＃Ｐの変換が可能となる。Now, if the received audio signal is P and the sound pressure signal of the receiver's external ear force is Q, it is possible to convert the sound #P that is aligned with the receiver's external ear force Q using the total integral consisting of Q = P * h. becomes.

以−トのディジタル演算により通話者の受聴するレシー
バからの音声は、あらかじめ９１算された実空間にいる
のと同一の音声波形となるので、人間の聴覚真理的反応
として、あたかも通話者はあたかも［」の前１．５ｍ程
度を隔てて相手と通話する間隔を得ることができ、その
実空間にいる自然な通話環境と感じられる。頭外感覚は
その自然な感覚の一部である。このため、ヘッドセット
通話につきものの圧迫感や聴覚的および心理的疲労感を
覚えることなく快適に長時間の通話を楽しむことができ
る。Through the above digital calculation, the voice heard by the caller from the receiver becomes the same sound waveform as if they were in the real space calculated in advance by 91, so as a human auditory truth response, the caller hears it as if it were in the real space. You can talk to the other party at a distance of about 1.5 meters in front of the [ ], and it feels like a natural conversation environment in the real space. Extra-head sensations are part of that natural sensation. Therefore, the user can comfortably enjoy long-term calls without feeling pressured or feeling of auditory or psychological fatigue that is often associated with headset calls.

一方、送話音声信号とともに送話元を示す制御情報手段
に、［［’ｌＴＴ　　（国際電信電話諮問委員会）勧告
Ｇ、７２２、および論文（島田、銘木：［−多対地音声
会議通信システノ、の対地識別音像生成方式」電子情報
通信学会誌、ｖｏｌ、　Ｊ７０−８．Ｎｏ、９．１９８
７年）に記載しである送話光制御情報符号化法があり、
この論文では複数の拡声器を用いて室内空間での音像定
位論が示されている。On the other hand, in the control information means indicating the transmission source together with the transmission audio signal, [['lTT (International Telegraph and Telephone Advisory Committee) Recommendation G, 722, and the paper (Shimada, Meiki: [-Multi-destination audio conference communication system, “Ground Discrimination Sound Image Generation Method” Journal of the Institute of Electronics, Information and Communication Engineers, vol. J70-8. No. 9.198
There is a transmitting light control information encoding method described in 7th year).
This paper presents a theory of sound localization in indoor space using multiple loudspeakers.

音声回線で各送話者の音声信号を加算して伝送し、複数
の送話者が同時にあるような会議通話システムの場合に
は、複数人の音声信号の有無を送話光制御情報信号によ
って判断できたとしても、−度音声加算された信号は各
送話者ご古の音声信号に分離できない。たとえ分離でき
たとしても、その音声信号の特徴パラメータと送話者音
声信号との対応を認識（話者認識）して、そのあらかじ
め指定された送話者位置に音像を生成する技術は現在の
実用化レベルに達していない。In the case of a conference call system in which the audio signals of each speaker are summed and transmitted over an audio line, and where there are multiple speakers at the same time, the presence or absence of audio signals from multiple speakers is determined by the transmission light control information signal. Even if this could be determined, the signal with -degree voice addition cannot be separated into the voice signals of each speaker. Even if separation is possible, the current technology is to recognize the correspondence between the characteristic parameters of the audio signal and the speaker's audio signal (speaker recognition) and generate a sound image at the pre-specified speaker position. It has not reached the practical level.

さらにヘッドホンを用いて頭外部に音像を定位させる方
式では、複数同時音声の状態になると複数の音像定位の
中心に移動することになるので受聴８゛に対して違和感
を覚える。さらに前記島田らの論文では複数の発声音源
、すなわちあらかじめ送話光制御受聴信号によって分け
られた複数の拡声器（スピーカ）の制御法の議論がなさ
れているだけであり、ヘッドセットのように単に左右の
両耳のレシーバ駆！ＪＪ　源信号からの提案はなされて
いない。そこで前述の点に注目し、本願発明者のうちの
一人により、複数の同時送話者が発生した場合、複数の
音像定位の間に音像の拡がり感を付与させることを特徴
とする方式が既に提案されている（島田「会議通話端末
装置」特願平２−１１２３４１号、平成２年４月２７日
出願）。Furthermore, with the method of localizing sound images outside the head using headphones, when multiple simultaneous sounds occur, the user will move to the center of multiple sound image localizations, which makes the listener feel uncomfortable at 8°. Furthermore, the paper by Shimada et al. only discusses a control method for multiple vocal sound sources, that is, multiple loudspeakers (speakers) that are separated in advance by transmitting optical control listening signals, and does not simply discuss methods for controlling multiple loudspeakers (speakers) that are separated in advance by transmitting optical control listening signals. Receiver drive for both left and right ears! JJ No proposal has been made from the source signal. Therefore, paying attention to the above-mentioned points, one of the inventors of the present application has already developed a method characterized by giving a sense of sound image expansion between multiple sound image localizations when multiple simultaneous talkers occur. It has been proposed (Shimada "Conference Call Terminal Device" Japanese Patent Application No. 112341/1990, filed on April 27, 1990).

[Problem to be solved by the invention]

しかし、一人の話者状態から別の一人の話者状態に急激
に遷移する場合、インパルス応答特性りと音声信号との
畳込み積分を行っているので、このような状態遷移が発
生した場合、切り替える瞬間に受聴音に雑音が混入した
り、音像定位の移動受聴感覚に不自然性が発生し、通話
品質を劣化させる恐れがある欠点がある。However, when there is a sudden transition from one speaker state to another speaker state, the impulse response characteristic is convolved with the speech signal, so when such a state transition occurs, There are drawbacks such as noise being mixed into the listening sound at the moment of switching, unnaturalness occurring in the listening sensation of moving sound image localization, and the possibility of deteriorating call quality.

本発明の目的は、この欠点を除去することにより、一人
の話者状態から別の話者状態に遷移した場合にも、通話
品質を良好に維持できるヘッドセット会議通話装置を提
供することにある。An object of the present invention is to provide a headset conference call device that can maintain good call quality even when transitioning from one speaker state to another by eliminating this drawback. .

[Failure to solve the problem]

本発明は、会議通話用の音声回線から受信した？Ｔ、　
ｐｊＨ信号１、二対する音像を人の両耳に対応するニー
つの電気音響変換手段により生成する音像生成手段と、
この音像生成手段により生成される音像を前記音声回線
から音声信号とともに受信した制御信号に従って定位す
る音像定位手段とを備えた会議通話端末装置において、
前記音像定位手段は、人の話者状態から他の一人の話者
状態に遷移した場合に、受信された第一の送話光を示す
制御信号が終了してから一定時間以内は二人の話者が同
時に存在する複数話者状態とし、二人の音像定位の間に
音像の拡がり感を付与する拡がり付与手段を含むことを
特徴とする。The present invention received from the voice line for conference call? T,
sound image generation means for generating sound images for pjH signals 1 and 2 by two electroacoustic conversion means corresponding to both ears of a person;
A conference call terminal device comprising a sound image localization means for localizing the sound image generated by the sound image generation means according to a control signal received together with the audio signal from the audio line,
The sound image localization means is configured to control the sound image localization means, when the state of one person's speaker changes to the state of another person's speaker, for a certain period of time after the end of the received control signal indicating the first transmission light. The present invention is characterized in that it is set to a multi-speaker state in which speakers are present at the same time, and includes a spread imparting means for imparting a feeling of spread of the sound image between the sound image localization of the two speakers.

[Effect]

拡がり付与手段は、一人の話者状態から別の一人の話者
状態に遷移した場合、受信された第一の送話光を示す制
御信号が終了してから一定時間以内は二人の話者が同時
に存在する複数話者状態とし、二人の音像定位の間に音
像の拡がり感を付与させる。When the one-speaker state transitions to another one-speaker state, the spread imparting means controls two speakers within a certain period of time after the received control signal indicating the first transmitting light ends. A multi-speaker state exists at the same time, and a sense of spread of the sound image is imparted between the sound image localization of the two people.

従って、一人の話者状態から別の一人の話者状態に急激
に遷移した場合にも、音像定位は送話者がより自然的な
一般の集合して会議するような雰囲気を生成することが
可能となる。Therefore, even when there is a sudden transition from a single speaker state to another single speaker state, sound image localization can create an atmosphere where the speakers are more naturally gathered together for a meeting. It becomes possible.

〔Example〕

以下、本発明の実施例について図面を容態して説明する
。Embodiments of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例を示すブロック構成図である
。FIG. 1 is a block diagram showing one embodiment of the present invention.

第１図によると、本実施例は、会議通話用の音声回線か
ら受信した音声信号に対する音像を人の両耳に対応する
二つの電気音響変換手段としての両耳レシーバ１７によ
り生成する音像生成手段としての、ディジタルアナログ
変換器（Ｄ／Ａ）１４．２個の低域フィルタ（ＬＰＦ）
１５　、および２個の増幅器（八ＭＰ）１６と、この音
像生成手段により生成される音像を前記音声回線から音
声信号とともに受信した送話元制御信号３２に従って定
位する音像定位手段としての、信号分離回路１、切替制
御回路２、リニア符号化回路３、切替回路（１）　４　
ａおよび４ｂ、記憶回路５、切替回路（２）６ａ、なら
びに加算器７および８とを備えた会議通話端末装置にお
いて、本発明の特徴とするところの、前記音像定位手段は、一人の話者状態から他の一人の話
者状態に遷移した場合に、受信された第一の送話元を示
す制御信号が終了してから一定時間以内は二人の話者が
同時に存在する複数話者状態とし、二人の音像定位の間
に音像の拡がり感を付与する拡がり付与手段としての、
加算器９、遅延回路（ＤＥＬ）１１　、高域フィルタ（
ＨＰＦ）１２　、切替回路６ｂ、加算器ＩＯおよび減算
器１３を含んでいる。According to FIG. 1, in this embodiment, a sound image generating means generates a sound image for an audio signal received from an audio line for a conference call using a binaural receiver 17 as two electroacoustic converting means corresponding to both ears of a person. Digital to analog converter (D/A) as 14.2 low pass filters (LPF)
15, and two amplifiers (8 MP) 16, and a signal separation means as a sound image localization means for localizing the sound image generated by the sound image generation means according to the transmission source control signal 32 received together with the audio signal from the audio line. Circuit 1, switching control circuit 2, linear encoding circuit 3, switching circuit (1) 4
a and 4b, a memory circuit 5, a switching circuit (2) 6a, and adders 7 and 8, in which the present invention is characterized in that the sound image localization means A multi-speaker state in which two speakers exist at the same time within a certain period of time after the received control signal indicating the first speaking source ends when the state transitions from one speaker state to another one-speaker state. As a spreading means that gives a feeling of spreading the sound image between two people's sound image localization,
Adder 9, delay circuit (DEL) 11, high-pass filter (
HPF) 12, a switching circuit 6b, an adder IO, and a subtracter 13.

第２図は送話元制御信号のフレームフォーマットの説明
図で、ＣＣＩＴＴ勧告Ｈ，２２１の（ｉＪＫｂｐｓ内の
分割使用を示す。また表は音像定位制御信号の符号フォ
ーマット例である。FIG. 2 is an explanatory diagram of the frame format of the source control signal, and shows the divisional use within (iJKbps) of CCITT Recommendation H, 221. The table also shows an example of the code format of the sound image localization control signal.

ｌ５ＤＮ（ディジタルサービス統合網）の基本サービス
には、５４ｋｂｐｓサービスの情報回線が２回線（Ｂ）
と１６ｋｂｐｓサービスの情報制御回線（Ｄ＞が１回線
の２８＋Ｄのチャネルがある。この加入者相互間にサー
ビスされたディジタル１リンク回線５４ｋｂｐｓを利用
して、速度分割し、各送話者に対し制御情報信号と音声
信号とを同時に受聴者に伝達する方式がある。すなわち
、ＣＣＩＴＴで勧告されている）Ｉ、　２２１　（Ｇ、
　？２２）のフレームフォーマットを使用して行う方式
である。この方式は５４ｋｂｐｓディジタル１リンク回
線を使用して、音声だけでなくディジタル画像信号やデ
ータ信号を同時に伝送することが可能な通信系に適用す
るものである。６４ｋｂｐｓチヤネルは８　ｋｔｌｚで
伝送されるオクテツト構成（８ビツトが一つの構成単位
）とし、各々のオクテツトの８ビツト目をサービスチャ
ネルと呼び、さらにサービスチャネルは第２図に示すよ
うに３部分に分かれる。フレーム・アライメント・シグ
ナル（ＦＡＳ）は８ビツトからなる基本フレームパター
ンをサービスチャネルのフレームごとに交互に発生する
マルチフレームで構成されている。ビット・アロケショ
ン・シグナル（ＢＡＳ）は各種通信モードを規定してい
る。また、アプリケーションチャネル（＾Ｃ）は１フレ
ーム６４ビツトであり、５．４ｋｈｐｓ以下の伝送速度
を有するデータザービス通信に適用可能である。The basic service of l5DN (Digital Service Integrated Network) includes two 54kbps service information lines (B).
There are 28+D channels with 16kbps service information control line (D> is 1 line.Using the 54kbps digital 1 link line provided between these subscribers, the speed is divided and control is provided to each caller. There is a method for simultaneously transmitting information signals and audio signals to the listener (i.e. recommended by CCITT) I, 221 (G,
? This method uses the frame format of 22). This method is applied to communication systems that can simultaneously transmit not only voice but also digital image signals and data signals using a 54 kbps digital 1-link line. A 64 kbps channel consists of octets (8 bits are one unit) transmitted at 8 kTLZ, and the 8th bit of each octet is called a service channel, and the service channel is further divided into three parts as shown in Figure 2. . The frame alignment signal (FAS) consists of multiple frames in which a basic frame pattern of 8 bits is generated alternately for each frame of the service channel. Bit Allocation Signals (BAS) define various communication modes. Further, the application channel (^C) has 64 bits per frame, and is applicable to data service communication having a transmission rate of 5.4 khps or less.

表　　　符号フォーマット例この方式の実現例として前記の島田らの論文では、送話
元制御情報信号として表に示すような符号フォーマット
例を用いている。すなわち、送話元の信号は送話者Ａ、
Ｂ、Ｃ，Ｄ、Ｅ、Ｆに対応する送話元制御情報信号をそ
れぞれｒｌＯＵＯｏｏＪ、　ｒ０１００００コ、　ｒｏ
ｏｉｏｏｏ、、　「０００１００Ｊ、ｒｏｏｏｏｌｏＪ
としている。Table Example of Code Format As an example of implementing this method, the above-mentioned paper by Shimada et al. uses an example of code format as shown in the table as a source control information signal. In other words, the signal from the transmitter is from the transmitter A,
The source control information signals corresponding to B, C, D, E, and F are rlOUOooJ, r010000ko, and ro, respectively.
oiooo,, “000100J,rooooloJ
It is said that

前記論文では、この送話元制御情報信号が互いに直交す
るような符号化系列であり複数の送話者があった場合で
も論理処理が簡単化できるので、望ましい符号であるこ
とを述べている。The above paper states that the transmitting source control information signals are encoded sequences that are orthogonal to each other and are desirable because the logical processing can be simplified even when there are multiple transmitters.

いま、遠隔会議通信ンステムで送話者となった音声信号
と各送話元制御情報信号とを同時に送信する。複数対地
を接続するノード装置では制御情報信号を論理処理して
、受聴者側に音声信号と送話元制御情報信号を伝達する
。受信側では送話元制御情報信号に対応して、あらかじ
め方向別に分割された音像位置にその音声を発生し、あ
たかも、一つのテーブルに会議参加者が席についたごと
く音像を生成する。複数の送話者があった場合の送話光
制御受聴には少なくとも論理「１」の符号を複数個含む
。Now, the remote conference communication system simultaneously transmits the voice signal of the transmitter and the control information signal of each transmitter. A node device that connects multiple destinations logically processes the control information signal and transmits the audio signal and the source control information signal to the listener side. On the receiving side, the sound is generated in sound image positions divided by direction in advance in response to the sender control information signal, and a sound image is generated as if conference participants were seated at one table. When there are multiple speakers, the transmitting light control listening includes at least a plurality of logic "1" codes.

次に、本実施例の具体的な動作説明を行う前に、第３図
に示す会議通話遷移図を用いて、本実施例の、基本的な
動作について説明する。Next, before explaining the specific operation of this embodiment, the basic operation of this embodiment will be explained using the conference call transition diagram shown in FIG.

一般に会議通話状態は第３図に示すように、■単独話者
状態（話者が一人でいる場合、通常はこの状態が全会議
通話状態の８０〜９０％を占有している）、■複数話者
状態（複数人のあいづぢや笑い声も含む）、および■全
無音状態（通話者全員が喋っていない状態）の三つの状
態があり、これに矢印の状態遷移が加わる。これが会議
の通話状態である。In general, the conference call states are as shown in Figure 3: ■ Single speaker state (when there is only one speaker, this state usually occupies 80 to 90% of all conference call states), ■ Multiple speaker state. There are three states: a speaker state (including chatter and laughter of multiple people), and a total silence state (a state in which all parties are not speaking), and the state transitions indicated by the arrows are added to these states. This is the conference call status.

複数話者状態の場合ではいろいろな音像定位法が考えら
れ、例えば音像の拡がり感付与などが上げられる。この
方法については既に前述した。さらに全無音状態ではヘ
ッドホンのレシーバに入力される音圧を０とするのでは
なく、各送話者から加算されたそのままの室内雑音や回
線雑音を付加して処理すればよい。従って、問題となる
のは単独話者すなわち一人の話者が別の一人の話者へ急
激に状態遷移する場合である。In the case of a multi-speaker situation, various sound image localization methods can be considered, such as adding a sense of spaciousness to the sound image. This method has already been described above. Furthermore, in a completely silent state, instead of setting the sound pressure input to the receiver of the headphones to 0, processing may be performed by adding the room noise and line noise added as they are from each speaker. Therefore, a problem arises when a single speaker, ie, one speaker, suddenly changes state to another speaker.

本発明の基本的な考え方は、今送話元制御信号が第一の
話者から第二の話者に遷移するときに、第一の話者が終
了する時点を一定時間延長し、第二の話者の開始時点と
重複した場合に音像の拡がり感付与を行うものである（
第３図の点線の遷移）。The basic idea of the present invention is that when the current source control signal transitions from the first speaker to the second speaker, the point at which the first speaker ends is extended for a certain period of time, and the second This gives a sense of spaciousness to the sound image when it overlaps with the starting point of the speaker (
(transition indicated by the dotted line in Figure 3).

この一定時間の値の設定は例えば前記の畳込み演算に要
する時間や受聴の不自然性等から求められる。The value of this certain period of time is determined, for example, from the time required for the above-mentioned convolution calculation, the unnaturalness of listening, and the like.

本実施例は、前記の基本的な考え方に基づいて構成され
たものであり、一つの通話者端末から音声信号が送信さ
れているときには、受信した制御信号に従って相手の通
話者ごとに発声位置が異なる音像を生成する。この方式
は異なる対地を接続し、複数の通話者と接続することに
より遠隔会議通信を可能とするもので、通話者ごとに異
なる位置に音像を生成することで通話者認識を容易にさ
せ、送話者が急激に変わる場合でもあたかも一つのテー
ブルに会議参加者が席についたごとくの自然性ある会議
を構成する。This embodiment is constructed based on the above-mentioned basic idea, and when a voice signal is being transmitted from one caller's terminal, the speaking position is determined for each caller according to the received control signal. Generate different sound images. This method enables remote conference communication by connecting different destinations and connecting with multiple callers.By generating sound images in different positions for each caller, it is easy to recognize the callers and send To create a natural conference as if conference participants were seated at one table even when speakers suddenly change.

次に、本実施例の具体的な動作について第４図に示すタ
イミングチャートを参照して説明する。Next, the specific operation of this embodiment will be explained with reference to the timing chart shown in FIG.

ここで第４図は切替回路（１）ｉ６よび（２）の動作を
示す。Here, FIG. 4 shows the operation of switching circuits (1) i6 and (2).

第３図において、音声信号十送話元制御情報３１が前記
のＣＣＩＴＴ勧告Ｇ、　７２２またはＨ，２２１に準拠
したフレーノ・フォーマットで同時に伝送されてくる。In FIG. 3, audio signals and source control information 31 are simultaneously transmitted in Freno format conforming to the CCITT Recommendations G, 722 or H, 221.

ずなｔ）ち、ディジタル伝送路５４ｋｂｐｓの中で情報
ビット速度５６ｋｂｐｓと制御情報速度３　ｋｂｐｓが
分離して通信されてくる。この８ｋｂｐｓを用いて前記
の制御信号を送信する。信号分離回路１はこれら二つの
信号を分離し、音声信号はディジタル信号処理演算が可
能なようにリニア符号化回路３でディジタルノニア符号
となる。一方、送話光制御信号３２は切替制御回路２で
今受信した音声が単独話者なのか複数の話者なのかを送
話光制御信号３２から判断する。次に、切替回路（１）
　４　ａおよび４ｂで切替制御回路２からの指令内容に
よって音声信号に対応する話者の音像定位生成のための
インパルス応答りを選択する。このインパルス応答特性
情報（ｈ）は頭外に音像定位を生成するために用意され
るものであり、あらかじめ測定した各音像定位（第５図
の通話者Ａ、Ｂ、Ｃ，ＤおよびＥの音像）のインパルス
応答特性情報（ｈ）を格納した記憶回路５を切替回路（
１）　４　ａおよび４ｂで選択する。第１図では送話者
Ａから送話者Ｂへ遷移する場合の切替回路（１）１４＆
および４ｂの様子を示している。Then, an information bit rate of 56 kbps and a control information rate of 3 kbps are separately communicated on a digital transmission path of 54 kbps. The control signal described above is transmitted using this 8 kbps. A signal separation circuit 1 separates these two signals, and the audio signal is converted into a digital nonia code by a linear encoding circuit 3 so that digital signal processing operations can be performed. On the other hand, based on the transmission light control signal 32, the switching control circuit 2 determines whether the voice just received is from a single speaker or multiple speakers. Next, the switching circuit (1)
4a and 4b select an impulse response for generating the speaker's sound image localization corresponding to the audio signal according to the contents of the command from the switching control circuit 2. This impulse response characteristic information (h) is prepared to generate sound image localization outside the head, and is based on the sound image localization of each pre-measured sound image (sound images of callers A, B, C, D, and E in Figure 5). ) is connected to the memory circuit 5 storing the impulse response characteristic information (h) of the switching circuit (
1) Select in 4a and 4b. In FIG. 1, the switching circuit (1) 14&
and 4b are shown.

単独話者（話者が一人）の場合、切替回路（２）　６　
ａおよび６ｂは「オフ」となって、加算回路７．８およ
び１０や減算回路１３を介して、ディジタルアナτコグ
変換器１４でアナログ信号に変換され、量子化雑音を除
去するための低域フィルタ１５に接続される。これをヘ
ッドホンセットの両耳レシーバ１７に最適な音量とする
だめの増幅器１６を介して、受聴者１８のヘッドホンセ
ットの両耳レシーバ１７に頭外音像定位した処理情報を
伝達する。In the case of a single speaker (one speaker), switching circuit (2) 6
a and 6b are turned off, and are converted into analog signals by the digital analog τ cog converter 14 via the adder circuits 7.8 and 10 and the subtracter circuit 13, and are converted into analog signals to remove quantization noise. Connected to filter 15. Processing information on the extrahead sound image localization is transmitted to the binaural receivers 17 of the headphone set of the listener 18 via an amplifier 16 that makes the volume optimal for the binaural receivers 17 of the headphone set.

次に、送話者Ａの単独話者状態から送話者Ｂの単独話者
状態へ急激に遷移した場合の例について説明する。Next, an example will be described in which there is a sudden transition from the solo speaker state of speaker A to the solo speaker state of speaker B.

第４図において、各送話者Ａ、Ｂ、Ｃ，ＤＳＥおよびＦ
に対応する送話光制御信号３２は、それぞれｒｌｏｏｏ
ｏｏ」、ｒｏｌｏｏｏ（１＋、「００１０００Ｊ、ｒｏ
ｏｏｌｏｏ」、および「００００１０　Ｊであるので、
送話光制御信号３２は最初はｒｌｏｏｏｏｏ」であり、
その後、ｒｏｌｏｏｏｏ」と遷移することになる。前述
のように、このときの切替回路（１）　４　ａおよび４
ｂのスイッチの位置はＡ端子と接続されているので、イ
ンパルス応答はｈＡｉおよびｈＡＬが選択され、入力音
声信号との畳込み積分が行われている。ところが第４図
に示すように、１フレ一ム分だけの遅延後に別の一人の
話者、ここでは送話者Ｂが話始めたとすると、演算遅延
や音像定位の移動に違和感が生じる。In Figure 4, each speaker A, B, C, DSE and F
The transmitting light control signal 32 corresponding to rlooo
oo”, rolooo(1+, “001000J, ro
ooloo” and “000010 J, so
The transmitting light control signal 32 is initially "rlooooo",
After that, the transition will be "rolooooo". As mentioned above, the switching circuit (1) 4a and 4 at this time
Since the switch position b is connected to the A terminal, hAi and hAL are selected as impulse responses, and convolution with the input audio signal is performed. However, as shown in FIG. 4, if another speaker, here speaker B, starts speaking after a delay of one frame, a sense of discomfort arises in the calculation delay and the movement of the sound image localization.

これを防止するために、送話者へが終了した時点から一
定時間終了するまで、その送話者Ａが話者状態を継続し
ているとし、この一定時間内に送話者Ｂが話始めたとす
れば、この送話者Ａと送話者Ｂが同時に話者となる複数
話者状態と理解し、音像の拡がり感付与の演算を行う。In order to prevent this, it is assumed that speaker A continues to be in the speaker state from the time when the message to the speaker ends until the end of a certain period of time, and within this fixed period of time, speaker B starts speaking. If this is the case, it is understood that this is a multi-speaker state in which speaker A and speaker B are speakers at the same time, and calculations are performed to give a sense of spaciousness to the sound image.

すなわち、切替回路（１）　４　ａおよび４ｂは一定時
間の中間でＢ端子を「オン」にすると同時に切替回路（
２）６ａ右よび６ｂを１オン」とする。そして、切替回
路（２）　６　ａおよび６ｂは一定時間終了と同時に「
オフ」とする。That is, the switching circuits (1) 4a and 4b turn on the B terminal in the middle of a certain period of time, and at the same time turn on the switching circuit (1) 4a and 4b.
2) Set 6a right and 6b to 1 on. Then, the switching circuit (2) 6a and 6b is activated at the same time as the end of the certain period of time.
"Off".

このように送話光制御信号３２から抽出した各送話元の
情報から、一定時間まではぞの話者が保留している状態
を生成することにより、音像定位の不自然の移動や演ｐ
の途中中止による雛音発生を防止することができる。In this way, by generating a state in which the speaker is on hold for a certain period of time from the information of each transmitting source extracted from the transmitting light control signal 32, it is possible to prevent unnatural movement of sound image localization and performance.
It is possible to prevent the occurrence of hina-song due to the mid-stop of the process.

〔Effect of the invention〕

以上説明したように、本発明は、会議通話状態である単
独話者が急激に別の単独話者の状態に遷移した場合でも
、前の単独話者の状態を一定時間保留していることから
、音像定位は送話者がより自然的な一般の集合して会議
するような雰囲気を生成することができ、通話品質を良
好に維持することができる効果がある。As explained above, the present invention is advantageous because even if a single speaker in a conference call state suddenly transitions to another single speaker state, the previous single speaker state is suspended for a certain period of time. The sound image localization has the effect of creating a more natural atmosphere in which callers are meeting in a general gathering, and maintaining good call quality.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック構成図。第２図はその送話光制御信号のフレームフォーマットを
示す説明図。第３図はその会議通話遷移図。第４図はその動作を示すタイミングチャート。第５図は頭外音像定位の方法を示す説明図。１・・・信号分離回路、２・・・切替制御回路、３・・
・リニア符号化回路、４ａ　、４ｂ・・・切替回路（１
）、５・・記憶回路、６ａ、６ｂ・・・切替回路（２）
、７〜１０・・・加算回路、ＩＩ・・・遅延回路（ＤＰ
：Ｌ）　、１２・・・高域フィルタ（ＨＰＦ）　、１３
・・・減算回路、１４・・・ディジタルアナログ変換器
（０／＾）、１５・・・低域フィルタ（ＬＰＦ）　、１
６・・・増幅器（ＡＭＰ）　、１７・・・両耳レシーバ
、１８・・・受聴者、２１〜２５・・・音像位置、３１
・・・音声十送話元制御信号、３２・・・送話光制御信
号。特許出願人　　日本電信電話株式会社代理人　　弁理士　井　出　直　孝Ｂ　　　十　　Ｂ＋Ｄ　　冨２Ｂ＋Ｄｉ−ヒス牙ヤオル実船（Ｗ’ｌ　　（フレームフッーマ・ント）亮　２　
震実文例（，４′−諸通砧！料図）３３　回FIG. 1 is a block diagram showing an embodiment of the present invention. FIG. 2 is an explanatory diagram showing the frame format of the transmission light control signal. Figure 3 is a transition diagram of the conference call. FIG. 4 is a timing chart showing the operation. FIG. 5 is an explanatory diagram showing a method of extrahead sound image localization. 1... Signal separation circuit, 2... Switching control circuit, 3...
・Linear encoding circuit, 4a, 4b...switching circuit (1
), 5...memory circuit, 6a, 6b...switching circuit (2)
, 7 to 10...addition circuit, II...delay circuit (DP
:L), 12...High-pass filter (HPF), 13
...Subtraction circuit, 14...Digital-to-analog converter (0/^), 15...Low pass filter (LPF), 1
6... Amplifier (AMP), 17... Binaural receiver, 18... Listener, 21-25... Sound image position, 31
. . . Audio transmitter control signal, 32 . . . Transmit light control signal. Patent Applicant Nippon Telegraph and Telephone Corporation Agent Patent Attorney Nao Ide Takashi B 10 B+D 2 B+D
Sentence example (, 4'-Shotsu Kinuta! Fee map) 33 times

Claims

[Claims] 1. Sound image generating means for generating a sound image for an audio signal received from an audio line for a conference call using two electroacoustic converting means corresponding to both ears of a person; and sound image localization means for localizing a sound image according to a control signal received together with an audio signal from the audio line, wherein the sound image localization means changes from a state of one speaker to a state of another speaker. In the case of a transition, a multi-speaker state in which two speakers exist at the same time is established within a certain period of time after the end of the received control signal indicating the first transmission source. A conference call terminal device characterized in that it includes a spread imparting means for imparting a sense of spread to a sound image.