JP4914319B2

JP4914319B2 - COMMUNICATION VOICE PROCESSING METHOD, DEVICE THEREOF, AND PROGRAM THEREOF

Info

Publication number: JP4914319B2
Application number: JP2007241378A
Authority: JP
Inventors: 賢一野口; 陽一羽田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-09-18
Filing date: 2007-09-18
Publication date: 2012-04-11
Anticipated expiration: 2027-09-18
Also published as: JP2009075160A

Description

この発明は、遠隔地間を通信で結ぶ映像コミュニケーションと音声コミュニケーションにおける音声処理方法と、その装置と、そのプログラムに関する。 The present invention relates to an audio processing method, an apparatus thereof, and a program thereof in video communication and audio communication that connect remote locations by communication.

近年、通信網の発達により、遠隔地間を常時接続の回線で結ぶ通信環境が整備されつつある。これら常時接続環境は、家庭では見守りサービスとして、或いは遠隔地間の家族のコミュニケーションに用いられる。職場においては、遠隔地を結んだ共同作業、または在宅勤務でテレワークを行う際に用いられる。しかし、常時接続環境下では、通信する双方の状況が常にマイクロホンで収音され、また各種センサで情報が収集されるので、送り手と聞き手の双方でプライバシーを侵害してしまう問題が生ずる。 In recent years, with the development of communication networks, a communication environment for connecting remote locations with always-connected lines is being developed. These always-on environments are used as watching services at home or for family communication between remote locations. In the workplace, it is used when performing telework by working together from remote locations or working from home. However, in a constantly connected environment, both communication situations are always picked up by a microphone, and information is collected by various sensors. This causes a problem that both the sender and the listener infringe on privacy.

従来、常時接続環境下において、この問題を回避しながら常時接続のコミュニケーションを行う方法として、特許文献１に開示された映像コミュニケーション装置が知られている。図２０を参照して従来の映像コミュニケーション装置を簡単に説明する。映像コミュニケーション装置１００は、通信端末であり、入力手段１０と、出力手段２０と、通信制御手段３０と、端末状況情報処理手段４０と、入出力制御処理手段５０と、音声処理手段６０と、映像処理手段７０とで構成される。入力手段１０は、撮像手段１１と、音声入力手段１２と、近接検出手段１３と、操作手段１４とを有する。近接手段１３は、赤外線や超音波等を利用した距離センサであり、映像コミュニケーション装置１００とユーザとの距離を測るものである。近接手段１３で検出した近接情報は、端末状況情報処理手段４０に入力される。端末状況情報処理手段４０は、近接情報を元に、入出力制御処理手段５０を介して映像処理手段７０と音声処理手段６０の動作を制御する。映像処理手段７０は、入出力制御処理手段５０からの制御信号によって、映像の鮮明度を低下させる処理を行なう。音声処理手段６０は、同様に音声の鮮明度を低下させる処理、または、音量を小さくする処理を行なう。ユーザと映像コミュニケーション装置１００との距離が近く、ユーザが通信相手とコミュニケーションを積極的に取りたい場合には、映像と音声の鮮明度を高くする。逆に、ユーザと映像コミュニケーション装置１００との距離が遠く、コミュニケーションを取ることについて消極的な場合には、映像と音声の鮮明度を低下させ、或いは音量を小さくするものである。
特開２００６−１４０７４７号公報（図１） Conventionally, a video communication apparatus disclosed in Patent Document 1 is known as a method for performing always-connected communication while avoiding this problem in an always-connected environment. A conventional video communication apparatus will be briefly described with reference to FIG. The video communication apparatus 100 is a communication terminal, and includes input means 10, output means 20, communication control means 30, terminal status information processing means 40, input / output control processing means 50, audio processing means 60, video. And processing means 70. The input unit 10 includes an imaging unit 11, a voice input unit 12, a proximity detection unit 13, and an operation unit 14. The proximity unit 13 is a distance sensor using infrared rays, ultrasonic waves, or the like, and measures the distance between the video communication apparatus 100 and the user. The proximity information detected by the proximity unit 13 is input to the terminal status information processing unit 40. The terminal status information processing unit 40 controls the operations of the video processing unit 70 and the audio processing unit 60 via the input / output control processing unit 50 based on the proximity information. The video processing means 70 performs processing for reducing the sharpness of the video according to the control signal from the input / output control processing means 50. Similarly, the sound processing means 60 performs a process for reducing the definition of the sound or a process for reducing the volume. When the distance between the user and the video communication device 100 is short and the user wants to actively communicate with the communication partner, the clarity of video and audio is increased. Conversely, when the distance between the user and the video communication device 100 is long and reluctant to communicate, the definition of video and audio is reduced or the volume is reduced.
JP 2006-140747 A (FIG. 1)

従来のコミュニケーション装置の音声の鮮明度を低下させる具体的な処理については、特許文献１にその記載は無い。一般的に音声の鮮明度を低下させると聞き難くなり、聞き手は不快に感じることが多く、コミュニケーション装置として使い勝手が悪かった。また、音声を小さくする方法の場合には、聞き手が聞き耳を立てれば会話の内容が分かってしまう。このように、従来のコミュニケーション装置は、プライバシー保護と使い勝手の両面から見て不十分なものであった。 There is no description in Patent Document 1 regarding a specific process for reducing the definition of the voice of a conventional communication device. In general, when the sharpness of the voice is lowered, it becomes difficult to hear, the listener often feels uncomfortable, and the usability as a communication device is poor. Further, in the case of a method of reducing the voice, if the listener listens and listens, the content of the conversation can be understood. As described above, the conventional communication apparatus is insufficient from the viewpoint of privacy protection and usability.

この発明は、このような点に鑑みてなされたものである。特に音声処理装置に着目し、プライバシーの侵害を回避しつつ、相手の状況が把握でき、且つ、聞き手も不快にならないコミュニケーション音声処理装置とその方法、及びそのプログラムを提供することを目的とする。 The present invention has been made in view of these points. In particular, an object is to provide a communication speech processing apparatus, a method thereof, and a program thereof that can focus on the speech processing apparatus and can grasp the other party's situation while avoiding privacy infringement and that do not make the listener uncomfortable.

この発明によるコミュニケーション音声処理装置は、周波数変換部と、明瞭度設定部と、帯域分割部と、特徴量計算部と、フィルタ部と、周波数逆変換部とを具備する。周波数変換部は、入力音声信号を周波数領域の信号に変換する。明瞭度設定部は明瞭度を設定する。帯域分割部は、入力音声信号の周波数分析を行ない所定の周波数帯域に分割された帯域分割信号を生成する。特徴量計算部は、帯域分割信号毎の特徴量を計算する。フィルタ部は、複数の周波数帯域毎に、帯域分割信号の特徴量と入力音声信号以外の所定の音響周波数信号を乗算する。周波数逆変換部は、フィルタ部の出力信号を時間領域信号に変換する。 The communication speech processing apparatus according to the present invention includes a frequency conversion unit, a clarity setting unit, a band division unit, a feature amount calculation unit, a filter unit, and a frequency inverse conversion unit. The frequency converter converts the input audio signal into a frequency domain signal. The articulation setting unit sets articulation. The band dividing unit performs frequency analysis of the input audio signal and generates a band divided signal divided into predetermined frequency bands. The feature amount calculation unit calculates a feature amount for each band division signal. The filter unit multiplies a predetermined acoustic frequency signal other than the feature amount of the band division signal and the input audio signal for each of a plurality of frequency bands . The frequency inverse transform unit transforms the output signal of the filter unit into a time domain signal.

この発明のコミュニケーション音声処理装置は、マイクロホン部で収音した入力音声信号の特徴量によってフィルタリングされた雑音、環境音、楽音等の音響信号を、出力信号として出力する。したがって、音響信号を聞き心地の良い音にすることにより、聞き手を不快にすることなく送り手（自己）のプライバシーを保護することが出来る。また、入力音声信号の特徴量によりフィルタリングされた音響信号によって、相手側の音声の詳細は隠蔽されるが、プライバシーを保護しながら会話を含めた状況の雰囲気を伝達することが可能である。さらに、この発明のコミュニケーション音声処理装置は、マイクロホンの入力信号のみを用いて処理を行なうので、通信装置やシステムへの組み込みが容易である。 The communication voice processing apparatus according to the present invention outputs as an output signal an acoustic signal such as noise, environmental sound, and musical sound filtered by the feature amount of the input voice signal collected by the microphone unit. Therefore, by making the sound signal a comfortable sound, the privacy of the sender (self) can be protected without making the listener uncomfortable. Further, although the details of the other party's voice are concealed by the acoustic signal filtered by the feature quantity of the input voice signal, it is possible to convey the atmosphere of the situation including the conversation while protecting the privacy. Furthermore, since the communication sound processing apparatus of the present invention performs processing using only the input signal of the microphone, it can be easily incorporated into a communication apparatus or system.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１にこの発明のコミュニケーション音声処理装置の実施例１の機能構成例を示す。その動作フローを図２に示す。コミュニケーション音声処理装置２１０は、マイクロホン部１と、帯域分割部２と、特徴量計算部３と、フィルタ部４と、明瞭度設定部５と、出力部６とを具備する。常時接続を可能にする通信制御手段と、図示しないネットワークを介して伝達される聞き手側（他方のコミュニケーション音声処理装置）からの音声を入力して再生する手段については、この発明の要部では無いので省略している。マイクロホン部１は、周囲の音を収音し、入力音声信号ｘ（ｎ）を帯域分割部２と出力部６に出力する（図２のステップＳ１）。ここで、(ｎ)は、ある所定のサンプリング間隔で離散化された離散時間を示す。図１において音声信号を離散値化するＡ/Ｄコンバータについては省略している。帯域分割部２は、入力音声信号ｘ（ｎ）を入力とし、例えばフィルタバンク分析によって周波数分析を行い、所定の周波数帯域に分割した帯域分割信号ｘ_ｍ（ｎ）を生成する（ステップＳ２）。ｍは分割した帯域の番号を表わす。例えば、ｍ＝１６であり、１６等分された帯域分割信号ｘ_ｍ（ｎ）を生成する。等分割でなくても、対数スケールでの分割や、聴覚特性に応じた分割を行っても良い。
ｍの値が大きいほど、後述するフィルタ部４の出力する出力信号ｘ_ｃ（ｎ）の音声的特徴が、入力音声信号ｘ（ｎ）に近づく。帯域分割部２は、帯域分割信号ｘ_ｍ（ｎ）を特徴量計算部３に入力する。 FIG. 1 shows an example of a functional configuration of the communication voice processing apparatus according to the first embodiment of the present invention. The operation flow is shown in FIG. The communication sound processing device 210 includes a microphone unit 1, a band dividing unit 2, a feature amount calculating unit 3, a filter unit 4, a clarity setting unit 5, and an output unit 6. The communication control means that enables the constant connection and the means for inputting and reproducing the voice from the listener side (the other communication voice processing device) transmitted via a network (not shown) are not the main parts of the present invention. It is omitted. The microphone unit 1 collects ambient sounds and outputs the input audio signal x (n) to the band dividing unit 2 and the output unit 6 (step S1 in FIG. 2). Here, (n) indicates a discrete time discretized at a predetermined sampling interval. In FIG. 1, the A / D converter that converts the audio signal into discrete values is omitted. The band dividing unit 2 receives the input audio signal x (n) as input, performs frequency analysis by, for example, filter bank analysis, and generates a band divided signal x _m (n) divided into a predetermined frequency band (step S2). m represents the number of the divided band. For example, m = 16, and the band division signal x _m (n) divided into 16 is generated. Even if the division is not equal, division on a logarithmic scale or division according to auditory characteristics may be performed.
The greater the value of m, the closer the sound characteristics of the output signal x _c (n) output from the filter unit 4 described later will approach the input sound signal x (n). The band division unit 2 inputs the band division signal x _m (n) to the feature amount calculation unit 3.

特徴量計算部３は、帯域分割信号ｘ_ｍ（ｎ）毎の例えば平均パワーを計算し、各帯域の特徴量Ｐ_ｍを出力する（ステップＳ３）。平均パワーの計算は、時間方向に平滑化処理を行なっても良い。平滑化することにより、時間軸での平均パワーの急激な変化を防止することが出来る。例えば、現処理フレームをｋとし、現フレームでの帯域平均パワーをＰ_ｔｍｐ（ｋ,ｍ）とし、１フレーム前の帯域平均パワーをＰ（ｋ-１,ｍ）とする。帯域平均パワーＰ（ｋ,ｍ）を式（１）によって求めても良い。 The feature amount calculation unit 3 calculates, for example, average power for each band division signal x _m (n), and outputs a feature amount P _m of each band (step S3). The average power may be calculated by performing a smoothing process in the time direction. By smoothing, it is possible to prevent an abrupt change in the average power on the time axis. For example, assume that the current processing frame is k, the band average power in the current frame is P _tmp (k, m), and the band average power one frame before is P (k−1, m). The band average power P (k, m) may be obtained by Expression (1).

Ｐ（ｋ,ｍ）＝（１−α）・Ｐ_ｔｍｐ（ｋ,ｍ）＋α・Ｐ（ｋ-１,ｍ）（１）
ここでαは平滑化のための時定数である。各帯域毎の特徴量Ｐ_ｍは、フィルタ部４の一方の入力信号として入力される。フィルタ部４の他方の入力として、雑音、環境音、楽音等の音響信号ｚ（ｎ）が外部から入力される。フィルタ部４のフィルタ係数生成部４ａは、特徴量Ｐ_ｍに応じたフィルタ係数Ｄ_ｍを生成する。フィルタ部４は、フィルタ係数Ｄ_ｍ毎にＤ_ｍ＊ｚ_ｍ（ｎ）を計算してｘ_ｃ（ｎ）を得る（ステップＳ４）。＊は畳み込み演算を意味する。ここで、音響信号ｚ（ｎ）を、長時間聞いても不快に感じないような例えばバブル音にすると、ｘ_ｃ（ｎ）は、帯域分割信号ｘ_ｍ（ｎ）の平均パワー変動を持つ出力信号になる。出力部６は、入力音声信号ｘ（ｎ）と、フィルタ部４の出力信号ｘ_ｃ（ｎ）と、制御信号ｃ（ｎ）を入力として出力信号ｙ（ｎ）を出力する（ステップＳ６）。制御
信号ｃ（ｎ）を、０〜１までの値とすると、出力信号ｙ（ｎ）は、例えば式（２）で計算される。 P (k, m) = (1−α) · P _tmp (k, m) + α · P (k−1, m) (1)
Here, α is a time constant for smoothing. The feature amount P _{m for} each band is input as one input signal of the filter unit 4. As the other input of the filter unit 4, an acoustic signal z (n) such as noise, environmental sound, and musical sound is input from the outside. Filter coefficient generating portion 4a of the filter 4 generates the filter coefficients D _m corresponding to the feature quantity P _m. The filter unit 4 calculates D _m * z _m (n) for each filter coefficient D _m to obtain x _c (n) (step S4). * Means a convolution operation. Here, if the acoustic signal z (n) is, for example, a bubble sound that does not feel uncomfortable even when heard for a long time, x _c (n) is an output having an average power fluctuation of the band division signal x _m (n). Become a signal. The output unit 6 receives the input audio signal x (n), the output signal x _c (n) of the filter unit 4, and the control signal c (n) and outputs an output signal y (n) (step S6). When the control signal c (n) is a value from 0 to 1, the output signal y (n) is calculated by, for example, the equation (2).

ｙ（ｎ）＝（１−ｃ（ｎ））・ｘ（ｎ）＋ｃ（ｎ）・ｘ_ｃ（ｎ）（２）
制御信号ｃ（ｎ）は、段階的に設定される値でも良いし、固定値でも良い。出力信号ｙ（ｎ）は、図示しない通信制御手段を介して、ネットワークの先に接続された聞き手側のコミュニケーション音声処理装置に送信される。 y (n) = (1-c (n)). x (n) + c (n) _.xc (n) (2)
The control signal c (n) may be a value set stepwise or a fixed value. The output signal y (n) is transmitted to a communication audio processing device on the listener side connected to the end of the network via a communication control means (not shown).

実施例１のコミュニケーション音声処理装置２１０で処理をした入力音声信号ｘ（ｎ）
と出力信号ｙ（ｎ）の信号波形を図３（ａ）と図３（ｂ）に示す。図３（ａ）,（ｂ）の横軸は、時間（秒）であり、縦軸は振幅である。図３（ａ）は、１６ｋＨｚでサンプリングされた音声データベースの女性の声の音声波形である。また、図示していない音響信号ｚ（ｎ）にはバブル音を用いた。図３（ｂ）は、制御信号ｃ（ｎ）＝１とした場合の出力信号ｙ（ｎ）の信号波形である。ｍの値は１６である。入力音声信号ｘ（ｎ）の４,８,１２秒付近の無音区間では、出力信号ｙ（ｎ）も無音区間である。また、各音声波形の時間崩落曲線の形に注目すると、出力信号ｙ（ｎ）と入力音声信号ｘ（ｎ）の波形が類似していることが分かる。この例では、制御信号ｃ（ｎ）＝１なので、出力信号ｙ（ｎ）はフィルタ部４の出力信号ｘ_ｃ（ｎ）となる（式（２））。その出力信号は、入力音声信号ｘ（ｎ）の周波数帯域毎の平均パワー変動を持った音声信号となるので、言葉の意味は不明であるが、言葉の雰囲気は伝達することが出来る。したがって、会話の内容を隠蔽しつつ、話をしている様子を伝えることが出来る。更に出力信号ｙ（ｎ）は、バブル音を入力音声信号の特徴量でフィルタリングしたものなので長時間聞いても不快に感じない。 Input voice signal x (n) processed by the communication voice processing apparatus 210 of the first embodiment.
3A and 3B show signal waveforms of the output signal y (n). 3A and 3B, the horizontal axis represents time (seconds), and the vertical axis represents amplitude. FIG. 3A shows a voice waveform of a female voice in a voice database sampled at 16 kHz. Moreover, bubble sound was used for the acoustic signal z (n) not shown. FIG. 3B is a signal waveform of the output signal y (n) when the control signal c (n) = 1. The value of m is 16. In the silent section around 4, 8, 12 seconds of the input audio signal x (n), the output signal y (n) is also a silent section. When attention is paid to the shape of the time decay curve of each speech waveform, it can be seen that the waveforms of the output signal y (n) and the input speech signal x (n) are similar. In this example, since the control signal c (n) = 1, the output signal y (n) becomes the output signal x _c (n) of the filter unit 4 (formula (2)). Since the output signal is an audio signal having an average power fluctuation for each frequency band of the input audio signal x (n), the meaning of the word is unknown, but the atmosphere of the word can be transmitted. Therefore, it is possible to convey the state of talking while hiding the content of the conversation. Furthermore, since the output signal y (n) is obtained by filtering the bubble sound with the feature amount of the input sound signal, it does not feel uncomfortable even if it is heard for a long time.

なお、帯域分割部２の周波数分析手段として、フィルタバンク分析を用いた例で説明を行ったが、短時間フーリエ変換やウェーブレット変換を用いても良い。また、音響信号ｚ（ｎ）は、外部から供給される例で説明を行ったが、図１中に破線で示す様に、コミュニケーション音声処理装置２１０内に信号保存部７を設け、音響信号ｚ（ｎ）を保存しておいても良い。 In addition, although the example which used the filter bank analysis was demonstrated as a frequency analysis means of the band division part 2, you may use short-time Fourier transform and wavelet transform. The acoustic signal z (n) has been described with an example of being supplied from the outside. However, as shown by a broken line in FIG. 1, the signal storage unit 7 is provided in the communication voice processing device 210, and the acoustic signal z (N) may be stored.

実施例1は、明瞭度設定部５に設定された明瞭度に応じてフィルタ部の出力信号と入力音声信号との配分を変えて出力信号ｙ（ｎ）を生成する。以降、この機能を説明の便宜上プライバシー機能と称する。実施例１では、このプライバシー機能の動作は常時行なわれ、選択の余地が無い。例えば、プライバシーの保護の必要の無い幼児の見守り等にこの発明のコミュニケーション音声処理装置を用いる場合に、実施例１の構成では使い難い場合もある。そこで、このプライバシー機能を選択的にＯＮ/ＯＦＦ出来るようにした実施例２を次に説明する。 In the first embodiment, the output signal y (n) is generated by changing the distribution of the output signal of the filter unit and the input audio signal according to the clarity set in the clarity setting unit 5. Hereinafter, this function is referred to as a privacy function for convenience of explanation. In the first embodiment, the privacy function is always operated and there is no room for selection. For example, when the communication speech processing apparatus of the present invention is used for watching an infant who does not need privacy protection, the configuration of the first embodiment may be difficult to use. A second embodiment in which the privacy function can be selectively turned on / off will be described below.

この発明の実施例２のコミュニケーション音声処理装置２２０の機能構成例を図４に示す。その動作フローを図５に示す。コミュニケーション音声処理装置２２０は、プライバシー機能を周波数領域の信号で処理するようにしたものである。実施例１に対して、実施例２は、スイッチ部４１と、処理制御部４２と、周波数変換部４３と、周波数逆変換部４７とが加えられ、帯域分割部４４と出力部４５の動作が変る。よって、ここでは、新たに追加された構成と、実施例１と動作が異なる部分について説明する。 FIG. 4 shows a functional configuration example of the communication voice processing apparatus 220 according to the second embodiment of the present invention. The operation flow is shown in FIG. The communication voice processing device 220 processes the privacy function with a frequency domain signal. In contrast to the first embodiment, in the second embodiment, a switch unit 41, a processing control unit 42, a frequency converting unit 43, and a frequency inverse converting unit 47 are added, and the operations of the band dividing unit 44 and the output unit 45 are performed. Change. Therefore, here, a newly added configuration and a portion different from the operation in the first embodiment will be described.

スイッチ部４１は、各種ボタン、タッチセンサ、ダイヤル等の一般的なユーザーインターフェースを持つ入力装置であり、スイッチ信号ｐ（ｎ）を処理制御部４２と出力部４５に出力する。スイッチ信号ｐ（ｎ）は、ｐ（ｎ）＝“１”（“１,０”は論理レベルを意味する。）の時にプライバシー機能をＯＮとし、ｐ（ｎ）＝“０”の時にプライバシー機能をＯＦＦとする信号である。 The switch unit 41 is an input device having a general user interface such as various buttons, touch sensors, and dials, and outputs a switch signal p (n) to the processing control unit 42 and the output unit 45. The switch signal p (n) turns on the privacy function when p (n) = “1” (“1,0” means a logic level), and privacy function when p (n) = “0”. Is a signal to turn OFF the.

処理制御部４２は、マイクロホン部１の出力する離散化された入力音声信号ｘ（ｎ）と、スイッチ信号ｐ（ｎ）を入力として出力を決定する。処理制御部４２は、スイッチ信号ｐ（ｎ）＝“１”の時に入力音声信号ｘ（ｎ）を周波数変換部４３に入力する（図５のステップＳ４２のＹｅｓ）。周波数変換部４３は、入力音声信号ｘ（ｎ）を例えば短時間フーリエ変換を用いて周波数領域信号Ｘ（ω）に変換する。短時間フーリエ変換は、一定の大きさの窓関数を用いて入力音声信号ｘ（ｎ）を切り出し、その信号を高速フーリエ変換（ＦＦＴ）して周波数領域信号Ｘ（ω）を計算する。ＦＦＴサイズは、例えば１６ｋＨｚサンプリング、サンプル数２５６、シフト長１/２といった値である。帯域分割部４４は、周波数領域信号Ｘ（ω）を、例えば１６等分して帯域分割信号Ｘ_ｍ（ω）を生成する（ステップＳ４４）。特徴量計算部３は、実施例１と同じであり、各帯域の特徴量Ｐ_ｍをフィルタ部４６の一方の入力信号として出力する。フィルタ部４６も他方の入力には、帯域分割信号Ｘ_ｍ（ω）の分割前の周波数信号と同一周波数帯の、雑音、環境音、楽音等の音響周波数信号Ｚ（ω）が外部から入力される。フィルタ部４６は、特徴量Ｐ_ｍの分割数に対応させて、音響周波数信号Ｚ（ω）を分割したＺ_ｍ（ω）を生成する。そしてＺ_ｍ（ω）を特徴量Ｐ_ｍでフィルタリングする（ステップＳ４６）。つまり、帯域毎にＰ_ｍ・Ｚ_ｍ（ω）を計算してＸ_ｃｍ（ω）を得、全帯域のＸ_ｃｍ（ω）を足し合わせたＸ_ｃ（ω）を周波数逆変換部４７に出力する。なお、音響周波数信号Ｚ（ω）の周波数帯域は、入力音声信号の周波数帯域と、必ずしも一致する必要は無い。所望の帯域に合わせて分割数を調整すれば良い。 The processing control unit 42 determines the output by using the discretized input audio signal x (n) output from the microphone unit 1 and the switch signal p (n) as inputs. The processing control unit 42 inputs the input audio signal x (n) to the frequency conversion unit 43 when the switch signal p (n) = “1” (Yes in step S42 in FIG. 5). The frequency conversion unit 43 converts the input audio signal x (n) into a frequency domain signal X (ω) using, for example, a short-time Fourier transform. In the short-time Fourier transform, an input speech signal x (n) is cut out using a window function having a constant size, and the frequency domain signal X (ω) is calculated by performing fast Fourier transform (FFT) on the signal. The FFT size is a value such as 16 kHz sampling, the number of samples 256, and a shift length 1/2. The band dividing unit 44 divides the frequency domain signal X (ω) into 16 equal parts, for example, to generate a band divided signal X _m (ω) (step S44). The feature amount calculation unit 3 is the same as that in the first embodiment, and outputs the feature amount P _m of each band as one input signal of the filter unit 46. The filter unit 46 also receives externally input an acoustic frequency signal Z (ω) such as noise, environmental sound, and musical sound in the same frequency band as the frequency signal before the division of the band division signal X _m (ω). The The filter unit 46 generates Z _m (ω) obtained by dividing the acoustic frequency signal Z (ω) in accordance with the number of divisions of the feature amount P _m . Then, Z _m (ω) is filtered by the feature amount P _m (step S46). In other words, to obtain a _{X cm} (omega) to calculate the _P _m · _Z m (ω) for each band, outputs _{X c} of the sum of the total bandwidth of _{X cm (ω) (ω)} in the frequency inverse conversion unit 47 To do. Note that the frequency band of the acoustic frequency signal Z (ω) does not necessarily match the frequency band of the input audio signal. What is necessary is just to adjust the division | segmentation number according to a desired zone | band.

周波数逆変換部４７は、例えば短時間フーリエ逆変換を用いて、フィルタ部４６の出力するＸ_ｃ（ω）を時間領域の信号ｘ_ｃ（ｎ）に変換する（ステップＳ４７）。 The frequency inverse transform unit 47 transforms X _c (ω) output from the filter unit 46 into a time domain signal x _c (n) using, for example, short-time Fourier inverse transform (step S47).

出力部４５は、制御信号ｃ（ｎ）と、スイッチ信号ｐ（ｎ）と、入力音声信号ｘ（ｎ）と、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）を入力として出力信号ｙ（ｎ）を合成して出力する（ステップＳ４５）。出力部４５はスイッチ信号ｐ（ｎ）＝“１”の時、制御信号ｃ（ｎ）の値に応じて入力音声信号ｘ（ｎ）と、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）の配分を変えて足し合わせて出力信号ｙ（ｎ）を合成する（ステップＳ４５１）。スイッチ信号ｐ（ｎ）＝“０”の時は、入力音声信号ｘ（ｎ）をそのまま出力信号ｙ（ｎ）（ｙ（ｎ）＝ｘ（ｎ））として出力する（ステップＳ４５２、収音信号出力処理）。つまり、プライバシー機能はＯＦＦになる。 The output unit 45 receives the control signal c (n), the switch signal p (n), the input audio signal x (n), and the output signal x _c (n) of the frequency inverse conversion unit 47 as an output signal y ( n) is synthesized and output (step S45). When the switch signal p (n) = “1”, the output unit 45 outputs the input audio signal x (n) according to the value of the control signal c (n) and the output signal x _c (n) of the frequency inverse conversion unit 47. The output signal y (n) is synthesized by changing and adding the distributions (step S451). When the switch signal p (n) = “0”, the input audio signal x (n) is output as it is as the output signal y (n) (y (n) = x (n)) (step S452, sound collection signal) Output processing). That is, the privacy function is turned off.

このようにスイッチ部４１の出力するスイッチ信号ｐ（ｎ）によって、プライバシー機能をＯＮ/ＯＦＦすることが出来、この発明のコミュニケーション音声処理装置の使い勝手を向上させることが出来る。なお、音響周波数信号Ｚ（ω）をそのまま保存するのではなく、時間領域の音声データを記録しておき、フィルタ部４６でフィルタリングする都度、音声データを周波数分析して音響周波数信号Ｚ（ω）を生成しても良い。また、フィルタ部４６で特徴量Ｐ_ｍの分割数に対応させて音響周波数信号Ｚ（ω）を分割したＺ_ｍ（ω）を計算する例で説明を行ったが、予めＺ_ｍ（ω）を記録しておいても良い。また、フィルタ部４６のフィルタ動作を、帯域毎にＰ_ｍ・Ｚ_ｍ（ω）を計算する例で説明を行ったが、音響周波数信号Ｚ_ｍ（ω）に単純に特徴量Ｐ_ｍを乗算すると、Ｘ_ｃ（ω）が歪んでしまうことがある。その場合は、全帯域の平均パワーと、Ｘ_ｃ（ω）の平均パワーが等しくなるように正規化する係数Ｗを導入して、帯域毎にＷ・Ｐ_ｍ・Ｚ_ｍ（ω）を計算するようにしても良い。 Thus, the privacy function can be turned ON / OFF by the switch signal p (n) output from the switch unit 41, and the usability of the communication voice processing apparatus of the present invention can be improved. The acoustic frequency signal Z (ω) is not stored as it is, but the time-domain audio data is recorded, and each time the filter unit 46 performs filtering, the audio data is subjected to frequency analysis and the acoustic frequency signal Z (ω). May be generated. Also it has been described in example of calculating the divided audio frequency signal Z (omega) in correspondence with the number of divisions of the feature quantity P _m filter unit 46 Z _{m (ω),} previously Z _m the _(omega) It may be recorded. Further, the filter operation of the filter unit 46 has been described with an example in which P _m · Z _m (ω) is calculated for each band. However, when the acoustic frequency signal Z _m (ω) is simply multiplied by the feature amount P _m. , X _c (ω) may be distorted. In that case, a coefficient W for normalization is introduced so that the average power of the entire band is equal to the average power of X _c (ω), and W · P _m · Z _m (ω) is calculated for each band. You may do it.

なお、プライバシー機能をＯＮ/ＯＦＦさせる目的は、実施例１の構成にスイッチ部４１を設け、出力部６を出力部４５に変更することでも実現出来る。 The purpose of turning on / off the privacy function can also be realized by providing the switch unit 41 in the configuration of the first embodiment and changing the output unit 6 to the output unit 45.

図６にこの発明の実施例３のコミュニケーション音声処理装置２３０の機能構成例を示す。その動作フローを図７に示す。実施例３は、実施例２の明瞭度設定部５の代わりにセンサ部６１が設けられたものである。他の構成は、実施例２と同じである。ここでは、センサ部６１と出力部６２の動作を説明する。 FIG. 6 shows a functional configuration example of the communication speech processing apparatus 230 according to the third embodiment of the present invention. The operation flow is shown in FIG. In the third embodiment, a sensor unit 61 is provided instead of the clarity setting unit 5 of the second embodiment. Other configurations are the same as those of the second embodiment. Here, operations of the sensor unit 61 and the output unit 62 will be described.

センサ部６１は、周囲環境情報を検出して、コミュニケーション音声処理装置２３０と話者との距離を表わすセンサデータｓ（ｎ）を出力部６２に出力する（ステップＳ６１）。センサ部６１を構成する検出デバイスとしては、マイクロホン、ＣＣＤ撮像素子、温度センサ、超音波センサ、赤外線センサ等が利用出来る。これらのどの検出デバイスを用いても一般的な回路構成で、コミュニケーション音声処理装置２３０と話者との距離を検出することが可能である。距離を検出する方法は、従来技術で簡単に実現できるので、具体的な構成を示した説明は省略する。検出されたセンサデータｓ（ｎ）は、例えばセンサ部６１内に設けられたβ（ｎ）変換部６１ａで０〜１までの値に変換されて、出力部６２に入力される。 The sensor unit 61 detects ambient environment information and outputs sensor data s (n) representing the distance between the communication voice processing device 230 and the speaker to the output unit 62 (step S61). As a detection device constituting the sensor unit 61, a microphone, a CCD image sensor, a temperature sensor, an ultrasonic sensor, an infrared sensor, or the like can be used. Any of these detection devices can be used to detect the distance between the communication speech processing apparatus 230 and the speaker with a general circuit configuration. Since the method for detecting the distance can be easily realized by the prior art, the description showing the specific configuration is omitted. The detected sensor data s (n) is converted into a value from 0 to 1 by a β (n) conversion unit 61 a provided in the sensor unit 61, for example, and input to the output unit 62.

出力部６２は、β（ｎ）と、入力音声信号ｘ（ｎ）と、スイッチ信号ｐ（ｎ）と、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）を入力として出力信号ｙ（ｎ）を合成して出力する（ステップＳ６２）。出力部６２は、出力信号ｙ（ｎ）を式（３）で計算される値に合成する（ステップＳ６２１）。 The output unit 62 receives β (n), the input voice signal x (n), the switch signal p (n), and the output signal x _c (n) of the frequency inverse transform unit 47 as an output signal y (n). Are synthesized and output (step S62). The output unit 62 combines the output signal y (n) with the value calculated by Expression (3) (step S621).

ｙ（ｎ）＝（１−β（ｎ））・ｘ（ｎ）＋β（ｎ）・ｘ_ｃ（ｎ）（３）
出力部６２のスイッチ信号ｐ（ｎ）に対する動作は、実施例２と同じである。センサ部６１を明瞭度設定部５の代わりに設けたことで、コミュニケーション音声処理装置２３０と話者との距離によって、出力信号ｙ（ｎ）の明瞭度を自動的に変えることが可能になる。話者の位置が、コミュニケーション音声処理装置２３０に近い場合はβ（ｎ）の値が小さいので、出力信号ｙ（ｎ）に占める入力音声信号ｘ（ｎ）の割合が大きくなる。したがって、出力信号ｙ（ｎ）は明瞭になる。逆に話者の位置が、コミュニケーション音声処理装置２３０から遠い場合は、フィルタリングされた周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）の割合が大きくなるので、出力信号ｙ（ｎ）は不明瞭になる。スイッチ信号ｐ（ｎ）＝“０”の時は、入力音声信号ｘ（ｎ）がそのまま出力信号ｙ（ｎ）として出力される（ステップＳ６２２）。 y (n) = (1−β (n)) · x (n) + β (n) · x _c (n) (3)
The operation of the output unit 62 for the switch signal p (n) is the same as that in the second embodiment. By providing the sensor unit 61 instead of the intelligibility setting unit 5, the intelligibility of the output signal y (n) can be automatically changed according to the distance between the communication voice processing device 230 and the speaker. When the position of the speaker is close to the communication speech processing device 230, the value of β (n) is small, so that the ratio of the input speech signal x (n) to the output signal y (n) is large. Therefore, the output signal y (n) becomes clear. On the contrary, when the position of the speaker is far from the communication speech processing device 230, the ratio of the output signal x _c (n) of the filtered frequency inverse transform unit 47 becomes large, so the output signal y (n) is unclear. become. When the switch signal p (n) = “0”, the input audio signal x (n) is output as it is as the output signal y (n) (step S622).

このように、センサ部６１を設けることで自動的に出力信号ｙ（ｎ）の明瞭度を可変することが可能になる。なお、β（ｎ）変換部６１ａは、出力部６２内に設けても良い。 Thus, by providing the sensor unit 61, it becomes possible to automatically change the clarity of the output signal y (n). Note that the β (n) conversion unit 61 a may be provided in the output unit 62.

実施例２の変形例のコミュニケーション音声処理装置２４０を、実施例４として図８に示す。動作フローを図９に示す。実施例４は、実施例２（図４）に対して明瞭度設定部５に設定される制御信号ｃ（ｎ）が帯域分割部８１に入力される点と、出力部８２の動作が異なっている。他の部分は実施例２と同じである。 A communication voice processing apparatus 240 according to a modification of the second embodiment is shown in FIG. The operation flow is shown in FIG. The fourth embodiment is different from the second embodiment (FIG. 4) in that the control signal c (n) set in the articulation setting unit 5 is input to the band dividing unit 81 and the operation of the output unit 82 is different. Yes. Other parts are the same as those in the second embodiment.

帯域分割部８１は、周波数変換部４３からの周波数領域信号Ｘ（ω）と、明瞭度設定部５からの制御信号ｃ（ｎ）とを入力として帯域分割信号Ｘ_ｍ（ω）を特徴量計算部３に出力する（図９のステップＳ６１）。実施例２では、例えばｍ＝１６として説明したｍの値が、制御信号ｃ（ｎ）によって決定される。例えば、制御信号ｃ（ｎ）が０〜１までの値をとるとき、ｃ（ｎ）が小さい程、ｍの値を３２に近づけ、ｃ（ｎ）が大きい程、ｍを１６に近づける。つまり、ｍの値が大きく帯域分割数が多いと周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）のバブル音は、入力音声信号ｘ（ｎ）により近い特徴を持つ音声信号になる。逆にｍの値が小さいと、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）は、バブル音により近くなる。出力部８２は、スイッチ信号ｐ（ｎ）＝“１”の時、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）を出力する。スイッチ信号ｐ（ｎ）＝“０”の時は、入力音声信号ｘ（ｎ）を出力する。 The band dividing unit 81 receives the frequency domain signal X (ω) from the frequency converting unit 43 and the control signal c (n) from the intelligibility setting unit 5 as input and calculates the feature amount of the band divided signal X _m (ω). It outputs to the part 3 (step S61 of FIG. 9). In the second embodiment, for example, the value of m described as m = 16 is determined by the control signal c (n). For example, when the control signal c (n) takes a value from 0 to 1, the smaller the c (n), the closer the value of m is to 32, and the larger the c (n) is, the closer m is to 16. That is, when the value of m is large and the number of band divisions is large, the bubble sound of the output signal x _c (n) of the frequency inverse transform unit 47 becomes a sound signal having characteristics closer to the input sound signal x (n). Conversely, when the value of m is small, the output signal x _c (n) of the frequency inverse transform unit 47 becomes closer to the bubble sound. The output unit 82 outputs the output signal x _c (n) of the frequency inverse transform unit 47 when the switch signal p (n) = “1”. When the switch signal p (n) = “0”, the input audio signal x (n) is output.

したがって、スイッチ信号ｐ（ｎ）＝“１”で、且つ、制御信号ｃ（ｎ）の値が小さい時は、出力信号ｙ（ｎ）に重畳する周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）はよりバブル音に近いので、出力信号ｙ（ｎ）の明瞭度は低下する。ｃ（ｎ）の値が大きい時、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）は、入力音声信号ｘ（ｎ）に近い特性を持つので、出力信号ｙ（ｎ）の明瞭度が向上する。このように実施例４の構成でも、明瞭度設定部５に設定した制御信号ｃ（ｎ）の値によって、出力信号ｙ（ｎ）の明瞭さを制御することが可能である。 Therefore, when the switch signal p (n) = “1” and the value of the control signal c (n) is small, the output signal x _c (n) of the frequency inverse transform unit 47 superimposed on the output signal y (n). ) Is more like a bubble sound, the clarity of the output signal y (n) is reduced. When the value of c (n) is large, the output signal x _c (n) of the frequency inverse transform unit 47 has characteristics close to the input audio signal x (n), so the clarity of the output signal y (n) is improved. To do. As described above, even in the configuration of the fourth embodiment, the clarity of the output signal y (n) can be controlled by the value of the control signal c (n) set in the clarity setting unit 5.

実施例３の変形例のコミュニケーション音声処理装置２５０を、実施例５として図１０に示す。動作フローを図１１に示す。実施例５は、実施例３（図６）に対してセンサ部６１の出力するセンサデータｓ（ｎ）が帯域分割部１０１に入力される点と、出力部が出力部８２である点で異なっている。他の部分は実施例３と同じである。 A communication voice processing apparatus 250 according to a modification of the third embodiment is shown in FIG. The operation flow is shown in FIG. The fifth embodiment is different from the third embodiment (FIG. 6) in that the sensor data s (n) output from the sensor unit 61 is input to the band dividing unit 101 and the output unit is the output unit 82. ing. The other parts are the same as in the third embodiment.

帯域分割部１０１は、周波数変換部４３からの周波数領域信号Ｘ（ω）と、センサ部６１からのセンサデータｓ（ｎ）とを入力として帯域分割信号Ｘ_ｍ（ω）を特徴量計算部３に出力する（図１１のステップＳ１０１）。この実施例では、帯域分割信号Ｘ_ｍ（ω）のｍの値は、センサデータｓ（ｎ）によって決定される。帯域分割部１０１は、コミュニケーション音声処理装置２５０と話者との距離が近い程、例えばｓ（ｎ）の値が小さくなるとｍの値を３２に近づけ、ｓ（ｎ）の値が大きくなればｍを１６に近い値にする。出力部８２は、実施例４（図８）で説明した出力部８２であり、入力音声信号ｘ（ｎ）と、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）とを合成して出力信号ｙ（ｎ）を出力する。ｍの値が大きいと、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）の音を入力音声信号ｘ（ｎ）に近づけることが出来る。ｍの値が小さいと、周波数逆変換部４７の出力信号ｘ_ｃ（ｎ）はバブル音に近くなる。したがって、後は実施例４と同様の動作によって、出力信号ｙ（ｎ）の明瞭さを制御することが可能である。 The band division unit 101 receives the frequency domain signal X (ω) from the frequency conversion unit 43 and the sensor data s (n) from the sensor unit 61 as input, and uses the band division signal X _m (ω) as the feature amount calculation unit 3. (Step S101 in FIG. 11). In this embodiment, the value of m of the band division signal X _m (ω) is determined by the sensor data s (n). The band dividing unit 101 brings the value of m closer to 32 when the value of s (n) decreases, for example, as the distance between the communication speech processing apparatus 250 and the speaker decreases, and m increases when the value of s (n) increases. To a value close to 16. The output unit 82 is the output unit 82 described in the fourth embodiment (FIG. 8). The output unit 82 synthesizes the input audio signal x (n) and the output signal x _c (n) of the frequency inverse transform unit 47 to generate an output signal. y (n) is output. If the value of m is large, the sound of the output signal x _c (n) of the frequency inverse transform unit 47 can be brought close to the input audio signal x (n). When the value of m is small, the output signal x _c (n) of the frequency inverse transform unit 47 is close to a bubble sound. Therefore, the clarity of the output signal y (n) can be controlled thereafter by the same operation as in the fourth embodiment.

実施例４の変形例のコミュニケーション音声処理装置２６０を、実施例６として図１２に示す。その動作フローを図１３に示す。実施例６は、実施例４（図８）に対して明瞭度設定部５が出力する制御信号ｃ（ｎ）が、周波数逆変換部１２１に入力される点と、周波数変換部４３の出力する周波数領域信号Ｘ（ω）と音響周波数信号Ｚ（ω）とが周波数逆変換部１２１に入力されている点が異なっている。他の構成は実施例４と同じである。 A communication voice processing apparatus 260 according to a modification of the fourth embodiment is shown in FIG. The operation flow is shown in FIG. In the sixth embodiment, the control signal c (n) output from the intelligibility setting unit 5 with respect to the fourth embodiment (FIG. 8) is input to the frequency inverse conversion unit 121 and output from the frequency conversion unit 43. The difference is that the frequency domain signal X (ω) and the acoustic frequency signal Z (ω) are input to the frequency inverse transform unit 121. Other configurations are the same as those of the fourth embodiment.

通常、短時間フーリエ逆変換で周波数領域の信号を、時間領域の信号に変換する場合は、その周波数領域の信号の振幅特性と位相特性を用いる。しかし、この実施例の周波数逆変換部１２１は、フィルタ部４６の出力信号Ｘ_ｃ（ω）の位相特性Ｓ_ｃ（ω）を、制御信号ｃ（ｎ）によって可変する。周波数逆変換部１２１は、その可変された位相特性Ｓ_ｃ（ω）に基づいて、短時間フーリエ逆変換を行いフィルタ部４６の出力信号Ｘ_ｃ（ω）を時間領域の信号ｘ_ｃ（ω）に変換する（図１３のステップＳ１２１）。その位相特性の可変は、周波数変換部４３の出力する周波数領域信号Ｘ（ω）の位相特性Ｓ_Ｘ（ω）と、音響周波数信号Ｚ（ω）の位相特性Ｓ_Ｚ（ω）の配分を式（４）に示すように変えて行う。 Usually, when a frequency domain signal is converted into a time domain signal by short-time inverse Fourier transform, the amplitude characteristics and phase characteristics of the frequency domain signal are used. However, the frequency inverse transform unit 121 of this embodiment varies the phase characteristic S _c (ω) of the output signal X _c (ω) of the filter unit 46 according to the control signal c (n). The frequency inverse transform unit 121 performs a short-time Fourier inverse transform on the basis of the variable phase characteristic S _c (ω) and converts the output signal X _c (ω) of the filter unit 46 into a time domain signal x _c (ω). (Step S121 in FIG. 13). The variation of the phase characteristic is an expression of the distribution of the phase characteristic S _X (ω) of the frequency domain signal X (ω) output from the frequency converter 43 and the phase characteristic S _Z (ω) of the acoustic frequency signal Z (ω). Change as shown in (4).

Ｓ_ｃ（ω）＝（１−ｃ（ｎ））・Ｓ_Ｘ（ω）＋ｃ（ｎ）・Ｓ_Ｚ（ω）（４）
時間領域の信号ｘ_ｃ（ω）の位相特性Ｓ_ｃ（ω）は、制御信号ｃ（ｎ）の値が小さければ、入力音声信号の周波数領域信号Ｘ（ω）の位相特性Ｓ_Ｘ（ω）に近づく。逆にｃ（ｎ）の値が大きければ、ｘ_ｃ（ω）は、音響周波数信号Ｚ_ｍ（ω）の例えばバブル音の位相特性Ｓ_Ｚ（ω）に近づくことになる。このように周波数逆変換部１２１において、時間領域信号に周波数領域信号を逆変換する際の位相特性を、制御信号ｃ（ｎ）で可変することでも、出力信号の明瞭度を制御することが出来る。 S _c (ω) = (1−c (n)) · S _X (ω) + c (n) · S _Z (ω) (4)
Phase characteristic S _c of the signal x _c in the time domain _{_(ω) (ω),} the smaller the value of the control signal c (n), the phase characteristic S _X of the input audio signal in the frequency domain signals X _(ω) (ω) Get closer to. Conversely, if the value of c (n) is large, x _c (ω) approaches the phase characteristic S _Z (ω) of, for example, bubble sound of the acoustic frequency signal Z _m (ω). As described above, the frequency inverse transform unit 121 can also control the intelligibility of the output signal by changing the phase characteristic when the frequency domain signal is inversely transformed into the time domain signal by the control signal c (n). .

実施例４の変形例のコミュニケーション音声処理装置２７０を、実施例７として図１４に示す。その動作フローを図１５に示す。実施例７は、実施例４（図８）に対してフィルタ部４６に入力される音響周波数信号Ｚ（ω）が、重み付き加算部１３１で生成された音響周波数信号Ｚａ（ω）である点で異なっている。 A communication voice processing apparatus 270 according to a modification of the fourth embodiment is shown in FIG. The operation flow is shown in FIG. In the seventh embodiment, the acoustic frequency signal Z (ω) input to the filter unit 46 in the fourth embodiment (FIG. 8) is the acoustic frequency signal Za (ω) generated by the weighted addition unit 131. Is different.

重み付き加算部１３１は、周波数変換部４３の出力する周波数領域信号Ｘ（ω）と、明瞭度設定部５に設定された制御信号ｃ（ｎ）と、音響周波数信号Ｚ（ω）を入力として音響周波数信号Ｚａ（ω）を生成する（図１５のステップＳ１３１）。音響周波数信号Ｚａ（ω）は、例えば、制御信号ｃ（ｎ）が０〜１までの値をとるとき、式（５）で計算される。 The weighted addition unit 131 receives the frequency domain signal X (ω) output from the frequency conversion unit 43, the control signal c (n) set in the articulation setting unit 5, and the acoustic frequency signal Z (ω) as inputs. An acoustic frequency signal Za (ω) is generated (step S131 in FIG. 15). For example, when the control signal c (n) takes a value from 0 to 1, the acoustic frequency signal Za (ω) is calculated by Expression (5).

Ｚａ（ω）＝（１−ｃ（ｎ））・Ｘ（ω）＋ｃ（ｎ）・Ｚ（ω）（５）
このように重み付き加算部１３１は、周波数領域信号Ｘ（ω）と音響周波数信号Ｚ（ω）を、重み付け加算して音響周波数信号Ｚａ（ω）を生成する。したがって、制御信号ｃ（ｎ）の値が小さい時の音響周波数信号Ｚ（ω）は、入力音声信号ｘ（ｎ）に近い信号になる。逆にｃ（ｎ）の値が大きい時は、音響周波数信号Ｚ（ω）を例えばバブル音とすれば、バブル音に近い信号になる。このように重み付き加算部１３１を設けた構成でも出力信号ｙ（ｎ）の明瞭度を制御することが出来る。 Za (ω) = (1−c (n)) · X (ω) + c (n) · Z (ω) (5)
As described above, the weighted addition unit 131 weights and adds the frequency domain signal X (ω) and the acoustic frequency signal Z (ω) to generate the acoustic frequency signal Za (ω). Therefore, the acoustic frequency signal Z (ω) when the value of the control signal c (n) is small is a signal close to the input audio signal x (n). Conversely, when the value of c (n) is large, if the acoustic frequency signal Z (ω) is, for example, a bubble sound, the signal is close to a bubble sound. In this way, even with the configuration in which the weighted addition unit 131 is provided, the clarity of the output signal y (n) can be controlled.

実施例３の変形例のコミュニケーション音声処理装置２８０を、実施例８として図１６に示す。その動作フローを図１７に示す。実施例８は、実施例３（図６）のセンサ部６１がパワー計算部１５１に置き換えられている点と、出力部１５２とが異なる。 A communication voice processing apparatus 280 according to a modification of the third embodiment is shown in FIG. The operation flow is shown in FIG. The eighth embodiment is different from the third embodiment (FIG. 6) in that the sensor unit 61 is replaced with a power calculation unit 151 and the output unit 152.

パワー計算部１５１は、周波数変換部４３の出力する周波数領域信号Ｘ（ω）を入力として、式（６）に示すパワーデータＰｓ（ｎ）を計算する（図１７のステップＳ１５１）。 The power calculator 151 receives the frequency domain signal X (ω) output from the frequency converter 43 and calculates the power data Ps (n) shown in Expression (6) (step S151 in FIG. 17).

Ｐｓ（ｎ）＝Σ（Ｘ（ω）^２）（６）
Ｐｓ（ｎ）は、周波数変換部４３が例えば短時間フーリエ変換をおこなう１フレーム内のパワーの総和である。したがって、Ｐｓ（ｎ）は、マイクロホン部１で収音した入力音声信号ｘ（ｎ）の振幅が大きい程、大きくなる値である。Ｐｓ（ｎ）の値が大きければ、コミュニケーション音声処理装置２８０と話者との距離が一般的には近く、逆にＰｓ（ｎ）の値が小さければその距離が遠いといえる。このＰｓ（ｎ）は、定性的に実施例３のセンサデータと同じように扱うことが可能である。 Ps (n) = Σ (X (ω) ² ) (6)
Ps (n) is the total power in one frame for which the frequency conversion unit 43 performs, for example, short-time Fourier transform. Therefore, Ps (n) is a value that increases as the amplitude of the input audio signal x (n) collected by the microphone unit 1 increases. If the value of Ps (n) is large, the distance between the communication speech processing apparatus 280 and the speaker is generally close, and conversely if the value of Ps (n) is small, the distance is far. This Ps (n) can be handled qualitatively in the same way as the sensor data of the third embodiment.

このＰｓ（ｎ）の値は、出力部１５２に入力され、出力部１５２内の正規化部１５２aで０〜１の値に変換される。例えば正規化部１５２ａは、Ｐｓ（ｎ）の最小値と最大値の幅を１として正規化する。Ｐｓ（ｎ）を正規化した後の動作は実施例２の出力部４５と同じである。また、出力部６２のスイッチ信号ｐ（ｎ）に対する動作も、実施例３と同じである。 The value of Ps (n) is input to the output unit 152 and converted to a value of 0 to 1 by the normalization unit 152a in the output unit 152. For example, the normalization unit 152a normalizes the width between the minimum value and the maximum value of Ps (n) as 1. The operation after normalizing Ps (n) is the same as that of the output unit 45 of the second embodiment. The operation of the output unit 62 for the switch signal p (n) is the same as that in the third embodiment.

よって、センサ部６１の代わりにパワー計算部１５１を設けた構成でも、コミュニケーション音声処理装置２８０と話者との距離によって、出力信号の明瞭度を自動的に変えることが可能になる。 Therefore, even in the configuration in which the power calculation unit 151 is provided instead of the sensor unit 61, the intelligibility of the output signal can be automatically changed according to the distance between the communication voice processing device 280 and the speaker.

図１８にこの発明の実施例９のコミュニケーション音声処理装置２９０の機能構成例を示す。その動作フローを図１９に示す。実施例９は、実施例１（図１）に音声区間検出部１７１と処理制御部４２の構成を追加し、出力部を出力部４５にしたものである。 FIG. 18 shows a functional configuration example of the communication voice processing apparatus 290 according to the ninth embodiment of the present invention. The operation flow is shown in FIG. In the ninth embodiment, the configuration of the voice section detection unit 171 and the processing control unit 42 is added to the first embodiment (FIG. 1), and the output unit is the output unit 45.

音声区間検出部１７１は、マイクロホン部で収音した入力音声信号ｘ（ｎ）を入力として、音声区間であるか否かを判別し、判別結果ｄ（ｎ）を処理制御部４２と出力部４５に出力する（図１９のステップＳ１７１）。判別結果ｄ（ｎ）は、例えば、入力音声信号ｘ（ｎ）の短時間パワーが、閾値を一定時間以上越えた場合に音声区間と判定しｄ（ｎ）＝“１”となる信号である。非音声区間は、ｄ（ｎ）＝“０”となる。 The voice section detection unit 171 receives the input voice signal x (n) picked up by the microphone unit, determines whether or not it is a voice section, and uses the determination result d (n) as the processing control unit 42 and the output unit 45. (Step S171 in FIG. 19). The determination result d (n) is, for example, a signal in which d (n) = “1” is determined as a voice section when the short-time power of the input voice signal x (n) exceeds a threshold value for a certain time or more. . In the non-voice section, d (n) = “0”.

この音声区間検出部１７１の出力信号ｄ（ｎ）を、スイッチ信号ｐ（ｎ）の代わりに用いて処理制御部４２と出力部４５は、実施例２と同じように動作する。このように音声区間検出部１７１をスイッチ部４１の代わりに設けることで、自動的にプライバシー機能を動作させることが出来る。なお、実施例１（図１）に音声区間検出部１７１を設ける例で実施例９を説明した。他の実施例のスイッチ部４１を音声区間検出部１７１に置き換えることも出来る。その場合、音声区間のみでプライバシー機能を自動的に動作させることが可能になる。 The processing control unit 42 and the output unit 45 operate in the same manner as in the second embodiment by using the output signal d (n) of the voice section detection unit 171 instead of the switch signal p (n). Thus, by providing the voice section detection unit 171 instead of the switch unit 41, the privacy function can be automatically operated. In addition, Example 9 was demonstrated by the example which provides the audio | voice area detection part 171 in Example 1 (FIG. 1). The switch unit 41 of another embodiment can be replaced with a voice section detection unit 171. In that case, it becomes possible to automatically operate the privacy function only in the voice section.

以上、説明した実施例は、通信を行う送り手側と受け手側のコミュニケーション音声処理装置に適用される。また、この発明である装置及び方法は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、回線の途中に配置されるネットワークサーバーにこの発明の考えを適用しても良い。また、受け手側だけで行っても良い。また、上記して説明した実施例の構成以外にも、フィルタバンクを用いたチャネルボコーダ、ホルマントボコーダ、パターンマッチングボコーダ、相関ボコーダ、位相ボコーダ、最尤ボコーダ、ホモモルフィックボコーダ、ＡＣＯＲボコーダ、線形予測ボコーダ、ＬＳＰボコーダを用いた構成も考えられる。例えば、フィルタバンクを用いたチャネルボコーダではチャネル数を制御することにより、明瞭さを制御する。また、線形予測ボコーダでは線形予測次数を制御することにより、明瞭さを制御する。また、上記装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The embodiment described above is applied to communication voice processing apparatuses on the sender side and the receiver side that perform communication. Moreover, the apparatus and method which are this invention are not limited to the above-mentioned embodiment, It can change suitably in the range which does not deviate from the meaning of this invention. For example, the idea of the present invention may be applied to a network server arranged in the middle of a line. Alternatively, it may be performed only on the receiver side. In addition to the configuration of the embodiment described above, channel vocoder, formant vocoder, pattern matching vocoder, correlation vocoder, phase vocoder, maximum likelihood vocoder, homomorphic vocoder, ACOR vocoder, linear prediction using filter banks A configuration using a vocoder or LSP vocoder is also conceivable. For example, in a channel vocoder using a filter bank, the clarity is controlled by controlling the number of channels. Also, the linear prediction vocoder controls the clarity by controlling the linear prediction order. Further, the processes described in the above apparatus and method are not only executed in time series according to the order of description, but also may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記憶媒体に記憶しておくことができる。コンピュータで読み取り可能な記憶媒体としては、例えば、磁気記憶装置、光ディスク、光磁気記憶媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記憶装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記憶媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be stored in a computer-readable storage medium. The computer-readable storage medium may be any medium such as a magnetic storage device, an optical disk, a magneto-optical storage medium, and a semiconductor memory. Specifically, for example, as a magnetic storage device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical storage media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記憶したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記憶媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable storage medium such as a DVD or CD-ROM storing the program, for example. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

この発明の実施例１のコミュニケーション音声処理装置２１０の機能構成例を示す図。The figure which shows the function structural example of the communication audio | voice processing apparatus 210 of Example 1 of this invention. コミュニケーション音声処理装置２１０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 210. コミュニケーション音声処理装置２１０で処理をした入力音声信号ｘ（ｎ）と出力信号ｙ（ｎ）の信号波形を示す図。The figure which shows the signal waveform of the input audio | voice signal x (n) processed by the communication audio | voice processing apparatus 210, and the output signal y (n). この発明の実施例２のコミュニケーション音声処理装置２２０の機能構成例を示す図。The figure which shows the function structural example of the communication audio processing apparatus 220 of Example 2 of this invention. コミュニケーション音声処理装置２２０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 220. この発明の実施例３のコミュニケーション音声処理装置２３０の機能構成例を示す図。The figure which shows the function structural example of the communication audio processing apparatus 230 of Example 3 of this invention. コミュニケーション音声処理装置２３０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 230. この発明の実施例４のコミュニケーション音声処理装置２４０を示す図。The figure which shows the communication audio processing apparatus 240 of Example 4 of this invention. コミュニケーション音声処理装置２４０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 240. この発明の実施例５のコミュニケーション音声処理装置２５０を示す図。The figure which shows the communication audio processing apparatus 250 of Example 5 of this invention. コミュニケーション音声処理装置２５０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 250. この発明の実施例６のコミュニケーション音声処理装置２６０を示す図。The figure which shows the communication audio processing apparatus 260 of Example 6 of this invention. コミュニケーション音声処理装置２６０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 260. この発明の実施例７のコミュニケーション音声処理装置２７０を示す図。The figure which shows the communication audio processing apparatus 270 of Example 7 of this invention. コミュニケーション音声処理装置２７０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 270. この発明の実施例８のコミュニケーション音声処理装置２８０を示す図。The figure which shows the communication audio processing apparatus 280 of Example 8 of this invention. コミュニケーション音声処理装置２８０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 280. この発明の実施例９のコミュニケーション音声処理装置２９０を示す図。The figure which shows the communication audio | voice processing apparatus 290 of Example 9 of this invention. コミュニケーション音声処理装置２９０の動作フローを示す図。The figure which shows the operation | movement flow of the communication audio processing apparatus 290. 従来の映像コミュニケーション装置１００の機能構成を示す図。The figure which shows the function structure of the conventional video communication apparatus 100. FIG.

Claims

A frequency conversion unit for converting the input audio signal into a frequency domain signal,
And clarity setting unit for setting a bright Akirado,
The signal of the frequency domain, the intelligibility as the division number becomes large small band dividing unit for outputting a band division signals divided into a plurality of frequency bands as the number of divisions as the clarity is greater decreases When,
A feature amount calculation unit for calculating a feature amount for each of the band division signals;
For each of the plurality of frequency bands, a filter unit that multiplies a predetermined acoustic frequency signal other than the feature amount of the band- divided signal and the input audio signal ;
And frequency inverse converter for converting No. LSE out of the filter unit into a time domain signal,
Communications voice processing apparatus characterized by comprising a.

A frequency conversion unit for converting the input audio signal into a frequency domain signal,
A band dividing section for outputting a band division signal by dividing the signal of the frequency domain in a predetermined frequency band,
A feature amount calculation unit for calculating a feature amount for each of the band division signals;
For each of the plurality of frequency bands, a filter unit that multiplies a predetermined acoustic frequency signal other than the feature amount of the band- divided signal and the input audio signal ;
And clarity setting unit for setting a bright Akirado,
The phase characteristic of the output signal of the filter unit is closer to the phase characteristic of the input audio signal as the clarity is higher, and closer to the phase characteristic of a predetermined acoustic frequency signal other than the input voice signal as the clarity is lower. varied as a frequency inverse conversion unit for converting the signal in the time domain,
Communications voice processing apparatus characterized by comprising a.

A frequency conversion unit for converting the input audio signal into a frequency domain signal,
A band dividing section for outputting a band division signal by dividing the signal of the frequency domain in a predetermined frequency band,
A feature amount calculation unit for calculating a feature amount for each of the band division signals;
And clarity setting unit for setting a bright Akirado,
And No. LSE out of the frequency converter, and a predetermined acoustic frequency signals other than the input speech signal, the weights of the upper KiAkira Akirado becomes larger as the output signal of the frequency converter increases, the clarity is less A weighted adder that weights and adds so that the weight of the output signal of the frequency converter becomes smaller ,
For each of the plurality of frequency bands, a filter section for multiplying the output signal of the features and the weighted addition of the band division signals,
And frequency inverse converter for converting No. LSE out of the filter unit into a time domain signal,
Communications voice processing apparatus characterized by comprising a.

Frequency conversion unit, a frequency conversion process of converting the input audio signal into a frequency domain signal,
Clarity setting process in which the clarity setting unit sets the clarity,
Band dividing unit, the signal of the upper Symbol frequency domain becomes large number of divided as the clarity is low, the band division signals divided into a plurality of frequency bands as the number of divisions as the clarity is greater decreases Output band splitting process,
A feature amount calculation unit calculates a feature amount for each of the band division signals, and a feature amount calculation process,
A filter process in which the filter unit multiplies the characteristic amount of the band- divided signal by a predetermined acoustic frequency signal other than the input audio signal for each of the plurality of frequency bands ;
Frequency inverse conversion unit, and the frequency inverse conversion process of converting the time domain signal No. LSE out of the filter unit,
Communications voice processing method, which comprises a.

Frequency conversion unit, a frequency conversion process of converting the input audio signal into a frequency domain signal,
Band division section includes a band division step of outputting a band division signal by dividing the signal of the frequency domain in a predetermined frequency band,
A feature amount calculation unit calculates a feature amount for each of the band division signals, and a feature amount calculation process,
A filter process in which the filter unit multiplies the characteristic amount of the band- divided signal by a predetermined acoustic frequency signal other than the input audio signal for each of the plurality of frequency bands ;
Clarity setting process in which the clarity setting unit sets the clarity,
Frequency inverse conversion portion, the phase characteristic of the output signal of the filter unit, the larger the upper Symbol intelligibility close to the phase characteristic of the input speech signal, predetermined acoustic frequencies other than the input speech signal as the intelligibility is small Inverse frequency transformation process that changes to be close to the phase characteristics of the signal and converts it to a signal in the time domain,
Communications voice processing method, which comprises a.

Frequency conversion unit, a frequency conversion process of converting the input audio signal into a frequency domain signal,
Band division section includes a band division step of outputting a band division signal by dividing the signal of the frequency domain in a predetermined frequency band,
A feature amount calculation unit calculates a feature amount for each of the band division signals, and a feature amount calculation process,
Clarity setting process in which the clarity setting unit sets the clarity,
Weighted addition unit, and the No. LSE out of the frequency converter, and a predetermined acoustic frequency signals other than the input speech signal, the weights of the output signals above SL clarity the larger the frequency conversion process is increased, A weighted addition process in which weighting is performed so that the weight of the output signal of the frequency converter becomes smaller as the intelligibility is smaller , and
A filtering process in which the filter unit filters the output signal of the weighted addition unit according to the feature amount for each band;
Frequency inverse conversion unit, and the frequency inverse conversion process of converting the time domain signal No. LSE out of the filter unit,
Communications voice processing method, which comprises a.

A program for causing a computer to function as the communication voice processing apparatus according to any one of claims 1 to 3 .