JP2018110301A

JP2018110301A - Telephone calling device and echo suppression program

Info

Publication number: JP2018110301A
Application number: JP2016256823A
Authority: JP
Inventors: 遠藤　香緒里; Kaori Endo; 香緒里遠藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2018-07-12

Abstract

PROBLEM TO BE SOLVED: To efficiently and correctly determine whether or not a microphone input sound in a telephone calling device is a double talk state.SOLUTION: A telephone calling device comprises: a received sound processing part; a component extraction part; a transfer characteristic learning part; and a determination part. The received sound processing part removes and outputs components of a predetermined frequency band in a received voice received from an external device to a receiver. The component extraction part extracts the components of the predetermined frequency band in an input voice input from a microphone. The transfer characteristic learning part learns the transfer characteristics of an echo path going from the receiver to the microphone on the basis of the received voice and the input voice. The determination part causes the transfer characteristic learning part to learn the transfer characteristics when the scale of the components of the predetermined frequency band extracted from the input voice is equal to or less than a threshold.SELECTED DRAWING: Figure 1

Description

本発明は、通話装置、及びエコー抑圧プログラムに関する。 The present invention relates to a communication device and an echo suppression program.

移動端末等の通話装置には、自装置のマイクロフォンから入力されたマイク入力音に含まれるエコー成分を抑圧し、当該エコー成分を抑圧した入力音を他の通話装置に送信するものがある。ここで、エコー成分は、マイク入力音における、自装置のレシーバから出力された受話音の成分である。 Some communication devices such as mobile terminals suppress an echo component included in a microphone input sound input from a microphone of the own device, and transmit an input sound in which the echo component is suppressed to another communication device. Here, the echo component is a component of the received sound output from the receiver of the own device in the microphone input sound.

この種の通話装置では、通話中に受話音とマイク入力音とに基づいてエコー経路の伝達特性を学習し、当該伝達特性に基づいてマイク入力音に含まれるエコー成分を抑圧する。ただし、伝達特性を学習する際に、マイク入力音に受話音以外の音声成分が含まれるダブルトークの状態であると、誤った伝達係数を学習してしまい、エコー成分を正しく抑圧することが困難となる。このため、エコー成分を抑圧する通話装置では、マイク入力音がダブルトークの状態ではない場合にのみ、受話音とマイク入力音とに基づいて伝達特性を学習してエコー成分を抑圧する。 In this type of communication device, the transfer characteristic of the echo path is learned based on the received sound and the microphone input sound during a call, and the echo component contained in the microphone input sound is suppressed based on the transfer characteristic. However, when learning the transfer characteristics, if the microphone input sound is in a double-talk state that includes speech components other than the received sound, it is difficult to correctly suppress the echo component because an incorrect transfer coefficient is learned. It becomes. For this reason, in a call device that suppresses echo components, only when the microphone input sound is not in a double talk state, the transmission characteristics are learned based on the received sound and the microphone input sound to suppress the echo component.

マイク入力音がダブルトークの状態であるか否かを判定する方法の１つとして、カメラで利用者の口の動きをとらえて判定する方法がある（例えば、特許文献１を参照）。 As one of methods for determining whether or not the microphone input sound is in a double talk state, there is a method of determining by capturing the movement of the user's mouth with a camera (see, for example, Patent Document 1).

また、マイク入力音がダブルトークの状態であるか否かを判定する別の方法として、音声出力信号と音声入力信号との音量比に基づいて、判定する方法がある（例えば、特許文献２を参照）。 As another method for determining whether or not a microphone input sound is in a double talk state, there is a method for determining based on a volume ratio between an audio output signal and an audio input signal (see, for example, Patent Document 2). reference).

更に、受話信号から特定周波数の成分を除去した信号をエコー経路に出力し、送話信号から検出した特定周波数と同じ周波数の成分を利用して、エコー成分を抑圧する制御パラメータを求める方法がある（例えば、特許文献３を参照）。この方法では、送話信号から検出された特定周波数成分のパワーに基づきノイズパワーを求め、エコー成分を含む周波数成分のパワーに基づきノイズ及びエコー成分を含む合算パワーを求めた後、該ノイズパワーと合算パワーとを用いて、制御パラメータを求める。 Further, there is a method for obtaining a control parameter for suppressing the echo component by outputting a signal obtained by removing a specific frequency component from the received signal to the echo path and using a component having the same frequency as the specific frequency detected from the transmission signal. (For example, see Patent Document 3). In this method, the noise power is obtained based on the power of the specific frequency component detected from the transmission signal, the combined power including the noise and the echo component is obtained based on the power of the frequency component including the echo component, and then the noise power and The control parameter is obtained using the total power.

特開２０００−２９５３３８号公報JP 2000-295338 A 特開２００７−０５３５１２号公報JP 2007-053512 A 特開２０１１−１３０１７０号公報JP 2011-130170 A

カメラで利用者の口の動きをとらえる場合、当該カメラの撮像範囲内に利用者の口がないと、ダブルトークの状態であるか否かを正しく判定することができない。また、音声出力信号と音声入力信号との音量比によってダブルトークの状態であるか否かを検出する場合、音声入力信号に含まれる利用者の音声や周囲の雑音のレベルが小さいと、ダブルトークの状態であるか否かを正しく判定することができない。更に、送話信号から検出された特定周波数成分のパワーに基づいて求めたノイズパワーと、エコー成分を含む周波数成分のパワーに基づいて求めた合算パワーとを用いて制御パラメータを求める場合、処理負荷が増大する。 When capturing the movement of the user's mouth with the camera, it is impossible to correctly determine whether or not the user is in a double talk state if the user's mouth is not within the imaging range of the camera. In addition, when detecting whether or not a double talk state is detected based on the volume ratio between the audio output signal and the audio input signal, if the level of the user's voice or ambient noise included in the audio input signal is small, the double talk It is not possible to correctly determine whether or not this is the state. Furthermore, when the control parameter is obtained using the noise power obtained based on the power of the specific frequency component detected from the transmission signal and the combined power obtained based on the power of the frequency component including the echo component, the processing load Will increase.

１つの側面において、本発明は、通話装置におけるマイク入力音がダブルトークの状態であるか否かを効率よく、かつ正しく判定することを目的とする。 In one aspect, an object of the present invention is to efficiently and correctly determine whether or not a microphone input sound in a call device is in a double talk state.

１つの態様の通話装置は、受話音加工部と、成分抽出部と、伝達特性学習部と、判定部と、を備える。受話音加工部は、外部装置から受信した受話音声における所定の周波数帯の成分を除去してレシーバに出力する。成分抽出部は、マイクロフォンから入力された入力音声における所定の周波数帯の成分を抽出する。伝達特性学習部は、受話音声と、入力音声とに基づいてレシーバからマイクロフォンに至るエコー経路の伝達特性を学習する。判定部は、入力音声から抽出した所定の周波数帯の成分の大きさが閾値以下である場合に、伝達特性学習部に伝達特性を学習させる。 The communication device according to one aspect includes a received sound processing unit, a component extraction unit, a transfer characteristic learning unit, and a determination unit. The received sound processing unit removes a component of a predetermined frequency band from the received voice received from the external device and outputs the result to the receiver. The component extraction unit extracts a component in a predetermined frequency band from the input sound input from the microphone. The transfer characteristic learning unit learns the transfer characteristic of the echo path from the receiver to the microphone based on the received voice and the input voice. The determination unit causes the transfer characteristic learning unit to learn transfer characteristics when the magnitude of a component in a predetermined frequency band extracted from the input speech is equal to or less than a threshold value.

上述の態様によれば、通話装置におけるマイク入力音がダブルトークの状態であるか否かを効率よく、かつ正しく判定することが可能となる。 According to the above-described aspect, it is possible to efficiently and correctly determine whether or not the microphone input sound in the call device is in a double talk state.

一実施形態に係る通話装置の機能的構成を示す図である。It is a figure which shows the functional structure of the telephone apparatus which concerns on one Embodiment. マスク情報を説明するグラフ図である。It is a graph explaining mask information. 一実施形態に係る通話装置が行う処理を説明するフローチャートである。It is a flowchart explaining the process which the telephone apparatus which concerns on one Embodiment performs. 受話音のフレームを周波数分析する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which frequency-analyzes the frame of a received sound. 受話音から除去するマスク周波数を決定する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which determines the mask frequency removed from an incoming call sound. 第１のマスク周波数を算出する処理の内容を説明するフローチャート（その１）である。It is a flowchart (the 1) explaining the content of the process which calculates a 1st mask frequency. 第１のマスク周波数を算出する処理の内容を説明するフローチャート（その２）である。It is a flowchart (the 2) explaining the content of the process which calculates a 1st mask frequency. 第１のマスク周波数の算出方法を説明するグラフ図である。It is a graph explaining the calculation method of the 1st mask frequency. 第２のマスク周波数を算出する処理の内容を説明するフローチャート（その１）である。It is a flowchart (the 1) explaining the content of the process which calculates a 2nd mask frequency. 第２のマスク周波数を算出する処理の内容を説明するフローチャート（その２）である。It is a flowchart (the 2) explaining the content of the process which calculates a 2nd mask frequency. 第２のマスク周波数の算出方法を説明するグラフ図である。It is a graph explaining the calculation method of the 2nd mask frequency. マスク周波数の成分を受話音から除去して出力する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which removes the component of a mask frequency from an incoming call sound, and outputs it. ダブルトークであるか否かを判定する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which determines whether it is a double talk. 移動端末のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a mobile terminal. コンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a computer.

図１は、一実施形態に係る通話装置の機能的構成を示す図である。
図１に示すように、本実施形態に係る通話装置１は、送受信部１１０と、受話音分析部１２０と、マスク周波数決定部１３０と、受話音加工部１４０と、成分抽出部１５０と、ダブルトーク判定部１６０と、入力音加工部１７０と、を備える。また、通話装置１は、マスク情報記憶部１９１と、分析結果保持部１９２と、伝達特性記憶部１９３と、を備える。更に、通話装置１は、レシーバ２と、マイクロフォン３と、を備える。 FIG. 1 is a diagram illustrating a functional configuration of a call device according to an embodiment.
As shown in FIG. 1, the communication device 1 according to the present embodiment includes a transmission / reception unit 110, a received sound analysis unit 120, a mask frequency determination unit 130, a received sound processing unit 140, a component extraction unit 150, a double A talk determination unit 160 and an input sound processing unit 170 are provided. In addition, the communication device 1 includes a mask information storage unit 191, an analysis result holding unit 192, and a transfer characteristic storage unit 193. Further, the communication device 1 includes a receiver 2 and a microphone 3.

送受信部１１０は、他の通話装置４からの音声信号（受話音）を受信するとともに、マイクロフォン３で収音して取得した音声信号（マイク入力音）に含まれるエコー成分を抑圧した音声信号を他の通話装置４に送信する。エコー成分は、マイク入力音に含まれる、レシーバ２から出力された音の成分である。なお、本明細書では、他の通話装置4から受信した受話音（音声信号）を音波として通話装置１の外部に出力するものとしてレシーバ２を挙げるが、通話装置１は、レシーバ２の代わりに、スピーカ等のレシーバ２と同等の機能を持つものを備えていてもよい。 The transmission / reception unit 110 receives an audio signal (received sound) from the other call device 4 and receives an audio signal in which an echo component contained in an audio signal (microphone input sound) acquired by collecting the sound from the microphone 3 is suppressed. It transmits to the other communication device 4. The echo component is a component of the sound output from the receiver 2 included in the microphone input sound. In this specification, the receiver 2 is given as an example in which the received sound (sound signal) received from the other call device 4 is output as a sound wave to the outside of the call device 1, but the call device 1 is replaced with the receiver 2. A device having a function equivalent to that of the receiver 2 such as a speaker may be provided.

受話音分析部１２０は、他の通話装置４から受信した受話音に対する周波数分析を行い、周波数のピークを算出する。 The received sound analysis unit 120 performs frequency analysis on the received sound received from the other communication device 4 and calculates a frequency peak.

マスク周波数決定部１３０は、受話音に対する周波数分析の結果に基づいて、受話音から除去する周波数（マスク周波数）を算出する。マスク周波数は、受話音における他の周波数成分によりマスクされるため、受話音から除去しても知覚に影響を与えない成分の周波数とする。マスク周波数決定部１３０は、マスク情報記憶部１９１に記憶させたマスク情報と、分析結果保持部１９２で保持している周波数分析の結果とに基づいて、マスク周波数を算出する。 The mask frequency determination unit 130 calculates a frequency (mask frequency) to be removed from the received sound based on the result of frequency analysis for the received sound. Since the mask frequency is masked by other frequency components in the received sound, it is a frequency of a component that does not affect the perception even if it is removed from the received sound. The mask frequency determination unit 130 calculates a mask frequency based on the mask information stored in the mask information storage unit 191 and the frequency analysis result held in the analysis result holding unit 192.

受話音加工部１４０は、受話音に含まれるマスク周波数の成分を除去した加工受話音を生成する。受話音加工部１４０は、生成した加工受話音をレシーバ２に出力するとともに、マスク周波数を示す情報を成分抽出部１５０に出力する。 The received sound processing unit 140 generates a processed received sound from which the mask frequency component included in the received sound has been removed. The reception sound processing unit 140 outputs the generated processed reception sound to the receiver 2 and outputs information indicating the mask frequency to the component extraction unit 150.

成分抽出部１５０は、マイクロフォン３から取得したマイク入力音に含まれるマスク周波数の成分を抽出する。 The component extraction unit 150 extracts a mask frequency component included in the microphone input sound acquired from the microphone 3.

ダブルトーク判定部１６０は、マイク入力音から抽出した周波数成分に基づいて、ダブルトークの状態であるか否かを判定する。ダブルトークの状態は、通話装置１のマイクロフォン３に向かって発話する利用者の音声や通話装置１の周囲で発生している雑音がマイク入力音に含まれる状態である。 The double talk determination unit 160 determines whether or not the double talk state is based on the frequency component extracted from the microphone input sound. The double talk state is a state in which the voice of the user who speaks toward the microphone 3 of the call device 1 and the noise generated around the call device 1 are included in the microphone input sound.

入力音加工部１７０は、ダブルトーク判定部１６０の判定結果と、レシーバ２からマイクロフォン３に至るエコー経路の伝達特性とに基づいて、マイク入力音に含まれるエコー成分を抑圧した加工入力音を生成する。ダブルトークの状態ではない場合、入力音加工部１７０は、エコー経路の伝達特性を学習した後、学習した伝達特性に基づいて、マイク入力音に含まれるエコー成分を除去する。このとき、入力音加工部１７０は、学習した伝達特性を伝達特性記憶部１９３に記憶させる。一方、ダブルトークの状態である場合、入力音加工部１７０は、伝達特性記憶部１９３に記憶させた伝達特性（すなわちダブルトークの状態ではないときに学習した伝達特性）に基づいて、マイク入力音に含まれるエコー成分を除去する。入力音加工部１７０は、学習判定部１７１と、伝達特性学習部１７２と、エコー成分抑圧部１７３と、を含む。学習判定部１７１は、ダブルトーク判定部１６０の判定結果に基づいて、伝達特性を学習するか否かを判定する。伝達特性学習部１７２は、受話音とマイク入力音とに基づいてエコー経路の伝達特性を学習する。エコー成分抑圧部１７３は、伝達特性に基づいて、マイク入力音に含まれるエコー成分を抑圧した加工入力音を生成する。エコー成分を抑圧した加工入力音は、送受信部１１０を介して、他の通話装置４に送信される。 The input sound processing unit 170 generates a processed input sound that suppresses the echo component included in the microphone input sound based on the determination result of the double talk determination unit 160 and the transmission characteristic of the echo path from the receiver 2 to the microphone 3. To do. When not in the double talk state, the input sound processing unit 170 learns the transfer characteristic of the echo path, and then removes the echo component included in the microphone input sound based on the learned transfer characteristic. At this time, the input sound processing unit 170 stores the learned transfer characteristic in the transfer characteristic storage unit 193. On the other hand, in the case of the double talk state, the input sound processing unit 170 uses the microphone input sound based on the transfer characteristic stored in the transfer characteristic storage unit 193 (that is, the transfer characteristic learned when not in the double talk state). The echo component contained in is removed. The input sound processing unit 170 includes a learning determination unit 171, a transfer characteristic learning unit 172, and an echo component suppression unit 173. The learning determination unit 171 determines whether to learn transfer characteristics based on the determination result of the double talk determination unit 160. The transfer characteristic learning unit 172 learns the transfer characteristic of the echo path based on the received sound and the microphone input sound. The echo component suppression unit 173 generates a processed input sound in which the echo component included in the microphone input sound is suppressed based on the transfer characteristic. The processed input sound in which the echo component is suppressed is transmitted to the other communication device 4 via the transmission / reception unit 110.

上記のように、本実施形態の通話装置１におけるマスク周波数決定部１３０は、マスク情報を参照し、受話音に含まれる音声のうちの他の周波数成分によりマスクされる周波数（マスク周波数）を算出する。ここで、マスク周波数の成分をマスクする他の周波数成分は、同一フレームにおけるピーク周波数の成分と、同一周波数の時間変化においてピークとなる過去フレームの成分との２通りが存在する。このため、本実施形態では、マスク情報として、第１のマスク情報と、第２のマスク情報とを利用する。第１のマスク情報は、処理対象のフレーム内におけるピーク周波数からの距離（周波数差）と、各周波数のマスクされる最大音量（パワー）との関係を示す情報である。ピーク周波数は、周波数スペクトルにおいて極大値をとる周波数である。第２のマスク情報は、過去の所定時間分の複数のフレームにおける同一周波数の成分の時間変化でのピーク時刻からの距離（時間差）と、マスクされる最大音量との関係を示す情報である。ピーク時刻は、所定時間分の周波数スペクトルにおける同一周波数成分の時間変化において極大値をとる時刻である。 As described above, the mask frequency determination unit 130 in the communication device 1 according to the present embodiment refers to the mask information and calculates a frequency (mask frequency) masked by other frequency components in the voice included in the received sound. To do. Here, there are two types of other frequency components that mask the mask frequency component, that is, the peak frequency component in the same frame and the past frame component that peaks in the temporal change of the same frequency. For this reason, in the present embodiment, the first mask information and the second mask information are used as the mask information. The first mask information is information indicating the relationship between the distance (frequency difference) from the peak frequency in the frame to be processed and the maximum volume (power) masked for each frequency. The peak frequency is a frequency having a maximum value in the frequency spectrum. The second mask information is information indicating the relationship between the distance (time difference) from the peak time in the time change of the component of the same frequency in a plurality of frames for a predetermined time in the past and the maximum sound volume to be masked. The peak time is a time at which the maximum value is obtained in the time change of the same frequency component in the frequency spectrum for a predetermined time.

図２は、マスク情報を説明するグラフ図である。
図２の（ａ）には、マスク周波数決定部１３０が参照するマスク情報のうちの、第１のマスク情報を説明するグラフ図を示している。図２の（ａ）のグラフ図は、横軸がピーク周波数との差であり、縦軸がパワーである。 FIG. 2 is a graph for explaining the mask information.
FIG. 2A shows a graph for explaining the first mask information among the mask information referred to by the mask frequency determination unit 130. In the graph of FIG. 2A, the horizontal axis is the difference from the peak frequency, and the vertical axis is the power.

第１のマスク情報は、上記のように、ピーク周波数からの距離（ピーク周波数との周波数差）と、マスクされる最大音量（パワー）との関係を示す情報である。この第１のマスク情報は、例えば、図２の（ａ）に示した関数Ｇ（ｆ）により与えられる。関数Ｇ（ｆ）は、ピーク周波数との差ｆが０である場合にパワーＰＷ２（すなわちＧ（０）＝ＰＷ２）となり、ピーク周波数との差ｆの絶対値が大きくなるにつれて減少する関数である。関数Ｇ（ｆ）は、既知の統計処理により設定可能である。 As described above, the first mask information is information indicating the relationship between the distance from the peak frequency (frequency difference from the peak frequency) and the maximum volume (power) to be masked. This first mask information is given by, for example, the function G (f) shown in FIG. The function G (f) is a function that becomes power PW2 (that is, G (0) = PW2) when the difference f from the peak frequency is 0, and decreases as the absolute value of the difference f from the peak frequency increases. . The function G (f) can be set by known statistical processing.

関数Ｇ（ｆ）により算出される値は、上記のように、マスクされる最大音量（パワー）である。よって、処理対象のフレームについてのパワースペクトルにおけるある周波数のパワーが、関数Ｇ（ｆ）により算出した最大音量よりも小さい場合、当該ある周波数の成分はピーク周波数の成分によりマスクされる。すなわち、処理対象のフレームについてのパワースペクトルにおいて関数Ｇ（ｆ）により算出した最大音量よりもパワーが小さい周波数成分は、人が知覚することが困難であり、処理対象のフレームから除去しても知覚への影響が小さい。 The value calculated by the function G (f) is the maximum sound volume (power) to be masked as described above. Therefore, when the power of a certain frequency in the power spectrum for the processing target frame is smaller than the maximum volume calculated by the function G (f), the certain frequency component is masked by the peak frequency component. That is, it is difficult for a person to perceive a frequency component whose power is smaller than the maximum volume calculated by the function G (f) in the power spectrum for the processing target frame, and even if it is removed from the processing target frame. The impact on is small.

図２の（ａ）のグラフ図において、ピーク周波数との差ｆがＦ１である周波数のパワーＰ２は、関数Ｇ（ｆ）により算出される最大音量Ｇ（Ｆ１）よりも大きい。このように、処理対象のフレームについてのパワースペクトルにおけるある周波数のパワーが、関数Ｇ（ｆ）により算出される最大音量以上である場合、人は、当該周波数成分を知覚することが可能となる。 In the graph of FIG. 2A, the power P2 of the frequency whose difference f from the peak frequency is F1 is larger than the maximum volume G (F1) calculated by the function G (f). Thus, when the power of a certain frequency in the power spectrum for the processing target frame is equal to or higher than the maximum volume calculated by the function G (f), a person can perceive the frequency component.

これに対し、図２の（ａ）のグラフ図においてピーク周波数との差ｆが−Ｆ１である周波数のパワーＰ３は、関数Ｇ（ｆ）により算出される最大音量Ｇ（−Ｆ１）よりも小さい。このように、処理対象のフレームについてのパワースペクトルにおけるある周波数のパワーが、関数Ｇ（ｆ）により算出される最大音量よりも小さい場合、人は、当該周波数成分を知覚することが困難となる。特に、パワースペクトルにおけるパワーと、関数Ｇ（ｆ）により算出される最大音量との差（マスク量）が大きくなるほど、ピーク周波数の成分によりマスクされた周波数成分を知覚することが困難となる。 On the other hand, in the graph of FIG. 2A, the power P3 of the frequency whose difference f from the peak frequency is −F1 is smaller than the maximum volume G (−F1) calculated by the function G (f). . Thus, when the power of a certain frequency in the power spectrum for the processing target frame is smaller than the maximum volume calculated by the function G (f), it becomes difficult for a person to perceive the frequency component. In particular, the greater the difference (mask amount) between the power in the power spectrum and the maximum volume calculated by the function G (f), the more difficult it is to perceive the frequency component masked by the peak frequency component.

本実施形態では、ピーク周波数の成分によりマスクされる周波数のうちの、マスク量が最大値となる周波数を、第１のマスク周波数とする。このため、受話音から第１のマスク周波数の成分を除去することによる知覚への影響を最小限に抑えることが可能となる。 In the present embodiment, of the frequencies masked by the peak frequency component, the frequency having the maximum mask amount is set as the first mask frequency. For this reason, it becomes possible to minimize the influence on perception by removing the component of the first mask frequency from the received sound.

第１のマスク情報に基づいて第１のマスク周波数を算出する際、マスク周波数決定部１３０は、例えば、処理対象のフレームについての周波数スペクトルにおけるピーク周波数と、当該ピーク周波数の成分Ｐ１とに基づいて、関数Ｇ（ｆ）を並進させる。 When calculating the first mask frequency based on the first mask information, the mask frequency determination unit 130, for example, based on the peak frequency in the frequency spectrum for the frame to be processed and the component P1 of the peak frequency. , Translate function G (f).

一方、図２の（ｂ）には、マスク周波数決定部１３０が参照するマスク情報のうちの、第２のマスク情報を説明するグラフ図を示している。図２の（ｂ）のグラフ図は、横軸が継時ピークのフレームからの時間差であり、縦軸がパワーである。 On the other hand, FIG. 2B shows a graph for explaining the second mask information of the mask information referred to by the mask frequency determining unit 130. In the graph of FIG. 2B, the horizontal axis represents the time difference from the continuous peak frame, and the vertical axis represents the power.

第２のマスク情報は、上記のように、複数フレームにおける同一周波数の成分の時間変化でのピーク（継時ピーク）からの距離（継時ピークからの時間差）と、マスクされる最大音量（パワー）との関係を示す情報である。この第２のマスク情報は、例えば、図２の（ｂ）に示した関数Ｈ（ｔ）により与えられる。関数Ｈ（ｔ）は、継時ピークとの時間差ｔが０である場合にパワーＰＷ２（すなわちＨ（０）＝ＰＷ２）となり、継時ピークとの時間差ｔが大きくなるにつれて減少する関数である。関数Ｈ（ｔ）は、既知の統計処理により設定可能である。 As described above, the second mask information includes the distance (time difference from the continuous peak) from the peak (continuous peak) in the time change of the components of the same frequency in a plurality of frames, and the maximum volume (power) to be masked. ). This second mask information is given by, for example, the function H (t) shown in FIG. The function H (t) is a function that becomes power PW2 (that is, H (0) = PW2) when the time difference t from the continuous peak is 0, and decreases as the time difference t from the continuous peak increases. The function H (t) can be set by known statistical processing.

関数Ｈ（ｔ）により算出される値は、上記のように、マスクされる最大音量（パワー）である。よって、処理対象のフレームについてのパワースペクトルにおけるある周波数のパワーが、関数Ｈ（ｔ）により算出した最大音量よりも小さい場合、当該ある周波数の成分は過去フレームの同一周波数成分によりマスクされる。すなわち、処理対象のフレームについてのパワースペクトルにおいて関数Ｈ（ｔ）により算出した最大音量よりもパワーが小さい周波数成分は、人が知覚することが困難であり、処理対象のフレームから除去しても知覚への影響が小さい。 The value calculated by the function H (t) is the maximum sound volume (power) to be masked as described above. Therefore, when the power of a certain frequency in the power spectrum for the processing target frame is smaller than the maximum volume calculated by the function H (t), the component of the certain frequency is masked by the same frequency component of the past frame. That is, it is difficult for a person to perceive a frequency component whose power is smaller than the maximum volume calculated by the function H (t) in the power spectrum of the processing target frame, and it is perceived even if it is removed from the processing target frame. The impact on is small.

図２の（ｂ）のグラフ図において、継時ピークを含むフレームからの時間差ｔがＴ２であるフレームにおける同一周波数成分のパワーＰ３は、関数Ｈ（ｔ）により算出される最大音量Ｈ（Ｔ２）よりも大きい。このように、処理対象のフレームにおけるある周波数のパワーが、関数Ｈ（ｔ）により算出される最大音量よりも大きい場合、人は、当該周波数成分を知覚することが可能となる。 In the graph of FIG. 2B, the power P3 of the same frequency component in the frame whose time difference t from the frame including the successive peak is T2 is the maximum volume H (T2) calculated by the function H (t). Bigger than. In this way, when the power of a certain frequency in the processing target frame is larger than the maximum volume calculated by the function H (t), a person can perceive the frequency component.

これに対し、図２の（ｂ）のグラフ図において、継時ピークを含むフレームからの時間差ｔがＴ１であるフレームにおける同一周波数成分のパワーＰ２は、関数Ｈ（ｔ）により算出される最大音量Ｈ（Ｔ１）よりも小さい。このように、処理対象のフレームについてのパワースペクトルにおけるある周波数のパワーが、関数Ｈ（ｔ）により算出される最大音量よりも小さい場合、人は、当該周波数成分を知覚することが困難となる。特に、パワースペクトルにおけるパワーと、関数Ｈ（ｔ）により算出される最大音量との差（マスク量）が大きくなるほど、過去フレームの同一周波数成分によりマスクされた周波数成分を知覚することが困難となる。 On the other hand, in the graph of FIG. 2B, the power P2 of the same frequency component in the frame whose time difference t from the frame including the successive peak is T1 is the maximum volume calculated by the function H (t). It is smaller than H (T1). As described above, when the power of a certain frequency in the power spectrum for the processing target frame is smaller than the maximum volume calculated by the function H (t), it becomes difficult for a person to perceive the frequency component. In particular, the greater the difference (mask amount) between the power in the power spectrum and the maximum volume calculated by the function H (t), the more difficult it is to perceive the frequency component masked by the same frequency component of the past frame. .

本実施形態では、処理対象のフレームにおいて過去フレームの同一周波数成分（継時ピーク）によりマスクされる周波数のうちの、マスク量が最大値となる周波数を、第２のマスク周波数とする。このため、受話音から第２のマスク周波数の成分を除去することによる知覚への影響を最小限に抑えることが可能となる。 In the present embodiment, the frequency at which the mask amount becomes the maximum value among the frequencies masked by the same frequency component (continuous peak) of the past frame in the processing target frame is set as the second mask frequency. For this reason, it is possible to minimize the influence on perception by removing the second mask frequency component from the received sound.

第２のマスク情報に基づいて第２のマスク周波数を算出する際、マスク周波数決定部１３０は、例えば、継時ピークを含むフレームから現在処理対象となっているフレームまでの時間差と、当該継時ピークのパワーＰ１とに基づいて、関数Ｈ（ｔ）を並進させる。 When calculating the second mask frequency based on the second mask information, the mask frequency determination unit 130, for example, the time difference from the frame including the continuous peak to the currently processed frame, Based on the peak power P1, the function H (t) is translated.

通話装置１は、他の通話装置４との間での呼接続が確立した後、他の通話装置４から受信した受話音をレシーバ２から出力する処理と、マイクロフォン３から入力されたマイク入力音に含まれるエコー成分を抑圧して他の通話装置４に送信する処理とを行う。本実施形態に係る通話装置１は、他の通話装置４との間での呼接続が確立した後、図３に示した処理を行う。 After the call connection with the other call device 4 is established, the call device 1 outputs the received sound received from the other call device 4 from the receiver 2 and the microphone input sound input from the microphone 3. The process which suppresses the echo component contained in and transmits to the other communication apparatus 4 is performed. The call device 1 according to the present embodiment performs the process shown in FIG. 3 after establishing a call connection with another call device 4.

図３は、一実施形態に係る通話装置が行う処理を説明するフローチャートである。
通話装置１は、まず、受話音から処理対象のフレームを選択し（ステップＳ１）、該処理対象のフレームに対する周波数分析を行う（ステップＳ２）。ステップＳ１，Ｓ２の処理は、受話音分析部１２０が行う。受話音分析部１２０は、既知の分析方法に従って、処理対象のフレームにおける音声波形を周波数スペクトルに変換し、ピークとなる周波数の値と、ピークの数とを算出する。また、受話音分析部１２０は、処理対象のフレームに対する周波数分析の結果を分析結果保持部１９２に保持させる。この際、受話音分析部１２０は、現在処理対象であるフレームを含む、過去の所定フレーム数分の周波数分析の結果を分析結果保持部１９２に保持させる。以下の説明では、周波数スペクトルや周波数スペクトルに基づいて算出されるパワースペクトルにおける周波数帯のことを、単に「周波数」ともいう。 FIG. 3 is a flowchart for describing processing performed by the telephone device according to the embodiment.
The communication device 1 first selects a frame to be processed from the received sound (step S1), and performs frequency analysis on the frame to be processed (step S2). The received sound analysis unit 120 performs the processes of steps S1 and S2. The received sound analysis unit 120 converts the speech waveform in the frame to be processed into a frequency spectrum according to a known analysis method, and calculates a peak frequency value and the number of peaks. In addition, the received sound analysis unit 120 causes the analysis result holding unit 192 to hold the frequency analysis result for the processing target frame. At this time, the received sound analysis unit 120 causes the analysis result holding unit 192 to hold the frequency analysis results for the past predetermined number of frames including the frame that is currently processed. In the following description, the frequency spectrum and the frequency band in the power spectrum calculated based on the frequency spectrum are also simply referred to as “frequency”.

次に、通話装置１は、受話音における処理対象のフレームに対する周波数分析の結果に基づいて、当該フレームのマスク周波数を算出する（ステップＳ３）。ステップＳ３の処理は、マスク周波数決定部１３０が行う。本実施形態の通話装置１におけるマスク周波数決定部１３０は、マスク情報記憶部１９１に記憶させたマスク情報と、分析結果保持部１９２で保持している所定フレーム数分の周波数分析の結果とに基づいて、第１のマスク周波数と、第２のマスク周波数とを算出する。第１のマスク周波数は、処理対象のフレームの周波数スペクトルにおいてピークとなる周波数成分によりマスクされる周波数のうちの、マスク量が最も大きい周波数である。第２のマスク周波数は、所定フレーム数分の周波数スペクトルにおける同一周波数成分の時間変化でピークとなるフレームの周波数成分によりマスクされる周波数のうちの、マスク量が最も大きい周波数である。ピークとなる周波数成分によりマスクされる周波数は、ピークとなる周波数又はフレームとの距離に基づいて設定された最大音量（関数Ｇ（ｆ）又は関数Ｈ（ｔ））よりもスペクトル値（パワー）が小さい周波数である。また、マスク量は、マスクされる最大音量からパワーを減じた値である。 Next, the call device 1 calculates the mask frequency of the frame based on the result of frequency analysis for the frame to be processed in the received sound (step S3). The process of step S3 is performed by the mask frequency determination unit 130. The mask frequency determination unit 130 in the communication device 1 of the present embodiment is based on the mask information stored in the mask information storage unit 191 and the frequency analysis result for a predetermined number of frames held in the analysis result holding unit 192. Thus, the first mask frequency and the second mask frequency are calculated. The first mask frequency is a frequency having the largest mask amount among the frequencies masked by the frequency component that becomes a peak in the frequency spectrum of the processing target frame. The second mask frequency is a frequency having the largest mask amount among the frequencies masked by the frequency component of the frame that peaks due to the time change of the same frequency component in the frequency spectrum for the predetermined number of frames. The frequency masked by the peak frequency component has a spectrum value (power) higher than the maximum volume (function G (f) or function H (t)) set based on the peak frequency or the distance to the frame. It is a small frequency. The mask amount is a value obtained by subtracting power from the maximum volume to be masked.

次に、通話装置１は、処理対象のフレームの周波数スペクトルからマスク周波数の成分を除去して出力する（ステップＳ４）。ステップＳ４の処理は、受話音加工部１４０が行う。受話音加工部１４０は、周波数スペクトルにおける第１のマスク周波数及び第２のマスク周波数のスペクトル値を０にした後、周波数スペクトルを音声波形に逆変換して加工受話音を生成する。受話音加工部１４０は、加工受話音をレシーバ２に出力するとともに、マスク周波数を示す情報を成分抽出部１５０に出力する。これにより、通話装置１の利用者は、第１のマスク周波数及び第２のマスク周波数が除去された受話音５を聞くこととなる。しかしながら、上記のように、マスク周波数の成分は、他の周波数成分によりマスクされる、通話装置１の利用者が知覚困難な成分である。このため、マスク周波数の成分を除去した受話音５を出力することによる通話装置１の利用者の知覚への影響は非常に小さい。 Next, the communication device 1 removes the mask frequency component from the frequency spectrum of the frame to be processed and outputs it (step S4). The received sound processing unit 140 performs the process of step S4. The received sound processing unit 140 sets the spectrum values of the first mask frequency and the second mask frequency in the frequency spectrum to 0, and then inversely converts the frequency spectrum into a speech waveform to generate a processed received sound. The received sound processing unit 140 outputs the processed received sound to the receiver 2 and outputs information indicating the mask frequency to the component extraction unit 150. Thereby, the user of the communication device 1 listens to the received sound 5 from which the first mask frequency and the second mask frequency are removed. However, as described above, the mask frequency component is a component that is masked by other frequency components and is difficult for the user of the communication device 1 to perceive. For this reason, the influence on the perception of the user of the communication device 1 by outputting the reception sound 5 from which the mask frequency component is removed is very small.

次に、通話装置１は、マイク入力音における処理対象のフレームからマスク周波数の成分を抽出する（ステップＳ５）。ステップＳ５の処理は、成分抽出部１５０が行う。成分抽出部１５０は、マイクロフォン３から入力されたマイク入力音における処理対象のフレームを音声波形から周波数スペクトルに変換し、該周波数スペクトルにおける、受話音加工部１４０から受け取ったマスク周波数の成分を抽出する。受話音加工部１４０において受話音から第１のマスク周波数及び第２のマスク周波数の成分を除去した場合、成分抽出部１５０は、マイク入力音の周波数スペクトルから、第１のマスク周波数の成分と、第２のマスク周波数の成分とを抽出する。 Next, the communication device 1 extracts a mask frequency component from the processing target frame in the microphone input sound (step S5). The component extraction unit 150 performs the process in step S5. The component extraction unit 150 converts a frame to be processed in the microphone input sound input from the microphone 3 from a speech waveform to a frequency spectrum, and extracts a mask frequency component received from the received sound processing unit 140 in the frequency spectrum. . When the received sound processing unit 140 removes the components of the first mask frequency and the second mask frequency from the received sound, the component extraction unit 150 uses the first mask frequency component from the frequency spectrum of the microphone input sound, The component of the second mask frequency is extracted.

次に、通話装置１は、マイク入力音から抽出したマスク周波数の成分の値に基づいて、処理対象のフレームに含まれる音声がダブルトークの状態であるか否かを判定する（ステップＳ６）。ステップＳ６の処理は、ダブルトーク判定部１６０が行う。ダブルトーク判定部１６０は、マイク入力音の周波数スペクトルから抽出したマスク周波数の成分に基づいてパワーを算出し、該パワーが閾値よりも大きい場合に処理対象のフレームがダブルトークの状態であると判定する。マイク入力音から第１のマスク周波数の成分と第２のマスク周波数の成分とを抽出した場合、ダブルトーク判定部１６０は、例えば、第１のマスク周波数のパワー及び第２のマスク周波数のパワーが閾値よりも大きい場合に処理対象のフレームがダブルトークの状態であると判定する。 Next, based on the value of the mask frequency component extracted from the microphone input sound, the communication device 1 determines whether or not the voice included in the processing target frame is in a double talk state (step S6). The processing in step S6 is performed by the double talk determination unit 160. The double talk determination unit 160 calculates power based on the mask frequency component extracted from the frequency spectrum of the microphone input sound, and determines that the processing target frame is in a double talk state when the power is greater than the threshold value. To do. When the component of the first mask frequency and the component of the second mask frequency are extracted from the microphone input sound, the double-talk determining unit 160 has, for example, the power of the first mask frequency and the power of the second mask frequency. When it is larger than the threshold value, it is determined that the processing target frame is in a double talk state.

次に、通話装置１は、ダブルトーク判定部１６０の判定結果に基づいて、マイク入力音からエコー成分を除去する処理（ステップＳ７〜Ｓ９）を行う。ステップＳ７〜Ｓ９の処理は、入力音加工部１７０が行う。入力音加工部１７０は、まず、ダブルトーク判定部１６０の判定結果から、処理対象のフレームに含まれる音声がダブルトークの状態であるか否かを判定する（ステップＳ７）。ステップＳ７の処理は、入力音加工部１７０の学習判定部１７１が行う。 Next, the call device 1 performs processing (steps S7 to S9) for removing the echo component from the microphone input sound based on the determination result of the double talk determination unit 160. The input sound processing unit 170 performs the processes of steps S7 to S9. First, the input sound processing unit 170 determines whether or not the sound included in the processing target frame is in a double talk state from the determination result of the double talk determination unit 160 (step S7). The process of step S7 is performed by the learning determination unit 171 of the input sound processing unit 170.

処理対象のフレームがダブルトークの状態ではない場合（ステップＳ７；ＮＯ）、入力音加工部１７０は、エコー経路の伝達特性を学習する処理（ステップＳ８）を行った後、伝達特性に基づいてマイク入力音のエコー成分を抑圧する処理（ステップＳ９）を行う。ステップＳ８の処理は、入力音加工部１７０の伝達特性学習部１７２が行う。伝達特性学習部１７２は、既知の学習方法に従って、レシーバ２からマイクロフォン３に至るエコー経路の伝達特性を学習する。例えば、伝達特性学習部１７２は、受話音における処理対象のフレームについての周波数スペクトルと、マイク入力音における処理対象のフレームについての周波数スペクトルとに基づいて、各周波数における伝達特性を算出する。ステップＳ９の処理は、入力音加工部１７０のエコー成分抑圧部１７３が行う。エコー成分抑圧部１７３は、マイク入力音の周波数スペクトルにおける各周波数成分に伝達特性を適用することで、マイク入力音に含まれるエコー成分を抑圧する。なお、ステップＳ８の処理を行った場合、エコー成分抑圧部１７３は、ステップＳ８で学習した伝達特性に基づいて、マイク入力音に含まれるエコー成分を抑圧する。 When the frame to be processed is not in a double talk state (step S7; NO), the input sound processing unit 170 performs a process of learning the transfer characteristic of the echo path (step S8), and then performs a microphone based on the transfer characteristic. A process of suppressing the echo component of the input sound (step S9) is performed. The process of step S8 is performed by the transfer characteristic learning unit 172 of the input sound processing unit 170. The transfer characteristic learning unit 172 learns the transfer characteristic of the echo path from the receiver 2 to the microphone 3 according to a known learning method. For example, the transfer characteristic learning unit 172 calculates the transfer characteristic at each frequency based on the frequency spectrum for the processing target frame in the received sound and the frequency spectrum for the processing target frame in the microphone input sound. The processing of step S9 is performed by the echo component suppression unit 173 of the input sound processing unit 170. The echo component suppression unit 173 suppresses echo components included in the microphone input sound by applying transfer characteristics to each frequency component in the frequency spectrum of the microphone input sound. When the process of step S8 is performed, the echo component suppressing unit 173 suppresses the echo component included in the microphone input sound based on the transfer characteristic learned in step S8.

すなわち、処理対象のフレームがダブルトークの状態ではない場合（ステップＳ７；ＮＯ）、入力音加工部１７０の学習判定部１７１は、次に、伝達特性学習部１７２に伝達特性を学習させる。伝達特性学習部１７２は、現在処理対象であるフレームの周波数スペクトルに基づいて伝達特性を学習すると、学習した伝達特性を伝達特性記憶部１９３に記憶させるとともに、エコー成分抑圧部１７３に渡す。伝達特性学習部１７２から伝達特性を受け取ると、エコー成分抑圧部１７３は、受け取った伝達特性に基づいて、マイク入力音における処理対象のフレームに含まれるエコー成分を抑圧する。エコー成分抑圧部１７３は、マイク入力音の周波数スペクトルにおける各周波数成分に対して伝達特性を適用し、マイク入力音に含まれるエコー成分を抑圧する。 That is, when the processing target frame is not in the double talk state (step S7; NO), the learning determination unit 171 of the input sound processing unit 170 next causes the transfer characteristic learning unit 172 to learn the transfer characteristic. When the transfer characteristic learning unit 172 learns the transfer characteristic based on the frequency spectrum of the frame that is currently processed, the transfer characteristic learning unit 172 stores the learned transfer characteristic in the transfer characteristic storage unit 193 and passes it to the echo component suppression unit 173. Upon receiving the transfer characteristic from the transfer characteristic learning unit 172, the echo component suppressing unit 173 suppresses the echo component included in the processing target frame in the microphone input sound based on the received transfer characteristic. The echo component suppression unit 173 applies a transfer characteristic to each frequency component in the frequency spectrum of the microphone input sound, and suppresses the echo component included in the microphone input sound.

一方、処理対象のフレームがダブルトークの状態である場合（ステップＳ７；ＹＥＳ）、入力音加工部１７０は、ステップＳ８の処理をスキップし、ステップＳ９の処理を行う。すなわち、処理対象のフレームがダブルトークの状態である場合、入力音加工部１７０は、伝達特性を学習しない。入力音加工部１７０の学習判定部１７１は、次に、エコー成分抑圧部１７３に、マイク入力音に含まれるエコー成分を抑圧させる。この際、学習判定部１７１は、エコー成分抑圧部１７３に、現在処理対象であるフレームよりも前のフレーム（過去フレーム）に基づいて学習した伝達特性を伝達特性記憶部１７３から取得させる。エコー成分抑圧部１７３は、伝達特性記憶部１９３から取得した伝達特性に基づいて、マイク入力音における処理対象のフレームに含まれるエコー成分を抑圧する。 On the other hand, when the processing target frame is in a double talk state (step S7; YES), the input sound processing unit 170 skips the process of step S8 and performs the process of step S9. That is, when the processing target frame is in a double talk state, the input sound processing unit 170 does not learn transfer characteristics. Next, the learning determination unit 171 of the input sound processing unit 170 causes the echo component suppression unit 173 to suppress the echo component included in the microphone input sound. At this time, the learning determination unit 171 causes the echo component suppression unit 173 to acquire, from the transfer characteristic storage unit 173, the transfer characteristic learned based on the frame (past frame) before the frame that is the current processing target. The echo component suppression unit 173 suppresses the echo component included in the processing target frame in the microphone input sound based on the transfer characteristic acquired from the transfer characteristic storage unit 193.

マイク入力音に含まれるエコー成分を抑圧した後、通話装置１は、エコー成分を抑圧したマイク入力音（加工入力音）を他の通話装置４に送信する（ステップＳ１０）。ステップＳ１０の処理は、送受信部１１０が行う。送受信部１１０は、入力音加工部１７０でエコー成分を抑圧したマイク入力音に対して所定の処理を行った後、他の通話装置４に送信する。 After suppressing the echo component included in the microphone input sound, the communication device 1 transmits the microphone input sound (processed input sound) in which the echo component is suppressed to the other communication device 4 (step S10). The transmission / reception unit 110 performs the processing in step S10. The transmission / reception unit 110 performs predetermined processing on the microphone input sound in which the echo component is suppressed by the input sound processing unit 170, and then transmits the processed signal to the other communication device 4.

マイク入力音を他の通話装置４に送信した後、通話装置１は、他の通話装置４との通話が継続しているかを判定する（ステップＳ１１）。ステップＳ１１の処理は、例えば、送受信部１１０或いは受話音分析部１２０が行う。通話が継続している場合（ステップＳ１１；ＹＥＳ）、通話装置１は、ステップＳ１以降の処理を繰り返す。通話が終了した場合（ステップＳ１１；ＮＯ）、通話装置１は、上記の処理を終了する。 After transmitting the microphone input sound to the other call device 4, the call device 1 determines whether the call with the other call device 4 is continued (step S11). The processing in step S11 is performed by, for example, the transmission / reception unit 110 or the received sound analysis unit 120. When the call is continued (step S11; YES), the call device 1 repeats the processes after step S1. When the call is finished (step S11; NO), the call device 1 finishes the above process.

図３のフローチャートは、ステップＳ１で抽出した処理対象のフレームに対するステップＳ２〜Ｓ１０の処理を行った後、次の処理対象のフレームに対するステップＳ１〜Ｓ１０の処理を行う内容になっている。しかしながら、通話装置１は、ステップＳ１〜Ｓ１０の各処理をパイプライン化して行ってもよい。 The flowchart of FIG. 3 shows the content of performing steps S1 to S10 on the next processing target frame after performing the processing of steps S2 to S10 on the processing target frame extracted in step S1. However, the communication device 1 may perform each process of steps S1 to S10 in a pipeline.

通話装置１が行うステップＳ１〜Ｓ１１の処理のうちのステップＳ２の処理、すなわち受話音のフレームに対する周波数分析を行う処理は、上記のように受話音分析部１２０が行う。受話音分析部１２０は、例えば、図４に示した処理を行う。 The process of step S2 among the processes of steps S1 to S11 performed by the call device 1, that is, the process of performing frequency analysis on the frame of the received sound is performed by the received sound analysis unit 120 as described above. The received sound analysis unit 120 performs, for example, the process shown in FIG.

図４は、受話音のフレームを周波数分析する処理の内容を説明するフローチャートである。 FIG. 4 is a flowchart for explaining the contents of processing for frequency analysis of the frame of the received sound.

図４に示すように、受話音分析部１２０は、まず、受話音における処理対象のフレームの音声波形を周波数スペクトルに変換する（ステップＳ２０１）。ステップＳ２０１の処理では、受話音分析部１２０は、既知の変換方法に従って、音声波形を周波数スペクトルに変換する。例えば、受話音分析部１２０は、高速フーリエ変換により音声波形を周波数スペクトルに変換する。 As shown in FIG. 4, the received sound analysis unit 120 first converts the speech waveform of the frame to be processed in the received sound into a frequency spectrum (step S201). In the process of step S201, the received sound analysis unit 120 converts the speech waveform into a frequency spectrum according to a known conversion method. For example, the received sound analysis unit 120 converts a speech waveform into a frequency spectrum by fast Fourier transform.

次に、受話音分析部１２０は、周波数スペクトルに基づいて、処理対象のフレームについてのパワースペクトルを算出する（ステップＳ２０２）。ステップＳ２０２の処理では、受話音分析部１２０は、既知の算出方法に従って、周波数スペクトルにおける各周波数帯のパワーを算出する。 Next, the received sound analysis unit 120 calculates a power spectrum for the processing target frame based on the frequency spectrum (step S202). In the process of step S202, the received sound analysis unit 120 calculates the power of each frequency band in the frequency spectrum according to a known calculation method.

次に、受話音分析部１２０は、パワースペクトルに基づいて、現在処理対象であるフレームにおけるピーク周波数を抽出する（ステップＳ２０３）。ステップＳ２０３の処理では、受話音分析部１２０は、既知の抽出方法に従って、現在処理対象であるフレームのパワースペクトルにおいてピーク（極大）となる周波数帯を抽出する。例えば、受話音分析部１２０は、時刻ｔのフレームについてのパワースペクトルにおけるｉ番目の周波数帯のパワーP(i,t)が下記式（１−１）及び式（１−２）を満たす場合に、当該ｉ番目の周波数帯をピーク周波数とする。 Next, the received sound analysis unit 120 extracts the peak frequency in the frame currently being processed based on the power spectrum (step S203). In the process of step S203, the received sound analysis unit 120 extracts a frequency band that has a peak (maximum) in the power spectrum of the frame currently being processed according to a known extraction method. For example, the received sound analysis unit 120 determines that the power P (i, t) of the i-th frequency band in the power spectrum for the frame at time t satisfies the following expressions (1-1) and (1-2). The i-th frequency band is a peak frequency.

P(i,t)−P(i-1,t)＞TH1 ・・・（１−１）
P(i,t)−P(i+1,t)＞TH1 ・・・（１−２） P (i, t) -P (i-1, t)> TH1 (1-1)
P (i, t) -P (i + 1, t)> TH1 (1-2)

式（１−１）及び式（１−２）における判定閾値TH1は、適宜設定可能である。 The determination threshold value TH1 in the expressions (1-1) and (1-2) can be set as appropriate.

次に、受話音分析部１２０は、パワースペクトルの時間変化に基づいて、各周波数帯の継時ピークを抽出する（ステップＳ２０４）。ステップＳ２０４の処理では、受話音分析部１２０は、現在処理対象であるフレームを含む複数フレームのパワースペクトルを参照し、周波数帯毎に、パワーＰ（ｉ，ｔ）の時間変化においてピーク（極大）となるフレームの時刻を抽出する。例えば、受話音分析部１２０は、各フレームのパワースペクトルにおけるｉ番目の周波数帯のパワーのうちの、下記式（２−１）及び式（２−２）を満たすパワーP(i,t-1)を含むパワースペクトルのフレーム（時刻t-1）を、継時ピークとする。 Next, the received sound analysis unit 120 extracts the continuous peak of each frequency band based on the time change of the power spectrum (step S204). In the process of step S204, the received sound analysis unit 120 refers to the power spectrum of a plurality of frames including the frame currently being processed, and peaks (maximum) in the time change of the power P (i, t) for each frequency band. Extract the frame time. For example, the received sound analysis unit 120 includes power P (i, t−1) satisfying the following expressions (2-1) and (2-2) among the powers of the i-th frequency band in the power spectrum of each frame. ) Including a power spectrum frame (time t-1) is a continuous peak.

P(i,t-1)−P(i,t-2)＞TH1 ・・・（２−１）
P(i,t-1)−P(i,t)＞TH1 ・・・（２−２） P (i, t-1) -P (i, t-2)> TH1 (2-1)
P (i, t-1) -P (i, t)> TH1 (2-2)

次に、受話音分析部１２０は、現在処理対象であるフレームの周波数スペクトルと、ピーク周波数を示す情報と、継時ピークを示す情報とを分析結果保持部１９２に保持させる（ステップＳ２０５）。ステップＳ２０５の処理を終えると、受話音分析部１２０は、現在処理対象となっているフレームに対する周波数分析を終了する。 Next, the received sound analysis unit 120 causes the analysis result holding unit 192 to hold the frequency spectrum of the frame currently being processed, information indicating the peak frequency, and information indicating the continuous peak (step S205). When the process of step S205 is completed, the received sound analysis unit 120 ends the frequency analysis for the frame currently being processed.

受話音における処理対象のフレームに対する周波数分析が終了すると、通話装置１は、次に、当該フレームから除去するマスク周波数を決定する処理（ステップＳ３）を行う。ステップＳ３の処理は、マスク周波数決定部１３０が行う。マスク周波数決定部１３０は、ステップＳ３の処理として、図４に示した処理を行う。 When the frequency analysis for the frame to be processed in the received sound is completed, the communication device 1 next performs a process of determining a mask frequency to be removed from the frame (step S3). The process of step S3 is performed by the mask frequency determination unit 130. The mask frequency determination unit 130 performs the process shown in FIG. 4 as the process of step S3.

図５は、受話音から除去するマスク周波数を決定する処理の内容を説明するフローチャートである。 FIG. 5 is a flowchart for explaining the contents of the process for determining the mask frequency to be removed from the received sound.

ステップＳ４の処理において、マスク周波数決定部１３０は、まず、周波数スペクトルにおける周波数帯（周波数ビン）を指定する変数ｉを１に設定する（ステップＳ３０１）。 In the process of step S4, the mask frequency determination unit 130 first sets a variable i for designating a frequency band (frequency bin) in the frequency spectrum to 1 (step S301).

次に、マスク周波数決定部１３０は、ピーク周波数と、第１のマスク情報とに基づいて、フレーム内のピーク周波数によりマスクされる第１のマスク周波数Max_FM_freqを算出する（ステップＳ３０２）。第１のマスク周波数は、ピーク周波数の成分によりマスクされる周波数のうちの、マスク量が最大となる周波数である。マスク量は、ピーク周波数の成分によりマスクされるパワー（音量）の最大値から周波数スペクトルにおけるパワーを減じた値である。マスク周波数決定部１３０は、第１のマスク情報に基づいて、処理対象のフレームにおけるピーク周波数毎にｉ番目の周波数帯についてのマスク量を算出する。そして、算出したマスク量が最大値となる周波数帯を第１のマスク周波数として保持する。 Next, the mask frequency determination unit 130 calculates a first mask frequency Max_FM_freq masked by the peak frequency in the frame based on the peak frequency and the first mask information (step S302). The first mask frequency is a frequency at which the mask amount is maximum among the frequencies masked by the peak frequency component. The mask amount is a value obtained by subtracting the power in the frequency spectrum from the maximum value of the power (sound volume) masked by the peak frequency component. The mask frequency determination unit 130 calculates a mask amount for the i-th frequency band for each peak frequency in the processing target frame based on the first mask information. Then, the frequency band where the calculated mask amount is the maximum value is held as the first mask frequency.

次に、マスク周波数決定部１３０は、継時ピークと、第２のマスク情報とに基づいて、に基づいて、第２のマスク周波数Max_TM_freqを算出する（ステップＳ３０３）。第２のマスク周波数は、処理対象のフレームにおいて過去フレームの同一周波数成分によりマスクされる周波数のうちの、マスク量が最大となる周波数である。マスク周波数決定部１３０は、ｉ番目の周波数帯の時間変化における継時ピーク毎に処理対象のフレームにおけるマスク量を算出する。そして、算出したマスク量が最大値となる周波数帯を第２のマスク周波数として保持する。 Next, the mask frequency determination unit 130 calculates the second mask frequency Max_TM_freq based on the succession peak and the second mask information (step S303). The second mask frequency is a frequency at which the mask amount is maximum among the frequencies masked by the same frequency component of the past frame in the processing target frame. The mask frequency determination unit 130 calculates the mask amount in the processing target frame for each successive peak in the time change of the i-th frequency band. Then, the frequency band in which the calculated mask amount is the maximum value is held as the second mask frequency.

ステップＳ３０２及びＳ３０３の処理を終えると、マスク周波数決定部１３０は、次に、変数ｉをｉ＋１に更新し（ステップＳ３０４）、更新後の変数ｉが終値Ｎよりも大きいか否かを判定する（ステップＳ３０５）。終値Ｎは、周波数スペクトルにおける周波数帯の総数とする。すなわち、ステップＳ３０５では、マスク周波数決定部１３０は、周波数スペクトルにおける全ての周波数帯に対してステップＳ３０２及びＳ３０３の処理を行ったか否かを判定する。ｉ≦Ｎである場合（ステップＳ３０５；ＮＯ）、マスク周波数決定部１３０は、現時点でｉ番目となる周波数帯に対するステップＳ３０２以降の処理を行う。ｉ＞Ｎである場合（ステップＳ３０５；ＹＥＳ）、マスク周波数決定部１３０は、第１のマスク周波数Max_FM_freq、及び第２のマスク周波数Max_TM_freqをマスク周波数に決定する（ステップＳ３０６）。ステップＳ３０６の処理を終えると、マスク周波数決定部１３０は、現在処理対象となっているフレームに対するマスク周波数を決定する処理を終了する。 After completing the processing of steps S302 and S303, the mask frequency determination unit 130 next updates the variable i to i + 1 (step S304), and determines whether or not the updated variable i is larger than the closing price N (step S304). Step S305). The closing price N is the total number of frequency bands in the frequency spectrum. That is, in step S305, the mask frequency determination unit 130 determines whether or not the processing in steps S302 and S303 has been performed on all frequency bands in the frequency spectrum. When i ≦ N is satisfied (step S305; NO), the mask frequency determination unit 130 performs the processing from step S302 onward for the i-th frequency band at the present time. If i> N (step S305; YES), the mask frequency determination unit 130 determines the first mask frequency Max_FM_freq and the second mask frequency Max_TM_freq as mask frequencies (step S306). When the process of step S306 is completed, the mask frequency determination unit 130 ends the process of determining the mask frequency for the frame currently being processed.

図６Ａは、第１のマスク周波数を算出する処理の内容を説明するフローチャート（その１）である。図６Ｂは、第１のマスク周波数を算出する処理の内容を説明するフローチャート（その２）である。 FIG. 6A is a flowchart (part 1) for explaining the content of the process of calculating the first mask frequency. FIG. 6B is a flowchart (part 2) illustrating the content of the process of calculating the first mask frequency.

第１のマスク周波数を算出する処理において、マスク周波数決定部１３０は、まず、図６Ａに示すように、周波数帯を指定する変数ｉが１であるか否かを判定する（ステップＳ３０２０１）。ｉ＝１である場合（ステップＳ３０２０１；ＹＥＳ）、マスク周波数決定部１３０は、次に、処理対象のフレームにおけるマスク量の最大値Max_FM、及び第１のマスク周波数Max_FM_freqを、それぞれ、０に設定する（ステップＳ３０２０２）。その後、マスク周波数決定部１３０は、変数ｎを１に設定するとともに、ｉ番目の周波数帯におけるマスク量の最大値Max_FM_Tと、暫定マスク周波数Max_FM_freq_Tとを、それぞれ、０に設定する（ステップＳ３０２０３）。ここで、変数ｎは、処理対象のフレームにおけるピーク周波数を識別する値である。一方、ｉ＞１である場合（ステップＳ３０２０１；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０２０２の処理をスキップし、次に、ステップＳ３０２０３の処理を行う。 In the process of calculating the first mask frequency, the mask frequency determination unit 130 first determines whether or not the variable i specifying the frequency band is 1 as shown in FIG. 6A (step S30201). When i = 1 (step S30201; YES), the mask frequency determination unit 130 next sets the maximum mask amount Max_FM and the first mask frequency Max_FM_freq in the processing target frame to 0, respectively. (Step S30202). Thereafter, the mask frequency determination unit 130 sets the variable n to 1 and sets the maximum mask amount Max_FM_T and the provisional mask frequency Max_FM_freq_T in the i-th frequency band to 0 (step S30203). Here, the variable n is a value for identifying the peak frequency in the processing target frame. On the other hand, if i> 1 (step S30201; NO), the mask frequency determination unit 130 skips the process of step S30202 and then performs the process of step S30203.

ステップＳ３０２０３の処理の後、マスク周波数決定部１３０は、ｎ番目のピーク周波数からｉ番目の周波数までの距離（周波数差）を算出する（ステップＳ３０２０４）。 After the process of step S30203, the mask frequency determination unit 130 calculates a distance (frequency difference) from the nth peak frequency to the ith frequency (step S30204).

次に、マスク周波数決定部１３０は、ｉ番目の周波数帯において、ｎ番目のピーク周波数の成分によりマスクされる最大音量FM_TH(n,i)を算出する（ステップＳ３０２０５）。ステップＳ３０２０５では、マスク周波数決定部１３０は、周波数差とマスクされる最大音量との関係を示す第１のマスク情報に基づいて、最大音量FM_TH(n,i)を算出する。マスク周波数決定部１３０は、マスク情報記憶部１９１から第１のマスク情報を取得する。 Next, the mask frequency determination unit 130 calculates the maximum volume FM_TH (n, i) masked by the n-th peak frequency component in the i-th frequency band (step S30205). In step S30205, the mask frequency determination unit 130 calculates the maximum volume FM_TH (n, i) based on the first mask information indicating the relationship between the frequency difference and the maximum volume to be masked. The mask frequency determination unit 130 acquires first mask information from the mask information storage unit 191.

次に、マスク周波数決定部１３０は、処理対象のフレームにおけるｉ番目の周波数のパワーP(i,t)と、最大音量FM_TH(n,i)とに基づいて、マスク量FM(n,i)＝FM_TH(n,i)−P(i,t)を算出する（ステップＳ３０２０６）。 Next, the mask frequency determination unit 130 determines the mask amount FM (n, i) based on the power P (i, t) of the i-th frequency in the processing target frame and the maximum volume FM_TH (n, i). = FM_TH (n, i) -P (i, t) is calculated (step S30206).

次に、マスク周波数決定部１３０は、算出したマスク量FM(n,i)がｉ番目の周波数帯における現時点でのマスク量の最大値Max_FM_Tよりも大きいか否かを判定する（ステップＳ３０２０７）。FM(n,i)＞Max_FM_Tである場合（ステップＳ３０２０７；ＹＥＳ）、マスク周波数決定部１３０は、マスク量の最大値Max_FM_T、及び暫定マスク周波数Max_FM_freq_Tを、それぞれ、マスク量FM(n,i)、及び変数ｉに更新する（ステップＳ３０２０７）。その後、マスク周波数決定部１３０は、変数ｎをｎ＋１に更新し（ステップＳ３０２０９）、更新後の変数ｎが終値N_FPよりも大きいか否かを判定する（ステップＳ３０２１０）。終値N_FPは、処理対象のフレームにおけるピーク周波数の個数である。一方、FM(n,i)≦Max_FM_Tである場合（ステップＳ３０２０７；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０２０８の処理をスキップし、ステップＳ３０２０９の処理，及びステップＳ３０２１０の判定を行う。 Next, the mask frequency determination unit 130 determines whether or not the calculated mask amount FM (n, i) is larger than the current maximum mask amount value Max_FM_T in the i-th frequency band (step S30207). When FM (n, i)> Max_FM_T (step S30207; YES), the mask frequency determination unit 130 sets the mask amount maximum value Max_FM_T and the provisional mask frequency Max_FM_freq_T to the mask amount FM (n, i), And the variable i is updated (step S30207). Thereafter, the mask frequency determination unit 130 updates the variable n to n + 1 (step S30209), and determines whether or not the updated variable n is larger than the closing price N_FP (step S30210). The closing price N_FP is the number of peak frequencies in the processing target frame. On the other hand, when FM (n, i) ≦ Max_FM_T (step S30207; NO), the mask frequency determination unit 130 skips the process of step S30208, and performs the process of step S30209 and the determination of step S30210.

ｎ≦N_FPである場合（ステップＳ３０２１０；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０２０４以降の処理を行う。 When n ≦ N_FP is satisfied (step S30210; NO), the mask frequency determination unit 130 performs the processing after step S30204.

ｎ＞N_FPである場合（ステップＳ３０２１０；ＹＥＳ）、マスク周波数決定部１３０は、次に、図６Ｂに示すように、ｉ番目の周波数におけるマスク量の最大値Max_FM_Tが、処理対象のフレームにおけるマスク量の最大値Max_FMよりも大きいか否かを判定する（ステップＳ３０２１１）。 If n> N_FP (step S30210; YES), the mask frequency determination unit 130 then sets the maximum mask amount Max_FM_T at the i-th frequency to the mask amount in the processing target frame, as shown in FIG. 6B. It is determined whether it is larger than the maximum value Max_FM (step S30211).

Max_FM_T＞Max_FMである場合（ステップＳ３０２１１；ＹＥＳ）、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_FM、及び第１のマスク周波数Max_FM_freqを、それぞれ、Max_FM_T、及びMax_FM_freq_Tに更新する（ステップＳ３０２１２）。ステップＳ３０２１２の処理を行った場合、マスク周波数決定部１３０は、更新したマスク量の最大値Max_FM、及び第１のマスク周波数Max_FM_freqを保持した状態で、第１のマスク周波数を算出する処理を終了する。 When Max_FM_T> Max_FM (step S30211; YES), the mask frequency determination unit 130 updates the maximum mask amount Max_FM and the first mask frequency Max_FM_freq in the processing target frame to Max_FM_T and Max_FM_freq_T, respectively. (Step S30212). When the process of step S30212 is performed, the mask frequency determination unit 130 ends the process of calculating the first mask frequency while maintaining the updated maximum value Max_FM of the mask amount and the first mask frequency Max_FM_freq. .

一方、Max_FM_T≦Max_FMである場合（ステップＳ３０２１１；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０２１２の処理をスキップし、第１のマスク周波数を算出する処理を終了する。この場合（Max_FM_T≦Max_FMである場合）、マスク周波数決定部１３０は、ｉ−１番目の周波数までの処理におけるマスク量の最大値Max_FM、及び第１のマスク周波数Max_FM_freqを保持した状態で、第１のマスク周波数を算出する処理を終了する。 On the other hand, when Max_FM_T ≦ Max_FM (step S30211; NO), the mask frequency determination unit 130 skips the process of step S30212 and ends the process of calculating the first mask frequency. In this case (when Max_FM_T ≦ Max_FM), the mask frequency determination unit 130 holds the first mask frequency Max_FM and the first mask frequency Max_FM_freq in the state up to the (i−1) th frequency in the first state. The process of calculating the mask frequency is terminated.

図７は、第１のマスク周波数の算出方法を説明するグラフ図である。
図７のグラフ図において、横軸は処理対象のフレームについてのパワースペクトルにおける周波数であり、縦軸はパワーである。 FIG. 7 is a graph illustrating a method for calculating the first mask frequency.
In the graph of FIG. 7, the horizontal axis represents the frequency in the power spectrum for the processing target frame, and the vertical axis represents the power.

図７の太い曲線ＰＳは、処理対象のフレームにおけるパワースペクトルであり、第１のピークＰ１、第２のピークＰ２、及び第３のピークＰ３を含む。図６Ａ及び図６Ｂのフローチャートに沿ってパワースペクトルＰＳにおけるｉ番目の周波数についてのマスク量を算出する場合、マスク周波数決定部１３０は、以下の処理を行う。 A thick curve PS in FIG. 7 is a power spectrum in a frame to be processed, and includes a first peak P1, a second peak P2, and a third peak P3. When calculating the mask amount for the i-th frequency in the power spectrum PS according to the flowcharts of FIGS. 6A and 6B, the mask frequency determination unit 130 performs the following processing.

マスク周波数決定部１３０は、まず、第１のピークＰ１の周波数ＩＰ１とｉ番目の周波数との周波数差Ｆ１（＝ｉ−ＩＰ１）を算出する（ステップＳ３０２０４）。 The mask frequency determination unit 130 first calculates a frequency difference F1 (= i−IP1) between the frequency IP1 of the first peak P1 and the i-th frequency (step S30204).

次に、マスク周波数決定部１３０は、第１のピークＰ１によりマスクされるｉ番目の周波数の最大音量を算出する（ステップＳ３０２０５）。このとき、マスク周波数決定部１３０は、第１のピークＰ１の周波数ＩＰ１及びパワーＰＷＰ１と、第１のマスク情報とに基づいて、第１の関数Ｇ１（ｆ）を設定する。第１の関数Ｇ１（ｆ）は、同一フレームにおける、第１のピークＰ１の周波数ＩＰ１からの周波数差と、マスクされる最大音量との関係を示す関数である。マスク周波数決定部１３０は、第１の関数Ｇ１（ｆ）に基づいて、ｉ番目の周波数帯についての最大音量FM_TH(1,i)を算出する。 Next, the mask frequency determination unit 130 calculates the maximum volume of the i-th frequency masked by the first peak P1 (step S30205). At this time, the mask frequency determination unit 130 sets the first function G1 (f) based on the frequency IP1 and power PWP1 of the first peak P1 and the first mask information. The first function G1 (f) is a function indicating the relationship between the frequency difference from the frequency IP1 of the first peak P1 and the maximum volume to be masked in the same frame. The mask frequency determination unit 130 calculates the maximum volume FM_TH (1, i) for the i-th frequency band based on the first function G1 (f).

次に、マスク周波数決定部１３０は、マスク量FM(1,i)＝FM_TH(1,i)−P(i,t)Ｐ（ｉ，ｔ）を算出する（ステップＳ３０２０６）。 Next, the mask frequency determination unit 130 calculates a mask amount FM (1, i) = FM_TH (1, i) −P (i, t) P (i, t) (step S30206).

次に、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値、及び暫定マスク周波数を、それぞれ、マスク量FM(1,i)，及び変数ｉとする（ステップＳ３０２０８）。 Next, the mask frequency determination unit 130 sets the maximum value of the mask amount and the provisional mask frequency in the processing target frame as the mask amount FM (1, i) and the variable i, respectively (step S30208).

その後、マスク周波数決定部１３０は、ピークを識別する変数ｎを２に更新し（ステップＳ３０２０９）、ステップＳ３０２０４以降の処理を行う。すなわち、マスク周波数決定部１３０は、第２のピークＰ２とｉ番目の周波数帯との周波数差−Ｆ２を算出した後、第２のピークＰ２の周波数及びパワーと、第１のマスク情報とに基づいて、第２の関数Ｇ２（ｆ）を設定する。第２の関数Ｇ２（ｆ）は、第２のピークＰ２の周波数ＩＰ２からの周波数差と、マスクされる最大音量との関係を示す関数である。マスク周波数決定部１３０は、第２の関数Ｇ２（ｆ）に基づいてｉ番目の周波数帯についての最大音量FM_TH(2,i)を算出し、マスク量FM(2,i)＝FM_TH(2,i)−P(i,t)を算出する。図７に示した例では、第２のピークＰ２によるマスク量FM(2,i)が第１のピークＰ１によるマスク量FM(1,i)よりも大きくなる。よって、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値、及び暫定マスク周波数を、それぞれ、マスク量FM(2,i)，及び変数ｉとする（ステップＳ３０２０８）。 Thereafter, the mask frequency determination unit 130 updates the variable n for identifying the peak to 2 (step S30209), and performs the processing after step S30204. That is, the mask frequency determination unit 130 calculates the frequency difference −F2 between the second peak P2 and the i-th frequency band, and then based on the frequency and power of the second peak P2 and the first mask information. Thus, the second function G2 (f) is set. The second function G2 (f) is a function indicating the relationship between the frequency difference from the frequency IP2 of the second peak P2 and the maximum volume to be masked. The mask frequency determination unit 130 calculates the maximum volume FM_TH (2, i) for the i-th frequency band based on the second function G2 (f), and the mask amount FM (2, i) = FM_TH (2, i) −P (i, t) is calculated. In the example shown in FIG. 7, the mask amount FM (2, i) due to the second peak P2 is larger than the mask amount FM (1, i) due to the first peak P1. Therefore, the mask frequency determination unit 130 sets the maximum value of the mask amount and the temporary mask frequency in the processing target frame as the mask amount FM (2, i) and the variable i, respectively (step S30208).

この後、マスク周波数決定部１３０は、ピークを識別する変数ｎを３に更新し（ステップＳ３０２０９）、ステップＳ３０２０４以降の処理を行う。このとき、マスク周波数決定部１３０は、第３のピークＰ３の周波数ＩＰ３及びパワーと、第１のマスク情報とに基づいて第３の関数Ｇ３（ｆ）を設定する。第３の関数Ｇ３（ｆ）は、第３のピークＰ３の周波数ＩＰ３からの周波数差と、マスクされる最大音量との関係を示す関数である。図7に示した例では、第３の関数Ｇ３（ｆ）に基づいて算出される、ｉ番目の周波数帯についての最大音量FM_TH(3,i)が、ｉ番目の周波数帯のパワーP(i,t)よりも小さい。このため、マスク周波数決定部１３０は、ｉ番目の周波数帯におけるマスク量の最大値Max_FM_T、及び暫定マスク周波数Max_FM_freq_Tを、それぞれ、マスク量FM(2,i)，及び変数ｉとし、ステップＳ３０２１１の判定を行う。ここで、ｉ番目の周波数におけるマスク量の最大値Max_FM_Tが、ｉ−１番目の周波数までの処理におけるマスク量の最大値Max_FMよりも大きいと、マスク周波数決定部１３０は、ステップＳ３０２１２の処理を行う。すなわち、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_FM及び第１のマスク周波数Max_FM_freqを、それぞれ、ｉ番目の周波数帯におけるマスク量の最大値Max_FM_T、及び変数ｉに更新する。一方、ｉ番目の周波数帯におけるマスク量の最大値Max_FM_Tが、ｉ−１番目の周波数帯までの処理におけるマスク量の最大値Max_FM以下であると、マスク周波数決定部１３０は、ステップＳ３０２１２の処理を省略する。すなわち、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_FM及び第１のマスク周波数Max_FM_freqを、それぞれ、ｉ−１番目の周波数帯までの処理における最大値Max_FM_T、及び１からｉ−１までのいずれかの値のままとする。 Thereafter, the mask frequency determination unit 130 updates the variable n for identifying the peak to 3 (step S30209), and performs the processing after step S30204. At this time, the mask frequency determination unit 130 sets the third function G3 (f) based on the frequency IP3 and power of the third peak P3 and the first mask information. The third function G3 (f) is a function indicating the relationship between the frequency difference of the third peak P3 from the frequency IP3 and the maximum volume to be masked. In the example shown in FIG. 7, the maximum volume FM_TH (3, i) for the i-th frequency band calculated based on the third function G3 (f) is the power P (i of the i-th frequency band. , t). Therefore, the mask frequency determination unit 130 sets the maximum mask amount Max_FM_T and the provisional mask frequency Max_FM_freq_T in the i-th frequency band as the mask amount FM (2, i) and the variable i, respectively, and makes the determination in step S30211. I do. Here, when the maximum value Max_FM_T of the mask amount at the i-th frequency is larger than the maximum value Max_FM of the mask amount in the process up to the i−1th frequency, the mask frequency determination unit 130 performs the process of step S30212. . That is, the mask frequency determination unit 130 updates the maximum mask amount Max_FM and the first mask frequency Max_FM_freq in the processing target frame to the maximum mask amount Max_FM_T and the variable i in the i-th frequency band, respectively. . On the other hand, when the maximum value Max_FM_T of the mask amount in the i-th frequency band is equal to or less than the maximum value Max_FM of the mask amount in the processing up to the i−1th frequency band, the mask frequency determination unit 130 performs the process of step S30212. Omitted. That is, the mask frequency determination unit 130 sets the maximum mask amount Max_FM and the first mask frequency Max_FM_freq in the processing target frame to the maximum value Max_FM_T in the processing up to the i−1th frequency band, and 1 to i, respectively. Leave any value up to -1.

本実施形態に係るマスク周波数決定部１３０は、図５に示したように、処理対象のフレームについての周波数スペクトルにおける周波数帯毎に、ステップＳ３０２（ステップＳ３０２０１〜Ｓ３０２１２）の処理を行う。これにより、処理対象のフレームにおいてマスク量FM(n,i)が最大値となる周波数帯が、第１のマスク周波数となる。 As shown in FIG. 5, the mask frequency determination unit 130 according to the present embodiment performs the processing of step S302 (steps S30201 to S30212) for each frequency band in the frequency spectrum for the processing target frame. As a result, the frequency band in which the mask amount FM (n, i) has the maximum value in the processing target frame becomes the first mask frequency.

図８Ａは、第２のマスク周波数を算出する処理の内容を説明するフローチャート（その１）である。図８Ｂは、第２のマスク周波数を算出する処理の内容を説明するフローチャート（その２）である。 FIG. 8A is a flowchart (part 1) for explaining the contents of the process of calculating the second mask frequency. FIG. 8B is a flowchart (part 2) illustrating the content of the process of calculating the second mask frequency.

第２のマスク周波数を算出する処理において、マスク周波数決定部１３０は、まず、図８Ａに示すように、周波数帯を指定する変数ｉが１であるか否かを判定する（ステップＳ３０３０１）。ｉ＝１である場合（ステップＳ３０３０１；ＹＥＳ）、マスク周波数決定部１３０は、次に、処理対象のフレームにおけるマスク量の最大値Max_TM、及び第２のマスク周波数Max_TM_freqを、それぞれ、０に設定する（ステップＳ３０３０２）。その後、マスク周波数決定部１３０は、変数ｎを１に設定するとともに、ｉ番目の周波数帯におけるマスク量の最大値Max_TM_Tと、暫定マスク周波数Max_TM_freq_Tを、それぞれ、０に設定する（ステップＳ３０３０３）。ここで、変数ｎは、複数フレームのそれぞれにおけるｉ番目の周波数成分の時間変化から抽出された継時ピークを識別する値である。一方、ｉ＞１である場合（ステップＳ３０３０１；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０３０２の処理をスキップし、次に、ステップＳ３０３０３の処理を行う。 In the process of calculating the second mask frequency, the mask frequency determination unit 130 first determines whether or not the variable i specifying the frequency band is 1 as shown in FIG. 8A (step S30301). When i = 1 (step S30301; YES), the mask frequency determination unit 130 next sets the maximum mask amount Max_TM and the second mask frequency Max_TM_freq in the processing target frame to 0, respectively. (Step S30302). Thereafter, the mask frequency determining unit 130 sets the variable n to 1, and sets the maximum mask amount Max_TM_T and the provisional mask frequency Max_TM_freq_T to 0 in the i-th frequency band, respectively (step S30303). Here, the variable n is a value for identifying the successive peak extracted from the time change of the i-th frequency component in each of the plurality of frames. On the other hand, if i> 1 (step S30301; NO), the mask frequency determination unit 130 skips the process of step S30302 and then performs the process of step S30303.

ステップＳ３０３０３の処理の後、マスク周波数決定部１３０は、ｎ番目の継時ピークを含むフレームから処理対象のフレームまでの距離（時間差）を算出する（ステップＳ３０３０４）。 After the process of step S30303, the mask frequency determination unit 130 calculates the distance (time difference) from the frame including the nth successive peak to the frame to be processed (step S30304).

次に、マスク周波数決定部１３０は、処理対象のフレームのｉ番目の周波数帯において、ｎ番目の継時ピークの成分によりマスクされる最大音量TM_TH(n,i)を算出する（ステップＳ３０３０５）。ステップＳ３０３０５では、マスク周波数決定部１３０は、時間差とマスクされる最大音量との関係を示す第２のマスク情報に基づいて、最大音量TM_TH(n,i)を算出する。マスク周波数決定部１３０は、マスク情報記憶部１９１から第２のマスク情報を取得する。 Next, the mask frequency determination unit 130 calculates the maximum volume TM_TH (n, i) masked by the n-th peak component in the i-th frequency band of the processing target frame (step S30305). In step S30305, the mask frequency determination unit 130 calculates the maximum volume TM_TH (n, i) based on the second mask information indicating the relationship between the time difference and the maximum volume to be masked. The mask frequency determination unit 130 acquires second mask information from the mask information storage unit 191.

次に、マスク周波数決定部１３０は、処理対象のフレームにおけるｉ番目の周波数帯のパワーP(i,t)と、最大音量TM_TH(n,i)とに基づいて、マスク量TM(n,i)＝TM_TH(n,i)−P(i,t)を算出する（ステップＳ３０３０６）。 Next, the mask frequency determination unit 130 determines the mask amount TM (n, i) based on the power P (i, t) of the i-th frequency band in the processing target frame and the maximum volume TM_TH (n, i). ) = TM_TH (n, i) −P (i, t) is calculated (step S30306).

次に、マスク周波数決定部１３０は、算出したマスク量TM(n,i)がｉ番目の周波数帯における現時点でのマスク量の最大値Max_TM_Tよりも大きいか否かを判定する（ステップＳ３０３０７）。TM(n,i)＞Max_TM_Tである場合（ステップＳ３０３０７；ＹＥＳ）、マスク周波数決定部１３０は、マスク量の最大値Max_TM_T、及び暫定マスク周波数Max_TM_freq_Tを、それぞれ、マスク量TM(n,i)、及び変数ｉに更新する（ステップＳ３０３０７）。その後、マスク周波数決定部１３０は、変数ｎをｎ＋１に更新し（ステップＳ３０３０９）、更新後の変数ｎが終値N_TPよりも大きいか否かを判定する（ステップＳ３０３１０）。終値N_TPは、ｉ番目の周波数成分の時間変化における継時ピークの個数である。一方、TM(n,i)≦Max_TM_Tである場合（ステップＳ３０３０７；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０３０８の処理をスキップし、ステップＳ３０３０９の処理，及びステップＳ３０３１０の判定を行う。 Next, the mask frequency determination unit 130 determines whether or not the calculated mask amount TM (n, i) is larger than the maximum value Max_TM_T of the current mask amount in the i-th frequency band (step S30307). When TM (n, i)> Max_TM_T (step S30307; YES), the mask frequency determining unit 130 sets the mask amount maximum value Max_TM_T and the provisional mask frequency Max_TM_freq_T to the mask amount TM (n, i), And the variable i is updated (step S30307). Thereafter, the mask frequency determination unit 130 updates the variable n to n + 1 (step S30309), and determines whether or not the updated variable n is larger than the closing price N_TP (step S30310). The closing price N_TP is the number of successive peaks in the time change of the i-th frequency component. On the other hand, when TM (n, i) ≦ Max_TM_T (step S30307; NO), the mask frequency determination unit 130 skips the process of step S30308, and performs the process of step S30309 and the determination of step S30310.

ｎ≦N_TPである場合（ステップＳ３０３１０；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０３０４以降の処理を行う。 When n ≦ N_TP is satisfied (step S30310; NO), the mask frequency determination unit 130 performs the processing after step S30304.

ｎ＞N_TPである場合（ステップＳ３０３１０；ＹＥＳ）、マスク周波数決定部１３０は、次に、図８Ｂに示すように、ｉ番目の周波数帯におけるマスク量の最大値Max_TM_Tが、処理対象のフレームにおける現時点でのマスク量の最大値Max_TMよりも大きいか否かを判定する（ステップＳ３０３１１）。 When n> N_TP is satisfied (step S30310; YES), the mask frequency determination unit 130 then sets the maximum mask amount Max_TM_T in the i-th frequency band as shown in FIG. It is determined whether or not it is larger than the maximum value Max_TM of the mask amount at step S30311.

Max_TM_T＞Max_TMである場合（ステップＳ３０３１１；ＹＥＳ）、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_TM、及び第２のマスク周波数Max_TM_freqを、それぞれ、Max_TM_T、及びMax_TM_freq_Tに更新する（ステップＳ３０３１２）。ステップＳ３０３１２の処理を行った場合、マスク周波数決定部１３０は、更新したマスク量の最大値Max_TM、及び第２のマスク周波数Max_TM_freqを保持した状態で、第２のマスク周波数を算出する処理を終了する。 When Max_TM_T> Max_TM is satisfied (step S30311; YES), the mask frequency determination unit 130 updates the maximum mask amount Max_TM and the second mask frequency Max_TM_freq in the processing target frame to Max_TM_T and Max_TM_freq_T, respectively. (Step S30312). When the process of step S30312 is performed, the mask frequency determination unit 130 ends the process of calculating the second mask frequency while holding the updated maximum value Max_TM of the mask amount and the second mask frequency Max_TM_freq. .

一方、Max_TM_T≦Max_TMである場合（ステップＳ３０３１１；ＮＯ）、マスク周波数決定部１３０は、ステップＳ３０３１２の処理をスキップし、第２のマスク周波数を算出する処理を終了する。この場合（Max_TM_T≦Max_TMである場合）、マスク周波数決定部１３０は、ｉ−１番目の周波数帯までの処理におけるマスク量の最大値Max_TM、及び第２のマスク周波数Max_TM_freqを保持した状態で、第２のマスク周波数を算出する処理を終了する。 On the other hand, when Max_TM_T ≦ Max_TM (step S30311; NO), the mask frequency determination unit 130 skips the process of step S30312 and ends the process of calculating the second mask frequency. In this case (when Max_TM_T ≦ Max_TM), the mask frequency determination unit 130 holds the maximum mask amount Max_TM and the second mask frequency Max_TM_freq in the process up to the (i−1) -th frequency band. The process of calculating the mask frequency of 2 ends.

図９は、第２のマスク周波数の算出方法を説明するグラフ図である。
図９のグラフ図において、横軸は受話音における時刻（フレーム）であり、縦軸はパワーである。 FIG. 9 is a graph illustrating a method for calculating the second mask frequency.
In the graph of FIG. 9, the horizontal axis represents time (frame) in the received sound, and the vertical axis represents power.

図９の太い曲線ＰＳｉは、複数フレームにおけるｉ番目の周波数帯のパワーの時間変化を表しており、第１の継時ピークＰ１、第２の継時ピークＰ２、及び第３の継時ピークＰ３を含む。図８Ａ及び図８Ｂのフローチャートに沿ってパワースペクトルＰＳｉにおける時刻ｔでのマスク量を算出する場合、マスク周波数決定部１３０は、以下の処理を行う。 The thick curve PSi in FIG. 9 represents the time change of the power of the i-th frequency band in a plurality of frames, and the first successive peak P1, the second successive peak P2, and the third successive peak P3. including. When calculating the mask amount at time t in the power spectrum PSi along the flowcharts of FIGS. 8A and 8B, the mask frequency determination unit 130 performs the following processing.

マスク周波数決定部１３０は、まず、第１の継時ピークＰ１を含むフレームと、現在処理対象となっているフレームとの時間差Ｔ１（＝ｔ−ＴＰ１）を算出する（ステップＳ３０３０４）。 First, the mask frequency determination unit 130 calculates a time difference T1 (= t−TP1) between a frame including the first succession peak P1 and a frame currently being processed (step S30304).

次に、マスク周波数決定部１３０は、第１の継時ピークＰ１によりマスクされるｉ番目の周波数帯の最大音量を算出する（ステップＳ３０３０５）。このとき、マスク周波数決定部１３０は、第１の継時ピークＰ１の時刻ＴＰ１及びパワーＰＷＰａと、第２のマスク情報とに基づいて、第１の関数Ｈ１（Ｔ）を設定する。第１の関数Ｈ１（Ｔ）は、時間方向における、第１の継時ピークＰ１の時刻ＴＰ１からの時間差と、マスクされる最大音量との関係を示す関数である。その後、マスク周波数決定部１３０は、第１の関数Ｈ１（Ｔ）に基づいて、現在処理対象となっている時刻ｔのフレームにおけるｉ番目の周波数帯についての最大音量FM_TH(1,i)を算出する。 Next, the mask frequency determination unit 130 calculates the maximum volume of the i-th frequency band masked by the first succession peak P1 (step S30305). At this time, the mask frequency determination unit 130 sets the first function H1 (T) based on the time TP1 and the power PWPa of the first continuous peak P1 and the second mask information. The first function H1 (T) is a function showing the relationship between the time difference from the time TP1 of the first continuous peak P1 in the time direction and the maximum volume to be masked. After that, the mask frequency determination unit 130 calculates the maximum volume FM_TH (1, i) for the i-th frequency band in the frame at the time t currently being processed based on the first function H1 (T). To do.

次に、マスク周波数決定部１３０は、マスク量TM(1,i)＝TM_TH(1,i)−P(i,t)を算出する（ステップＳ３０３０６）。図９に示した例では、第１の継時ピークＰ１に基づく最大音量TM_TH(1,i)が、時刻ｔのフレームにおけるパワーP(i,t)よりも小さい。このため、マスク周波数決定部１３０は、ステップＳ３０３０８を省略し、次に、第２の継時ピークＰ２についてのステップＳ３０３０４以降の処理を行う。しかしながら、第２の継時ピークＰ２についての第２の関数Ｈ２（Ｔ）に基づいて算出される最大音量TM_TH(2,i)は、時刻ｔのフレームにおけるパワーP(i,t)よりも小さい。このため、マスク周波数決定部１３０は、ステップＳ３０３０８を省略し、次に、第３の継時ピークＰ３についてのステップＳ３０３０４以降の処理を行う。ここで、第２の関数Ｈ２（Ｔ）は、時間方向における、第２の継時ピークＰ２の時刻ＴＰ２からの時間差と、マスクされる最大音量との関係を示す関数である。 Next, the mask frequency determination unit 130 calculates the mask amount TM (1, i) = TM_TH (1, i) −P (i, t) (step S30306). In the example shown in FIG. 9, the maximum volume TM_TH (1, i) based on the first transition peak P1 is smaller than the power P (i, t) in the frame at time t. For this reason, the mask frequency determination unit 130 omits step S30308, and then performs the processing after step S30304 for the second continuous peak P2. However, the maximum volume TM_TH (2, i) calculated based on the second function H2 (T) for the second transition peak P2 is smaller than the power P (i, t) in the frame at time t. . For this reason, the mask frequency determination unit 130 omits step S30308, and then performs the processing after step S30304 for the third continuous peak P3. Here, the second function H2 (T) is a function indicating the relationship between the time difference from the time TP2 of the second continuous peak P2 in the time direction and the maximum volume to be masked.

第３の継時ピークＰ３についての第３の関数Ｈ３（Ｔ）に基づいて算出される最大音量ＦＭ＿ＴＨ（３，ｉ）は、時刻ｔのフレームにおけるパワーＰ（ｉ，ｔ）よりも大きい。このため、マスク周波数決定部１３０は、ステップＳ３０３０８において、処理対象のフレームにおけるマスク量の最大値、及び暫定マスク周波数を、それぞれ、マスク量ＴＭ（３，ｉ），及び変数ｉとする。ここで、第３の関数Ｈ３（Ｔ）は、時間方向における、第３の継時ピークＰ３の時刻ＴＰ３からの時間差と、マスクされる最大音量との関係を示す関数である。 The maximum volume FM_TH (3, i) calculated based on the third function H3 (T) for the third continuous peak P3 is larger than the power P (i, t) in the frame at time t. Therefore, in step S30308, the mask frequency determining unit 130 sets the maximum mask amount and the temporary mask frequency in the processing target frame as the mask amount TM (3, i) and the variable i, respectively. Here, the third function H3 (T) is a function indicating the relationship between the time difference from the time TP3 of the third successive peak P3 in the time direction and the maximum volume to be masked.

その後、マスク周波数決定部１３０は、時刻ｔのフレームのｉ番目の周波数帯におけるマスク量の最大値Max_TM_T、及び暫定マスク周波数Max_TM_freq_Tを、それぞれ、マスク量FM(3,i)，及び変数ｉとし、ステップＳ３０３１１の判定を行う。ここで、ｉ番目の周波数帯におけるマスク量の最大値Max_TM_Tが、ｉ−１番目の周波数帯までの処理におけるマスク量の最大値Max_TMよりも大きいと、マスク周波数決定部１３０は、ステップＳ３０３１２の処理を行う。すなわち、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_TM及び第２のマスク周波数Max_TM_freqを、それぞれ、ｉ番目の周波数帯におけるマスク量の最大値Max_TM_T、及び変数ｉに更新する。一方、ｉ番目の周波数帯におけるマスク量の最大値Max_TM_Tが、ｉ−１番目の周波数帯までの処理におけるマスク量の最大値Max_TM以下であると、マスク周波数決定部１３０は、ステップＳ３０３１２の処理を省略する。すなわち、マスク周波数決定部１３０は、処理対象のフレームにおけるマスク量の最大値Max_TM及び第２のマスク周波数Max_TM_freqを、それぞれ、ｉ−１番目の周波数帯までの処理におけるマスク量の最大値Max_TM_T、及び１からｉ−１までのいずれかのままとする。 Thereafter, the mask frequency determining unit 130 sets the maximum mask amount Max_TM_T and the provisional mask frequency Max_TM_freq_T in the i-th frequency band of the frame at time t as the mask amount FM (3, i) and the variable i, respectively. The determination in step S30311 is performed. Here, if the maximum value Max_TM_T of the mask amount in the i-th frequency band is larger than the maximum value Max_TM of the mask amount in the process up to the i−1th frequency band, the mask frequency determination unit 130 performs the process of step S30312. I do. That is, the mask frequency determination unit 130 updates the maximum mask amount Max_TM and the second mask frequency Max_TM_freq in the processing target frame to the maximum mask amount Max_TM_T and the variable i in the i-th frequency band, respectively. . On the other hand, when the maximum value Max_TM_T of the mask amount in the i-th frequency band is equal to or less than the maximum value Max_TM of the mask amount in the processing up to the i−1th frequency band, the mask frequency determination unit 130 performs the process of step S30312. Omitted. That is, the mask frequency determination unit 130 sets the maximum mask amount Max_TM and the second mask frequency Max_TM_freq in the frame to be processed to the maximum mask amount Max_TM_T in the processing up to the (i−1) th frequency band, and It remains either 1 to i-1.

本実施形態に係るマスク周波数決定部１３０は、上記のように、処理対象のフレームについての周波数スペクトルにおける周波数帯毎に、ステップＳ３０３（ステップＳ３０３０１〜Ｓ３０３１２）の処理を行う。これにより、処理対象のフレームにおいてマスク量TM(n,i)が最大値となる周波数帯が、第２のマスク周波数となる。 As described above, the mask frequency determination unit 130 according to the present embodiment performs the processing of step S303 (steps S30301 to S30312) for each frequency band in the frequency spectrum for the processing target frame. As a result, the frequency band in which the mask amount TM (n, i) has the maximum value in the processing target frame becomes the second mask frequency.

上記の処理により第１のマスク周波数及び第２のマスク周波数を算出した後、通話装置１は、受話音から当該マスク周波数の成分を除去して出力する（ステップＳ４）。マスク周波数の成分を除去する処理は、受話音加工部１４０が行う。受話音加工部１４０は、例えば、図１０に示す処理を行う。 After calculating the first mask frequency and the second mask frequency by the above processing, the communication device 1 removes the component of the mask frequency from the received sound and outputs it (step S4). The process of removing the mask frequency component is performed by the received sound processing unit 140. The received sound processing unit 140 performs, for example, the process shown in FIG.

図１０は、マスク周波数の成分を受話音から除去して出力する処理の内容を説明するフローチャートである。 FIG. 10 is a flowchart for explaining the contents of the process of removing the mask frequency component from the received sound and outputting it.

マスク周波数の成分を受話音から除去して出力する処理において、受話音加工部１４０は、まず、受話音の周波数スペクトルにおける第１のマスク周波数Max_FM_freqの値を０に変更する（ステップＳ４０１）。 In the process of removing the mask frequency component from the received sound and outputting it, the received sound processing unit 140 first changes the value of the first mask frequency Max_FM_freq in the frequency spectrum of the received sound to 0 (step S401).

次に、受話音加工部１４０は、受話音の周波数スペクトルにおける第２のマスク周波数Max_TM_freqの値を０に変更する（ステップＳ４０２）。 Next, the received sound processing unit 140 changes the value of the second mask frequency Max_TM_freq in the frequency spectrum of the received sound to 0 (step S402).

次に、受話音加工部１４０は、第１のマスク周波数及び第２のマスク周波数の値を０にした周波数スペクトルを音声波形に逆変換する（ステップＳ４０３）。ステップＳ４０３では、受話音加工部１４０は、ステップＳ２０１で音声波形を周波数スペクトルに変換する際の変換方法と対応する逆変換方法により、受話音の周波数スペクトルを音声波形に変換する。 Next, the received sound processing unit 140 inversely converts the frequency spectrum in which the values of the first mask frequency and the second mask frequency are set to 0 into a speech waveform (step S403). In step S403, the received sound processing unit 140 converts the frequency spectrum of the received sound into a speech waveform by an inverse conversion method corresponding to the conversion method used when converting the speech waveform into a frequency spectrum in step S201.

その後、受話音加工部１４０は、音声波形をレシーバ２に出力するとともに、第１のマスク周波数及び第２のマスク周波数の情報を成分抽出部１５０に出力し（ステップＳ４０４）、ステップＳ４の処理を終了する。 After that, the received sound processing unit 140 outputs the voice waveform to the receiver 2 and outputs the information of the first mask frequency and the second mask frequency to the component extraction unit 150 (step S404), and performs the process of step S4. finish.

このように、本実施形態に係る通話装置１は、第１のマスク周波数の成分及び第２のマスク周波数の成分を受話音から除去した加工受話音を生成し、当該加工受話音をレシーバ２から通話装置１の外部空間に出力する（放射する）。したがって、レシーバ２から出力された受話音をマイクロフォン３で収音した場合、マイク入力音における第１のマスク周波数及び第２のマスク周波数と対応する成分はほぼ０となる。ところが、レシーバ２から出力された受話音をマイクロフォン３で収音する際に、通話装置１の利用者が発話していたり、通話装置１の周囲に騒音が発生していたりすると、マイク入力音には受話音以外の音声が含まれる。マイク入力音に通話装置１の利用者の音声や、通話装置１の周囲に騒音が含まれるダブルトークの状態である場合、当該マイク入力音には、受話音から除去したマスク周波数の成分が含まれることが多い。このため、本実施形態の通話装置１では、マイク入力音における、受話音から除去したマスク周波数の成分に基づいて、マイク入力音がダブルトークの状態であるか否かを判定する。したがって、通話装置１では、上記のように、マイク入力音に含まれるマスク周波数の成分を抽出し（ステップＳ５）、ダブルトークの状態であるか否かを判定する（ステップＳ６）。ステップＳ５の処理は成分抽出部１５０が行い、ステップＳ６の処理はダブルトーク判定部１６０が行う。成分抽出部１５０は、マイク入力音における処理対象のフレームを音声波形から周波数スペクトルに変換した後、第１のマスク周波数及び第２のマスク周波数の成分を抽出する。ダブルトーク判定部１６０は、図１１の処理を行って、処理対象のフレームがダブルトークの状態であるか否かを判定する。 As described above, the communication device 1 according to the present embodiment generates a processed reception sound obtained by removing the first mask frequency component and the second mask frequency component from the reception sound, and the processed reception sound is received from the receiver 2. It outputs (emits) to the external space of the communication device 1. Therefore, when the received sound output from the receiver 2 is picked up by the microphone 3, the components corresponding to the first mask frequency and the second mask frequency in the microphone input sound are almost zero. However, when the received sound output from the receiver 2 is picked up by the microphone 3, if the user of the communication device 1 is speaking or if noise is generated around the communication device 1, the microphone input sound is generated. Includes sound other than the received sound. When the microphone input sound is in a double talk state in which the voice of the user of the communication device 1 is included in the microphone input sound or noise is included around the communication device 1, the microphone input sound includes a mask frequency component removed from the received sound. It is often done. For this reason, in the communication device 1 of the present embodiment, it is determined whether or not the microphone input sound is in a double talk state based on the mask frequency component removed from the received sound in the microphone input sound. Therefore, as described above, the communication device 1 extracts the mask frequency component included in the microphone input sound (step S5), and determines whether or not it is in a double talk state (step S6). The component extraction unit 150 performs the process in step S5, and the double talk determination unit 160 performs the process in step S6. The component extraction unit 150 extracts the components of the first mask frequency and the second mask frequency after converting the processing target frame in the microphone input sound from the speech waveform to the frequency spectrum. The double talk determining unit 160 performs the process of FIG. 11 to determine whether or not the processing target frame is in a double talk state.

図１１は、ダブルトークであるか否かを判定する処理の内容を説明するフローチャートである。 FIG. 11 is a flowchart for explaining the contents of the process for determining whether or not it is double talk.

ダブルトーク判定部１６０は、まず、マイク入力音の周波数スペクトルから抽出した第１のマスク周波数Max_FM_freqの値に基づいて、第１のパワーＰＦを算出する（ステップＳ６０１）。次に、ダブルトーク判定部１６０は、マイク入力音の周波数スペクトルから抽出した第２のマスク周波数Max_TM_freqの値に基づいて、第２のパワーＰＴを算出する（ステップＳ６０２）。ステップＳ６０１及びＳ６０２では、ダブルトーク判定部１６０は、既知の算出方法に基づいて、第１のパワーＰＦ及び第２のパワーＰＴを算出する。 First, the double talk determination unit 160 calculates the first power PF based on the value of the first mask frequency Max_FM_freq extracted from the frequency spectrum of the microphone input sound (step S601). Next, the double talk determination unit 160 calculates the second power PT based on the value of the second mask frequency Max_TM_freq extracted from the frequency spectrum of the microphone input sound (step S602). In steps S601 and S602, the double talk determination unit 160 calculates the first power PF and the second power PT based on a known calculation method.

次に、ダブルトーク判定部１６０は、第１のパワーＰＦが第１のパワー閾値ＴＨ２よりも大きいか否かを判定する（ステップＳ６０３）。第１のパワー閾値ＴＨ２は、適宜選択可能である。ＰＦ≦ＴＨ２である場合（ステップＳ６０３；ＮＯ）、ダブルトーク判定部１６０は、マイク入力音における処理対象のフレームがダブルトークではないと判断し、ダブルトークであるか否かを示すフラグDT_flagを０にする（ステップＳ６０５）。 Next, the double talk determination unit 160 determines whether or not the first power PF is larger than the first power threshold TH2 (step S603). The first power threshold TH2 can be selected as appropriate. When PF ≦ TH2 is satisfied (step S603; NO), the double talk determination unit 160 determines that the processing target frame in the microphone input sound is not double talk, and sets a flag DT_flag indicating whether or not it is double talk to 0. (Step S605).

一方、ＰＦ＞ＴＨ２である場合（ステップＳ６０３；ＹＥＳ）、ダブルトーク判定部１６０は、次に、第２のパワーＰＴが第２のパワー閾値ＴＨ３よりも大きいか否かを判定する（ステップＳ６０４）。第２のパワー閾値ＴＨ３は、適宜選択可能である。ＰＴ≦ＴＨ３である場合（ステップＳ６０４；ＮＯ）、ダブルトーク判定部１６０は、マイク入力音における処理対象のフレームがダブルトークではないと判断し、ダブルトークであるか否かを示すフラグDT_flagを０にする（ステップＳ６０５）。 On the other hand, if PF> TH2 (step S603; YES), the double-talk determining unit 160 next determines whether or not the second power PT is larger than the second power threshold TH3 (step S604). . The second power threshold TH3 can be selected as appropriate. When PT ≦ TH3 is satisfied (step S604; NO), the double talk determining unit 160 determines that the frame to be processed in the microphone input sound is not double talk, and sets a flag DT_flag indicating whether or not it is double talk to 0. (Step S605).

これに対し、ＰＴ＞ＴＨ３である場合（ステップＳ６０４；ＹＥＳ）、ダブルトーク判定部１６０は、マイク入力音における処理対象のフレームがダブルトークであると判断し、ダブルトークであるか否かを示すフラグDT_flagを１にする（ステップＳ６０６）。 On the other hand, when PT> TH3 (step S604; YES), the double talk determining unit 160 determines that the frame to be processed in the microphone input sound is double talk, and indicates whether or not it is double talk. The flag DT_flag is set to 1 (step S606).

ステップＳ６０５又はＳ６０６において処理対象のフレームに対するフラグを設定すると、ダブルトーク判定部１６０は、当該フレームに対する判定処理を終了する。 When the flag for the processing target frame is set in step S605 or S606, the double talk determination unit 160 ends the determination processing for the frame.

このようにマイク入力音における処理対象のフレームがダブルトークの状態であるか否かをフラグDT_flagの値で示示した場合、入力音加工部１７０の学習判定部１７１は、当該フラグDT_flagの値により伝達特性を学習するか否かを判定する。フラグDT_flagが０の場合（すなわちダブルトークの状態ではない場合）、学習判定部１７１は、伝達特性学習部１７２に伝達特性を学習させる。一方、フラグDT_flagが１の場合（すなわちダブルトークの状態である場合）、学習判定部１７１は、伝達特性を学習する処理を省略し、エコー成分抑圧部１７３に、過去フレームに対する処理で学習した伝達特性に基づくエコー成分の抑圧を行わせる。 In this way, when the value of the flag DT_flag indicates whether or not the frame to be processed in the microphone input sound is in a double talk state, the learning determination unit 171 of the input sound processing unit 170 transmits the value based on the value of the flag DT_flag. Determine whether to learn the characteristics. When the flag DT_flag is 0 (that is, when not in a double talk state), the learning determination unit 171 causes the transfer characteristic learning unit 172 to learn transfer characteristics. On the other hand, when the flag DT_flag is 1 (that is, in the case of a double talk state), the learning determination unit 171 omits the process of learning the transfer characteristic, and the transmission learned by the echo component suppression unit 173 through the process for the past frame. The echo component is suppressed based on the characteristics.

以上のように、本実施形態に係る通話装置１では、受話音に含まれる音声において人の知覚に影響を与えないマスク周波数の成分を除去してからレシーバ２を介して通話装置１の外部に出力する。このため、マイクロフォン３で収音したマイク入力音における上記のマスク周波数に有意な音声成分が含まれる場合、当該成分は、通話装置１の利用者の音声及び通話装置１の周囲の雑音等のダブルトーク成分であるとみなすことができる。しかも、受話音に含まれるマスク周波数の成分を除去したことにより、ダブルトーク成分の音量が小さい場合でも、ダブルトーク成分の有無を判定することが可能となり、マイク入力音がダブルトークの状態であるか否かを精度良く判定することが可能となる。加えて、本実施形態に係る通話装置１では、他の通話装置４から受信してレシーバ２に出力する受話音に含まれるマスク周波数の成分を除去し、当該マスク周波数の成分を除去した受話音をレシーバ２から出力する。このため、本実施形態に係る通話装置１では、受話音に基づいて生成した擬似エコー信号を利用してエコー成分の抑圧を制御する場合に比べて、処理量を低減させることが可能となる。よって、本実施形態によれば、通話装置におけるマイク入力音がダブルトークの状態であるか否かを効率よく、かつ正しく判定することが可能となる。これにより、ダブルトークの状態であるか否かの誤判定による誤った伝達特性の学習や、不適切な伝達特性によるエコー成分の抑圧を防ぐことが可能となる。 As described above, in the communication device 1 according to the present embodiment, the mask frequency component that does not affect human perception in the voice included in the received sound is removed, and then the receiver 2 is connected to the outside of the communication device 1. Output. For this reason, when a significant voice component is included in the mask frequency in the microphone input sound collected by the microphone 3, the component is a double such as the voice of the user of the call device 1 and noise around the call device 1. It can be regarded as a talk component. In addition, by removing the mask frequency component included in the received sound, it is possible to determine the presence or absence of the double talk component even when the volume of the double talk component is low, and the microphone input sound is in a double talk state. It is possible to accurately determine whether or not. In addition, in the communication device 1 according to the present embodiment, the received reception sound from which the mask frequency component is removed from the reception sound received from the other communication device 4 and output to the receiver 2 is removed. Is output from the receiver 2. For this reason, in the communication device 1 according to the present embodiment, it is possible to reduce the processing amount as compared with the case where the suppression of the echo component is controlled using the pseudo echo signal generated based on the received sound. Therefore, according to the present embodiment, it is possible to efficiently and correctly determine whether or not the microphone input sound in the call device is in a double talk state. As a result, it is possible to prevent learning of erroneous transfer characteristics due to an erroneous determination as to whether or not a double talk state or suppression of echo components due to inappropriate transfer characteristics.

なお、図５のフローチャートは、受話音から除去するマスク周波数を決定する処理の一例に過ぎない。マスク周波数を決定する処理は、図５のフローチャートに沿った処理に限らず、適宜変更可能である。例えば、図５のフローチャートにおけるステップＳ３０２の処理と、ステップＳ３０３の処理とは、順不同であり、ステップＳ３０３を先に行ってもよい。また、ステップＳ３０２の処理とステップＳ３０３の処理とを並列に行ってもよい。また、マスク周波数を決定する処理は、例えば、第１のマスク周波数と第２のマスク周波数とが同一周波数となった場合に、マスク量が３番目に大きい第３のマスク周波数を抽出する処理を含むものでもよい。更に、マスク周波数を決定する処理は、第１のマスク周波数及び第２のマスク周波数のいずれか一方のみを決定する処理であってもよい。また、第１のマスク周波数を算出する際には、例えば、ピーク周波数からの周波数差が所定の範囲内である周波数帯、すなわちピーク周波数の近傍の周波数帯のみを処理対象としてもよい。 Note that the flowchart in FIG. 5 is merely an example of a process for determining a mask frequency to be removed from the received sound. The process for determining the mask frequency is not limited to the process according to the flowchart of FIG. For example, the process of step S302 and the process of step S303 in the flowchart of FIG. 5 are out of order, and step S303 may be performed first. Further, the process of step S302 and the process of step S303 may be performed in parallel. The process for determining the mask frequency includes, for example, a process for extracting a third mask frequency having the third largest mask amount when the first mask frequency and the second mask frequency are the same. It may be included. Further, the process for determining the mask frequency may be a process for determining only one of the first mask frequency and the second mask frequency. In calculating the first mask frequency, for example, only the frequency band in which the frequency difference from the peak frequency is within a predetermined range, that is, the frequency band near the peak frequency may be processed.

また、マスク周波数を算出する際に参照するマスク情報は、図２に示した関数Ｇ（ｆ），Ｈ（ｔ）に限らず、適宜変更可能である。 The mask information referred to when calculating the mask frequency is not limited to the functions G (f) and H (t) shown in FIG.

また、図１１のフローチャートは、ダブルトークであるか否かを判定する処理の一例に過ぎない。ダブルトークであるか否かを判定する処理は、図１１のフローチャートに沿った処理に限らず、適宜変更可能である。例えば、ステップＳ６０１で第１のパワーＰＦを算出した後、ダブルトーク判定部１６０は、続けて、ステップＳ６０３の判定を行ってもよい。このようにすることで、例えば、第１のパワーＰＦに基づいてダブルトークではないと判定した場合に、第２のパワーＰＴを算出する処理を省略することが可能となり、ダブルトーク判定部１６０における処理量を低減することが可能となる。また、ステップＳ６０３及びＳ６０４の判定の代わりに、ダブルトーク判定部１６０は、例えば、ＰＦ＞ＴＨ２、又はＰＴ＞ＴＨ３である場合に処理対象のフレームがダブルトークの状態であると判定してもよい。更に、ダブルトーク判定部１６０と学習判定部１７１とを１つにまとめて、ステップＳ６０５，Ｓ６０６において伝達特性を学習するか否かを決定してもよい。 Further, the flowchart of FIG. 11 is merely an example of a process for determining whether or not double talk. The process for determining whether or not it is double talk is not limited to the process according to the flowchart of FIG. For example, after calculating the first power PF in step S601, the double talk determining unit 160 may continue to perform the determination in step S603. By doing so, for example, when it is determined that it is not double talk based on the first power PF, it is possible to omit the process of calculating the second power PT, and in the double talk determination unit 160 The processing amount can be reduced. Further, instead of the determination in steps S603 and S604, the double talk determination unit 160 may determine that the processing target frame is in a double talk state when PF> TH2 or PT> TH3, for example. . Furthermore, the double talk determination unit 160 and the learning determination unit 171 may be combined into one, and it may be determined whether or not the transfer characteristics are to be learned in steps S605 and S606.

また、マイク入力音がダブルトークの状態であるか否かの判定結果は、エコー成分の抑圧に用いる伝達特性を学習するか否かの判定に限らず、他の処理を行うか否かの判定に利用することも可能である。例えば、ダブルトークの状態であるか否かの判定結果は、電話会議システム等で音声データを話者毎に分類する処理に利用可能である。 In addition, the determination result of whether or not the microphone input sound is in a double talk state is not limited to determining whether or not to learn the transfer characteristic used for suppressing the echo component, and determining whether or not to perform other processing. It is also possible to use it. For example, the determination result as to whether or not the state is a double talk state can be used for a process of classifying voice data for each speaker in a telephone conference system or the like.

上記の実施形態に係る通話装置１は、例えば、携帯電話端末やスマートフォン等の通話機能を備えた移動端末に所定のエコー抑圧プログラムを実行させることにより実現可能である。 The call device 1 according to the above-described embodiment can be realized, for example, by causing a mobile terminal having a call function such as a mobile phone terminal or a smartphone to execute a predetermined echo suppression program.

図１２は、移動端末のハードウェア構成を示す図である。
図１２に示すように、移動端末７は、プロセッサ７０１と、メモリ７０２と、入力装置７０３と、表示装置７０４と、ＲＦ送受信機７０５と、アンテナ７０６と、レシーバ２と、マイクロフォン３とを備える。 FIG. 12 is a diagram illustrating a hardware configuration of the mobile terminal.
As illustrated in FIG. 12, the mobile terminal 7 includes a processor 701, a memory 702, an input device 703, a display device 704, an RF transceiver 705, an antenna 706, a receiver 2, and a microphone 3.

プロセッサ７０１は、Central Processing Unit（ＣＰＵ）やMicro Processing Unit（ＭＰＵ）等である。プロセッサ７０１は、オペレーティングシステムを含む各種のプログラムを実行することにより、移動端末７の全体の動作を制御する。また、プロセッサ７０１は、例えば、図３のステップＳ１〜Ｓ１１の処理を含むエコー抑圧プログラム等の各種のアプリケーションプログラムを実行する。プロセッサ７０１は、例えば、ベースバンド処理回路７０１ａと、アプリケーション処理回路７０１ｂとを含む。 The processor 701 is a central processing unit (CPU), a micro processing unit (MPU), or the like. The processor 701 controls the overall operation of the mobile terminal 7 by executing various programs including an operating system. Further, the processor 701 executes various application programs such as an echo suppression program including the processes of steps S1 to S11 in FIG. The processor 701 includes, for example, a baseband processing circuit 701a and an application processing circuit 701b.

メモリ７０２は、図示しないRead Only Memory（ＲＯＭ）及びRandom Access Memory（ＲＡＭ）を含む。メモリ７０２のＲＯＭには、例えば、移動端末７の起動時にプロセッサ７０１が読み出す所定の基本制御プログラム等が予め記録されている。一方、メモリ７０２のＲＡＭは、例えば、プロセッサ７０１が、各種のプログラムを実行する際に必要に応じて作業用記憶領域として使用する。また、メモリ７０２は、プロセッサ７０１によって実行される各種のプログラムや各種のデータ等の記憶に利用可能である。メモリ７０２は、マスク情報、伝達特性、受話音における複数フレーム分の分析結果等の記憶に利用可能である。なお、メモリ７０２は、移動端末７に内蔵されたフラッシュメモリ等の不揮発性メモリ（Solid State Drive（ＳＳＤ）を含む）であってもよいし、例えば、Secure Digital（ＳＤ）規格のメモリカード（フラッシュメモリ）等であってもよい。 The memory 702 includes a read only memory (ROM) and a random access memory (RAM) not shown. In the ROM of the memory 702, for example, a predetermined basic control program read by the processor 701 when the mobile terminal 7 is started is recorded in advance. On the other hand, the RAM of the memory 702 is used as a working storage area as needed when the processor 701 executes various programs, for example. The memory 702 can be used to store various programs executed by the processor 701 and various data. The memory 702 can be used for storing mask information, transfer characteristics, analysis results for a plurality of frames in the received sound, and the like. The memory 702 may be a non-volatile memory (including a solid state drive (SSD)) such as a flash memory built in the mobile terminal 7, or may be a Secure Digital (SD) standard memory card (flash Memory) or the like.

入力装置７０３は、例えば、キーボード装置やタッチパネル装置等である。移動端末７のオペレータ（利用者）が入力装置７０３に対して所定の操作を行うと、入力装置７０３は、その操作内容に対応付けられている入力情報をプロセッサ７０１に送信する。入力装置７０３は、例えば、通話を開始する命令、終了する命令等の入力に利用可能である。 The input device 703 is, for example, a keyboard device or a touch panel device. When the operator (user) of the mobile terminal 7 performs a predetermined operation on the input device 703, the input device 703 transmits input information associated with the operation content to the processor 701. The input device 703 can be used to input, for example, a command to start a call and a command to end a call.

表示装置７０４は、例えば、液晶表示装置等の表示装置である。表示装置７０４は、通話相手を識別する名称、電話番号等の表示に利用可能である。 The display device 704 is a display device such as a liquid crystal display device. The display device 704 can be used to display a name, a telephone number, etc. for identifying a call partner.

ＲＦ送受信機７０５は、移動端末７を電話網等のネットワークに接続し、ネットワークを介した移動端末７と他の通話装置４とによる音声データの送受信を制御する装置である。ＲＦ送受信機７０５は、アンテナ７０６で受信した電波を電気信号に変換してプロセッサ７０１に渡すとともに、プロセッサ７０１から受け取った電気信号を電波としてアンテナ７０６から出力させる処理を行う。 The RF transceiver 705 is a device that connects the mobile terminal 7 to a network such as a telephone network, and controls transmission / reception of voice data between the mobile terminal 7 and the other communication device 4 via the network. The RF transmitter / receiver 705 converts the radio wave received by the antenna 706 into an electric signal and passes it to the processor 701, and performs processing for outputting the electric signal received from the processor 701 from the antenna 706 as a radio wave.

移動端末７では、例えば、オペレータが入力装置７０３等を利用して通話開始の命令を入力すると、プロセッサ７０１が、メモリ７０２のＲＡＭ等の非一時的な記録媒体に記憶させたエコー抑圧プログラムを読み出して実行する。エコー抑圧プログラムが図３のステップＳ１〜Ｓ１１の処理を含むプログラムである場合、移動端末７は、受話音からマスク周波数の成分を除去する処理と、マイク入力音からエコー成分を抑圧する処理とを繰り返す。エコー抑圧プログラムを実行している間、プロセッサ７０１は、図１の通話装置１における受話音分析部１２０、マスク周波数決定部１３０、受話音加工部１４０、成分抽出部１５０、ダブルトーク判定部１６０、及び入力音加工部１７０として機能する（動作する）。また、プロセッサ７０１がエコー抑圧プログラムを実行している間、ＲＦ送受信機７０５及びアンテナ７０６は、図１の通話装置１における送受信部１１０として機能する。更に、プロセッサ７０１がエコー抑圧プログラムを実行している間、メモリ７０２は、マスク情報記憶部１９１、分析結果保持部１９２、及び伝達特性記憶部１９３として機能する。 In the mobile terminal 7, for example, when an operator inputs a call start command using the input device 703 or the like, the processor 701 reads out an echo suppression program stored in a non-temporary recording medium such as a RAM of the memory 702. And execute. When the echo suppression program is a program including the processes of steps S1 to S11 in FIG. 3, the mobile terminal 7 performs a process of removing the mask frequency component from the received sound and a process of suppressing the echo component from the microphone input sound. repeat. While executing the echo suppression program, the processor 701 receives the received sound analysis unit 120, the mask frequency determination unit 130, the received sound processing unit 140, the component extraction unit 150, the double talk determination unit 160 in the communication device 1 of FIG. And functions (operates) as the input sound processing unit 170. Further, while the processor 701 is executing the echo suppression program, the RF transceiver 705 and the antenna 706 function as the transceiver 110 in the call device 1 of FIG. Further, while the processor 701 is executing the echo suppression program, the memory 702 functions as a mask information storage unit 191, an analysis result holding unit 192, and a transfer characteristic storage unit 193.

なお、移動端末７は、例えば、Digital Signal Processor（ＤＳＰ）等のプロセッサ７０１とは別の演算装置を備えたものであり、当該別の演算装置に図３のステップＳ１〜Ｓ１１の処理の一部を行わせてもよい。 Note that the mobile terminal 7 includes an arithmetic device different from the processor 701 such as a digital signal processor (DSP), for example, and a part of the processing of steps S1 to S11 in FIG. May be performed.

また、本実施形態に係る通話装置１は、上記の移動端末７に限らず、インターネット等のネットワークを利用した音声データの送受信が可能なコンピュータに所定のエコー抑圧プログラムを実行させることによっても実現可能である。 Further, the call device 1 according to the present embodiment is not limited to the mobile terminal 7 described above, and can also be realized by causing a computer capable of transmitting and receiving voice data using a network such as the Internet to execute a predetermined echo suppression program. It is.

図１３は、コンピュータのハードウェア構成を示す図である。
図１３に示すように、コンピュータ８は、プロセッサ８０１と、主記憶装置８０２と、補助記憶装置８０３と、入力装置８０４と、出力装置８０５と、入出力インタフェース８０６と、通信制御装置８０７と、媒体駆動装置８０８と、を備える。コンピュータ８におけるこれらの要素８０１〜８０８は、バス８１０により相互に接続されており、要素間でのデータの受け渡しが可能になっている。 FIG. 13 is a diagram illustrating a hardware configuration of a computer.
As shown in FIG. 13, the computer 8 includes a processor 801, a main storage device 802, an auxiliary storage device 803, an input device 804, an output device 805, an input / output interface 806, a communication control device 807, and a medium. A driving device 808. These elements 801 to 808 in the computer 8 are connected to each other via a bus 810 so that data can be exchanged between the elements.

プロセッサ８０１は、ＣＰＵやＭＰＵ等である。プロセッサ８０１は、オペレーティングシステムを含む各種のプログラムを実行することにより、コンピュータ８の全体の動作を制御する。また、プロセッサ８０１は、例えば、図３のステップＳ１〜Ｓ１１の処理を含む音声処理プログラム等の各種のアプリケーションプログラムを実行する。 The processor 801 is a CPU, MPU, or the like. The processor 801 controls the overall operation of the computer 8 by executing various programs including an operating system. Further, the processor 801 executes various application programs such as a voice processing program including the processes of steps S1 to S11 in FIG.

主記憶装置８０２は、図示しないRead Only Memory（ＲＯＭ）及びRandom Access Memory（ＲＡＭ）を含む。主記憶装置８０２のＲＯＭには、例えば、コンピュータ８の起動時にプロセッサ８０１が読み出す所定の基本制御プログラム等が予め記録されている。一方、主記憶装置８０２のＲＡＭは、プロセッサ８０１が、各種のプログラムを実行する際に必要に応じて作業用記憶領域として使用する。主記憶装置８０２のＲＡＭは、例えば、マスク情報、伝達特性、受話音における複数フレーム分の分析結果等の記憶に利用可能である。 The main storage device 802 includes a read only memory (ROM) and a random access memory (RAM) not shown. In the ROM of the main storage device 802, for example, a predetermined basic control program read by the processor 801 when the computer 8 is started is recorded in advance. On the other hand, the RAM of the main storage device 802 is used as a working storage area as needed when the processor 801 executes various programs. The RAM of the main storage device 802 can be used, for example, for storing mask information, transfer characteristics, analysis results for a plurality of frames in the received sound, and the like.

補助記憶装置８０３は、主記憶装置８０２のＲＡＭと比べて容量の大きい記憶装置であり、例えば、Hard Disk Drive（ＨＤＤ）や、フラッシュメモリのような不揮発性メモリ（Solid State Drive（ＳＳＤ）を含む）等である。補助記憶装置８０３は、プロセッサ８０１によって実行される各種のプログラムや各種のデータ等の記憶に利用可能である。補助記憶装置８０３は、例えば、図３のステップＳ１〜Ｓ１１の処理を含むエコー抑圧プログラムの記憶に利用可能である。また、補助記憶装置８０３は、例えば、マスク情報、伝達特性、受話音における複数フレーム分の分析結果等の記憶、受話音やマイク入力音の記憶等に利用可能である。 The auxiliary storage device 803 is a storage device having a larger capacity than the RAM of the main storage device 802, and includes, for example, a hard disk drive (HDD) and a non-volatile memory (Solid State Drive (SSD)) such as a flash memory. ) Etc. The auxiliary storage device 803 can be used for storing various programs executed by the processor 801 and various data. The auxiliary storage device 803 can be used, for example, for storing an echo suppression program including the processes of steps S1 to S11 in FIG. The auxiliary storage device 803 can be used for storing mask information, transfer characteristics, analysis results of a plurality of frames in received sound, storage of received sound and microphone input sound, and the like.

入力装置８０４は、例えば、キーボード装置やタッチパネル装置等である。コンピュータ８のオペレータ（利用者）が入力装置８０４に対して所定の操作を行うと、入力装置８０４は、その操作内容に対応付けられている入力情報をプロセッサ８０１に送信する。入力装置８０４は、例えば、入力装置８０４は、例えば、通話を開始する命令、終了する命令等の入力に利用可能である。 The input device 804 is, for example, a keyboard device or a touch panel device. When an operator (user) of the computer 8 performs a predetermined operation on the input device 804, the input device 804 transmits input information associated with the operation content to the processor 801. For example, the input device 804 can be used to input, for example, a command to start or end a call.

出力装置８０５は、例えば、液晶表示装置等の表示装置である。出力装置８０５は、通話相手を識別する名称、電話番号等の表示に利用可能である。 The output device 805 is a display device such as a liquid crystal display device. The output device 805 can be used to display a name, a telephone number, etc. for identifying a call partner.

入出力インタフェース８０６は、コンピュータ８と、他の電子機器とを接続する。入出力インタフェース８０６は、例えば、フォーンジャックや、Universal Serial Bus（ＵＳＢ）規格のコネクタ等を備える。入出力インタフェース８０６は、例えば、コンピュータ８と、レシーバ２及びマイクロフォン３を含むヘッドセット１０との接続に利用可能である。 The input / output interface 806 connects the computer 8 and other electronic devices. The input / output interface 806 includes, for example, a phone jack, a universal serial bus (USB) standard connector, and the like. The input / output interface 806 can be used, for example, for connection between the computer 8 and the headset 10 including the receiver 2 and the microphone 3.

通信制御装置８０７は、コンピュータ８をインターネット等のネットワークに接続し、ネットワークを介したコンピュータ８と他の通信機器との各種通信を制御する装置である。通信制御装置８０７は、例えば、コンピュータ８と、他の通話装置４等との間での音声データの送受信に利用可能である。 The communication control device 807 is a device that connects the computer 8 to a network such as the Internet and controls various communications between the computer 8 and other communication devices via the network. The communication control device 807 can be used for transmission / reception of audio data between the computer 8 and another communication device 4 or the like, for example.

媒体駆動装置８０８は、可搬型記憶媒体８９０に記録されているプログラムやデータの読み出し、補助記憶装置８０３に記憶されたデータ等の可搬型記憶媒体９への書き込みを行う。媒体駆動装置８０８には、例えば、１種類又は複数種類の規格に対応したメモリカード用リーダ／ライタが利用可能である。媒体駆動装置８０８としてメモリカード用リーダ／ライタを用いる場合、可搬型記憶媒体９としては、メモリカード用リーダ／ライタが対応している規格、例えば、Secure Digital（ＳＤ）規格のメモリカード（フラッシュメモリ）等を利用可能である。また、可搬型記録媒体９としては、例えば、ＵＳＢ規格のコネクタを備えたフラッシュメモリが利用可能である。更に、コンピュータ８が媒体駆動装置８０８として利用可能な光ディスクドライブを搭載している場合、当該光ディスクドライブで認識可能な各種の光ディスクを可搬型記録媒体９として利用可能である。可搬型記録媒体９として利用可能な光ディスクには、例えば、Compact Disc（ＣＤ）、Digital Versatile Disc（ＤＶＤ）、Blu-ray Disc（Blu-rayは登録商標）等がある。可搬型記録媒体９は、例えば、図３のステップＳ１〜Ｓ１１の処理を含むエコー抑圧プログラムの記憶に利用可能である。また、可搬型記録媒体９は、例えば、マスク情報、伝達特性、受話音における複数フレーム分の分析結果等の記憶、受話音やマイク入力音の記憶等に利用可能である。 The medium driving device 808 reads a program and data recorded in the portable storage medium 890 and writes data stored in the auxiliary storage device 803 to the portable storage medium 9. For the medium driving device 808, for example, a memory card reader / writer corresponding to one type or a plurality of types of standards can be used. When a memory card reader / writer is used as the medium driving device 808, the portable storage medium 9 is a standard compatible with the memory card reader / writer, such as a Secure Digital (SD) standard memory card (flash memory). ) Etc. can be used. In addition, as the portable recording medium 9, for example, a flash memory having a USB standard connector can be used. Further, when the computer 8 is equipped with an optical disk drive that can be used as the medium driving device 808, various optical disks that can be recognized by the optical disk drive can be used as the portable recording medium 9. Examples of the optical disc that can be used as the portable recording medium 9 include a Compact Disc (CD), a Digital Versatile Disc (DVD), and a Blu-ray Disc (Blu-ray is a registered trademark). The portable recording medium 9 can be used, for example, for storing an echo suppression program including the processes of steps S1 to S11 in FIG. In addition, the portable recording medium 9 can be used for storing mask information, transfer characteristics, analysis results of a plurality of frames in received sound, storage of received sound and microphone input sound, and the like.

コンピュータ８は、例えば、オペレータが入力装置８０４等を利用して通話開始の命令を入力すると、プロセッサ８０１が、補助記憶装置８０３等の非一時的な記録媒体に記憶させたエコー抑圧プログラムを読み出して実行する。エコー抑圧プログラムが図３のステップＳ１〜Ｓ１１の処理を含むプログラムである場合、コンピュータ８は、受話音からマスク周波数の成分を除去して出力する処理と、マイク入力音からエコー成分を抑圧して出力する処理とを繰り返す。音声処理プログラムを実行している間、プロセッサ８０１は、図１の通話装置１における受話音分析部１２０、マスク周波数決定部１３０、受話音加工部１４０、成分抽出部１５０、ダブルトーク判定部１６０、及び入力音加工部１７０として機能する（動作する）。また、プロセッサ８０１がエコー抑圧プログラムを実行している間、通信制御装置８０７は、図１の通話装置１における送受信部１１０として機能する。更に、プロセッサ８０１がエコー抑圧プログラムを実行している間、主記憶装置８０２のＲＡＭ、補助記憶装置８０３、可搬型記録媒体９等の記録媒体は、マスク情報記憶部１９１、分析結果保持部１９２、及び伝達特性記憶部１９３として機能する。 For example, when the operator inputs an instruction to start a call using the input device 804 or the like, the computer 8 reads the echo suppression program stored in a non-temporary recording medium such as the auxiliary storage device 803 by the processor 801. Run. When the echo suppression program is a program including the processes of steps S1 to S11 in FIG. 3, the computer 8 removes the mask frequency component from the received sound and outputs it, and suppresses the echo component from the microphone input sound. Repeat the output process. While executing the voice processing program, the processor 801 receives the received sound analysis unit 120, the mask frequency determination unit 130, the received sound processing unit 140, the component extraction unit 150, the double talk determination unit 160 in the communication device 1 of FIG. And functions (operates) as the input sound processing unit 170. While the processor 801 is executing the echo suppression program, the communication control device 807 functions as the transmission / reception unit 110 in the call device 1 of FIG. Furthermore, while the processor 801 is executing the echo suppression program, the RAM of the main storage device 802, the auxiliary storage device 803, the portable recording medium 9 and other recording media include a mask information storage unit 191, an analysis result holding unit 192, And functions as a transfer characteristic storage unit 193.

なお、通話装置１として動作させるコンピュータ８は、図１３に示した全ての要素８０１〜８０８を含む必要はなく、用途や条件に応じて一部の要素を省略することも可能である。例えば、コンピュータ８は、媒体駆動装置８０８が省略されたものであってもよい。 Note that the computer 8 that operates as the communication device 1 does not need to include all the elements 801 to 808 illustrated in FIG. 13, and some elements may be omitted depending on applications and conditions. For example, the computer 8 may be one in which the medium driving device 808 is omitted.

以上記載した実施形態に関し、更に以下の付記を開示する。
（付記１）
外部装置から受信した受話音声における所定の周波数帯の成分を除去してレシーバに出力する受話音加工部と、
マイクロフォンから入力された入力音声における前記所定の周波数帯の成分を抽出する成分抽出部と、
前記受話音声と、前記入力音声とに基づいて前記レシーバから前記マイクロフォンに至るエコー経路の伝達特性を学習する伝達特性学習部と、
前記入力音声から抽出した前記所定の周波数帯の成分の大きさが閾値以下である場合に、前記伝達特性学習部に前記伝達特性を学習させる判定部と、
を備えることを特徴とする通話装置。
（付記２）
前記通話装置は、前記受話音声についての周波数スペクトルにおいてピークとなる周波数帯の成分に基づいて前記所定の周波数帯を決定する周波数決定部、を更に備える、
ことを特徴とする付記１に記載の通話装置。
（付記３）
前記周波数決定部は、１つの周波数スペクトルにおいてピークとなる周波数帯からの周波数差と、前記ピークとなる周波数帯の成分によりマスクされる最大音量との関係に基づいて、前記周波数スペクトルにおける成分の大きさが前記最大音量よりも小さい周波数帯を、前記所定の周波数帯に決定する、
ことを特徴とする付記２に記載の通話装置。
（付記４）
前記周波数決定部は、前記周波数スペクトルにおける成分の大きさが前記最大音量よりも小さい複数の周波数帯のうちの、前記最大音量との差が最大となる周波数帯を、前記所定の周波数帯に決定する、
ことを特徴とする付記３に記載の通話装置。
（付記５）
前記周波数決定部は、前記周波数スペクトルにおける成分の大きさが前記最大音量よりも小さい複数の周波数帯のうちの、前記ピークとなる周波数帯からの周波数差が所定の範囲内であり、かつ前記最大音量との差が最大となる周波数帯を、前記所定の周波数帯に決定する、
ことを特徴とする付記３に記載の通話装置。
（付記６）
前記周波数決定部は、時間方向で連続する複数の前記周波数スペクトルにおける周波数帯のなかから、成分の大きさの時間変化においてピークとなる時刻からの時間差と、前記ピークとなる時刻の成分によりマスクされる最大音量との関係に基づいて、現在の前記周波数スペクトルにおける成分の大きさが前記最大音量よりも小さい周波数帯を抽出し、抽出した前記周波数帯のなかから前記所定の周波数帯を決定する、
ことを特徴とする付記３に記載の通話装置。
（付記７）
前記周波数決定部は、現在の前記周波数スペクトルにおける成分の大きさが前記最大音量よりも小さい複数の前記周波数帯のうちの、前記最大音量との差が最大となる周波数帯を、前記所定の周波数帯に決定する、
ことを特徴とする付記６に記載の通話装置。
（付記８）
前記通話装置は、前記伝達特性に基づいて前記入力音声に含まれるエコー成分を抑圧するエコー成分抑圧部、を更に備える、
ことを特徴とする付記１に記載の通話装置。
（付記９）
外部装置から受信した受話音声における所定の周波数帯の成分を除去してレシーバに出力し、
マイクロフォンから入力された入力音声における前記所定の周波数帯の成分を抽出し、
前記入力音声から抽出した前記所定の周波数帯の成分の大きさが閾値以下である場合に、前記受話音声と、前記入力音声とに基づいて前記レシーバから前記マイクロフォンに至るエコー経路の伝達特性を学習し、
前記伝達特性に基づいて前記入力音声に含まれるエコー成分を抑圧する、
処理をコンピュータに実行させることを特徴とするエコー抑圧プログラム。 The following additional notes are disclosed with respect to the embodiment described above.
(Appendix 1)
A received sound processing unit that removes a component of a predetermined frequency band from the received sound received from the external device and outputs the component to the receiver;
A component extractor for extracting a component of the predetermined frequency band in the input sound input from the microphone;
A transfer characteristic learning unit for learning a transfer characteristic of an echo path from the receiver to the microphone based on the received voice and the input voice;
A determination unit that causes the transfer characteristic learning unit to learn the transfer characteristic when the magnitude of the component of the predetermined frequency band extracted from the input speech is equal to or less than a threshold;
A call device comprising:
(Appendix 2)
The call device further includes a frequency determination unit that determines the predetermined frequency band based on a frequency band component that peaks in the frequency spectrum of the received voice.
The telephone call device according to supplementary note 1, wherein:
(Appendix 3)
The frequency determination unit determines the magnitude of the component in the frequency spectrum based on the relationship between the frequency difference from the peak frequency band in one frequency spectrum and the maximum volume masked by the component in the peak frequency band. A frequency band having a frequency smaller than the maximum volume is determined as the predetermined frequency band;
The telephone call device according to Supplementary Note 2, wherein
(Appendix 4)
The frequency determination unit determines, as the predetermined frequency band, a frequency band having a maximum difference from the maximum volume among a plurality of frequency bands having a component size in the frequency spectrum smaller than the maximum volume. To
The telephone call device according to Supplementary Note 3, wherein
(Appendix 5)
The frequency determination unit has a frequency difference from the peak frequency band among a plurality of frequency bands whose component in the frequency spectrum is smaller than the maximum volume, and the maximum A frequency band having a maximum difference from the volume is determined as the predetermined frequency band;
The telephone call device according to Supplementary Note 3, wherein
(Appendix 6)
The frequency determination unit is masked by a time difference from a peak time in a time change of a component size and a component of the peak time from a plurality of frequency bands in the frequency spectrum continuous in the time direction. Based on the relationship with the maximum volume to extract a frequency band whose component in the current frequency spectrum is smaller than the maximum volume, and to determine the predetermined frequency band from the extracted frequency band,
The telephone call device according to Supplementary Note 3, wherein
(Appendix 7)
The frequency determination unit is configured to select a frequency band having a maximum difference from the maximum volume among the plurality of frequency bands having a component size in the current frequency spectrum smaller than the maximum volume as the predetermined frequency. Decide on the obi,
The telephone call device according to supplementary note 6, wherein:
(Appendix 8)
The call device further includes an echo component suppression unit that suppresses an echo component included in the input voice based on the transfer characteristic.
The telephone call device according to supplementary note 1, wherein:
(Appendix 9)
Remove the component of the predetermined frequency band from the received voice received from the external device and output it to the receiver.
Extracting the component of the predetermined frequency band in the input voice input from the microphone;
Learning the transmission characteristics of the echo path from the receiver to the microphone based on the received voice and the input voice when the magnitude of the component of the predetermined frequency band extracted from the input voice is less than or equal to a threshold value And
Suppressing echo components included in the input speech based on the transfer characteristics;
An echo suppression program for causing a computer to execute processing.

１，４通話装置
２レシーバ
３マイクロフォン
５，５０１加工受話音
６利用者の音声
７移動端末
８コンピュータ
９可搬型記録媒体
１０ヘッドセット
１１０送受信部
１２０受話音分析部
１３０マスク周波数決定部
１４０受話音加工部
１５０成分抽出部
１６０ダブルトーク判定部
１７０入力音加工部
１７１学習判定部
１７２伝達特性学習部
１７３エコー成分抑圧部
１９１マスク情報記憶部
１９２分析結果保持部
１９３伝達特性記憶部
７０１，８０１プロセッサ
７０１ａベースバンド処理回路
７０１ｂアプリケーション処理回路
７０２メモリ
７０３，８０４入力装置
７０４表示装置
７０５ＲＦ送受信機
７０６アンテナ
８０２主記憶装置
８０３補助記憶装置
８０５出力装置
８０６入出力インタフェース
８０７通信制御装置
８０８媒体駆動装置 1, 4 Communication device 2 Receiver 3 Microphone 5,501 Processed reception sound 6 User's voice 7 Mobile terminal 8 Computer 9 Portable recording medium 10 Headset 110 Transmission / reception unit 120 Reception sound analysis unit 130 Mask frequency determination unit 140 Received sound processing Unit 150 component extraction unit 160 double talk determination unit 170 input sound processing unit 171 learning determination unit 172 transmission characteristic learning unit 173 echo component suppression unit 191 mask information storage unit 192 analysis result holding unit 193 transmission characteristic storage units 701, 801 processor 701a base Band processing circuit 701b Application processing circuit 702 Memory 703, 804 Input device 704 Display device 705 RF transceiver 706 Antenna 802 Main storage device 803 Auxiliary storage device 805 Output device 806 I / O interface 807 Communication control device 80 A medium driving device

Claims

A received sound processing unit that removes a component of a predetermined frequency band from the received sound received from the external device and outputs the component to the receiver;
A component extractor for extracting a component of the predetermined frequency band in the input sound input from the microphone;
A transfer characteristic learning unit for learning a transfer characteristic of an echo path from the receiver to the microphone based on the received voice and the input voice;
A determination unit that causes the transfer characteristic learning unit to learn the transfer characteristic when the magnitude of the component of the predetermined frequency band extracted from the input speech is equal to or less than a threshold;
A call device comprising:

The call device further includes a frequency determination unit that determines the predetermined frequency band based on a frequency band component that peaks in the frequency spectrum of the received voice.
The call device according to claim 1.

The frequency determination unit determines the magnitude of the component in the frequency spectrum based on the relationship between the frequency difference from the peak frequency band in one frequency spectrum and the maximum volume masked by the component in the peak frequency band. A frequency band having a frequency smaller than the maximum volume is determined as the predetermined frequency band;
The communication device according to claim 2.

The frequency determination unit is masked by a time difference from a peak time in a time change of a component size and a component of the peak time from a plurality of frequency bands in the frequency spectrum continuous in the time direction. Based on the relationship with the maximum volume to extract a frequency band whose component in the current frequency spectrum is smaller than the maximum volume, and to determine the predetermined frequency band from the extracted frequency band,
The communication device according to claim 3.

The frequency determination unit is configured to select a frequency band having a maximum difference from the maximum volume among the plurality of frequency bands having a component size in the current frequency spectrum smaller than the maximum volume as the predetermined frequency. Decide on the obi,
The call device according to claim 4, wherein:

Remove the component of the predetermined frequency band from the received voice received from the external device and output it to the receiver.
Extracting the component of the predetermined frequency band in the input voice input from the microphone;
Learning the transmission characteristics of the echo path from the receiver to the microphone based on the received voice and the input voice when the magnitude of the component of the predetermined frequency band extracted from the input voice is less than or equal to a threshold value And
Suppressing echo components included in the input speech based on the transfer characteristics;
An echo suppression program for causing a computer to execute processing.