JP6232710B2

JP6232710B2 - Sound recording device

Info

Publication number: JP6232710B2
Application number: JP2013033558A
Authority: JP
Inventors: 茂出木　敏雄; 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2013-02-22
Filing date: 2013-02-22
Publication date: 2017-11-22
Anticipated expiration: 2033-02-22
Also published as: JP2014164039A

Description

本発明は、医療機関（調剤薬局などの受付カウンター）、金融機関・保険会社の相談カウンター、法律事務所などの面談室、携帯電話店のカウンター、会食に使われる飲食店などにおいて交わされる会話音声が待合室や他の面談室や座席に居る人々に聴取されないようにするための秘匿化技術に関し、特に、物理的な間仕切り（吸音材または遮音材で構成される板状のもの。「パーティション」とも呼ばれる。）と組み合わせ、間仕切りで仕切られた会議室等での会話音声の漏洩防止対策の評価のため、録音された会話音声を明瞭化する技術に関する。 The present invention provides conversational voices exchanged at medical institutions (reception counters for dispensing pharmacies, etc.), consultation counters for financial institutions and insurance companies, interview rooms for law firms, counters for mobile phone stores, restaurants used for dinner, etc. Concerning concealment technology to prevent people in waiting rooms, other interview rooms and seats from listening, especially physical partitions (plates made of sound-absorbing or sound-insulating materials. This technology relates to a technique for clarifying recorded conversational voice for evaluation of measures for preventing leakage of conversational voice in a conference room or the like partitioned by a partition.

医療機関（調剤薬局などの受付カウンター）、金融機関・保険会社の相談カウンター、法律事務所などの面談室、携帯電話店のカウンター、会食に使われる飲食店などにおいて交わされる対話音声は、第三者に聴取されることが好ましくない個人情報や企業の機密情報が含まれることが少なくない。しかしながら、従来は、簡易的な間仕切りのみによって済ませている施設が多い。これらの施設における会話が漏れないようにするために、音声信号に対するマスキング効果を高めつつ、再生される音楽の音色を原音と同等に維持し、音量を絞って再生しても所定のマスキング効果を働かせることができる秘匿化データ生成装置が開発されている（特許文献１参照）。 Dialogue voices exchanged at medical institutions (reception counters at dispensing pharmacies, etc.), consultation counters at financial institutions and insurance companies, interview rooms at law firms, mobile phone counters, restaurants used for dinner, etc. In many cases, personal information that is not desirable to be heard by a person or confidential information of a company is included. However, in the past, there are many facilities where only simple partitioning is used. In order to prevent the conversation at these facilities from leaking, while enhancing the masking effect on the audio signal, the timbre of the music to be played is maintained at the same level as the original sound, and the predetermined masking effect is maintained even when the volume is reduced. A concealed data generation device that can be operated has been developed (see Patent Document 1).

上記技術は物理的な間仕切りと組み合わせ、間仕切りで仕切られた会議室での会話音声の漏洩防止に主として有効であることが実証され、種々の拠点で実運用されている。一方、会話音を録音できるボイスレコーダは小型化され携帯電話・スマートフォンにも組み込まれ、更に録音された不明瞭な音声信号に対して内容を聴取できるようにする各種の音声強調・雑音除去ツールも出回っており、他者に気づかれずに簡便に高性能に盗聴する環境が整ってきた。そこで、漏洩対策を施した施設においてボイスレコーダで録音される音声に対して、秘匿化される度合いを定量的に評価することが求められるようになった。 The above technology has been proved to be effective mainly in combination with physical partitioning to prevent leakage of conversational voice in a conference room partitioned by partitioning, and has been practically used at various bases. On the other hand, voice recorders that can record conversational sounds are miniaturized and incorporated into mobile phones and smartphones, as well as various voice enhancement and noise removal tools that allow you to listen to unclear audio signals that have been recorded. The environment for eavesdropping has been established in a simple and high-performance manner without being noticed by others. Therefore, it has become necessary to quantitatively evaluate the degree of concealment of the sound recorded by the voice recorder in the facility where leakage countermeasures are taken.

録音音声に対して、内容を聴取できるように音声強調・雑音除去を行う手法として、録音音声に付加される雑音（マスキング音を含む）については、特許文献２でも活用されているスペクトラル・サブトラクション法（非特許文献１参照）が知られており、雑音成分を特定できれば低減可能である。雑音成分は音声と混合して録音されるため、音声が無音の区間における音成分は雑音であると判断し、これが定常雑音であれば、混合区間においても除去可能である。特許文献３では、２次ＩＩＲフィルタで車内雑音に埋もれた音声を明瞭化する方法を提案している。また、特許文献４では、子音を強調して明瞭度を改善する手法を提案している。 Spectral subtraction method that is also used in Patent Document 2 for noise (including masking sound) added to recorded speech as a method of performing speech enhancement and noise removal so that the content can be heard with respect to the recorded speech (See Non-Patent Document 1) is known and can be reduced if the noise component can be specified. Since the noise component is mixed with the voice and recorded, it is determined that the sound component in the silent section is noise, and if it is stationary noise, it can be removed even in the mixed section. Patent Document 3 proposes a method of clarifying voice buried in in-vehicle noise by a secondary IIR filter. Patent Document 4 proposes a technique for enhancing the intelligibility by emphasizing consonants.

特開２０１２−２２６１１３号公報JP 2012-226113 A ＷＯ９９／５０８２５号公報WO99 / 50825 publication 特開２００７−２９５３４７号公報JP 2007-295347 A 特許４８７６２４５号公報Japanese Patent No. 4876245

S.F.Boll:"Suppression of Acoustic Noise in Speech Using Spectral Subtraction.” IEEE Trans. ASSP., Vol.27, pp.113-120. 1979.S.F.Boll: "Suppression of Acoustic Noise in Speech Using Spectral Subtraction." IEEE Trans. ASSP., Vol.27, pp.113-120. 1979.

しかしながら、上記従来の技術では、間仕切りを介して減衰する音声を明瞭にするためには対応できず、グラフィックイコライザ等で周波数帯ごとに手動補正することが必要になっていた。 However, the above-described conventional technology cannot cope with the sound that is attenuated through the partition, and requires manual correction for each frequency band with a graphic equalizer or the like.

そこで、本発明は、間仕切り等を介して録音された録音音声に対して明瞭度を段階的に改善することが可能な録音音声の明瞭化装置を提供することを課題とする。 Therefore, an object of the present invention is to provide an apparatus for clarifying a recorded voice that can improve the clarity of a recorded voice recorded through a partition or the like in a stepwise manner.

上記課題を解決するため、本発明第１の態様では、録音により得られ、補正対象とする対象音声信号に対して、別途録音により得られた参照音声信号を用いて、前記対象音声信号の明瞭度を向上させる装置であって、前記対象音声信号に対して時間軸方向の所定のフレーム単位で周波数解析を行い、周波数に基づく変数をｆ（ｆは、周波数に比例する変数、MIDIのノートナンバーのように物理的な周波数に対して対数をとった形態の変数等、周波数に基づいて決定される変数）、τ番目のフレームをτとした対象音声信号の強度スペクトルである対象音声強度スペクトルＳ（ｆ，τ）と、前記対象音声信号の前記変数ｆごとに複数のフレーム（例えば、全フレーム）の平均値で構成される対象音声平均値スペクトルＳａｖ（ｆ）を算出する対象音声解析手段と、前記参照音声信号に対して時間軸方向の所定のフレーム単位で周波数解析を行い、前記参照音声信号の前記変数ｆごとに複数のフレーム（例えば、全フレーム）の平均値で構成される参照音声平均値スペクトルＨａｖ（ｆ）を算出する参照音声解析手段と、前記対象音声平均値スペクトルＳａｖ（ｆ）を利用して、雑音成分スペクトルＮ（ｆ）を作成する雑音成分スペクトル作成手段と、前記変数ｆごとに、前記対象音声平均値スペクトルＳａｖ（ｆ）から前記雑音スペクトルＮ（ｆ）を減算した値によって、前記参照音声平均値スペクトルＨａｖ（ｆ）を除した値に基づいて、変調成分スペクトルＧ（ｆ）を作成する変調成分スペクトル作成手段と、前記各フレームτにおいて前記変数ｆごとに前記対象音声強度スペクトルＳ（ｆ，τ）に対して前記作成された雑音成分スペクトルＮ（ｆ）を所定の割合α（０≦α≦１）だけ減算し、更に減算された値に前記作成された変調成分スペクトルＧ（ｆ）を所定の割合β（０≦β≦１）で乗算し、前記フレームτごとに乗算された値に対して、時間次元変換することによって、前記対象音声信号が補正された補正音声信号を作成する音声信号補正手段と、を具備することを特徴とする録音音声の明瞭化装置を提供する。 In order to solve the above-described problem, in the first aspect of the present invention, the target audio signal is clearly obtained by using a reference audio signal obtained by recording separately from the target audio signal to be corrected and obtained by recording. The frequency analysis is performed on the target audio signal in a predetermined frame unit in the time axis direction, and a variable based on the frequency is f (f is a variable proportional to the frequency, MIDI note number). The target speech intensity spectrum S, which is the intensity spectrum of the target speech signal with τ as the τ-th frame, such as a variable in the form of logarithm with respect to the physical frequency as shown in FIG. (F, τ) and a target sound for calculating a target speech average value spectrum Sav (f) composed of an average value of a plurality of frames (for example, all frames) for each variable f of the target speech signal A frequency analysis unit performs frequency analysis on the reference audio signal in a predetermined frame unit in a time axis direction, and is configured by an average value of a plurality of frames (for example, all frames) for each variable f of the reference audio signal The reference speech analysis means for calculating the reference speech average value spectrum Hav (f) and the noise component spectrum creation means for creating the noise component spectrum N (f) using the target speech average value spectrum Sav (f) And for each variable f, based on a value obtained by dividing the reference speech average value spectrum Hav (f) by a value obtained by subtracting the noise spectrum N (f) from the target speech average value spectrum Sav (f). Modulation component spectrum creating means for creating a modulation component spectrum G (f), and the target speech intensity spectrum for each variable f in each frame τ The generated noise component spectrum N (f) is subtracted from (f, τ) by a predetermined ratio α (0 ≦ α ≦ 1), and the generated modulation component spectrum G ( f) is multiplied by a predetermined ratio β (0 ≦ β ≦ 1), and the corrected audio signal in which the target audio signal is corrected is obtained by performing time dimension conversion on the value multiplied for each frame τ. An audio signal clarifying device is provided, and a recorded audio clarifying device is provided.

本発明第１の態様によれば、対象音声信号に対して所定のフレーム単位で周波数解析を行い、対象音声信号の変数ｆごとに複数のフレーム（例えば、全フレーム）の平均値で構成される対象音声平均値スペクトルＳａｖ（ｆ）を算出する一方、参照音声信号に対して周波数解析を行い、参照音声信号の周波数ごとに複数のフレーム（例えば、全フレーム）の平均値で構成される参照音声平均値スペクトルＨａｖ（ｆ）を算出し、変数ｆごとに対象音声平均値スペクトルＳａｖ（ｆ）を利用して、雑音成分スペクトルＮ（ｆ）を作成し、対象音声平均値スペクトルＳａｖ（ｆ）から雑音スペクトルＮ（ｆ）を減算した値によって、参照音声平均値スペクトルＨａｖ（ｆ）を除した値に基づいて、変調成分スペクトルＧ（ｆ）を作成し、前記各フレームτにおいて前記変数ｆごとに対象音声強度スペクトルＳ（ｆ，τ）に対して雑音成分スペクトルＮ（ｆ）を割合α（０≦α≦１）だけ減算し、更に減算された値に変調成分スペクトルＧ（ｆ）を割合β（０≦β≦１）で乗算し、前記フレームτごとに乗算された値に対して、時間次元変換することによって、対象音声信号が補正された補正音声信号を作成するようにしたので、所定の割合α（０≦α≦１）および所定の割合β（０≦β≦１）を所定の間隔で段階的に変化させることにより間仕切り等を介して録音された録音音声に対して明瞭度を段階的に改善することが可能になる。なお、対象音声平均値スペクトルＳａｖ（ｆ）、参照音声平均値スペクトルＨａｖ（ｆ）の算出は、複数のフレームの平均としているが、実際には全フレームとすることが好ましい。ただし、演算の都合上、先頭のフレームや最後尾のフレーム、その他都合により一部のフレームを除いたフレームの平均としても良い。 According to the first aspect of the present invention, the target audio signal is subjected to frequency analysis in a predetermined frame unit, and is configured by an average value of a plurality of frames (for example, all frames) for each variable f of the target audio signal. While calculating the target speech average value spectrum Sav (f), the reference speech signal is subjected to frequency analysis, and the reference speech composed of an average value of a plurality of frames (for example, all frames) for each frequency of the reference speech signal. An average value spectrum Hav (f) is calculated, and a noise component spectrum N (f) is created for each variable f using the target voice average value spectrum Sav (f). From the target voice average value spectrum Sav (f), A modulation component spectrum G (f) is created based on a value obtained by dividing the reference speech average value spectrum Hav (f) by a value obtained by subtracting the noise spectrum N (f). The noise component spectrum N (f) is subtracted by a ratio α (0 ≦ α ≦ 1) from the target speech intensity spectrum S (f, τ) for each variable f at the time τ, and further modulated to a subtracted value. A corrected audio signal in which the target audio signal is corrected by multiplying the component spectrum G (f) by a ratio β (0 ≦ β ≦ 1) and performing time-dimensional conversion on the value multiplied for each frame τ. Since the predetermined ratio α (0 ≦ α ≦ 1) and the predetermined ratio β (0 ≦ β ≦ 1) are changed step by step at predetermined intervals, recording is performed through a partition or the like. It is possible to improve the clarity of recorded voices in stages. Note that the calculation of the target speech average value spectrum Sav (f) and the reference speech average value spectrum Hav (f) is an average of a plurality of frames, but it is preferable to actually use all frames. However, for the sake of calculation, the average of the first frame, the last frame, and a frame excluding some frames for other reasons may be used.

本発明第２の態様では、前記対象音声解析手段は、前記対象音声平均値スペクトルＳａｖ（ｆ）に加えて、更に前記対象音声信号の前記変数ｆごとに強度が最小となるフレームで代表される最小値スペクトルＳｍｉｎ（ｆ）を算出するようにし、前記雑音成分スペクトル作成手段は、前記最小値スペクトルＳｍｉｎ（ｆ）に基づく値と前記対象音声平均値スペクトルＳａｖ（ｆ）に基づく値との対応する前記変数ｆごとに平均した値に基づいて、前記雑音成分スペクトルＮ（ｆ）を作成するようにしていることを特徴とする。 In the second aspect of the present invention, the target speech analysis means is represented by a frame having a minimum intensity for each variable f of the target speech signal, in addition to the target speech average value spectrum Sav (f). The minimum value spectrum Smin (f) is calculated, and the noise component spectrum creating means corresponds to a value based on the minimum value spectrum Smin (f) and a value based on the target speech average value spectrum Sav (f). The noise component spectrum N (f) is created based on an average value for each variable f.

本発明第２の態様によれば、対象音声信号の変数ｆごとに強度が最小となるフレームで代表される最小値スペクトルＳｍｉｎ（ｆ）を算出するようにし、最小値スペクトルＳｍｉｎ（ｆ）に基づく値と対象音声平均値スペクトルＳａｖ（ｆ）に基づく値との対応する変数ｆごとに平均した値に基づいて、雑音成分スペクトルＮ（ｆ）を作成するようにしたので、対象音声信号の全フレームを解析して高速に補正音声信号を作成することができる。 According to the second aspect of the present invention, the minimum value spectrum Smin (f) represented by the frame having the minimum intensity is calculated for each variable f of the target audio signal, and based on the minimum value spectrum Smin (f). Since the noise component spectrum N (f) is created based on the value averaged for each variable f corresponding to the value and the value based on the target speech average value spectrum Sav (f), all frames of the target speech signal And a corrected audio signal can be created at high speed.

本発明第３の態様では、前記対象音声解析手段は、前記対象音声信号の中で音声が存在する部分のみに対して周波数解析を行い、前記雑音成分スペクトル作成手段は、前記対象音声平均値スペクトルＳａｖ（ｆ）そのものを、雑音成分スペクトルＮ（ｆ）とするようにしていることを特徴とする。 In the third aspect of the present invention, the target speech analysis unit performs frequency analysis only on a portion of the target speech signal where speech exists, and the noise component spectrum creation unit includes the target speech average value spectrum. Sav (f) itself is a noise component spectrum N (f).

本発明第３の態様によれば、対象音声信号の中で音声が存在する区間のみに対して周波数解析を行い、音声に被った定常的な雑音の区間の平均値に対応する対象音声平均値スペクトルＳａｖ（ｆ）を、雑音成分スペクトルＮ（ｆ）とするようにしたので、音声が存在しない雑音のみの非定常的な雑音が雑音成分スペクトルＮ（ｆ）より排除され、高精度な補正音声信号を作成できるとともに、実質的に会話が記録されている部分だけを解析して高速に補正音声信号を作成することができる。 According to the third aspect of the present invention, the target speech average value corresponding to the average value of the stationary noise section that is subjected to the frequency analysis is performed only for the section where the speech exists in the target speech signal. Since the spectrum Sav (f) is set to the noise component spectrum N (f), non-stationary noise of only noise that does not include voice is excluded from the noise component spectrum N (f), and highly accurate corrected speech. A signal can be generated, and a corrected voice signal can be generated at high speed by analyzing only a portion where a conversation is substantially recorded.

本発明第４の態様では、前記音声信号補正手段は、前記各フレームτにおいて前記変数ｆごとに前記対象音声強度スペクトルＳ（ｆ，τ）に対して前記作成された雑音成分スペクトルＮ（ｆ）を所定の割合α（０≦α≦１）だけ減算する際、減算した値が負値になる場合、前記減算した値を０にするような補正を加えるようにしていることを特徴とする。 In the fourth aspect of the present invention, the sound signal correcting means may generate the noise component spectrum N (f) generated for the target sound intensity spectrum S (f, τ) for each variable f in each frame τ. Is subtracted by a predetermined ratio α (0 ≦ α ≦ 1), and if the subtracted value becomes a negative value, a correction is made to make the subtracted value 0.

本発明第４の態様によれば、前記各フレームτにおいて前記変数ｆごとに対象音声強度スペクトルＳ（ｆ，τ）に対して雑音成分スペクトルＮ（ｆ）を所定の割合α（０≦α≦１）だけ減じる際、減算した値が負値になる場合、減算した値を０にするようにしたので、自然法則に反する自然界に存在し得ない補正音声信号を作成することを防止することが可能となる。 According to the fourth aspect of the present invention, the noise component spectrum N (f) is set to a predetermined ratio α (0 ≦ α ≦) with respect to the target speech intensity spectrum S (f, τ) for each variable f in each frame τ. 1) If the subtracted value becomes negative when subtracted by 1), the subtracted value is set to 0, so that it is possible to prevent the generation of a corrected audio signal that cannot exist in the natural world contrary to the laws of nature. It becomes possible.

本発明第５の態様では、前記雑音成分スペクトル作成手段は、前記雑音成分スペクトルＮ（ｆ）を前記変数ｆ＝ｆ１を下限とし、前記変数ｆ＝ｆ２を上限とする所定の周波数範囲（例えば、ｆ１を２００Ｈｚに相当する値、ｆ２を６０００Ｈｚに相当する値とする）で定義するようにし、前記変調成分スペクトル作成手段は、前記変調成分スペクトルＧ（ｆ）を前記変数ｆ＝ｆ１を下限とし、前記変数ｆ＝ｆ２を上限とする所定の周波数範囲（例えば、ｆ１を２００Ｈｚに相当する値、ｆ２を６０００Ｈｚに相当する値とする）で定義するようにし、前記音声信号補正手段は、前記変数ｆ＝ｆ１を下限とし、前記変数ｆ＝ｆ２を上限とする所定の周波数範囲（例えば、ｆ１を２００Ｈｚに相当する値、ｆ２を６０００Ｈｚに相当する値とする）で、前記フレームτごとに前記対象音声強度スペクトルＳ（ｆ，τ）に対して前記作成された雑音成分スペクトルＮ（ｆ）を所定の割合αだけ減算し、更に減算された値に前記作成された変調成分スペクトルＧ（ｆ）を所定の割合βで乗算するようにしていることを特徴とする。 In the fifth aspect of the present invention, the noise component spectrum creating means has a predetermined frequency range in which the noise component spectrum N (f) has the variable f = f1 as a lower limit and the variable f = f2 as an upper limit (for example, f1 is a value corresponding to 200 Hz and f2 is a value corresponding to 6000 Hz), and the modulation component spectrum creating means sets the modulation component spectrum G (f) to the variable f = f1 as a lower limit, The variable f = f2 is defined as a predetermined frequency range (for example, f1 is a value corresponding to 200 Hz, and f2 is a value corresponding to 6000 Hz). = F1 as a lower limit and the variable f = f2 as an upper limit (for example, f1 is a value corresponding to 200 Hz and f2 is a value corresponding to 6000 Hz) Then, for each frame τ, the generated noise component spectrum N (f) is subtracted from the target speech intensity spectrum S (f, τ) by a predetermined ratio α, and the generated value is further subtracted. The modulation component spectrum G (f) is multiplied by a predetermined ratio β.

本発明第５の態様によれば、音声信号補正のためのスペクトルに対する処理を、所定の周波数範囲に対して行うようにしたので、音声帯域外の雑音が主たる部分を排除しながら高精度に音声信号の補正処理をすることが可能となる。 According to the fifth aspect of the present invention, since the processing for the spectrum for correcting the audio signal is performed for the predetermined frequency range, it is possible to perform the audio with high accuracy while excluding the main part of the noise outside the audio band. Signal correction processing can be performed.

本発明によれば、間仕切り等を介して録音された録音音声に対して所定のパラメータを段階的に設定することにより明瞭度を段階的に改善することができ、逆に明瞭な状態に補正された際の設定されたパラメータの値に基づき録音音声の明瞭度を定量評価することが可能となる。 According to the present invention, it is possible to improve the clarity step by step by setting predetermined parameters step by step with respect to the recorded sound recorded through the partition or the like, and on the contrary, the state is corrected to a clear state. It is possible to quantitatively evaluate the intelligibility of the recorded voice based on the parameter values set at the time of recording.

間仕切りを介して取得した音声を録音する場合の、音声の伝搬経路モデルを示す図である。It is a figure which shows the propagation path model of an audio | voice when recording the audio | voice acquired through the partition. 間仕切りを介さずに取得した音声を録音する場合の、音声の伝搬経路モデルを示す図である。It is a figure which shows the propagation path model of an audio | voice when recording the audio | voice acquired without going through a partition. 本発明による処理の概略を示す図である。It is a figure which shows the outline of the process by this invention. 雑音成分スペクトルＮ（ｆ）および変調成分スペクトルＧ（ｆ）の算出方法を示す図である。It is a figure which shows the calculation method of noise component spectrum N (f) and modulation component spectrum G (f). 本発明の一実施形態に係る録音音声の明瞭化装置のハードウェア構成図である。It is a hardware block diagram of the clarification apparatus of the sound recording based on one Embodiment of this invention. 本発明の一実施形態に係る録音音声の明瞭化装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the clarification apparatus of the sound recording based on one Embodiment of this invention. 本発明の一実施形態に係る録音音声の明瞭化装置の処理概要を示すフローチャートである。It is a flowchart which shows the process outline | summary of the clarification apparatus of the sound recording based on one Embodiment of this invention. 対象音声信号ｓ（ｉ）の波形を示す図である。It is a figure which shows the waveform of the object audio | voice signal s (i). 参照音声信号ｈ（ｉ）の波形を示す図である。It is a figure which shows the waveform of the reference audio | voice signal h (i). 対象音声平均値スペクトルＳａｖ（ｆ）、対象音声平均値スペクトルＨａｖ（ｆ）の波形を示す図である。It is a figure which shows the waveform of object audio | voice average value spectrum Sav (f) and object audio | voice average value spectrum Hav (f). 変調スペクトルＧ（ｆ）の波形を示す図である。It is a figure which shows the waveform of the modulation spectrum G (f). 雑音成分スペクトルＮ（ｆ）の波形を示す図である。It is a figure which shows the waveform of noise component spectrum N (f). 補正音声信号ｃ（ｉ）の波形を示す図である。It is a figure which shows the waveform of the correction | amendment audio | voice signal c (i). 対象音声平均値スペクトルＳａｖ（ｆ）、補正音声平均値スペクトルＣａｖ（ｆ）の波形を示す図である。It is a figure which shows the waveform of object audio | voice average value spectrum Sav (f) and correction | amendment audio | voice average value spectrum Cav (f).

以下、本発明の好適な実施形態について図面を参照して詳細に説明する。
＜１．本発明で用いる音声の伝搬経路モデル＞
まず、本発明で用いる音声の伝搬経路モデルについて説明する。図１は、間仕切りを介して取得した音声を録音する場合の、音声の伝搬経路モデルを示す図である。図１に示すように、本発明では、ソース音声信号源（会話音）Ｃ（ｆ，τ）がＡ（ｆ）なる周波数特性をもつ材質で構成された間仕切りを介して伝搬された音に、環境雑音源（マスキング音を含む）Ｎ（ｆ）が付加されて、Ｓ（ｆ，τ）＝Ｃ（ｆ，τ）・Ａ（ｆ）＋Ｎ（ｆ）なる音が漏洩されるという伝搬経路モデルを用いる。ここで、パラメータｆは周波数に基づく変数、τは周波数解析における所定のサンプル数をもつフレームのフレーム番号を示し、変数の値Ａ（ｆ）はスカラー値で、変数の値Ｓ（ｆ，τ）, Ｃ（ｆ，τ）およびＮ（ｆ）は複素数になる。環境雑音源については、空調音のように定常的な雑音Ｎ（ｆ）に限定し、マスキング音のように間仕切りを介さず直接伝搬する音に限定する。本発明では、間仕切りを介して録音された音声を明瞭化の対象である対象音声信号として扱う。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the drawings.
<1. Speech propagation path model used in the present invention>
First, a voice propagation path model used in the present invention will be described. FIG. 1 is a diagram showing a voice propagation path model in the case of recording voice acquired through a partition. As shown in FIG. 1, in the present invention, the sound transmitted through a partition composed of a material having a frequency characteristic of A (f) in the source audio signal source (conversation sound) C (f, τ) Propagation path model in which environmental noise source (including masking sound) N (f) is added and sound of S (f, τ) = C (f, τ) · A (f) + N (f) is leaked Is used. Here, the parameter f is a variable based on frequency, τ indicates the frame number of a frame having a predetermined number of samples in frequency analysis, the variable value A (f) is a scalar value, and the variable value S (f, τ). , C (f, τ) and N (f) are complex numbers. The environmental noise source is limited to a stationary noise N (f) such as an air-conditioning sound, and is limited to a sound that directly propagates without using a partition, such as a masking sound. In the present invention, the voice recorded through the partition is handled as the target voice signal that is the object of clarification.

図２は、間仕切りを介さずに取得した音声を録音する場合の、音声の伝搬経路モデルを示す図である。図２に示す伝搬経路モデルでは、ソース音声信号源（会話音）Ｃ（ｆ，τ）に、環境雑音源（マスキング音を含む）Ｎ（ｆ）が付加されて、Ｈ（ｆ，τ）＝Ｃ（ｆ，τ）＋Ｎ（ｆ）なる音が聴取される。ここで、変数の値Ｈ（ｆ，τ）, Ｃ（ｆ，τ）およびＮ（ｆ）は複素数になる。環境雑音源については、空調音のように定常的な雑音Ｎ（ｆ）に限定し、マスキング音のように間仕切りを介さず直接伝搬する音に限定する。本発明では、間仕切りを介さず録音された音声を、対象音声信号の明瞭化に際して参照する参照音声信号として扱う。 FIG. 2 is a diagram showing a voice propagation path model in the case of recording voice acquired without using a partition. In the propagation path model shown in FIG. 2, an environmental noise source (including masking sound) N (f) is added to the source audio signal source (conversation sound) C (f, τ), and H (f, τ) = The sound C (f, τ) + N (f) is heard. Here, the variable values H (f, τ), C (f, τ) and N (f) are complex numbers. The environmental noise source is limited to a stationary noise N (f) such as an air-conditioning sound, and is limited to a sound that directly propagates without using a partition, such as a masking sound. In the present invention, a sound recorded without using a partition is handled as a reference sound signal to be referred to when clarifying the target sound signal.

＜２．本発明による処理の概略＞
次に、本発明による処理の概略について説明する。図３は、本発明による処理の概略を示す図である。本発明では、間仕切りを介して取得された不明瞭な対象音声信号ｓ（ｉ）を明瞭化し、ソース音源信号源と推定される補正音声信号ｃ（ｉ）として得る。まず、録音音声である対象音声信号ｓ（ｉ）を周波数次元変換して対象音声強度スペクトルＳ（ｆ，τ）を得る。次に、雑音成分スペクトルＮ（ｆ）の複素スペクトル減算を行って、雑音除去スペクトルＳ（ｆ，τ）−α・Ｎ（ｆ）を得る。続いて、変調成分スペクトルＧ（ｆ）を乗算することによって、スペクトルの複素スペクトル変調を行って、補正音声スペクトルＣ（ｆ，τ）を得る。最後に、時間次元逆変換を行って補正音声信号ｃ（ｉ）を得る。パラメータα、βを段階的に変化させることにより、補正音声信号ｃ（ｉ）の補正の程度を変化させることができ、明瞭に聴取できるレベルに補正された際に設定されたパラメータα、βの値により、録音音声の明瞭度を定量的に評価することができる。 <2. Outline of processing according to the present invention>
Next, an outline of the processing according to the present invention will be described. FIG. 3 is a diagram showing an outline of the processing according to the present invention. In the present invention, the ambiguous target audio signal s (i) acquired through the partition is clarified and obtained as a corrected audio signal c (i) estimated as the source sound source signal source. First, the target voice signal s (i), which is a recorded voice, is frequency-dimensionally converted to obtain a target voice intensity spectrum S (f, τ). Next, complex spectrum subtraction of the noise component spectrum N (f) is performed to obtain a noise removal spectrum S (f, τ) −α · N (f). Subsequently, the complex spectrum modulation of the spectrum is performed by multiplying the modulation component spectrum G (f) to obtain a corrected speech spectrum C (f, τ). Finally, a time-reversed inverse transform is performed to obtain a corrected speech signal c (i). By changing the parameters α and β stepwise, the degree of correction of the corrected audio signal c (i) can be changed, and the parameters α and β set when the parameters are corrected to a level that can be clearly heard can be obtained. By the value, the intelligibility of the recorded voice can be quantitatively evaluated.

図３における雑音成分スペクトルＮ（ｆ）および変調成分スペクトルＧ（ｆ）の算出方法を図４に示す。周波数次元変換後の対象音声強度スペクトルＳ（ｆ，τ）に対して全フレームに渡る平均値スペクトルおよび最小値スペクトルを求め、各々Ｓａｖ(f)，Ｓｍｉｎ（ｆ）とし、周波数次元変換後の参照音声強度スペクトルＨ（ｆ，τ）に対して全フレームに渡る平均値スペクトルを求め、Ｈａｖ(f)とすると、図示の通り、雑音成分スペクトルＮ（ｆ）はＳａｖ(f)とＳｍｉｎ（ｆ）との平均値で、変調成分スペクトルＧ（ｆ）はＨａｖ（ｆ）をＳａｖ（ｆ）からＮ（ｆ）を減算した値で除算することにより算出される。 FIG. 4 shows a method for calculating the noise component spectrum N (f) and the modulation component spectrum G (f) in FIG. The average value spectrum and the minimum value spectrum over the entire frame are obtained for the target speech intensity spectrum S (f, τ) after the frequency dimension conversion, and are respectively referred to as Sav (f) and Smin (f). Assuming that the average value spectrum over the entire frame is obtained for the voice intensity spectrum H (f, τ) and is denoted by Hav (f), the noise component spectrum N (f) is Sav (f) and Smin (f) as shown in the figure. The modulation component spectrum G (f) is calculated by dividing Hav (f) by the value obtained by subtracting N (f) from Sav (f).

＜３．１．装置構成＞
以下、本発明に係る録音音声の明瞭化装置について、具体的に説明していく。図５は、本発明の一実施形態に係る録音音声の明瞭化装置のハードウェア構成図である。録音音声の明瞭化装置は、汎用のコンピュータで実現することができ、図５に示すように、ＣＰＵ（Central Processing Unit）１と、コンピュータのメインメモリであるＲＡＭ（Random Access Memory）２と、ＣＰＵ１が実行するプログラムやデータを記憶するための大容量の記憶装置（例えば、ハードディスク、フラッシュメモリ等）３と、キーボード、マウス等のキー入力Ｉ／Ｆ（インターフェース）４と、外部装置（データ記憶媒体等）とボイスレコーダに装着されているＳＤメモリカード、メモリスティックやＣＤなどのリムーバブル記憶媒体を装着して録音音声を記憶装置３に転送するための可搬型記憶装置５と、表示装置（ディスプレイ）に情報を送出するための表示出力Ｉ／Ｆ（インターフェース）６と、録音音声を記憶装置３に転送するためＵＳＢメモリ機能付きのボイスレコーダを直接装着したり、ＵＳＢケーブルを介してボイスレコーダを接続するためのＵＳＢ−Ｉ／Ｆ７を備え、互いにバスを介して接続されている。また、前述のＵＳＢ−Ｉ／Ｆ７にはＵＳＢケーブルを介して汎用コンピュータの外部に配置された音声入出力Ｉ／Ｆ８も接続され、音声を入力するマイクロフォン９ａと音声を出力するスピーカ９ｂが前述の音声入出力Ｉ／Ｆ８にアナログのオーディオ信号ケーブルまたは光デジタル音声ケーブルを介して接続されている。図では、音声入出力Ｉ／Ｆ８は汎用コンピュータの外部に配置された事例を示しているが、音声入出力Ｉ／Ｆ８をＵＳＢ−Ｉ／Ｆ７を経由せず汎用コンピュータ内部のバスに直結させ、汎用コンピュータ内部に配置させる方法も一般的に用いられる。ただし、本実施形態のように音声計測用途で精度が要求される場合には、音声入出力Ｉ／Ｆ８が記憶装置３のハードディスクなど機械的な振動音を発する雑音の影響を受けることを防止するため、汎用コンピュータの外部に配置される方が望ましい。 <3.1. Device configuration>
Hereinafter, the audio recording speech clarifying apparatus according to the present invention will be described in detail. FIG. 5 is a hardware configuration diagram of a recorded voice clarification device according to an embodiment of the present invention. The recorded voice clarification device can be realized by a general-purpose computer. As shown in FIG. 5, a CPU (Central Processing Unit) 1, a RAM (Random Access Memory) 2 as a main memory of the computer, and a CPU 1 A large-capacity storage device (for example, a hard disk, a flash memory, etc.) 3, a key input I / F (interface) 4, such as a keyboard and a mouse, and an external device (data storage medium) Etc.) and a portable storage device 5 for transferring a recorded sound to the storage device 3 by attaching a removable storage medium such as an SD memory card, a memory stick or a CD attached to the voice recorder, and a display device (display) Display output I / F (interface) 6 for transmitting information to the storage device 3 Therefore, a voice recorder with a USB memory function is directly mounted, or a USB-I / F 7 for connecting a voice recorder via a USB cable is provided, and they are connected to each other via a bus. Also, the above-described USB-I / F 7 is connected to a voice input / output I / F 8 disposed outside the general-purpose computer via a USB cable, and a microphone 9a for inputting voice and a speaker 9b for outputting voice are provided. The audio input / output I / F 8 is connected via an analog audio signal cable or an optical digital audio cable. In the figure, the voice input / output I / F 8 is shown as an example arranged outside the general-purpose computer, but the voice input / output I / F 8 is directly connected to the bus inside the general-purpose computer without going through the USB-I / F 7. A method of arranging in a general-purpose computer is also generally used. However, when accuracy is required for voice measurement applications as in the present embodiment, the voice input / output I / F 8 is prevented from being affected by noise that generates mechanical vibration sound such as a hard disk of the storage device 3. For this reason, it is desirable to arrange it outside the general-purpose computer.

図６は、本実施形態に係る録音音声の明瞭化装置の構成を示す機能ブロック図である。図６において、１０は対象音声解析手段、２０は雑音成分スペクトル作成手段、３０は参照音声解析手段、４０は変調成分スペクトル作成手段、５０は音声信号補正手段、５５はパラメータ設定手段、６０は記憶手段、６１は対象音声信号記憶部、６２は参照音声信号記憶部、６３は補正音声信号記憶部である。対象音声信号記憶部６１および参照音声信号記憶部６２には、ボイスレコーダに録音された対象音声信号および参照音声信号が図５の可搬型記憶装置５またはＵＳＢ−Ｉ／Ｆ７を経由して取り込まれている。なお、図６に示す装置は、基本的には、モノラル音声信号に対応している。対象をステレオ音声信号とする場合は、複数のチャンネルの合算値を使用して、モノラル音声信号として処理する。 FIG. 6 is a functional block diagram showing a configuration of a recorded voice clarification device according to the present embodiment. In FIG. 6, 10 is a target speech analysis means, 20 is a noise component spectrum creation means, 30 is a reference speech analysis means, 40 is a modulation component spectrum creation means, 50 is a speech signal correction means, 55 is a parameter setting means, and 60 is a memory. Means, 61 is a target audio signal storage unit, 62 is a reference audio signal storage unit, and 63 is a corrected audio signal storage unit. The target audio signal storage unit 61 and the reference audio signal storage unit 62 receive the target audio signal and the reference audio signal recorded in the voice recorder via the portable storage device 5 or the USB-I / F 7 in FIG. ing. Note that the apparatus shown in FIG. 6 basically corresponds to a monaural audio signal. When the target is a stereo audio signal, it is processed as a monaural audio signal using the sum of the plurality of channels.

対象音声解析手段１０は、明瞭化の対象とする対象音声信号を読み込み、フーリエ変換等の周波数解析を行って、時間次元から周波数次元に変換して複素数のスペクトルを生成する機能を有している。雑音成分スペクトル作成手段２０は、対象音声解析手段１０により生成されたスペクトルから雑音成分スペクトルＮ（ｆ）を作成する機能を有している。参照音声解析手段３０は、参照する参照音声信号を読み込み、フーリエ変換等の周波数解析を行って、時間次元から周波数次元に変換して複素数のスペクトルを生成する機能を有している。ここで、参照音声信号とは、対象音声信号とほぼ同一条件・時期に録音された音声信号で、補正を加えなくても明瞭に聴取できる補正の目標（手本）とする音声信号を指す。参照音声信号の話者、会話内容や録音長については対象音声信号のものとは全く異なった任意のもので構わないが、できるだけ、同一機種のボイスレコーダで、同一時期に、図２のように同一場所でパーティションが外された環境で録音された音声信号が望ましい。ただし、このような条件で参照音声信号を準備することが困難であれば、録音場所に近い環境で（音楽収録スタジオのような環境は非現実的で不適当）、スペックが近いボイスレコーダで適当な話者の明瞭な会話音を録音して準備しても良い。変調成分スペクトル作成手段４０は、対象音声解析手段１０により生成されたスペクトル、参照音声解析手段３０により生成されたスペクトル、雑音スペクトルＮ（ｆ）に基づいて、変調成分スペクトルＧ（ｆ）を作成する。音声信号補正手段５０は、雑音成分スペクトルＮ（ｆ）をパラメータとして設定された所定の割合α（０≦α≦１）だけ減じ、更に減算した値に作成された変調成分スペクトルＧ（ｆ）をパラメータとして設定された所定の割合β（０≦α≦１）を乗じ、乗算された値に対して、フーリエ逆変換等の周波数解析を行って、周波数次元から時間次元に逆変換することによって、対象音声信号に対して明瞭化する補正を行った補正音声信号を作成する。パラメータ設定手段５５は、雑音成分スペクトル作成手段２０、変調成分スペクトル作成手段４０で用いるパラメータα、βの設定を行うものであり、マウスやキーボード等の入力機器とキー入力Ｉ／Ｆ４により実現される。 The target speech analysis means 10 has a function of reading a target speech signal to be clarified, performing frequency analysis such as Fourier transform, and converting from a time dimension to a frequency dimension to generate a complex spectrum. . The noise component spectrum creation means 20 has a function of creating a noise component spectrum N (f) from the spectrum generated by the target speech analysis means 10. The reference speech analysis means 30 has a function of reading a reference speech signal to be referenced, performing frequency analysis such as Fourier transform, and converting from a time dimension to a frequency dimension to generate a complex spectrum. Here, the reference audio signal is an audio signal recorded under substantially the same conditions and time as the target audio signal, and indicates an audio signal as a correction target (example) that can be heard clearly without correction. The speaker of the reference voice signal, the conversation content, and the recording length may be arbitrary different from those of the target voice signal, but as much as possible with the same model voice recorder at the same time as shown in FIG. Audio signals recorded in an environment where partitions are removed at the same location are desirable. However, if it is difficult to prepare a reference audio signal under these conditions, it is appropriate to use a voice recorder with close specifications in an environment close to the recording location (an environment such as a music recording studio is unrealistic and inappropriate). You may prepare by recording a clear conversation sound of a simple speaker. The modulation component spectrum creation means 40 creates a modulation component spectrum G (f) based on the spectrum generated by the target speech analysis means 10, the spectrum generated by the reference speech analysis means 30, and the noise spectrum N (f). . The audio signal correcting unit 50 subtracts the noise component spectrum N (f) by a predetermined ratio α (0 ≦ α ≦ 1) set as a parameter, and further subtracts the modulation component spectrum G (f) created as a value. By multiplying a predetermined ratio β set as a parameter (0 ≦ α ≦ 1) and performing frequency analysis such as inverse Fourier transform on the multiplied value, and performing inverse transform from the frequency dimension to the time dimension, A corrected audio signal is generated by performing correction for clarifying the target audio signal. The parameter setting unit 55 sets parameters α and β used in the noise component spectrum generation unit 20 and the modulation component spectrum generation unit 40, and is realized by an input device such as a mouse or a keyboard and a key input I / F4. .

記憶手段６０は、明瞭化の対象とする対象音声信号を記憶した対象音声信号記憶部６１と、参照する参照音声信号を記憶した参照音声信号記憶部６２と、補正された補正音声信号を記憶する補正音声信号記憶部６３を有しており、その他処理に必要なデータやプログラムを記憶するものである。対象音声信号は、図１に示した間仕切りを介した伝搬経路モデルにより録音して得られた音声信号である。また、参照音声信号は、図２に示した間仕切りを介さない伝搬経路モデルにより録音して得られた音声信号である。対象音声信号と参照音声信号は、間仕切りの有無以外は全く同一条件で録音されたものである。 The storage unit 60 stores a target audio signal storage unit 61 that stores a target audio signal to be clarified, a reference audio signal storage unit 62 that stores a reference audio signal to be referenced, and a corrected corrected audio signal. A corrected audio signal storage unit 63 is provided to store other data and programs necessary for processing. The target audio signal is an audio signal obtained by recording with a propagation path model through the partition shown in FIG. Further, the reference audio signal is an audio signal obtained by recording using a propagation path model that does not pass through the partition shown in FIG. The target audio signal and the reference audio signal are recorded under exactly the same conditions except for the presence or absence of the partition.

図６に示した各構成手段は、現実には図５に示したように、コンピュータおよびその周辺機器等のハードウェアに専用のプログラムを搭載することにより実現される。すなわち、コンピュータが、専用のプログラムに従って各手段の内容を実行することになる。 Each component shown in FIG. 6 is actually realized by installing a dedicated program on hardware such as a computer and its peripheral devices as shown in FIG. That is, the computer executes the contents of each means according to a dedicated program.

図５の記憶装置３には、ＣＰＵ１を動作させ、コンピュータを、録音音声の明瞭化装置として機能させるための専用のプログラムが実装されている。この専用のプログラムを実行することにより、ＣＰＵ１は、対象音声解析手段１０、雑音成分スペクトル作成手段２０、参照音声解析手段３０、変調成分スペクトル作成手段４０、音声信号補正手段５０としての機能を実現することになる。また、記憶装置３は、対象音声信号記憶部６１、参照音声信号記憶部６２、補正音声信号記憶部６３を備えた記憶手段６０として機能する。 In the storage device 3 of FIG. 5, a dedicated program for operating the CPU 1 and causing the computer to function as a recording voice clarifying device is installed. By executing this dedicated program, the CPU 1 realizes functions as the target speech analysis means 10, noise component spectrum creation means 20, reference speech analysis means 30, modulation component spectrum creation means 40, and speech signal correction means 50. It will be. In addition, the storage device 3 functions as a storage unit 60 including a target audio signal storage unit 61, a reference audio signal storage unit 62, and a corrected audio signal storage unit 63.

＜３．２．処理動作＞
次に、図５、図６に示した録音音声の明瞭化装置の処理動作について、図７のフローチャートを用いて説明する。まず、対象音声解析手段１０が、対象音声信号記憶部６１から対象音声信号を読み込み、読み込んだ対象音声信号に対して周波数解析を行って周波数次元への変換を行う（ステップＳ１）。具体的には、対象音声解析手段１０は、まず、対象音声信号記憶部６１に記憶された対象音声信号Ｓ（ｉ）（ｉは全てのサンプルに対して付された通し番号：ｉ＝０，１，２，・・・）から、所定数Ｎのサンプルを１フレームとして読み込む。録音音声の明瞭化装置が処理する１フレームのサンプル数Ｎは、適宜設定することができる。本実施形態では、サンプリング周波数Ｆｓ＝４４１００Ｈｚの場合、Ｎ＝４０９６に設定している。したがって、４０９６サンプルずつ、順次１フレームとして読み込んでいくことになる。 <3.2. Processing action>
Next, the processing operation of the recorded speech clarification device shown in FIGS. 5 and 6 will be described with reference to the flowchart of FIG. First, the target voice analysis means 10 reads a target voice signal from the target voice signal storage unit 61, performs frequency analysis on the read target voice signal, and converts it to a frequency dimension (step S1). Specifically, the target speech analysis unit 10 firstly selects the target speech signal S (i) (i is a serial number assigned to all samples: i = 0, 1) stored in the target speech signal storage unit 61. , 2,...), A predetermined number N of samples are read as one frame. The number N of samples of one frame processed by the recorded speech clarification device can be set as appropriate. In this embodiment, when the sampling frequency Fs = 44100 Hz, N = 4096 is set. Therefore, 4096 samples are sequentially read as one frame.

各サンプルを読み込んだ際、全てのサンプルをフレームとしても良いが、本実施形態では、音声が存在すると判断される区間に存在するサンプルのみをフレーム内に設定するようにしている。音声が存在すると判断される区間とは、音声が存在しないと判断される非音声区間を除いた区間である。音声が存在しないと判断される非音声区間とは、信号値が所定のレベルに達していないサンプルが所定数（所定時間）連続する無音に近い区間であるか、信号値は所定のレベルに達しているがオペレータが試聴により雑音成分しか聴取できない区間を指す。したがって、対象音声解析手段１０は、信号値が所定のレベルに達していないサンプルを所定数連続して読み込んだ場合は、それらのサンプルをフレームに含める対象から除外する。ここで、所定のレベルとしては、無音と判断されるレベルを考慮して適宜設定することが可能である。サンプルが連続する所定数としては、無音と判断される区間の長さを考慮して適宜設定することが可能である。無音区間を除去した上で、オペレータは信号全体を試聴し、ヒトの会話音声の母音や子音成分が全く聴取できない雑音のみの区間を手動で除去してゆく。その結果、音声が存在する区間のみがフレームとして設定される。 When each sample is read, all the samples may be used as a frame, but in this embodiment, only the samples that exist in the section in which it is determined that there is sound are set in the frame. The section in which the voice is determined to be present is a section excluding the non-voice section in which the voice is determined not to exist. The non-speech period in which it is determined that there is no speech is a section close to silence in which a predetermined number of samples (predetermined time) for which the signal value has not reached a predetermined level, or the signal value has reached a predetermined level. However, it refers to a section where the operator can only hear the noise component by auditioning. Therefore, when the target speech analysis unit 10 continuously reads a predetermined number of samples whose signal values have not reached the predetermined level, the target speech analysis unit 10 excludes those samples from the target to be included in the frame. Here, the predetermined level can be appropriately set in consideration of the level determined to be silent. The predetermined number of consecutive samples can be appropriately set in consideration of the length of the section determined to be silent. After removing the silent section, the operator listens to the entire signal, and manually removes the section of noise alone in which the vowels and consonant components of the human conversational speech cannot be heard at all. As a result, only the section where the voice exists is set as a frame.

本実施形態では、奇数番目のフレーム、偶数番目のフレームは、互いに所定数（本実施形態ではＮ／２＝２０４８）のサンプルを重複して設定される。したがって、奇数番目のフレームを先頭からＡ１、Ａ２、Ａ３…とし、偶数番目のフレームを先頭からＢ１、Ｂ２、Ｂ３…とすると、Ａ１はサンプル１〜４０９６、Ａ２はサンプル４０９７〜８１９２、Ａ３はサンプル８１９３〜１２２８８、Ｂ１はサンプル２０４９〜６１４４、Ｂ２はサンプル６１４５〜１０２４０、Ｂ３はサンプル１０２４１〜１４３３６となる。したがって、偶数番目のフレームから処理を行うようにしても良いが、以下では、奇数番目のフレームから処理を行う場合を例にとって説明する。奇数番目のフレームと偶数番目のフレームで重複して読み込むサンプルの数は適宜設定することができ、重複するサンプル数を０とすることも可能である。 In this embodiment, the odd-numbered frame and the even-numbered frame are set by overlapping a predetermined number of samples (N / 2 = 2048 in this embodiment). Therefore, if odd-numbered frames are A1, A2, A3... From the top, and even-numbered frames are B1, B2, B3... From the top, A1 is samples 1 to 4096, A2 is samples 4097 to 8192, and A3 is sample. 8193-12288, B1 is samples 2049-6144, B2 is samples 6145-10240, and B3 is samples 10241-14336. Therefore, the processing may be performed from the even-numbered frame, but in the following, the case where the processing is performed from the odd-numbered frame will be described as an example. The number of samples to be read redundantly in the odd-numbered frame and the even-numbered frame can be set as appropriate, and the number of overlapping samples can be set to zero.

Ｎ個のサンプルで構成される各フレーム内のサンプル番号をｔ、フレーム番号をτとすると、読み込まれた対象音声信号Ｓ（ｉ）は、Ｔｓ個の対象音声フレームｓ（ｔ，τ）（ｔ＝０，…，Ｎ−１、τ＝０，…，Ｔｓ−１）の集合に変換される。 If the sample number in each frame composed of N samples is t and the frame number is τ, the read target audio signal S (i) is Ts target audio frames s (t, τ) (t = 0,..., N−1, τ = 0,..., Ts−1).

続いて、対象音声解析手段１０は、各フレームに対して周波数解析を行って、各フレームの複素数のスペクトルを得る。周波数解析としては、時間次元から周波数次元への変換を行う。周波数解析は、窓関数を利用して行う。周波数解析としては、フーリエ変換、ウェーブレット変換その他公知の種々の手法を用いることができるが、複素数のスペクトルを得られる手法である必要がある。本実施形態では、フーリエ変換を用いた場合を例にとって説明する。 Subsequently, the target speech analysis unit 10 performs frequency analysis on each frame to obtain a complex spectrum of each frame. As frequency analysis, conversion from the time dimension to the frequency dimension is performed. Frequency analysis is performed using a window function. As the frequency analysis, various known methods such as Fourier transform, wavelet transform, and the like can be used, but the method needs to obtain a complex spectrum. In the present embodiment, a case where Fourier transform is used will be described as an example.

一般に、所定の信号に対してフーリエ変換を行う場合、信号を所定の長さに区切って行う必要があるが、この場合、所定長さの信号に対してそのままフーリエ変換を行うと、擬似高調波成分が発生する。そこで、一般にフーリエ変換を行う場合には、ハニング窓と呼ばれる窓関数を用いて、信号の値を変化させた後、変化後の値に対してフーリエ変換を実行する。 In general, when Fourier transform is performed on a predetermined signal, it is necessary to divide the signal into predetermined lengths. In this case, if Fourier transform is performed on a signal of a predetermined length as it is, a pseudo-harmonic wave is generated. Ingredients are generated. Therefore, in general, when performing Fourier transform, a signal value is changed using a window function called a Hanning window, and then Fourier transform is performed on the changed value.

本実施形態においても、ハニング窓関数Ｗ（ｔ）を利用している。ハニング窓関数Ｗ（ｔ）は、中央の所定のサンプル番号Ｎ／２の位置において最大値１をとり、両端付近のサンプル番号０またはＮ−１の位置において最小値０をとるように設定されている。どのサンプル番号の場合に最大値をとるかについては、ハニング窓関数Ｗ（ｔ）の設計によって異なってくるが、本実施形態では、後述する〔数式１〕で定義される。フレームについてのフーリエ変換は、このハニング窓関数Ｗ（ｔ）を乗じたものに対して行われることになる。 Also in this embodiment, the Hanning window function W (t) is used. The Hanning window function W (t) is set to take a maximum value 1 at a position of a predetermined sample number N / 2 in the center and take a minimum value 0 at a position of sample number 0 or N-1 near both ends. Yes. Which sample number takes the maximum value depends on the design of the Hanning window function W (t), but in the present embodiment, it is defined by [Equation 1] described later. The Fourier transform for the frame is performed on the product of the Hanning window function W (t).

なお、上述のように、本実施形態においては、フレームは重複して読み込まれる。すなわち、奇数番目のフレームと偶数番目のフレームは、所定数のサンプルを重複して読み込む。本実施形態では、ハニング窓関数Ｗ（ｔ）は、以下の〔数式１〕で定義される。 As described above, in the present embodiment, frames are read in duplicate. That is, a predetermined number of samples are redundantly read in odd-numbered frames and even-numbered frames. In the present embodiment, the Hanning window function W (t) is defined by the following [Equation 1].

〔数式１〕
０≦ｔ≦Ｎ−１のとき、Ｗ（ｔ）＝０．５−０．５ｃｏｓ（２πｔ／Ｎ） [Formula 1]
When 0 ≦ t ≦ N−1, W (t) = 0.5−0.5 cos (2πt / N)

本実施形態においては、奇数番目の音響フレームと偶数番目の音響フレームを、所定サンプルずつ重複して読み込むため、補正を行った後、時系列の音声信号の形態に復元する際に、窓関数を乗じた奇数番目のフレームと、窓関数を乗じた偶数番目の音響フレームの重複サンプルを加算した場合に、ほぼ元の値に戻るようにしなければならない。このため、奇数番目のフレームと偶数番目のフレームの重複部分において、両者の窓関数Ｗ（ｔ）を加算すると、全サンプルが固定値１になるように定義されている。 In the present embodiment, the odd-numbered acoustic frame and the even-numbered acoustic frame are read in duplicate by a predetermined number of samples. When overlapping samples of the odd-numbered frames multiplied by and the even-numbered sound frames multiplied by the window function are added, it is necessary to return almost to the original value. For this reason, it is defined that all samples have a fixed value 1 when the window functions W (t) of the odd-numbered frame and the even-numbered frame are added to each other.

対象音声解析手段１０が、奇数番目および偶数番目のフレームに対してフーリエ変換を行う場合は、対象音声フレームｓ（ｔ，τ）（ｔ＝０，…，Ｎ−１、τ＝０，…，Ｔｓ−１）に対して、窓関数Ｗ（ｔ）を用いて、以下の〔数式２〕に従った処理を行い、変換データの実部Ｓｒ（ｆ，τ）、虚部Ｓｉ（ｆ，τ）を得る。 When the target speech analysis means 10 performs Fourier transform on the odd-numbered and even-numbered frames, the target speech frame s (t, τ) (t = 0,..., N−1, τ = 0,. Ts-1) is processed using the window function W (t) according to the following [Equation 2], and real part Sr (f, τ) and imaginary part Si (f, τ) of the converted data are obtained. )

〔数式２〕
Ｓｒ（ｆ，τ）＝Σ_t=0,…,N-1Ｗ（ｔ）・ｓ（ｔ，τ）・ｃｏｓ（２πｆｔ／Ｎ）
Ｓｉ（ｆ，τ）＝Σ_t=0,…,N-1Ｗ（ｔ）・ｓ（ｔ，τ）・ｓｉｎ（２πｆｔ／Ｎ） [Formula 2]
Sr (f, τ) = Σ _{t = 0,..., N−1} W (t) · s (t, τ) · cos (2πft / N)
Si (f, τ) = Σ _{t = 0,..., N−1} W (t) · s (t, τ) · sin (2πft / N)

上記〔数式２〕において、ｔは、全Ｔｓ個のフレームのうちτ番目のフレームτ内のＮ個のサンプルに付した通し番号であり、ｔ＝０，１，２，…，Ｎ−１の整数値をとる。τはτ＝０，１，２，…，Ｔｓ−１の整数値である。また、ｆは周波数にＮ／Ｆｓを乗じた値になり、値の小さなものから順に付した通し番号であり、ｆ＝０，１，２，…，Ｎ／２（ただし、Ｓｉ（ｆ，τ）は、ｆ＝０，…，Ｎ／２−１の範囲しか値をもたない）の整数値をとる。サンプリング周波数Ｆｓ＝４４１００Ｈｚ、Ｎ＝４０９６の場合、ｆの値が１つ異なると、周波数が約１０．８Ｈｚ異なることになる。変数ｆは周波数に基づく値であるが、本実施形態では、周波数に比例した値としている。 In the above [Equation 2], t is a serial number assigned to N samples in the τ-th frame τ of all Ts frames, and t = 0, 1, 2,..., N−1. Take a number. τ is an integer value of τ = 0, 1, 2,..., Ts−1. In addition, f is a value obtained by multiplying the frequency by N / Fs, and is a serial number sequentially assigned in ascending order, f = 0, 1, 2,..., N / 2 (where Si (f, τ) Takes an integer value only in the range of f = 0,..., N / 2-1. In the case of sampling frequency Fs = 44100 Hz and N = 4096, if the value of f is different by one, the frequency will be different by about 10.8 Hz. The variable f is a value based on the frequency. In the present embodiment, the variable f is a value proportional to the frequency.

上記〔数式２〕に従った処理を実行することにより、各フレームの各窓関数に対応する複素数のスペクトルが得られる。続いて、対象音声解析手段１０は、得られたスペクトルＳｒ（ｆ，τ）、Ｓｉ（ｆ，τ）を用いて、以下の〔数式３〕に従った処理を実行し、対象音声強度スペクトルＳ（ｆ，τ）を算出する。 By executing the processing according to the above [Equation 2], a complex spectrum corresponding to each window function of each frame is obtained. Subsequently, the target speech analysis unit 10 executes processing according to the following [Equation 3] using the obtained spectra Sr (f, τ) and Si (f, τ), and the target speech intensity spectrum S (F, τ) is calculated.

〔数式３〕
Ｓ（ｆ，τ）＝｛Ｓｒ（ｆ，τ）²＋Ｓｉ（ｆ，τ）²｝^1/2 [Formula 3]
S (f, τ) = {Sr (f, τ) ² + Si (f, τ) ² } ^1/2

さらに、対象音声解析手段１０は、算出された対象音声強度スペクトルＳ（ｆ，τ）を用いて、以下の〔数式４〕に従った処理を実行し、対象音声強度スペクトルＳ（ｆ，τ）のτ＝０，１，２，…，Ｔｓ−１における最小値のスペクトルである対象音声最小値スペクトルＳｍｉｎ（ｆ）、および平均値のスペクトルである対象音声平均値スペクトルＳａｖ（ｆ）を算出する。 Furthermore, the target speech analysis means 10 executes processing according to the following [Equation 4] using the calculated target speech intensity spectrum S (f, τ), and the target speech intensity spectrum S (f, τ). .., Ts−1, the target speech minimum value spectrum Smin (f), which is the minimum value spectrum, and the target speech average value spectrum Sav (f), which is the average value spectrum, are calculated. .

〔数式４〕
Ｓｍｉｎ（ｆ）＝ＭＩＮτ_=0,…,Ts-1Ｓ（ｆ，τ）
Ｓａｖ（ｆ）＝Στ_=0,…,Ts-1Ｓ（ｆ，τ）／Ｔｓ [Formula 4]
Smin (f) = MINτ _{= 0,..., Ts-1} S (f, τ)
Sav (f) = Στ _{= 0,..., Ts−1} S (f, τ) / Ts

上記〔数式４〕において、ＭＩＮτ_=0,…,Ts-1Ｓ（ｆ，τ）は、τを０からＴｓ−１まで変化させた場合に、最小となるＳ（ｆ，τ）を意味する。また、上記〔数式４〕において、Στ_=0,…,Ts-1Ｓ（ｆ，τ）は、τを０からＴｓ−１まで変化させた場合のＳ（ｆ，τ）の総和であり、Ｓａｖ（ｆ）は、０からＴｓ−１までの全てのτについてのＳ（ｆ，τ）の平均値を意味する。 In the above [Equation 4], MINτ _{= 0,..., Ts-1} S (f, τ) means S (f, τ) that is minimum when τ is changed from 0 to Ts−1. . In the above [Expression 4], Στ _{= 0,..., Ts−1} S (f, τ) is the sum of S (f, τ) when τ is changed from 0 to Ts−1. Sav (f) means the average value of S (f, τ) for all τ from 0 to Ts−1.

次に、参照音声解析手段３０が、参照音声信号記憶部６２から参照音声信号を読み込み、読み込んだ参照音声信号に対して周波数解析を行って周波数次元への変換を行う（ステップＳ２）。具体的には、参照音声解析手段３０は、まず、参照音声信号記憶部６２に記憶された参照音声信号から、所定数Ｎのサンプルを１フレームとして読み込む。録音音声の明瞭化装置が処理する１フレームのサンプル数Ｎは、適宜設定することができる。本実施形態では、サンプリング周波数Ｆｓ＝４４１００Ｈｚの場合、Ｎ＝４０９６に設定している。したがって、４０９６サンプルずつ、順次１フレームとして読み込んでいくことになる。参照音声解析手段３０は、基本的には、対象音声解析手段１０が対象音声信号を読み込んでサンプルを設定する場合と同様に処理を行う。 Next, the reference speech analysis means 30 reads the reference speech signal from the reference speech signal storage unit 62, performs frequency analysis on the read reference speech signal, and converts it to a frequency dimension (step S2). Specifically, the reference speech analysis unit 30 first reads a predetermined number N of samples as one frame from the reference speech signal stored in the reference speech signal storage unit 62. The number N of samples of one frame processed by the recorded speech clarification device can be set as appropriate. In this embodiment, when the sampling frequency Fs = 44100 Hz, N = 4096 is set. Therefore, 4096 samples are sequentially read as one frame. The reference speech analysis unit 30 basically performs the same processing as when the target speech analysis unit 10 reads a target speech signal and sets a sample.

参照音声信号は、無音区間や非音声区間が存在しないように録音信号に対してあらかじめ編集された音声信号であるので、参照音声解析手段３０は、対象音声解析手段１０が行ったような無音区間の判断は行わず、参照音声信号の全てのサンプルをフレームの構成要素として読み込む。また、参照音声解析手段３０においても、対象音声解析手段１０と同様、奇数番目のフレーム、偶数番目のフレームは、互いに所定数（本実施形態ではＮ＝２０４８）のサンプルを重複して設定される。 Since the reference speech signal is a speech signal that has been edited in advance with respect to the recording signal so that there is no silence interval or non-speech interval, the reference speech analysis unit 30 performs the silence interval as performed by the target speech analysis unit 10. This determination is not performed, and all samples of the reference audio signal are read as frame components. In the reference speech analysis unit 30, similarly to the target speech analysis unit 10, the odd-numbered frame and the even-numbered frame are set by overlapping a predetermined number of samples (N = 2048 in this embodiment). .

続いて、参照音声解析手段３０は、対象音声解析手段１０と同様、各フレームに対して周波数解析を行って、各フレームの複素数のスペクトルを得る。周波数解析としては、時間次元から周波数次元への変換を行う。ここでも、参照音声解析手段３０は、対象音声解析手段１０と同様、上記〔数式１〕に示したハニング窓関数Ｗ（ｔ）を利用して周波数解析を行う。周波数解析としては、フーリエ変換、ウェーブレット変換その他公知の種々の手法を用いることができるが、複素数のスペクトルを得られる手法である必要がある。本実施形態では、フーリエ変換を用いた場合を例にとって説明する。 Subsequently, as with the target speech analysis unit 10, the reference speech analysis unit 30 performs frequency analysis on each frame to obtain a complex spectrum of each frame. As frequency analysis, conversion from the time dimension to the frequency dimension is performed. Here, as with the target speech analysis unit 10, the reference speech analysis unit 30 performs frequency analysis using the Hanning window function W (t) shown in [Formula 1]. As the frequency analysis, various known methods such as Fourier transform, wavelet transform, and the like can be used, but the method needs to obtain a complex spectrum. In the present embodiment, a case where Fourier transform is used will be described as an example.

参照音声解析手段３０が、奇数番目および偶数番目のフレームに対してフーリエ変換を行う場合は、参照音声信号ｈ（ｔ，τ）（ｔ＝０，…，Ｎ−１、τ＝０，…，Ｔｈ−１）に対して、窓関数Ｗ（ｔ）を用いて、以下の〔数式５〕に従った処理を行い、変換データの実部Ｈｒ（ｆ，τ）、虚部Ｈｉ（ｆ，τ）を得る。 When the reference speech analysis means 30 performs Fourier transform on the odd-numbered and even-numbered frames, the reference speech signal h (t, τ) (t = 0,..., N−1, τ = 0,. Th-1) is processed using the window function W (t) according to the following [Equation 5], and the real part Hr (f, τ) and imaginary part Hi (f, τ) of the converted data are obtained. )

〔数式５〕
Ｈｒ（ｆ，τ）＝Σ_t=0,…,N-1Ｗ（ｔ）・ｈ（ｔ，τ）・ｃｏｓ（２πｆｔ／Ｎ）
Ｈｉ（ｆ，τ）＝Σ_t=0,…,N-1Ｗ（ｔ）・ｈ（ｔ，τ）・ｓｉｎ（２πｆｔ／Ｎ） [Formula 5]
Hr (f, τ) = Σ _{t = 0,..., N−1} W (t) · h (t, τ) · cos (2πft / N)
Hi (f, τ) = Σ _{t = 0,..., N−1} W (t) · h (t, τ) · sin (2πft / N)

上記〔数式５〕において、ｔは、全Ｔｈ個のフレームのうちτ番目のフレームτ内のＮ個のサンプルに付した通し番号であり、ｔ＝０，１，２，…，Ｎ−１の整数値をとる。τはτ＝０，１，２，…，Ｔｈ−１の整数値である。また、ｆは周波数に比例し、値の小さなものから順に付した通し番号であり、ｆ＝０，１，２，…，Ｎ／２（ただし、Ｈｉ（ｆ，τ）は、ｆ＝０，…，Ｎ／２−１の範囲しか値をもたない）の整数値をとる。 In the above [Expression 5], t is a serial number assigned to N samples in the τ-th frame τ among all Th frames, and t = 0, 1, 2,..., N−1. Take a number. τ is an integer value of τ = 0, 1, 2,..., Th−1. In addition, f is a serial number that is proportional to the frequency and is assigned in order from the smallest value, and f = 0, 1, 2,..., N / 2 (where Hi (f, τ) is f = 0,. , N / 2-1 only in the range).

上記〔数式５〕に従った処理を実行することにより、各フレームの各窓関数に対応する複素数のスペクトルが得られる。続いて、参照音声解析手段３０は、得られたスペクトルＨｒ（ｆ，τ）、Ｈｉ（ｆ，τ）を用いて、以下の〔数式６〕に従った処理を実行し、参照音声強度スペクトルＨ（ｆ，τ）を算出する。 By executing the processing according to the above [Equation 5], a complex spectrum corresponding to each window function of each frame is obtained. Subsequently, the reference speech analysis means 30 executes processing according to the following [Equation 6] using the obtained spectra Hr (f, τ) and Hi (f, τ), and the reference speech intensity spectrum H (F, τ) is calculated.

〔数式６〕
Ｈ（ｆ，τ）＝｛Ｈｒ（ｆ，τ）²＋Ｈｉ（ｆ，τ）²｝^1/2 [Formula 6]
H (f, τ) = {Hr (f, τ) ² + Hi (f, τ) ² } ^1/2

さらに、参照音声解析手段３０は、算出された参照音声強度スペクトルＨ（ｆ，τ）を用いて、以下の〔数式７〕に従った処理を実行し、参照音声強度スペクトルＨ（ｆ，τ）のτ＝０，１，２，…，Ｔｈ−１における平均値である参照音声平均値スペクトルＨａｖ（ｆ）を算出する。 Further, the reference speech analysis unit 30 executes processing according to the following [Equation 7] using the calculated reference speech intensity spectrum H (f, τ), and the reference speech intensity spectrum H (f, τ). Of the reference speech average value spectrum Hav (f), which is an average value at τ = 0, 1, 2,..., Th−1.

〔数式７〕
Ｈａｖ（ｆ）＝Στ_=0,…,Th-1Ｈ（ｆ，τ）／Ｔｈ [Formula 7]
Hav (f) = Στ _{= 0,..., Th−1} H (f, τ) / Th

上記〔数式７〕において、Στ_=0,…,Th-1Ｈ（ｆ，τ）は、τを０からＴｈ−１まで変化させた場合のＨ（ｆ，τ）の総和であり、Ｈａｖ（ｆ）は、０からＴｈ−１までの全てのτについてのＨ（ｆ，τ）の平均値を意味する。 In the above [Expression 7], Στ _{= 0,..., Th−1} H (f, τ) is the sum of H (f, τ) when τ is changed from 0 to Th−1, and Hav ( f) means the average value of H (f, τ) for all τ from 0 to Th−1.

次に、雑音成分スペクトル作成手段２０が、雑音成分スペクトルの作成を行う（ステップＳ３）。雑音成分スペクトルは、ステップＳ１において対象音声信号から無音区間を除外したかどうかにより作成の手法が異なる。対象音声信号から無音区間を除外した場合、雑音成分スペクトル作成手段２０は、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式８〕に従った処理を実行し、雑音成分スペクトルＮ（ｆ）を算出する。 Next, the noise component spectrum creating means 20 creates a noise component spectrum (step S3). The method of creating the noise component spectrum differs depending on whether or not the silent section is excluded from the target speech signal in step S1. When the silent section is excluded from the target speech signal, the noise component spectrum creating means 20 performs the following [Equation 8] for each f of f1 to f2 (0 ≦ f1 <f2 ≦ N / 2-1). The process according to this is executed, and the noise component spectrum N (f) is calculated.

〔数式８〕
Ｎ（ｆ）＝Ｓａｖ（ｆ） [Formula 8]
N (f) = Sav (f)

上記〔数式８〕において、Ｓａｖ（ｆ）は、ステップＳ１において対象音声解析手段１０により算出された対象音声平均値スペクトルである。ステップＳ１において対象音声信号から無音区間を除外した場合、〔数式８〕に示したように、雑音成分スペクトルＮ（ｆ）は、対象音声平均値スペクトルＳａｖ（ｆ）そのものとして得られることになる。一方、ステップＳ１において対象音声信号から無音区間を除外していない場合、雑音成分スペクトル作成手段２０は、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式９〕に従った処理を実行し、雑音成分スペクトルＮ（ｆ）を算出する。 In the above [Equation 8], Sav (f) is the target speech average value spectrum calculated by the target speech analysis means 10 in step S1. When the silent section is excluded from the target speech signal in step S1, the noise component spectrum N (f) is obtained as the target speech average value spectrum Sav (f) itself as shown in [Formula 8]. On the other hand, when the silent section is not excluded from the target speech signal in step S1, the noise component spectrum creating means 20 performs f for each of f1 and f2 (0 ≦ f1 <f2 ≦ N / 2-1). The process according to the following [Equation 9] is executed to calculate the noise component spectrum N (f).

〔数式９〕
Ｎ（ｆ）＝｛Ｓｍｉｎ（ｆ）＋Ｓａｖ（ｆ）｝／２ [Formula 9]
N (f) = {Smin (f) + Sav (f)} / 2

上記〔数式９〕において、Ｓｍｉｎ（ｆ）は、ステップＳ１において対象音声解析手段１０により算出された対象音声最小値スペクトルである。ステップＳ１において対象音声信号から無音区間を除外していない場合、〔数式９〕に示したように、雑音成分スペクトルＮ（ｆ）は、対象音声最小値スペクトルＳｍｉｎ（ｆ）と対象音声平均値スペクトルＳａｖ（ｆ）の平均値として得られることになる。 In the above [Equation 9], Smin (f) is the target speech minimum value spectrum calculated by the target speech analysis means 10 in step S1. When the silent section is not excluded from the target speech signal in step S1, as shown in [Formula 9], the noise component spectrum N (f) includes the target speech minimum value spectrum Smin (f) and the target speech average value spectrum. This is obtained as an average value of Sav (f).

雑音成分スペクトル作成手段２０が、雑音成分スペクトルＮ（ｆ）の算出対象範囲とするｆ１〜ｆ２は、音声帯域が集中する範囲とすることが好ましい。したがって、本実施形態では、Ｎ（ｆ）の算出対象範囲が、音声帯域が集中する２００Ｈｚ〜６０００Ｈｚとなるように、ｆ１＝２００Ｎ／Ｆｓ、ｆ２＝６０００Ｎ／Ｆｓと設定している。Ｎ（ｆ）の算出対象範囲を音声帯域が集中する範囲とすることにより、音声帯域以外の低音雑音および高音雑音が除外される。 It is preferable that f1 to f2 that the noise component spectrum creating means 20 uses as the calculation target range of the noise component spectrum N (f) is a range in which the voice band is concentrated. Therefore, in the present embodiment, f1 = 200 N / Fs and f2 = 6000 N / Fs are set so that the calculation target range of N (f) is 200 Hz to 6000 Hz where the voice band is concentrated. By setting the calculation target range of N (f) as a range in which the voice band is concentrated, bass noise and treble noise other than the voice band are excluded.

次に、変調成分スペクトル作成手段４０が、変調成分スペクトルの作成を行う（ステップＳ４）。具体的には、変調成分スペクトル作成手段４０は、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式１０〕に従った処理を実行し、変調成分スペクトルＧ（ｆ）を算出する。 Next, the modulation component spectrum creating means 40 creates a modulation component spectrum (step S4). Specifically, the modulation component spectrum creating means 40 executes the processing according to the following [Equation 10] for each f of f1 or more and f2 or less (0 ≦ f1 <f2 ≦ N / 2-1). Then, the modulation component spectrum G (f) is calculated.

〔数式１０〕
Ｇ（ｆ）＝Ｈａｖ（ｆ）／｛Ｓａｖ（ｆ）−Ｎ（ｆ）｝ [Formula 10]
G (f) = Hav (f) / {Sav (f) -N (f)}

上記〔数式１０〕において、Ｈａｖ（ｆ）は、ステップＳ２において参照音声解析手段３０により算出された参照音声平均値スペクトルである。〔数式１０〕に示したように、変調成分スペクトルＧ（ｆ）は、ステップＳ１において対象音声解析手段１０により算出された対象音声平均値スペクトルＳａｖ（ｆ）から雑音成分スペクトルＮ（ｆ）を減じた値で、ステップＳ２において参照音声解析手段３０により算出された参照音声平均値スペクトルＨａｖ（ｆ）を除算することにより得られることになる。 In the above [Equation 10], Hav (f) is the reference speech average value spectrum calculated by the reference speech analysis means 30 in step S2. As shown in [Formula 10], the modulation component spectrum G (f) is obtained by subtracting the noise component spectrum N (f) from the target speech average value spectrum Sav (f) calculated by the target speech analysis means 10 in step S1. Is obtained by dividing the reference speech average value spectrum Hav (f) calculated by the reference speech analyzing means 30 in step S2.

次に、音声信号補正手段５０が、雑音成分の除去を行う（ステップＳ５）。具体的には、まず、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式１１〕に従った処理を実行し、雑音除去スペクトルＳ´（ｆ，τ）を算出する。 Next, the audio signal correcting unit 50 removes noise components (step S5). Specifically, first, a process according to the following [Equation 11] is executed for each f of f1 or more and f2 or less (0 ≦ f1 <f2 ≦ N / 2-1) to obtain a noise removal spectrum S ′. (F, τ) is calculated.

〔数式１１〕
Ｓ´（ｆ，τ）＝Ｓ（ｆ，τ）−α・Ｎ（ｆ）
ただし、Ｓ´（ｆ，τ）＜０となった場合、Ｓ´（ｆ，τ）＝０とする。 [Formula 11]
S ′ (f, τ) = S (f, τ) −α · N (f)
However, when S ′ (f, τ) <0, S ′ (f, τ) = 0.

上記〔数式１１〕において、Ｓ（ｆ，τ）は、ステップＳ１において対象音声解析手段１０により算出された対象音声強度スペクトルである。また、αは、パラメータ設定手段５５により設定された、０≦α≦１の実数値である補正係数である。〔数式１１〕に示したように、対象音声強度スペクトルＳ´（ｆ，τ）は、ステップＳ３において雑音成分スペクトル作成手段２０により作成された雑音成分スペクトルＮ（ｆ）に補正係数αを乗じたものを、ステップＳ１において対象音声解析手段１０により算出された対象音声強度スペクトルＳ（ｆ，τ）から減じることにより得られることになる。 In the above [Expression 11], S (f, τ) is the target speech intensity spectrum calculated by the target speech analysis means 10 in step S1. Α is a correction coefficient that is set by the parameter setting means 55 and is a real value of 0 ≦ α ≦ 1. As shown in [Formula 11], the target speech intensity spectrum S ′ (f, τ) is obtained by multiplying the noise component spectrum N (f) created by the noise component spectrum creating means 20 in step S3 by the correction coefficient α. This is obtained by subtracting the one from the target speech intensity spectrum S (f, τ) calculated by the target speech analysis means 10 in step S1.

続いて、音声信号補正手段５０は、変調処理を行う（ステップＳ６）。具体的には、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式１２〕に従った処理を実行し、補正音声強度スペクトルＣ（ｆ，τ）を算出する。 Subsequently, the audio signal correcting unit 50 performs a modulation process (step S6). Specifically, the processing according to the following [Equation 12] is performed on each f of f1 or more and f2 or less (0 ≦ f1 <f2 ≦ N / 2-1), and the corrected sound intensity spectrum C (f , Τ).

〔数式１２〕
Ｃ（ｆ，τ）＝Ｓ´（ｆ，τ）・Ｇ（ｆ）・β [Formula 12]
C (f, τ) = S ′ (f, τ) · G (f) · β

上記〔数式１２〕において、Ｓ´（ｆ，τ）は、〔数式１１〕に従って音声信号補正手段５０により算出された雑音除去スペクトルである。また、Ｇ（ｆ）は、ステップＳ４において変調成分スペクトル作成手段４０により算出された変調成分スペクトルである。また、βは、パラメータ設定手段５５により設定された、０≦β≦１の実数値である補正係数である。〔数式１２〕に示したように、補正音声強度スペクトルＣ（ｆ，τ）は、音声信号補正手段５０により算出されたスペクトル雑音除去Ｓ´（ｆ，τ）と、変調成分スペクトルＧ（ｆ）と、補正係数βを乗じることにより得られることになる。 In the above [Equation 12], S ′ (f, τ) is a noise removal spectrum calculated by the audio signal correcting means 50 in accordance with [Equation 11]. G (f) is the modulation component spectrum calculated by the modulation component spectrum creating means 40 in step S4. Β is a correction coefficient that is set by the parameter setting means 55 and is a real value of 0 ≦ β ≦ 1. As shown in [Equation 12], the corrected sound intensity spectrum C (f, τ) includes the spectrum noise removal S ′ (f, τ) calculated by the sound signal correcting means 50 and the modulation component spectrum G (f). And the correction coefficient β.

さらに、後段で時間次元に逆変換する都合上、〔数式１２〕で算出されたスカラー値の補正音声強度スペクトルＣ（ｆ，τ）の位相は、対象音声信号Ｓ（ｆ，τ）の位相と同一であるという前提で、音声信号補正手段５０は、ｆ１以上ｆ２以下（０≦ｆ１＜ｆ２≦Ｎ／２−１）の各ｆに対して、以下の〔数式１３〕に従った処理を実行し、〔数式１２〕で算出されたスカラー値の補正音声強度スペクトルＣ（ｆ，τ）を複素数値の補正複素スペクトルＣｒ（ｆ，τ）、Ｃｉ（ｆ，τ）に変換する。 Furthermore, for the sake of convenience of reverse conversion to the time dimension in the subsequent stage, the phase of the corrected speech intensity spectrum C (f, τ) of the scalar value calculated by [Equation 12] is the same as the phase of the target speech signal S (f, τ). On the assumption that they are the same, the audio signal correcting means 50 executes the processing according to the following [Equation 13] for each f of f1 or more and f2 or less (0 ≦ f1 <f2 ≦ N / 2-1). Then, the corrected speech intensity spectrum C (f, τ) of the scalar value calculated by [Equation 12] is converted into corrected complex spectra Cr (f, τ) and Ci (f, τ) of complex values.

〔数式１３〕
Ｃｒ（ｆ，τ）＝Ｓｒ（ｆ，τ）・Ｃ（ｆ，τ）／Ｓ（ｆ，τ）
Ｃｉ（ｆ，τ）＝Ｓｉ（ｆ，τ）・Ｃ（ｆ，τ）／Ｓ（ｆ，τ） [Formula 13]
Cr (f, τ) = Sr (f, τ) · C (f, τ) / S (f, τ)
Ci (f, τ) = Si (f, τ) · C (f, τ) / S (f, τ)

〔数式１３〕に示したように、補正複素スペクトルＣｒ（ｆ，τ）、Ｃｉ（ｆ，τ）は、強度値の比率Ｃ（ｆ，τ）／Ｓ（ｆ，τ）（補正音声強度スペクトルＣ（ｆ，τ）を対象音声強度スペクトルＳ（ｆ，τ）で除したもの）を、ステップＳ１において対象音声解析手段１０により算出された実部Ｓｒ（ｆ，τ）、虚部Ｓｉ（ｆ，τ）に、それぞれ乗じることにより得られることになる。 As shown in [Formula 13], the corrected complex spectra Cr (f, τ) and Ci (f, τ) are intensity ratios C (f, τ) / S (f, τ) (corrected speech intensity spectrum). C (f, τ) divided by the target speech intensity spectrum S (f, τ)), real part Sr (f, τ), imaginary part Si (f) calculated by the target speech analysis means 10 in step S1 , Τ), respectively.

補正複素スペクトルＣｒ（ｆ，τ）、Ｃｉ（ｆ，τ）が得られたら、音声信号補正手段５０は、元と同じ時系列形式とするために時間次元逆変換して、補正音声信号を作成する処理を行う（ステップＳ７）。この時間次元逆変換は、当然のことながら、対象音声解析手段１０が実行した手法に対応していることが必要となる。本実施形態では、対象音声解析手段１０において、フーリエ変換を施しているため、音声信号補正手段５０は、フーリエ逆変換を実行することになる。 When the corrected complex spectra Cr (f, τ) and Ci (f, τ) are obtained, the audio signal correcting means 50 creates a corrected audio signal by performing inverse time-dimensional transformation to obtain the same time series format as the original. Is performed (step S7). Naturally, this time-dimensional inverse transform needs to correspond to the method executed by the target speech analysis means 10. In the present embodiment, since the target speech analysis unit 10 performs the Fourier transform, the speech signal correction unit 50 executes the inverse Fourier transform.

具体的には、各フレーム単位で、音声信号補正手段５０は、補正複素スペクトルの実部Ｃｒ（ｆ，τ）、虚部Ｃｉ（ｆ，τ）を用いて、以下の〔数式１４〕に従った処理を行い、補正音声信号ｃ（ｔ，τ）を算出する。 Specifically, in each frame unit, the audio signal correcting unit 50 uses the real part Cr (f, τ) and the imaginary part Ci (f, τ) of the corrected complex spectrum according to the following [Equation 14]. The corrected sound signal c (t, τ) is calculated.

〔数式１４〕
ｃ（ｔ，τ）＝１／Ｎ・｛Σ_fＣｒ（ｆ，τ）・ｃｏｓ（２πｆｔ／Ｎ）−Σ_fＣｉ（ｆ，τ）・ｓｉｎ（２πｆｔ／Ｎ）｝＋ｃ（ｔ＋Ｎ／２，τ−１） [Formula 14]
c (t, τ) = 1 / N · {Σ f Cr (f, τ) · cos (2πft / N) -Σ f Ci (f, τ) · sin (2πft / N)} + c (t + N / 2, τ-1)

上記〔数式１４〕においては、式が繁雑になるのを防ぐため、Σ_f=0,…,N/2をΣ_fとして示している。上記〔数式１４〕における“＋ｃ（ｔ＋Ｎ／２，τ−１）”の項は、直前フレームのデータｃ（ｔ，τ−１）が存在する場合に、時間軸上Ｎ／２サンプル分重複することを考慮して加算するためのものである。上記〔数式１４〕により補正音声信号ｃ（ｔ，τ）が得られることになる。ｃ（ｔ，τ）はフレーム単位の表現であるので、サンプル番号をフレーム内のｔから、全体を通したｉ（ｉ＝τ×Ｎ／２＋ｔ）に変更することにより、補正音声信号ｃ（ｉ）と表現することができる。音声信号補正手段５０は、得られた補正音声信号を補正音声信号記憶部６３に格納する。 In the above [Expression 14], Σ _{f = 0,..., N / 2} is shown as Σ _{f in} order to prevent the expression from becoming complicated. The term “+ c (t + N / 2, τ−1)” in [Formula 14] overlaps by N / 2 samples on the time axis when the data c (t, τ−1) of the immediately preceding frame exists. This is for the addition. The corrected audio signal c (t, τ) is obtained by the above [Equation 14]. Since c (t, τ) is expressed in units of frames, the corrected audio signal c (i) is changed by changing the sample number from t in the frame to i (i = τ × N / 2 + t) throughout. ). The audio signal correcting unit 50 stores the obtained corrected audio signal in the corrected audio signal storage unit 63.

補正音声信号を再生機器により再生し、人間が耳で聴取することにより、明瞭度を確認することができる。補正音声信号ｃ（ｉ）と元の対象音声信号ｓ（ｉ）を聴き比べることにより、補正音声信号ｃ（ｉ）が対象音声信号ｓ（ｉ）に比べて明瞭になっていることがわかる。補正音声信号ｃ（ｉ）を作成する際、パラメータ設定手段５５により係数α、βを段階的に変化させて設定することにより、補正音声信号ｃ（ｉ）が、係数α、βに応じて段階的に明瞭化されることが確認できる。 The corrected audio signal is reproduced by a reproduction device, and the human being listens with the ear, whereby the intelligibility can be confirmed. By listening and comparing the corrected audio signal c (i) and the original target audio signal s (i), it can be seen that the corrected audio signal c (i) is clearer than the target audio signal s (i). When the corrected audio signal c (i) is created, the parameter setting means 55 sets the coefficients α and β in a stepwise manner so that the corrected audio signal c (i) is stepped according to the coefficients α and β. It can be confirmed that it is clarified.

＜４．実験例＞
上記実施形態に係る録音音声の明瞭化装置により処理される音声信号、スペクトル等の波形を図８〜図１４に示す。図８は、対象音声信号ｓ（ｉ）の波形を示しており、横軸が時間、縦軸が振幅である。図９は、参照音声信号ｈ（ｉ）の波形を示しており、横軸が時間、縦軸が振幅である。図１０は、対象音声平均値スペクトルＳａｖ（ｆ）、参照音声平均値スペクトルＨａｖ（ｆ）を示しており、横軸が周波数、縦軸がエネルギーである。図１１は、変調スペクトルＧ（ｆ）を示しており、横軸が周波数、縦軸が変調強度である。図１２は、雑音成分スペクトルＮ（ｆ）を示しており、横軸が周波数、縦軸がエネルギーである。図１３は、補正音声信号ｃ（ｉ）の波形を示しており、横軸が時間、縦軸が振幅である。図１４は、対象音声平均値スペクトルＳａｖ（ｆ）、補正音声平均値スペクトルＣａｖ（ｆ）を示しており、横軸が周波数、縦軸がエネルギーである。なお、上記実施形態では、補正音声平均値スペクトルＣａｖ（ｆ）は明示的には算出していないが（複素数値で算出しているため図示できない）、図１４では、対象音声平均値スペクトルＳａｖ（ｆ）との比較のために意図的に算出したものである。 <4. Experimental example>
Waveforms of speech signals, spectra, etc. processed by the recorded speech clarification device according to the above embodiment are shown in FIGS. FIG. 8 shows the waveform of the target audio signal s (i), with the horizontal axis representing time and the vertical axis representing amplitude. FIG. 9 shows the waveform of the reference audio signal h (i), with the horizontal axis representing time and the vertical axis representing amplitude. FIG. 10 shows the target voice average value spectrum Sav (f) and the reference voice average value spectrum Hav (f), where the horizontal axis represents frequency and the vertical axis represents energy. FIG. 11 shows the modulation spectrum G (f), where the horizontal axis represents frequency and the vertical axis represents modulation intensity. FIG. 12 shows the noise component spectrum N (f), where the horizontal axis represents frequency and the vertical axis represents energy. FIG. 13 shows the waveform of the corrected audio signal c (i), with the horizontal axis representing time and the vertical axis representing amplitude. FIG. 14 shows the target voice average value spectrum Sav (f) and the corrected voice average value spectrum Cav (f), where the horizontal axis represents frequency and the vertical axis represents energy. In the above embodiment, the corrected speech average value spectrum Cav (f) is not explicitly calculated (it cannot be shown because it is calculated as a complex value), but in FIG. 14, the target speech average value spectrum Sav ( This is intentionally calculated for comparison with f).

以上、本発明の好適な実施形態について限定したが、本発明は上記実施形態に限定されず、種々の変形が可能である。例えば、上記実施形態では、実質的に補正を行う周波数範囲を２００Ｈｚ〜６０００Ｈｚとしたが、ボイスレコーダの周波数特性に応じて適宜周波数範囲を縮小または拡大することが可能である。例えば、電話回線帯域に抑えられているボイスレコーダを使用する場合、周波数範囲は３００Ｈｚ〜３４００Ｈｚに限定される。 As mentioned above, although it limited about the suitable embodiment of the present invention, the present invention is not limited to the above-mentioned embodiment, and various modifications are possible. For example, in the above embodiment, the frequency range in which correction is substantially performed is 200 Hz to 6000 Hz. However, the frequency range can be appropriately reduced or expanded according to the frequency characteristics of the voice recorder. For example, when using a voice recorder suppressed to a telephone line band, the frequency range is limited to 300 Hz to 3400 Hz.

また、上記実施形態では、変数ｆを周波数に比例した値としているが、ＭＩＤＩのノートナンバーのように物理的な周波数に対して対数をとった形態としても良い。また、比例や対数以外でも、周波数の変化と密接に関連して変化する変数であれば、他のものを用いても良い。 In the above embodiment, the variable f is a value proportional to the frequency. However, a logarithm may be used for the physical frequency such as a MIDI note number. Other than the proportionality and logarithm, any other variable may be used as long as it is a variable that changes in close association with the change in frequency.

１・・・ＣＰＵ（Central Processing Unit）
２・・・ＲＡＭ（Random Access Memory）
３・・・記憶装置
４・・・キー入力Ｉ／Ｆ
５・・・可搬型記憶装置
６・・・表示出力Ｉ／Ｆ
７・・・ＵＳＢ−Ｉ／Ｆ
８・・・音声入出力Ｉ／Ｆ
９ａ・・・マイクロフォン
９ｂ・・・スピーカ
１０・・・対象音声解析手段
２０・・・雑音成分スペクトル作成手段
３０・・・参照音声解析手段
４０・・・変調成分スペクトル作成手段
５０・・・音声信号補正手段
５５・・・パラメータ設定手段
６０・・・記憶手段
６１・・・対象音声信号記憶部
６２・・・参照音声信号記憶部
６３・・・補正音声信号記憶部 1 ... CPU (Central Processing Unit)
2 ... RAM (Random Access Memory)
3 ... Storage device 4 ... Key input I / F
5 ... Portable storage device 6 ... Display output I / F
7 ... USB-I / F
8 ... Voice input / output I / F
DESCRIPTION OF SYMBOLS 9a ... Microphone 9b ... Speaker 10 ... Target audio | voice analysis means 20 ... Noise component spectrum preparation means 30 ... Reference sound analysis means 40 ... Modulation component spectrum preparation means 50 ... Audio | voice signal Correction means 55 ... Parameter setting means 60 ... Storage means 61 ... Target audio signal storage section 62 ... Reference audio signal storage section 63 ... Correction audio signal storage section

Claims

An apparatus for improving the intelligibility of the target audio signal using a reference audio signal obtained separately by recording with respect to the target audio signal to be corrected and obtained by recording,
The target voice signal is subjected to frequency analysis in a predetermined frame unit in the time axis direction, and a target voice intensity spectrum S (which is an intensity spectrum of the target voice signal where f is a variable based on the frequency and τ is a τ-th frame. f, τ), the target speech average value spectrum Sav (f) composed of an average value of a plurality of frames for each variable f of the target speech signal, and the intensity minimum for each variable f of the target speech signal Target speech analysis means for calculating a minimum value spectrum Smin (f) represented by a frame to be
The reference speech signal is subjected to frequency analysis in a predetermined frame unit in the time axis direction, and a reference speech average value spectrum Hav (f) composed of an average value of a plurality of frames for each variable f of the reference speech signal. A reference speech analysis means for calculating
A noise component spectrum N (f) is created based on an average value for each variable f corresponding to a value based on the minimum value spectrum Smin (f) and a value based on the target speech average value spectrum Sav (f). Noise component spectrum creating means for
For each variable f, modulation is performed based on a value obtained by dividing the reference speech average value spectrum Hav (f) by a value obtained by subtracting the noise component spectrum N (f) from the target speech average value spectrum Sav (f). Modulation component spectrum creating means for creating a component spectrum G (f);
In each frame τ, the generated noise component spectrum N (f) is subtracted by a predetermined ratio α (0 ≦ α ≦ 1) from the target speech intensity spectrum S (f, τ) for each variable f. Further, the subtracted value is multiplied by the created modulation component spectrum G (f) by a predetermined ratio β (0 ≦ β ≦ 1), and the value multiplied for each frame τ is converted into a time dimension. Audio signal correcting means for creating a corrected audio signal in which the target audio signal is corrected,
An apparatus for clarifying recorded voice, comprising:

Oite to claim 1, wherein the sound signal correcting means, said target voice intensity spectrum S (f, tau) for each of the variable f in each frame tau said created for the noise component spectrum N (f) is When subtracting a predetermined ratio α (0 ≦ α ≦ 1), if the subtracted value becomes a negative value, a correction is made so that the subtracted value is corrected to 0. Clarification device.

In claim 1 or claim 2 ,
The noise component spectrum creating means defines the noise component spectrum N (f) in a predetermined frequency range with the variable f = f1 as a lower limit and the variable f = f2 as an upper limit.
The modulation component spectrum creating means defines the modulation component spectrum G (f) in a predetermined frequency range with the variable f = f1 as a lower limit and the variable f = f2 as an upper limit.
The audio signal correcting means applies to the target audio intensity spectrum S (f, τ) for each frame τ within a predetermined frequency range with the variable f = f1 as a lower limit and the variable f = f2 as an upper limit. The generated noise component spectrum N (f) is subtracted by a predetermined ratio α, and the subtracted value is multiplied by the generated modulation component spectrum G (f) by a predetermined ratio β. A device for clarifying recorded audio.

The program for functioning a computer as a clarification apparatus of the recorded audio | voice as described in any one of Claims 1-3 .