JP2007219188A

JP2007219188A - Consonant processing device, speech information transmission device, and consonant processing method

Info

Publication number: JP2007219188A
Application number: JP2006040187A
Authority: JP
Inventors: Yasuyoshi Nakajima; 祥好中島; Tatsuro Yasutake; 達朗安武
Original assignee: Kyushu University NUC
Current assignee: Kyushu University NUC
Priority date: 2006-02-17
Filing date: 2006-02-17
Publication date: 2007-08-30
Anticipated expiration: 2026-02-17
Also published as: JP4876245B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an inexpensive consonant processing device, an inexpensive speech information transmission device, and a consonant processing method which are capable of transmitting speech information almost in real time, in which signal processing is easy, and in which even the elderly or the people with hearing difficulty can distinguish a consonant even in the environment full of noise. <P>SOLUTION: The device comprises: a frame division part 1 for extracting a frame signal by each of a plurality of time frames from an inputted speech signal; a power calculation part 2 for calculating an average power or a sound pressure level for each frame signal; a comparison part 3 for mutually comparing the average powers and sound pressure levels between the frame signals; a consonant determination part 4 for determining whether the speech signal is a consonant or not based on the comparison result by the comparison part 3; and an amplification part that amplifies the amplification object point or the amplification object width of the speech signal when the consonant determination part 4 has determined that the speech signal is a consonant, but does not amplify it when the consonant determination part 4 has determined that the speech signal is not a consonant or a syllable end. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、リアルタイムの音声情報伝達が行え、信号処理が簡単で、騒音の多い環境でも、高齢者や聴覚障害者でも子音または音節の端点が聞き取り易く、安価に製造できる子音加工装置と、これを搭載した音声情報伝達装置及び子音加工方法に関する。 The present invention provides a consonant processing device that can transmit audio information in real time, has simple signal processing, and is easy to hear the end points of consonants or syllables even in a noisy environment or an elderly person or a hearing impaired person, and can be manufactured at low cost. The present invention relates to a voice information transmission device and a consonant processing method.

聴力が低下した高齢者や聴覚障害者は、一般の人と比べてどうしても言葉を聞き取る力が低下する。このような高齢者や聴覚障害者にとって、飛行場等の構内放送や電車、バス内の案内放送、自動販売機やＡＴＭのガイド音声など、周囲に騒音の多い公共空間における各種音声の内容を正確に聞き取るのは難しい。中でも緊急時における避難誘導の音声は、内容が理解できないと大きな事故に繋がりかねない。 Elderly people with reduced hearing ability and hearing-impaired people are inevitably less able to hear words than ordinary people. For such elderly people and hearing-impaired persons, the contents of various voices in public spaces with a lot of noise, such as on-site broadcasts such as airfields, guidance broadcasts in trains, buses, vending machines and ATM guidance voices, are accurate. It is difficult to hear. In particular, the voice of evacuation guidance in an emergency can lead to a major accident if the contents are not understood.

このような言葉の聞き取り力が衰えた高齢者や聴覚障害者のために、受信機を携帯してもらって電波や赤外線などの通信媒体を介して、音声を送信するワイヤレス放送システムが開発されている。このようなシステムでは、発声者は騒音の少ない静かな場所で、口元近くに配置したマイクに向けて発声し、そのクリアな音声が通信媒体を介して直接使用者に送られる。従って、使用者はどんな騒音環境下においても、常にクリアな音声を聴取できる。 Wireless broadcasting systems have been developed that carry receivers and transmit sound via radio or infrared communication media for the elderly and hearing impaired people who have weakened their ability to hear such words. . In such a system, the speaker speaks to a microphone placed near the mouth in a quiet place with little noise, and the clear sound is sent directly to the user via the communication medium. Therefore, the user can always listen to clear sound in any noise environment.

また、補聴器などの聴覚補助機器では、周囲の騒音を抑制する騒音抑制機能（noise reduction）や、後述するような音声の中の子音のみを強調する子音強調機能（consonant enhancement）などを備えた機種が提案されている。とくに子音強調は、音声の子音の振幅が母音に比べて小さいために、高齢者や聴覚障害者が、子音部を聞き取ることが困難であるという事実に着目して開発された技術である。 In addition, hearing aids and other hearing aids are equipped with a noise reduction function that suppresses ambient noise, a consonant enhancement function that emphasizes only consonants in speech as described below, etc. Has been proposed. In particular, consonant emphasis is a technology developed by focusing on the fact that it is difficult for the elderly and the hearing impaired to hear the consonant part because the consonant amplitude of speech is smaller than that of vowels.

しかし、このような現象は聴力の低下に由来するものには限られない。聴力が低下していない人間でも、構内放送や案内放送などのアナウンス放送装置、携帯電話、その他の音声情報伝達装置は、騒音の多い環境下では騒音で音声が聞き取れなくなる。これに対して、出力の大きなスピーカやイヤホンを搭載して音量を大きくすればよいが、聴覚の許容限界を越えてしまう可能性がある上に、装置の大きさに限界があり、また、音の歪み等が大きくなる。 However, such a phenomenon is not limited to that resulting from a decrease in hearing. Even for humans whose hearing ability has not deteriorated, announcement broadcasting devices such as private broadcasting and guidance broadcasting, mobile phones, and other voice information transmission devices cannot hear sound due to noise in a noisy environment. On the other hand, it is only necessary to increase the volume by installing speakers or earphones with high output, but there is a possibility that the permissible limit of hearing may be exceeded, and there is a limit to the size of the device. Distortion etc. become large.

このため、音声を聞き取り易くする目的で音声強調の幾つかの手法が提案されるに至った。この音声強調というのは、音声の周波数スペクトルの所定帯域におけるスペクトル振幅を増幅し、聞く側からみて音声の明瞭度を向上させることであり、これは、所定の周波数帯域パワーを増幅すると共に、増幅帯域と異なる帯域のスペクトル振幅を減衰させることである。（特許文献１参照）。 For this reason, several methods of speech enhancement have been proposed for the purpose of making speech easy to hear. This voice enhancement is to amplify the spectrum amplitude in a predetermined band of the frequency spectrum of the voice and improve the intelligibility of the voice as seen from the listening side. This amplifies the predetermined frequency band power and amplifies it. It is to attenuate the spectrum amplitude of a band different from the band. (See Patent Document 1).

ここで、音声の発生するメカニズムについて説明すると、声帯が振動すると、声帯から唇までの声道で音声波が生成され、この声道で生成された音声波を唇及び舌等とを介して音声として放射する。すなわち、声帯が一定周期（ピッチ周期）の振動をすると、声道は、肺からの空気を喉等の形状（例えば太さ）で共振させて母音を生成する。このとき、喉等の形状を変化させることで「あ」、「い」、「う」等の母音に調音し、音声波として放射する。そして、唇及び舌等では、口内で破裂音や摩擦音、また、鼻音、その他の音を発生して子音とし、通常は子音と母音とが合わさって空間に放射されるものである。 Here, the mechanism of sound generation will be explained. When the vocal cords vibrate, a voice wave is generated in the vocal tract from the vocal cords to the lips, and the voice wave generated in the vocal tract is voiced through the lips and the tongue. Radiates as. That is, when the vocal cords vibrate at a constant cycle (pitch cycle), the vocal tract resonates air from the lungs with a shape (eg, thickness) such as a throat to generate vowels. At this time, by changing the shape of the throat or the like, the vowels such as “A”, “I”, “U” are tuned and radiated as voice waves. On the lips, tongue, and the like, plosive sounds, friction sounds, nasal sounds, and other sounds are generated in the mouth as consonants. Normally, consonants and vowels are combined and emitted into the space.

ところで、日本語はＣＶ（Consonant-Vowel）型の言語であり、例えば日本語で「か」という発音は、「くっ」という子音（Ｃ）と「あ」という母音（Ｖ）とから構成され、子音と母音とが合わさって放射される。また、例えば英語はＣＶＣ（Consonant-Vowel- Consonant）型の言語でもあり、子音、母音、子音の順に配列されることが多いことが知られている。 By the way, Japanese is a CV (Consonant-Vowel) type language. For example, the pronunciation of “ka” in Japanese is composed of a consonant (C) “ku” and a vowel (V) “a”. Consonants and vowels are emitted together. For example, English is also a CVC (Consonant-Vowel-Consonant) type language, and it is known that consonants, vowels, and consonants are often arranged in this order.

従って、日本語の場合、「ん」「っ」を除いて他の行の発音は、このような子音（Ｃ）と母音（Ｖ）の組合せなどで発音され、言葉を発するときは、多くの場合各音において、まず唇及び舌等を使って声帯からの音が妨げられて子音が調音され、次いで、声帯からの音が妨げられることなく母音が強い音で発声されることになる。 Therefore, in the case of Japanese, the pronunciation of the other lines except “n” and “tsu” is pronounced with such a combination of consonant (C) and vowel (V). In each case, the sound from the vocal cords is first disturbed using the lips and tongue, etc., and the consonant is tuned, and then the vowel is uttered with a strong sound without disturbing the sound from the vocal cords.

このため、子音は母音に比べて音声の振幅が小さく、音声情報伝達装置の音声は周囲の環境等では雑音に紛れて聞こえなくなる。このため、例えば、音声を明瞭に聞き分けることが可能な補聴器等の音声情報伝達装置が提案された（特許文献２参照）。 For this reason, the consonant has a smaller voice amplitude than the vowel, and the voice of the voice information transmission device cannot be heard in the surrounding environment due to noise. For this reason, for example, a voice information transmission device such as a hearing aid that can clearly hear the voice has been proposed (see Patent Document 2).

特許文献２の音声情報伝達装置は、外部から音声が入力されるマイクロフォンと、入力された音声信号に基づいて子音明瞭化信号を生成する音声信号処理部と、搬送波信号を生成する搬送波信号発生部と、搬送波信号を子音明瞭化信号に基づいて振幅変調する振幅変調部と、振幅変調された出力信号に基づく機械的振動を伝達する振動子とから構成され、この音声信号処理部が、子音抽出部で音声信号に含まれる子音部を抽出すると共に、反復処理部で抽出された子音部子音部が複数回反復されて音声信号に付加して子音明瞭化信号を生成するものである。 The audio information transmission apparatus of Patent Document 2 includes a microphone to which audio is input from the outside, an audio signal processing unit that generates a consonant clarification signal based on the input audio signal, and a carrier signal generation unit that generates a carrier signal And an amplitude modulation unit that modulates the amplitude of the carrier wave signal based on the consonant clarification signal, and a vibrator that transmits mechanical vibration based on the amplitude-modulated output signal. The consonant part included in the audio signal is extracted by the unit, and the consonant part consonant part extracted by the iterative processing unit is repeated a plurality of times and added to the audio signal to generate a consonant clarification signal.

音声信号における子音部は、母音部との間にＶＯＴ（Voice Onset Time）が数十ｍｓ程度存在する。このＶＯＴは、子音の破裂から声帯が振動するまでの時間であり、無音に近い状態である。したがって、子音部の立ち上がりや母音に比べて振幅は小さく、適当な基準値を設定し、振幅がこの基準値以下となる状態が所定時間（例えば、１０ｍｓ程度）以上続く領域をＶＯＴとして判別することにより、子音部の残余部分や母音部と区別して、子音部の終期を特定することができるというものである。 The consonant part in the audio signal has a VOT (Voice Onset Time) of about several tens of ms between the vowel part. This VOT is the time from the burst of a consonant until the vocal cords vibrate, and is in a state close to silence. Therefore, the amplitude is smaller than the rise of the consonant part and the vowel, an appropriate reference value is set, and a region where the amplitude is below the reference value continues for a predetermined time (for example, about 10 ms) is determined as VOT. Thus, the end of the consonant part can be identified by distinguishing from the remaining part of the consonant part and the vowel part.

同様に、母音部の後、次に続く子音部との間にも、通常は数十ｍｓ以上の無音区間が存在する。従って、ＶＯＴの検出と同様にしてこの無音区間を判別することにより、次の子音部の始期を特定するものである。 Similarly, a silent section of several tens of ms or more usually exists between the vowel part and the subsequent consonant part. Therefore, the beginning of the next consonant part is specified by discriminating this silent section in the same manner as the detection of VOT.

また、特許文献２と同様、上述した特許文献１においても、携帯電話等の受話音声の明瞭度を改善し、入力音声に雑音が含まれる場合に音声品質の劣化及び雑音の増加を抑圧する音声強調装置が提案されている。 Similarly to Patent Document 2, in Patent Document 1 described above, speech that improves the clarity of received speech from a mobile phone or the like and suppresses deterioration of speech quality and increase in noise when the input speech includes noise. Emphasis devices have been proposed.

特許文献１の音声強調装置は、入力音声信号の音声品質を推定し音声品質推定値（推定Ｓ／Ｎ比）を出力する音声品質推定部と、音声品質推定部にて出力された音声品質推定値に基づいて、入力音声信号の声道特性の調整（ホルマントの増幅、アンチホルマントの減衰）と、入力音声信号の残差信号の強調（ピッチの強調）を行う音声強調処理部とを備えたものである。なお、この残差信号とは、音声波から線形予測可能な部分の除去により分離されたもの音源信号で、これの自己相関を算出することにより、音源のピッチ周期が得られるものである。 The speech enhancement apparatus of Patent Literature 1 estimates a speech quality of an input speech signal and outputs a speech quality estimation value (estimated S / N ratio), and a speech quality estimation output by the speech quality estimation unit. A voice enhancement processing unit that adjusts the vocal tract characteristics of the input voice signal (formant amplification, anti-formant attenuation) and emphasizes the residual signal of the input voice signal (pitch enhancement) based on the value Is. The residual signal is a sound source signal separated by removing a linearly predictable portion from a speech wave, and the sound source pitch period can be obtained by calculating the autocorrelation thereof.

特開２００５−３３１７８３号公報JP 2005-331783 A 特開２００５−２８７６００号公報JP 2005-287600 A

以上説明したように、子音は母音に比べて音が弱く、音声情報伝達装置の音声は周囲の環境等次第で雑音に紛れて聞こえなくなるという問題があった。 As described above, there is a problem that the consonant is weaker than the vowel, and the voice of the voice information transmission device cannot be heard due to noise depending on the surrounding environment.

上述したワイヤレス放送システムは、そのようなシステムが備えられている限定された公共空間でなければ利用できない上に、使用者が受信機を携帯せねばならない。しかも、この放送システムは大規模で高価であるため、あらゆる公共空間に設けることは難しく、また、使用者全員が受信機を携帯することも困難なため、普及は難しいという問題があった。 The above-described wireless broadcasting system can only be used in a limited public space equipped with such a system, and the user must carry a receiver. Moreover, since this broadcasting system is large and expensive, it is difficult to install it in any public space, and it is difficult for all users to carry the receiver, so that there is a problem that it is difficult to spread.

また、補聴器等にも問題がある。まず、使用者が装用していなければ役に立たないし、補聴器への入力は音声と周囲の騒音が混在した音となる。従って、騒音抑制機構や子音強調機構によって、この双方の音が混在した中から騒音のみを抑制したり、子音のみを強調したりしなければならない。しかし、これらは騒音の種類など、場合によってうまく動作しないこともあるし、子音強調は静寂下においても難しい。このため従来は振幅エンベロープ、無声閉鎖子音の破裂に伴う無音区間、その他の周波数領域の情報など、複数の手がかりを並列的に利用して子音を検出することで、検出の正確さを上げている。しかし、このための処理は複雑になり、リアルタイム（実時間）若しくはこれに近い時間（準リアルタイム）内に子音強調して音声情報伝達を行うことの障害となっていた。 There are also problems with hearing aids and the like. First, it is useless if the user is not wearing it, and the input to the hearing aid is a sound in which voice and ambient noise are mixed. Therefore, it is necessary to suppress only the noise or to emphasize only the consonant from the mixture of both sounds by the noise suppression mechanism and the consonant enhancement mechanism. However, they may not work well depending on the type of noise, and consonant enhancement is difficult even in silence. For this reason, in the past, detection accuracy has been improved by detecting consonants using multiple cues in parallel, such as amplitude envelopes, silence intervals associated with bursts of silent closed consonants, and other frequency domain information. . However, the processing for this is complicated, and it has been an obstacle to performing speech information transmission by emphasizing consonants in real time (real time) or a time close to this (semi-real time).

特許文献１の音声強調装置は、入力音声信号のＳ／Ｎ比を推定し、このＳ／Ｎ比に基づいて、ホルマント周波数を中心とするホルマントの電力（パワー）に正のゲインを与える処理を行うと共に、アンチホルマント周波数を中心とするアンチホルマントの電力（パワー）に負のゲインを与える処理を行い、また、ピッチを強調して聞き取り易くするものである。しかし、処理が複雑で時間がかかり、高コストで、リアルタイムに近い時間内に音声情報伝達処理を行うには課題が多いものであった。そして、この特許文献１の音声強調装置は、日本語の音声は子音と母音の組合せからなり、子音は母音に比べて音が弱いという特性を活かしていない。 The speech enhancement device of Patent Literature 1 estimates a S / N ratio of an input speech signal, and performs a process of giving a positive gain to formant power centered on the formant frequency based on the S / N ratio. In addition, a process of giving a negative gain to the antiformant power centered on the antiformant frequency is performed, and the pitch is emphasized to make it easy to hear. However, the processing is complicated, time consuming, expensive, and there are many problems in performing speech information transmission processing in a time close to real time. The speech enhancement apparatus of Patent Document 1 does not take advantage of the characteristic that Japanese speech is a combination of consonants and vowels, and the consonants are weaker than vowels.

これに対して、特許文献２の音声情報伝達装置は、音声信号に含まれる子音部を抽出すると共に、抽出された子音部子音部が複数回反復されて音声信号に付加して子音明瞭化信号を生成する。このため、音声の聞き分けの改善にはなったが、各音で子音部分が繰返され、各音の子音部分で遅れが生じ、これが積み重なって、リアルタイムに近い時間内に音声情報伝達が行えるものではなかった。ＶＯＴや無音区間の判別のためには、このＶＯＴや無音区間の情報が必要で、このため音声信号を一旦記憶してから処理することが必要であり、処理が複雑で時間がかかる点は特許文献１と変わらない。 On the other hand, the speech information transmission apparatus of Patent Document 2 extracts the consonant part included in the speech signal, and the extracted consonant part consonant part is repeated a plurality of times and added to the speech signal to add the consonant clarification signal. Is generated. For this reason, although it has improved voice discrimination, the consonant part is repeated in each sound, and there is a delay in the consonant part of each sound. There wasn't. In order to discriminate between VOT and silent section, information on this VOT and silent section is necessary. For this reason, it is necessary to store the audio signal once and process it, and the processing is complicated and takes time. It is the same as Reference 1.

このように従来の技術は、複数の手がかりを並列的に利用して子音の検出を行い、このためその子音強調処理は非常に複雑であり、リアルタイム若しくはこれに近い時間内に音声情報伝達が行えるものではなかった。また、これらの技術は予め強調処理を施した音を記憶しておく必要があり、柔軟性が要求される音声情報伝達装置においては利用が難しい技術であった。 As described above, the conventional technique uses a plurality of cues in parallel to detect consonants, and therefore the consonant enhancement processing is very complicated, and voice information can be transmitted in real time or in a time close to this. It was not a thing. In addition, these techniques need to store sound that has been subjected to enhancement processing in advance, and are difficult to use in a voice information transmission device that requires flexibility.

そこで本発明は、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音の多い環境でも、高齢者や聴覚障害者でも子音や音節の端点が聞き取り易く、安価に製造できる子音加工装置と音声情報伝達装置を提供することを目的とする。 Therefore, the present invention is a consonant that can transmit voice information in a time close to real time, is simple in signal processing, is easy to hear the end points of consonants and syllables even in noisy environments, elderly people and hearing impaired people, and can be manufactured at low cost. An object is to provide a processing device and a voice information transmission device.

また、本発明は、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音の多い環境でも、高齢者や聴覚障害者でも子音や音節の端点が聞き取り易い子音加工方法を提供することを目的とする。 In addition, the present invention provides a consonant processing method that can transmit voice information in a time close to real time, is simple in signal processing, and is easy to hear the end points of consonants and syllables even in a noisy environment, even in elderly people or hearing impaired people. The purpose is to do.

本発明の子音加工装置は、入力された音声信号から複数の時間フレームによってそれぞれでフレーム信号を抽出するフレーム分割部と、フレーム信号のそれぞれで平均パワーを算出するパワー算出部と、フレーム信号間で平均パワーを互いに比較する比較部と、比較部の比較結果に基づいて音声信号の増幅対象点または増幅対象幅が子音または音節の端点であるか否かを判定する子音判定部と、子音判定部が子音または音節の端点と判断した場合は音声信号の増幅対象点または増幅対象幅を増幅すると共に、子音または音節の端点でないと判断した場合は増幅しない増幅部とを備えたことを主要な特徴とする。 A consonant processing device according to the present invention includes a frame dividing unit that extracts a frame signal from a plurality of time frames from an input audio signal, a power calculation unit that calculates an average power for each of the frame signals, and a frame signal. A comparison unit that compares average power with each other, a consonant determination unit that determines whether the amplification target point or amplification target width of the audio signal is an end point of a consonant or syllable based on the comparison result of the comparison unit, and a consonant determination unit Amplifying the sound signal amplification target point or amplification target width when it is determined as a consonant or syllable end point, and an amplification unit that does not amplify when it is determined not to be a consonant or syllable end point And

本発明の子音加工装置、音声情報伝達装置及び子音加工方法によれば、複数の時間フレームによって複数のフレーム信号を抽出し、このフレーム信号の平均パワーを計算して比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音または音節の端点が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、環境騒音が増加するのを防ぐことができる。また、安価に製造できる子音加工装置、音声情報伝達装置を提供することができる。 According to the consonant processing device, the speech information transmission device, and the consonant processing method of the present invention, it is possible to perform consonant enhancement only by extracting a plurality of frame signals from a plurality of time frames and calculating and comparing the average power of the frame signals. Therefore, it is not necessary to perform various processes in parallel, voice information can be transmitted in a time close to real time, signal processing is simple, even in situations where noise or voice competes with other acoustic signals. In addition, it becomes easier for hearing-impaired and elderly people to hear the end points of consonants or syllables, thereby reducing the overall strength of the speech without impairing the clarity of the speech and preventing the increase in environmental noise. it can. Further, it is possible to provide a consonant processing device and a voice information transmission device that can be manufactured at low cost.

本発明の第１の形態は、入力された音声信号から複数の時間フレームによってそれぞれでフレーム信号を抽出するフレーム分割部と、フレーム信号のそれぞれで平均パワーを算出するパワー算出部と、フレーム信号間で平均パワーを互いに比較する比較部と、比較部の比較結果に基づいて音声信号の増幅対象点または増幅対象幅が子音または音節の端点であるか否かを判定する子音判定部と、子音判定部が子音または音節の端点と判断した場合は音声信号の増幅対象点または増幅対象幅を増幅すると共に、子音または音節の端点でないと判断した場合は増幅しない増幅部とを備えたことを特徴とする子音加工装置である。この構成により、複数の時間フレームによって複数のフレーム信号を抽出し、このフレーム信号の平均パワーを計算して比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音または音節の端点が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、環境騒音が増加するのを防ぐことができ、安価に製造できる子音加工装置を提供することができる。 According to a first aspect of the present invention, a frame division unit that extracts a frame signal from each of a plurality of time frames from an input audio signal, a power calculation unit that calculates an average power for each of the frame signals, and a frame signal A comparison unit that compares the average power with each other, a consonant determination unit that determines whether the amplification target point or the amplification target width of the audio signal is an end point of a consonant or a syllable based on a comparison result of the comparison unit, and a consonant determination And an amplification unit that amplifies the amplification target point or amplification target width of the audio signal when it is determined to be an end point of a consonant or syllable, and does not amplify when it is determined that it is not an end point of a consonant or syllable. Is a consonant processing device. With this configuration, it is possible to perform consonant enhancement simply by extracting multiple frame signals using multiple time frames, calculating the average power of these frame signals, and comparing them, making it unnecessary to perform various processes in parallel. Audio information can be transmitted within a short time, signal processing is simple, even under noisy conditions, or when the voice is competing with other acoustic signals. This makes it easy to hear, thereby reducing the strength of the entire voice without losing the clarity of the voice, preventing an increase in environmental noise, and providing a consonant processing device that can be manufactured at low cost. .

本発明の第２の形態は、入力された音声信号から複数の時間フレームによってそれぞれでフレーム信号を抽出するフレーム分割部と、フレーム信号のそれぞれで平均パワーを算出するパワー算出部と、フレーム信号間で平均パワーを互いに比較する比較部と、比較部の比較結果に基づいて音声信号が子音または音節の端点であるか否かを判定する子音判定部と、子音判定部が子音または音節の端点と判断した場合は音声信号の増幅対象点または増幅対象幅の増幅度を増幅方向に決定すると共に、子音または音節の端点でないと判断した場合は音声信号を増幅しない旨決定する増幅度決定部と、増幅度決定部が決定した増幅度に応じて音声信号を増幅する増幅部とを備えたことを特徴とする子音加工装置である。この構成により、前記第1の形態の作用効果に加えて、増幅度決定部によって増幅度を調整でき、さらに聞き取り容易な子音加工装置を提供することができる。 According to a second aspect of the present invention, a frame dividing unit that extracts a frame signal from a plurality of time frames from an input audio signal, a power calculation unit that calculates an average power for each of the frame signals, and a frame signal A comparison unit that compares the average power with each other, a consonant determination unit that determines whether the audio signal is an end point of a consonant or a syllable based on a comparison result of the comparison unit, and a consonant determination unit that is an end point of a consonant or a syllable If determined, the amplification target point of the audio signal or the amplification degree of the amplification target width is determined in the amplification direction, and if it is determined that it is not the end point of the consonant or syllable, the amplification degree determination unit determines that the audio signal is not amplified, A consonant processing apparatus comprising: an amplifying unit that amplifies an audio signal in accordance with the amplification degree determined by the amplification degree determining unit. With this configuration, in addition to the effects of the first embodiment, the amplification degree can be adjusted by the amplification degree determination unit, and a consonant processing device that can be easily heard can be provided.

本発明の第３の形態は、第１または第２の形態に従属する形態であって、比較部が、各フレーム信号のデシベル表示した平均パワーの差を算出することにより比較することを特徴とする子音加工装置であり、差を演算するだけであるから容易に信号処理でき、リアルタイムに近い時間内に音声情報伝達が行える。 A third aspect of the present invention is a form subordinate to the first or second form, characterized in that the comparison unit performs comparison by calculating a difference in decibel average power of each frame signal. This is a consonant processing device that simply calculates the difference, so that signal processing can be easily performed, and voice information can be transmitted in a time close to real time.

本発明の第４の形態は、第１または第２の形態に従属する形態であって、比較部が、各フレーム信号の平均パワーの比率を算出することにより比較することを特徴とする子音加工装置であり、比率を演算するだけであるから容易に信号処理でき、リアルタイムに近い時間内に音声情報伝達が行える。 A fourth form of the present invention is a form subordinate to the first or second form, wherein the comparison unit compares by calculating the ratio of the average power of each frame signal. Since it is a device and only calculates the ratio, it can easily process signals and transmit voice information in a time close to real time.

本発明の第５の形態は、第１〜第４のいずれかの形態の子音加工装置において、時間フレームには子音を抽出可能な抽出幅の時間フレームが設けられ、増幅対象点または増幅対象幅がこの時間フレームの抽出幅の中央位置に設定されることを特徴とする子音加工装置であり、ＶＣＶ型の信号処理が好適に行え、構成が簡単で効果的に増幅できる。 According to a fifth aspect of the present invention, in the consonant processing device according to any one of the first to fourth aspects, the time frame is provided with a time frame having an extraction width from which a consonant can be extracted. Is a consonant processing device characterized in that it is set at the center position of the extraction width of this time frame, VCV type signal processing can be suitably performed, and the configuration can be simplified and effectively amplified.

本発明の第６の形態は、第１〜第４のいずれかの形態の子音加工装置において、時間フレームに連続する２つの時間フレームが設けられた場合に、増幅対象点または増幅対象幅が２つの時間フレームの境界に設定されることを特徴とする子音加工装置であり、ＣＶ型の信号処理が好適に行え、構成が簡単で効果的に増幅できる。 According to a sixth aspect of the present invention, in the consonant processing device according to any one of the first to fourth aspects, when two time frames continuous to the time frame are provided, the amplification target point or the amplification target width is 2 It is a consonant processing device characterized in that it is set at the boundary of two time frames, CV type signal processing can be suitably performed, the configuration is simple, and it can be amplified effectively.

本発明の第７の形態は、第３の形態の子音加工装置において、デシベル表示した平均パワーの差が０以下の場合には、増幅対象点または増幅対象幅の音声信号の振幅を増幅し、該デシベル表示した差が０より大きい場合には増幅しないことを特徴とする子音加工装置であり、前記形態の作用効果に加えて、さらに信号処理が簡単になる。 In the consonant processing device according to the third aspect of the present invention, in the consonant processing device according to the third aspect, when the difference in average power displayed in decibels is 0 or less, the amplitude of the audio signal at the amplification target point or the amplification target width is amplified. The consonant processing device is characterized in that it does not amplify when the decibel-displayed difference is larger than 0. In addition to the operational effects of the above-described embodiment, signal processing is further simplified.

本発明の第８の形態は、第３の形態の子音加工装置において、平均パワー間の比率が１以下の場合には、増幅対象点または増幅対象幅の音声信号の振幅を増幅し、該平均パワーの比率が１より大きい場合には増幅しないことを特徴とする請求項４記載の子音加工装置であり、前記形態の作用効果に加えて、さらに信号処理が簡単になる。 According to an eighth aspect of the present invention, in the consonant processing device of the third aspect, when the ratio between the average powers is 1 or less, the amplitude of the audio signal at the amplification target point or the amplification target width is amplified and the average 5. The consonant processing device according to claim 4, wherein the power is not amplified when the power ratio is greater than 1, and in addition to the function and effect of the above aspect, the signal processing is further simplified.

本発明の第９の形態は、第１、第３〜第６のいずれかの形態の子音加工装置において、増幅部が、子音判定部が子音または音節の端点と判断した場合は音声信号の増幅対象点または増幅対象幅を増幅するのに代えて、子音または音節の端点と判断した場合に音声信号の増幅対象点または増幅対象幅を逆に抑制することを特徴とする子音加工装置である。この構成により、聞き取り難い音声を設けて聴力検査や聞き取り訓練に供することができる。 According to a ninth aspect of the present invention, in the consonant processing device according to any one of the first, third to sixth aspects, when the amplifying unit determines that the consonant determining unit is an end point of a consonant or a syllable, the sound signal is amplified. Instead of amplifying the target point or the amplification target width, a consonant processing device is characterized in that when it is determined as a consonant or an end point of a syllable, the amplification target point or the amplification target width of the audio signal is suppressed. With this configuration, it is possible to provide a sound that is difficult to hear and to be used for hearing tests and listening training.

本発明の第１０の形態は、第２〜第６のいずれかの形態の子音加工装置において、増幅度決定部が、子音判定部が子音または音節の端点と判断した場合に音声信号の増幅対象点または増幅対象幅の増幅度を増幅方向に決定するのに代えて、増幅度決定部が、前記子音判定部が子音または音節の端点と判断した場合に前記音声信号の増幅対象点または増幅対象幅の増幅度を逆に抑制方向にする旨の決定を行うことを特徴とする子音加工装置である。この構成により、増幅対象点または増幅対象幅が抑制され、聞き取り難い音声を設けて聴力検査や聞き取り訓練に供することができる。 According to a tenth aspect of the present invention, in the consonant processing device according to any one of the second to sixth aspects, when the amplification degree determination unit determines that the consonant determination unit is an end point of a consonant or a syllable, Instead of determining the amplification degree of the point or the amplification target width in the amplification direction, the amplification degree determination unit determines that the consonant determination unit determines that it is an end point of a consonant or syllable, or the amplification target point or amplification target of the audio signal A consonant processing device that performs a determination to reverse the width amplification degree in a suppression direction. With this configuration, the amplification target point or the amplification target width is suppressed, and it is possible to provide a sound that is difficult to hear and to be used for hearing test or listening training.

本発明の第１１の形態は、第１〜８の形態の子音加工装置において、平均パワー間のフレーム分割部に音声信号を入力する前に、所定の周波数成分を通過させるフィルタ部が設けられたことを特徴とする子音加工装置であり、子音強調の明瞭性を増すことができる。 According to an eleventh aspect of the present invention, in the consonant processing device according to any one of the first to eighth aspects, a filter unit that allows a predetermined frequency component to pass is provided before an audio signal is input to a frame division unit between average powers. The consonant processing apparatus is characterized by the fact that the clarity of consonant enhancement can be increased.

本発明の第１２の形態は、第１〜第１１のいずれかの形態に従属する形態であって、増幅部が、感音性難聴者の感覚量である音の大きさを健聴者の感覚量である音の大きさに一致させる補充現象の補正特性に従って物理的な音圧を増幅することを特徴とする請求項１〜８のいずれかに記載の子音加工装置であり、健聴者の聞き取り易いと感じられる強さに子音または音節の端点を増幅することができる。 A twelfth aspect of the present invention is a form subordinate to any one of the first to eleventh aspects, in which the amplifying unit determines the loudness as a sensory amount of a sound-sensitive deaf person. The consonant processing device according to claim 1, wherein the physical sound pressure is amplified according to a correction characteristic of a supplementary phenomenon that matches a volume of a sound, which is a volume, and is listened to by a normal listener It is possible to amplify the end points of consonants or syllables to the strength that is felt easy.

本発明の第１３の形態は、第１〜第１２のいずれかの形態に従属する子音加工装置と、該子音加工装置からの子音加工された音声信号に基づいて子音強調された音声を出力するスピーカを備えたことを特徴とする音声情報伝達装置である。この構成により、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音または音節の端点が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、環境騒音が増加するのを防ぐことができ、安価な音声情報伝達装置を提供することができる。 According to a thirteenth aspect of the present invention, a consonant processing device subordinate to any one of the first to twelfth embodiments and a consonant-enhanced sound based on a consonant-processed audio signal from the consonant processing device are output. A voice information transmission device including a speaker. With this configuration, it is not necessary to perform various processes in parallel, voice information can be transmitted in a time close to real time, signal processing is simple, and there is a situation where noise or voice competes with other acoustic signals. However, it is easier for deaf and elderly people to hear the end points of consonants or syllables, thereby reducing the overall strength of the speech without compromising the clarity of the speech and preventing an increase in environmental noise. Therefore, an inexpensive audio information transmission device can be provided.

本発明の第１４の形態は、入力された音声信号から複数の時間フレームによってそれぞれでフレーム信号を抽出し、フレーム信号のそれぞれで平均パワーを算出し、フレーム信号間で平均パワーを互いに比較し、この比較結果に基づいて音声信号の増幅対象点または増幅対象幅が子音または音節の端点であるか否かを判定し、子音または音節の端点と判断される場合は音声信号の増幅対象点または増幅対象幅を増幅し、子音でないと判断した場合は増幅しないことを特徴とする子音加工方法である。この構成により、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音または音節の端点が聞き取り易くなり、これにより音声全体の強さを減らすことができ、環境騒音が増加するのを防ぐことができる。 In the fourteenth aspect of the present invention, a frame signal is extracted for each of a plurality of time frames from an input audio signal, an average power is calculated for each of the frame signals, and the average power is compared between the frame signals. Based on the comparison result, it is determined whether or not the amplification target point or amplification target width of the audio signal is a consonant or syllable end point. In this consonant processing method, the target width is amplified and is not amplified when it is determined that the target width is not a consonant. With this configuration, it is not necessary to perform various processes in parallel, voice information can be transmitted in a time close to real time, signal processing is simple, and there is a situation where noise or voice competes with other acoustic signals. However, the end points of consonants or syllables can be easily heard even by a hearing-impaired person or an elderly person, thereby reducing the strength of the entire voice and preventing an increase in environmental noise.

本発明の第１５の形態は、第１４の形態の子音加工方法において、子音または音節の端点と判断される場合は音声信号の増幅対象点または増幅対象幅を増幅するのに代えて、子音または音節の端点と判断された場合は音声信号の増幅対象点または増幅対象幅を逆に抑制することを特徴とする子音加工方法である。この構成により、聞き取り難い音声を設けて聴力検査や聞き取り訓練に供することができる。 According to a fifteenth aspect of the present invention, in the consonant processing method according to the fourteenth aspect, when it is determined that the end point of a consonant or a syllable, a consonant or a consonant A consonant processing method is characterized in that when it is determined as an end point of a syllable, the amplification target point or the amplification target width of the voice signal is conversely suppressed. With this configuration, it is possible to provide a sound that is difficult to hear and to be used for hearing tests and listening training.

本発明の第１３の形態は、第１２の形態に従属する形態であって、増幅度が、感音性難聴者の感覚量である音の大きさを健聴者の感覚量である音の大きさに一致させる補充現象の補正特性に従って物理的な音圧を増幅することを特徴とする子音加工方法であり、聴覚正常者の聞き取り易いと感じられる状態に子音または音節の端点を増幅することができる。 A thirteenth aspect of the present invention is a form subordinate to the twelfth aspect, in which the amplification level is the amount of sound that is a sensory amount of a hearing-impaired deaf person, and the volume of sound that is a sensory amount of a normal hearing person. A consonant processing method characterized by amplifying a physical sound pressure in accordance with a correction characteristic of a supplementary phenomenon that matches the depth, and amplifying the end points of consonants or syllables in a state where it is felt that a normal hearing person can easily hear. it can.

（実施例１）
以下、本発明の実施例1における子音加工装置と音声情報伝達装置、子音加工方法について説明する。 Example 1
Hereinafter, a consonant processing device, a voice information transmission device, and a consonant processing method according to Embodiment 1 of the present invention will be described.

実施例１の音強調処理装置が行う子音強調は、日本語のように音節がＣＶ型をもつ言語の構造の場合に、すなわち子音（Ｃの直後に母音（Ｖ）が続く頻度が高い場合の子音強調に好適なものである。なお、本明細書おいて子音強調というが、これは子音だけでなく音節の端点も含めて強調するものである。 The consonant enhancement performed by the sound enhancement processing apparatus according to the first embodiment is performed when the syllable has a CV-type language structure as in Japanese, that is, when the frequency of consonant (the vowel (V) immediately follows C is high). In this specification, the term “consonant enhancement” is used to emphasize not only consonants but also the end points of syllables.

図１は本発明の実施例1における子音加工装置とこれを搭載した音声情報伝達装置の構成図、図２は本発明の実施例1における子音加工装置の処理の説明図、図７（ａ）は本発明の実施例1における増幅時の増幅度の説明図である。 FIG. 1 is a configuration diagram of a consonant processing device and a voice information transmission device equipped with the consonant processing device in Embodiment 1 of the present invention. FIG. 2 is an explanatory diagram of processing of the consonant processing device in Embodiment 1 of the present invention. FIG. 5 is an explanatory diagram of the amplification degree at the time of amplification in Example 1 of the present invention.

図１において、１は音声信号が入力されると図２に示すように複数の時間フレームでそれぞれフレーム信号を抽出するフレーム分割部である。そして、１ａはフレーム分割部１を構成し、子音の長さの１／３程度の幅のフレーム信号を抽出するための第１時間フレーム、また、１ｂは第１時間フレーム１ａを包含し、子音を抽出可能な抽出幅の第２時間フレーム、さらに１ｃは第２時間フレーム１ｂを包含し音節の長さの１〜３倍程度を抽出可能な第３時間フレームである。 In FIG. 1, reference numeral 1 denotes a frame dividing unit that extracts frame signals in a plurality of time frames as shown in FIG. 2 when an audio signal is input. 1a constitutes the frame dividing unit 1 and includes a first time frame for extracting a frame signal having a width of about 1/3 of the length of the consonant, and 1b includes the first time frame 1a. Is a second time frame having an extraction width that allows extraction of the syllable, and 1c is a third time frame that includes the second time frame 1b and that can extract about 1 to 3 times the syllable length.

第１時間フレーム１ａ、第２時間フレーム１ｂ、第３時間フレーム１ｃは方形窓、ハミング窓等の窓関数を乗じる機能を備えたもので、実施例１においては方形窓が採用されている。 The first time frame 1a, the second time frame 1b, and the third time frame 1c have a function of multiplying a window function such as a rectangular window or a Hamming window. In the first embodiment, a rectangular window is employed.

すなわち、ｔ＝Ｔの時点の音声信号に対して、第１時間フレーム１ａは窓関数ｗ_１（ｔ）＝１（ここでＴ−τ_１≦ｔ≦Ｔ＋τ_１）、ｗ_１（ｔ）＝０（それ以外のとき）で構成され、第２時間フレーム１ｂは窓関数ｗ_２（ｔ）＝１（ここでＴ−τ_２≦ｔ≦Ｔ＋τ_２）、ｗ_２（ｔ）＝０（それ以外のとき）、第３時間フレーム１ｃも窓関数ｗ_３（ｔ）＝１（ここでＴ−τ_３≦ｔ≦Ｔ＋τ_３）、ｗ_３（ｔ）＝０（それ以外のとき）で構成される。いずれも単位はｍｓである。 That is, for the audio signal at the time of t = T, the first time frame 1a has the window function w ₁ (t) = 1 (where T−τ ₁ ≦ t ≦ T + τ ₁ ), w ₁ (t) = 0. The second time frame 1b has a window function w ₂ (t) = 1 (where T−τ ₂ ≦ t ≦ T + τ ₂ ), w ₂ (t) = 0 (other than that) ), And the third time frame 1c is also configured by the window function w ₃ (t) = 1 (where T−τ ₃ ≦ t ≦ T + τ ₃ ) and w ₃ (t) = 0 (otherwise). In either case, the unit is ms.

なお、第１時間フレーム１ａ、第２時間フレーム１ｂ、第３時間フレーム１ｃの中央位置（ｔ＝Ｔ）は全て一致する必要はないが、図１に示す音声信号の波形においては中央位置が一致しており、この中央位置がこれらの時間フレームにおける音声信号の増幅対象点である。これは点を増幅するだけでなく、所定の幅を増幅するものであってもよい。この場合、本明細書ではこれを増幅対象幅という。この増幅位置または増幅対象幅は、少なくとも第２時間フレーム１ｂの中央位置に設定されるのが好適である。なお、第２時間フレーム１ｂは第3の時間フレーム１ｃの後端部よりに設けられる方が処理の速さを高める可能性があり、第２時間フレーム１ｂの中央位置をこの後端部よりに配置するのが好適である。 Note that the center positions (t = T) of the first time frame 1a, the second time frame 1b, and the third time frame 1c do not have to coincide with each other, but in the waveform of the audio signal shown in FIG. This center position is a point to be amplified of the audio signal in these time frames. This may not only amplify the points but also amplify a predetermined width. In this case, this is referred to as an amplification target width in this specification. This amplification position or amplification target width is preferably set at least at the center position of the second time frame 1b. Note that the second time frame 1b may be provided more quickly than the rear end of the third time frame 1c, and the processing speed may be increased. It is preferable to arrange.

窓関数のτ_１，τ_２，τ_３は、経験的に定められるパラメータであり、実施例１においてはτ_１＝７．５ｍｓ程度、τ_２＝２５ｍｓ程度（子音を抽出可能な長さ）、τ_３＝２００ｍｓ程度（音節の長さの１〜４倍程度を抽出可能な長さ）に設定される。日本語の場合、一般的に子音の長さは数十ｍｓ程度、１音節の長さは１００〜４００ｍｓの程度である。 The window functions τ ₁ , τ ₂ , and τ ₃ are empirically determined parameters. In the first embodiment, τ ₁ = about 7.5 ms, τ ₂ = about 25 ms (the length that can extract consonants), τ ₃ is set to about 200 ms (a length capable of extracting about 1 to 4 times the syllable length). In the case of Japanese, the length of a consonant is generally about several tens of ms, and the length of one syllable is about 100 to 400 ms.

従って、音声信号ｐ（ｔ）に対して、第１時間フレーム１ａからはｙ_１（ｔ）＝ｗ_１（ｔ）・ｐ（ｔ）が出力され、第２時間フレーム１ｂからはｙ_２（ｔ）＝ｗ_２（ｔ）・ｐ（ｔ）、第３時間フレーム１ｃからはｙ_３（ｔ）＝ｗ_３（ｔ）・ｐ（ｔ）の演算によりフレーム信号が抽出される。デジタル信号の場合は、例えばｙ_３（ｔ）を説明すると、Ｔ−τ_３≦ｔ＜Ｔ、Ｔ＜ｔ≦τ_３＋Ｔの間がそれぞれＮ個の時系列値とｔ＝Ｔの時系列値とからなり、全体で（２Ｎ＋１）個の時系列値で演算される。ｙ_１（ｔ）、ｙ_２（ｔ）の時系列値も同様で、ｙ_３（ｔ）の入力時系列値と重複した値を用いる。 Therefore, the audio signal p (t), from the first time frame _{_{1a y 1 (t) = w}} 1 (t) · p (t) is output from the second time frame 1b _y 2 (t ) = W ₂ (t) · p (t), and the frame signal is extracted from the third time frame 1c by y ₃ (t) = w ₃ (t) · p (t). In the case of a digital signal, for example, y ₃ (t) will be described. Between T−τ ₃ ≦ t <T and T <t ≦ τ ₃ + T, N time series values and t = T time series values, respectively. And is calculated with (2N + 1) time series values in total. The same applies to the time series values of y ₁ (t) and y ₂ (t), and the same value as the input time series value of y ₃ (t) is used.

このように本実施例１の子音加工装置は、フレーム分割部１などでＡ／Ｄ変換を行い、後述する増幅度の決定などまで含めてデジタル回路もしくはプロセッサでデジタル処理しているが、アナログ回路を使ってアナログ処理することもできる。なお、プロセッサでデジタル処理するときは、図示はしないが、メモリを設けてプログラムや設定値を格納し、これを読み出して演算する。 As described above, the consonant processing apparatus according to the first embodiment performs A / D conversion by the frame division unit 1 and the like, and performs digital processing by the digital circuit or processor including the determination of the amplification degree described later. Can also be used for analog processing. When digital processing is performed by the processor, although not shown, a memory is provided to store programs and set values, which are read and calculated.

次に、図１において、２は第１時間フレーム１ａ、第２時間フレーム１ｂ、第３時間フレーム１ｃで抽出されたフレーム信号ｙ_１（ｔ）、ｙ_２（ｔ）、ｙ_３（ｔ）の平均パワーを計算するパワー算出部である。２ａは第１パワー算出部であり、第１時間フレーム１ａから出力されたｙ_１（ｔ）の振幅の２乗である平均パワーＰ_１をデシベル表示した平均パワーＬ_１を演算する。同様に、２ｂは第２パワー算出部であり、第２時間フレーム１ｂから出力されたｙ_２（ｔ）の平均パワーＰ_２をデシベル表示した平均パワーＬ_２を演算する。さらに、２ｃは第３パワー算出部であって、第３時間フレーム１ｃから出力されたｙ_３（ｔ）の平均パワーＰ_３をデシベル表示した平均パワーＬ_３を演算する。なお、平均パワーＰ_ｉ（ｉ＝１，２，３）は（数１）で表され、２Ｎ＋１は時系列値のそれぞれの総数である。平均パワーＬ_１，Ｌ_２，Ｌ_３の単位はｄＢである。 Next, in FIG. 1, reference numeral 2 denotes frame signals y ₁ (t), y ₂ (t), and y ₃ (t) extracted in the first time frame 1a, the second time frame 1b, and the third time frame 1c. It is a power calculation part which calculates average power. 2a is a first power calculating unit calculates the average power L ₁ of the average power P ₁ is the square of the amplitude and decibels of y ₁ output from the first time frame 1a _(t). Similarly, 2b denotes a second power calculating unit calculates the average power L ₂ and the average power P ₂ of y ₂ output from the second time frame 1b _(t) and decibels. Furthermore, 2c is a third power calculation unit calculates the average power L ₃ that the average power P ₃ of y ₃ output _(t) and decibels from the third time frame 1c. The average power P _i (i = 1, 2, 3) is expressed by (Equation 1), and 2N + 1 is the total number of time series values. The unit of the average powers L ₁ , L ₂ , L ₃ is dB.

なお、以下、デシベル表示した平均パワーＬ_ｉ（ｉ＝１，２，３）を使って差で説明するが、平均パワーＰ_ｉの比率Ｋ_ij＝Ｐ_ｉ／P_j（ｉ，ｊ＝１，２，３；ｉ＜ｊ）を使用して演算することもできる。この比率Ｋ_ijを使用した場合の説明は後述の実施例４で行う。さらに、平均パワーＰ_ｉの対数をとってデシベル表示のＬ_ｉ（ｉ＝１，２，３）とするのでなく、平均パワーＰ_ｉ自体の差Ｐ_ｉ−P_jを演算しても同様の作用効果が得られる。しかし詳細な説明はデシベル表示の説明に譲って省略する。 Hereinafter, the difference is described using the average power L _i (i = 1, 2, 3) expressed in decibels, but the ratio of the average power P _i K _ij = P _i / P _j (i, j = 1, 2, 3; i <j). The case where this ratio K _ij is used will be described in Example 4 described later. Further, the same effect can be obtained by calculating the difference P _i -P _j of the average power P _i itself instead of taking the logarithm of the average power P _i and making it L _i (i = 1, 2, 3) in decibels. An effect is obtained. However, the detailed description will be omitted to the description of the decibel display.

続いて、３は各フレーム信号の平均パワーＬ_１，Ｌ_２，Ｌ_３の差を計算して比較する比較部であり、４は比較部３の比較結果に基づいて音声信号が子音であるか否かを判定する子音判定部である。また、５は増幅までの処理に必要な時間だけ音声信号を遅延しあるいはデータをバッファする遅延部、６は子音判定部４が子音と判断したとき音声信号の増幅対象点の増幅度を変更し、子音でないと判断した場合は増幅度を変更しない増幅部である。 Subsequently, 3 is a comparison unit that calculates and compares the difference between the average powers L ₁ , L ₂ , and L ₃ of each frame signal, and 4 indicates whether the audio signal is a consonant based on the comparison result of the comparison unit 3. It is a consonant determination unit that determines whether or not. Further, 5 is a delay unit that delays the audio signal or buffers the data for a time required for the process until amplification, and 6 is a change in the amplification degree of the amplification target point of the audio signal when the consonant determination unit 4 determines that it is a consonant. When it is determined that it is not a consonant, the amplification unit does not change the amplification degree.

そして、１０は音声信号を入力されたとき子音を強調して出力する実施例１の子音加工装置であり、定用途向け集積回路などとして構成される。また、１１は音声を入力するためのマイク、１２は音声を出力するためのスピーカ、２０は子音加工装置１０を搭載した音声情報伝達装置である。 Reference numeral 10 denotes a consonant processing device according to the first embodiment that emphasizes and outputs a consonant when an audio signal is input, and is configured as an integrated circuit for fixed use. Further, 11 is a microphone for inputting sound, 12 is a speaker for outputting sound, and 20 is a sound information transmission device equipped with the consonant processing device 10.

音声情報伝達装置２０は、子音加工装置１０によって子音強調された音声信号をスピーカ１２から出力し、上述のワイヤレス放送システム、構内放送や案内放送などのアナウンス放送装置、携帯端末等の携帯型情報機器、その他の音声情報伝達装置、補聴器などに利用できる。なお、マイク１１を備えていない音声情報伝達装置２０の場合、例えば、自動販売機やＡＴＭのガイド音声などの場合は、予め録音された音信号について子音加工装置１０による音の加工を行えばよい。 The audio information transmission device 20 outputs the audio signal emphasized by the consonant by the consonant processing device 10 from the speaker 12, and the above-described wireless broadcasting system, announcement broadcasting device such as private broadcasting and guidance broadcasting, and portable information equipment such as a portable terminal. It can be used for other audio information transmission devices and hearing aids. In the case of the voice information transmission device 20 that does not include the microphone 11, for example, in the case of a vending machine or an ATM guide voice, the sound processing by the consonant processing device 10 may be performed on a prerecorded sound signal. .

さて、実施例１の子音加工装置１０は、日本語のように子音、母音が続くＣＶ型の構造をもつ言語に有効な装置であり、比較部３はこのような構造を利用して以下のような基準で各フレーム信号の平均パワーの比較を行い、子音判定部４が子音または音節の端点か否かを判定する。 The consonant processing apparatus 10 of the first embodiment is an apparatus effective for a language having a CV type structure in which consonants and vowels continue like Japanese, and the comparison unit 3 uses the structure as follows. The average power of each frame signal is compared with such a reference, and the consonant determination unit 4 determines whether or not it is an end point of a consonant or a syllable.

すなわち、先ず第１に、デシベル表示の平均パワーＬ_１がデシベル表示の平均パワーＬ_２より所定の閾値（実施例１では５ｄＢ）以上高い場合（すなわちＬ_１＞Ｌ_２＋５）は、１５ｍｓ（子音の長さの約１／３程度）程度のごく狭い幅で振幅が増加しているだけであるから、この増加は雑音の増加とみなす。比較部３はＬ_１−Ｌ_２を計算し、閾値より大きいか、以下かを算出する。閾値より大きい場合、子音判定部４は音声信号を雑音と判断する。閾値以下の場合は、次の基準で判定される。 That is, firstly, when the average power L ₁ of the decibel display is higher than the average power L _{2 of the} decibel display by a predetermined threshold (5 dB in the first embodiment) or more (that is, L ₁ > L ₂ +5), 15 ms (consonant) Therefore, this increase is regarded as an increase in noise. The comparison unit 3 calculates L ₁ -L ₂ and calculates whether it is greater than or less than the threshold value. When it is larger than the threshold, the consonant determination unit 4 determines that the voice signal is noise. If it is less than or equal to the threshold value, it is determined according to the following criteria.

第２に、Ｌ_１−Ｌ_２が閾値（５ｄＢ）以下であって、Ｌ_２＜Ｌ_３であれば、第２時間フレーム１ｂにおける５０ｍｓ（子音の長さより少し長い）の間の平均パワーＬ_２が、第３時間フレーム１ｃの４００ｍｓ（数音節の長さ）の平均パワーＬ_３より低いことになり、ここには子音または音節の端点があると考えられる。 Second, if L ₁ -L ₂ is less than or equal to the threshold (5 dB) and L ₂ <L ₃ , the average power L ₂ during 50 ms (slightly longer than the length of the consonant) in the second time frame 1b but will be lower than the average power L ₃ of 400ms third time frame 1c (length of several syllables), herein it is considered to be the end point of the consonant or syllable.

すなわち、子音、母音と続くとき、子音または音節の端点は母音に比べて平均パワーが小さいため、Ｌ_２とＬ_３のレベルを比較してＬ_２が小さければ第２時間フレーム１ｂに子音または音節の端点があると推定するものである。この状態を図２に示す。 That is, when the consonant or vowel continues, the end point of the consonant or syllable has a lower average power than the vowel, so if the L ₂ and L ₃ levels are compared and L ₂ is small, the consonant or syllable is displayed in the second time frame 1b. It is estimated that there is an end point. This state is shown in FIG.

図２において、窓関数ｗ_２（ｔ）で抽出したフレーム信号の平均パワーＬ_２は小さく、窓関数ｗ_３（ｔ）で抽出したフレーム信号の平均パワーＬ_３は平均パワーＬ_２に比べて大きい。それ故、第２時間フレーム１ｂで抽出したフレーム信号は子音または音節の端点であって、前後、あるいは前の部分または後の部分に母音が存在すると推定できる。このとき比較部３はＬ_２−Ｌ_３を計算し、Ｌ_２＜Ｌ_３であれば、子音判定部４は第２時間フレーム１ｂのフレーム信号が子音または音節の端点と判断し、増幅を行う。 In FIG. 2, the average power L ₂ of the frame signal extracted by the window function w ₂ (t) is small, and the average power L ₃ of the frame signal extracted by the window function w ₃ (t) is larger than the average power L _2. . Therefore, it can be estimated that the frame signal extracted in the second time frame 1b is an end point of a consonant or a syllable, and vowels exist before and after, or in the previous part or the subsequent part. At this time, the comparison unit 3 calculates L ₂ −L ₃ , and if L ₂ <L ₃ , the consonant determination unit 4 determines that the frame signal of the second time frame 1b is the end point of the consonant or syllable and performs amplification. .

これにより実施例１の増幅部６は、以上説明した子音判定部４の判定に基づいて、Ｌ_１−Ｌ_２＞５の場合には増幅を行わず、Ｌ_１−Ｌ_２＜５であって、Ｌ_２−Ｌ_３が０〜−２０ｄＢの範囲内であれば一定の増幅度λ_０、例えば１０ｄＢ増幅する。但し、Ｌ_１−Ｌ_２＜５であっても、Ｌ_２−Ｌ_３＜−２０の条件を充たす場合、雑音との判別が難しくなるため増幅部６は増幅の程度を弱める。なお、このとき子音判定部４がこの判断を行うのでもよい。このような増幅特性を図示すると、図７（ａ）のようになる。きわめて簡単な構成で容易に子音強調が可能になる。なお、図７（ａ）は一例としてあげたもので、急に増幅をやめると、増幅度が不連続に変化して違和感のある音声となるので、一点鎖線のような低減の仕方、若しくは、さらにより滑らかな低減の仕方をするのが好ましい。 Accordingly, the amplification unit 6 of the first embodiment does not perform amplification when L ₁ −L ₂ > 5 based on the determination of the consonant determination unit 4 described above, and L ₁ −L ₂ <5. , L ₂ −L ₃ is within a range of 0 to −20 dB, a constant amplification factor λ ₀ , for example, 10 dB is amplified. However, even if L ₁ −L ₂ <5, when the condition of L ₂ −L ₃ <−20 is satisfied, the amplification unit 6 weakens the degree of amplification because it becomes difficult to distinguish from noise. At this time, the consonant determination unit 4 may make this determination. Such an amplification characteristic is illustrated in FIG. Consonant enhancement can be easily performed with a very simple configuration. Note that FIG. 7A is given as an example. If the amplification is suddenly stopped, the amplification level changes discontinuously and the sound becomes uncomfortable, so a reduction method such as a one-dot chain line, or It is preferable to use a smoother reduction method.

このように実施例１の子音加工装置、音声情報伝達装置及び子音加工方法は、複数の時間フレームによって複数のフレーム信号を抽出し、このフレーム信号の平均パワーを計算して比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、論理判断が少なく信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、境騒音が増加するのを防ぐことができる。また、構成が簡単で安価に製造できる子音強調処理装置等の子音加工装置、音声情報伝達装置を提供することができる。 As described above, the consonant processing device, the speech information transmission device, and the consonant processing method according to the first embodiment extract a plurality of frame signals by using a plurality of time frames, and calculate and compare the average power of the frame signals to enhance the consonant. Therefore, there is no need to perform various processes in parallel, voice information can be transmitted in a time close to real time, signal judgment is simple, signal processing is simple, and noise or voice competes with other acoustic signals. This makes it easier to hear consonant sounds even for people who are hard of hearing or for elderly people, thereby reducing the overall strength of the voice without compromising the clarity of the voice and preventing an increase in boundary noise. be able to. Further, it is possible to provide a consonant processing device such as a consonant enhancement processing device and a voice information transmission device that are simple in configuration and can be manufactured at low cost.

（実施例２）
以下、本発明の実施例２における子音加工装置と音声情報伝達装置、子音加工方法について説明する。実施例２の子音加工装置と音声情報伝達装置は、子音と判断されたときに、比較部の比較結果に応じて増幅度を調整するものである。そして、実施例２の子音強調もとくにＣＶ型の言語の子音強調に好適なものである。 (Example 2)
Hereinafter, the consonant processing device, the voice information transmission device, and the consonant processing method according to the second embodiment of the present invention will be described. The consonant processing device and the voice information transmission device according to the second embodiment adjust the amplification degree according to the comparison result of the comparison unit when it is determined as a consonant. The consonant enhancement according to the second embodiment is also particularly suitable for consonant enhancement in a CV type language.

図３は本発明の実施例２における子音加工装置とこれを搭載した音声情報伝達装置の構成図、図４（ａ）は補充現象の説明図、図４（ｂ）は静寂な環境での音と雑音中での音の比較図、図７（ｂ）は本発明の実施例２における増幅時の増幅度の説明図の説明図である。なお、実施例２と実施例１とで同一符号は同様の構成であり、音声信号の処理も比較結果に応じて増幅度を決定する点を除いては実施例１と同様であるから、これらの説明は省略する。 FIG. 3 is a block diagram of a consonant processing device and a voice information transmission device equipped with the consonant processing device according to the second embodiment of the present invention, FIG. 4 (a) is an explanatory diagram of a supplementary phenomenon, and FIG. 4 (b) is a sound in a quiet environment. FIG. 7B is an explanatory diagram of an amplification factor at the time of amplification in the second embodiment of the present invention. The same reference numerals are used in the second embodiment and the first embodiment, and the processing of the audio signal is the same as that of the first embodiment except that the amplification degree is determined according to the comparison result. Description of is omitted.

図３において、１はフレーム分割部、１ａは第１時間フレーム、１ｂは第２時間フレーム、１ｃは第３時間フレームである。また、２はパワー算出部、２ａは第１パワー算出部、２ｂは第２パワー算出部、２ｃは第３パワー算出部、３は比較部であり、４は子音判定部、５は遅延部、６は増幅部である。そして、１０は実施例２の子音加工装置、１１はマイク、１２はスピーカ、２０は実施例２の音声情報伝達装置である。これらは実施例１と同様の構成である。 In FIG. 3, 1 is a frame dividing unit, 1a is a first time frame, 1b is a second time frame, and 1c is a third time frame. 2 is a power calculation unit, 2a is a first power calculation unit, 2b is a second power calculation unit, 2c is a third power calculation unit, 3 is a comparison unit, 4 is a consonant determination unit, 5 is a delay unit, Reference numeral 6 denotes an amplification unit. Reference numeral 10 denotes a consonant processing apparatus according to the second embodiment, reference numeral 11 denotes a microphone, reference numeral 12 denotes a speaker, and reference numeral 20 denotes a voice information transmission apparatus according to the second embodiment. These are the same configurations as those in the first embodiment.

実施例２の特徴的な点は、実施例１が一定の増幅度λ_０で増幅したのに対して、比較部３の比較結果に応じて増幅度λを調整する点である。図３において、７は増幅度λの値を決定する増幅度決定部である。 The characteristic point of the second embodiment is that the amplification factor λ is adjusted according to the comparison result of the comparison unit 3 while the first example is amplified with a constant amplification factor λ ₀ . In FIG. 3, reference numeral 7 denotes an amplification factor determination unit that determines the value of the amplification factor λ.

増幅度決定部７は、音比較部３が子音または音節の端点と判断した場合に音声信号の増幅対象点または増幅対象幅の増幅度を決定し、子音または音節の端点でないと判断した場合は音声信号を増幅しない旨決定する。増幅度λは、デシベル表示でＬ_２＜Ｌ_３であれば、音声信号が雑音ではないと判断されたとき、Ｌ_２とＬ_３のレベル差が大きければ大きいほど大きな値にされる。 When the sound comparing unit 3 determines that the end point of the consonant or syllable is determined, the amplification degree determining unit 7 determines the amplification target point or the amplification target width of the audio signal and determines that it is not the end point of the consonant or syllable. Decide not to amplify the audio signal. If L ₂ <L ₃ in the decibel display, the amplification factor λ is set to a larger value as the level difference between L ₂ and L ₃ is larger when it is determined that the audio signal is not noise.

ここで、実施例２で行う増幅度λについて説明する。実施例２の増幅度λは感音性難聴者に対する補充現象の補正特性と同様の特性を採用する。この補充現象とは、物理的な刺激音の音圧の増加に対して感覚量である音の大きさの増加が、所定範囲の強さの音に対して健聴者より感音性難聴者の方が大きい現象のことである。 Here, the amplification factor λ performed in the second embodiment will be described. The amplification factor λ of the second embodiment employs a characteristic similar to the correction characteristic of the supplementary phenomenon for the sound-sensitive deaf person. This supplementary phenomenon means that the increase in sound volume, which is the amount of sensation, with respect to the increase in the sound pressure of a physical stimulus sound, This is a larger phenomenon.

このため、補充現象においては、図４（ａ）に示すように、所定範囲の強さの音に対して、小さい音ほど大きく補正し、大きな音ほど小さく補正する。これにより、人間の聴覚系が対応し、弱くて聞き取りにくい音を聞き取り易い強さの音に補正することができる。図４（ｂ）は、静寂な環境と雑音中と主観的に認識できる音にどれだけの差が生じるのか、実験したものである。図中Ｐは静寂な環境で聞いたときの音の強さであり、Ｎは同一の音を白色雑音中で聞いたときの音の強さである。この結果をみると、Ｎは図４（ａ）の補充現象と同様な特性を示している。 For this reason, in the replenishment phenomenon, as shown in FIG. 4A, the louder the sound with a predetermined range, the larger the sound is corrected and the smaller the sound is corrected. As a result, it is possible to correct a weak sound that is compatible with the human auditory system and is difficult to hear into a sound that is easy to hear. FIG. 4B is an experiment of how much difference occurs between a quiet environment and noise that can be subjectively recognized as noise. In the figure, P is the intensity of sound when listening in a quiet environment, and N is the intensity of sound when listening to the same sound in white noise. From this result, N shows the same characteristics as the replenishment phenomenon shown in FIG.

従って、実施例２で行う増幅は、増幅対象点の音声信号の増幅度を以下のように調整する。先ず第１に、デシベル表示の平均パワーＬ_１がデシベル表示の平均パワーＬ_２より所定の閾値（実施例２では５ｄＢ）以上高い場合（すなわちＬ_１＞Ｌ_２＋５）は、１５ｍｓ（子音の長さの約１／３程度）程度のごく狭い範囲で振幅が増加しているだけであるから、この増加は雑音の増加とみなし、Ｌ_１−Ｌ_２を計算して、閾値より大きいか、以下かを算出する。閾値より大きい場合、子音判定部４は音声信号を雑音と判断する。閾値以下の場合は、次の基準で判定する。 Therefore, the amplification performed in the second embodiment adjusts the amplification degree of the audio signal at the amplification target point as follows. First, when the average power _{L 1} of decibels or more decibels of average power _{L 2} than the predetermined threshold value (Example 2, 5 dB) high (i.e. _L _1> L 2 +5) is, 15 ms (consonant length Since the amplitude only increases in a very narrow range of about 1/3 of the above), this increase is regarded as an increase in noise, and L ₁ -L ₂ is calculated to be larger than the threshold value or less Calculate. When it is larger than the threshold, the consonant determination unit 4 determines that the voice signal is noise. If it is less than or equal to the threshold value, it is determined according to the following criteria.

第２に、Ｌ_１−Ｌ_２が閾値（５ｄＢ）以下であって、Ｌ_３−２０＜Ｌ_２＜Ｌ_３であれば、子音判定部４は子音または音節の端点と判断して増幅度λをλ＝ｃ・（Ｌ_３−Ｌ_２）として決定する。ここでｃ＝０．９である。なお、デシベル表示した平均パワーの差（Ｌ_３−Ｌ_２）ではなく、平均パワーＰ_２、Ｐ_３の比率Ｋ_２３＝Ｐ_２／Ｐ_３でも表現できる。このときλはλ＝（Ｋ_２３ ^１／２）^ｄとなる。この係数ｃの意味については図８を用いて実施例４で説明する。このとき、Ｌ_３とＬ_２のレベル差若しくはＫ_２３が大きいときほど大きく増幅することになり、２０ｄＢの差を２ｄＢにまで圧縮することができる。なお、Ｌ_２−Ｌ_３＝−２０ｄＢのときには増幅度λが１８ｄＢで最大となる。 Second, if L ₁ -L ₂ is equal to or less than the threshold (5 dB) and L ₃ -20 <L ₂ <L ₃ , the consonant determination unit 4 determines that the end point of the consonant or syllable is the amplification factor λ Is determined as λ = c · (L ₃ −L ₂ ). Here, c = 0.9. It can be expressed not by the average power difference (L ₃ −L ₂ ) expressed in decibels but by the ratio of the average powers P ₂ and P ₃ , K ₂₃ = P ₂ / P ₃ . At this time, λ becomes λ = (K ₂₃ ^1/2 ) ^d . The meaning of the coefficient c will be described in Example 4 with reference to FIG. At this time, the larger the level difference between L ₃ and L ₂ , or the larger K ₂₃ , the larger the amplification, and the 20 dB difference can be compressed to 2 dB. When L ₂ −L ₃ = −20 dB, the amplification factor λ becomes maximum at 18 dB.

さらに、Ｌ_１−Ｌ_２が閾値（５ｄＢ）以下で、Ｌ_２＜Ｌ_３−２０の場合、Ｌ_２がＬ_３より２０ｄＢ以上低い場合は、前後の音声信号と比べてパワーが小さく、無理に増幅しても雑音との判別が難しくなるため、徐々に増幅度を低下させる。例えば、図７（ｂ）のように、増幅度λをＬ_２−Ｌ_３が１０ｄＢ下がるごとに４．５ｄＢ下げ、Ｌ_２−Ｌ_３が−６０ｄＢのときに増幅度λを０とする。しかし、図７（ｂ）は一例としてあげたもので、ステップ状に低下させると、増幅度が不連続に変化するところで違和感のある音声となるので、より滑らかな一点鎖線のような低減の仕方をするのが好ましい。できれば、急激な変化をしないさらに滑らかな低減の仕方をするのが好ましい。 Furthermore, when L ₁ -L ₂ is equal to or less than the threshold value (5 dB) and L ₂ <L ₃ -20, if L ₂ is 20 dB or more lower than L ₃ , the power is small compared with the preceding and following audio signals, and it is impossible. Even if it is amplified, it becomes difficult to distinguish it from noise, so the degree of amplification is gradually reduced. For example, as shown in FIG. 7B, the amplification factor λ is decreased by 4.5 dB every time L ₂ -L ₃ decreases by 10 dB, and the amplification factor λ is set to 0 when L ₂ -L ₃ is −60 dB. However, FIG. 7 (b) is given as an example. If the stepwise reduction is performed, the sound becomes uncomfortable where the amplification degree changes discontinuously. It is preferable to If possible, it is preferable to perform a smoother reduction method that does not cause a sudden change.

なお、以上説明した子音加工装置、音声情報伝達装置及び子音加工方法は、子音強調処理装置、それを搭載した音声情報伝達装置、子音強調方法として有効であり、上述したようにＬ_３−２０＜Ｌ_２＜Ｌ_３の場合に増幅度を正の値にして音声信号を増幅方向に増幅したが、逆に増幅度λを負にすることにより音声信号を抑制方向にすることもできる。例えば、聴力障害者などに対する聴力検査、外国語学習者などに対する聞き取り訓練等を行う場合に、騒音を長時間にわたって聞かせ続けると聴力の低下、不快感を招くが、この手段、方法によれば、このような聴力検査装置や聞き取り訓練装置に有効な子音抑制処理装置、子音抑制方法となり、音声の加工が可能になる。 The consonant processing device, the speech information transmission device, and the consonant processing method described above are effective as a consonant enhancement processing device, a speech information transmission device equipped with the consonant enhancement processing device, and a consonant enhancement method. As described above, L ₃ −20 < In the case of L ₂ <L ₃ , the amplification level is set to a positive value and the audio signal is amplified in the amplification direction, but conversely, the audio signal can be set to the suppression direction by making the amplification level λ negative. For example, when performing hearing tests for hearing impaired persons, listening training for foreign language learners, etc., continuing to listen to noise over a long period of time results in decreased hearing and discomfort. It becomes a consonant suppression processing apparatus and a consonant suppression method effective for such a hearing test apparatus and a listening training apparatus, and it becomes possible to process speech.

このように実施例２は、日本語のように重要な情報が音節の始まりに存在することが多いＣＶ型の言語のほかに、英語等のような他の多様な言語に対しても、きわめて簡単な構成で容易に子音強調が可能になる。なお、音環境、使用目的に応じて、時間フレーム１の抽出幅、偏りや、最大ゲインなどのパラメータを変えることができる。 As described above, the second embodiment is extremely effective not only for CV type languages in which important information is often present at the beginning of syllables, but also for various other languages such as English. Consonant enhancement can be easily performed with a simple configuration. Note that parameters such as the extraction width, bias, and maximum gain of the time frame 1 can be changed according to the sound environment and the purpose of use.

実施例２の子音加工装置、音声情報伝達装置及び子音加工方法は、複数の時間フレームによってそれぞれフレーム信号を抽出し、このフレーム信号の平均パワーを計算して比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、論理判断が少なく信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音が聞き取り易くなり、これにより音声全体の強さを減らすことができ、環境が騒音化するのを防ぐことができる。また、多くの言語の子音強調に汎用的に利用でき、増幅度を簡単に調整できるため構成が簡単で安価な子音強調処理装置等の子音加工装置、音声情報伝達装置を提供することができる。 Since the consonant processing device, the voice information transmitting device, and the consonant processing method of the second embodiment extract the frame signal by each of a plurality of time frames, and calculate and compare the average power of the frame signal, the consonant enhancement can be performed. There is no need to perform various processes in parallel, voice information can be transmitted in near real time, there are few logic judgments, signal processing is simple, and there is a situation where noise or voice competes with other acoustic signals. However, consonant sounds can be easily heard even by a hearing impaired person or an elderly person, thereby reducing the strength of the entire voice and preventing the environment from becoming noisy. Further, it is possible to provide a consonant processing device such as a consonant emphasis processing device and a voice information transmission device that can be widely used for consonant emphasis in many languages and that the amplification degree can be easily adjusted.

また、子音抑制処理装置等として利用して増幅度を負にした場合、音声を子音または音節の端点が聞き取り難い音声に加工することができ、聴力検査、聞き取り訓練等に利用できる。 Further, when the amplification degree is set to a negative value by using it as a consonant suppression processing device or the like, the voice can be processed into a voice in which the end point of the consonant or syllable is difficult to hear, and can be used for hearing test, listening training and the like.

（実施例３）
本発明の実施例３における子音加工装置と音声情報伝達装置、子音加工方法について説明する。実施例３の子音加工装置と音声情報伝達装置も、ＣＶ型の言語の子音強調に好適なものである。 (Example 3)
A consonant processing device, a voice information transmission device, and a consonant processing method according to Embodiment 3 of the present invention will be described. The consonant processing device and the voice information transmission device of the third embodiment are also suitable for consonant enhancement of a CV type language.

図５は本発明の実施例３における子音加工装置とこれを搭載した音声情報伝達装置の構成図、図６は本発明の実施例３における子音加工装置の処理の説明図、図７（ｃ）は本発明の実施例３における増幅時の増幅度の説明図である。なお、実施例３と実施例２とで同一符号は同様の構成であり、時間フレームの構成が異なるだけで、その余の点は実施例２と同様であるから、子音加工装置と音声情報伝達装置の基本的な構成の説明は実施例２に譲って省略する。 FIG. 5 is a configuration diagram of a consonant processing device and a voice information transmission device equipped with the consonant processing device according to the third embodiment of the present invention. FIG. 6 is an explanatory diagram of processing of the consonant processing device according to the third embodiment of the present invention. These are explanatory drawings of the degree of amplification at the time of amplification in Example 3 of the present invention. It should be noted that the same reference numerals in the third and second embodiments have the same configuration, and only the configuration of the time frame is different, and the other points are the same as in the second embodiment. The description of the basic configuration of the apparatus will be omitted to the second embodiment.

図６は、１はフレーム分割部、１ａは第１時間フレーム、１ｂは第２時間フレームである。また、２はパワー算出部、２ａは第１パワー算出部、２ｂは第２パワー算出部、３は比較部であり、４は子音判定部、５は遅延部、６は増幅部、７は増幅度決定部である。そして、１０は実施例２の子音加工装置、１１はマイク、１２はスピーカ、２０は実施例２の音声情報伝達装置である。これらは実施例２と同様の構成である。 In FIG. 6, 1 is a frame dividing unit, 1a is a first time frame, and 1b is a second time frame. 2 is a power calculation unit, 2a is a first power calculation unit, 2b is a second power calculation unit, 3 is a comparison unit, 4 is a consonant determination unit, 5 is a delay unit, 6 is an amplification unit, and 7 is amplification. It is a degree determination part. Reference numeral 10 denotes a consonant processing apparatus according to the second embodiment, reference numeral 11 denotes a microphone, reference numeral 12 denotes a speaker, and reference numeral 20 denotes a voice information transmission apparatus according to the second embodiment. These are the same configurations as those in the second embodiment.

実施例３の特徴的な点は、実施例２が第３時間フレーム１ｃによって１〜３音節分を抽出して前後、あるいは前の部分または後の部分の音節から増幅の判断を行ったのに対して、第２時間フレーム１ｂに連続して第４時間フレーム１ｄを設け、後続の音節と比較することによって増幅度λを調整する点である。 The characteristic point of the third embodiment is that the second embodiment extracts 1 to 3 syllables by the third time frame 1c and determines amplification based on the syllables of the front part or the front part or the rear part. On the other hand, the fourth time frame 1d is provided continuously to the second time frame 1b, and the amplification factor λ is adjusted by comparing with the subsequent syllable.

図６において、１ｄは第４時間フレームであり、子音を抽出可能な抽出幅を有して第２時間フレーム１ｂの直後に設けられる。また、２ｄは第３パワー算出部であり、第４時間フレーム１ｄから出力されたフレーム信号ｙ_４（ｔ）の平均パワーＰ_４をデシベル表示した平均パワーＬ_４を演算する。なお、平均パワーＰ_４は（数１）において、ｉ＝４とする。 In FIG. 6, reference numeral 1d denotes a fourth time frame, which is provided immediately after the second time frame 1b with an extraction width capable of extracting a consonant. Further, 2d is the third power calculating unit calculates the average power L ₄ that the average power P ₄ was decibels of fourth time frame 1d frame signal y ₄ output from _(t). The average power _{P 4} in equation (1), and i = 4.

実施例３の第１時間フレーム１ａは窓関数ｗ_１（ｔ）＝１（ここでＴ−τ_１≦ｔ≦Ｔ＋τ_１）、ｗ_１（ｔ）＝０（それ以外のとき）で構成され、第２時間フレーム１ｂは窓関数ｗ_２（ｔ）＝１（ここでＴ−τ_２≦ｔ≦Ｔ＋τ_２）、ｗ_２（ｔ）＝０（それ以外のとき）、第４時間フレーム１ｄは窓関数ｗ_４（ｔ）＝１（ここでＴ＋τ_２≦ｔ≦Ｔ＋τ_２＋２τ_４）、ｗ_４（ｔ）＝０（それ以外のとき）で構成される。単位はｍｓである。τ_２＝τ_４であるが、τ_２とτ_４を異なったパラメータとすることもできる。 The first time frame 1a according to the third embodiment includes a window function w ₁ (t) = 1 (where T−τ ₁ ≦ t ≦ T + τ ₁ ) and w ₁ (t) = 0 (otherwise), The second time frame 1b has a window function w ₂ (t) = 1 (where T−τ ₂ ≦ t ≦ T + τ ₂ ), w ₂ (t) = 0 (otherwise), and the fourth time frame 1d has a window The function w ₄ (t) = 1 (here, T + τ ₂ ≦ t ≦ T + τ ₂ + 2τ ₄ ) and w ₄ (t) = 0 (otherwise). The unit is ms. Although τ ₂ = τ ₄ , τ ₂ and τ ₄ can be different parameters.

このτ_１，τ_２，τ_４は経験的に定められるもので、実施例２においてはτ_１＝７．５ｍｓ程度、τ_２，τ_４＝２５ｍｓ程度に設定される。従って、ｗ_１（ｔ）＝１（ここでＴ−７．５≦ｔ≦Ｔ＋７．５）、ｗ_１（ｔ）＝０（それ以外のとき）で構成され、第２時間フレーム１ｂは窓関数ｗ_２（ｔ）＝１（ここでＴ−２５≦ｔ≦Ｔ＋２５）、ｗ_２（ｔ）＝０（それ以外のとき）、第４時間フレーム１ｄは窓関数ｗ_４（ｔ）＝１（ここでＴ＋２５≦ｔ≦Ｔ＋７５）、ｗ_４（ｔ）＝０（それ以外のとき）となる。 These τ ₁ , τ ₂ , and τ ₄ are determined empirically. In the second embodiment, τ ₁ = 7.5 ms and τ ₂ , τ ₄ = 25 ms are set. Accordingly, w ₁ (t) = 1 (here, T−7.5 ≦ t ≦ T + 7.5), w ₁ (t) = 0 (otherwise), and the second time frame 1b is a window function. w ₂ (t) = 1 (where T−25 ≦ t ≦ T + 25), w ₂ (t) = 0 (otherwise), the fourth time frame 1d is the window function w ₄ (t) = 1 (here T + 25 ≦ t ≦ T + 75) and w ₄ (t) = 0 (otherwise).

次に、実施例３で行う増幅について説明する。先ず第１に、平均パワーＬ_１が平均パワーＬ_２より所定の閾値（実施例２では５ｄＢ）以上高い場合（すなわちＬ_１＞Ｌ_２＋５）は、１５ｍｓ程度のごく狭い範囲で振幅が増加しているだけであるから、この増加は雑音の増加とみなし、Ｌ_１−Ｌ_２を計算して、閾値より大きいか、以下かを算出する。閾値より大きい場合、子音判定部４は音声信号を雑音と判断する。閾値以下の場合は、次の基準で判定する。 Next, amplification performed in the third embodiment will be described. First of all, when the average power L ₁ is higher than the average power L ₂ by a predetermined threshold (5 dB in the second embodiment) or more (that is, L ₁ > L ₂ +5), the amplitude increases in a very narrow range of about 15 ms. Therefore, this increase is regarded as an increase in noise, and L ₁ -L ₂ is calculated to calculate whether it is greater than or less than the threshold. When it is larger than the threshold, the consonant determination unit 4 determines that the voice signal is noise. If it is less than or equal to the threshold value, it is determined according to the following criteria.

第２に、Ｌ_１−Ｌ_２が閾値（５ｄＢ）以下であって、Ｌ_４−２０＜Ｌ_２＜Ｌ_４であれば、子音判定部４は子音または音節の端点と判断して増幅度λをλ＝ｃ・（Ｌ_４−Ｌ_２）として決定する。ここでｃ＝０．７２である。なお、デシベル表示した平均パワーの差（Ｌ_４−Ｌ_２）ではなく、平均パワーＰ_２、Ｐ_４の比率Ｋ_２４＝Ｐ_２／Ｐ_４でも表現できる。このときλはλ＝（Ｋ_２４ ^１／２）^ｄとなる。ｄも係数である。係数ｃの意味については図８を用いて実施例４で説明する。これらは、子音と母音が交互に続く配列のとき、子音または音節の端点は母音に比べて平均パワーが小さいため、Ｌ_２のレベルとＬ_４のレベルを比較してＬ_２が小さければ、第２時間フレーム１ｂに子音あるいは音節の始点があると考え、増幅対象点または増幅対象幅を増幅することを意味する。 Secondly, if L ₁ −L ₂ is equal to or less than the threshold (5 dB) and L ₄ −20 <L ₂ <L ₄ , the consonant determination unit 4 determines that the end point of the consonant or syllable and determines the amplification factor λ Is determined as λ = c · (L ₄ −L ₂ ). Here, c = 0.72. It can be expressed not by the average power difference (L ₄ −L ₂ ) expressed in decibels but also by the ratio of the average powers P ₂ and P ₄ , K ₂₄ = P ₂ / P ₄ . At this time, λ becomes λ = (K ₂₄ ^1/2 ) ^d . d is also a coefficient. The meaning of the coefficient c will be described in Example 4 with reference to FIG. These, when consonants and vowels followed alternating arrangement, since the end point of the consonant or syllable less average power than the vowels, the smaller the L ₂ compares the level of L ₂ level and L _4, the It means that the start point of the consonant or syllable is considered in the 2-hour frame 1b, and the amplification target point or the amplification target width is amplified.

なお、図６に示す実施例３の増幅対象点は、第２時間フレーム１ｂ、第４時間フレーム１ｄの境界の点である。実施例１，２と同様に、第１時間フレーム１ａ、第２時間フレーム１ｂの中央位置の音声信号を増幅するのでもよいが、第２時間フレーム１ｂ、第４時間フレーム１ｄを設けた場合、この境界を増幅する方が効果を期待でき、実施例３においてはこの境界を増幅対象点としている。また、第２時間フレーム１ｂ、第４時間フレーム１ｄの双方に跨って第１時間フレーム１ａを配置し、境界または付近の増幅対象点または増幅対象幅を増幅するか否かを決定することもできる。このとき、第１時間フレーム１ａを包含する第５時間フレーム（図示しない）を設けて、音声信号が雑音であるか否かを判断し、雑音でないと判断された場合にのみ増幅対象点または増幅対象幅を増幅するのが好適である。 Note that the amplification target points in Example 3 shown in FIG. 6 are the boundary points between the second time frame 1b and the fourth time frame 1d. As in the first and second embodiments, the audio signal at the center position of the first time frame 1a and the second time frame 1b may be amplified, but when the second time frame 1b and the fourth time frame 1d are provided, Amplifying this boundary can be expected to have an effect. In the third embodiment, this boundary is set as an amplification target point. It is also possible to determine whether to amplify the amplification target point or the amplification target width at the boundary or in the vicinity by arranging the first time frame 1a across both the second time frame 1b and the fourth time frame 1d. . At this time, by providing a fifth time frame (not shown) including the first time frame 1a, it is determined whether or not the audio signal is noise. It is preferable to amplify the target width.

デシベル表示した平均パワーＬ_４とＬ_２のレベル差が大きいときほど大きく増幅し、２０ｄＢの差を５．６ｄＢにまで圧縮することができる。Ｌ_２−Ｌ_４＝−２０ｄＢのときには増幅度が１４．４ｄＢで最大となる。 The larger the level difference between the average powers L ₄ and L ₂ displayed in decibels, the larger the amplification, and the difference of 20 dB can be compressed to 5.6 dB. When L ₂ −L ₄ = −20 dB, the amplification degree is maximum at 14.4 dB.

なお、以上説明した子音加工装置、音声情報伝達装置及び子音加工方法は、子音強調処理装置、それを搭載した音声情報伝達装置、子音強調方法として有効であり、Ｌ_３−２０＜Ｌ_２＜Ｌ_３の場合に増幅度λを上げたが、逆に増幅を抑制して増幅度λを負にすることもできる。例えば、聴力障害者などに対する聴力検査、聞き取り訓練等を行う場合に、騒音を長時間にわたって聞かせ続けると聴力の低下、不快感を招くが、この手段、方法によれば、このような聴力検査装置や聞き取り訓練装置に有効な子音抑制処理装置、子音抑制方法となり、音声の加工が可能になる。 The consonant processing device, the speech information transmission device, and the consonant processing method described above are effective as a consonant enhancement processing device, a speech information transmission device including the consonant enhancement processing device, and a consonant enhancement method, and L ₃ -20 <L ₂ <L. _{Although the} amplification factor λ is increased in the case of ₃ , the amplification factor λ can be made negative by suppressing the amplification. For example, when performing hearing tests, hearing training, etc. for persons with hearing impairment, etc., if the noise continues to be heard for a long time, hearing loss is reduced and uncomfortable feelings. And a consonant suppression processing device and a consonant suppression method that are effective for a listening training apparatus, and voice processing becomes possible.

さらに、Ｌ_１−Ｌ_２が閾値（５ｄＢ）以下で、Ｌ_２＜Ｌ_４−２０の場合、Ｌ_２がＬ_４より２０ｄＢ以上低い場合は、前後の音声信号と比べてパワーが小さく、無理に増幅しても雑音との判別が難しくなるため、徐々に増幅度を低下させる。例えば、図７（ｃ）のように、増幅度λをＬ_２−Ｌ_４が１０ｄＢ下がるごとに３．６ｄＢ下げ、Ｌ_２−Ｌ_３が−６０ｄＢのときに増幅度λを０とするものである。しかし、図７（ｃ）は一例としてあげたもので、ステップ状に低下させると、増幅度が不連続に変化するところで違和感のある音声となるので、より滑らかな一点鎖線のような低減の仕方をするのが好ましい。 Furthermore, when L ₁ -L ₂ is equal to or less than the threshold value (5 dB) and L ₂ <L ₄ -20, if L ₂ is 20 dB or more lower than L ₄ , the power is small compared to the previous and next audio signals, and it is impossible. Even if it is amplified, it becomes difficult to distinguish it from noise, so the degree of amplification is gradually reduced. For example, as shown in FIG. 7C, the amplification factor λ is decreased by 3.6 dB every time L ₂ -L ₄ decreases by 10 dB, and the amplification factor λ is set to 0 when L ₂ -L ₃ is −60 dB. is there. However, FIG. 7 (c) is given as an example, and if it is lowered in a stepped manner, the sound becomes uncomfortable where the amplification level changes discontinuously. It is preferable to

このように実施例３は、とくに日本語やイタリア語のように重要な情報が音節の始まりに存在することが多いＣＶ型の言語に対して、きわめて簡単な構成で容易に子音強調が可能になる。なお、音環境、使用目的に応じて、時間フレーム１の抽出幅、偏りや、最大ゲインなどのパラメータを変えることができる。 As described above, the third embodiment makes it possible to easily emphasize consonants with a very simple configuration, especially for a CV type language in which important information is often present at the beginning of syllables, such as Japanese and Italian. Become. Note that parameters such as the extraction width, bias, and maximum gain of the time frame 1 can be changed according to the sound environment and the purpose of use.

さらに、実施例１，２のフレーム分割は日本語でも外国語でも子音強調を行える汎用性のあるものであるが、実施例３のフレーム分割は日本語等のＣＶ型の言語の子音強調を効果的に行えるものである。従って、実施例３のフレーム分割を単独で使用しても、実施例１，２のフレーム分割と組合せて使用することもできる。このとき、２つの処理を並行して行い、増幅度の大きい方を選ぶようにすればよい。 Furthermore, the frame divisions of the first and second embodiments are versatile enough to perform consonant enhancement in both Japanese and foreign languages, but the frame division of the third embodiment is effective for consonant enhancement of CV type languages such as Japanese. It can be done. Therefore, the frame division of the third embodiment can be used alone or in combination with the frame division of the first and second embodiments. At this time, the two processes may be performed in parallel to select the one with the larger amplification degree.

なお、実施例３の子音加工装置と音声情報伝達装置は、子音と母音の判断を逆にするだけで、ＶＣ（Vowel- Consonant）型の子音強調処理に応用することができる。子音判定部４が子音または音節の端点でないと判断した場合には、音声信号の増幅対象点または増幅対象幅を増幅せず、子音または音節の端点と判断した場合に音声信号の増幅対象点または増幅対象幅を増幅すればよい。この場合、音節の終端部が強調され、音節の終端部が強調され、ＣＶ型言語以外の外国語の子音強調が効果的に行える。日本語においても、撥音「ん」、発声のさいに母音が脱落し無声化した音節などに対して効果がある。 Note that the consonant processing device and the speech information transmission device of the third embodiment can be applied to a VC (Vowel-Consonant) type consonant enhancement process only by reversing the determination of consonants and vowels. When the consonant determination unit 4 determines that it is not the end point of a consonant or syllable, it does not amplify the amplification target point or the amplification target width of the audio signal, but when it is determined as the end point of a consonant or syllable, What is necessary is just to amplify the amplification object width. In this case, the end part of the syllable is emphasized, the end part of the syllable is emphasized, and consonant emphasis of foreign languages other than the CV type language can be effectively performed. Even in Japanese, it is effective for sound-repellent “n”, syllables that are voicing and become silent when uttered.

また、増幅度を負にした場合、音声を子音または音節の端点が聞き取り難い音声に加工することができ、聴力検査、聞き取り訓練等に利用できる。 Further, when the amplification degree is negative, the voice can be processed into a voice in which the end points of consonants or syllables are difficult to hear, and can be used for hearing test, listening training and the like.

以上説明したように、実施例３の子音加工装置、音声情報伝達装置及び子音加工方法は、フレーム信号の平均パワーの差を比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、論理判断が少なく信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、境騒音が増加するのを防ぐことができる。また、日本語等のＣＶ型の言語の子音強調に好適で、増幅度を簡単に調整できるため構成が簡単で、安価に製造できる音強調処理装置等の子音加工装置、音声情報伝達装置を提供することができる。そして、子音または音節の端点でないと判断した場合には、音声信号の増幅対象点または増幅対象幅を増幅せず、子音または音節の端点と判断した場合に音声信号の増幅対象点または増幅対象幅を増幅すると、音節の終端部が強調され、ＣＶ型言語以外の外国語などの子音強調が効果的に行え、また、子音抑制処理装置等として利用すれば、音声を聴力検査、聞き取り訓練等のために加工できる。 As described above, the consonant processing device, the speech information transmission device, and the consonant processing method of the third embodiment can perform consonant enhancement only by comparing the difference in average power of frame signals, and thus perform various processes in parallel. It is not necessary, can transmit voice information in a time close to real time, has few logic judgments, is easy to process signals, even under noisy conditions or when the voice competes with other acoustic signals. Even elderly people can easily hear consonants, thereby reducing the strength of the whole voice without impairing the clarity of the voice, and preventing an increase in boundary noise. Moreover, it is suitable for consonant enhancement of CV type languages such as Japanese, and provides a consonant processing device such as a sound enhancement processing device and a voice information transmission device that can be manufactured at low cost because the amplification degree can be easily adjusted. can do. If it is determined that it is not the end point of a consonant or syllable, the amplification target point or amplification target width of the audio signal is not amplified, and if it is determined that it is an end point of a consonant or syllable, the amplification target point or amplification target width of the audio signal , The end of the syllable is emphasized, and consonant enhancement such as foreign languages other than the CV type language can be effectively performed. Also, when used as a consonant suppression processing device, the speech is used for hearing test, listening training, etc. Can be processed for.

（実施例４）
本発明の実施例４における子音加工装置と音声情報伝達装置、子音加工方法について説明する。図８は本発明の実施例４における子音加工装置の増幅特性の説明図、図９は音声刺激のパターン説明図、図１０は音声刺激ごとの子音強調処理前後の正答率の比較図である。 Example 4
A consonant processing device, a voice information transmission device, and a consonant processing method according to Embodiment 4 of the present invention will be described. FIG. 8 is an explanatory diagram of the amplification characteristics of the consonant processing device according to the fourth embodiment of the present invention, FIG. 9 is an explanatory diagram of a voice stimulus pattern, and FIG. 10 is a comparison diagram of correct answer rates before and after consonant enhancement processing for each voice stimulus.

実施例３の比較部３はデシベル表示した平均パワーＬ_ｉ（ｉ＝１，２，４）の差を計算して増幅度を計算したが、実施例４は各時間フレームの平均パワーＰ_ｉ（ｉ＝１，２，４）の比率を計算して増幅度を計算するものである。従って、実施例４と実施例３とで同一符号は同様の構成であり、比較部３の計算方法が異なるだけで、その余の点は実施例３と同様である。これらの詳細な説明は実施例３に譲ってここでは省略する。従って、図５、図６を参照する。 The comparison unit 3 of the third embodiment calculates the amplification degree by calculating the difference between the average power L _i (i = 1, 2, 4) displayed in decibels. In the fourth embodiment, the comparison unit 3 calculates the average power P _i ( The ratio is calculated by calculating the ratio of i = 1, 2, 4). Therefore, the same reference numerals are used in the fourth embodiment and the third embodiment, and the other points are the same as those in the third embodiment except that the calculation method of the comparison unit 3 is different. These detailed explanations are given in Example 3 and are omitted here. Therefore, reference is made to FIGS.

実施例４においては、比較部３が各フレーム信号の平均パワーＰ_ｉ（ｉ＝１，２，４）の比率Ｋ_ij＝Ｐ_ｉ／P_j（ｉ，ｊ＝１，２，４；ｉ＜ｊ）を計算し、増幅度決定部７で増幅度を算出する。Ｌ_１とＬ_２の関係は実施例３と同様に比率で雑音を判別できればよい。そこで、以下Ｌ_２とＬ_４の関係を説明する。 In the fourth embodiment, the comparison unit 3 uses the ratio K _ij = P _i / P _j (i, j = 1,2,4) of the average power P _i (i = 1,2,4) of each frame signal; i < j) is calculated, and the amplification degree determination unit 7 calculates the amplification degree. As for the relationship between L ₁ and L ₂ , it is sufficient that noise can be discriminated by a ratio as in the third embodiment. Therefore, the relationship between L ₂ and L ₄ will be described below.

比較部３は、第２パワー算出部２ｂからの出力であるデシベル表示したＬ_２、第４パワー算出部２ｄからの出力であるデシベル表示したＬ_４の差Ｌ_２−Ｌ_４を算出し、Ｌ_２−Ｌ_４＞０であれば増幅度決定部７は増幅を行わない。比率Ｋ_２４＝Ｐ_２／P_４で判定する場合、Ｋ_２４＞１となる。これは図１０においてＡ点よりＬが大きい場合である。 Comparator 3, calculates a second decibels the L ₂ which is the output from the power calculation unit _2b, the difference L ₂ -L ₄ of the fourth decibels the L ₄ is the output from the power calculating unit 2d, L _{If 2-} L ₄ > 0, the amplification degree determination unit 7 does not perform amplification. When the determination is made with the ratio K ₂₄ = P ₂ / P ₄ , K ₂₄ > 1. This is a case where L is larger than point A in FIG.

これに対し、Ｌ_２−Ｌ_４≦０、あるいは比率Ｋ_２４≦１の場合、増幅度決定部７は増幅を行う。この場合、増幅度λ＝ｃ・（Ｌ_２−Ｌ_４）となる。図８においては、このｃは（線分βγ）／（線分αγ）で表される比で表される。ｃを増やすほど増幅度が大きくなり、ｃ
が０のときには音声信号が増幅されない。破線上のγ点の入力があったとき、Ｌ_２＜Ｌ_４であれば、出力は線分βγ分持ち上げられ、β点にまで増幅されることを意味する。 On the other hand, when L ₂ −L ₄ ≦ 0 or the ratio K ₂₄ ≦ 1, the amplification degree determination unit 7 performs amplification. In this case, the amplification degree is λ = c · (L ₂ −L ₄ ). In FIG. 8, this c is represented by a ratio represented by (line segment βγ) / (line segment αγ). Increasing c increases the degree of amplification, and c
When is 0, the audio signal is not amplified. If there is an input of a γ point on the broken line, if L ₂ <L ₄ , it means that the output is lifted by the line segment βγ and amplified to the β point.

図８においてＢ点はニーポイント（増幅度の切り換わり点）であって、これ以下のレベルの入出力信号はノイズと判別が難しくなるので、増幅度を下げている。図８の場合、ニーポイントＢ点を−２０ｄＢとし、ニーポイントＢ点より小さな入出力信号に対しては、増幅度を徐々に下げ、ニーポイントＢ点で増幅度が最大となる。 In FIG. 8, point B is a knee point (amplification degree switching point), and an input / output signal at a level below this becomes difficult to distinguish from noise, so the amplification degree is lowered. In the case of FIG. 8, the knee point B point is set to −20 dB, and for input / output signals smaller than the knee point B point, the amplification degree is gradually decreased, and the amplification degree is maximized at the knee point B point.

また、実施例４の子音加工装置と音声情報伝達装置は、デシベル表示した平均パワーの差Ｌ_２−Ｌ_４がＬ_２−Ｌ_４＞０、Ｌ_２−Ｌ_４≦０、あるいは平均パワーＰ_２，P_４の比率Ｋ_２４＞１とＫ_２４≦１における判断を逆にすることなどで、ＶＣ型の子音強調処理に応用することができる。すなわち、増幅対象点または増幅対象幅の増幅度の判断を逆にすることで、音節の終端部が強調され、ＣＶ型言語以外の外国語の子音強調が効果的に行える。そして、このフレーム分割を実施例１，２，３のフレーム分割と組合せて使用することもできる。組合せる場合、２つの処理を並行して行い、増幅度の大きい方を選ぶようにすればよい。これにより、音節の最終端と判断される場合に比率Ｋ_２４が１以下であれば音声信号を増幅するので、音節の最後を明瞭にすることができる。 Further, in the consonant processing device and the speech information transmission device according to the fourth embodiment, the average power difference L ₂ −L ₄ expressed in decibels is L ₂ −L ₄ > 0, L ₂ −L ₄ ≦ 0, or the average power P _2. , P ₄ ratios K ₂₄ > 1 and K ₂₄ ≦ 1, for example, can be applied to VC-type consonant enhancement processing. That is, by inverting the determination of the amplification target point or the amplification degree of the amplification target width, the terminal part of the syllable is emphasized, and consonant enhancement of foreign languages other than the CV type language can be effectively performed. This frame division can also be used in combination with the frame divisions of the first, second, and third embodiments. When combining, two processes should be performed in parallel and the one with the larger amplification degree may be selected. Thus, the ratio K ₂₄ amplifies the audio signal if 1 or less when it is determined that the final end of the syllable, it is possible to clarify the last syllable.

さて、実施例４の子音加工装置の有効性を確認するために、明瞭度を検証した。音声刺激としては、「人工内耳装用による語音聴取評価検査（ＣＩ２００４）」（日本人工内耳研究会編）に収録されている成人用子音検査の音源を用いた。この音源には、「ａｂａ」，「ａｄａ」，「ａｇａ」，「ａｈａ」，「ａｋａ」，「ａｍａ」，「ａｎａ」，「ａｐａ」，「ａｒａ」，「ａｓａ」，「ａｔａ」，「ａｗａ」，「ａｙａ」，「ａｚａ」の１４種類のＶＣＶ音節が設けられている。この音源を４４．１ｋＨｚで、子音強調処理を施したものと処理しないものを用意し、図９に示すように上限、下限周波数が８０００Ｈｚ、５０Ｈｚの背景雑音を加えて、音声刺激とした。背景雑音の継続時間は５０００ｍｓ、５００ｍｓの立ち上がり及び立ち下りを設け、５０００ｍｓの継続時間の中央に子音強調処理を施した音声刺激を配した。次の音声刺激までの時間間隔は２０００ｍｓとした。 Now, in order to confirm the effectiveness of the consonant processing device of Example 4, the intelligibility was verified. As a voice stimulus, a sound source for adult consonant examination recorded in “Evaluation test for speech listening by wearing a cochlear implant (CI2004)” (edited by the Japan Cochlear Implant Research Group) was used. The sound source includes “aba”, “ada”, “aga”, “aha”, “aka”, “ama”, “ana”, “apa”, “ara”, “asa”, “ata”, “ Fourteen types of VCV syllables “awa”, “aya”, and “aza” are provided. This sound source was prepared at 44.1 kHz and subjected to consonant emphasis processing and not processed, and as shown in FIG. 9, background noise having upper and lower limit frequencies of 8000 Hz and 50 Hz was added to obtain a voice stimulus. The background noise has a duration of 5000 ms and a rise and fall of 500 ms, and a voice stimulus subjected to consonant enhancement processing is arranged at the center of the duration of 5000 ms. The time interval until the next voice stimulus was 2000 ms.

この音声刺激を正常な聴力をもつ１４人の実験参加者に与え、子音強調処理を施したものと処理しないものとで正答率を比較した。図１０は音声刺激ごとの子音強調処理前後の正答率を比較したものである。図１０で両者の全体の平均値を比較すると、子音強調処理を施したものの方が処理しないものより高いことが分かる。実施例１の子音加工装置が有効に機能していることが分かる。 This voice stimulus was given to 14 experimental participants with normal hearing ability, and the correct answer rates were compared between those subjected to consonant enhancement processing and those not processed. FIG. 10 compares the correct answer rate before and after the consonant enhancement processing for each voice stimulus. Comparing the average value of both of them in FIG. 10, it can be seen that the one subjected to the consonant enhancement process is higher than the one not processed. It turns out that the consonant processing apparatus of Example 1 is functioning effectively.

この中で、とくに「ａｓａ」，「ａｚａ」の正答率が高いのは、「ｓ」や「ｚ」のエネルギーの大半が８０００Ｈｚ以上で背景雑音によってマスクされなかったからと考えられるし、摩擦音は摩擦性の雑音、及び前後の母音との遷移部（ＶＯＴや無音区間）に特徴があるため、「ｓ」や「ｚ」はこの遷移部より雑音部（子音部）に多くの音声情報をもっているとみられることから、実施例１の増幅部６による子音強調処理が有効に機能し、明瞭度を増したと考えられる。 Among them, the correct answer rate of “asa” and “aza” is particularly high because most of the energy of “s” and “z” is over 8000 Hz and was not masked by background noise. S and “z” are considered to have more audio information in the noise part (consonant part) than in this transition part. Therefore, it is considered that the consonant enhancement processing by the amplifying unit 6 of Example 1 functions effectively, and the clarity is increased.

これに対し、「ａｂａ」，「ａｄａ」，「ａｇａ」は有声閉鎖子音であり、第２ホルマント遷移の形状が音声の識別に大きな影響を与える。有声破裂子音はこの第２ホルマント遷移に多くの音声情報を有しているとみられ、また、第２ホルマント遷移部の振幅は母音に対して大きい値を示すために、実施例１の増幅部６による子音強調処理を行ったものと行わなかったものとで、明瞭度にはそれほど差が出なかったものと考えられる。 On the other hand, “aba”, “ada”, and “aga” are voiced closed consonants, and the shape of the second formant transition has a great influence on the voice identification. The voiced burst consonant is considered to have a lot of voice information in the second formant transition, and the amplitude of the second formant transition part shows a large value with respect to the vowel. It is considered that there was not much difference in clarity between those with and without the consonant enhancement processing.

このように実施例４の子音加工装置、音声情報伝達装置及び子音加工方法は、フレーム信号の平均パワーの比率を比較するだけで子音強調が行えるから、並列的に様々の処理を行う必要がなく、リアルタイムに近い時間内に音声情報伝達が行え、信号処理が簡単で、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも子音が聞き取り易くなり、これにより音声の明瞭さを損なうことなく音声全体の強さを減らすことができ、環境騒音が増加するのを防ぐことができる。また、日本語等のＣＶ型の言語の子音強調に好適で、増幅度を簡単に調整できるため構成が簡単で、安価に製造できる子音強調処理装置等の子音加工装置、音声情報伝達装置を提供することができる。 As described above, the consonant processing device, the speech information transmission device, and the consonant processing method of the fourth embodiment can perform consonant enhancement only by comparing the ratios of the average powers of the frame signals. Therefore, it is not necessary to perform various processes in parallel. Voice information can be transmitted in a time close to real time, signal processing is simple, and consonants are easy to hear even in deaf and elderly people, even under noisy conditions or when the voice competes with other acoustic signals. Thus, the strength of the entire voice can be reduced without impairing the clarity of the voice, and an increase in environmental noise can be prevented. Also suitable for consonant emphasis in CV type languages such as Japanese, and providing a consonant processing device such as a consonant emphasis processing device and a voice information transmission device that can be manufactured at low cost because the amplification level can be easily adjusted. can do.

また、実施例３と同様に増幅度を負にした場合、子音抑制処理装置等として音声を子音または音節の端点が聞き取り難い音声に加工することができ、聴力検査、聞き取り訓練等に利用できる。 Further, when the amplification degree is negative as in the third embodiment, the speech can be processed into speech in which consonant or syllable end points are difficult to hear as a consonant suppression processing device or the like, which can be used for hearing test, listening training, and the like.

（実施例５）
以下、本発明の実施例５における子音加工装置と音声情報伝達装置、子音加工方法について説明する。図１１は本発明の実施例５における子音加工装置とこれを搭載した音声情報伝達装置の構成図である。 (Example 5)
Hereinafter, a consonant processing device, a voice information transmission device, and a consonant processing method according to Embodiment 5 of the present invention will be described. FIG. 11 is a configuration diagram of a consonant processing device and a voice information transmission device equipped with the same in Embodiment 5 of the present invention.

実施例５における子音加工装置は、音声信号の子音あるいは音節の境界をより明瞭に検出するために、予め音声信号を処理して時間フレーム１に入力するものである。 The consonant processing apparatus according to the fifth embodiment processes a speech signal in advance and inputs it into the time frame 1 in order to detect the consonant or syllable boundary of the speech signal more clearly.

図１１に示す８はフレーム分割部１の直前に置かれたフィルタ部である。フィルタ部８は、３０００Ｈｚ以下の周波数成分を通過させ１０００Ｈｚ近辺にピークがあるような特性を有しており、これによって子音または音節の境界をより適切に検出することが可能になる。なお、実施例５は、実施例１の子音加工装置１０と音声情報伝達装置２０にフィルタ部８を設けたものを示しているが、フィルタ部８を実施例２〜４の子音加工装置１０と音声情報伝達装置２０に設けるのでも同様である。これらは図示しない。 Reference numeral 8 shown in FIG. 11 denotes a filter unit placed immediately before the frame dividing unit 1. The filter unit 8 has a characteristic that allows a frequency component of 3000 Hz or less to pass therethrough and has a peak in the vicinity of 1000 Hz, which makes it possible to more appropriately detect a boundary between consonants or syllables. In addition, although Example 5 has shown what provided the filter part 8 in the consonant processing apparatus 10 and the audio | voice information transmission apparatus 20 of Example 1, the filter part 8 and the consonant processing apparatus 10 of Examples 2-4 are shown. The same applies to the voice information transmission device 20. These are not shown.

このように実施例５の子音加工装置、音声情報伝達装置及び子音加工方法は、簡単に子音若しくは音節の境界を明瞭に検出することができ、騒音下、あるいは音声が他の音響信号と競合する状況であっても、また、難聴者、高齢者でも聞き取り易くなる。 As described above, the consonant processing device, the voice information transmitting device, and the consonant processing method of the fifth embodiment can easily detect the boundary between consonants or syllables easily, and the noise or the voice competes with other acoustic signals. Even in situations, it is easy for the hearing impaired and elderly people to hear.

本発明は、アナウンス放送装置や携帯電話等、補聴器等の音声情報伝達装置に適用できる。 The present invention can be applied to audio information transmission devices such as hearing aids, such as announcement broadcasting devices and mobile phones.

本発明の実施例1における子音加工装置とこれを搭載した音声情報伝達装置の構成図1 is a configuration diagram of a consonant processing device and a voice information transmission device equipped with the same in Embodiment 1 of the present invention. 本発明の実施例1における子音加工装置の処理の説明図Explanatory drawing of the process of the consonant processing apparatus in Example 1 of this invention 本発明の実施例２における子音加工装置とこれを搭載した音声情報伝達装置の構成図The block diagram of the consonant processing apparatus in Example 2 of this invention and the audio | voice information transmission apparatus carrying this （ａ）補充現象の説明図、（ｂ）静寂な環境での音と雑音中での音の比較図(A) Explanatory diagram of replenishment phenomenon, (b) Comparison diagram of sound in quiet environment and sound in noise 本発明の実施例３における子音加工装置とこれを搭載した音声情報伝達装置の構成図The block diagram of the consonant processing apparatus in Example 3 of this invention and the audio | voice information transmission apparatus carrying this 本発明の実施例３における子音加工装置の処理の説明図Explanatory drawing of the process of the consonant processing apparatus in Example 3 of this invention （ａ）本発明の実施例1における増幅時の増幅度の説明図、（ｂ）本発明の実施例２における増幅時の増幅度の説明図、（ｃ）本発明の実施例３における増幅時の増幅度の説明図(A) Explanatory diagram of the amplification factor in the first embodiment of the present invention, (b) Explanatory diagram of the amplification factor in the second embodiment of the present invention, (c) Amplification in the third embodiment of the present invention Illustration of the degree of amplification 本発明の実施例４における子音加工装置の増幅特性の説明図Explanatory drawing of the amplification characteristic of the consonant processing apparatus in Example 4 of this invention 音声刺激のパターン説明図Explanation of voice stimulation pattern 音声刺激ごとの子音強調処理前後の正答率の比較図Comparison of correct answer rate before and after consonant enhancement for each voice stimulus 本発明の実施例５における子音加工装置とこれを搭載した音声情報伝達装置の構成図Configuration diagram of consonant processing device and voice information transmission device equipped with the same in Embodiment 5 of the present invention

Explanation of symbols

１フレーム分割部
１ａ第１時間フレーム
１ｂ第２時間フレーム
１ｃ第３時間フレーム
１ｄ第４時間フレーム
２パワー算出部
２ａ第１パワー算出部
２ｂ第２パワー算出部
２ｃ第３パワー算出部
２ｄ第４パワー算出部
３比較部
４子音判定部
５遅延部
６増幅部
７増幅度決定部
８フィルタ部
１０子音加工装置
１１マイク
１２スピーカ
２０音声情報伝達装置 DESCRIPTION OF SYMBOLS 1 Frame division | segmentation part 1a 1st time frame 1b 2nd time frame 1c 3rd time frame 1d 4th time frame 2 Power calculation part 2a 1st power calculation part 2b 2nd power calculation part 2c 3rd power calculation part 2d 4th power Calculation unit 3 Comparison unit 4 Consonant determination unit 5 Delay unit 6 Amplification unit 7 Amplification degree determination unit 8 Filter unit 10 Consonant processing device 11 Microphone 12 Speaker 20 Audio information transmission device

Claims

A frame dividing unit that extracts a frame signal from each of a plurality of time frames from an input audio signal;
A power calculator that calculates an average power for each of the frame signals;
A comparison unit for comparing average powers between the frame signals;
A consonant determination unit that determines whether an amplification target point or an amplification target width of the audio signal is an end point of a consonant or a syllable based on a comparison result of the comparison unit;
When the consonant determination unit determines that it is an end point of a consonant or syllable, the amplification unit includes an amplification unit that amplifies the amplification target point or the amplification target width of the audio signal and does not amplify when it is determined that it is not the end point of a consonant or syllable A consonant processing apparatus characterized by that.

A frame dividing unit that extracts a frame signal from each of a plurality of time frames from an input audio signal;
A power calculator that calculates an average power for each of the frame signals;
A comparison unit for comparing average powers between the frame signals;
A consonant determination unit that determines whether an amplification target point or an amplification target width of the audio signal is an end point of a consonant or a syllable based on a comparison result of the comparison unit;
When the consonant determination unit determines that it is an end point of a consonant or syllable, it determines the amplification target point or amplification target width of the audio signal in the amplification direction, and when it determines that it is not an end point of a consonant or syllable An amplification degree determination unit that determines that the signal is not amplified;
A consonant processing apparatus comprising: an amplification unit that amplifies the audio signal in accordance with the amplification degree determined by the amplification degree determination unit.

3. The consonant processing device according to claim 1, wherein the comparison unit performs comparison by calculating a difference in decibel average power of each frame signal. 4.

The consonant processing apparatus according to claim 1, wherein the comparison unit compares the frame signals by calculating a ratio of average power of the frame signals.

The time frame is provided with a time frame having an extraction width capable of extracting a consonant, and the amplification target point or the amplification target width is set at a center position of the extraction width of the time frame. 4. The consonant processing device according to any one of 4 above.

The amplification target point or the amplification target width is set at a boundary between the two time frames when two time frames continuous to the time frame are provided. The consonant processing device described in 1.

When the difference of the average power displayed in decibels is 0 or less, the amplitude of the audio signal at the amplification target point or the width of the amplification target is amplified, and when the difference displayed in decibels is larger than 0, it is not amplified. The consonant processing device according to claim 3.

When the ratio between the average powers is 1 or less, the amplitude of the audio signal at the amplification target point or the amplification object width is amplified, and when the average power ratio is greater than 1, the amplification is not performed. Item 4. The consonant processing device according to Item 4.

7. The consonant processing device according to claim 1, wherein the amplifying unit amplifies the amplification target point or the amplification target width of the audio signal when the consonant determination unit determines that it is an end point of a consonant or a syllable. Instead, the consonant processing device is characterized in that when it is determined as an end point of a consonant or a syllable, the amplification target point or the amplification target width of the audio signal is conversely suppressed.

The consonant processing device according to any one of claims 2 to 6, wherein the amplification degree determination unit amplifies an amplification target point or an amplification target width of the audio signal when the consonant determination unit determines that it is an end point of a consonant or a syllable. Instead of determining the degree of amplification in the amplification direction, the amplification degree determination unit reverses the amplification degree of the amplification target point or the amplification target width of the audio signal when the consonant determination unit determines that it is an end point of a consonant or a syllable. A consonant processing device, characterized in that a determination is made that the direction is suppressed.

The consonant processing apparatus according to claim 1, further comprising a filter unit that allows a predetermined frequency component to pass before an audio signal is input to the frame dividing unit.

The amplifying unit amplifies a physical sound pressure in accordance with a correction characteristic of a supplementary phenomenon that matches a loudness that is a sensory amount of a hearing-impaired hearing person with a loudness that is a sensory amount of a normal hearing person. The consonant processing apparatus according to claim 1, wherein the consonant processing apparatus is characterized.

An audio comprising: the consonant processing device according to any one of claims 1 to 12; and a speaker that outputs a consonant-enhanced sound based on a consonant-processed audio signal from the consonant processing device. Information transmission device.

A frame signal is extracted for each of a plurality of time frames from the input audio signal, an average power is calculated for each of the frame signals, and the average power is compared between the frame signals. It is determined whether the point to be amplified or the width to be amplified is an end point of a consonant or syllable. A consonant processing method characterized by not amplifying if it is determined not to be a consonant.

15. The consonant processing method according to claim 14, wherein when it is determined as an end point of a consonant or a syllable, instead of amplifying an amplification target point or an amplification target width of the audio signal, it is determined as an end point of a consonant or a syllable Is a consonant processing method characterized by conversely suppressing the amplification target point or the amplification target width of the audio signal.

The physical sound pressure is amplified in accordance with a correction characteristic of a supplementary phenomenon that matches the loudness of a sensory sensory hearing person with the loudness of a normal hearing person. The consonant processing method described in 14 or 15.