JP2002366189A

JP2002366189A - System for identifying and detecting music and voice

Info

Publication number: JP2002366189A
Application number: JP2001217355A
Authority: JP
Inventors: Junichi Kakumoto; 純一角元
Original assignee: Individual
Current assignee: Individual
Priority date: 2001-06-12
Filing date: 2001-06-12
Publication date: 2002-12-20

Abstract

PROBLEM TO BE SOLVED: To provide signal processing technique for automatically improving clarity in sound transmission in an electronic circuit and a digital signal processing field in the industry of equipment such as acoustic equipment, broadcasting unit, a receiver or guide broadcasting equipment. SOLUTION: A signal is generated which has a specified delay time within the range of the several hundreds of milliseconds of an original acoustic signal, energy is calculated between the signal and the original signal, a low-pass filter is made to work and, then, a kind of relative strength is calculated in a time difference. The correlation is statistically evaluated so as to obtain a music and sound identifying function.

Description

DETAILED DESCRIPTION OF THE INVENTION

【００１】[0101]

【発明の所属する技術分野】［音響機器、放送機器、受
信機器、案内放送機器］などの装置産業における、電子
回路やディジタル信号処理分野において、音声の伝達明
瞭性を自動的に向上させる信号処理技術。BACKGROUND OF THE INVENTION Signal processing for automatically improving voice transmission clarity in the field of electronic circuits and digital signal processing in the device industry such as [acoustic equipment, broadcasting equipment, receiving equipment, and guide broadcasting equipment] Technology.

【００２】[0092]

【発明が解決しようとしている課題】近年、音響機器、
ＡＶ機器の音質は、電子回路の伝達特性に手を加えるこ
とによって、低音域を強調する傾向が高い。このような
手法は、主として音楽の音質に重点を置いて設計されて
いるケースが多い。このようなシステムでは、ニュース
や天気予報など音声の明瞭性伝達を重視する信号におい
て、明瞭性の確保の妨げになることが多い。一般的に
［音楽の豊かさを表現するに必要な特性］と［音声の内
容を伝達するに必要な特性］とは相容れない通過特性が
要求される。双方のニーズを満足するフィルターが存在
しない以上、この問題解決には、音楽と音声の識別機能
が必要となる。In recent years, audio equipment,
The sound quality of AV equipment tends to emphasize the bass range by modifying the transfer characteristics of the electronic circuit. In many cases, such a method is designed mainly with an emphasis on the sound quality of music. In such a system, it is often difficult to ensure clarity in a signal such as news or weather forecast that emphasizes clarity of voice. Generally, pass characteristics that are incompatible with [characteristics necessary for expressing richness of music] and [characteristics required for transmitting audio content] are required. Since there is no filter that satisfies both needs, a solution to this problem requires the ability to distinguish between music and speech.

【００３】[0093]

【問題を解決するための手段】一般的に、音声信号は音
楽信号に比べ、自己相関関数の強度が［時間差が大きく
なると共に］減衰する傾向が強いことが実験により確か
められる。音声ならば、［個人差］や［ニュース、天気
予報、会話、映画などのセリフ、ＢＧＭの有無など、音
声の内容］の影響を受け、大きなバラツキはあるもの
の、また、音楽であれば、［音楽のテンポ］や［楽器の
種類］や［演奏方法］などの影響を受けバラツキがある
ものの、平均的には音声と音楽はこの点で大きく性質を
異にする。本発明はこの点に着目し、［フーリエ分析や
その結果のパターン分析など］多量の演算処理をするこ
となく小規模な演算処理工程でこの問題を解決する。In general, it has been confirmed by experiments that the intensity of the autocorrelation function of a voice signal tends to attenuate [as the time difference increases], as compared with a music signal. If it is voice, it is influenced by [individual difference] and [the content of the voice such as news, weather forecast, conversation, dialogue of movies, etc., and the presence or absence of BGM]. Although there are variations due to the effects of music tempo, instrument type, and playing method, on average, speech and music differ greatly in this respect. The present invention pays attention to this point, and solves this problem by a small-scale arithmetic processing step without performing a large amount of arithmetic processing [Fourier analysis and pattern analysis of the result thereof].

【００４】[0093]

【従来の技術】特にない。コンシューマ商品であれば、
通常はユーザが好みの音質に調整し、録音とかＰＡの分
野では専門の調整員がケースバイケースで音質を調整し
ている。2. Description of the Related Art If it ’s a consumer product,
Normally, the user adjusts the sound quality to his or her preference, and in the field of recording and PA, a specialized adjuster adjusts the sound quality on a case-by-case basis.

【００５】[0056]

【用語の定義】［音声］とは人が話をする声の信号であ
るものとする。［音楽信号］とは音楽を構成する楽器音
や歌声などの信号であるものとする。［加算］とは一般
的に減算も含むものとする。［加算、減算、乗算、除
算、対数］とは厳密に数学的なものではなく、実用的な
装置を作る上に差し支えのない誤差を許容できる機能で
あるものとする。［有効成分］とは２個の音響信号の積
に含まれるところの相関強度と強い関係を持つ［正また
は負］の直流成分］もしくは［長周期の交流成分］であ
るとする。［無効成分］とは２個の音響信号の積に含ま
れるところの相関強度とは無関係の短周期の交流成分で
あるとする。［ＤＳＰ］とは信号処理に特化した演算処
理集積回路とする。[Definition of terms] [Speech] is a signal of a voice spoken by a person. [Music signal] is a signal such as an instrumental sound or a singing voice that constitutes music. [Addition] generally includes subtraction. [Addition, subtraction, multiplication, division, logarithm] is not strictly mathematical, but is a function that can tolerate an error that does not hinder a practical device. The “effective component” is assumed to be a “positive or negative DC component” or a “long-period AC component” having a strong relationship with the correlation strength included in the product of two acoustic signals. It is assumed that the "ineffective component" is a short-period AC component that is not included in the correlation strength and included in the product of the two acoustic signals. [DSP] is an arithmetic processing integrated circuit specialized in signal processing.

【００６】第１図は本発明の一実施例を示すブロック図
である。説明を簡単で、かつ、実用的に充分なものにす
るために、遅延機能が２個ある場合の実施例を示す。実
機能を得るための実用的信号処理の実施例であることか
ら、本発明の本質と係わらない機能が多く含まれている
が、本発明の一般性についての範囲が制限されるもので
はない。一般的に、遅延機能の数は応用装置に要求され
るところの［性能とコスト］に依存して決定される。本
発明はコンシューマ商品へ応用されることから、一般的
な人の感覚に照らして必要かつ十分な機能を満足する実
施例である。FIG. 1 is a block diagram showing one embodiment of the present invention. In order to make the description simple and practically sufficient, an embodiment in which there are two delay functions will be described. Since this is an embodiment of practical signal processing for obtaining actual functions, it includes many functions not related to the essence of the present invention, but does not limit the generality of the present invention. In general, the number of delay functions is determined depending on [performance and cost] required for an application device. Since the present invention is applied to a consumer product, it is an embodiment satisfying necessary and sufficient functions in light of general human senses.

【００７】ｆ（ｔ）は入力音響信号、ＤＬＹ＿１、ＤＬ
Ｙ＿２はそれぞれ特定の遅延時間Ｄ１、Ｄ２を持つ遅延
機能であり、それぞれのブロック名称を第１遅延機能、
第２遅延機能とする。ＭＰＹ＿０、ＭＰＹ＿１、ＭＰＹ
＿２、はそれぞれ２つの入力を４象限乗算する乗算機能
であり、それぞれの名称を第０乗算機能、第１乗算機
能、第２乗算機能とする。ブロックＬＰＦのＬＰＦ＿
０、ＬＰＦ＿１、ＬＰＦ＿２、はそれぞれ遮断周波数Ｆ
Ｒを持つ同一特性の低域フィルタ機能であり、それぞれ
の名称を第０低域フィルタ機能、第１低域フィルタ機
能、第２低域フィルタ機能とする。ＲＭＳ＿０、ＲＭＳ
＿１、ＲＭＳ＿２はそれぞれ入力信号の短時間の平均値
または実行値またはそれらに類する値を得る機能であ
り、それぞれの名称を第０平均化機能、第１平均化機
能、第２平均化機能とする。平均化機能のそれぞれの出
力をＰ０（ｔ）、Ｐ１（ｔ）、Ｐ２（ｔ）とする。F (t) is the input audio signal, DLY_1, DL
Y_2 is a delay function having specific delay times D1 and D2, respectively.
This is the second delay function. MPY_0, MPY_1, MPY
_2 is a multiplication function that multiplies two inputs by four quadrants, and their names are a zeroth multiplication function, a first multiplication function, and a second multiplication function. LPF_ of block LPF
0, LPF_1 and LPF_2 respectively have a cutoff frequency F
R is a low-pass filter function having the same characteristics and has the same name as the 0th low-pass filter function, the first low-pass filter function, and the second low-pass filter function. RMS_0, RMS
_1 and RMS_2 are functions for obtaining a short-time average value or an execution value of an input signal or a value similar thereto, respectively, and their names are a zeroth averaging function, a first averaging function, and a second averaging function, respectively. . Let the outputs of the averaging function be P0 (t), P1 (t) and P2 (t).

【００８】ブロックＳＭＰＬのＳＭＰＬ＿０、ＳＭＰＬ
＿１、ＳＭＰＬ＿２は平均化機能の出力を周期ＴＳ間、
積分し、結果をサンプリングする、サンプリング機能で
ある。その出力はそれぞれ、ＰＳ０（ｔ）、ＰＳ１
（ｔ）、ＰＳ２（ｔ）である。ＳＧＴはサンプリング機
能の出力ＰＳ１（ｔ）、ＰＳ２（ｔ）、の大きい方を選
択する選択機能である。その出力はＰＳ１２（ｔ）であ
る。SMPL_0, SMPL of block SMPL
_1 and SMPL_2 output the output of the averaging function during the period TS.
This is a sampling function that integrates and samples the result. The outputs are PS0 (t), PS1
(T) and PS2 (t). SGT is a selection function for selecting the larger of the outputs PS1 (t) and PS2 (t) of the sampling function. Its output is PS12 (t).

【００９】ＬＯＧ＿０、ＬＯＧ＿１２は、それぞれＰＳ
０（ｔ）、Ｐ１２（ｔ）を対数変換する、対数演算機能
である。ＰＬ０（ｔ）、ＰＬ１２（ｔ）はそれぞれＬＯ
Ｇ＿０、ＬＯＧ＿１２の出力である。ＮＲＭはＰＬ１２
（ｔ）とＰＬ０（ｔ）の差を出力する正規化機能であ
り、その出力はＧ（ｔ）である。ＤＩＦＦは周期ＴＳご
とにＧ（ｔ）とＧ（ｔ−ＴＳ）の差分を演算する差分機
能であり、その出力はＨ（ｔ）である。ＡＶＧは周期Ｔ
ＳのＮ倍の期間、積分しその結果をサンプリング出力す
る積分機能である。その出力はＪ（ｔ）である。ＤＴＣ
ＴはＪ（ｔ）をさらに平均化し、音声と音楽を識別し、
音響フィルターＦＬＴを制御するに必要な信号に変換す
るための検出機能である。ＤＴＣＴは平均化のためのパ
ラメータ、デッドゾーンＺｄｅｄ、スレッショルドＬｔ
ｈｄ、平均化のアタック時定数ＴＡａｖｇとレリース時
定数ＴＲａｖｇを持つ。LOG_0 and LOG_12 are PS
It is a logarithmic operation function that performs logarithmic conversion of 0 (t) and P12 (t). PL0 (t) and PL12 (t) are LO
Outputs of G_0 and LOG_12. NRM is PL12
This is a normalization function that outputs the difference between (t) and PL0 (t), and the output is G (t). DIFF is a difference function for calculating the difference between G (t) and G (t-TS) for each cycle TS, and its output is H (t). AVG is period T
This is an integration function of integrating N times of S and sampling and outputting the result. The output is J (t). DTC
T further averages J (t) to distinguish between voice and music,
This is a detection function for converting into a signal necessary for controlling the acoustic filter FLT. DTCT is a parameter for averaging, dead zone Zded, threshold Lt.
hd, an averaging attack time constant TAavg and a release time constant TRavg.

【０１０】ＣＴＲＬは検出機能の出力Ｍ（ｔ）により、
入力音響信号の相関強度を判定し、または相関強度に対
応し［音響特性の制御に必要な］音響特性制御信号を発
生する制御機能である。ＦＬＴは音響特性を変える可変
定数フィルター機能である。CTRL is determined by the output M (t) of the detection function.
This is a control function of determining the correlation strength of the input audio signal or generating an acoustic characteristic control signal [necessary for controlling the acoustic characteristic] corresponding to the correlation intensity. FLT is a variable constant filter function that changes acoustic characteristics.

【０１１】以下の説明で付番ｎは０，１，２のいずれか
であり同一番号は同ブロックに属する。［ＭＰＹ＿０、
ＬＰＦ＿０、ＲＭＳ＿０］からなるブロックは入力音響
信号の短時間平均強度Ｐ０（ｔ）を出力する。乗算機能
ＭＰＹ＿０の２つの入力は、共に入力音響信号ｆ（ｔ）
である。したがってＭＰＹ＿０の出力Ｃ０（ｔ）はｆ
（ｔ）の二乗であることから、全成分が有効成分であ
り、常に正の値である。平均化機能ＲＭＳ＿０の出力Ｐ
０（ｔ）はＣ０（ｔ）の短時間平均強度である。Ｐ０
（ｔ）はｆ（ｔ）とｆ（ｔ−Ｄｎ）との相関強度の正規
化のために使われる。Ｐ０（ｔ）のディメンジョンは音
響信号の自乗である。In the following description, the number n is 0, 1, or 2, and the same number belongs to the same block. [MPY_0,
LPF_0, RMS_0] outputs the short-time average intensity P0 (t) of the input audio signal. The two inputs of the multiplication function MPY_0 are both input sound signals f (t).
It is. Therefore, the output C0 (t) of MPY_0 is f
Since it is the square of (t), all components are effective components and always have positive values. Output P of averaging function RMS_0
0 (t) is the short-time average intensity of C0 (t). P0
(T) is used for normalizing the correlation strength between f (t) and f (t-Dn). The dimension of P0 (t) is the square of the acoustic signal.

【０１２】［ＭＰＹ＿ｎ、ＬＰＦ＿ｎ、ＲＭＳ＿ｎ］か
らなるブロックは入力音響信号ｆ（ｔ）と入力信号の遅
延信号ｆ（ｔ＿Ｄｎ）の短時間平均相関強度Ｐｎ（ｔ）
を出力する。乗算機能ＭＰＹ＿ｎの２つの入力の一方
は、入力音響信号ｆ（ｔ）であり他の一方はｆ（ｔ）の
時間Ｄｎ遅れの遅延信号ｆ（ｔ−Ｄｎ）である。遅延時
間は本実施例では数十ｍｓｅｃ〜数百ｍｓｅｃが実験
上、良好であることを確認している。ＭＰＹ＿ｎの出力
Ｃｎ（ｔ）はｆ（ｔ）とｆ（ｔ−Ｄｎ）の積である。Ｃ
０（ｔ）はｆ（ｔ）の完全な二乗であるが、Ｃｎ（ｔ）
は音響信号の全周波数帯において［位相と周波数が必ず
しも同じではない２個の信号の積］であることから［元
々の周波数の２倍の成分に近い周波数成分を含み、正負
に変化するところの無効成分］を多く含む。ｆ（ｔ）の
周期が安定しているほど、Ｃｎ（ｔ）に含まれる［正ま
たは負］の有効成分は多くなる。The block consisting of [MPY_n, LPF_n, RMS_n] is a short-time average correlation strength Pn (t) between the input acoustic signal f (t) and the delay signal f (t_Dn) of the input signal.
Is output. One of two inputs of the multiplication function MPY_n is an input sound signal f (t), and the other is a delay signal f (t-Dn) delayed by a time Dn of f (t). In this embodiment, it is confirmed by experiment that the delay time is several tens msec to several hundred msec. The output Cn (t) of MPY_n is the product of f (t) and f (t-Dn). C
0 (t) is the perfect square of f (t), but Cn (t)
Is the product of two signals whose phase and frequency are not necessarily the same in all frequency bands of the acoustic signal, and therefore contains a frequency component that is close to twice the original frequency and changes to positive or negative. Inactive ingredients]. As the period of f (t) becomes more stable, the number of [positive or negative] effective components contained in Cn (t) increases.

【０１３】一般的に、［音楽や音声］の信号は［弦や膜
や構造体の固有振動］により発生することから、信号は
自己相関強度を持ってる。従って、Ｃｎ（ｔ）には［有
効成分すなわち低域成分］が含まれるが、音源の振動が
安定して持続しているほど有効成分が大きくなり、振動
に変化が大きいほど有効成分は小さくなる。音楽から音
声にかけて、その間に明確な境目はないが、一般的に、
音声は［音質や音程］の変化が複雑に大きく変化する傾
向にあり、音楽は［音質や音程］が安定している傾向に
ある。従ってＰ１（ｔ）、Ｐ２（ｔ）の大きさはＰ０
（ｔ）よりも小さいのが一般的である。そして、その大
きさの度合いや時間変化の度合いの数値評価によって、
音声か音楽かの判定材料とする。Generally, a signal of [music or voice] is generated by [a natural vibration of a string, a membrane, or a structure], so that the signal has an autocorrelation strength. Therefore, Cn (t) includes an [effective component, that is, a low-frequency component], but the effective component increases as the vibration of the sound source is stably maintained, and the effective component decreases as the change in the vibration increases. . From music to audio, there is no clear line between them, but in general,
Voice tends to have a large change in [sound quality and pitch] in a complicated manner, and music tends to have stable [sound quality and pitch]. Therefore, the magnitudes of P1 (t) and P2 (t) are P0
It is generally smaller than (t). And, by numerical evaluation of the degree of the magnitude and the degree of time change,
This is used as a material for determining whether the sound or the music.

【０１４】低域フィルター機能ＬＦＰ＿ｎはＣｎ（ｔ）
から無効成分を取り除き有効成分ＣＬｎ（ｔ）を取り出
す。低域フィルター機能ＬＰＦ＿ｎは通常、簡単な［１
次低域フィルターか２次低域フィルター］が使われる。
本実施例では、遮断周波数が１Ｈｚ〜２０Ｈｚの２次低
域フィルターである。ＬＰＦ＿ｎをどのような特性とす
るかについては本発明の本質とするところではないので
詳細説明を省略する。The low-pass filter function LFP_n is Cn (t)
The inactive component is removed from, and the active component CLn (t) is extracted. The low pass filter function LPF_n is usually a simple [1
Second-order low-pass filter or second-order low-pass filter].
In this embodiment, the filter is a secondary low-pass filter having a cutoff frequency of 1 Hz to 20 Hz. Since the characteristics of the LPF_n are not the essence of the present invention, detailed description is omitted.

【０１５】平均化機能ＲＭＳｎはＣＬｎ（ｔ）の短時間
平均強度Ｐｎ（ｔ）を取り出す。ＣＬｎ（ｔ）のディメ
ンジョンが２個の音響信号の乗算であることから、Ｐ０
（ｔ）のディメンジョンと合わせなければならないこと
から、ＲＭＳｎの機能は［絶対値の短時間平均値］が簡
単で有効である。ＲＭＳｎ（ｔ）がどのような手法でＣ
Ｌｎ（ｔ）の短時間平均強度を得るか、については本発
明の本質とするところではないので詳細説明を省略す
る。The averaging function RMSn extracts the short-time average intensity Pn (t) of CLn (t). Since the dimension of CLn (t) is a multiplication of two acoustic signals, P0
Since the RMSn function must be matched with the dimension of (t), the short-time average of the absolute value is a simple and effective function of the RMSn. RMSn (t) determines how C
Since obtaining the short-time average intensity of Ln (t) is not the essence of the present invention, the detailed description is omitted.

【０１６】一連の信号処理には、ほとんどのケースでデ
ィジタルシグナルプロセッサが使われる。そして、この
種の信号処理に許されるコストが応用商品の性格上、数
十円（数十セント）程度であることから、計算工程と使
用するメモリー数を最小限としなければならない。一
方、音響信号の音質制御は人の感性に照らして高速であ
る必要はない。本実施例ではＳＭＰＬまでは音響データ
のサンプリング周期ごとに、その後の処理は５０ｍｓｅ
ｃ〜１００ｍｓｅｃごとのサンプリングによる計算によ
り判定している。ＳＭＰＬ＿ｎでは、時間ＴＳの間、Ｐ
ｎ（ｔ）を積分し、その結果をサンプリング値として出
力する。ＰＳ１（ｔ）とＰＳ２（ｔ）の値の大きい方を
ＳＧＴが選択し、出力ＰＳ１２（ｔ）を得る。任意の周
波数範囲で正確に相関強度を得ることは計算量が多いこ
とから、実施例では２個の相関強度を計算し、いずれか
大きい方を選択することによって、少ない計算量で、実
用性を確保している。ｎの値が幾つであるか、について
は本発明の本質とするところではない。In most cases, a digital signal processor is used for a series of signal processing. Since the cost allowed for this type of signal processing is about several tens of yen (several tens of cents) due to the nature of applied products, the number of calculation steps and the number of memories used must be minimized. On the other hand, it is not necessary to control the sound quality of the acoustic signal at a high speed in light of human sensitivity. In this embodiment, until the SMPL, every sampling cycle of the acoustic data, the subsequent processing is 50 ms.
The determination is made by calculation based on sampling every c to 100 msec. In SMPL_n, during time TS, P
n (t) is integrated, and the result is output as a sampling value. The SGT selects the larger value of PS1 (t) and PS2 (t), and obtains the output PS12 (t). Since obtaining the correlation strength accurately in an arbitrary frequency range requires a large amount of calculation, in the present embodiment, two correlation strengths are calculated, and by selecting the larger one, the practicality can be reduced with a small calculation amount. Is secured. The value of n is not the essence of the present invention.

【０１７】ＰＬ１２（ｔ）の大きさは入力信号の大きさ
に依存することから、正規化する意味でＰＬ０（ｔ）と
の比が必要である。ＰＳ１２（ｔ）／ＰＳ０（ｔ）の値
を計算するよりも、｛Ｌｏｇ（ＰＬ１２（ｔ））｝−
｛Ｌｏｇ（ＰＬ０（ｔ））｝が信号処理上、都合が良い
ことから、ＰＳ０（ｔ）とＰＳ１２（ｔ）については対
数演算機能ＬＯＧ＿０、ＬＯＧ＿１２によって、対数変
換される。ＮＲＭは単なる減算工程で、その出力Ｇ
（ｔ）はＰＬ１２（ｔ）−ＰＬ０（ｔ）であり、この値
は、Ｌｏｇ｛Ｐ１２（ｔ）／Ｐ０（ｔ）｝であり、元々
の信号の二乗平均値で正規化された、相関強度の短時間
平均である。従って、Ｇ（ｔ）は入力信号の強度の影響
を受けない。Since the magnitude of PL12 (t) depends on the magnitude of the input signal, a ratio with PL0 (t) is required for normalization. Rather than calculating the value of PS12 (t) / PS0 (t), {Log (PL12 (t))} −
Since {Log (PL0 (t))} is convenient for signal processing, PS0 (t) and PS12 (t) are logarithmically converted by logarithmic calculation functions LOG_0 and LOG_12. NRM is a mere subtraction process whose output G
(T) is PL12 (t) -PL0 (t), the value of which is Log {P12 (t) / P0 (t)}, the correlation strength normalized by the root mean square value of the original signal. Is the short-term average of Therefore, G (t) is not affected by the strength of the input signal.

【０１８】差分機能ＤＩＦＦの出力Ｈ（ｔ）はＧ（ｔ）
−Ｇ（ｔ−Ｔｓ）である。一般的に、音楽信号の場合は
ボーカルも含めてＧ（ｔ）の大きさの対時間変化が小さ
く、ニュースや天気予報のなどの音声信号はＧ（ｔ）の
大きさの対時間変化が大きい。この実施例では［Ｇ
（ｔ）の大きさの時間変化の度合い］を音楽と音声の判
定に利用している。音楽と音声の判定について、Ｇ
（ｔ）のどのような性質を使うかは本発明の本質とする
ところではない。ＳＭＰＬ＿ＮはさらにＨ（ｔ）を平滑
する。この実施例では、Ｊ（ｔ）はＨ（ｔ）＋Ｈ（ｔ−
ＴＳ）＋Ｈ（ｔ−２ＴＳ）＋Ｈ（ｔ−３ＴＳ）である
が、この平滑機能の有無または手法については本発明の
本質とするところではない。The output H (t) of the difference function DIFF is G (t)
−G (t−Ts). Generally, in the case of a music signal, the change of the magnitude of G (t) with respect to time including vocals is small, and in the case of an audio signal such as a news or weather forecast, the change of the magnitude of G (t) with time is large. . In this embodiment, [G
(The degree of temporal change in the magnitude of (t)) is used for the determination of music and voice. About the judgment of music and voice, G
The nature of (t) to use is not the essence of the present invention. SMPL_N further smoothes H (t). In this embodiment, J (t) is H (t) + H (t−
TS) + H (t−2TS) + H (t−3TS), but the presence or absence or method of this smoothing function is not the essence of the present invention.

【０１９】ＤＴＣＴは平均化出力Ｊ（ｔ）を入力とし、
音質制御をしやすいような信号に、さらに平滑する機能
である。音響信号の統計的性質は常に大きく変動し、大
きく変動する値をそのまま音質制御信号として利用する
と、制御された音響信号は人の聴感に違和感を与える。
そのため、様様な平滑手法が使われるが、この実施例で
は判定のための中心値Ｌｔｈｄと不感帯Ｚｄｅｄを設
け、さらに、アタックタイムＴＡａｖｇとレリースタイ
ムＴＲａｖｇを持つ時定数機能により平滑し、音質制
御信号Ｍ（ｔ）を生成する。ＤＴＣＴの平滑機能の有無
または手法については本発明の本質とするところではな
い。The DTCT receives an averaged output J (t) as an input,
This function is to further smooth the signal to make it easier to control the sound quality. The statistical properties of the sound signal always fluctuate greatly, and if a value that fluctuates greatly is used as it is as the sound quality control signal, the controlled sound signal gives a sense of incongruity to human hearing.
For this reason, various smoothing methods are used. In this embodiment, a center value Lthd and a dead zone Zded for determination are provided, and further, the sound quality control signal M is smoothed by a time constant function having an attack time TAavg and a release time TRavg. (T) is generated. The existence or method of the smoothing function of the DTCT is not the essence of the present invention.

【０２０】一般的に、音声の明瞭性を高くするには、音
声の認識に不必要な周波数成分である、ピッチ成分を取
り除くことが望ましい。このピッチ成分は音声のスペク
トル分布上、最も低音側に存在することから、低音域を
抑制することにより、明瞭性の向上を計る。音声のピッ
チ成分は個人差も大きく、また、マイクロホンの使い
方、マイクロホンの種類、放送のときの音響効果によっ
ても大きく左右されるが、いずれにしても、低音域の抑
制は明瞭度を向上させるに有効であることは周知されて
いる。従って、この実施例では、入力信号が音声である
と判定すれば、ＣＴＲは音響フィルターＦＬＴを低音域
抑制の特性とするよう動作する。Generally, in order to enhance the clarity of speech, it is desirable to remove pitch components, which are unnecessary frequency components for speech recognition. Since this pitch component exists on the lowest tone side in the voice spectrum distribution, the clarity is improved by suppressing the low tone range. The pitch component of voice varies greatly from person to person, and is greatly influenced by the use of microphones, the type of microphones, and the sound effects at the time of broadcasting.In any case, suppression of the low-frequency range improves clarity. It is well known that it is effective. Therefore, in this embodiment, if it is determined that the input signal is a voice, the CTR operates so that the acoustic filter FLT has characteristics of suppressing the low frequency range.

【０２１】遅延時間Ｄｎの値をどのように選ぶかは本発
明の本質とすところではないが、Ｄｎの値が音楽と音声
の識別の性能に影響することから、第１図の実施例での
Ｄｎについて詳細説明を加える。実験の結果、Ｄｎの平
均値はおおむね数十ミリ秒から数百ミリ秒が適当であ
る。このＤｎの平均時間が［短くなれば総じて有効成分
は多くなり］、［長くなれば総じて有効成分は少なくな
る］。音声においても、［ニュースなど幾分早い音声］
と［解説などの幾分遅い音声］では有効成分の大きさは
異なる。また、個人差も大きい。How to select the value of the delay time Dn is not the essence of the present invention. However, since the value of Dn affects the performance of discriminating between music and voice, the embodiment of FIG. Dn will be described in detail. As a result of the experiment, it is appropriate that the average value of Dn is approximately several tens to several hundreds of milliseconds. The average time of this Dn is [the active ingredient generally increases as the length decreases] and [the active ingredient decreases as the length increases]. [Somewhat faster sound such as news]
And [somewhat slower sound such as commentary] have different active component sizes. Also, individual differences are large.

【０２２】[0222]

【表１】の説明説明を簡単にするために、一例として、遅延時間Ｄ１を
１００ｍｓｅｃとし、Ｄ２を８２ｍｓｅｃとした場合の
図１の実施例の構成に基づく実施例で、信号が純粋な正
弦波の場合、音楽として検出できるところの計算上の期
待検出率を表１に示す。Description of Table 1 For simplicity of explanation, as an example, in the embodiment based on the configuration of the embodiment of FIG. 1 where the delay time D1 is 100 msec and D2 is 82 msec, the signal is a pure sine wave. Table 1 shows the calculated expected detection rates that can be detected as music.

【０２３】表１の説明を簡単にするために、横方向にＡ
〜Ｆの列の記号、縦方向に１〜８５の行の欄の名前を付
す。行１は示す数値が検出率（単位％）であることを示
す。行２は相関強度が０．５以上の強い相関を持ってい
るサンプルの数を示す。行３はサンプルの母数を示す。
Ａ列の８〜８５は１２平均率音階上にある入力信号の周
波数を示す。Ｂ列の８〜８５は遅延機能Ｄ１が８２ｍｓ
ｅｃの場合の元々の信号との相関強度である。Ｃ列の８
〜８５はＢ列の相関強度が０．５以上のものについて
“１”のマークを記入してある。Ｄ列の８〜８５は遅延
時間が１００ｍｓｅｃの場合の元々の信号との相関強度
である。Ｅ列の８〜８５はＤ列の相関強度が０．５以上
のものについて“１”のマークを記入してある。Ｆ列
は、遅延時間が８２ｍｓｅｃと１００ｍｓｅｃの場合の
相関強度のいずれかが０．５以上のものについて“１”
のマークを記入してある。表に示すように、遅延時間が
８２ｍｓｅｃの場合、サンプル母数７１に対し、音楽信
号としての検出数が５０、その検出率は７０．４（％）
である。遅延時間が１００ｍｓｅｃの場合、サンプル母
数７１に対し、音楽信号としての検出数が５０、その検
出率は７０．４（％）である。遅延時間が８２ｍｓｅｃ
と１００ｍｓｅｃの二つを採用したシステムでは、母数
７１に対し、検出数６６、検出率９３．０（％）であ
る。以上は純粋な正弦波の場合である。To simplify the description in Table 1, A
The names of the columns in columns F to F and the names of the columns in rows 1 to 85 in the vertical direction are given. Row 1 shows that the numerical value shown is the detection rate (unit%). Row 2 shows the number of samples having a strong correlation with a correlation strength of 0.5 or more. Row 3 shows the sample parameter.
8 to 85 in column A indicate the frequencies of the input signals on the 12-average scale. The delay function D1 is 82 ms for 8 to 85 in column B
ec is the correlation strength with the original signal. 8 in column C
Nos. To 85 are marked with "1" for those having a correlation intensity of row B of 0.5 or more. 8 to 85 in the D column indicate the correlation strength with the original signal when the delay time is 100 msec. In columns 8 to 85 in column E, "1" marks are entered for columns D having a correlation strength of 0.5 or more. The F column indicates “1” for one having a correlation strength of 0.5 or more when the delay time is 82 msec and 100 msec.
Is marked. As shown in the table, when the delay time is 82 msec, the number of detected music signals is 50 and the detection rate is 70.4 (%) with respect to the sample parameter 71.
It is. When the delay time is 100 msec, the number of detections as a music signal is 50 with respect to the sample parameter 71, and the detection rate is 70.4 (%). 82 msec delay time
In the system adopting the two parameters of 100 and 100 msec, the number of detections is 66 and the detection rate is 93.0 (%) with respect to the population parameter 71. The above is the case of a pure sine wave.

【０２４】[0243]

【表２】の説明第１図の実施例について、実際の音楽信号と音声信号に
対する、実際の信号処理プログラムの性能を確認したも
のである。各数値のディメンジョンについては説明を省
略する。Ｎｏ．の縦列欄は信号の番号を示す。Ｍ／Ｓの
縦列欄はソース信号が音楽信号か音声信号かを示す。Ｓ
ｏｕｒｃｅの縦列欄はソース信号の種類を示す。付番＿
Ｅはスピーチが英語、付番＿ＪＰはスピーチが日本語で
あることを示す。Ｇ（ｔ）の縦列欄はそれぞれのソース
について、第１図の実施例中のブロックＬＯＧ＿１２の
出力Ｇ（ｔ）を示す。Ｈ（ｔ）の縦列欄はそれぞれのソ
ースについて、第１図の実施例中のブロックＤＩＦＦの
出力Ｈ（ｔ）を示す。Ｍ（ｔ）の縦列欄はそれぞれのソ
ースについて、第１図の実施例中のブロックＤＴＣＴの
出力Ｍ（ｔ）を示す。Description of Table 2 With respect to the embodiment of FIG. 1, the performance of an actual signal processing program for actual music signals and audio signals was confirmed. The description of the dimension of each numerical value is omitted. No. Column indicates signal numbers. The column of M / S indicates whether the source signal is a music signal or an audio signal. S
The column of source indicates the type of the source signal. Numbering_
E indicates that the speech is in English, and Numbering_JP indicates that the speech is in Japanese. The column of G (t) shows the output G (t) of the block LOG_12 in the embodiment of FIG. 1 for each source. The column of H (t) shows the output H (t) of the block DIFF in the embodiment of FIG. 1 for each source. The column of M (t) shows the output M (t) of the block DTCT in the embodiment of FIG. 1 for each source.

【０２５】表２中番号１〜９行は信号ソースが音楽の場
合のそれぞれＧ（ｔ）、Ｈ（ｔ）、Ｍ（ｔ）、の値を示
す。表２中番号１０〜２０行は信号ソースが音声の場合
のそれぞれＧ（ｔ）、Ｈ（ｔ）、Ｍ（ｔ）、の値を示
す。Ｇ（ｔ）の値については音楽の場合は７９．８〜１
１５の範囲にあるが、音声の場合は１６５〜２１３の範
囲にあり、明確に判別されていることがわかる。Ｈ
（ｔ）の値については音楽の場合は−１７〜−２６．８
の範囲にあるが、音声の場合は１．０〜１３．１の範囲
にあり、明確に判別されていることがわかる。Ｍ（ｔ）
の値については音楽の場合は−１１．４〜−３２．７の
範囲にあるが、音声の場合は７．０〜３７．８の範囲に
あり、明確に判別されていることがわかる。これらの演
算結果がある範囲に分布しているのは、それぞれの信号
の持つ特徴の現れであって、実際は境界線は明確なもの
ではない。少なくとも、明確な音声は明確に音声と判定
され、明確な音楽は明確に音楽と判定される。このこと
は、本発明の本質を損ねるものではない。The numbers 1 to 9 in Table 2 show the values of G (t), H (t) and M (t) when the signal source is music. Lines 10 to 20 in Table 2 show the values of G (t), H (t), and M (t) when the signal source is audio, respectively. The value of G (t) is 79.8-1 for music.
Although it is in the range of 15 and in the case of voice, it is in the range of 165 to 213, and it can be seen that it is clearly discriminated. H
The value of (t) is -17 to -26.8 in the case of music.
However, in the case of voice, it is in the range of 1.0 to 13.1, and it can be seen that it is clearly discriminated. M (t)
Is in the range of -11.4 to -32.7 in the case of music, but is in the range of 7.0 to 37.8 in the case of audio, which indicates that the value is clearly discriminated. The distribution of these calculation results in a certain range is a manifestation of the characteristics of each signal, and the boundaries are not clear in practice. At least, clear voice is clearly determined to be voice, and clear music is clearly determined to be music. This does not impair the essence of the present invention.

【０２６】[0262]

【発明の実施の形態】以下の例ような、音楽や音声を伝
達する装置。１）コンピュータやＤＳＰのプログラム２）ＤＳＰＰチップ３）ＡＶ機器、ステレオ装置、テレビ、ラジオ、ＰＡシ
ステムなどBEST MODE FOR CARRYING OUT THE INVENTION An apparatus for transmitting music and voice as in the following example. 1) Computer and DSP programs 2) DSPP chips 3) AV equipment, stereo equipment, televisions, radios, PA systems, etc.

【０２７】[0279]

【発明の効果】１）本発明は公知の技術を組み合わせる
ことによって、音声の明瞭性を自動的に制御する機能で
あり、音楽を主体に作られた装置などにありがちな、音
声の明瞭性に欠ける音質を自動補正する。ニュース、天
気予報、株式情報など、内容の聞き取りが重要な場合、
便利である。特に［数値情報などに関する音声］や［短
時間に多くの内容を伝えている音声］に極めて有効であ
る。 As described above, the present invention is a function of automatically controlling the clarity of voice by combining known techniques. The clarity of voice, which is often found in devices mainly made of music, is improved. Automatically correct missing sound quality. When it ’s important to hear about news, weather, stocks, etc.
It is convenient. In particular, it is extremely effective for [speech related to numerical information and the like] and [speech that conveys many contents in a short time].

【０２８】[0285]

[Brief description of the drawings]

【図１】本発明の一実施例FIG. 1 shows an embodiment of the present invention.

[Explanation of symbols]

以下の説明で付番ｎは０，１，２のいずれかである In the following description, the number n is 0, 1, or 2.

【ＩＮＰＵＴ】入力信号端子[INPUT] Input signal terminal

【ｆ（ｔ）】入力信号[F (t)] input signal

【Ｏｕｔｐｕｔ】出力端子[Output] Output terminal

【Ｆ（ｔ）】出力信号[F (t)] output signal

【ＤＬＹ＿ｎ】遅延機能[DLY_n] delay function

【Ｄｎ】遅延時間[Dn] Delay time

【ｆ（ｔ−Ｄｎ）】遅延信号[F (t-Dn)] delay signal

【ＭＰＹ＿ｎ】乗算器[MPY_n] Multiplier

【Ｃｎ（ｔ）】乗算器の出力信号[Cn (t)] Multiplier output signal

【ＬＰＦ】低域フィルター[LPF] Low pass filter

【ＬＰＦ＿ｎ】低域フィルター[LPF_n] Low-pass filter

【ＦＲ】低域フィルターの遮断周波数[FR] Cutoff frequency of low-pass filter

【ＣＬｎ（ｔ）】低域フィルターの出力[CLn (t)] Output of low-pass filter

【ＲＭＳ＿ｎ】短時間平均強度算出機能[RMS_n] Short-time average intensity calculation function

【Ｐｎ（ｔ）】短時間平均強度機能の出力[Pn (t)] Output of short-time average intensity function

【ＳＭＰＬ】積分とサンプリング機能[SMPL] Integration and sampling function

【ＳＭＰＬ＿ｎ】積分とサンプリング機能[SMPL_n] Integration and sampling function

【ＴＳ】積分時間またはサンプリング周期[TS] Integration time or sampling cycle

【ＰＳｎ（ｔ）】サンプリング信号[PSn (t)] sampling signal

【ＳＧＴ】大きい方の信号選択機能[SGT] Larger signal selection function

【ＰＳ１２（ｔ）】大きい方の信号[PS12 (t)] Larger signal

【ＬＯＧ＿０】対数演算機能[LOG_0] Logarithmic calculation function

【ＬＯＧ＿１２】対数演算機能[LOG_12] Logarithmic operation function

【ＰＬ０（ｔ）】基準相関強度[PL0 (t)] reference correlation strength

【ＰＬ１２（ｔ）】検出相関強度[PL12 (t)] detected correlation strength

【ＮＲＭ】正規化機能[NRM] Normalization function

【Ｇ（ｔ）】正規化された検出相関強度[G (t)] normalized detected correlation intensity

【ＤＩＦＦ】差分演算機能[DIFF] Difference calculation function

【Ｈ（ｔ）】差分出力[H (t)] Difference output

【Ｎ】ＴＳの倍数の積分時間[N] Integration time of multiple of TS

【Ｊ（ｔ）】Ｎ^＊ＴＳサンプリング周期の検出値[J (t)] N ^* TS Detected value of sampling period

【ＤＴＣＴ】平滑機能[DTCT] Smoothing function

【ＴＡａｖｇ】アタック時定数[TAavg] Attack time constant

【ＴＲａｖｇ】レリース時定数[TRavg] Release time constant

【Ｚｄｅｄ】検出不感帯[Zded] Detection dead zone

【Ｌｔｈｄ】検出レベル[Lthd] Detection level

【Ｍ（ｔ）】平滑機能の出力[M (t)] Output of smoothing function

【ＣＴＲ】音響フィルターの制御機能[CTR] Acoustic filter control function

【ＦＬＴ】制御を受けて、音声／音楽の最適特性を得
る、可変定数フィルター[FLT] Variable constant filter under control to obtain optimal voice / music characteristics

【０２９】[0290]

[Brief explanation of the table]

【表１】図１の実施例の正弦波入力に対する検出部の応
答Table 1 Response of the detection unit to the sine wave input of the embodiment of FIG.

【表２】図１の実施例の実信号入力に対する検出部の応
答Table 2 Response of the detection unit to the actual signal input in the embodiment of FIG.

【０３０】[0302]

[Explanation of symbols in table]

【信号周波数】入力信号が１２平均率音階上の正弦波の
場合の信号の周波数[Signal frequency] The frequency of the signal when the input signal is a sine wave on a 12-average scale

【遅延時間】図１のＤ１またはＤ２なる遅延時間[Delay time] Delay time of D1 or D2 in FIG.

【相関強度】純粋な自己相関係数[Correlation strength] Pure autocorrelation coefficient

【相関強度が０．５以上】１を最大とする自己相関係数
が０．５以上[Correlation strength is 0.5 or more] The autocorrelation coefficient that maximizes 1 is 0.5 or more

【システムとして相関強度が０．５以上】遅延時間Ｄ
１、Ｄ２のどちらか大きい方[The correlation strength is 0.5 or more as a system] Delay time D
1, whichever is greater, D2

【母数】サンプル数[Parameter] Number of samples

【相関強度が０．５以上の数】サンプル数の内、相関強
度が０．５以上の強いサンプルの数[Number with correlation strength of 0.5 or more] Among the sample numbers, the number of strong samples with correlation strength of 0.5 or more

【検出率％】相関強度が０．５以上のサンプルの割合[Detection rate%] Percentage of samples with correlation strength of 0.5 or more

【Ｎｏ．】サンプルの番号の欄[No. ] Sample number column

【Ｓ】音楽か音声かの区別の欄[S] Field for distinction between music and voice

【Ｓｏｕｒｃｅ】信号の種類を示す欄[Source] Column indicating signal type

【Ｇ（ｔ）】図１の正規化機能ＮＲＭの出力[G (t)] Output of the normalization function NRM of FIG.

【Ｈ（ｔ）】図１の差分機能ＤＩＦＦの出力[H (t)] Output of the difference function DIFF of FIG.

【Ｍ（ｔ）】図１の平滑機能ＤＴＣＴの出力[M (t)] Output of the smoothing function DTCT of FIG.

Claims

[Claims]

An arbitrary sound signal is set as a first sound signal, and
One or a plurality of delay functions having different delay times with an audio signal as an input are referred to as a first delay function group, and respective output signals of the first delay function group are referred to as a first delay signal group. ] And [individual first delay signal group] are referred to as a first multiplication function group, and a function of applying an [integration or low-pass filter] to each output signal of the first multiplication function group is referred to as a first multiplication function group. The first averaging function group is a function of obtaining an [average value or an execution value] by an arbitrary method for each output signal of the first low-pass filter function group. Each output of the group is referred to as a first correlation strength group, and a function of changing a filter characteristic, which receives a first acoustic signal as an input, is referred to as a first variable filter function. A first feature is that it has at least a [first delay function group, a first multiplication function group, and a first averaging function group].
A second feature is that a signal group is used as a signal for [discriminating between music and voice], and the signal group is dependent on the correlation strength group or the first correlation strength group.

2. A first feature of the present invention is that it has at least a [first delay function group, a first multiplication function group, and a first averaging function group]. The second characteristic is that the signal has a structure for controlling the filter characteristic of the first variable filter function.