JP2009069425A

JP2009069425A - Music detection device, speech detection device and sound field control device

Info

Publication number: JP2009069425A
Application number: JP2007237194A
Authority: JP
Inventors: Osamu Fujii; 修藤井
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2007-09-12
Filing date: 2007-09-12
Publication date: 2009-04-02
Anticipated expiration: 2027-09-12
Also published as: JP4885812B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately determine whether music or speech is included in a sound signal, and to perform proper scene determination for a listener. <P>SOLUTION: In a music detection device 1 of the invention, a music scale spectrum calculation section 11 calculates a spectrum of each frequency of an equal temperament scale from the sound signal, for each frame in which a predetermined period of the sound signal is expressed, and an autocorrelation coefficient calculation section 12 calculates autocorrelation value of the spectrum. A variance calculation section 16 digitizes a size of variation of a maximum of the autocorrelation value which is detected by a coefficient maximum detection section 13, and the maximum of the autocorrelation value in a succeeding plurality of frames. When the size of variance is smaller than a threshold which is determined beforehand, a music/non-music determination section 17 determines that the sound signal is music. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、テレビ受信装置などに設けられ、放送中の番組のシーンについて、音楽を含むシーンであるか、音声を含むシーンであるかを判定する音楽検出装置、音声検出装置、および音場制御装置に関するものである。 The present invention provides a music detection apparatus, a sound detection apparatus, and a sound field control that are provided in a television receiver or the like and determine whether a scene of a program being broadcast is a scene including music or a scene including sound. It relates to the device.

オーディオ再生技術の進歩に伴い、専用のリスニングルームにおけるＨｉＦｉ（High Fidelity:高忠実度）オーディオによる大音量での音楽再生や、マルチチャンネルのホームシアターシステムによるサラウンド再生などにより、ユーザは、自宅に居ながらにして、コンサートホールや映画館と同様の自然な残響音や臨場感を楽しむことができる。 With the advancement of audio playback technology, users can stay at home by playing music at high volume with HiFi (High Fidelity) audio in a dedicated listening room and surround playback with a multi-channel home theater system. In the same way, you can enjoy the natural reverberation and realism of a concert hall or movie theater.

一方、通常、テレビ放送などのコンテンツを視聴する場合、視聴者はリビングや台所などにおいて、小さい音量で視聴することが多い。そして、このような小さい音量でテレビを視聴する場合などにおいても、臨場感や音声の聞き取り易さなどが要求されるようになっている。 On the other hand, when viewing content such as television broadcasts, viewers often view at a low volume in a living room or kitchen. Even when watching TV at such a low volume, there is a demand for a sense of reality and ease of listening to audio.

そこで、放送中あるいは再生中のコンテンツについて音場制御を行う必要があるが、現在テレビジョン放送の主流になりつつあるデジタル放送の場合、放送波と共に伝送されるＳＩ（Service Information）情報、あるいは、ＳＩ情報をもとに生成されるＥＰＧ（Electronic Program Guide；電子番組表）情報などを用いて、番組のジャンルに応じた音場制御（すなわち、番組単位に共通の音場制御）を行うことが考えられる。 Therefore, although it is necessary to control the sound field for the content being broadcast or being played back, in the case of digital broadcasting that is currently becoming the mainstream of television broadcasting, SI (Service Information) information transmitted along with the broadcast wave, or Sound field control corresponding to the genre of a program (that is, sound field control common to each program) is performed using EPG (Electronic Program Guide) information generated based on SI information. Conceivable.

しかしながら、１つの番組は、音声のみのシーン、音楽のみのシーン、あるいは、音声と音楽との両方を含むシーンなど複数のシーンから構成されているため、ＳＩ情報やＥＰＧ情報のみに基づく番組のジャンルに応じた音場制御では、一部のシーンにおいては適切な音場制御がなされるものの、他のシーンにおいては適切な音場制御がなされない。 However, since one program is composed of a plurality of scenes such as a voice-only scene, a music-only scene, or a scene including both voice and music, the program genre is based only on SI information and EPG information. In sound field control according to the above, appropriate sound field control is performed in some scenes, but appropriate sound field control is not performed in other scenes.

したがって、番組を通じて適切な音場制御を行うためには、番組単位の音場制御ではなく、シーン毎の音場制御を行う必要がある。例えば、音楽シーンの場合、低域と高域の周波数帯域における音圧を強調する音場制御を行うことにより臨場感が増す。また、音声シーンの場合、中域の周波数帯域における音圧を強調するように音場制御を行うことにより音声（人の声）が聞き取り易くなる。 Therefore, in order to perform appropriate sound field control throughout a program, it is necessary to perform sound field control for each scene instead of sound field control for each program. For example, in the case of a music scene, a sense of reality is increased by performing sound field control that emphasizes sound pressure in low and high frequency bands. In the case of a sound scene, sound (human voice) can be easily heard by performing sound field control so as to emphasize sound pressure in the middle frequency band.

そのため、現在のシーンが音楽を含むシーンなのか、あるいは、音声を含むシーンなのか等を判別する必要がある。 Therefore, it is necessary to determine whether the current scene is a scene including music or a scene including audio.

音楽シーン、あるいは、音声シーンを検出する技術としては、例えば、特許文献１に記載のオーディオ帯域信号の音声／音楽判別装置、特許文献２に記載の音声音楽判別装置、特許文献３に記載の音楽検出装置、音楽検出方法及び録音再生装置、特許文献４に記載のオーディオ情報分類装置などが提案されている。 As a technique for detecting a music scene or a voice scene, for example, an audio band signal voice / music discriminating device described in Patent Document 1, a voice / music discriminating device described in Patent Document 2, and a music described in Patent Document 3 A detection apparatus, a music detection method and a recording / playback apparatus, an audio information classification apparatus described in Patent Document 4, and the like have been proposed.

特許文献１には、低域および高域の音圧を検出し、検出された音圧が強いときに音楽と判定する構成や、受信信号がモノラルの場合には音声であると判定する構成の音声／音楽判別装置が開示されている。 Patent Document 1 has a configuration in which low- and high-frequency sound pressures are detected and determined as music when the detected sound pressure is strong, or in a configuration where sound is determined when the received signal is monaural. An audio / music discrimination device is disclosed.

特許文献２に開示されている音声音楽判別装置では、フレームごとに音響パワーを算出し、算出されたパワー値をもとに各フレームが有音か無音かを判定し、複数フレームごとに有音フレームの数が予め定められた閾値よりも大きいときには音楽と判定する。 In the audio-music discrimination device disclosed in Patent Document 2, the sound power is calculated for each frame, and whether each frame is sound or silence is determined based on the calculated power value. When the number of frames is larger than a predetermined threshold, it is determined as music.

特許文献３に開示されている音楽検出装置では、２チャンネル音声の各チャンネルのパワーの合計と、各チャンネルのパワーの差とを算出し、各チャンネルのパワーの合計と各チャンネルのパワーの差との比を算出し、その比と所定の閾値との比較結果に基づいて音楽区間を判定する。 In the music detection device disclosed in Patent Document 3, the total power of each channel of 2-channel audio and the difference in power between the channels are calculated, and the total power in each channel and the difference in power between each channel are calculated. The music section is calculated based on the comparison result between the ratio and a predetermined threshold.

特許文献４に開示されているオーディオ情報分類装置では、このオーディオ情報分類装置では、単位時間毎の周波数データを用いて、有音区間のみを抽出し、抽出した有音区間に対して、１秒毎のエネルギー変化率を算出し、そのエネルギー変化率の大きさによって音声区間を抽出し、さらに、単位時間毎の周波数データから平均バンドエネルギ比を求め、該平均バンドエネルギ比から音楽区間を抽出する。特許文献１〜４に開示されている技術は、いずれも周波数スペクトルのパワーやエネルギーの情報に基づいて音楽、あるいは、音声を判別するものである。 In the audio information classification device disclosed in Patent Document 4, in this audio information classification device, only the sound section is extracted using the frequency data for each unit time, and 1 second is extracted from the extracted sound section. Calculate the energy change rate for each time, extract the voice interval according to the magnitude of the energy change rate, further determine the average band energy ratio from the frequency data per unit time, and extract the music interval from the average band energy ratio . Each of the techniques disclosed in Patent Documents 1 to 4 discriminates music or voice based on frequency spectrum power and energy information.

音楽シーンを検出する技術として、他に、特許文献５に記載の音楽検出回路及び該回路を用いた音声信号入力装置、特許文献６に記載の映像分類方法及び装置がある。これらの技術は、周波数スペクトルの調波構造に着目して音楽シーンを検出するものである。 Other techniques for detecting a music scene include a music detection circuit described in Patent Document 5, an audio signal input device using the circuit, and a video classification method and apparatus described in Patent Document 6. These techniques detect music scenes by paying attention to the harmonic structure of the frequency spectrum.

特許文献５に開示されている音楽検出回路は、入力信号を複数のバンドパスフィルタによって濾波し、各バンドパスフィルタからの出力信号毎に、当該出力信号が周期性をもって繰り返されているか否かを判定し、周期性をもって繰り返されているときに、入出力信号が音楽であると判定する。 The music detection circuit disclosed in Patent Document 5 filters an input signal by a plurality of bandpass filters, and determines whether or not the output signal is repeated with periodicity for each output signal from each bandpass filter. When it is determined and repeated with periodicity, it is determined that the input / output signal is music.

特許文献６に開示されている映像分類装置は、入力された映像情報に含まれる音情報を周波数解析し、得られたスペクトルを時間方向に並べたスペクトログラムの一定周波数における時間方向のエッジの強さから音楽を検出する音楽検出部を備えている。 The video classification device disclosed in Patent Document 6 analyzes the frequency of sound information included in input video information, and the strength of edges in the time direction at a constant frequency of a spectrogram in which the obtained spectra are arranged in the time direction. A music detection unit for detecting music from

また、音声シーンを検出する技術として、他に、特許文献７に開示されている音響区間検出方法及び装置がある。特許文献７に記載の音響区間検出方法では、フレーム毎に周波数変換し、同一フレーム内または隣接するフレーム間の所定周波数帯域間での相関値を算出し、最大値をとる周波数帯域の識別子と最小値をとる周波数帯域の識別子との差を示す帯域番号を算出し、帯域番号の分散に基づいて前記帯域番号を補正し、補正された補正帯域番号の最大値である重み付きの帯域番号を算出して、前記相関値に、前記重み付きの帯域番号を乗じることにより、音響特徴量を算出する。そして、算出した音響特徴量の同一フレームの相関値または異なるフレーム間における相関値に基づいて音声区間を決定する。
特開平５−８８６９５（１９９３年４月９日公開）特開平６−４０８８（１９９４年１月１４日公開）特開２００６−３０１１３４（２００６年１１月２日公開）特開平１０−２４７０９３（１９９８年９月１４日公開）特開平５−２８９６９３（１９９３年１１月５日公開）特開平１０−１８７１８２（１９９８年７月１４日公開）国際公開公報ＷＯ２００４／１１１９９６（２００４年１２月２３日国際公開） As another technique for detecting an audio scene, there is an acoustic section detection method and apparatus disclosed in Patent Document 7. In the acoustic section detection method described in Patent Document 7, frequency conversion is performed for each frame, a correlation value between predetermined frequency bands within the same frame or between adjacent frames is calculated, and an identifier and minimum frequency band that takes the maximum value are calculated. Calculate the band number indicating the difference from the frequency band identifier that takes the value, correct the band number based on the distribution of the band number, and calculate the weighted band number that is the maximum value of the corrected correction band number Then, an acoustic feature amount is calculated by multiplying the correlation value by the weighted band number. Then, the speech section is determined based on the correlation value of the calculated acoustic feature quantity in the same frame or the correlation value between different frames.
JP 5-88695 (published on April 9, 1993) JP 6-4088 (published January 14, 1994) JP 2006-301134 (released on November 2, 2006) JP 10-247093 (published September 14, 1998) Japanese Patent Laid-Open No. 5-289963 (published on November 5, 1993) JP 10-187182 (released July 14, 1998) International Publication No. WO2004 / 111996 (International Publication on Dec. 23, 2004)

しかしながら、特許文献１〜４に記載の構成、すなわち、周波数スペクトルのパワーやエネルギーの情報のみに基づいて判別する構成では、精度よく音楽シーンや音声シーンを判別することは難しい。 However, with the configuration described in Patent Documents 1 to 4, that is, the configuration for determining based only on the power and energy information of the frequency spectrum, it is difficult to accurately determine the music scene and the audio scene.

また、特許文献５に記載の構成では、周期性のあるノイズについても音楽であると誤判定してしまう場合がある。また、特許文献６に記載の構成の場合、上記のエッジの強さは音声の発音開始と類似しているため、音声の発音開始を音楽と誤判定してしまうことになり、音楽検出の精度は低い。 Further, in the configuration described in Patent Document 5, periodic noise may be erroneously determined as music. Further, in the case of the configuration described in Patent Document 6, since the edge strength is similar to the start of sound production, the sound production start is erroneously determined as music, and the accuracy of music detection is reduced. Is low.

本発明は、上記の問題点に鑑みてなされたものであり、その目的は、音楽シーン、音声シーンなどを、各種の音の特性に基づいて精度よく判別するための音楽検出装置、音声検出装置、および音場制御装置を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a music detection device and a sound detection device for accurately discriminating music scenes, sound scenes, and the like based on various sound characteristics. And providing a sound field control device.

本発明に係る音楽検出装置は、上記の課題を解決するために、音響信号から、該音響信号の所定時間を表すフレームごとに、周波数スペクトルを算出するスペクトル算出手段と、上記周波数スペクトルの自己相関値を算出する自己相関値算出手段と、連続する複数フレームにおける上記自己相関値の最大値のばらつきの大きさを数値化する数値化手段と、上記ばらつきの大きさが予め定められた閾値よりも小さい場合、上記音響信号を音楽と判定する音楽判定手段と、を備えていることを特徴としている。 In order to solve the above problems, a music detection apparatus according to the present invention includes a spectrum calculation unit that calculates a frequency spectrum for each frame representing a predetermined time of the acoustic signal, and an autocorrelation of the frequency spectrum. An autocorrelation value calculating means for calculating a value, a quantifying means for quantifying the magnitude of the variation of the maximum value of the autocorrelation value in a plurality of consecutive frames, and the magnitude of the variation from a predetermined threshold value In the case of being small, it is characterized by comprising a music determination means for determining the acoustic signal as music.

上記の構成によれば、本発明に係る音楽検出装置では、スペクトル算出手段は、音響信号から、該音響信号の所定時間を表すフレームごとに、周波数スペクトルを算出する。ここで、スペクトル算出手段は、リニアな周波数軸における周波数スペクトル、すなわち、一定の周波数間隔ごとに周波数スペクトルを算出してもよいし、音階（例えば、平均律や純音律など）ごとに周波数スペクトルを算出してもよく、特に限定はされない。 According to said structure, in the music detection apparatus which concerns on this invention, a spectrum calculation means calculates a frequency spectrum for every flame | frame showing the predetermined time of this acoustic signal from an acoustic signal. Here, the spectrum calculating means may calculate a frequency spectrum on a linear frequency axis, that is, a frequency spectrum for every fixed frequency interval, or a frequency spectrum for each scale (for example, average temperament or pure temperament). It may be calculated and is not particularly limited.

また、上記構成によれば、自己相関値算出手段は、上記フレームごとに、上記スペクトルの自己相関値を算出する。ここで、自己相関値は、上記スペクトルの自己相関を表す値であって、例えば、上記スペクトルを表すＮ個のデータをｓｐ（ｉ）（ｉ＝０，１，・・・，Ｎ−１）とすれば、数１に示すＲ１（ｘ）（ｘ＝１，２，・・・，Ｍ）によって表される。数１に示す例では、Ｍ個の自己相関値が算出される。 Further, according to the above configuration, the autocorrelation value calculating means calculates the autocorrelation value of the spectrum for each frame. Here, the autocorrelation value is a value representing the autocorrelation of the spectrum, and for example, N data representing the spectrum is represented by sp (i) (i = 0, 1,..., N−1). Then, it is represented by R1 (x) (x = 1, 2,..., M) shown in Equation 1. In the example shown in Equation 1, M autocorrelation values are calculated.

自己相関値としては、数１に示す自己相関関数を正規化した値や定数を乗じた値を用いてもよく、自己相関を評価できる値であれば特に限定はされない。 As the autocorrelation value, a value obtained by normalizing the autocorrelation function shown in Equation 1 or a value multiplied by a constant may be used, and is not particularly limited as long as the autocorrelation can be evaluated.

ここで、Ｌは数１を用いて自己相関値を算出する場合の積和演算の数であり、例えば、スペクトルデータの個数を１３６（すなわち、Ｎ＝１３６）とすると、Ｌは６８程度で十分である。そして、ｘの上限（すなわち、Ｍ）もまた６８とすれば、全てのｘにおいて、Ｒ１（ｘ）を算出するための積和演算の数を平等にすることができる。 Here, L is the number of product-sum operations when the autocorrelation value is calculated using Equation 1, and for example, if the number of spectrum data is 136 (that is, N = 136), about 68 is sufficient for L. It is. If the upper limit of x (that is, M) is also set to 68, the number of product-sum operations for calculating R1 (x) can be made equal for all x.

自己相関値は、例えば、数１の自己相関関数のように、上記スペクトルのデータ列ｓｐ（ｉ）と、そのデータ列ｓｐ（ｉ）をｘずらしたスペクトルのデータ列ｓｐ（ｉ＋ｘ）とを掛け合わせたものを足し込んだ値によって表される。そして、データ列ｓｐ（ｉ）のスペクトルのピーク値に周期性がある場合、ｘを１周期（すなわち、ピーク値をとる周波数の間隔）としたときに、掛け合わされるデータ列の値は互いにピーク値同士となるため、数１に示す自己相関関数Ｒ１（ｘ）の値は大きくなる。 For example, the autocorrelation value is multiplied by the spectrum data string sp (i) and the spectrum data string sp (i + x) obtained by shifting the data string sp (i) by x as in the autocorrelation function of Equation 1. It is represented by the sum of the combined values. When the peak value of the spectrum of the data string sp (i) has periodicity, the values of the data strings to be multiplied with each other peak when x is one period (that is, the frequency interval at which the peak value is taken). Since the values are equal to each other, the value of the autocorrelation function R1 (x) shown in Equation 1 is large.

ところで、上記音響信号が倍音を含む音（例えば、バイオリンの音など）を表す信号の場合、音響信号の周波数スペクトルは、基音の周波数の整数倍の周波数においてピーク値を示す。 By the way, in the case where the acoustic signal is a signal representing a sound including overtones (for example, a violin sound), the frequency spectrum of the acoustic signal exhibits a peak value at a frequency that is an integral multiple of the frequency of the fundamental tone.

そのため、例えば、スペクトル算出手段において、一定の周波数間隔ごとに（すなわち、リニアな周波数軸において）スペクトルデータを算出した場合、そのデータ列ｓｐ（ｉ）では、一定のデータ間隔においてピーク値のデータとなる。 Therefore, for example, when the spectrum calculation unit calculates the spectrum data for each fixed frequency interval (that is, on the linear frequency axis), the data string sp (i) includes the peak value data at the fixed data interval. Become.

したがって、この場合、上記音響信号が倍音を含む音（例えば、バイオリンの音など）を表す信号の場合、上記自己相関関数値Ｒ１（ｘ）は、ｘがピーク値のデータ間隔となるときに最大値となる。 Therefore, in this case, in the case where the acoustic signal is a signal that represents a sound including overtones (for example, a violin sound), the autocorrelation function value R1 (x) is the maximum when x is the data interval of the peak value. Value.

また、スペクトル算出手段において、音階（例えば、平均律や純音律など）ごとに周波数スペクトルを算出した場合も同様に、そのデータ列ｓｐ（ｉ）では、一定のデータ間隔においてピーク値のデータとなる。 Similarly, when the spectrum calculation unit calculates a frequency spectrum for each scale (for example, equal temperament, pure temperament, etc.), the data string sp (i) similarly has peak value data at a constant data interval. .

したがって、この場合にも、上記音響信号が倍音を含む音（例えば、バイオリンの音など）を表す信号の場合、上記自己相関関数値Ｒ１（ｘ）は、ｘがピーク値のデータ間隔となるときに最大値となる。 Therefore, also in this case, when the acoustic signal is a signal representing a sound including overtones (for example, a violin sound), the autocorrelation function value R1 (x) is obtained when x is a data interval of a peak value. The maximum value.

また、上記の構成によれば、数値化手段は、連続する複数フレームにおける上記自己相関値の最大値のばらつきの大きさを数値化する。ばらつきの大きさを数値化した値としては、例えば、分散や標準偏差、あるいは、最大値と最小値との差などがあり、特に限定はされない。 Further, according to the above configuration, the digitizing means digitizes the magnitude of the variation of the maximum value of the autocorrelation value in a plurality of consecutive frames. The value obtained by quantifying the magnitude of variation includes, for example, variance, standard deviation, or the difference between the maximum value and the minimum value, and is not particularly limited.

そして、上記の構成によれば、音楽判定手段は、上記ばらつきの大きさが予め定められた閾値よりも小さい場合、上記音響信号を音楽と判定する。倍音成分が含まれている楽器の音の場合、一定時間、倍音成分を含む音が継続する。つまり、一定時間（すなわち、複数フレームにおいて）、倍音成分においてピーク値を示すスペクトル波形は継続する。その場合、スペクトルの自己相関値の最大値も複数フレームにおいて一定幅の値となる。 And according to said structure, a music determination means determines the said acoustic signal as music, when the magnitude | size of the said dispersion | variation is smaller than a predetermined threshold value. In the case of the sound of a musical instrument containing a harmonic component, the sound including the harmonic component continues for a certain period of time. That is, the spectrum waveform indicating the peak value in the harmonic component continues for a certain time (that is, in a plurality of frames). In that case, the maximum value of the autocorrelation value of the spectrum is also a value having a certain width in a plurality of frames.

したがって、上記自己相関値の最大値のばらつきが十分小さければ、楽器の倍音成分が継続していることになる。そのため、上記音楽判定手段は、上記自己相関値の最大値のばらつきを、予め定められた閾値と比較することによって、ばらつきが十分小さいか否かを判定している。 Therefore, if the variation of the maximum value of the autocorrelation value is sufficiently small, the harmonic component of the instrument is continued. Therefore, the music determination means determines whether or not the variation is sufficiently small by comparing the variation of the maximum value of the autocorrelation value with a predetermined threshold value.

これにより、本発明に係る音楽検出装置は、倍音成分を含む楽器の音、すなわち、バイオリンなどの弦楽器やトランペットなどの管楽器の楽音を検出することが可能となる。 Thereby, the music detection apparatus according to the present invention can detect the sound of a musical instrument including a harmonic component, that is, the musical tone of a stringed instrument such as a violin or a wind instrument such as a trumpet.

本発明に係る音楽検出装置では、上記スペクトル算出手段は、音階に対応する各周波数のスペクトルを算出することが好ましい。 In the music detection apparatus according to the present invention, it is preferable that the spectrum calculating means calculates a spectrum of each frequency corresponding to a musical scale.

上記の構成によれば、スペクトル算出手段は、平均律や純音律などの音階律に対応する各周波数のスペクトルを算出する。例えば、スペクトル算出手段は、音響信号から、該音響信号の所定時間を表すフレームごとに、平均律音階の各周波数のスペクトルを算出する。ここで、平均律音階とは、１オクターブを等比数列によって配分して得られる音階である。例えば、１２平均律音階の場合、周波数が２倍となる間隔である１オクターブを等比数列により１２分割することになり、隣り合う音の周波数の比は２の１２乗根となる。すなわち、１２平均律音階を構成する音の各周波数ｆｎは、基音の周波数をｆ_０とすれば、ｆｎ＝ｆ_０×２^ｎ／１２によって表される。なお、上記平均律音階は、１２平均律音階には限定はされない。また、基音の周波数は任意であってよく、特に限定はされない。そして、スペクトル算出手段は、上記平均律音階の各周波数に対応するスペクトルを算出する。そのため、算出されるスペクトルの数はオクターブごとに同数である。例えば、１２平均律音階の場合、１オクターブごとに１２のスペクトルが算出される。 According to said structure, a spectrum calculation means calculates the spectrum of each frequency corresponding to musical scales, such as an equal temperament and a pure temperament. For example, the spectrum calculating means calculates a spectrum of each frequency of the average temperament scale for each frame representing a predetermined time of the acoustic signal from the acoustic signal. Here, the average tempered scale is a scale obtained by allocating one octave by a geometric progression. For example, in the case of twelve average temperament scales, one octave that is an interval at which the frequency is doubled is divided into twelve by a geometric sequence, and the ratio of the frequencies of adjacent sounds is the twelfth root of 2. That is, each frequency fn of the sound constituting the 12 equal temperament scale, if the frequency of the fundamental tone and _{f 0,} is represented by _{^{fn = f 0 × 2 n /}} 12. The average scale is not limited to 12 average scales. Further, the frequency of the fundamental tone may be arbitrary and is not particularly limited. The spectrum calculating means calculates a spectrum corresponding to each frequency of the average tempered scale. Therefore, the calculated number of spectra is the same for each octave. For example, in the case of 12 average temperament scales, 12 spectra are calculated for each octave.

そして、本発明に係る音楽検出装置では、上述したとおり、自己相関値算出手段は、上記フレームごとに、上記スペクトルの自己相関値を算出する。上述したとおり、自己相関値は、上記スペクトルの自己相関を表す値であって、例えば、平均律音階に対応する各周波数のスペクトルを表すＮ個のデータをｓｐ（ｉ）（ｉ＝０，１，・・・，Ｎ−１）とすれば、数１に示すＲ１（ｘ）（ｘ＝１，２，・・・，Ｍ）によって表される。数１に示す例では、Ｍ個の自己相関値が算出される。
ここで、Ｌは数１を用いて自己相関値を算出する場合の積和演算の数であり、例えば、スペクトルデータの個数を１３６（すなわち、Ｎ＝１３６）とすると、Ｌは６８程度で十分である。そして、ｘの上限（すなわち、Ｍ）もまた６８とすれば、全てのｘにおいて、Ｒ１（ｘ）を算出するための積和演算の数を平等にすることができる。 In the music detection apparatus according to the present invention, as described above, the autocorrelation value calculation means calculates the autocorrelation value of the spectrum for each frame. As described above, the autocorrelation value is a value representing the autocorrelation of the spectrum. For example, N data representing the spectrum of each frequency corresponding to the average temperament scale is represented by sp (i) (i = 0, 1). ,..., N−1), it is represented by R1 (x) (x = 1, 2,..., M) shown in Equation 1. In the example shown in Equation 1, M autocorrelation values are calculated.
Here, L is the number of product-sum operations when the autocorrelation value is calculated using Equation 1, and for example, if the number of spectrum data is 136 (that is, N = 136), about 68 is sufficient for L. It is. If the upper limit of x (that is, M) is also set to 68, the number of product-sum operations for calculating R1 (x) can be made equal for all x.

ところで、上述したとおり、上記音響信号が倍音を含む音（例えば、バイオリンの音など）を表す信号の場合、音響信号の周波数スペクトルは、基音の周波数の整数倍の周波数においてピーク値を示す。さらに、その場合、音響信号の周波数スペクトルは、基音の周波数からオクターブ間隔においてピーク値を示すことにもなる。 Incidentally, as described above, when the acoustic signal is a signal representing a sound including overtones (for example, a violin sound), the frequency spectrum of the acoustic signal exhibits a peak value at a frequency that is an integral multiple of the frequency of the fundamental tone. Further, in that case, the frequency spectrum of the acoustic signal also shows a peak value at octave intervals from the fundamental frequency.

また、平均律音階は、オクターブごとに同じ数の周波数によって配分されているため、上述したスペクトルのデータ列ｓｐ（ｉ）では、一定の間隔ごとにオクターブ離れた周波数のスペクトルのデータとなる。例えば、１２平均律音階の場合には、ｓｐ（０）、ｓｐ（１２）、ｓｐ（２４），・・・のように１２のデータ間隔で１オクターブ離れた周波数のスペクトルのデータとなる。そして、倍音が含まれる音響信号の周波数スペクトルは、基音の周波数からオクターブ間隔においてピーク値を示すため、スペクトルのデータ列ｓｐ（ｉ）においても、オクターブ間隔、すなわち、一定のデータ間隔（１２平均律音階の場合には１２のデータ間隔）においてピーク値のデータとなる。 Further, since the average temperament scale is distributed by the same number of frequencies for each octave, the spectrum data sequence sp (i) described above becomes spectrum data having a frequency separated by an octave at regular intervals. For example, in the case of 12 average temperament scales, spectrum data having a frequency one octave apart at twelve data intervals such as sp (0), sp (12), sp (24),. Since the frequency spectrum of the acoustic signal including overtones shows a peak value in the octave interval from the fundamental frequency, the octave interval, that is, a constant data interval (12 average temperament) is also included in the spectrum data string sp (i). In the case of a musical scale, the peak value data is obtained at 12 data intervals).

したがって、上記音響信号が倍音を含む音（例えば、バイオリンの音など）を表す信号の場合、上記自己相関関数値Ｒ１（ｘ）は、ｘがオクターブ間隔（あるいは、その整数倍の間隔）となるときに最大値となる。 Therefore, when the acoustic signal is a signal representing a sound including overtones (for example, a violin sound), the autocorrelation function value R1 (x) is such that x is an octave interval (or an integral multiple interval thereof). Sometimes maximum.

これにより、自己相関関数値Ｒ１（ｘ）の算出に用いるスペクトルのデータ列ｓｐ（ｉ）のデータ数が少なくなるため、演算量を低減できる。したがって、倍音成分を含む楽器の音、すなわち、バイオリンなどの弦楽器やトランペットなどの管楽器の楽音を高速に検出できるようになる。 As a result, the number of data in the spectrum data sequence sp (i) used for calculating the autocorrelation function value R1 (x) is reduced, and the amount of calculation can be reduced. Therefore, it is possible to detect the sound of a musical instrument containing a harmonic component, that is, the musical tone of a stringed instrument such as a violin or a wind instrument such as a trumpet at high speed.

本発明に係る音楽検出装置では、上記自己相関値算出手段は、上記スペクトルを表すＮ個のデータであるｓｐ（ｉ）（ｉ＝０，１，・・・，Ｎ−１）を用いて、Ｍ個の上記自己相関値を、上記の自己相関関数Ｒ１（ｘ）（ｘ＝１，２，・・・，Ｍ）の各値として算出することが好ましい。 In the music detection apparatus according to the present invention, the autocorrelation value calculation means uses sp (i) (i = 0, 1,..., N−1) that is N pieces of data representing the spectrum, The M autocorrelation values are preferably calculated as the values of the autocorrelation function R1 (x) (x = 1, 2,..., M).

本発明に係る音楽検出装置では、上記数値化手段は、上記最大値の分散を算出して、上記ばらつきを数値化することが好ましい。 In the music detection apparatus according to the present invention, it is preferable that the numerical means calculates the variance of the maximum value to numerically express the variation.

本発明に係る音楽検出装置では、上記の課題を解決するために、音響信号から、該音響信号の所定時間を表すフレームごとに、音階に対応する各周波数のスペクトルパワーを算出するスペクトルパワー算出手段と、上記音階の各周波数に該各周波数を識別する音階識別番号が割り当てられており、上記フレームごとに、上記音階識別番号のうち上記スペクトルパワーが最大となる最大音階識別番号を検出する最大音階識別番号検出手段と、連続する複数フレームにおける上記最大音階識別番号のばらつきの大きさを数値化する数値化手段と、上記ばらつきの大きさが予め定められた閾値よりも大きい場合、上記音響信号を音楽と判定する音楽判定手段と、を備えていることを特徴としている。 In the music detection apparatus according to the present invention, in order to solve the above-described problem, a spectrum power calculation unit that calculates the spectrum power of each frequency corresponding to the scale for each frame representing a predetermined time of the sound signal from the sound signal. And a scale identification number for identifying each frequency is assigned to each frequency of the scale, and for each frame, a maximum scale for detecting a maximum scale identification number that maximizes the spectrum power among the scale identification numbers. An identification number detecting means; a numerical means for quantifying the magnitude of the variation of the maximum scale identification number in a plurality of consecutive frames; and the acoustic signal when the magnitude of the variation is larger than a predetermined threshold. And a music determination means for determining the music.

上記の構成によれば、本発明に係る音楽検出装置では、スペクトルパワー算出手段は、音響信号から、該音響信号の所定時間を表すフレームごとに、平均律や純音律などの音階に対応する各周波数スペクトルパワーを算出する。例えば、スペクトルパワー算出手段は、平均律音階の各周波数スペクトルパワーを算出する。ここで、平均律音階とは、１オクターブを等比数列によって配分して得られる音階である。平均律音階は、１２平均律音階には限定はされない。また、基音の周波数は任意であってよく、特に限定はされない。そして、この場合、スペクトル算出手段は、上記平均律音階の各周波数に対応するスペクトルパワーを算出する。そのため、算出されるスペクトルの数はオクターブごとに同数である。例えば、１２平均律音階の場合、１オクターブごとに１２のスペクトルパワーが算出される。 According to the above configuration, in the music detection device according to the present invention, the spectrum power calculation means includes each of the sound signals corresponding to a scale such as an equal temperament or a pure temperament for each frame representing a predetermined time of the sound signal. Calculate frequency spectrum power. For example, the spectrum power calculation means calculates each frequency spectrum power of the average temperament scale. Here, the average tempered scale is a scale obtained by allocating one octave by a geometric progression. The average scale is not limited to the 12 average scale. Further, the frequency of the fundamental tone may be arbitrary and is not particularly limited. In this case, the spectrum calculating means calculates the spectrum power corresponding to each frequency of the average scale. Therefore, the calculated number of spectra is the same for each octave. For example, in the case of 12 average temperament scales, 12 spectral powers are calculated for each octave.

また、上記構成によれば、上記音階に対応する各周波数に該各周波数を識別する音階識別番号が割り当てられており、最大音階識別番号検出手段は、上記フレームごとに、上記音階識別番号のうち上記スペクトルパワーが最大となる最大音階識別番号を検出する。ここで、音階識別番号は、音階に対応する各周波数の昇順、または、降順に割り当てられた連続的な番号であり、隣り合う番号の間隔は等しい。つまり、音階識別番号は、平均律音階における各音の高さ（あるいは、低さ）の順番を表す番号になる。また、スペクトルパワーの最大値となる周波数の音は、１つのフレームに含まれる音のうち、最も強い音である。すなわち、最大音階識別番号は、１つのフレームに含まれる音のうち、最も強い音を表す識別番号である。 Further, according to the above configuration, a scale identification number for identifying each frequency is assigned to each frequency corresponding to the scale, and the maximum scale identification number detection means includes the scale identification number for each frame. A maximum scale identification number that maximizes the spectrum power is detected. Here, the scale identification number is a continuous number assigned in ascending or descending order of each frequency corresponding to the scale, and the interval between adjacent numbers is equal. That is, the scale identification number is a number that represents the order of the pitch (or pitch) of each sound in the average tempered scale. In addition, the sound having the frequency having the maximum value of the spectrum power is the strongest sound among the sounds included in one frame. That is, the maximum scale identification number is an identification number representing the strongest sound among the sounds included in one frame.

また、上記構成によれば、数値化手段は、連続する複数フレームにおける上記最大音階識別番号のばらつきの大きさを数値化する。ばらつきの大きさを数値化した値としては、例えば、分散や標準偏差、あるいは、最大値と最小値との差などがあり、特に限定はされない。 Further, according to the above configuration, the digitizing means digitizes the magnitude of variation in the maximum scale identification number in a plurality of consecutive frames. The value obtained by quantifying the magnitude of variation includes, for example, variance, standard deviation, or the difference between the maximum value and the minimum value, and is not particularly limited.

そして、上記の構成によれば、音楽判定手段は、上記ばらつきの大きさが予め定められた閾値よりも大きい場合、上記音響信号を音楽と判定する。音楽は、音の高低、強弱、長短、音色などを組み合わせて表現されるものである。そして、音の高低が変化するということは、１つのフレームに含まれる最も強い音が、複数のフレームにおいて変化するということである。つまり、上記の最大値音階識別番号は複数のフレームにおいてばらつくことになる。 And according to said structure, a music determination means determines the said sound signal as music, when the magnitude | size of the said dispersion | variation is larger than the predetermined threshold value. Music is expressed by a combination of high, low, high, short, and timbre of sounds. The change in the level of the sound means that the strongest sound included in one frame changes in a plurality of frames. That is, the maximum value scale identification number varies in a plurality of frames.

したがって、上記最大値音階識別番号のばらつきが十分大きければ、音響信号は、音楽を表していることになる。そのため、上記音楽判定手段は、上記最大値音階識別番号のばらつきを、予め定められた閾値と比較することによって、ばらつきが十分大きいか否かを判定している。 Therefore, if the variation of the maximum value scale identification number is sufficiently large, the acoustic signal represents music. Therefore, the music determination means determines whether the variation is sufficiently large by comparing the variation of the maximum value scale identification number with a predetermined threshold value.

これにより、本発明に係る音楽検出装置は、音符の有無、すなわち、音の高低が変化する音楽を検出することが可能となる。 Thereby, the music detection apparatus according to the present invention can detect music in which the presence or absence of a note, that is, the pitch of a sound changes.

本発明に係る音楽検出装置では、上記数値化手段は、上記最大値の分散を算出して、上記ばらつきを数値化することを特徴とすることが好ましい。 In the music detection apparatus according to the present invention, it is preferable that the numerical means calculates the variance of the maximum value and numerically expresses the variation.

本発明に係る音楽検出装置では、上記の課題を解決するために、音響信号から、フレームごとに予め定められた第１の閾値以下の周波数または第１の閾値未満の周波数のスペクトルパワーを加算して低域スペクトルパワーを算出する低域スペクトルパワー算出手段と、予め定められた数の連続する複数フレームにおける上記低域スペクトルパワーの自己相関値が最大となるフレーム間隔を検出するフレーム間隔検出手段と、上記音響信号から、上記フレームごとに、第１の閾値以上の周波数または第１の閾値より大きい周波数のスペクトルパワーを加算して高域スペクトルパワーを算出する高域スペクトルパワー算出手段と、上記高域スペクトルパワーに対する上記低域スペクトルパワーの比率が予め定められた第２の閾値以上であり、かつ、上記フレーム間隔が予め定められた範囲内にある場合に、上記音響信号を音楽と判定する音楽判定手段と、を備えていることを特徴としている。 In the music detection device according to the present invention, in order to solve the above-described problem, a spectrum power having a frequency equal to or lower than a first threshold predetermined for each frame or a frequency lower than the first threshold is added from an acoustic signal. Low-frequency spectrum power calculating means for calculating the low-frequency spectrum power, and frame interval detection means for detecting a frame interval at which the autocorrelation value of the low-frequency spectrum power in a predetermined number of consecutive frames is maximum. High frequency spectrum power calculating means for calculating a high frequency spectrum power by adding a spectrum power of a frequency equal to or higher than a first threshold value or a frequency higher than the first threshold value for each frame from the acoustic signal; A ratio of the low-frequency spectrum power to the high-frequency spectrum power is equal to or greater than a predetermined second threshold; and If within a range in which the frame interval is predetermined, it is characterized in that it comprises a determining music determination means with the music the acoustic signal.

上記の構成によれば、本発明に係る音楽検出装置では、低域スペクトルパワー算出手段は、音響信号から、フレームごとに予め定められた第１の閾値以下の周波数または第１の閾値未満の周波数のスペクトルパワーを加算して低域スペクトルパワーを算出する。 According to the above configuration, in the music detection device according to the present invention, the low-frequency spectrum power calculation means uses a frequency equal to or lower than a first threshold value predetermined for each frame or a frequency less than the first threshold value from an acoustic signal. Are added to calculate the low-frequency spectrum power.

また、上記の構成によれば、本発明に係る音楽検出装置では、フレーム間隔検出手段は、予め定められた数の連続する複数フレームにおける上記低域スペクトルパワーの自己相関値が最大となるフレーム間隔を検出する。ここで、自己相関値は、予め定められた数の連続する複数フレームにおける上記スペクトルパワーの自己相関を表す値であって、例えば、Ｎフレームの各スペクトルパワーを表すデータをｓｐｐ（ｉ）（ｉ＝０，１，・・・，Ｎ−１）とすれば、数２に示すＲ２（ｘ）（ｘ＝１，２，・・・，Ｍ）によって表される。数２に示す例では、Ｍ個の自己相関値が算出される。 Further, according to the above configuration, in the music detection device according to the present invention, the frame interval detection means includes a frame interval at which the autocorrelation value of the low frequency spectrum power is maximized in a predetermined number of consecutive frames. Is detected. Here, the autocorrelation value is a value representing the autocorrelation of the spectrum power in a predetermined number of consecutive frames, and for example, data representing each spectrum power of N frames is represented by spp (i) (i = 0, 1,..., N−1), it is represented by R2 (x) (x = 1, 2,..., M) shown in Equation 2. In the example shown in Equation 2, M autocorrelation values are calculated.

自己相関値としては、数２に示す自己相関関数を正規化した値や定数を乗じた値を用いてもよく、自己相関を評価できる値であれば特に限定はされない。 As the autocorrelation value, a value obtained by normalizing the autocorrelation function shown in Formula 2 or a value multiplied by a constant may be used, and there is no particular limitation as long as the autocorrelation can be evaluated.

ここで、Ｌは数２を用いて自己相関値を算出する場合の積和演算の数であり、例えば、スペクトルパワーを表すデータの個数を１２８（すなわち、Ｎ＝１２８）とすると、Ｌは６４程度で十分である。そして、ｘの上限（すなわち、Ｍ）もまた６４とすれば、全てのｘにおいて、Ｒ２（ｘ）を算出するための積和演算の数を平等にすることができる。 Here, L is the number of product-sum operations when the autocorrelation value is calculated using Equation 2. For example, if the number of data representing spectrum power is 128 (ie, N = 128), L is 64. The degree is sufficient. If the upper limit of x (that is, M) is also 64, the number of product-sum operations for calculating R2 (x) can be made equal for all x.

自己相関値は、例えば、数２の自己相関関数のように、複数フレームのスペクトルパワーのデータ列ｓｐｐ（ｉ）と、そのデータ列ｓｐｐ（ｉ）をｘずらしたスペクトルパワーのデータ列ｓｐｐ（ｉ＋ｘ）とを掛け合わせたものを足し込んだ値によって表される。そして、データ列ｓｐｐ（ｉ）によって表される複数フレームのスペクトルパワーの変動に周期性がある場合、ｘを１周期（すなわち、ピーク値をとるフレーム間隔、または、時間間隔）としたときに、掛け合わされるデータ列の値は互いにピーク値同士となるため、数１に示す自己相関関数Ｒ２（ｘ）の値は大きくなる。 The autocorrelation value is, for example, a spectral power data sequence spp (i) of a plurality of frames and a spectral power data sequence spp (i + x) obtained by shifting the data sequence spp (i) by x as in the autocorrelation function of Equation 2. ) And multiplied by the value. Then, when there is periodicity in the fluctuation of the spectral power of a plurality of frames represented by the data string spp (i), when x is one period (that is, a frame interval or a time interval that takes a peak value), Since the values of the data strings to be multiplied are mutually peak values, the value of the autocorrelation function R2 (x) shown in Equation 1 is large.

ところで、上記音響信号が低域においてリズムを有する音（例えば、ドラムや太鼓の音など）を表す信号の場合、音響信号の低域スペクトルパワーの時間変化は、周期性を有し、一定のフレーム（時間）間隔においてピーク値を示す。 By the way, when the acoustic signal is a signal representing a sound having a rhythm in a low frequency (for example, drum or drum sound), the temporal change in the low frequency spectrum power of the acoustic signal has a periodicity and a constant frame. The peak value is shown in the (time) interval.

したがって、上記音響信号が低域においてリズムを有する音を表す信号の場合、上記自己相関関数値Ｒ２（ｘ）は、ｘが低域スペクトルパワーの時間変化の周期に等しいフレーム間隔となるときに最大値となる。 Therefore, when the acoustic signal is a signal representing a sound having a rhythm in a low frequency range, the autocorrelation function value R2 (x) is maximum when x is a frame interval equal to the period of time change of the low frequency spectrum power. Value.

また、上記の構成によれば、本発明に係る音楽検出装置では、高域スペクトルパワー算出手段は、上記音響信号から、上記フレームごとに、第１の閾値以上の周波数または第１の閾値より大きい周波数のスペクトルパワーを加算して高域スペクトルパワーを算出する。 Moreover, according to said structure, in the music detection apparatus which concerns on this invention, a high frequency spectrum power calculation means is larger than a 1st threshold value or a frequency more than a 1st threshold value for every said frame from the said acoustic signal. The spectral power of the frequency is added to calculate the high frequency spectral power.

そして、上記の構成によれば、本発明に係る音楽検出装置では、音楽判定手段は、上記高域スペクトルパワーに対する上記低域スペクトルパワーの比率が予め定められた第２の閾値以上であり、かつ、上記フレーム間隔が予め定められた範囲内にある場合に、上記音響信号を音楽と判定する。 And according to said structure, in the music detection apparatus which concerns on this invention, a music determination means is more than the 2nd predetermined threshold value with the ratio of the said low frequency spectrum power with respect to the said high frequency spectrum power, and When the frame interval is within a predetermined range, the acoustic signal is determined as music.

低域においてリズムを有する楽器の音の場合、低域スペクトルパワーの時間変化の周期は、一定の時間の範囲内となる。例えば、ドラムや太鼓であれば、その周期は、０．２秒から１．５秒の間でリズムを形成する場合が多い。 In the case of the sound of a musical instrument having a rhythm in the low frequency range, the period of time change of the low frequency spectrum power is within a certain time range. For example, in the case of a drum or a drum, the cycle often forms a rhythm between 0.2 seconds and 1.5 seconds.

また、低域においてリズムを有する音であっても、人の声の場合には、低域のスペクトルはほとんど含まれず、低域のスペクトルパワーは非常に小さい。一方、ドラムなどの音の場合は、低域のスペクトルパワーが大きい。そのため、低域においてリズムを有する楽器（例えばドラムなど）の音は、人の声に比べて、低域のスペクトルパワーは相対的に大きくなる。つまり、低域においてリズムを有する楽器の音の場合、高域のスペクトルパワーに対する低域スペクトルパワーの比率は大きくなる。換言すれば、全帯域のスペクトルパワーの合計に対する低域スペクトルパワーの割合は大きくなる。 Further, even in the case of a human voice, even a sound having a rhythm in the low frequency range includes almost no low frequency spectrum, and the low frequency spectrum power is very small. On the other hand, in the case of sounds such as drums, the spectrum power in the low band is large. Therefore, the sound of an instrument (such as a drum) having a rhythm in the low range has a relatively high spectrum power in the low range compared to a human voice. That is, in the case of the sound of an instrument having a rhythm in the low frequency range, the ratio of the low frequency spectrum power to the high frequency spectrum power is large. In other words, the ratio of the low frequency spectrum power to the sum of the spectral powers of all the bands becomes large.

したがって、上記高域スペクトルパワーに対する上記低域スペクトルパワーの比率が予め定められた閾値と比較して大きく、かつ、上記フレーム間隔が予め定められた範囲内にある場合、すなわち、低域スペクトルパワーの時間変化の周期は、一定の時間の範囲内である場合、上記音響信号を音楽と判定できる。 Therefore, when the ratio of the low frequency spectrum power to the high frequency spectrum power is large compared to a predetermined threshold and the frame interval is within a predetermined range, that is, the low frequency spectrum power When the period of time change is within a certain time range, the acoustic signal can be determined as music.

これにより、本発明に係る音楽検出装置は、低周波数領域の音において周期、すなわち、リズムを有する音楽を検出することが可能となる。 Thereby, the music detection device according to the present invention can detect music having a period, that is, a rhythm in the sound in the low frequency region.

本発明に係る音楽検出装置では、上記フレーム間隔検出手段は、上記低域スペクトルパワーを表すＮ個のデータであるｓｐｐ（ｉ）（ｉ＝０，１，・・・，Ｎ−１）を用いて、Ｍ個の上記自己相関値を、上記の自己相関関数Ｒ２（ｘ）（ｘ＝１，２，・・・，Ｍ）の各値として算出することが好ましい。 In the music detection apparatus according to the present invention, the frame interval detection means uses spp (i) (i = 0, 1,..., N−1) that is N pieces of data representing the low frequency spectrum power. Thus, it is preferable to calculate the M autocorrelation values as the values of the autocorrelation function R2 (x) (x = 1, 2,..., M).

本発明に係る音声検出装置では、上記の課題を解決するために、音響信号から、フレームごとに基本周波数を抽出する基本周波数抽出手段と、予め定められた数の連続する複数フレームにおける上記基本周波数の変化を検出する基本周波数変化検出手段と、上記基本周波数変化検出手段によって、上記基本周波数が単調に変化しているか、または、単調変化から一定周波数へ変化しているか、または、一定周波数から単調変化へ変化していることが検出され、かつ、上記基本周波数が予め定められた周波数の範囲内において変化しており、かつ、上記基本周波数の変化の幅が予め定められた周波数の幅より小さいとき、上記音響信号を音声と判定する音声判定手段と、を備えていることを特徴としている。 In the speech detection apparatus according to the present invention, in order to solve the above-mentioned problem, the fundamental frequency extracting means for extracting the fundamental frequency for each frame from the acoustic signal, and the fundamental frequency in a predetermined number of consecutive frames. The fundamental frequency change detecting means for detecting a change in the frequency and the fundamental frequency change detecting means, the basic frequency is changing monotonously, or changing from monotone change to constant frequency, or from constant frequency to monotone It is detected that there is a change to the change, and the fundamental frequency is changed within a predetermined frequency range, and the width of the change in the fundamental frequency is smaller than the predetermined frequency width. And a sound determination means for determining the sound signal as sound.

上記の構成によれば、基本周波数抽出手段は、音響信号から、フレームごとに基本周波数を抽出する。基本周波数を抽出する方法としては、例えば、ケプストラム法や瞬時周波数法などがあり、特に限定はされない。 According to said structure, a fundamental frequency extraction means extracts a fundamental frequency for every flame | frame from an acoustic signal. Examples of the method for extracting the fundamental frequency include a cepstrum method and an instantaneous frequency method, and are not particularly limited.

そして、上記の構成によれば、音声判定手段は、上記基本周波数が、単調に変化（すなわち、単調増加、または、単調増加）しているか、または、単調変化から一定周波数へ変化（すなわち、単調増加から一定周波数、または、単調減少から一定周波数へ変化）しているか、または、一定周波数から単調変化へ変化（すなわち、一定周波数から単調増加、または、一定周波数から単調減少へ変化）していることが検出され、かつ、上記基本周波数が予め定められた周波数の範囲内において変化しており、かつ、上記基本周波数の変化の幅が予め定められた周波数の幅より小さいとき、上記音響信号を音声と判定する。 According to the above configuration, the sound determination means is configured such that the fundamental frequency changes monotonously (that is, monotonically increases or monotonically increases), or changes from monotonic change to a constant frequency (that is, monotonic). Is changing from increasing to constant frequency, or changing from monotonic decreasing to constant frequency), or changing from constant frequency to monotonic changing (ie, changing from constant frequency to monotonic increasing, or changing from constant frequency to monotonic decreasing) Is detected, and the fundamental frequency changes within a predetermined frequency range, and the range of change in the fundamental frequency is smaller than the predetermined frequency width, the acoustic signal is Judged as voice.

上記基本周波数の変化が単調に変化している場合、人の声のフレーズ成分を表している可能性がある。また、上記基本周波数の変化が単調変化から一定周波数へ変化している場合、あるいは、上記基本周波数の変化が一定周波数から単調変化へ変化している場合、人の声のアクセント成分を表している可能性がある。 When the change of the fundamental frequency is changing monotonously, it may represent a phrase component of a human voice. In addition, when the change in the fundamental frequency changes from a monotone change to a constant frequency, or when the change in the fundamental frequency changes from a constant frequency to a monotone change, it represents an accent component of a human voice. there is a possibility.

人の声の基本周波数の帯域は、一般的に、約１００Ｈｚ〜４００Ｈｚの間である。より詳細には、男性の声の基本周波数の帯域は、約１５０Ｈｚ±５０Ｈｚであり、女性の声の基本周波数の帯域は、約２５０Ｈｚ±５０Ｈｚである。また、子供の基本周波数の帯域は、女性よりも５０Ｈｚさらに高く、約３００Ｈｚ±５０Ｈｚである。さらに、人の声のフレーズ成分、あるいは、アクセント成分の場合、基本周波数の変化の幅は、約１２０Ｈｚである。 The fundamental frequency band of the human voice is generally between about 100 Hz and 400 Hz. More specifically, the band of the fundamental frequency of the male voice is about 150 Hz ± 50 Hz, and the band of the fundamental frequency of the female voice is about 250 Hz ± 50 Hz. Moreover, the band of the fundamental frequency of the child is 50 Hz higher than that of women, and is about 300 Hz ± 50 Hz. Further, in the case of a phrase component or accent component of a human voice, the width of change in the fundamental frequency is about 120 Hz.

つまり、上記基本周波数が単調に変化しているか、または、単調変化から一定周波数へ変化しているか、または、一定周波数から単調変化へ変化している場合、基本周波数の最大値と最小値とが所定の範囲内にない場合、音声ではないと判定できる。また、上記基本周波数が単調に変化しているか、または、単調変化から一定周波数へ変化しているか、または、一定周波数から単調変化へ変化している場合、基本周波数の最大値と最小値との差が所定の値よりも大きい場合にも、音声ではないと判定できる。 In other words, if the basic frequency is changing monotonously, changing from monotonic change to constant frequency, or changing from constant frequency to monotone change, the maximum and minimum values of the basic frequency are If it is not within the predetermined range, it can be determined that it is not voice. In addition, when the basic frequency is changing monotonously, changing from monotonic change to constant frequency, or changing from constant frequency to monotone change, the maximum and minimum values of the basic frequency Even when the difference is larger than a predetermined value, it can be determined that the sound is not voice.

したがって、上記基本周波数が単調に変化しているか、または、単調変化から一定周波数へ変化しているか、または、一定周波数から単調変化へ変化しているときに、基本周波数の変化が予め定められた周波数の範囲内における変化の場合、すなわち、基本周波数の最大値と最小値とが所定の範囲内にある場合であって、かつ、基本周波数の変化の幅が予め定められた周波数の幅より小さい場合、すなわち、基本周波数の最大値と最小値との差が所定の値よりも小さい場合、音声判定手段は、フレーズ成分、あるいは、アクセント成分であると判定できる。しかも、上記の予め定められた周波数の範囲を男性の声、女性の声、子供の声に応じて設定すれば、男性の声、女性の声、子供の声を区別することもできる。 Therefore, when the fundamental frequency is changing monotonously, changing from monotonic change to constant frequency, or changing from constant frequency to monotone change, the change of the basic frequency is predetermined. In the case of a change within the frequency range, that is, when the maximum value and the minimum value of the basic frequency are within a predetermined range, and the width of the change in the basic frequency is smaller than the predetermined frequency width. In other words, if the difference between the maximum value and the minimum value of the fundamental frequency is smaller than a predetermined value, the speech determination means can determine that it is a phrase component or an accent component. Moreover, if the predetermined frequency range is set in accordance with male voice, female voice, and child voice, male voice, female voice, and child voice can be distinguished.

これにより、本発明に係る音声検出装置は、精度よく人の声を検出することができ、しかも、男性の声、女性の声の両方を検出することが可能であると共に、女性の声か子供の声かもある程度検出することが可能となる。 As a result, the voice detection device according to the present invention can accurately detect a human voice and can detect both a male voice and a female voice, and can also detect a female voice or a child voice. Can be detected to some extent.

本発明に係る音声検出装置では、上記音声判定手段は、上記周波数の変化が略１００Ｈｚから略４００Ｈｚの範囲内における変化であり、かつ、上記周波数の変化の幅が略１２０Ｈｚより小さい場合に、上記音響信号を音声と判定することが好ましい。 In the voice detection device according to the present invention, the voice determination unit is configured to change the frequency when the frequency change is within a range of about 100 Hz to about 400 Hz, and the frequency change width is smaller than about 120 Hz. It is preferable to determine the acoustic signal as sound.

本発明に係る音場制御装置は、上記音楽検出装置によって予め定められた期間内に上記音響信号が音楽と判定された回数と、上記音声検出装置によって上記予め定められた期間内に上記音響信号が音声と判定された回数とに応じて、音場制御の状態を切り替えることを特徴としている。 The sound field control device according to the present invention includes the number of times that the acoustic signal is determined to be music within a predetermined period by the music detecting device, and the acoustic signal within the predetermined period by the voice detecting device. Is characterized in that the state of sound field control is switched according to the number of times that is determined to be sound.

上記の構成によれば、本発明に係る音場制御装置は、上記音楽検出装置によって、判定された判定結果と、上記音声検出装置によって、判定された判定結果の誤判定に起因する音場制御の変更を防止することができる。ここで、音場制御の状態としては、例えば、音楽シーン用の音場制御や音声シーン用の音場制御や音楽と音声の両方を含むシーン用の音場制御の状態などがある。これにより、音場制御の状態の切り替えを適切な回数にすることができるため、聴取者が１つのシーンと認識する主観的な時間区切りにおいてのみ、音場制御の状態の切り替えを行う構成を実現できる。 According to said structure, the sound field control apparatus which concerns on this invention is the sound field control resulting from the determination result determined by the said music detection apparatus, and the misjudgment of the determination result determined by the said audio | voice detection apparatus. Can be prevented. Here, the sound field control state includes, for example, a sound field control for a music scene, a sound field control for a sound scene, and a sound field control state for a scene including both music and sound. As a result, switching of the sound field control state can be performed at an appropriate number of times, so that the structure of switching the sound field control state is realized only at the subjective time interval that the listener recognizes as one scene. it can.

本発明に係る音場制御装置では、上記音場制御を切り替える条件を、制御されている状態に応じて変更することを特徴としている。 The sound field control device according to the present invention is characterized in that the condition for switching the sound field control is changed according to the controlled state.

上記の構成によれば、本発明に係る音場制御装置は、現在の音場制御の状態に優位性を持たせるような判定条件を設定することができ、頻繁にシーンが変化するようなコンテンツにおいても、過度なシーン切り替えを防止することができる。 According to the above configuration, the sound field control device according to the present invention can set a determination condition that gives superiority to the current state of sound field control, and content whose scene changes frequently. Even in this case, excessive scene switching can be prevented.

本発明に係る音楽検出装置は、音響信号から、該音響信号の所定時間を表すフレームごとに、周波数スペクトルを算出するスペクトル算出手段と、上記周波数スペクトルの自己相関値を算出する自己相関値算出手段と、連続する複数フレームにおける上記自己相関値の最大値のばらつきの大きさを数値化する数値化手段と、上記ばらつきの大きさが予め定められた閾値よりも小さい場合、上記音響信号を音楽と判定する音楽判定手段と、を備えている。 The music detection apparatus according to the present invention includes a spectrum calculation unit that calculates a frequency spectrum for each frame representing a predetermined time of the acoustic signal, and an autocorrelation value calculation unit that calculates an autocorrelation value of the frequency spectrum. And numerical means for digitizing the magnitude of the variation of the maximum value of the autocorrelation value in a plurality of consecutive frames, and if the variation is smaller than a predetermined threshold, the acoustic signal is music Music determination means for determining.

それゆえ、本発明に係る音楽検出装置は、倍音成分を含む楽器の音、すなわち、バイオリンなどの弦楽器やトランペットなどの管楽器の楽音を検出することが可能となる。 Therefore, the music detection device according to the present invention can detect the sound of a musical instrument including a harmonic component, that is, the musical tone of a stringed instrument such as a violin or a wind instrument such as a trumpet.

本発明に係る音楽検出装置では、音響信号から、該音響信号の所定時間を表すフレームごとに、音階に対応する各周波数のスペクトルパワーを算出するスペクトルパワー算出手段と、上記音階の各周波数に該各周波数を識別する音階識別番号が割り当てられており、上記フレームごとに、上記音階識別番号のうち上記スペクトルパワーが最大となる最大音階識別番号を検出する最大音階識別番号検出手段と、連続する複数フレームにおける上記最大音階識別番号のばらつきの大きさを数値化する数値化手段と、上記ばらつきの大きさが予め定められた閾値よりも大きい場合、上記音響信号を音楽と判定する音楽判定手段と、を備えている。 In the music detection device according to the present invention, spectrum power calculating means for calculating the spectrum power of each frequency corresponding to the scale for each frame representing a predetermined time of the sound signal from the sound signal, and for each frequency of the scale, A scale identification number for identifying each frequency is assigned, and for each frame, a maximum scale identification number detecting means for detecting a maximum scale identification number having the maximum spectrum power among the scale identification numbers, and a plurality of continuous scale identification numbers. A digitizing means for digitizing the magnitude of the variation of the maximum scale identification number in the frame; a music judging means for judging that the acoustic signal is music when the magnitude of the variation is larger than a predetermined threshold; It has.

それゆえ、本発明に係る音楽検出装置は、音符の有無、すなわち、音の高低が変化する音楽を検出することが可能となる。 Therefore, the music detection device according to the present invention can detect music in which the presence or absence of a note, that is, the pitch of a sound changes.

本発明に係る音楽検出装置では、音響信号から、フレームごとに予め定められた第１の閾値以下の周波数または第１の閾値未満の周波数のスペクトルパワーを加算して低域スペクトルパワーを算出する低域スペクトルパワー算出手段と、予め定められた数の連続する複数フレームにおける上記低域スペクトルパワーの自己相関値が最大となるフレーム間隔を検出するフレーム間隔検出手段と、上記音響信号から、上記フレームごとに、第１の閾値以上の周波数または第１の閾値より大きい周波数のスペクトルパワーを加算して高域スペクトルパワーを算出する高域スペクトルパワー算出手段と、上記高域スペクトルパワーに対する上記低域スペクトルパワーの比率が予め定められた第２の閾値以上であり、かつ、上記フレーム間隔が予め定められた範囲内にある場合に、上記音響信号を音楽と判定する音楽判定手段と、を備えている。 In the music detection device according to the present invention, the low frequency spectrum power is calculated by adding the spectrum power of the frequency below the first threshold or the frequency less than the first threshold predetermined for each frame from the acoustic signal. Band spectrum power calculation means, frame interval detection means for detecting a frame interval at which the autocorrelation value of the low band spectrum power in a predetermined number of consecutive frames is maximum, and the acoustic signal for each frame. A high-frequency spectrum power calculating means for calculating a high-frequency spectrum power by adding a spectrum power of a frequency equal to or higher than the first threshold value or a frequency greater than the first threshold value, and the low-frequency spectrum power relative to the high-frequency spectrum power Is equal to or greater than a predetermined second threshold, and the frame interval is predetermined. When in the range, and a, and determines music determination means with the music the acoustic signal.

それゆえ、本発明に係る音楽検出装置は、低周波数領域の音において周期、すなわち、リズムを有する音楽を検出することが可能となる。 Therefore, the music detection apparatus according to the present invention can detect music having a period, that is, a rhythm in a sound in a low frequency region.

本発明に係る音声検出装置では、音響信号から、フレームごとに基本周波数を抽出する基本周波数抽出手段と、予め定められた数の連続する複数フレームにおける上記基本周波数の変化を検出する基本周波数変化検出手段と、上記基本周波数変化検出手段によって、上記基本周波数が単調に変化しているか、または、単調変化から一定周波数へ変化しているか、または、一定周波数から単調変化へ変化していることが検出され、かつ、上記基本周波数が予め定められた周波数の範囲内において変化しており、かつ、上記基本周波数の変化の幅が予め定められた周波数の幅より小さいとき、上記音響信号を音声と判定する音声判定手段と、を備えていることを特徴としている。 In the speech detection device according to the present invention, a fundamental frequency extracting means for extracting a fundamental frequency for each frame from an acoustic signal, and a fundamental frequency change detection for detecting a change in the fundamental frequency in a predetermined number of consecutive frames. And the fundamental frequency change detecting means detect that the fundamental frequency is changing monotonously, changing from monotonic change to constant frequency, or changing from constant frequency to monotone change. And the acoustic signal is determined to be speech when the fundamental frequency changes within a predetermined frequency range and the width of the fundamental frequency change is smaller than the predetermined frequency width. And a voice determination means for performing the above.

それゆえ、本発明に係る音声検出装置は、精度よく人の声を検出することができ、しかも、男性の声、女性の声の両方と子供の声も検出することが可能となる。 Therefore, the voice detection device according to the present invention can accurately detect a human voice, and can also detect both a male voice, a female voice, and a child voice.

（音楽検出装置１）
本発明に係る音楽検出装置１の一実施形態について、図１ないし図４に基づいて説明すると以下の通りである。 (Music detection device 1)
An embodiment of the music detection apparatus 1 according to the present invention will be described below with reference to FIGS.

図１は、本発明に係る音楽検出装置１の構成を示すブロック図である。本発明に係る音楽検出装置１は、フレーム分割部５と窓掛け部６とスペクトル変換部７と音楽検出部１０とを含んで構成される。 FIG. 1 is a block diagram showing a configuration of a music detection apparatus 1 according to the present invention. The music detection apparatus 1 according to the present invention includes a frame dividing unit 5, a windowing unit 6, a spectrum conversion unit 7, and a music detection unit 10.

音楽検出部１０は、音階スペクトル算出部（スペクトル算出手段）１１と自己相関係数算出部（自己相関値算出手段）１２と係数最大値検出部１３と係数最大値保存部１４と係数最大値比較部１５と分散算出部（数値化手段）１６と音楽／非音楽判定部（音楽判定手段）１７とを備えている。 The music detection unit 10 includes a scale spectrum calculation unit (spectrum calculation unit) 11, an autocorrelation coefficient calculation unit (autocorrelation value calculation unit) 12, a coefficient maximum value detection unit 13, a coefficient maximum value storage unit 14, and a coefficient maximum value comparison. A unit 15, a variance calculation unit (numericalization unit) 16, and a music / non-music determination unit (music determination unit) 17 are provided.

音楽検出装置１は、テレビ受信装置などに実装され、放送信号に含まれる音響信号をもとに、放送中の番組に含まれる音楽シーンを検出する。ここで、音楽シーンとは、音楽が含まれるシーンのことであり、音楽番組などにおける音楽のみからなるシーンのほか、音声（人の話し声など）のバックグラウンドにおいて音楽が流れているようなシーンも含まれる。なお、音楽検出装置１は、録画された番組を録画再生装置などによって再生する場合などにおいても、音響信号をもとに再生中の番組に含まれる音楽シーンを検出することができ、特に限定はされない。本実施の形態では、音楽検出装置１には、ＰＣＭ（Pulse Code Modulation）によってデジタル符号化された音響信号が入力される。 The music detection device 1 is mounted on a television receiver or the like, and detects a music scene included in a broadcast program based on an acoustic signal included in the broadcast signal. Here, a music scene is a scene that includes music. In addition to scenes that consist only of music in music programs, there are scenes in which music flows in the background of audio (such as human speech). included. Note that the music detection device 1 can detect a music scene included in a program being reproduced based on an acoustic signal even when the recorded program is reproduced by a recording / reproduction device or the like. Not. In the present embodiment, the music detection apparatus 1 receives an acoustic signal digitally encoded by PCM (Pulse Code Modulation).

以下に、図１に示す音楽検出装置１における音楽検出の処理について説明する。 Hereinafter, a music detection process in the music detection apparatus 1 shown in FIG. 1 will be described.

フレーム分割部５は、入力された音響信号をフレーム分割し、窓かけ部６に出力する。本実施の形態では、フレーム分割部５は、１フレームあたり１０２４サンプルに分割する。音響信号のサンプリング周波数が４４．１ｋＨｚの場合、１フレームあたりの時間は、２３ｍｓ（＝（１÷４４１００）×１０２４）となる。 The frame dividing unit 5 divides the input acoustic signal into frames and outputs the result to the windowing unit 6. In the present embodiment, the frame dividing unit 5 divides the frame into 1024 samples per frame. When the sampling frequency of the acoustic signal is 44.1 kHz, the time per frame is 23 ms (= (1 ÷ 44100) × 1024).

窓掛け部６は、フレーム分割された音響信号に対しハニング窓などの窓関数を掛けて、スペクトル変換部７に出力する。窓掛け部６において窓関数を適用することにより、フレーム分割された音響信号についての周波数解析の誤差を低減できる。 The windowing unit 6 multiplies the acoustic signal divided into frames by a window function such as a Hanning window and outputs the result to the spectrum conversion unit 7. By applying a window function in the windowing unit 6, it is possible to reduce an error in frequency analysis for an acoustic signal divided into frames.

スペクトル変換部７は、窓掛け部６から出力された音響信号に対してＦＦＴ（Fast Fourier Transform）を行い、時間領域の音響信号を周波数領域のデータ、すなわち、スペクトルに変換して、音階スペクトル算出部１１に出力する。スペクトル変換部７では、フレーム単位にＦＦＴが行われることになる。本実施の形態においては、上述したとおり、１フレームには１０２４サンプルが含まれており、スペクトル変換部７は、１０２４ポイントのＦＦＴを行う。 The spectrum conversion unit 7 performs FFT (Fast Fourier Transform) on the acoustic signal output from the windowing unit 6, converts the time domain acoustic signal into frequency domain data, that is, a spectrum, and calculates a scale spectrum. To the unit 11. In the spectrum conversion unit 7, FFT is performed in units of frames. In the present embodiment, as described above, 1024 samples are included in one frame, and the spectrum conversion unit 7 performs 1024-point FFT.

音階スペクトル算出部１１は、スペクトル変換部７から出力されるスペクトルに基づいて、１２平均律音階の各周波数に対応するスペクトル（以下では、音階スペクトルと呼ぶ）を算出する。 Based on the spectrum output from the spectrum conversion unit 7, the scale spectrum calculation unit 11 calculates a spectrum (hereinafter referred to as a scale spectrum) corresponding to each frequency of the 12 average temperament scales.

ここで、平均律音階とは、１オクターブを等比数列によって配分して得られる音階であり、１２平均律音階とは、特に、１オクターブを等比数列によって１２分割して得られる音階である。オクターブとは、ある音と、その音の２倍の高さとなる音との間隔を表す。すなわち、ある音に対して、１オクターブ離れた音の周波数は２倍になる。したがって、１２平均律音階では、周波数が２倍となる１オクターブを等比数列により１２分割することになるため、隣り合う音の周波数の比は２の１２乗根となる。つまり、１２平均律音階を構成する音の各周波数ｆｎは、基音の周波数をｆ_０とすれば、ｆｎ＝ｆ_０×２^ｎ／１２によって表される。 Here, the average scale is a scale obtained by allocating 1 octave by a geometric sequence, and the 12 average scale is a scale obtained by dividing 1 octave by 12 by a geometric sequence. . An octave represents the interval between a sound and a sound that is twice as high as that sound. That is, the frequency of a sound one octave away from a certain sound is doubled. Accordingly, in the 12 average temperament scale, one octave whose frequency is doubled is divided into 12 by the geometric sequence, so the frequency ratio of adjacent sounds is the 12th root of 2. In other words, each frequency fn of the sound constituting the 12 equal temperament scale, if the frequency of the fundamental tone and _{f 0,} is represented by _{^{fn = f 0 × 2 n /}} 12.

本実施の形態では、音階スペクトル算出部１１は、音階スペクトルとして、１２平均律音階の各周波数のスペクトルを算出する。図２は、１２平均律音階と周波数の関係を示す図である。図２に示す例では、オクターブ４のラ（Ａ）の音を基準として、その周波数を４４０ｋＨｚとした場合の１２平均律音階の各周波数を示す表である。図２の１２平均音階率の各周波数には、周波数の小さい順に、０〜１２６の音階番号が付与されている。この音階番号によって、１２平均律音階の各周波数を識別することができる。また、「Ｃ,Ｃ＃,Ｄ,Ｄ＃,Ｅ,Ｆ,Ｆ＃,Ｇ,Ｇ＃,Ａ,Ａ＃,Ｂ」は、１オクターブ内の１２の音を区別するコードであり、図２には、各周波数に対応するコードが示されている。 In the present embodiment, the scale spectrum calculation unit 11 calculates the spectrum of each frequency of the 12 average temperament scale as the scale spectrum. FIG. 2 is a diagram showing the relationship between 12 average temperament scales and frequency. The example shown in FIG. 2 is a table showing each frequency of 12 average temperament scales when the frequency is 440 kHz with reference to the sound of octave 4 (A). The scale numbers of 0 to 126 are assigned to the frequencies of the 12 average scale ratios in FIG. 2 in ascending order of frequency. Each frequency of the 12 average temperament scales can be identified by this scale number. “C, C #, D, D #, E, F, F #, G, G #, A, A #, B” are codes for distinguishing 12 sounds in one octave. Shows codes corresponding to the respective frequencies.

音階スペクトル算出部１１の処理について、より具体的に説明する。音階スペクトル算出部１１は、図２に示す各周波数に対応するスペクトルの絶対値を算出する。つまり、スペクトル変換部７から出力される一定の周波数間隔ごとのスペクトルの絶対値を用いた線形補間によって、図２に示す各周波数に対応するスペクトルの絶対値を算出する。例えば、図２によれば音階番号９に対応する周波数は１３．７５Ｈｚであるが、スペクトル変換部７からのスペクトルに１３．７５Ｈｚのスペクトルが含まれていない場合、音階スペクトル算出部１１は、スペクトル変換部７からのスペクトルのうち、１３．７５Ｈｚに近い周波数の２つのスペクトルの絶対値から線形補間によって１３．７５Ｈｚに対応するスペクトルの絶対値を算出する。このようにして、音階スペクトル算出部１１は、音階番号０〜１３６の全ての周波数に対応する音階スペクトルを算出する。そして、音階スペクトル算出部１１は、算出した音階スペクトルを自己相関係数算出部１２に出力する。 The processing of the scale spectrum calculation unit 11 will be described more specifically. The scale spectrum calculation unit 11 calculates the absolute value of the spectrum corresponding to each frequency shown in FIG. That is, the absolute value of the spectrum corresponding to each frequency shown in FIG. 2 is calculated by linear interpolation using the absolute value of the spectrum for each fixed frequency interval output from the spectrum conversion unit 7. For example, according to FIG. 2, when the frequency corresponding to the scale number 9 is 13.75 Hz, but the spectrum from the spectrum conversion unit 7 does not include the spectrum of 13.75 Hz, the scale spectrum calculation unit 11 Of the spectrum from the conversion unit 7, the absolute value of the spectrum corresponding to 13.75 Hz is calculated by linear interpolation from the absolute values of the two spectra having a frequency close to 13.75 Hz. In this way, the scale spectrum calculation unit 11 calculates the scale spectrum corresponding to all the frequencies of the scale numbers 0 to 136. Then, the scale spectrum calculation unit 11 outputs the calculated scale spectrum to the autocorrelation coefficient calculation unit 12.

自己相関係数算出部１２は、音階スペクトル算出部１１から出力された音階スペクトルの自己相関係数Ｒ１（ｘ）を算出する。つまり、自己相関係数算出部１２はフレームごとに音階スペクトルの自己相関係数Ｒ１（ｘ）を算出する。 The autocorrelation coefficient calculation unit 12 calculates the autocorrelation coefficient R1 (x) of the scale spectrum output from the scale spectrum calculation unit 11. That is, the autocorrelation coefficient calculation unit 12 calculates the autocorrelation coefficient R1 (x) of the scale spectrum for each frame.

上述したとおり、本実施の形態では、音階スペクトル算出部１１は、図２に示す各周波数に対応するスペクトル、すなわち、音階番号０〜１３６に対応する音階スペクトルを算出する。そして、上記の自己相関係数Ｒ１（ｘ）を算出する式において、ｓｐ（ｉ）（ｉは音階番号に対応）は、音階スペクトルを表している。ここで、本実施の形態においては、数３において、Ｌ＝６８とし、ｉを０〜６７まで変化させる。また、ｘは自己相関係数を算出する音階の間隔を表しており、ｘを１〜６８まで変化させて、各ｘに対する自己相関係数Ｒ１（ｘ）を算出する。そして、自己相関係数算出部１２は、自己相関係数Ｒ１（ｘ）を、係数最大値検出部１３に出力する。 As described above, in the present embodiment, the scale spectrum calculation unit 11 calculates the spectrum corresponding to each frequency shown in FIG. 2, that is, the scale spectrum corresponding to the scale numbers 0 to 136. In the equation for calculating the autocorrelation coefficient R1 (x), sp (i) (i corresponds to a scale number) represents a scale spectrum. Here, in this embodiment, in Equation 3, L = 68 and i is changed from 0 to 67. X represents the interval of the scale for calculating the autocorrelation coefficient, and x is varied from 1 to 68 to calculate the autocorrelation coefficient R1 (x) for each x. Then, autocorrelation coefficient calculation unit 12 outputs autocorrelation coefficient R1 (x) to coefficient maximum value detection unit 13.

係数最大値検出部１３は、自己相関係数算出部１２から出力される自己相関係数Ｒ１（１）〜Ｒ１（６８）の中から最大値を検出する。すなわち、係数最大値検出部１３は、各フレームにおける音階スペクトルの自己相関係数の最大値（以下では、最大自己相関係数と呼ぶ）を検出する。そして、係数最大値検出部１３は、最大自己相関係数を、係数最大値保存部１４と係数最大値比較部１５とに出力する。 The coefficient maximum value detection unit 13 detects the maximum value from the autocorrelation coefficients R1 (1) to R1 (68) output from the autocorrelation coefficient calculation unit 12. That is, the coefficient maximum value detection unit 13 detects the maximum value of the autocorrelation coefficient of the scale spectrum in each frame (hereinafter referred to as the maximum autocorrelation coefficient). Then, the coefficient maximum value detection unit 13 outputs the maximum autocorrelation coefficient to the coefficient maximum value storage unit 14 and the coefficient maximum value comparison unit 15.

係数最大値保存部１４は、係数最大値検出部１３から出力される各フレームにおける最大自己相関係数を記憶する。つまり、係数最大値保存部１４は、全てのフレームについて音階スペクトルの自己相関係数の最大値を履歴データとして記憶している。 The coefficient maximum value storage unit 14 stores the maximum autocorrelation coefficient in each frame output from the coefficient maximum value detection unit 13. That is, the coefficient maximum value storage unit 14 stores the maximum value of the autocorrelation coefficient of the scale spectrum for all frames as history data.

係数最大値比較部１５は、係数最大値検出部１３から出力された最大自己相関係数、すなわち、現在のフレームの最大自己相関係数について、微小信号であるか否か判定する。より具体的には、係数最大値比較部１５は、現在フレームの最大自己相関係数について、あらかじめ設定された閾値と比較する。そして、現在フレームの最大自己相関係数が閾値よりも大きい場合には、係数最大値比較部１５は、微小信号ではないと判定し、分散算出部１６に、現在フレームの最大自己相関係数を出力する。 The coefficient maximum value comparison unit 15 determines whether the maximum autocorrelation coefficient output from the coefficient maximum value detection unit 13, that is, the maximum autocorrelation coefficient of the current frame is a minute signal. More specifically, the coefficient maximum value comparison unit 15 compares the maximum autocorrelation coefficient of the current frame with a preset threshold value. When the maximum autocorrelation coefficient of the current frame is larger than the threshold, the coefficient maximum value comparison unit 15 determines that the signal is not a minute signal, and the variance calculation unit 16 determines the maximum autocorrelation coefficient of the current frame. Output.

一方、現在フレームの最大自己相関係数が微小信号であると判定された場合、分散算出部１６に、現在フレームの最大自己相関係数を出力しない。この場合、現在フレームについて音楽シーンであるか否かの判定は行われない。 On the other hand, when it is determined that the maximum autocorrelation coefficient of the current frame is a minute signal, the maximum autocorrelation coefficient of the current frame is not output to the variance calculation unit 16. In this case, it is not determined whether the current frame is a music scene.

また、係数最大値比較部１５は、係数最大値保存部１４から、過去フレームについての最大自己相関係数を取り出し、現在フレームと同様に、取り出した過去フレームの最大自己相関係数が微小信号であるか否かの判定を行い、微小信号ではない場合、判定対象の過去フレームの最大自己相関係数を、分散算出部１６に出力する。一方、取り出した過去フレームの最大自己相関係数が微小信号の場合には、判定対象の過去フレームの最大自己相関係数を、分散算出部１６に出力しない。 Further, the coefficient maximum value comparison unit 15 extracts the maximum autocorrelation coefficient for the past frame from the coefficient maximum value storage unit 14, and the maximum autocorrelation coefficient of the extracted past frame is a minute signal in the same manner as the current frame. If it is not a minute signal, the maximum autocorrelation coefficient of the past frame to be determined is output to the variance calculation unit 16. On the other hand, when the maximum autocorrelation coefficient of the extracted past frame is a minute signal, the maximum autocorrelation coefficient of the determination target past frame is not output to the variance calculation unit 16.

本実施の形態では、係数最大値比較部１５は、係数最大値保存部１４から、時間的に現在フレームに近い順に、順次、過去フレームのパワー最大値を取り出して微小信号か否かを判定し、判定結果に基づいて判定対象の過去フレームの最大自己相関係数を分散算出部１６に出力する処理を繰り返す。この処理は、４つの過去フレームの最大自己相関係数が分散算出部１６に出力されるまで繰り返される。最終的に、分散出力部１６には、現在フレームと４つの過去フレームとの合計５つのフレームについて、最大自己相関係数を分散算出部１６に出力する。 In the present embodiment, the coefficient maximum value comparison unit 15 sequentially extracts the power maximum values of past frames from the coefficient maximum value storage unit 14 in order of time closest to the current frame, and determines whether or not the signal is a minute signal. The process of outputting the maximum autocorrelation coefficient of the past frame to be determined to the variance calculation unit 16 based on the determination result is repeated. This process is repeated until the maximum autocorrelation coefficients of the four past frames are output to the variance calculation unit 16. Finally, the variance output unit 16 outputs the maximum autocorrelation coefficient to the variance calculation unit 16 for a total of five frames including the current frame and the four past frames.

分散算出部１６は、係数最大値比較部１５から出力された５つのフレームの最大自己相関係数について、数４に示す式を用いて分散を算出し、音楽／非音楽判定部１７に出力する。 The variance calculation unit 16 calculates the variance for the maximum autocorrelation coefficients of the five frames output from the coefficient maximum value comparison unit 15 using the formula shown in Equation 4, and outputs the variance to the music / non-music determination unit 17. .

ここで、Ｒｘ_ｉ（ｉ＝１〜５）は５つのフレームの各最大自己相関係数であり、＜Ｒｘ＞は５つのフレームの最大自己相関係数の平均である。また、ｎ＝５である。 Here, Rx _i (i = 1 to 5) is the maximum autocorrelation coefficient of each of the five frames, and <Rx> is an average of the maximum autocorrelation coefficients of the five frames. N = 5.

音楽／非音楽判定部２７は、分散算出部２６から出力された分散が予め設定された閾値よりも小さい場合、音楽シーン（音響信号に音楽が含まれているシーン）と判定する。すなわち、音楽を検出する。 The music / non-music determination unit 27 determines a music scene (a scene in which music is included in an acoustic signal) when the variance output from the variance calculation unit 26 is smaller than a preset threshold value. That is, music is detected.

なお、係数最大値比較部１５において、最大自己相関係数が微小信号であるか否かを判定し、微小信号ではない最大自己相関係数のみを分散算出部１６に出力する構成とすることにより、分散算出部１６において算出される分散は、最大自己相関係数のばらつきを表す指標としての信頼性が高くなる。しかしながら、必ずしも、係数最大値比較部１５における微小信号の判定によって、微小信号ではない最大自己相関係数のみを分散算出部１６に出力する構成とする必要はなく、特に限定はされない。 The coefficient maximum value comparison unit 15 determines whether or not the maximum autocorrelation coefficient is a minute signal and outputs only the maximum autocorrelation coefficient that is not a minute signal to the variance calculation unit 16. The variance calculated by the variance calculation unit 16 is highly reliable as an index representing the variation of the maximum autocorrelation coefficient. However, it is not always necessary to use a configuration in which only the maximum autocorrelation coefficient that is not a minute signal is output to the variance calculation unit 16 by the minute signal determination in the coefficient maximum value comparison unit 15, and there is no particular limitation.

図３は、トランペットの周波数スペクトルを示す図であり、（ａ）はある時刻の周波数スペクトルを示す図であり、（ｂ）は（ａ）の周波数スペクトルを示す時刻の２３ｍｓ後の周波数スペクトルを示す図である。図３（ａ）は、トランペットで８８０Ｈｚの音を吹いたときの周波数スペクトルの例を示しており、吹かれた音の整数倍の周波数（すなわち、倍音）近辺においてスペクトルはピークを示している。また、図３（ｂ）に示すとおり、図３（ａ）の周波数スペクトルを示す時刻から２３ｍｓ後においても、倍音は継続している。 FIG. 3 is a diagram showing a frequency spectrum of a trumpet, (a) is a diagram showing a frequency spectrum at a certain time, and (b) is a frequency spectrum 23 ms after the time showing the frequency spectrum of (a). FIG. FIG. 3A shows an example of a frequency spectrum when a sound of 880 Hz is blown with a trumpet, and the spectrum shows a peak in the vicinity of a frequency that is an integral multiple of the blown sound (ie, overtone). In addition, as shown in FIG. 3B, overtones continue even after 23 ms from the time showing the frequency spectrum of FIG.

図４は、鉄琴の周波数スペクトルを示す図であり、（ａ）はある時刻の周波数スペクトルを示す図であり、（ｂ）は（ａ）の周波数スペクトルを示す時刻の２３ｍｓ後の周波数スペクトルを示す図である。図４（ａ）は、鉄琴のある音を鳴らしたときの周波数スペクトルの例を示しており、鳴らされた音の整数倍の周波数（すなわち、倍音）近辺においてスペクトルはピークを示している。また、図４（ｂ）に示すとおり、図４（ａ）の周波数スペクトルを示す時刻から２３ｍｓ後においても、倍音は継続している。 4A and 4B are diagrams showing the frequency spectrum of the koto, where FIG. 4A is a diagram showing the frequency spectrum at a certain time, and FIG. 4B is the frequency spectrum 23 ms after the time showing the frequency spectrum of FIG. FIG. FIG. 4A shows an example of a frequency spectrum when a sound with a koto is played, and the spectrum shows a peak in the vicinity of a frequency that is an integral multiple of the played sound (ie, a harmonic). In addition, as shown in FIG. 4B, overtones continue even after 23 ms from the time showing the frequency spectrum of FIG.

図３や図４に示すとおり、トランペットや鉄琴の音には、それぞれ固有の倍音成分が含まれており、倍音成分が含まれた音は一定時間、継続する。図３や図４に示すトランペットや鉄琴以外にも、バイオリンなど、弦楽器の楽音には倍音成分は含まれる。 As shown in FIGS. 3 and 4, each of the sounds of the trumpet and the iron koto includes a unique harmonic component, and the sound including the harmonic component continues for a certain period of time. In addition to the trumpet and iron koto shown in FIG. 3 and FIG. 4, the musical tone of a stringed instrument such as a violin includes a harmonic component.

本発明に係る音楽検出装置１では、自己相関係数算出部１２において周波数スペクトルの自己相関係数（自己相関値）を算出し、係数最大値検出部１３において、各フレームにおける自己相関係数の最大値を算出し、分散算出部１６において複数フレーム間での前記最大値の分散（ばらつきの大きさ）が算出され、その分散が予め定められた閾値よりも小さければ音楽と判定している。 In the music detection apparatus 1 according to the present invention, the autocorrelation coefficient calculation unit 12 calculates the autocorrelation coefficient (autocorrelation value) of the frequency spectrum, and the coefficient maximum value detection unit 13 calculates the autocorrelation coefficient in each frame. The maximum value is calculated, and the variance calculation unit 16 calculates the variance of the maximum value (a size of variation) between a plurality of frames. If the variance is smaller than a predetermined threshold, the variance is determined as music.

楽音の周波数スペクトルは、上述したとおり倍音成分においてピークを示すため、元の周波数スペクトルのデータ列を倍音成分が現れる周波数間隔だけずらしたときに自己相関係数が最大値となる。 Since the frequency spectrum of the musical tone shows a peak in the harmonic component as described above, the autocorrelation coefficient becomes maximum when the data sequence of the original frequency spectrum is shifted by the frequency interval at which the harmonic component appears.

また、楽器の音であれば、一定時間、同じ周波数スペクトルが継続する。そのため、楽音の場合、周波数スペクトルの自己相関係数の最大値は、数フレームにわたって、ほぼ一定値を示すことになる。すなわち、連続するフレーム間での周波数スペクトルの自己相関係数の最大値のばらつきは小さい。 In the case of a musical instrument sound, the same frequency spectrum continues for a certain period of time. Therefore, in the case of a musical sound, the maximum value of the autocorrelation coefficient of the frequency spectrum shows a substantially constant value over several frames. That is, the variation in the maximum value of the autocorrelation coefficient of the frequency spectrum between consecutive frames is small.

本発明に係る音楽検出装置１では、音階スペクトル算出部１１において、１２平均律音階の各周波数に対応する音階スペクトルを算出し、自己相関係数算出部１２では、音階スペクトルの自己相関係数を算出する。倍音成分が含まれる音であれば、音階スペクトルにおいても、一定の間隔において倍音成分を示すピーク値が現れる。 In the music detection apparatus 1 according to the present invention, the scale spectrum calculation unit 11 calculates a scale spectrum corresponding to each frequency of 12 average temperament scales, and the autocorrelation coefficient calculation unit 12 calculates the autocorrelation coefficient of the scale spectrum. calculate. If the sound includes harmonic components, peak values indicating harmonic components appear at regular intervals in the scale spectrum.

そのため、楽音の場合、音階スペクトルの自己相関係数の最大値は、数フレームにわたって、ほぼ一定値を示し、ばらつきは小さい。したがって、複数フレーム間での音階スペクトルの自己相関係数の最大値の分散が予め定められた閾値よりも小さい場合、音楽であると判定できる。 Therefore, in the case of a musical tone, the maximum value of the autocorrelation coefficient of the scale spectrum shows a substantially constant value over several frames, and the variation is small. Therefore, when the variance of the maximum value of the autocorrelation coefficient of the scale spectrum among a plurality of frames is smaller than a predetermined threshold, it can be determined that the music is music.

なお、本実施の形態では、音階スペクトル算出部１１において算出された音階スペクトルに基づいて、すなわち、音階スペクトルの自己相関係数の最大値についての複数フレームでの分散を用いて、音楽であるか否かの判定を行っているが、スペクトル変換部７から出力される周波数スペクトルに基づいて算出する構成であってもよい。すなわち、スペクトル変換部７から出力される周波数スペクトルの自己相関係数の最大値についての複数フレームでの分散を算出し、その分散が予め定められた閾値よりも小さい場合、音楽と判定する構成であってもよく、特に限定はされない。 In the present embodiment, whether the music is based on the scale spectrum calculated by the scale spectrum calculation unit 11, that is, using the variance in a plurality of frames for the maximum value of the autocorrelation coefficient of the scale spectrum. Although determination of whether or not is performed, a configuration of calculating based on the frequency spectrum output from the spectrum conversion unit 7 may be used. That is, the variance of a plurality of frames with respect to the maximum value of the autocorrelation coefficient of the frequency spectrum output from the spectrum converter 7 is calculated, and when the variance is smaller than a predetermined threshold, it is determined to be music. There may be, and it does not specifically limit.

また、本実施の形態においては、５フレーム分の音階スペクトルの自己相関係数の最大値の分散に基づいて音楽シーンであるか否かの判定を行っているが、分散を算出するために用いるフレームの数は５フレーム以上であってもよく、特に限定はされない。 In the present embodiment, whether or not the music scene is determined based on the variance of the maximum value of the autocorrelation coefficient of the scale spectrum for five frames, but is used to calculate the variance. The number of frames may be 5 frames or more, and is not particularly limited.

（音楽検出装置２）
本発明に係る音楽検出装置１の一実施形態について、図５ないし図６に基づいて説明すると以下の通りである。 (Music detection device 2)
An embodiment of the music detection apparatus 1 according to the present invention will be described below with reference to FIGS.

図５は、本発明に係る音楽検出装置２の構成を示すブロック図である。本発明に係る音楽検出装置２は、フレーム分割部５と窓掛け部６とスペクトル変換部７と音楽検出部２０とを含んで構成される。 FIG. 5 is a block diagram showing the configuration of the music detection device 2 according to the present invention. The music detection device 2 according to the present invention includes a frame dividing unit 5, a windowing unit 6, a spectrum conversion unit 7, and a music detection unit 20.

音楽検出部２０は、音階スペクトル算出部２１とスペクトルパワー算出部（スペクトルパワー算出手段）２２とパワー最大値検出部（最大音階識別番号検出手段）２３とパワー最大値保存部２４とパワー最大値比較部２５と分散算出部（数値化手段）２６と音楽／非音楽判定部（音楽判定手段）２７とを備えている。 The music detection unit 20 includes a scale spectrum calculation unit 21, a spectrum power calculation unit (spectrum power calculation unit) 22, a power maximum value detection unit (maximum scale identification number detection unit) 23, a power maximum value storage unit 24, and a power maximum value comparison. A unit 25, a variance calculation unit (numericalization unit) 26, and a music / non-music determination unit (music determination unit) 27.

音楽検出装置２は、音楽検出装置１と同様に、テレビ受信装置などに実装され、放送信号に含まれる音響信号をもとに、放送中の番組に含まれる音楽シーンを検出する。本実施の形態では、音楽検出装置２には、音楽検出装置１と同様に、ＰＣＭ（Pulse Code Modulation）によってデジタル符号化された音響信号が入力される。 Similar to the music detection device 1, the music detection device 2 is mounted on a television receiver or the like, and detects a music scene included in a broadcast program based on an acoustic signal included in the broadcast signal. In the present embodiment, similarly to the music detection device 1, an audio signal digitally encoded by PCM (Pulse Code Modulation) is input to the music detection device 2.

以下に、図５に示す音楽検出装置２における音楽検出の処理について説明する。 Hereinafter, a music detection process in the music detection apparatus 2 shown in FIG. 5 will be described.

音楽検出装置２におけるフレーム分割部５、窓掛け部６、および、スペクトル変換部７の処理内容は、音楽検出装置１と同様であり、説明は省略する。 The processing contents of the frame division unit 5, the windowing unit 6, and the spectrum conversion unit 7 in the music detection device 2 are the same as those of the music detection device 1 and will not be described.

音階スペクトル算出部２１は、スペクトル変換部７から受け取るフレームごとのスペクトル（以下では、入力スペクトルと呼ぶ）に基づいて、図２に示す１２平均律音階の各周波数に対応するスペクトル（音階スペクトルと呼ぶ）を表すデータを生成する。音階スペクトル算出部２１は、音楽検出装置１における音階スペクトル算出部１１と同様の処理を行うため、詳細な説明は省略する。 The scale spectrum calculation unit 21 is a spectrum (referred to as scale spectrum) corresponding to each frequency of the 12 average temperament scales shown in FIG. ) Is generated. The scale spectrum calculation unit 21 performs the same processing as the scale spectrum calculation unit 11 in the music detection device 1, and thus detailed description thereof is omitted.

スペクトルパワー算出部２２は、音階スペクトルから音階ごとのスペクトルパワー（すなわち、スペクトルの２乗の値；以下では、音階スペクトルパワーと呼ぶ）を算出し、パワー最大値検出部２３に出力する。 The spectrum power calculation unit 22 calculates the spectrum power for each scale from the scale spectrum (that is, the value of the square of the spectrum; hereinafter referred to as the scale spectrum power) and outputs it to the power maximum value detection unit 23.

パワー最大値検出部２３は、音階スペクトルパワーの最大値を検出する。そして、パワー最大値検出部２３は、音階スペクトルパワーの最大値（以下では、パワー最大値と呼ぶ）とパワー最大値に対応する音階番号（以下では、最大値音階番号と呼ぶ）とをパワー最大値保存部２４とパワー最大値比較部２５とに出力する。なお、音階番号は、図２に示す音階番号である。また、音階番号は、特許請求の範囲における音階識別番号に対応する。 The power maximum value detector 23 detects the maximum value of the scale spectrum power. Then, the power maximum value detection unit 23 sets the maximum value of the scale spectrum power (hereinafter referred to as the power maximum value) and the scale number corresponding to the power maximum value (hereinafter referred to as the maximum value scale number) as the power maximum. The data is output to the value storage unit 24 and the power maximum value comparison unit 25. Note that the scale number is the scale number shown in FIG. The scale number corresponds to the scale identification number in the claims.

パワー最大値保存部２４は、パワー最大値検出部２３から出力される各フレームのパワー最大値と最大値音階番号とを記憶する。つまり、パワー最大値保存部２４は、全てのフレームについてパワー最大値と最大値音階番号とを履歴データとして記憶している。 The power maximum value storage unit 24 stores the power maximum value and maximum value scale number of each frame output from the power maximum value detection unit 23. That is, the power maximum value storage unit 24 stores the power maximum value and the maximum value scale number as history data for all frames.

パワー最大値比較部２５は、パワー最大値検出部２３から出力されたパワー最大値、すなわち、現在フレームのパワー最大値について、微小信号であるか否か判定する。より具体的には、パワー最大値比較部２５は、現在フレームのパワー最大値について、あらかじめ設定された閾値と比較する。そして、現在フレームのパワー最大値が閾値よりも大きい場合には、パワー最大値比較部２５は、微小信号ではないと判定し、分散算出部２６に、現在フレームの最大値音階番号を出力する。 The power maximum value comparison unit 25 determines whether or not the power maximum value output from the power maximum value detection unit 23, that is, the power maximum value of the current frame is a minute signal. More specifically, the power maximum value comparison unit 25 compares the power maximum value of the current frame with a preset threshold value. When the power maximum value of the current frame is larger than the threshold value, the power maximum value comparison unit 25 determines that the signal is not a minute signal, and outputs the maximum value scale number of the current frame to the variance calculation unit 26.

一方、現在フレームのパワー最大値が微小信号であると判定された場合、分散算出部１６に、現在フレームの最大値音階番号を出力しない。この場合、現在フレームについて音楽シーンであるか否かの判定は行われない。 On the other hand, when it is determined that the power maximum value of the current frame is a minute signal, the maximum scale number of the current frame is not output to the variance calculation unit 16. In this case, it is not determined whether the current frame is a music scene.

また、パワー最大値比較部２５は、パワー最大値保存部２４から、過去フレームについてパワー最大値とパワー最大音階を取り出し、現在フレームと同様に、取り出した過去フレームのパワー最大値が微小信号であるか否かの判定を行い、微小信号ではない場合、判定対象の過去フレームのパワー最大音階を、分散算出部２６に出力する。一方、取り出した過去フレームのパワー最大値が微小信号の場合には、判定対象の過去フレームのパワー最大音階を、分散算出部２６に出力しない。 Further, the power maximum value comparison unit 25 extracts the power maximum value and the power maximum scale for the past frame from the power maximum value storage unit 24, and the extracted power maximum value of the past frame is a minute signal in the same manner as the current frame. If it is not a minute signal, the maximum power scale of the past frame to be determined is output to the variance calculation unit 26. On the other hand, when the power maximum value of the extracted past frame is a minute signal, the maximum power scale of the past frame to be determined is not output to the variance calculation unit 26.

本実施の形態では、パワー最大値比較部２５は、パワー最大値保存部２４から、時間的に現在フレームに近い順に、順次、過去フレームのパワー最大値を取り出して微小信号か否かを判定し、判定結果に基づいて判定対象の過去フレームのパワー最大音階を分散算出部２６に出力する処理を繰り返す。この処理は、４つの過去フレームのパワー最大音階が分散算出部２６に出力されるまで繰り返される。最終的に、分散算出部２６には、現在フレームと４つの過去フレームとの合計５つのフレームについて、最大値音階番号を分散算出部２６に出力する。 In the present embodiment, the power maximum value comparison unit 25 sequentially extracts the power maximum values of past frames from the power maximum value storage unit 24 in order of time closest to the current frame, and determines whether or not the signal is a minute signal. Based on the determination result, the process of outputting the maximum power scale of the past frame to be determined to the variance calculating unit 26 is repeated. This process is repeated until the maximum power scales of the four past frames are output to the variance calculation unit 26. Finally, the maximum value scale number is output to the variance calculation unit 26 for the total 5 frames including the current frame and the four past frames.

分散算出部２６は、パワー最大値比較部２６から出力された５つのフレームの最大値音階番号について、数５に示す式を用いて分散を算出し、音楽／非音楽判定部２７に出力する。 The variance calculation unit 26 calculates the variance for the maximum scale numbers of the five frames output from the power maximum value comparison unit 26 using the formula shown in Equation 5, and outputs the variance to the music / non-music determination unit 27.

ここで、ｘ_ｉ（ｉ＝１〜５）は５つのフレームの各最大値音階番号であり、＜ｘ＞は５つのフレームの最大値音階番号の平均である。また、ｎ＝５である。 Here, x _i (i = 1 to 5) is the maximum scale number of each of the five frames, and <x> is the average of the maximum scale numbers of the five frames. N = 5.

音楽／非音楽判定部２７は、分散算出部２６から出力された分散が予め設定された閾値よりも大きい場合、音楽シーンと判定する。 The music / non-music determination unit 27 determines a music scene when the variance output from the variance calculation unit 26 is greater than a preset threshold.

図６は、フレームとパワー最大音階の関係の一例を示す図である。図６は、「ケツメイシ」というアーティストの「ドライブ」という楽曲についてのフレームごとの最大値音階番号を示すグラフである。図６に示すとおり、最大値音階番号は４０付近を中心にしてばらついている。図６に示す例のように、通常、音楽は様々な音によって構成されるため、音階にばらつきがある。音楽検出装置２では、音楽／非音楽判定部２７は、分散算出部２６において算出される分散を用いて、音階のばらつきを定量的に評価することができる。したがって、音階のばらつきの指標としての分散が予め定められた閾値よりも大きい場合、音楽であると判定することができる。 FIG. 6 is a diagram illustrating an example of the relationship between the frame and the maximum power scale. FIG. 6 is a graph showing the maximum scale number for each frame of the song “Drive” by the artist “Ketsumeishi”. As shown in FIG. 6, the maximum scale numbers vary around 40. As in the example shown in FIG. 6, music is usually composed of various sounds, and therefore there is variation in musical scale. In the music detection device 2, the music / non-music determination unit 27 can quantitatively evaluate the variation in the scale using the variance calculated by the variance calculation unit 26. Therefore, when the variance as an index of the scale variation is larger than a predetermined threshold, it can be determined that the music.

本実施の形態においては、５フレーム分のパワー最大音階の分散に基づいて音楽シーンであるか否かの判定を行っているが、分散を算出するために用いるフレームの数は５フレーム以上であってもよく、特に限定はされない。 In this embodiment, it is determined whether or not a music scene is based on the variance of the maximum power scale for five frames. However, the number of frames used to calculate the variance is five or more. There is no particular limitation.

なお、本実施の形態では、音階スペクトル算出部２１は、図２に示す１２平均律音階の各周波数に対応するスペクトルを算出する構成であるが、音階スペクトル算出部２１において、１２平均律以外の平均律や純音律の音階に対応するスペクトルを音階スペクトルとして算出する構成であってもよく、特に限定はされない。 In the present embodiment, the scale spectrum calculation unit 21 is configured to calculate a spectrum corresponding to each frequency of the 12 average temperament shown in FIG. 2, but the scale spectrum calculation unit 21 has a configuration other than the 12 average temperament. There may be a configuration in which a spectrum corresponding to a scale of equal temperament or pure temperament is calculated as a scale spectrum, and there is no particular limitation.

（音楽検出装置３）
本発明に係る音楽検出装置３の一実施形態について、図７ないし図９に基づいて説明すると以下の通りである。 (Music detection device 3)
An embodiment of the music detection device 3 according to the present invention will be described below with reference to FIGS.

図７は、本発明に係る音楽検出装置３の構成を示すブロック図である。本発明に係る音楽検出装置３は、フレーム分割部５と窓掛け部６とスペクトル変換部７と音楽検出部３０とを含んで構成される。 FIG. 7 is a block diagram showing the configuration of the music detection device 3 according to the present invention. The music detection device 3 according to the present invention includes a frame division unit 5, a windowing unit 6, a spectrum conversion unit 7, and a music detection unit 30.

音楽検出部３０は、超低域スペクトルパワー算出部（低域スペクトルパワー算出手段）３１と超低域スペクトルパワー保存部３２と超低域スペクトルパワー自己相関係数算出部３３と係数最大値判定部（フレーム間隔検出手段）３４と高域スペクトルパワー算出部（高域スペクトルパワー算出手段）３５と超低域／高域パワー比算出部３６と音楽／非音楽判定部（音楽判定手段）３７とを備えている。 The music detection unit 30 includes an ultra low frequency spectrum power calculation unit (low frequency spectrum power calculation unit) 31, an ultra low frequency spectrum power storage unit 32, an ultra low frequency spectrum power autocorrelation coefficient calculation unit 33, and a coefficient maximum value determination unit. (Frame interval detection means) 34, high frequency spectrum power calculation unit (high frequency spectrum power calculation means) 35, ultra low frequency / high frequency power ratio calculation unit 36, and music / non-music determination unit (music determination unit) 37 I have.

音楽検出装置３は、音楽検出装置１と同様に、テレビ受信装置などに実装され、放送信号に含まれる音響信号をもとに、放送中の番組に含まれる音楽シーンを検出する。本実施の形態では、音楽検出装置３には、音楽検出装置１と同様に、ＰＣＭ（Pulse Code Modulation）によってデジタル符号化された音響信号が入力される。 Similar to the music detection device 1, the music detection device 3 is mounted on a television receiver or the like, and detects a music scene included in a program being broadcast based on an acoustic signal included in the broadcast signal. In the present embodiment, similarly to the music detection device 1, an acoustic signal digitally encoded by PCM (Pulse Code Modulation) is input to the music detection device 3.

以下に、図７に示す音楽検出装置３における音楽検出の処理について説明する。 Below, the music detection process in the music detection apparatus 3 shown in FIG. 7 will be described.

音楽検出装置３におけるフレーム分割部５、窓掛け部６、および、スペクトル変換部７の処理内容は、音楽検出装置１と同様であり、説明は省略する。 The processing contents of the frame division unit 5, the windowing unit 6, and the spectrum conversion unit 7 in the music detection device 3 are the same as those of the music detection device 1, and the description thereof is omitted.

超低域スペクトルパワー算出部３１は、スペクトル変換部７から受け取るフレームごとのスペクトル（以下では、入力スペクトルと呼ぶ）をもとに、１００Ｈｚ（予め定められた第１の閾値）以下のスペクトルパワーの和を算出し、超低域スペクトルパワー保存部３２と高域スペクトルパワー算出部３３とに出力する。つまり、超低域スペクトルパワー算出部３１は、入力スペクトルのうち、１００Ｈｚ以下のスペクトルを抽出し、抽出したスペクトルを２乗した値の総和（以下では、超低域スペクトルパワー合計と呼ぶ）を算出する。すなわち、超低域スペクトルパワー合計は、フレームごとの１００Ｈｚ以下の超低域スペクトルについてのスペクトルパワーの合計である。なお、本実施の形態では、超低域スペクトルパワー合計を１００Ｈｚ以下のスペクトルパワーの合計として算出したが、１００Ｈｚ未満のスペクトルパワーの合計であってもよい。また、閾値は１００Ｈｚには限定されない。 The ultra-low frequency spectrum power calculation unit 31 has a spectrum power of 100 Hz (predetermined first threshold value) or less based on the spectrum for each frame received from the spectrum conversion unit 7 (hereinafter referred to as an input spectrum). The sum is calculated and output to the ultra low frequency spectrum power storage unit 32 and the high frequency spectrum power calculation unit 33. That is, the ultra-low frequency spectrum power calculation unit 31 extracts a spectrum of 100 Hz or less from the input spectrum, and calculates the sum of values obtained by squaring the extracted spectrum (hereinafter referred to as the ultra-low frequency spectrum power total). To do. That is, the total ultra-low frequency spectrum power is the total of the spectrum power for the ultra-low frequency spectrum of 100 Hz or less for each frame. In this embodiment, the total ultra-low frequency spectrum power is calculated as the sum of spectrum powers of 100 Hz or less, but may be the sum of spectrum powers less than 100 Hz. Further, the threshold value is not limited to 100 Hz.

超低域スペクトルパワー保存部３２は、超低域スペクトルパワー算出部３１から出力される上記１００Ｈｚ以下の超低域スペクトルパワー合計を記憶する。つまり、超低域スペクトルパワー保存部３２は、全てのフレームについて超低域スペクトルパワー合計を履歴データとして記憶している。 The ultra low frequency spectrum power storage unit 32 stores the total ultra low frequency spectrum power of 100 Hz or less output from the ultra low frequency spectrum power calculation unit 31. That is, the ultra-low frequency spectrum power storage unit 32 stores the ultra-low frequency spectrum power total as history data for all frames.

また、超低域スペクトルパワー自己相関係数算出部３３は、超低域スペクトルパワー算出部３１から出力された超低域スペクトルパワー合計、すなわち、現在フレームの超低域スペクトルパワー合計と、超低域スペクトルパワー保存部３２から取り出した過去フレームの超低域スペクトルパワー合計とを用いて、連続するフレーム間における低域スペクトルパワーの自己相関係数を算出する。本実施の形態においては、現在フレームと過去の１２７フレームとの合計１２８フレームについて、数６に示す自己相関係数Ｒ２（ｘ）を算出する。 Further, the ultra-low frequency spectrum power autocorrelation coefficient calculating unit 33 calculates the total ultra-low frequency spectrum power output from the ultra-low frequency spectrum power calculating unit 31, that is, the ultra-low frequency spectrum power total of the current frame, The autocorrelation coefficient of the low-frequency spectrum power between consecutive frames is calculated using the total ultra-low-frequency spectrum power of the past frame extracted from the high-frequency spectrum power storage unit 32. In the present embodiment, the autocorrelation coefficient R2 (x) shown in Equation 6 is calculated for a total of 128 frames including the current frame and the past 127 frames.

上記の自己相関係数Ｒ２（ｘ）を算出する式において、ｓｐｐ（ｉ）は各フレームの超低域スペクトルパワー合計を表している。ここで、ｉはフレームを識別する番号（以下では、フレーム識別番号と呼ぶ）を示しており、１〜１２８の整数である。フレーム識別番号は、１〜１２８の順番に各フレームに対して時系列に割り当てられている。つまり、ｓｐｐ（１）は最も過去のフレームのスペクトルパワーであり、ｓｐｐ（１２８）は現在のフレームのスペクトルパワーである。本実施の形態においては、数６において、Ｌ＝６４とし、ｉを０〜６３まで変化させる。また、ｘは自己相関係数を算出するフレームの間隔を表しており、ｘを１〜６４まで変化させて、各ｘに対する自己相関係数Ｒ２（ｘ）を算出する。そして、超低域スペクトルパワー自己相関係数算出部３３は、算出した６４個の自己相関係数Ｒ２（ｘ）（ｘは１〜６４の整数）を係数最大値検出部３４に出力する。 In the above formula for calculating the autocorrelation coefficient R2 (x), spp (i) represents the total ultra-low frequency spectrum power of each frame. Here, i represents a number for identifying a frame (hereinafter referred to as a frame identification number), and is an integer from 1 to 128. Frame identification numbers are assigned to each frame in time series in the order of 1 to 128. That is, spp (1) is the spectrum power of the past frame, and spp (128) is the spectrum power of the current frame. In the present embodiment, in Equation 6, L = 64 and i is changed from 0 to 63. Also, x represents the frame interval for calculating the autocorrelation coefficient, and x is varied from 1 to 64 to calculate the autocorrelation coefficient R2 (x) for each x. Then, the ultra low frequency spectrum power autocorrelation coefficient calculation unit 33 outputs the calculated 64 autocorrelation coefficients R2 (x) (x is an integer of 1 to 64) to the coefficient maximum value detection unit 34.

係数最大値検出部３４は、超低域スペクトルパワー自己相関係数算出部３３から出力されたＲ２（１）〜Ｒ２（６４）の最大値を検出し、自己相関係数Ｒ２（ｘ）が最大値を示すフレーム間隔ｘ（以下では、最大値フレーム間隔と呼ぶ）を高域スペクトルパワー算出部３５に出力する。 The coefficient maximum value detection unit 34 detects the maximum value of R2 (1) to R2 (64) output from the ultra low frequency spectrum power autocorrelation coefficient calculation unit 33, and the autocorrelation coefficient R2 (x) is the maximum. The frame interval x indicating the value (hereinafter referred to as the maximum value frame interval) is output to the high-frequency spectrum power calculation unit 35.

高域スペクトルパワー算出部３５は、係数最大値検出部３４から、最大値フレーム間隔とあわせて、フレームごとの入力スペクトルを受け取る。つまり、本実施の形態においては、スペクトル変換部７から出力された入力スペクトルは、超低域スペクトルパワー算出部３１と超低域スペクトルパワー自己相関係数算出部３３と係数最大値判定部３４とを通じて、高域スペクトル算出部３５に入力される。 The high frequency spectrum power calculation unit 35 receives the input spectrum for each frame from the coefficient maximum value detection unit 34 together with the maximum value frame interval. That is, in the present embodiment, the input spectrum output from the spectrum conversion unit 7 includes the ultra low frequency spectrum power calculation unit 31, the ultra low frequency spectrum power autocorrelation coefficient calculation unit 33, and the coefficient maximum value determination unit 34. And input to the high-frequency spectrum calculator 35.

そして、高域スペクトルパワー算出部３５は、係数最大値検出部３４から受け取るフレームごとのスペクトル、すなわち、入力スペクトルをもとに、１００（予め定められた第１の閾値）Ｈｚ以上のスペクトルパワーの和を算出し、超低域／高域パワー比算出部３６に出力する。つまり、高域スペクトルパワー算出部３５は、入力スペクトルのうち、１００Ｈｚ以上のスペクトルを抽出し、抽出したスペクトルを２乗した値の総和（以下では、高域スペクトルパワー合計と呼ぶ）を算出する。すなわち、高域スペクトルパワー合計は、フレームごとの１００Ｈｚ以上の高域スペクトルについてのスペクトルパワーの合計である。なお、本実施の形態では、高域スペクトルパワー合計を１００Ｈｚ以上のスペクトルパワーの合計として算出したが、１００Ｈｚより大きいスペクトルパワーの合計であってもよい。また、閾値は１００Ｈｚには限定されない。 Then, the high-frequency spectrum power calculation unit 35 has a spectrum power of 100 (predetermined first threshold) Hz or higher based on the spectrum for each frame received from the coefficient maximum value detection unit 34, that is, the input spectrum. The sum is calculated and output to the ultra low frequency / high frequency power ratio calculation unit 36. That is, the high frequency spectrum power calculation unit 35 extracts a spectrum of 100 Hz or more from the input spectrum, and calculates a sum of values obtained by squaring the extracted spectrum (hereinafter referred to as high frequency spectrum power total). That is, the total high-frequency spectrum power is the total spectral power for a high-frequency spectrum of 100 Hz or more for each frame. In the present embodiment, the total high-frequency spectral power is calculated as the total of spectral power of 100 Hz or more, but may be the total of spectral power greater than 100 Hz. Further, the threshold value is not limited to 100 Hz.

なお、本実施の形態では、高域スペクトルパワー算出部３５は、超低域／高域パワー比算出部３６に対して、高域スペクトルパワー合計と併せて、超低域スペクトルパワー合計を出力する。つまり、本実施の形態においては、超低域スペクトルパワー算出部３１において算出された超低域スペクトルパワー合計は、超低域スペクトルパワー自己相関係数算出部３３と係数最大値判定部３４と高域スペクトル算出部３５とを通じて、超低域／高域パワー比算出部３６に入力される。また、高域スペクトルパワー算出部３５は、上記の最大値フレーム間隔も超低域／高域パワー比算出部３６に出力する。 In the present embodiment, the high frequency spectrum power calculation unit 35 outputs the total ultra low frequency spectrum power to the ultra low frequency / high frequency power ratio calculation unit 36 together with the high frequency spectrum power total. . That is, in the present embodiment, the total ultra-low frequency spectrum power calculated by the ultra-low frequency spectrum power calculation unit 31 is the same as the ultra-low frequency spectrum power autocorrelation coefficient calculation unit 33, the coefficient maximum value determination unit 34, The signal is input to the ultra low frequency / high frequency power ratio calculating unit 36 through the band spectrum calculating unit 35. The high frequency spectrum power calculation unit 35 also outputs the maximum value frame interval to the ultra low frequency / high frequency power ratio calculation unit 36.

超低域／高域パワー比算出部３６は、高域スペクトルパワー算出部３５から受け取った高域スペクトルパワー合計と超低域スペクトルパワー合計との比（以下では、超低域／高域パワー比と呼ぶ）を算出し、音楽／非音楽判定部３７に出力する。より具体的には、超低域／高域パワー比は、超低域スペクトルパワー合計÷高域スペクトルパワー合計の演算によって算出される。なお、超低域／高域パワー比として、超低域スペクトルパワー合計÷（超低域スペクトルパワー合計＋高域スペクトルパワー合計）を算出してもよく、特に限定はされない。また、超低域／高域パワー比算出部３６は、上記の最大値フレーム間隔も音楽／非音楽判定部３７に出力する。 The ultra low frequency / high frequency power ratio calculation unit 36 is a ratio of the total high frequency spectrum power received from the high frequency spectrum power calculation unit 35 to the total ultra low frequency spectrum power (hereinafter, the ultra low frequency / high frequency power ratio). Is calculated and output to the music / non-music determination unit 37. More specifically, the ultra low frequency / high frequency power ratio is calculated by calculating the total ultra low frequency spectrum power / the high frequency spectrum power. Note that, as the ultra low frequency / high frequency power ratio, the total ultra low frequency spectrum power / (total ultra low frequency spectrum power + total high frequency spectrum power) may be calculated, and is not particularly limited. Further, the ultra low frequency / high frequency power ratio calculation unit 36 also outputs the maximum value frame interval to the music / non-music determination unit 37.

音楽／非音楽判定部３７は、超低域／高域パワー比算出部３６から出力された超低域／高域パワー比が予め定められた閾値値（例えば、０．０００３）以上であるか否かを判定する。また、音楽／非音楽判定部３７は、最大値フレーム間隔について、１０フレーム以上、６４フレーム以下（すなわち、０．２３ｓ〜１．５ｓ）であるか否かを判定する。 Whether the music / non-music determination unit 37 has the ultra-low / high frequency power ratio output from the ultra-low / high frequency power ratio calculation unit 36 equal to or greater than a predetermined threshold value (for example, 0.0003). Determine whether or not. Further, the music / non-music determination unit 37 determines whether the maximum value frame interval is 10 frames or more and 64 frames or less (that is, 0.23 s to 1.5 s).

そして、音楽／非音楽判定部３７は、上記２つの判定の結果、最大値フレーム間隔が１０フレーム以上６４フレーム以下であって、かつ、超低域／高域パワー比が０．０００３以上の場合、音楽シーンと判定する。 Then, the music / non-music determination unit 37 determines that, as a result of the above two determinations, the maximum value frame interval is 10 frames or more and 64 frames or less, and the ultra low frequency / high frequency power ratio is 0.0003 or more. It is determined as a music scene.

図８は、太鼓の周波数スペクトルを示す図である。図８に示す太鼓の周波数スペクトルは、図３に示すトランペットの周波数スペクトルや図４に示す鉄琴の周波数スペクトルとは異なり、倍音成分が含まれていない。したがって、ドラムなどの倍音成分を含まない、すなわち、楽音ではない楽器の音楽シーンについては、音楽検出装置１によって検出できない場合がある。 FIG. 8 is a diagram showing the frequency spectrum of the drum. The frequency spectrum of the drum shown in FIG. 8 does not include overtone components, unlike the frequency spectrum of the trumpet shown in FIG. 3 or the frequency spectrum of the iron koto shown in FIG. Therefore, a music scene of an instrument that does not include a harmonic component such as a drum, that is, is not a musical sound may not be detected by the music detection device 1.

図９は、ドラムの１００Ｈｚ以下のスペクトルパワー合計の時間遷移を示す図である。縦軸は、１６ビットＰＣＭの最下位ビットを１とした時の、１００Ｈｚ以下のスペクトルパワーの合計を示している。横軸は、ある時刻をフレームＮｏ.１とした時のフレームＮｏを示している。図９に示すとおり、ドラムの１００Ｈｚ以下のスペクトルパワーの時間遷移は、周期性を有している。すなわち、一定の周期において１００Ｈｚ以下のスペクトルパワーのピークが繰り返し現れる。音楽検出装置３では、超低域スペクトルパワー自己相関係数算出部３３において複数フレーム間での１００Ｈｚ以下のスペクトルパワーの自己相関係数を算出し、係数最大値判定部３４において、自己相関係数が最大となるフレーム間隔（すなわち、最大値フレーム間隔）を検出している。ここで、図９において、複数フレーム間での１００Ｈｚ以下のスペクトルパワーは、上述したとおり一定の周期においてピークを示すため、元の１００Ｈｚ以下のスペクトルパワーのデータ列を、上記ピークが現れる一定の周期分のフレーム間隔だけずらしたときに自己相関係数が最大値となる。つまり、係数最大値判定部３４において検出される、自己相関係数が最大値となる最大値フレーム間隔は、１００Ｈｚ以下のスペクトルパワーのピークが現れる周期である。また、図９に示されるような上述のピークの周期は、ドラムなどの楽器であれば、一定の時間範囲内にある。したがって、この周期（すなわち、最大値フレーム間隔）が所定の範囲内（例えば、１０フレーム以上６４フレーム以下であり、特許請求の範囲における予め定められた範囲内に対応）にあるか否かを判定し、所定の範囲内になければ音楽ではないと判定することができる。 FIG. 9 is a diagram showing a time transition of the total spectral power of the drum of 100 Hz or less. The vertical axis indicates the total of spectrum power of 100 Hz or less when the least significant bit of 16-bit PCM is 1. The horizontal axis indicates the frame number when a certain time is set as the frame number 1. As shown in FIG. 9, the time transition of the spectral power of 100 Hz or less of the drum has periodicity. That is, a spectrum power peak of 100 Hz or less appears repeatedly in a certain period. In the music detection device 3, the ultralow frequency spectrum power autocorrelation coefficient calculation unit 33 calculates an autocorrelation coefficient of spectrum power of 100 Hz or less between a plurality of frames, and the coefficient maximum value determination unit 34 calculates the autocorrelation coefficient. Is detected as the maximum frame interval (that is, the maximum value frame interval). Here, in FIG. 9, since the spectral power of 100 Hz or less between a plurality of frames shows a peak in a constant cycle as described above, the data sequence of the original spectral power of 100 Hz or less is converted into a constant cycle in which the peak appears. The autocorrelation coefficient becomes the maximum value when shifted by the frame interval of minutes. That is, the maximum value frame interval at which the autocorrelation coefficient is maximum detected by the coefficient maximum value determination unit 34 is a period in which a peak of spectrum power of 100 Hz or less appears. Further, the period of the above peak as shown in FIG. 9 is within a certain time range in the case of an instrument such as a drum. Therefore, it is determined whether or not this period (that is, the maximum value frame interval) is within a predetermined range (for example, 10 frames or more and 64 frames or less, corresponding to a predetermined range in the claims). If it is not within the predetermined range, it can be determined that it is not music.

ところで、人間の話声には、１００Ｈｚ以下の成分はほとんど含まれないが、僅かに含まれる１００Ｈｚ以下の成分には、スペクトルパワーの周期性が見られる。そのため、人間の話声をドラムなどの音楽と誤判定しないようにさらなる判定条件が必要となる。ここで、ドラムなどの音に含まれている成分は、人の話声と異なり、低域にかたよっているため、１００Ｈｚ以下の超低域成分の含まれている割合が非常に小さい場合には、音楽ではないと判定できる。したがって、超低域／高域パワー比算出部３６において算出される超低域／高域パワー比が予め定められた閾値（例えば、０．０００３であり、特許請求の範囲における第２の閾値に対応）以上であるか否かを判定し、予め定められた閾値以下であれば、すなわち、超低域成分の含まれている割合が非常に小さい場合には、音楽ではないと判定できる。 By the way, although human speech has almost no component of 100 Hz or less, the component of 100 Hz or less, which is slightly contained, shows periodicity of spectral power. Therefore, further determination conditions are necessary so that human speech is not erroneously determined as music such as drums. Here, the component contained in the sound of the drum or the like is different from the human voice, and depends on the low range. Therefore, when the proportion of the ultra low frequency component of 100 Hz or less is very small, It can be determined that it is not music. Therefore, the ultra-low frequency / high frequency power ratio calculated by the ultra-low frequency / high frequency power ratio calculation unit 36 is a predetermined threshold value (for example, 0.0003, which is the second threshold value in the claims). It can be determined that it is not music if it is equal to or less than a predetermined threshold value, that is, if the proportion of the ultra low frequency component is very small.

これにより、音楽／非音楽判定部３７は、超低域／高域パワー比算出部３６から出力された超低域／高域パワー比が予め定められた閾値値（例えば、０．０００３）以上であり、かつ、最大値フレーム間隔が予め定められた範囲内（例えば、１０フレーム以上６４フレーム以下）にある場合に、音楽であると判定することができる。 As a result, the music / non-music determination unit 37 sets the ultra low frequency / high frequency power ratio output from the ultra low frequency / high frequency power ratio calculation unit 36 to a predetermined threshold value (for example, 0.0003) or more. When the maximum value frame interval is within a predetermined range (for example, 10 frames or more and 64 frames or less), it can be determined that the music is stored.

（音声検出装置４）
本発明に係る音声検出装置４の一実施の形態について、図１０ないし図１３に基づいて説明すると次のとおりである。 (Voice detection device 4)
An embodiment of the voice detection device 4 according to the present invention will be described with reference to FIGS. 10 to 13 as follows.

図１０は、本発明に係る音声検出装置４の構成を示すブロック図である。本発明に係る音声検出装置４は、フレーム分割部５と窓掛け部６とスペクトル変換部７と音声検出部４０とを含んで構成される。 FIG. 10 is a block diagram showing the configuration of the voice detection device 4 according to the present invention. The voice detection device 4 according to the present invention includes a frame dividing unit 5, a windowing unit 6, a spectrum conversion unit 7, and a voice detection unit 40.

音声検出部４０は、対数スペクトル算出部４１とケプストラム算出部４２と基本周波数抽出部（基本周波数抽出手段）４３と基本周波数保存部４４とローパスフィルタ部４５とフレーズ成分解析部４６（基本周波数変化検出手段）とアクセント成分解析部４７（基本周波数変化検出手段）と音楽／非音楽判定部（音声判定手段）４８とを備えている。 The voice detection unit 40 includes a logarithmic spectrum calculation unit 41, a cepstrum calculation unit 42, a fundamental frequency extraction unit (basic frequency extraction means) 43, a fundamental frequency storage unit 44, a low-pass filter unit 45, and a phrase component analysis unit 46 (basic frequency change detection). Means), an accent component analysis unit 47 (basic frequency change detection unit), and a music / non-music determination unit (speech determination unit) 48.

音声検出装置４は、音楽検出装置１と同様に、テレビ受信装置などに実装され、放送信号に含まれる音響信号をもとに、放送中の番組に含まれる音楽シーンを検出する。本実施の形態では、音声検出装置４には、音楽検出装置１と同様に、ＰＣＭ（Pulse Code Modulation）によってデジタル符号化された音響信号が入力される。 Similar to the music detection device 1, the audio detection device 4 is mounted on a television receiver or the like, and detects a music scene included in a broadcast program based on an acoustic signal included in the broadcast signal. In the present embodiment, an audio signal digitally encoded by PCM (Pulse Code Modulation) is input to the audio detection device 4 as in the music detection device 1.

以下に、図１０に示す音声検出装置４における音声検出の処理について説明する。 Below, the process of the audio | voice detection in the audio | voice detection apparatus 4 shown in FIG. 10 is demonstrated.

音声検出装置４におけるフレーム分割部５、窓掛け部６、および、スペクトル変換部７の処理内容は、音楽検出装置１と同様であり、説明は省略する。 The processing contents of the frame division unit 5, the windowing unit 6, and the spectrum conversion unit 7 in the voice detection device 4 are the same as those of the music detection device 1, and a description thereof is omitted.

対数スペクトル算出部４１は、スペクトル変換部７から受け取るフレームごとのスペクトル（以下では、入力スペクトルと呼ぶ）を基底１０の対数に変換する。つまり、対数スペクトル算出部４１は、入力スペクトルをｓｐとするとｌｏｇ_１０｜ｓｐ｜を算出する。以下では、ｌｏｇ_１０｜ｓｐ｜を対数スペクトルと呼ぶ。そして、対数スペクトル算出部４１は、対数スペクトルをケプストラム算出部４２に出力する。 The logarithmic spectrum calculation unit 41 converts a spectrum for each frame received from the spectrum conversion unit 7 (hereinafter referred to as an input spectrum) into a logarithm of the base 10. That is, the logarithmic spectrum calculation unit 41 calculates log ₁₀ | sp | when the input spectrum is sp. Hereinafter, log ₁₀ | sp | is referred to as a logarithmic spectrum. Then, the log spectrum calculation unit 41 outputs the log spectrum to the cepstrum calculation unit 42.

ケプストラム算出部４２は、対数スペクトル算出部４１から出力される対数スペクトルに対して１０２４ポイントのＩＦＦＴ（Inverse Fast Fourier Transform）を施し、時間領域のデータであるケプストラムに変換する。そして、ケプストラム算出部４２は、算出したケプストラムを、基本周波数抽出部４３に出力する。 The cepstrum calculation unit 42 performs 1024-point IFFT (Inverse Fast Fourier Transform) on the logarithmic spectrum output from the logarithmic spectrum calculation unit 41 and converts the logarithm spectrum into a cepstrum that is time-domain data. Then, the cepstrum calculation unit 42 outputs the calculated cepstrum to the fundamental frequency extraction unit 43.

基本周波数抽出部４３は、ケプストラム算出部４２から出力されるケプストラムの高次側（約ｆｓ／８００以上）の最大ケプストラムを抽出し、最大ケプストラムとなるケフレンシーの逆数を基本周波数（Ｆ０）として算出する。基本周波数抽出部４３は、基本周波数（Ｆ０）を基本周波数保存部４４とローパスフィルタ部４５とに出力する。 The fundamental frequency extraction unit 43 extracts the maximum cepstrum on the higher-order side (about fs / 800 or more) of the cepstrum output from the cepstrum calculation unit 42, and calculates the reciprocal of the quefrency that becomes the maximum cepstrum as the fundamental frequency (F0). . The fundamental frequency extraction unit 43 outputs the fundamental frequency (F0) to the fundamental frequency storage unit 44 and the low pass filter unit 45.

基本周波数保存部４４は、基本周波数抽出部４３から出力される基本周波数（Ｆ０）を記憶する。つまり、基本周波数保存部４４は、全てのフレームについて基本周波数（Ｆ０）を履歴データとして記憶している。 The fundamental frequency storage unit 44 stores the fundamental frequency (F0) output from the fundamental frequency extraction unit 43. That is, the fundamental frequency storage unit 44 stores the fundamental frequency (F0) as history data for all frames.

ローパスフィルタ部４５は、基本周波数抽出部４３から出力された基本周波数（Ｆ０）、すなわち、現在フレームの基本周波数（Ｆ０）を低域濾過して、フレーズ成分解析部４６に出力する。また、ローパスフィルタ部４４は、基本周波数保存部４４から、過去フレームについて基本周波数（Ｆ０）を取り出し、現在フレームの基本周波数（Ｆ０）と同様に、低域濾過して、フレーズ成分解析部４６に出力する。ローパスフィルタ部４４において低域の基本周波数（Ｆ０）、すなわち、ノイズとなるような基本周波数（Ｆ０）の情報についてはフレーズ成分解析部４６やアクセント成分解析部４７に出力されずに、除去される。ローパスフィルタ部４４における低域濾過の結果、現在フレームの基本周波数（Ｆ０）が出力されない場合、現在フレームについて音声シーンであるか否かの判定は行われない。 The low-pass filter unit 45 low-pass-filters the fundamental frequency (F0) output from the fundamental frequency extraction unit 43, that is, the fundamental frequency (F0) of the current frame, and outputs it to the phrase component analysis unit 46. The low-pass filter unit 44 extracts the fundamental frequency (F0) for the past frame from the fundamental frequency storage unit 44, performs low-pass filtering in the same way as the fundamental frequency (F0) of the current frame, and sends it to the phrase component analysis unit 46. Output. In the low-pass filter unit 44, information on the low-frequency fundamental frequency (F0), that is, the fundamental frequency (F0) that causes noise is removed without being output to the phrase component analysis unit 46 or the accent component analysis unit 47. . If the basic frequency (F0) of the current frame is not output as a result of low-pass filtering in the low-pass filter unit 44, it is not determined whether the current frame is an audio scene.

本実施の形態では、ローパスフィルタ部４５は、基本周波数保存部４４から、時間的に現在フレームに近い順に、順次、過去フレームの基本周波数（Ｆ０）を取り出して低域濾過して出力する処理を繰り返す。この処理は、４つの基本周波数（Ｆ０）がフレーズ成分解析部４６に出力されるまで繰り返される。最終的に、ローパスフィルタ部４５は、現在フレームと４つの過去フレームとの合計５つのフレームについて、基本周波数（Ｆ０）をフレーズ成分解析部４６に出力する。 In the present embodiment, the low-pass filter unit 45 performs a process of sequentially extracting the fundamental frequency (F0) of the past frame from the fundamental frequency storage unit 44 in order of time closest to the current frame, and performing low-pass filtering and outputting. repeat. This process is repeated until the four fundamental frequencies (F0) are output to the phrase component analysis unit 46. Finally, the low-pass filter unit 45 outputs the fundamental frequency (F0) to the phrase component analysis unit 46 for a total of five frames including the current frame and the four past frames.

フレーズ成分解析部４６は、ローパスフィルタ部４５から出力された５つのフレームの基本周波数（Ｆ０）について、基本周波数（Ｆ０）が単調減少、または、単調増加しているか（すなわち、単調に変化しているか）を解析する。そして、フレーズ成分解析部４６は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調減少、または、単調増加が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間）にあるか否かを判定する。さらに、フレーズ成分解析部４６は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調減少、または、単調増加（すなわち、単調に変化していること）を検出した場合、その単調減少、または、単調増加における基本周波数（Ｆ０）の変化の幅が所定の範囲内（例えば、１２０Ｈｚ以内）にあるか否かを判定する。 The phrase component analysis unit 46 determines whether the fundamental frequency (F0) monotonously decreases or monotonously increases (that is, changes monotonously) with respect to the fundamental frequency (F0) of the five frames output from the low-pass filter unit 45. Is analyzed). Then, the phrase component analysis unit 46 determines whether the basic frequency (F0) monotonically decreases or monotonically increases between the above five frames within a predetermined frequency range (for example, between 100 Hz and 400 Hz). Determine whether. Further, when the phrase component analysis unit 46 detects a monotonic decrease or monotonic increase (that is, monotonic change) of the fundamental frequency (F0) between the five frames, the monotonic decrease or Then, it is determined whether or not the width of the change in the fundamental frequency (F0) in the monotonic increase is within a predetermined range (for example, within 120 Hz).

フレーズ成分解析部４６は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調減少、または、単調増加が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間であり、特許請求の範囲における予め定められた周波数の範囲内）にあり、かつ、その単調減少、または、単調増加の変化の幅が所定の範囲内（例えば、１２０Ｈｚ以内であり、特許請求の範囲における予め定められた周波数の幅）にあった場合、その単調減少、または、単調増加を、人の声によるフレーズを表すフレーズ成分であると判定する。そして、フレーズ成分解析部４６は、フレーズ成分が含まれているか否かを表すフレーズ解析結果情報をアクセント成分解析部４７に出力する。また、本実施の形態においては、フレーズ成分解析部４６は、ローパスフィルタ部４５からの５つのフレームの基本周波数（Ｆ０）を、フレーズ解析結果情報とともにアクセント解析部４７に出力する。 The phrase component analysis unit 46 has a monotonic decrease or monotonic increase in the fundamental frequency (F0) between the above five frames within a predetermined frequency range (for example, between 100 Hz and 400 Hz). Within a predetermined frequency range), and the width of the monotonic decrease or monotonic increase is within a predetermined range (for example, within 120 Hz, and the predetermined frequency in the claims) The monotonic decrease or monotonic increase is determined to be a phrase component representing a phrase by a human voice. Then, the phrase component analysis unit 46 outputs phrase analysis result information indicating whether or not the phrase component is included to the accent component analysis unit 47. In the present embodiment, the phrase component analysis unit 46 outputs the five frames of the fundamental frequency (F0) from the low-pass filter unit 45 to the accent analysis unit 47 together with the phrase analysis result information.

アクセント成分解析部４７は、フレーズ成分解析部４６から出力された５つのフレームの基本周波数（Ｆ０）について、基本周波数（Ｆ０）が単調増加からフラットへの遷移（変化なし）または、単調減少からフラットへの遷移（変化なし）であるか（すなわち、単調変化から一定周波数へ変化）を解析する。また、アクセント成分解析部４７は、フラット（変化なし）から単調減少への遷移、または、フラット（変化なし）から単調増加への遷移であるか（すなわち、一定周波数から単調変化へ変化）を解析する。そして、アクセント成分解析部４７は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調増加からフラットへの遷移、単調減少からフラットへの遷移、フラットから単調減少への遷移、または、フラットから単調増加への遷移が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間であり、特許請求の範囲における予め定められた周波数の範囲内）にあるか否かを判定する。さらに、アクセント成分解析部４７は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調増加からフラットへの遷移、単調減少からフラットへの遷移、フラットから単調減少への遷移、または、フラットから単調増加への遷移を検出した場合、その基本周波数（Ｆ０）の変化の幅が所定の範囲内（例えば、１２０Ｈｚ以内であり、特許請求の範囲における予め定められた周波数の幅）にあるか否かを判定する。 The accent component analysis unit 47 changes the basic frequency (F0) from monotonically increasing to flat (no change) or flattening from monotonic decreasing to the basic frequency (F0) of the five frames output from the phrase component analyzing unit 46. (I.e., change from monotonic change to constant frequency). Further, the accent component analysis unit 47 analyzes whether the transition is from flat (no change) to monotonic decrease, or from flat (no change) to monotone increase (that is, change from a constant frequency to monotone change). To do. Then, the accent component analysis unit 47 shifts the fundamental frequency (F0) from the monotone increase to the flat, the transition from the monotone decrease to the flat, the transition from the flat to the monotone decrease, or from the flat between the above five frames. It is determined whether or not the transition to monotonic increase is within a predetermined frequency range (for example, between 100 Hz and 400 Hz, and within a predetermined frequency range in the claims). Further, the accent component analysis unit 47 makes a transition from the monotonic increase to the flat, the monotone decrease to the flat, the transition from the flat to the monotonous decrease, or from the flat, between the above five frames. When a transition to monotonic increase is detected, whether or not the width of the change in the fundamental frequency (F0) is within a predetermined range (for example, within 120 Hz, a predetermined frequency width in the claims). Determine whether.

アクセント成分解析部４７は、上記の５つのフレーム間における基本周波数（Ｆ０）の単調増加からフラットへの遷移、単調減少からフラットへの遷移、フラットから単調減少への遷移、または、フラットから単調増加への遷移が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間）にあり、かつ、その変化の幅が所定の範囲内（例えば、１２０Ｈｚ以内）にあった場合、人の声によるアクセントを表すアクセント成分であると判定する。そして、アクセント成分解析部４７は、アクセント成分が含まれているか否かを表すアクセント解析結果情報を音声／非音声判定部４８に出力する。また、本実施の形態においては、アクセント成分解析部４７は、フレーズ成分解析部４６からのフレーズ解析結果情報を、アクセント解析結果情報とともに音声／非音声判定部４８に出力する。 The accent component analysis unit 47 makes a transition from the monotone increase to the flat, the transition from the monotone decrease to the flat, the transition from the flat to the monotone decrease, or the monotone increase from the flat between the above five frames. If the transition to is within a predetermined frequency range (for example, between 100 Hz and 400 Hz) and the width of the change is within the predetermined range (for example, within 120 Hz), the accent by human voice It is determined that the accent component represents. Then, the accent component analysis unit 47 outputs accent analysis result information indicating whether or not an accent component is included to the voice / non-voice determination unit 48. In the present embodiment, the accent component analysis unit 47 outputs the phrase analysis result information from the phrase component analysis unit 46 to the voice / non-voice determination unit 48 together with the accent analysis result information.

音声／非音声判定部３７は、アクセント解析結果情報とフレーズ解析情報とに基づいて、アクセント成分、または、フレーズ成分のいずれかが含まれているか否かを判定し、アクセント成分、または、フレーズ成分のいずれかが含まれている場合には、音声シーン（音響信号に音声が含まれているシーン）と判定する。すなわち、音声を検出する。一方、アクセント成分、および、フレーズ成分のいずれも含まれていない場合には、非音声シーンであると判定する。 The voice / non-voice determination unit 37 determines whether or not either an accent component or a phrase component is included based on the accent analysis result information and the phrase analysis information, and the accent component or the phrase component. Is included, it is determined as an audio scene (scene in which audio is included in the acoustic signal). That is, the voice is detected. On the other hand, when neither an accent component nor a phrase component is included, it is determined that the scene is a non-audio scene.

図１１は、音声の特性を示す図であり、（ａ）は男性による日本語でのスピーチにおける時間波形を示す図であり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。図１２は、音声の特性を示す図であり、（ａ）は女性による日本語でのスピーチにおける時間波形を示す図であり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。図１３は、音楽の特性を示す図であり、（ａ）は時間波形を示すであり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。 FIG. 11 is a diagram showing the characteristics of speech, (a) is a diagram showing a time waveform in a Japanese speech by a man, and (b) is a fundamental frequency obtained from the time waveform of (a) ( It is a figure which shows the time change of F0). FIG. 12 is a diagram showing the characteristics of speech, (a) is a diagram showing a time waveform in a Japanese speech by a woman, and (b) is a fundamental frequency obtained from the time waveform of (a) ( It is a figure which shows the time change of F0). FIG. 13 is a diagram showing the characteristics of music, (a) shows a time waveform, and (b) shows a time change of the fundamental frequency (F0) obtained from the time waveform of (a). is there.

図１１（ｂ）および図１２（ｂ）に示すとおり、人間の音声の場合には、フレーズ成分やアクセント成分が含まれており、いずれも、周波数が１００Ｈｚ〜４００Ｈｚの範囲内にある。また、フレーズ成分、および、アクセント成分の変化量は、いずれも、約１００Ｈｚ以内となっている。一方、図１３（ｂ）に示すとおり、音楽の場合には、フレーズ成分、および、アクセント成分は、全く含まれていない。 As shown in FIGS. 11B and 12B, in the case of human speech, a phrase component and an accent component are included, and both have a frequency in the range of 100 Hz to 400 Hz. Also, the amount of change in the phrase component and the accent component are both within about 100 Hz. On the other hand, as shown in FIG. 13B, in the case of music, the phrase component and the accent component are not included at all.

したがって、上記の５つのフレーム間における基本周波数（Ｆ０）の単調減少、または、単調増加が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間）にあり、かつ、その単調減少、または、単調増加の変化の幅が所定の範囲内（例えば、１２０Ｈｚ以内）にあった場合、人の声におけるフレーズを表すフレーズ成分が含まれていることがわかるため、人の声が含まれていると判定できる。また、上記の５つのフレーム間における基本周波数（Ｆ０）の単調増加からフラットへの遷移、単調減少からフラットへの遷移、フラットから単調減少への遷移、または、フラットから単調増加への遷移が、所定の周波数の範囲内（例えば、１００Ｈｚ〜４００Ｈｚの間）にあり、かつ、その変化の幅が所定の範囲内（例えば、１２０Ｈｚ）にあった場合、人の声におけるアクセントを表すアクセント成分が含まれていることがわかるため、人の声が含まれていると判定できる。 Therefore, the monotonic decrease or monotonic increase of the fundamental frequency (F0) between the above five frames is within a predetermined frequency range (for example, between 100 Hz and 400 Hz), and the monotonic decrease or When the width of the monotonous increase is within a predetermined range (for example, within 120 Hz), it is understood that a phrase component representing a phrase in a human voice is included. Can be judged. In addition, the transition of the fundamental frequency (F0) from the monotonic increase to the flat, the transition from the monotonic decrease to the flat, the transition from the flat to the monotonic decrease, or the transition from the flat to the monotonic increase between the above five frames. If it is within a predetermined frequency range (for example, between 100 Hz and 400 Hz) and the range of change is within the predetermined range (for example, 120 Hz), an accent component representing an accent in a human voice is included. Therefore, it can be determined that human voice is included.

なお、上記音楽検出装置１、２、３、および、上記音声検出装置４を備えた音楽音声検出装置において、音響信号に対して、上記音声検出装置１、２、３による音楽検出処理、および、上記音声検出装置４による音声検出処理を全て実行した場合、短時間（最短０．１秒）で検出でき、正解率は音声について８７％、音楽について９４％となり、誤検出を低減することができる。 In the music voice detection device provided with the music detection devices 1, 2, 3, and the voice detection device 4, music detection processing by the voice detection devices 1, 2, 3, When all the voice detection processing by the voice detection device 4 is executed, detection can be performed in a short time (minimum 0.1 seconds), and the correct answer rate is 87% for voice and 94% for music, thereby reducing false detection. .

（音場制御装置５０）
本発明に係る音場制御装置５０の一実施の形態について、図１４に基づいて説明すると次のとおりである。図１４は、本実施の形態に係る音場制御装置５０の構成を示すブロック図である。音場制御装置５０は、音楽シーンや音声シーンなどに応じて音響信号を補正して、音場を制御する。音場制御装置５０は、音楽判定部５１とメモリ５２と音場制御判定部５３と音場制御処理部５４とを備えている。本実施の形態では、音楽判定部５１は、音場制御装置５０に含まれているが、音場制御装置５０とは独立して設けられた構成であってもよく、特に限定はされない。音場制御装置５０には、上記音楽検出装置１、２、および、３による音楽検出処理の結果と、上記音声検出装置４による音声検出処理の結果とが入力される。 (Sound field control device 50)
An embodiment of the sound field control device 50 according to the present invention will be described with reference to FIG. FIG. 14 is a block diagram showing a configuration of the sound field control device 50 according to the present embodiment. The sound field control device 50 corrects an acoustic signal according to a music scene, a sound scene, or the like, and controls the sound field. The sound field control device 50 includes a music determination unit 51, a memory 52, a sound field control determination unit 53, and a sound field control processing unit 54. In the present embodiment, the music determination unit 51 is included in the sound field control device 50, but may be configured independently of the sound field control device 50, and is not particularly limited. The sound field control device 50 receives the result of the music detection process performed by the music detection devices 1, 2, and 3 and the result of the sound detection process performed by the sound detection device 4.

音場制御装置５０では、音楽判定部５１が、入力された音楽検出処理の結果から、音響信号に音楽が含まれているか否かを判定する。より具体的には、音楽検出装置１、２、または、３からの音楽検出処理の結果のうち、少なくともいずれか１つが音楽を検出したことを表している場合（つまり、少なくともいずれか１つの装置において音楽が検出された場合）、音楽判定部５１は音響信号に音楽が含まれていると判定する。そして、音楽判定部５１は、判定結果をメモリ５２に出力する。メモリ５２は、音楽判定部５１からの判定結果（以下では、音楽検出情報と呼ぶ）を記憶する。また、メモリ５２は、音声検出装置４からの音声検出処理の結果（以下では、音声検出情報と呼ぶ）を記憶する。 In the sound field control device 50, the music determination unit 51 determines whether or not music is included in the acoustic signal from the result of the input music detection process. More specifically, when at least one of the music detection processing results from the music detection devices 1, 2, or 3 indicates that music is detected (that is, at least any one device) In the case where music is detected in (2), the music determination unit 51 determines that music is included in the acoustic signal. Then, the music determination unit 51 outputs the determination result to the memory 52. The memory 52 stores a determination result from the music determination unit 51 (hereinafter referred to as music detection information). In addition, the memory 52 stores the result of the voice detection process (hereinafter referred to as voice detection information) from the voice detection device 4.

なお、本実施の形態では、音楽検出装置１、２、３、および、音声検出装置４は、音楽音声検出装置５５に備えられており、音声検出装置５５に入力された音響信号は、音楽検出装置１、２、３、および、音声検出装置４のそれぞれによって、音楽検出処理、または、音声検出処理が行われる。 In the present embodiment, the music detection devices 1, 2, 3, and the voice detection device 4 are provided in the music voice detection device 55, and the acoustic signal input to the voice detection device 55 is the music detection device. Music detection processing or voice detection processing is performed by each of the devices 1, 2, 3, and the voice detection device 4.

そして、音場制御装置５０では、メモリ５２に蓄えられた複数の音楽検出情報、および、音声検出情報に基づいて、音場制御判定部５３が音場制御の内容を決定する。音場制御の種類としては、「音楽シーン用の音場制御」と「音声シーン用の音場制御」と「音楽と音声との両方が含まれるシーン用の音場制御」とがある。音場制御の状態としては、上記の（Ａ）「音楽シーン用の音場制御」がなされている状態と（Ｂ）「音声シーン用の音場制御」がなされている状態と（Ｃ）「音楽と音声との両方が含まれるシーン用の音場制御」がなされている状態のほか、（Ｄ）音場制御されていない状態（以下ではニュートラルの状態と呼ぶ）の４種類の状態がある。 In the sound field control device 50, the sound field control determination unit 53 determines the content of the sound field control based on the plurality of music detection information and the sound detection information stored in the memory 52. The types of sound field control include “sound field control for music scene”, “sound field control for sound scene”, and “sound field control for scene including both music and sound”. As the state of sound field control, (A) “Sound field control for music scene” is performed, (B) “Sound field control for sound scene” is performed, and (C) “Sound field control” is performed. In addition to the state in which “sound field control for a scene including both music and sound” is performed, there are four types of states: (D) a state in which sound field control is not performed (hereinafter referred to as a neutral state). .

図１５は、音場制御装置５０における音場制御の状態遷移を示す図である。図１５には、上記（Ａ）〜（Ｄ）の４つの状態が示されている。また、図１５に示すとおり、状態遷移のパターンは、（１）〜（１６）の１６通りである。 FIG. 15 is a diagram illustrating state transition of sound field control in the sound field control device 50. FIG. 15 shows the four states (A) to (D). Also, as shown in FIG. 15, there are 16 state transition patterns (1) to (16).

図１６は、各状態遷移の条件を示す図である。図１６には、図１５の（１）〜（１６）に対応して各状態遷移の条件が示されている。例えば、上記（Ｄ）の状態（すなわちニュートラルの状態）においては、図１５に示すとおり、（１）、（２）、（３）、（１３）の４つの状態遷移が発生し得る。そして、音場制御装置５０では、音場制御判定部５３は、メモリ５２に蓄えられた音楽検出情報、および、音声検出情報に基づいて、図１６に示す条件に応じて、音場制御を行う。 FIG. 16 is a diagram showing conditions for each state transition. FIG. 16 shows conditions for each state transition corresponding to (1) to (16) of FIG. For example, in the state (D) (ie, the neutral state), four state transitions (1), (2), (3), and (13) can occur as shown in FIG. And in the sound field control apparatus 50, the sound field control determination part 53 performs sound field control according to the conditions shown in FIG. 16 based on the music detection information stored in the memory 52 and the sound detection information. .

本実施の形態では、音楽検出装置１、２、３、および、音声検出装置４に入力される音響信号はＰＣＭによってデジタル符号化され、１フレームあたり１０２４サンプルに分割される。音響信号のサンプリング周波数が４４．１ｋＨｚの場合、１フレームあたりの時間は、２３ｍｓ（＝（１÷４４１００）×１０２４）となる。音楽検出装置１〜３や音声検出装置４では、連続する複数のフレーム（概ね５フレーム程度）を用いて音楽検出処理、あるいは、音声検出処理が行われるため、メモリ５２には、上述の音楽検出情報、および、音声検出情報が約０．１０５秒（＝２３ｍｓ×５フレーム）ごとに蓄えられる。そして、音場制御判定部５３は、メモリ５２に蓄積された最新の連続する１０回分（約１．０５秒）の音楽検出情報、および、音声検出情報を分析し、音場制御の内容を決定する。 In the present embodiment, the acoustic signals input to the music detection devices 1, 2, 3, and the voice detection device 4 are digitally encoded by PCM and divided into 1024 samples per frame. When the sampling frequency of the acoustic signal is 44.1 kHz, the time per frame is 23 ms (= (1 ÷ 44100) × 1024). In the music detection devices 1 to 3 and the voice detection device 4, the music detection process or the voice detection process is performed using a plurality of consecutive frames (approximately 5 frames). Information and voice detection information are stored approximately every 0.105 seconds (= 23 ms × 5 frames). Then, the sound field control determination unit 53 analyzes the latest 10 consecutive (about 1.05 seconds) music detection information and sound detection information stored in the memory 52, and determines the contents of the sound field control. To do.

より詳細に説明すれば、音場制御判定部５３は、１０回分の音楽検出情報と音声検出情報とから、音楽が検出された回数（以下では、音楽検出回数と呼ぶ）と音声が検出された回数（以下では、音声検出回数と呼ぶ）とをカウントし、音声検出回数、および、音楽検出回数に応じて、上記（Ａ）〜（Ｄ）の音場制御の状態を切り替える。 More specifically, the sound field control determination unit 53 detects the number of times that music has been detected (hereinafter referred to as the number of times of music detection) and the sound from the 10 times of music detection information and sound detection information. The number of times (hereinafter referred to as the number of times of sound detection) is counted, and the state of the sound field control (A) to (D) is switched according to the number of times of sound detection and the number of times of music detection.

図１７は、音場制御判定部５３における処理内容を示すフローチャートである。図１７を用いて、音場制御判定部５３による処理を説明すれば次のとおりである。 FIG. 17 is a flowchart showing the processing contents in the sound field control determination unit 53. The processing performed by the sound field control determination unit 53 will be described with reference to FIG.

まず、音場制御判定部５３は、Ｓ１７１において、現在の音場制御状態がニュートラル（上記（Ｄ）の状態）であるか否かを判定する。そして、音場制御状態がニュートラルの場合、Ｓ１７２において、図１６に示す状態遷移の条件（１）「音楽検出回数＜２、かつ、音声検出回数＞３」を満たしているか否かを判定する。 First, in S171, the sound field control determination unit 53 determines whether or not the current sound field control state is neutral (the state (D) above). If the sound field control state is neutral, it is determined in S172 whether or not the state transition condition (1) “music detection count <2 and voice detection count> 3” shown in FIG. 16 is satisfied.

Ｓ１７２において、条件（１）を満たしていると判定した場合、音声シーンの音場制御を行う（Ｓ１７３）。すなわち、図１５に示す（Ｂ）の状態に遷移する。一方、Ｓ１７２において条件（１）を満たしていないと判定した場合、Ｓ１７４において、図１６に示す状態遷移の条件（２）「音楽検出回数＞３、かつ、音声検出回数＜２」を満たしているか否かを判定する。 If it is determined in S172 that the condition (1) is satisfied, sound field control of the audio scene is performed (S173). That is, the state transits to the state (B) shown in FIG. On the other hand, if it is determined in S172 that the condition (1) is not satisfied, whether or not the state transition condition (2) “music detection count> 3 and voice detection count <2” shown in FIG. 16 is satisfied in S174. Determine whether or not.

Ｓ１７４において、条件（２）を満たしていると判定した場合、音楽シーンの音場制御を行う（Ｓ１７５）。すなわち、図１５に示す（Ａ）の状態に遷移する。一方、Ｓ１７４において条件（２）を満たしていないと判定した場合、Ｓ１７６において、図１６に示す状態遷移の条件（３）「音楽検出回数＞２、かつ、音声検出回数＞２」の条件を満たしているか否かを判定する。 If it is determined in S174 that the condition (2) is satisfied, the sound field control of the music scene is performed (S175). That is, the state transits to the state shown in FIG. On the other hand, if it is determined in S174 that the condition (2) is not satisfied, the condition of the state transition condition (3) “music detection count> 2 and voice detection count> 2” shown in FIG. 16 is satisfied in S176. It is determined whether or not.

Ｓ１７６において、条件（３）を満たしていると判定した場合、音楽および音声が含まれるシーン用の音場制御を行う（Ｓ１７７）。すなわち、図１５に示す（Ｃ）の状態に遷移する。一方、Ｓ１７６において条件（３）を満たしていないと判定した場合、ニュートラルの状態における音場制御を継続する（Ｓ１７８）。 If it is determined in S176 that the condition (3) is satisfied, sound field control for a scene including music and sound is performed (S177). That is, the state transits to the state (C) shown in FIG. On the other hand, if it is determined in S176 that the condition (3) is not satisfied, the sound field control in the neutral state is continued (S178).

なお、本実施の形態では、図１６に示す条件は、あらかじめメモリ５２に記憶されているが、条件を変更して再度設定することも可能であり、特に限定はされない。 In the present embodiment, the conditions shown in FIG. 16 are stored in the memory 52 in advance. However, the conditions can be changed and set again, and are not particularly limited.

また、すでに（Ａ）〜（Ｃ）のいずれかの音場制御が行われている状態においても、音場制御判定部５３は、同様に、図１７に示すフローチャートに従ってＳ１７９〜Ｓ２０１に示す処理を行う。この場合も、音場制御判定部５３は、図１６に示す状態遷移の条件に基づいて遷移する状態を判定する。以下にＳ１７９〜Ｓ２０１の処理フローに従って、上記（Ａ）〜（Ｃ）のいずれかの音場制御が行われている場合の状態遷移について説明する。 Even in the state in which the sound field control of any one of (A) to (C) is already performed, the sound field control determination unit 53 similarly performs the processes shown in S179 to S201 according to the flowchart shown in FIG. Do. Also in this case, the sound field control determination unit 53 determines the transition state based on the state transition condition shown in FIG. The state transition when the sound field control of any of (A) to (C) above is performed according to the processing flow of S179 to S201 will be described below.

上述のとおり、音場制御判定部５３は、Ｓ１７１において、現在の音場制御状態がニュートラル（上記（Ｄ）の状態）であるか否かを判定するが、Ｓ１７１においてニュートラルな状態でないと判定された場合、Ｓ１７９において、音場制御状態が音楽シーン用の制御状態（上記（Ａ）の状態）であるか否かを判定する。そして、音場制御状態が音楽シーン用の制御状態の場合、Ｓ１８０において、図１６に示す状態遷移の条件（７）「音楽検出回数＜２、かつ、音声検出回数＞５（＝３＋２）」（判定条件（１）の音声検出回数に＋２のオフセット）を満たしているか否かを判定する。 As described above, the sound field control determination unit 53 determines whether or not the current sound field control state is neutral (state (D) above) in S171, but is determined not to be neutral in S171. In S179, it is determined whether or not the sound field control state is the control state for music scene (the state (A) above). If the sound field control state is the control state for the music scene, the state transition condition (7) “music detection count <2 and voice detection count> 5 (= 3 + 2)” shown in FIG. It is determined whether or not the number of sound detection times of the determination condition (1) satisfies an offset of +2.

Ｓ１８０において、条件（７）を満たしていると判定した場合、音声シーンの音場制御を行う（Ｓ１８１）。すなわち、図１５に示す（Ｂ）の状態に遷移する。一方、Ｓ１８０において、条件（７）を満たしていないと判定した場合、Ｓ１８２において、図１６に示す状態遷移の条件（１０）「音楽検出回数＞４（＝２＋２）、かつ、音声検出回数＞４（＝２＋２）」を満たしているか否かを判定する。 If it is determined in S180 that the condition (7) is satisfied, sound field control of the audio scene is performed (S181). That is, the state transits to the state (B) shown in FIG. On the other hand, if it is determined in S180 that the condition (7) is not satisfied, in S182, the state transition condition (10) shown in FIG. 16 “music detection count> 4 (= 2 + 2) and voice detection count> 4 It is determined whether or not (= 2 + 2) ”is satisfied.

Ｓ１８２において、条件（１０）を満たしていると判定した場合、音楽および音声を含んでいるシーンの音場制御を行う（Ｓ１８３）。すなわち、図１５に示す（Ｃ）の状態に遷移する。一方、Ｓ１８２において条件（１０）を満たしていないと判定した場合、Ｓ１８４において、図１６に示す状態遷移の条件（５）「音楽検出回数＜２」の条件を満たしているか否かを判定する。 If it is determined in S182 that the condition (10) is satisfied, the sound field control of the scene including music and sound is performed (S183). That is, the state transits to the state (C) shown in FIG. On the other hand, if it is determined in S182 that the condition (10) is not satisfied, it is determined in S184 whether or not the condition (5) “number of times of music detection <2” shown in FIG. 16 is satisfied.

Ｓ１８４において、条件（５）を満たしていると判定した場合、ニュートラルの音場制御を行う（Ｓ１８５）。すなわち、図１５に示す（Ｄ）の状態に遷移する。一方、Ｓ１８４において、条件（５）を満たしていないと判定した場合、音楽シーンの音場制御を継続する（Ｓ１８６）。 If it is determined in S184 that the condition (5) is satisfied, neutral sound field control is performed (S185). That is, the state transits to the state (D) shown in FIG. On the other hand, if it is determined in S184 that the condition (5) is not satisfied, the sound field control of the music scene is continued (S186).

また、上述のとおり、音場制御判定部５３は、Ｓ１７９において、音場制御状態は音楽シーン用の制御状態（上記（Ａ）の状態）であるか否かを判定するが、Ｓ１７９において音楽シーン用の制御状態でないと判定された場合、Ｓ１８７において、音場制御状態が音声シーン用の制御状態（上記（Ｂ）の状態）であるか否かを判定する。そして、音場制御状態が音声シーン用の制御状態の場合、Ｓ１８８において、図１６に示す状態遷移の条件（８）「音楽検出回数＞５（＝３＋２）、かつ、音声検出回数＜２」（判定条件（２）の音楽検出回数に＋２のオフセット）を満たしているか否かを判定する。 Further, as described above, the sound field control determination unit 53 determines whether or not the sound field control state is the control state for the music scene (the state (A) above) in S179, but the music scene control state is determined in S179. If it is determined that the control state is not the control state for use, it is determined in S187 whether or not the sound field control state is a control state for the sound scene (the state (B) above). If the sound field control state is a control state for an audio scene, the state transition condition (8) “music detection count> 5 (= 3 + 2) and audio detection count <2” shown in FIG. It is determined whether or not the music detection count in the determination condition (2) satisfies an offset of +2.

Ｓ１８８において、条件（８）を満たしていると判定した場合、音楽シーンの音場制御を行う（Ｓ１８９）。すなわち、図１５に示す（Ａ）の状態に遷移する。一方、Ｓ１８８において、条件（８）を満たしていないと判定した場合、Ｓ１９０において、図１６に示す状態遷移の条件（１２）「音楽検出回数＞４（＝２＋２）、かつ、音声検出回数＞４（＝２＋２）」（判定条件（３）の音楽検出回数、および音声検出回数にそれぞれ＋２のオフセット）を満たしているか否かを判定する。 If it is determined in S188 that the condition (8) is satisfied, the sound field control of the music scene is performed (S189). That is, the state transits to the state shown in FIG. On the other hand, if it is determined in S188 that the condition (8) is not satisfied, the state transition condition (12) shown in FIG. 16 “music detection count> 4 (= 2 + 2) and voice detection count> 4 in S190. It is determined whether (= 2 + 2) ”(the offset of +2 for the number of times of music detection and the number of times of sound detection of the determination condition (3)) is satisfied.

Ｓ１９０において、条件（１２）を満たしていると判定した場合、音楽および音声を含んでいるシーンの音場制御を行う（Ｓ１９１）。すなわち、図１５に示す（Ｃ）の状態に遷移する。一方、Ｓ１９０において条件（１２）を満たしていないと判定した場合、Ｓ１９２において、図１６に示す状態遷移の条件（４）「音声検出回数＜２」の条件を満たしているか否かを判定する。 If it is determined in S190 that the condition (12) is satisfied, sound field control of a scene including music and audio is performed (S191). That is, the state transits to the state (C) shown in FIG. On the other hand, if it is determined in S190 that the condition (12) is not satisfied, it is determined in S192 whether or not the condition (4) “Number of times of voice detection <2” shown in FIG. 16 is satisfied.

Ｓ１９２において、条件（４）を満たしていると判定した場合、ニュートラルの音場制御を行う（Ｓ１９３）。すなわち、図１５に示す（Ｄ）の状態に遷移する。一方、Ｓ１９２において、条件（４）を満たしていないと判定した場合、音声シーンの音場制御を継続する（Ｓ１９４）。 If it is determined in S192 that the condition (4) is satisfied, neutral sound field control is performed (S193). That is, the state transits to the state (D) shown in FIG. On the other hand, if it is determined in S192 that the condition (4) is not satisfied, the sound field control of the audio scene is continued (S194).

また、上述のとおり、音場制御判定部５３は、Ｓ１８７において、音場制御状態は音楽シーン用の制御状態（上記（Ｂ）の状態）であるか否かを判定するが、Ｓ１８７において音声シーン用の制御状態でないと判定された場合、音場制御状態は、音楽と音声の両方を含むシーン用の制御状態（上記（Ｃ）の状態）ということになる。そして、音場制御状態が音楽と音声の両方を含むシーン用の制御状態の場合、Ｓ１９５において、図１６に示す状態遷移の条件（９）「音楽検出回数＞５（＝３＋２）、かつ、音声検出回数２」（判定条件（２）の音楽検出回数に＋２のオフセット）を満たしているか否かを判定する。 Further, as described above, the sound field control determination unit 53 determines whether or not the sound field control state is the control state for the music scene (the state (B) above) in S187, but the sound scene control state is determined in S187. When it is determined that the control state is not the control state, the sound field control state is a control state for the scene including both music and sound (the state (C) described above). If the sound field control state is a control state for a scene including both music and audio, the state transition condition (9) “music detection count> 5 (= 3 + 2) shown in FIG. It is determined whether or not “number of detection times 2” (an offset of +2 with respect to the number of music detection times of the determination condition (2)) is satisfied.

Ｓ１９５において、条件（９）を満たしていると判定した場合、音楽シーンの音場制御を行う（Ｓ１９６）。すなわち、図１５に示す（Ａ）の状態に遷移する。一方、Ｓ１９５において、条件（９）を満たしていないと判定した場合、Ｓ１９７において、図１６に示す状態遷移の条件（１１）「音楽検出回数＜２、かつ、音声検出回数＞５（＝３＋２）」（判定条件（１）の音声検出回数に＋２のオフセット）を満たしているか否かを判定する。 If it is determined in S195 that the condition (9) is satisfied, the sound field control of the music scene is performed (S196). That is, the state transits to the state shown in FIG. On the other hand, if it is determined in S195 that the condition (9) is not satisfied, in S197, the state transition condition (11) shown in FIG. 16 “music detection count <2 and voice detection count> 5 (= 3 + 2) It is determined whether or not (the offset of +2 with respect to the number of sound detection times of the determination condition (1)) is satisfied.

Ｓ１９７において、条件（１１）を満たしていると判定した場合、音声シーンの音場制御を行う（Ｓ１９８）。すなわち、図１５に示す（Ｂ）の状態に遷移する。一方、Ｓ１９７において条件（１１）を満たしていないと判定した場合、Ｓ１９９において、図１６に示す状態遷移の条件（６）「音楽検出回数＜２、かつ、音声検出回数＜２」の条件を満たしているか否かを判定する。 If it is determined in S197 that the condition (11) is satisfied, sound field control of the audio scene is performed (S198). That is, the state transits to the state (B) shown in FIG. On the other hand, if it is determined in S197 that the condition (11) is not satisfied, in S199, the condition of the state transition condition (6) “music detection count <2 and voice detection count <2” shown in FIG. 16 is satisfied. It is determined whether or not.

Ｓ１９９において、条件（６）を満たしていると判定した場合、ニュートラルの音場制御を行う（Ｓ２００）。すなわち、図１５に示す（Ｄ）の状態に遷移する。一方、Ｓ１９９において、条件（６）を満たしていないと判定した場合、音楽と音声の両方を含むシーンの音場制御を継続する（Ｓ２０１）。 If it is determined in S199 that the condition (6) is satisfied, neutral sound field control is performed (S200). That is, the state transits to the state (D) shown in FIG. On the other hand, if it is determined in S199 that the condition (6) is not satisfied, the sound field control of the scene including both music and sound is continued (S201).

以上に説明したとおり、音場制御判定部５３は、図１６に示す状態遷移条件に基づいて判定を行い、判定結果に応じて音場制御の状態を切り替える。すなわち、音場制御の状態が遷移する。そして、音場制御処理部５４は、入力されている音響信号に、音場制御判定部５３による判定結果に応じた信号処理を施して補正し、図示しないＤＡコンバータ、アンプ、スピーカなどの再生装置を介して、出力ＰＣＭの再生を行う。 As described above, the sound field control determination unit 53 makes a determination based on the state transition condition shown in FIG. 16, and switches the state of the sound field control according to the determination result. That is, the state of the sound field control transitions. The sound field control processing unit 54 corrects the input acoustic signal by performing signal processing according to the determination result by the sound field control determination unit 53, and reproduces a DA converter, an amplifier, a speaker, or the like (not shown) The output PCM is played back via the.

これにより、例えば、ニュートラルな状態において、上述した１０回分の音楽検出情報、および、音声検出情報を分析した結果、音楽検出回数が８回、音声検出回数が１回、音声および音楽のいずれも検出されなかった回数が１回であった場合、図１７のＳ１７４において状態遷移の条件（２）「音楽検出回数＞３、かつ、音声検出回数＜２」を満たしていると判定される。この場合、音場制御判定部５３は、音楽シーンの音場制御を開始する判定を行う。 As a result, for example, in the neutral state, as a result of analyzing the above-described 10 times of music detection information and voice detection information, the number of times of music detection is 8, the number of times of voice detection is 1, and both voice and music are detected. When the number of times of not being performed is one, it is determined in S174 of FIG. 17 that the condition (2) “music detection count> 3 and voice detection count <2” is satisfied. In this case, the sound field control determination unit 53 determines to start the sound field control of the music scene.

ここで、条件（２）は、音楽検出処理、音声検出処理の正解率は約９０％程度、すなわち、１割は誤判定があることを考慮して設定されている。そのため、音声検出回数が１回あるものの、音楽シーンの音場制御を行う判定がなされる。 Here, the condition (2) is set in consideration that the correct answer rate of the music detection process and the voice detection process is about 90%, that is, 10% has an erroneous determination. For this reason, although the number of times of sound detection is one, it is determined that the sound field control of the music scene is performed.

さらに、音楽シーンの音場制御がなされている場合において、次の１０回分の音楽検出情報、および、音声検出情報を分析した結果、音楽検出回数が７回、音楽検出回数が３回、音声および音楽のいずれも検出されなかった回数が３回であった場合（この場合、音声と音楽との両方が同時に検出された回数が３回）、図１７のＳ１８４において状態遷移の条件（５）「条件（７）、（１０）を満たさず、かつ、音楽検出回数＜２」を満たしていないと判定される。この場合、音場制御判定部５３は、音楽シーンの音場制御を継続する判定を行う。 Further, in the case where the sound field control of the music scene is performed, as a result of analyzing the music detection information and the voice detection information for the next 10 times, the number of music detections is 7, the number of music detections is 3, When the number of times that none of the music is detected is 3 (in this case, the number of times that both voice and music are detected at the same time is 3), the state transition condition (5) “ It is determined that the conditions (7) and (10) are not satisfied and the number of music detection times <2 ”is not satisfied. In this case, the sound field control determination unit 53 determines to continue the sound field control of the music scene.

この例から、ニュートラルな状態の場合、音楽検出回数が７回、音声検出回数は３回の場合、Ｓ１７６において状態遷移の条件（３）を満たしていると判定されて、音楽と音声の両方を含むシーンの音場制御の判定が行われるのに対し、音楽シーンの制御状態においては、音楽検出回数が７回、音声検出回数は３回の場合、音楽と音声の両方を含むシーンの音場制御の判定が行われないことがわかる。つまり、現在の状態が何らかの音場制御を行っている状態（すなわち、ニュートラルでない状態）においては、現在の音場制御の状態に優位性を持たせた状態遷移の条件が設定されている。このような条件設定が成されている理由は以下のとおりである。 From this example, in the neutral state, when the number of music detections is 7 and the number of sound detections is 3, it is determined in S176 that the state transition condition (3) is satisfied, and both music and speech are In contrast, when the sound field control of the scene including the determination is performed, in the control state of the music scene, when the number of times of music detection is seven and the number of times of sound detection is three, the sound field of the scene including both music and sound It can be seen that the control is not judged. That is, in the state where the current state is performing some kind of sound field control (that is, a state that is not neutral), a condition for state transition that gives priority to the current state of sound field control is set. The reason why such a condition is set is as follows.

上述したとおり、音楽検出処理、音声検出処理の正解率は約９０％であり、１０％程度は誤判定されるため、音場制御の状態の切り替えが適切に行われない場合がある。また、１つの会話のシーンにおいて息継ぎの無声部分があったり、効果音だけで数秒だけ音楽が混入されたりする場合もある。そのため、この数秒（または数百ｍｓ）単位のオーダーでのシーンチェンジ（すなわち、音場制御の状態の切り替え）を追従しても、必ずしも、視聴者にとって快適な切り替えがなされるとは言えず、むしろ視聴者を疲れさせてしまうことになる。そこで、既に音場制御が行われている場合には、ニュートラルな状態からの音場制御に比べて、現在のシーン、すなわち、現在の音場制御の状態に優位性を持たせるようにして状態遷移の条件が設定されている。これにより、音場制御の状態の切り替えを適切な回数にすることができる。つまり、聴取者が１つのシーンと認識する主観的な時間区切りにおいてのみ、音場制御の状態の切り替えを行う構成を実現できる。 As described above, the correct answer rate of the music detection process and the voice detection process is about 90%, and about 10% is erroneously determined. Therefore, the state of sound field control may not be appropriately switched. In addition, there may be a silent part of breathing in one conversation scene, or music may be mixed for only a few seconds with sound effects. Therefore, even if it follows the scene change (that is, switching of the sound field control state) in the order of several seconds (or hundreds of ms), it cannot always be said that the comfortable switching is made for the viewer. Rather, it will make viewers tired. Therefore, when sound field control has already been performed, the current scene, that is, the state of the current sound field control is given priority over the sound field control from the neutral state. Transition conditions are set. Thereby, the switching of the state of the sound field control can be performed an appropriate number of times. That is, it is possible to realize a configuration in which the state of sound field control is switched only in a subjective time segment that the listener recognizes as one scene.

なお、本発明を、以下のように表現することも可能である。 The present invention can also be expressed as follows.

（第１の構成）
入力音声音響信号を所定時間で区切られたフレームに分割する手段と、前記フレーム毎に周波数に変換する手段と、前記周波数に変換したスペクトルの横軸を対数変換する手段と、前記対数変換したスペクトルの自己相関値を算出する手段と、前記算出された自己相関値の過去のフレームとの相関値を算出する手段と、前記相関値が所定フレームの間で、所定値以内かを比較する手段と、所定値以内の場合に音楽シーンと判定する判定手段を備えることを特徴とする第１の構成。 (First configuration)
Means for dividing the input audio-acoustic signal into frames divided by a predetermined time; means for converting the frequency into each frame; means for logarithmically converting the horizontal axis of the spectrum converted to the frequency; and the logarithmically converted spectrum Means for calculating the autocorrelation value, means for calculating the correlation value of the calculated autocorrelation value with a past frame, means for comparing whether the correlation value is within a predetermined value between predetermined frames, and A first configuration characterized by comprising determination means for determining a music scene when it is within a predetermined value.

（第２の構成）
入力音声音響信号を所定時間で区切られたフレームに分割する手段と、前記フレーム毎に周波数に変換する手段と、前記周波数に変換したスペクトルの横軸を対数変換する手段と、前記対数変換したスペクトルのパワーの最大値を検出する手段と、前記検出した最大値を有する周波数を検出する手段と、前記検出された最大周波数と過去の所定フレームの最大周波数を比較する手段と、前記比較した周波数帯域が所定の音階幅以上か比較する手段と、所定の音階以上の場合に音楽シーンと判定する判定手段を備えることを特徴とする第２の構成。 (Second configuration)
Means for dividing the input audio-acoustic signal into frames divided by a predetermined time; means for converting the frequency into each frame; means for logarithmically converting the horizontal axis of the spectrum converted to the frequency; and the logarithmically converted spectrum Means for detecting the maximum value of power, means for detecting a frequency having the detected maximum value, means for comparing the detected maximum frequency with the maximum frequency of a predetermined frame in the past, and the compared frequency band A second configuration characterized by comprising means for comparing whether or not is greater than or equal to a predetermined musical scale, and determination means for determining that the music scene is greater than or equal to a predetermined musical scale.

（第３の構成）
入力音声音響信号を所定時間で区切られたフレームに分割する手段と、前記フレーム毎に周波数に変換する手段と、前記周波数に変換したスペクトルの所定周波数以下のパワーと以上のパワーを算出する手段と、前記算出した低域パワーと高域パワーを過去のフレームと累積加算する手段と、前記累積加算した低域パワーと高域パワーの比を算出する手段と、前記累積加算した低域パワーの自己相関値を算出する手段と、前記低域パワーの累積加算値と高域パワーの累積加算値の比を算出する手段と、前記算出した比が所定値以上であって、かつ前記低域パワーの自己相関値の最大値が所定値（略0.2秒）以上でありかつ所定値（略1.5秒）以下の場合に音楽シーンと判定する判定手段を備えることを特徴とする第３の構成。 (Third configuration)
Means for dividing the input audio-acoustic signal into frames divided by a predetermined time; means for converting each frame into a frequency; and means for calculating power equal to or lower than a predetermined frequency of the spectrum converted into the frequency and above power Means for cumulatively adding the calculated low frequency power and high frequency power to a past frame; means for calculating the ratio of the cumulatively added low frequency power and high frequency power; and the self-addition of the cumulatively added low frequency power. Means for calculating a correlation value; means for calculating a ratio of the cumulative addition value of the low frequency power and the cumulative addition value of the high frequency power; and the calculated ratio is equal to or greater than a predetermined value and the low frequency power A third configuration characterized by comprising determination means for determining a music scene when the maximum autocorrelation value is not less than a predetermined value (approximately 0.2 seconds) and not more than a predetermined value (approximately 1.5 seconds).

（第４の構成）
入力音声音響信号を所定時間で区切られたフレームに分割し、前記フレーム毎にケプストラム法や瞬時周波数法等により基本周波数を抽出する装置において、前記抽出した基本周波数と複数の過去フレームの検出した基本周波数を、各々所定範囲（略100Hzから略400Hz）と比較する手段と、前記所定範囲を満たす場合に、基本周波数の変化量を検出する手段と、検出した変化量が所定範囲（略120Hz）以内であり、かつ単調増加または単調減少する場合に音声シーンと判定する判定手段を備えることを特徴とする第４の構成。 (Fourth configuration)
In an apparatus that divides an input audio-acoustic signal into frames divided by a predetermined time and extracts a fundamental frequency by a cepstrum method, an instantaneous frequency method, or the like for each frame, the detected fundamental frequency and a detected basic of a plurality of past frames A means for comparing each frequency with a predetermined range (approximately 100 Hz to approximately 400 Hz), a means for detecting a change amount of the fundamental frequency when the predetermined range is satisfied, and a detected change amount within a predetermined range (approximately 120 Hz) And a determination unit that includes a determination unit that determines that the scene is an audio scene when it monotonously increases or monotonously decreases.

本発明は上述した実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能である。すなわち、請求項に示した範囲で適宜変更した技術的手段を組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope shown in the claims. That is, embodiments obtained by combining technical means appropriately modified within the scope of the claims are also included in the technical scope of the present invention.

最後に、音楽検出装置１、２、３、および、音声検出装置４の各ブロックは、ハードウェアロジックによって構成してもよいし、次のようにＣＰＵを用いてソフトウェアによって実現してもよい。 Finally, each block of the music detection devices 1, 2, 3, and the voice detection device 4 may be configured by hardware logic, or may be realized by software using a CPU as follows.

すなわち、音楽検出部１０、２０、３０、および、音声検出部４０は、各機能を実現する制御プログラムの命令を実行するＣＰＵ（central processing unit）、上記プログラムを格納したＲＯＭ（read only memory）、上記プログラムを展開するＲＡＭ（random access memory）、上記プログラムおよび各種データを格納するメモリ等の記憶装置（記録媒体）などを備えている。そして、本発明の目的は、上述した機能を実現するソフトウェアである音楽検出部１０、２０、３０、および、音声検出部４０の制御プログラムのプログラムコード（実行形式プログラム、中間コードプログラム、ソースプログラム）をコンピュータで読み取り可能に記録した記録媒体を、音楽検出装置１、２、３、および、音声検出装置４に供給し、そのコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に記録されているプログラムコードを読み出し実行することによっても、達成可能である。 That is, the music detection units 10, 20, 30, and the voice detection unit 40 include a CPU (central processing unit) that executes a command of a control program that realizes each function, a ROM (read only memory) that stores the program, A RAM (random access memory) for expanding the program, a storage device (recording medium) such as a memory for storing the program and various data, and the like are provided. The object of the present invention is the program code (execution format program, intermediate code program, source program) of the control programs of the music detection units 10, 20, 30 and the voice detection unit 40, which are software for realizing the functions described above. Is supplied to the music detection devices 1, 2, 3, and the voice detection device 4, and the computer (or CPU or MPU) stores the program code recorded on the recording medium. This can also be achieved by executing reading.

上記記録媒体としては、例えば、磁気テープやカセットテープ等のテープ系、フロッピー（登録商標）ディスク／ハードディスク等の磁気ディスクやＣＤ−ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ／ＣＤ−Ｒ等の光ディスクを含むディスク系、ＩＣカード（メモリカードを含む）／光カード等のカード系、あるいはマスクＲＯＭ／ＥＰＲＯＭ／ＥＥＰＲＯＭ／フラッシュＲＯＭ等の半導体メモリ系などを用いることができる。 Examples of the recording medium include a tape system such as a magnetic tape and a cassette tape, a magnetic disk such as a floppy (registered trademark) disk / hard disk, and an optical disk such as a CD-ROM / MO / MD / DVD / CD-R. Card system such as IC card, IC card (including memory card) / optical card, or semiconductor memory system such as mask ROM / EPROM / EEPROM / flash ROM.

また、音楽検出装置１、２、３、および、音声検出装置４を通信ネットワークと接続可能に構成し、通信ネットワークを介して上記プログラムコードを供給してもよい。この通信ネットワークとしては、特に限定されず、例えば、インターネット、イントラネット、エキストラネット、ＬＡＮ、ＩＳＤＮ、ＶＡＮ、ＣＡＴＶ通信網、仮想専用網（virtual private network）、電話回線網、移動体通信網、衛星通信網等が利用可能である。また、通信ネットワークを構成する伝送媒体としては、特に限定されず、例えば、ＩＥＥＥ１３９４、ＵＳＢ、電力線搬送、ケーブルＴＶ回線、電話線、ＡＤＳＬ回線等の有線でも、ＩｒＤＡやリモコンのような赤外線、Ｂｌｕｅｔｏｏｔｈ（登録商標）、８０２．１１無線、ＨＤＲ、携帯電話網、衛星回線、地上波デジタル網等の無線でも利用可能である。なお、本発明は、上記プログラムコードが電子的な伝送で具現化された、搬送波に埋め込まれたコンピュータデータ信号の形態でも実現され得る。 Further, the music detection devices 1, 2, 3, and the voice detection device 4 may be configured to be connectable to a communication network, and the program code may be supplied via the communication network. The communication network is not particularly limited. For example, the Internet, intranet, extranet, LAN, ISDN, VAN, CATV communication network, virtual private network, telephone line network, mobile communication network, satellite communication. A net or the like is available. Also, the transmission medium constituting the communication network is not particularly limited. For example, even in the case of wired such as IEEE 1394, USB, power line carrier, cable TV line, telephone line, ADSL line, etc., infrared rays such as IrDA and remote control, Bluetooth ( (Registered trademark), 802.11 wireless, HDR, mobile phone network, satellite line, terrestrial digital network, and the like can also be used. The present invention can also be realized in the form of a computer data signal embedded in a carrier wave in which the program code is embodied by electronic transmission.

本発明に係る音楽検出装置、および、音声検出装置は、放送中や再生中の番組について音楽シーン、音声シーンを検出することができるため、シーンに応じて最適な音場制御を行うテレビ受像装置などにおいて好適に利用できる。 Since the music detection device and the sound detection device according to the present invention can detect a music scene and a sound scene for a program being broadcast or played, a television receiver that performs optimal sound field control according to the scene. It can utilize suitably in.

本発明に係る音楽検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the music detection apparatus which concerns on this invention. １２平均律音階と周波数の関係を示す図である。It is a figure which shows the relationship between 12 average temperament scales and frequency. トランペットの周波数スペクトルを示す図であり、（ａ）はある時刻の周波数スペクトルを示す図であり、（ｂ）は（ａ）の周波数スペクトルを示す時刻の２３ｍｓ後の周波数スペクトルを示す図である。It is a figure which shows the frequency spectrum of a trumpet, (a) is a figure which shows the frequency spectrum of a certain time, (b) is a figure which shows the frequency spectrum 23 ms after the time which shows the frequency spectrum of (a). 鉄琴の周波数スペクトルを示す図であり、（ａ）はある時刻の周波数スペクトルを示す図であり、（ｂ）は（ａ）の周波数スペクトルを示す時刻の２３ｍｓ後の周波数スペクトルを示す図である。It is a figure which shows the frequency spectrum of a koto, (a) is a figure which shows the frequency spectrum of a certain time, (b) is a figure which shows the frequency spectrum after 23 ms of the time which shows the frequency spectrum of (a). . 本発明に係る音楽検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the music detection apparatus which concerns on this invention. フレームとパワー最大音階の関係の一例を示す図である。It is a figure which shows an example of the relationship between a flame | frame and a power maximum musical scale. 本発明に係る音楽検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the music detection apparatus which concerns on this invention. 太鼓の周波数スペクトルを示す図である。It is a figure which shows the frequency spectrum of a drum. ドラムの１００Ｈｚ以下のスペクトルパワー合計の時間遷移を示す図である。It is a figure which shows the time transition of the spectrum power total of 100 Hz or less of a drum. 本発明に係る音声検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice detection apparatus which concerns on this invention. 音声の特性を示す図であり、（ａ）は男性による日本語でのスピーチにおける時間波形を示す図であり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。It is a figure which shows the characteristic of an audio | voice, (a) is a figure which shows the time waveform in the speech in Japanese by a man, (b) is the time of the fundamental frequency (F0) calculated | required from the time waveform of (a). It is a figure which shows a change. 音声の特性を示す図であり、（ａ）は女性による日本語でのスピーチにおける時間波形を示す図であり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。It is a figure which shows the characteristic of an audio | voice, (a) is a figure which shows the time waveform in the speech in Japanese by a woman, (b) is the time of the fundamental frequency (F0) calculated | required from the time waveform of (a). It is a figure which shows a change. 音楽の特性を示す図であり、（ａ）は時間波形を示すであり、（ｂ）は（ａ）の時間波形から求められた基本周波数（Ｆ０）の時間変化を示す図である。It is a figure which shows the characteristic of music, (a) shows a time waveform, (b) is a figure which shows the time change of the fundamental frequency (F0) calculated | required from the time waveform of (a). 本発明に係る音場制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound field control apparatus which concerns on this invention. 音場制御装置における音場制御の状態遷移を示す図である。It is a figure which shows the state transition of the sound field control in a sound field control apparatus. 音場制御の各状態遷移の条件を示す図である。It is a figure which shows the conditions of each state transition of sound field control. 音場制御判定部における処理内容を示すフローチャートである。It is a flowchart which shows the processing content in a sound field control determination part.

Explanation of symbols

１音楽検出装置
２音楽検出装置
３音楽検出装置
４音声検出装置
５フレーム分割部
６窓掛け部
７スペクトル変換部
１０音楽検出部
１１音階スペクトル算出部（スペクトル算出手段）
１２自己相関係数算出部（自己相関値算出手段）
１３係数最大値検出部
１４係数最大値保存部
１５係数最大値比較部
１６分散算出部（数値化手段）
１７音楽／非音楽判定部（音楽判定手段）
２０音楽検出部
２１音階スペクトル算出部
２２スペクトルパワー算出部（スペクトルパワー算出手段）
２３パワー最大値検出部（最大音階識別番号検出手段）
２４パワー最大値保存部
２５パワー最大値比較部
２６分散算出部（数値化手段）
２７音楽／非音楽判定部（音楽判定手段）
３０音楽検出部
３１超低域スペクトルパワー算出部（低域スペクトルパワー算出手段）
３２超低域スペクトルパワー保存部
３３超低域スペクトルパワー自己相関係数算出部
３４係数最大値判定部（フレーム間隔検出手段）
３５高域スペクトルパワー算出部（高域スペクトルパワー算出手段）
３６超低域／高域パワー比算出部
３７音楽／非音楽判定部（音楽判定手段）
４０音声検出部
４１対数スペクトル算出部
４２ケプストラム算出部
４３基本周波数算出部（基本周波数抽出手段）
４４基本周波数保存部
４５ローパスフィルタ部
４６フレーズ成分解析部（基本周波数変化検出手段）
４７アクセント成分解析部（基本周波数変化検出手段）
４８音声／非音声判定部（音声判定手段）
５０音場制御装置
５１音楽判定部（音楽判定装置）
５２メモリ
５３音場制御判定部
５４音場制御処理部
５５音楽音声検出装置 DESCRIPTION OF SYMBOLS 1 Music detection apparatus 2 Music detection apparatus 3 Music detection apparatus 4 Voice detection apparatus 5 Frame division part 6 Windowing part 7 Spectrum conversion part 10 Music detection part 11 Scale spectrum calculation part (spectrum calculation means)
12 Autocorrelation coefficient calculation unit (autocorrelation value calculation means)
13 Coefficient maximum value detection unit 14 Coefficient maximum value storage unit 15 Coefficient maximum value comparison unit 16 Variance calculation unit (numericalization means)
17 Music / non-music determination unit (music determination means)
20 music detector 21 scale spectrum calculator 22 spectrum power calculator (spectrum power calculator)
23 Power maximum value detection unit (maximum scale identification number detection means)
24 power maximum value storage unit 25 power maximum value comparison unit 26 variance calculation unit (numericalization means)
27 Music / non-music determination unit (music determination means)
30 Music detection unit 31 Ultra low frequency spectrum power calculation unit (low frequency spectrum power calculation means)
32 Ultra-low frequency spectrum power storage unit 33 Ultra-low frequency spectrum power autocorrelation coefficient calculation unit 34 Coefficient maximum value determination unit (frame interval detection means)
35 High frequency spectrum power calculation unit (High frequency spectrum power calculation means)
36 Ultra low frequency / high frequency power ratio calculation unit 37 Music / non-music determination unit (music determination means)
40 voice detection unit 41 logarithm spectrum calculation unit 42 cepstrum calculation unit 43 fundamental frequency calculation unit (basic frequency extraction means)
44 Fundamental frequency storage unit 45 Low-pass filter unit 46 Phrase component analysis unit (basic frequency change detection means)
47 Accent component analysis unit (basic frequency change detection means)
48 voice / non-voice judgment unit (voice judgment means)
50 Sound Field Control Device 51 Music Judgment Unit (Music Judgment Device)
52 Memory 53 Sound Field Control Determination Unit 54 Sound Field Control Processing Unit 55 Music Audio Detection Device

Claims

Spectrum calculation means for calculating a frequency spectrum for each frame representing a predetermined time of the acoustic signal from the acoustic signal;
Autocorrelation value calculating means for calculating the autocorrelation value of the frequency spectrum;
Quantification means for quantifying the magnitude of variation of the maximum value of the autocorrelation value in a plurality of consecutive frames;
A music detection apparatus comprising: music determination means for determining that the acoustic signal is music when the magnitude of the variation is smaller than a predetermined threshold value.

The spectrum calculation means includes:
The music detection apparatus according to claim 1, wherein a spectrum of each frequency corresponding to a musical scale is calculated.

The autocorrelation value calculating means includes:
Using sp (i) (i = 0, 1,..., N−1), which is N data representing the spectrum, the M autocorrelation values are converted into the following autocorrelation function R1 (x ) (X = 1, 2,... M).

The above numerical means is
4. The music detection apparatus according to claim 2, wherein the variance of the maximum value is calculated and the variation is digitized.

Spectral power calculation means for calculating the spectral power of each frequency corresponding to the scale for each frame representing a predetermined time of the acoustic signal from the acoustic signal;
A scale identification number for identifying each frequency is assigned to each frequency of the scale, and a maximum scale identification number for detecting a maximum scale identification number that maximizes the spectrum power among the scale identification numbers for each frame. Detection means;
Quantification means for quantifying the magnitude of variation in the maximum scale identification number in a plurality of consecutive frames;
A music detection device comprising: music determination means for determining that the acoustic signal is music when the variation is larger than a predetermined threshold value.

The above numerical means is
6. The music detection apparatus according to claim 5, wherein variance of the maximum scale identification number is calculated and the variation is digitized.

Low frequency spectrum power calculation means for calculating a low frequency spectrum power by adding a spectrum power of a frequency equal to or lower than a predetermined first threshold value or a frequency lower than the first threshold value for each frame from an acoustic signal;
A frame interval detecting means for detecting a frame interval at which the autocorrelation value of the low frequency spectrum power in a predetermined number of consecutive plural frames is maximum;
High frequency spectrum power calculating means for calculating a high frequency spectrum power by adding spectral power of a frequency equal to or higher than a first threshold value or a frequency higher than the first threshold value for each frame from the acoustic signal;
When the ratio of the low-frequency spectrum power to the high-frequency spectrum power is equal to or greater than a second predetermined threshold and the frame interval is within a predetermined range, the acoustic signal is determined to be music. And a music determination device.

The frame interval detecting means includes
Using spp (i) (i = 0, 1,..., N−1) representing N data representing the low-frequency spectrum power, M autocorrelation values are expressed by the following autocorrelation function. The music detection device according to claim 7, wherein the music detection device calculates each value of R2 (x) (x = 1, 2,... M).

A fundamental frequency extracting means for extracting a fundamental frequency for each frame from the acoustic signal;
A fundamental frequency change detecting means for detecting a change in the fundamental frequency in a predetermined number of consecutive frames;
The fundamental frequency change detecting means detects that the fundamental frequency is changing monotonously, changing from monotonic change to constant frequency, or changing from constant frequency to monotone change, and Voice determination that determines that the acoustic signal is voice when the fundamental frequency changes within a predetermined frequency range and the change width of the fundamental frequency is smaller than the predetermined frequency width. And a voice detecting device.

The voice determination means is
10. The acoustic signal is determined to be speech when the change in frequency is a change within a range of approximately 100 Hz to approximately 400 Hz and the width of the change in frequency is less than approximately 120 Hz. The voice detection device according to 1.

The number of times that the acoustic signal is determined to be music within a predetermined period by the music detection device according to claim 1, and the voice detection device according to claim 9 or 10, A sound field control apparatus characterized by switching the state of sound field control according to the number of times that the acoustic signal is determined to be speech within a predetermined period.

The sound field control device according to claim 11, wherein a condition for switching the sound field control is changed according to a controlled state.