JP2016162411A

JP2016162411A - Sound source search device and sound source search method

Info

Publication number: JP2016162411A
Application number: JP2015043586A
Authority: JP
Inventors: 高橋　徹; Toru Takahashi; 徹高橋
Original assignee: Osaka Sangyo University
Current assignee: Osaka Sangyo University
Priority date: 2015-03-05
Filing date: 2015-03-05
Publication date: 2016-09-05
Anticipated expiration: 2035-03-05
Also published as: JP6588212B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound source search device that uses mixed sound as a search key and offers high search speed and improved search accuracy.SOLUTION: A sound source search device 100 includes: characteristic quantity extraction means 10 that extracts a chroma spectrum from mixed sound comprising a music signal and a voice signal as a characteristic quantity; binarization means 20 that binarizes the extracted characteristic quantity by defining characteristic quantity values no less than an average value as "1" and characteristic quantity values less than the average value as "0"; and search means 30 that searches for a sound source in a sound source database 50 using the characteristic quantity of the mixed sound binarized by the binarization means 20 as a search key.SELECTED DRAWING: Figure 1

Description

本発明は、音源検索装置および音源検索方法に関する。 The present invention relates to a sound source search device and a sound source search method.

従来、音源検索装置が知られている（たとえば、非特許文献１参照）。 Conventionally, a sound source search device is known (see, for example, Non-Patent Document 1).

上記非特許文献１には、混合音を検索キーとした音源検索装置が開示されている。この音源検索装置では、混合音から特徴量を抽出するとともに、抽出された特徴量を検索キーとして、音源データベースから音源を検索するように構成されている。ここで、特徴量としては、クロマスペクトルが用いられている。クロマスペクトルは、所定の時間長さを有する混合音（信号）の各分析フレームにおけるフーリエスペクトルを算出した後、各帯域窓の出力エネルギを算出することにより求められる。なお、クロマスペクトルの要素は、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）である。そして、混合音の特徴量ベクトルと、音源データベースの音源の特徴量ベクトルとのパターンマッチング（特徴量ベクトル間のユークリッド距離）により、音源を検索するように構成されている。 Non-Patent Document 1 discloses a sound source search device using a mixed sound as a search key. This sound source search apparatus is configured to extract a feature amount from a mixed sound and to search a sound source from a sound source database using the extracted feature amount as a search key. Here, a chroma spectrum is used as the feature amount. A chroma spectrum is obtained by calculating a Fourier spectrum in each analysis frame of a mixed sound (signal) having a predetermined time length and then calculating an output energy of each band window. The element of the chroma spectrum is a scalar quantity (for example, a single precision floating point number, 32 bits). The sound source is searched for by pattern matching (the Euclidean distance between the feature amount vectors) of the feature amount vector of the mixed sound and the feature amount vector of the sound source in the sound source database.

“特徴量間の累積距離を用いた混合音からの音源検索システムの評価”、信学技報、ｖｏｌ．１１４、ｎｏ．１９１、ｐｐ．１９−２４．“Evaluation of sound source retrieval system from mixed sound using cumulative distance between features”, IEICE Technical Report, vol. 114, no. 191, pp. 19-24.

しかしながら、上記非特許文献１に記載の音源検索装置では、クロマスペクトルが特徴量として用いられている一方、楽曲信号の音圧の変化などに起因して検索精度が低下するという問題点がある。また、スカラー量の要素を有するクロマスペクトルを特徴量として用いているため、混合音の特徴量ベクトルと、音源データベースの音源の特徴量ベクトルとのパターンマッチング（検索）に時間がかかるという問題点がある。 However, in the sound source search device described in Non-Patent Document 1, the chroma spectrum is used as a feature value, but there is a problem that the search accuracy is lowered due to a change in the sound pressure of the music signal. In addition, since a chroma spectrum having a scalar quantity element is used as a feature quantity, it takes time to perform pattern matching (search) between the feature quantity vector of the mixed sound and the feature quantity vector of the sound source in the sound source database. is there.

この発明は、上記のような課題を解決するためになされたものであり、この発明の１つの目的は、混合音を検索キーとする音源検索装置および音源検索方法において、検索速度を高速化させ、かつ、検索精度を向上させることが可能な音源検索装置および音源検索方法を提供することである。 The present invention has been made to solve the above-described problems, and one object of the present invention is to increase the search speed in a sound source search apparatus and sound source search method using a mixed sound as a search key. A sound source search apparatus and a sound source search method capable of improving the search accuracy.

上記目的を達成するために、この発明の第１の局面における音源検索装置は、楽曲信号と音声信号とを含む混合音から特徴量を抽出する検索装置側特徴量抽出手段と、抽出した特徴量を２値化する検索装置側２値化手段と、検索装置側２値化手段により２値化された混合音の特徴量を検索キーとして、音源データベースから音源を検索する検索手段とを備え、混合音の特徴量は、混合音のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルであり、混合音の特徴量は、所定の時間長さを有する１分析フレーム毎または複数の分析フレーム毎に抽出されており、検索装置側２値化手段は、混合音の特徴量が、１分析フレーム毎または複数の分析フレーム毎における特徴量の所定の基準値以上の場合に特徴量を１とし、所定の基準値未満の場合に特徴量を０とするように構成されている。 In order to achieve the above object, a sound source search device according to a first aspect of the present invention includes a search device-side feature amount extraction unit that extracts a feature amount from a mixed sound including a music signal and a sound signal, and the extracted feature amount. And binarizing means for searching for a sound source from a sound source database using the characteristic value of the mixed sound binarized by the binarizing means for the searching apparatus as a search key. The feature amount of the mixed sound is a chroma spectrum that is output energy of each band window calculated based on the Fourier spectrum of the mixed sound, and the feature amount of the mixed sound is one analysis frame having a predetermined time length or It is extracted for each of a plurality of analysis frames, and the binarization means on the search device side is characterized when the feature amount of the mixed sound is equal to or greater than a predetermined reference value of the feature amount for each analysis frame or for each of the plurality of analysis frames. Was a 1, is configured to the feature quantity and 0 if it is less than a predetermined reference value.

この発明の第１の局面による音源検索装置では、上記のように、抽出した特徴量を２値化する検索装置側２値化手段を備えることによって、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）の要素を有する特徴量を検索キーとして音源データベースから音源を検索する場合と比べて、特徴量が２値化される分、次元が小さくなる（１ｂｉｔ）ので、検索速度を高速化させることができる。 In the sound source search device according to the first aspect of the present invention, as described above, by providing the search device side binarization means for binarizing the extracted feature value, a scalar quantity (for example, a single precision floating point number, Compared with the case where a sound source is searched from a sound source database using a feature amount having 32 bits) as a search key, the dimension is reduced (1 bit) as the feature amount is binarized, so that the search speed is increased. Can do.

また、楽曲信号に音声信号を混合した場合、音声信号が混合される分、楽曲信号の包絡（形状）が変化する。そこで、本発明では、抽出した特徴量を２値化する検索装置側２値化手段を備えることによって、２値化後の特徴量のうち、「１」の部分は、音声信号が加法的に作用している限り、「１」のままである。一方、２値化後の特徴量のうち、「０」の部分に音声信号が加法的に作用しても、２値化するための基準値を超えない限り「０」のままである。なお、２値化するためのしきい値近傍では、音声信号が混合されることにより、２値化後の特徴量の「０」または「１」が反転する場合がある一方、音声信号の出力エネルギが大きい周波数（反転する可能性がある周波数）は基本周波数の整数倍の周波数近傍のみの比較的小さい範囲であるため、反転による影響は小さいと考えられる。その結果、２値化された特徴量は、楽曲信号の包絡（形状）を表しながら、混合される音声信号に対して頑強な特徴量となる。また、楽曲信号の音圧の変化に対しても、混合音の音量の変化に伴って所定の基準値も変化させることが可能であるので、混合音の特徴量の変化（特徴量が「０」であるか、または、「１」であるかの判断の変化）が防止される。この点は、発明者の実験によって確認済みである。これらによって、検索速度を高速化させ、かつ、検索精度を向上させることができる。 In addition, when an audio signal is mixed with a music signal, the envelope (shape) of the music signal changes as much as the audio signal is mixed. Therefore, in the present invention, by providing a binarization unit on the search device side that binarizes the extracted feature value, the voice signal is additively added to the portion “1” of the feature value after binarization. As long as it works, it remains “1”. On the other hand, even if the voice signal acts additively on the “0” portion of the binarized feature value, it remains “0” unless the reference value for binarization is exceeded. In the vicinity of the threshold value for binarization, the audio signal is mixed, so that the binarized feature quantity “0” or “1” may be inverted, while the audio signal output Since the frequency with high energy (frequency that can be inverted) is a relatively small range only in the vicinity of a frequency that is an integral multiple of the fundamental frequency, the influence of inversion is considered to be small. As a result, the binarized feature value is a robust feature value with respect to the mixed audio signal while representing the envelope (shape) of the music signal. In addition, since the predetermined reference value can be changed in accordance with the change in the volume of the mixed sound even when the sound pressure of the music signal changes, the change in the characteristic amount of the mixed sound (the characteristic amount is “0”). "Or change in the determination of" 1 ") is prevented. This point has been confirmed by the inventors' experiments. As a result, the search speed can be increased and the search accuracy can be improved.

上記第１の局面による音源検索装置において、好ましくは、音源データベースは、データベース用楽曲信号から特徴量を抽出するデータベース側特徴量抽出手段と、抽出した特徴量を２値化するデータベース側２値化手段と、データベース側２値化手段により２値化されたデータベース用楽曲信号の特徴量から音源データベースを構築する構築手段とを含み、データベース用楽曲信号の特徴量は、データベース用楽曲信号のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルであり、データベース用楽曲信号の特徴量は、所定の時間長さを有する１分析フレーム毎または複数の分析フレーム毎に抽出されており、データベース側２値化手段は、データベース用楽曲信号の特徴量が、１分析フレーム毎または複数の分析フレーム毎における特徴量の所定の基準値以上の場合に特徴量を１とし、所定の基準値未満の場合に特徴量を０とするように構成されている。このように構成すれば、音源データベースの特徴量が２値化されるので、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）の要素を有する特徴量から音源データベースが構築される場合と比べて、特徴量が２値化される分、次元が小さくなる（１ｂｉｔ）ので、音源データベースのデータベースサイズを小さくすることができる。その結果、検索速度を高速化させることができる。 In the sound source search apparatus according to the first aspect, preferably, the sound source database includes database-side feature amount extraction means for extracting feature amounts from the music signal for database, and database-side binarization for binarizing the extracted feature amounts. And a construction means for constructing a sound source database from the feature value of the database music signal binarized by the database-side binarization means, and the feature value of the database music signal is the Fourier spectrum of the database music signal. Is a chroma spectrum that is the output energy of each band window calculated based on the database, and the feature amount of the music signal for the database is extracted for each analysis frame or a plurality of analysis frames having a predetermined time length. The database-side binarization means determines that the feature quantity of the music signal for the database is one analysis frame. The characteristic amount is 1 in the case of more than a predetermined reference value of the feature quantity at each of a plurality of analysis frame, and is configured to feature quantity in the case of less than the predetermined reference value to zero. With this configuration, since the feature amount of the sound source database is binarized, compared to a case where the sound source database is constructed from a feature amount having a scalar amount (for example, a single precision floating point number, 32 bits). As the feature value is binarized, the dimension is reduced (1 bit), so the database size of the sound source database can be reduced. As a result, the search speed can be increased.

この発明の第２の局面における音源検索方法は、楽曲信号と音声信号とを含む混合音から特徴量を抽出する工程と、抽出した特徴量を２値化する工程と、２値化された混合音の特徴量を検索キーとして、音源データベースから音源を検索する工程とを備え、混合音の特徴量は、所定の時間長さを有する１分析フレーム毎または複数の分析フレーム毎に抽出されており、楽曲信号と音声信号とを含む混合音から特徴量を抽出する工程は、混合音のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルを特徴量として抽出する工程を含み、抽出した特徴量を２値化する工程は、混合音の特徴量が、１分析フレーム毎または複数の分析フレーム毎における特徴量の所定の基準値以上の場合に特徴量を１とし、所定の基準値未満の場合に特徴量を０とする工程を含む。 A sound source search method according to a second aspect of the present invention includes a step of extracting a feature amount from a mixed sound including a music signal and a sound signal, a step of binarizing the extracted feature amount, and a binarized mixture And a step of searching for a sound source from a sound source database using the sound feature amount as a search key. The feature amount of the mixed sound is extracted for each analysis frame or a plurality of analysis frames having a predetermined time length. The step of extracting the feature amount from the mixed sound including the music signal and the sound signal includes the step of extracting the chroma spectrum, which is the output energy of each band window calculated based on the Fourier spectrum of the mixed sound, as the feature amount. The step of binarizing the extracted feature value is performed by setting the feature value to 1 when the feature value of the mixed sound is equal to or greater than a predetermined reference value of the feature value for each analysis frame or for each of the plurality of analysis frames. A feature quantity in the case of less than the reference value comprises the step of zero.

この発明の第２の局面による音源検索方法では、上記のように、抽出した特徴量を２値化する工程を備えることによって、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）の要素を有する特徴量を検索キーとして音源データベースから音源を検索する場合と比べて、特徴量が２値化される分、次元が小さくなる（１ｂｉｔ）とともに、２値化された特徴量は、楽曲信号の包絡（形状）を表しながら、混合される音声信号に対して頑強でかつ楽曲信号の音圧の変化に対して不変となるので、検索速度を高速化させ、かつ、検索精度を向上させることが可能な音源検索方法を提供することができる。 In the sound source search method according to the second aspect of the present invention, as described above, by including the step of binarizing the extracted feature quantity, it has an element of scalar quantity (for example, single precision floating point number, 32 bits). Compared to the case where a sound source is searched from the sound source database using the feature amount as a search key, the dimension is reduced by 1 bit because the feature amount is binarized, and the binarized feature amount is an envelope of the music signal. (Shape) while being robust against mixed audio signals and invariant to changes in the sound pressure of music signals, it is possible to increase the search speed and improve the search accuracy. A sound source search method can be provided.

本発明によれば、上記のように、混合音を検索キーとする音源検索装置および音源検索方法において、検索速度を高速化させ、かつ、検索精度を向上させることができる。 According to the present invention, as described above, in the sound source search apparatus and sound source search method using the mixed sound as a search key, the search speed can be increased and the search accuracy can be improved.

本発明の一実施形態による音源検索装置のブロック図である。It is a block diagram of a sound source search device according to an embodiment of the present invention. 混合音の波形の模式図である。It is a schematic diagram of the waveform of a mixed sound. 図２の混合音のフーリエスペクトルを示す模式図である。It is a schematic diagram which shows the Fourier spectrum of the mixed sound of FIG. 図３の混合音のフーリエスペクトルから求められたクロマスペクトルを示す模式図である。It is a schematic diagram which shows the chroma spectrum calculated | required from the Fourier spectrum of the mixed sound of FIG. 本発明の一実施形態による音源データベースのブロック図である。It is a block diagram of a sound source database according to an embodiment of the present invention. 本発明の一実施形態による音源データベースの構築方法のフロー図である。It is a flowchart of the construction method of the sound source database by one Embodiment of this invention. 本発明の一実施形態による音源検索方法のフロー図である。It is a flowchart of the sound source search method by one Embodiment of this invention. 混合音の特徴量と音源データベースに記憶された楽曲の特徴量との間の距離の頻度を示す図である。It is a figure which shows the frequency of the distance between the feature-value of mixed sound and the feature-value of the music memorize | stored in the sound source database. １楽曲分の２値化されていないクロマスペクトルを示す図である。It is a figure which shows the chroma spectrum which is not binarized for 1 music. 図９よりも１０ｄＢ大きい１楽曲分の２値化されていないクロマスペクトルを示す図である。FIG. 10 is a diagram illustrating an unbinarized chroma spectrum for one piece of music that is 10 dB larger than FIG. 9. １楽曲分の２値化されたクロマスペクトルを示す図である。It is a figure which shows the chroma spectrum binarized for 1 music. 図１１よりも１０ｄＢ大きい１楽曲分の２値化されたクロマスペクトルを示す図である。It is a figure which shows the binarized chroma spectrum for 1 music larger 10 dB than FIG. 比較例による音源検索装置の検索結果（Ｆ値）を示す図である。It is a figure which shows the search result (F value) of the sound source search apparatus by a comparative example. 本発明の一実施形態による音源検索装置の検索結果（Ｆ値）を示す図である。It is a figure which shows the search result (F value) of the sound source search device by one Embodiment of this invention.

以下、本発明の実施形態を図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

[音源検索装置の構成]
図１〜図４を参照して、本実施形態による音源検索装置１００の構成について説明する。音源検索装置１００は、混合音を構成する音源を、後述する音源データベース５０から検索するように構成されている。 [Configuration of sound source search device]
With reference to FIGS. 1-4, the structure of the sound source search apparatus 100 by this embodiment is demonstrated. The sound source search device 100 is configured to search for a sound source constituting the mixed sound from a sound source database 50 described later.

図１に示すように、本実施形態による音源検索装置１００は、特徴量抽出手段１０と、２値化手段２０と、検索手段３０とを備えている。また、本実施形態では、音源検索装置１００の検索性能を評価するために、混合音は、混合手段４０により生成されるように構成されている。なお、特徴量抽出手段１０と、２値化手段２０と、検索手段３０と、混合手段４０とは、たとえば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの制御部により構成されている。なお、特徴量抽出手段１０および２値化手段２０は、本発明の「検索装置側特徴量抽出手段」および「検索装置側２値化手段」の一例である。以下、具体的に説明する。 As shown in FIG. 1, the sound source search device 100 according to the present embodiment includes a feature amount extraction unit 10, a binarization unit 20, and a search unit 30. In the present embodiment, the mixed sound is generated by the mixing means 40 in order to evaluate the search performance of the sound source search device 100. Note that the feature amount extraction unit 10, the binarization unit 20, the search unit 30, and the mixing unit 40 are configured by a control unit such as a CPU (Central Processing Unit), for example. The feature quantity extraction means 10 and the binarization means 20 are examples of the “search apparatus side feature quantity extraction means” and the “search apparatus side binarization means” of the present invention. This will be specifically described below.

（混合手段）
混合手段４０は、楽曲信号と音声信号とを混合（編集）することにより、混合音を生成するように構成されている。なお、楽曲とは、楽器による演奏のみの場合と、楽器による演奏および歌声とを含む場合とを意味する。また、音声とは、雑音（ノイズ）ではない音声を意味する。たとえば、混合音とは、テレビの番組中におけるナレーションの音声と、その背景で流されるＢＧＭとにより構成される音である。 (Mixing means)
The mixing means 40 is configured to generate a mixed sound by mixing (editing) the music signal and the audio signal. Note that the music means a case where only a performance by an instrument is performed and a case where a performance by a musical instrument and a singing voice are included. The voice means a voice that is not noise. For example, the mixed sound is a sound composed of voice of narration in a TV program and BGM played in the background.

混合音は、複数の音源が任意の割合で重み付け加算された音である。混合音の時間波形をｋ（ｔ）とし、Ｊ個の音源ｓ_ｊ（ｔ）がｗ_ｊで重み付けされたとすると、混合音は、下記の式（１）により表される。

The mixed sound is a sound in which a plurality of sound sources are weighted and added at an arbitrary ratio. Assuming that the time waveform of the mixed sound is k (t) and the J sound sources s _j (t) are weighted by w _j , the mixed sound is expressed by the following equation (1).

ここで、ｊ＝１，．．．，Ｊで、ｔは、時間を表す。音源検索の一般形は、ｋ（ｔ）を検索キーとして、Ｊ個の音源ｓ_１（ｔ），．．．，ｓ_Ｊ（ｔ）を音源データベース５０内から検索するものである。本実施形態において、Ｊ＝２で、重み（ｗ）は、任意であるとすると、上記の式（１）は、下記の式（２）となる。

Here, j = 1,. . . , J, t represents time. The general form of sound source search is that J sound sources s ₁ (t),. . . , S _J (t) is retrieved from the sound source database 50. In the present embodiment, if J = 2 and the weight (w) is arbitrary, the above equation (1) becomes the following equation (2).

また、２つの音源ｓ_１（ｔ）およびｓ_２（ｔ）は、楽曲信号ｓ_１（ｔ）と、音声信号ｓ_２（ｔ）とする。このように、本実施形態の音源検索装置１００は、混合音であるｋ（ｔ）の特徴量を検索キーとして、音源データベース５０からｓ_１（ｔ）の特徴量を検索するように構成されている。 Two sound sources s ₁ (t) and s ₂ (t) are a music signal s ₁ (t) and an audio signal s ₂ (t). As described above, the sound source search device 100 according to the present embodiment is configured to search for the feature value of s ₁ (t) from the sound source database 50 using the feature value of k (t) that is a mixed sound as a search key. Yes.

（特徴量抽出手段）
図１に示すように、特徴量抽出手段１０には、混合手段４０によって生成された楽曲信号と音声信号とを含む混合音が入力されるように構成されている。そして、特徴量抽出手段１０は、楽曲信号と音声信号とを含む混合音から特徴量を抽出するように構成されている。具体的には、混合音の特徴量は、混合音のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルである。以下、混合音の特徴量の抽出について、具体的に説明する。 (Feature amount extraction means)
As shown in FIG. 1, the feature amount extraction unit 10 is configured to receive a mixed sound including a music signal and an audio signal generated by the mixing unit 40. The feature amount extraction unit 10 is configured to extract the feature amount from the mixed sound including the music signal and the audio signal. Specifically, the feature amount of the mixed sound is a chroma spectrum that is output energy of each band window calculated based on the Fourier spectrum of the mixed sound. Hereinafter, the extraction of the feature amount of the mixed sound will be specifically described.

混合音の特徴量ベクトルをｋ（ｎ）、楽曲の特徴量ベクトルをｓ_１（ｎ）、音声の特徴量ベクトルをｓ_２（ｎ）とする。ただし、ｎは、分析フレーム番号である。なお、分析フレームの説明は、後述する。また、各ベクトルは、Ｄ次元であり、混合音の特徴量ベクトルを、ｋ（ｎ）＝［ｋ（ｎ，１）、ｋ（ｎ，２），．．．，ｋ（ｎ，Ｄ）]^Ｔとする。ここで、^Ｔは、ベクトルの転置を表す。ｓ_１（ｎ）およびｓ_２（ｎ）も同様に表される。 It is assumed that the feature amount vector of the mixed sound is k (n), the feature amount vector of the music is s ₁ (n), and the feature amount vector of the sound is s ₂ (n). Here, n is an analysis frame number. The analysis frame will be described later. Each vector is D-dimensional, and the feature quantity vector of the mixed sound is expressed as k (n) = [k (n, 1), k (n, 2),. . . , K (n, D)] ^T. Here, ^T represents transposition of the vector. s ₁ (n) and s ₂ (n) are similarly represented.

（分析フレーム）
次に、図２を参照して、分析フレームについて説明する。図２では、横軸は、時間（ｔ）を表し、縦軸は、混合音の振幅を表す。そして、図２の混合音の波形を、所定の時間長さＴ（たとえば、１ｓ）毎に取り出す。具体的には、ある時刻ｔ１を先頭に、所定の時間長さＴの分析フレームに窓（たとえば、ハミング窓）をかけて取り出す。また、時刻ｔ２から一定の時間（フレームシフト長）経過後の時刻ｔ２を先頭に、所定の時間長さＴの分析フレームにハミング窓をかけて取り出す。以下、同様に、混合音の全ての領域において、混合音の波形を分析フレーム毎に取り出す。 (Analysis frame)
Next, the analysis frame will be described with reference to FIG. In FIG. 2, the horizontal axis represents time (t), and the vertical axis represents the amplitude of the mixed sound. Then, the waveform of the mixed sound in FIG. 2 is extracted every predetermined time length T (for example, 1 s). Specifically, the analysis frame having a predetermined time length T is extracted by applying a window (for example, a Hamming window) starting from a certain time t1. In addition, the analysis frame having a predetermined time length T is extracted by applying a Hamming window with time t2 after a certain time (frame shift length) has elapsed from time t2. Hereinafter, similarly, in all regions of the mixed sound, the waveform of the mixed sound is extracted for each analysis frame.

（クロマスペクトル）
次に、図３および図４を参照して、クロマスペクトルについて説明する。図３では、横軸は、周波数を表し、縦軸は、フーリエスペクトルの振幅を表す。所定の時間長さＴの分析フレーム毎に取り出された混合音（図２参照）について、フーリエスペクトル（図３の実線）が算出される。そして、フーリエスペクトルから各帯域窓の出力エネルギ（クロマスペクトル）が算出される。具体的には、ピアノの鍵盤に対応する各帯域窓（図３の三角形の点線で囲まれた領域、フィルタバンク）を設定する。なお、帯域窓は、周波数が高くなるほど、幅が広い三角形になる。そして、各帯域窓に含まれるフーリエスペクトルを積分することにより、図４に示すように、出力エネルギ（クロマスペクトル）が算出される。なお、図４では、横軸は、周波数を表し、縦軸は、クロマスペクトルの大きさを表す。 (Chroma spectrum)
Next, a chroma spectrum is demonstrated with reference to FIG. 3 and FIG. In FIG. 3, the horizontal axis represents the frequency, and the vertical axis represents the amplitude of the Fourier spectrum. A Fourier spectrum (solid line in FIG. 3) is calculated for the mixed sound (see FIG. 2) extracted for each analysis frame having a predetermined time length T. Then, the output energy (chroma spectrum) of each band window is calculated from the Fourier spectrum. Specifically, each band window (area surrounded by a dotted line in FIG. 3, filter bank) corresponding to the piano keyboard is set. The band window becomes a triangle having a wider width as the frequency becomes higher. Then, by integrating the Fourier spectrum included in each band window, the output energy (chroma spectrum) is calculated as shown in FIG. In FIG. 4, the horizontal axis represents the frequency, and the vertical axis represents the size of the chroma spectrum.

ここで、１オクターブの音程には、ピアノの鍵盤に対応するように、１２個の帯域窓（Ａ、Ａ＃、Ｂ、Ｃ、Ｃ＃、Ｄ、Ｄ＃、Ｅ、Ｆ、Ｆ＃、Ｇ、Ｇ＃）が存在する。本実施形態では、６オクターブ分の帯域窓（７２個＝１２×６）について、クロマスペクトル（特徴量）を算出する。これにより、混合音の特徴量ベクトルｋ（ｎ）は、７２次元の次元Ｄを有する。 Here, there are twelve band windows (A, A #, B, C, C #, D, D #, E, F, F #, G, so as to correspond to a piano keyboard for a pitch of one octave. , G #). In this embodiment, a chroma spectrum (feature value) is calculated for a band window (72 = 12 × 6) for 6 octaves. Thus, the mixed sound feature vector k (n) has a 72-dimensional dimension D.

（２値化手段）
ここで、本実施形態では、２値化手段２０は、抽出した特徴量を２値化するように構成されている。具体的には、図４に示すように、２値化手段２０は、混合音の特徴量（クロマスペクトル）が、１分析フレーム毎における特徴量の所定の基準値（具体的には、平均値）（図４の点線参照）以上の場合に特徴量を１とし、１分析フレーム毎における特徴量の所定の基準値未満の場合に特徴量を０とするように構成されている。すなわち、２値化手段２０は、スカラー量の要素を有するクロマスペクトルを、２値化するように構成されている。 (Binarization means)
Here, in this embodiment, the binarizing means 20 is configured to binarize the extracted feature quantity. Specifically, as shown in FIG. 4, the binarizing unit 20 determines that the feature amount (chroma spectrum) of the mixed sound is a predetermined reference value (specifically, an average value) of the feature amount for each analysis frame. (See the dotted line in FIG. 4) The feature amount is set to 1 in the above case, and the feature amount is set to 0 when the feature amount is less than a predetermined reference value for each analysis frame. That is, the binarizing means 20 is configured to binarize a chroma spectrum having a scalar quantity element.

具体的には、時刻ｔのクロマスペクトルをｃ（ｔ）＝［ｃ_１（ｔ），ｃ_２（ｔ），．．．，ｃ_Ｄ（ｔ）]^Ｔとする。ここで、Ｄは、ベクトルの次元数を表し、^Ｔは、ベクトルの転置を表す。そして、２値化されたクロマスペクトルｂ（ｔ）＝［ｂ_１（ｔ），ｂ_２（ｔ），．．．，ｂ_Ｄ（ｔ）]^Ｔは、下記の式（３）により表される。

Specifically, the chroma spectrum at time t is expressed as c (t) = [c ₁ (t), c ₂ (t),. . . , C _D (t)] ^T. Here, D represents the number of dimensions of the vector, and ^T represents the transposition of the vector. The binarized chroma spectrum b (t) = [b ₁ (t), b ₂ (t),. . . , B _D (t)] ^T is represented by the following equation (3).

（検索手段）
検索手段３０は、２値化手段により２値化された混合音の特徴量を検索キーとして、音源データベース５０から音源を検索するように構成されている。具体的には、検索手段３０は、複数（Ｐ個）の分析フレーム（累積分析フレーム）に対応する２値化された混合音の特徴量を検索キーとして、音源データベース５０から音源を検索するように構成されている。すなわち、音源の検索は、Ｐ個の特徴量ベクトルの列を検索キーとした類似パターン検索問題に帰着する。具体的には、ｎ番目からｎ＋Ｐ−１番目の分析フレームの特徴量ベクトルは、下記の式（４）〜式（６）により表される。

(Search means)
The search means 30 is configured to search for a sound source from the sound source database 50 using the feature value of the mixed sound binarized by the binarization means as a search key. Specifically, the search means 30 searches for the sound source from the sound source database 50 using the binarized mixed sound feature values corresponding to a plurality (P) of analysis frames (cumulative analysis frames) as search keys. It is configured. That is, the sound source search results in a similar pattern search problem using a sequence of P feature quantity vectors as a search key. Specifically, the feature quantity vectors of the nth to (n + P-1) th analysis frames are expressed by the following equations (4) to (6).

ここで、Ｖ個の楽曲信号と、Ｗ個の音声信号とがあるとすると、Ｓ_１，ｖ（ｎ）およびＳ_２，ｗ（ｍ）を、ｖ番目およびｗ番目の特徴量とする。そして、ｖ番目の楽曲信号と、ｗ番目の音声信号とが混合された混合音の特徴量をＫ_ｖ，ｗ（ｎ）とすると、検索は、ｖ，ｗ，ｎ，ｍが未知の条件で、Ｋ_ｖ，ｗ（ｎ）から楽曲番号ｖ^＊と、分析フレーム番号ｎ^＊とを推定する問題となる。検索処理をｓｅａｒｃｈと表すと、検索は、下記の式（７）により表される。

Here, assuming that there are V music signals and W audio signals, let S _{1, v} (n) and S _{2, w} (m) be the v th and w th feature quantities. If the characteristic amount of the mixed sound obtained by mixing the v-th music signal and the w-th audio signal is K _{v, w} (n), the search is performed under conditions where v, w, n, and m are unknown. , K _{v, w} (n), the music number v ^* and the analysis frame number n ^* are estimated. When the search process is expressed as “search”, the search is expressed by the following formula (7).

すなわち、検索処理ｓｅａｒｃｈは、検索の結果に該当する項目（特徴量ベクトル間の距離が最小の項目）を１組決定することになる。つまり、検索キーの特徴量ベクトルと検索対象の特徴量ベクトルとの間の距離が最小になる場合を検索結果とする。 That is, the search process search determines one set of items corresponding to the search result (items with the smallest distance between feature quantity vectors). That is, the search result is a case where the distance between the feature quantity vector of the search key and the feature quantity vector to be searched becomes the minimum.

（平均誤棄却率および平均誤検出率）
検索の性能は、誤棄却（Ｍｉｓｓ）と誤検出（ＦａｌｓｅＡｌａｒｍ）との２つの指標により評価される。誤棄却は、検索結果に、混合音を構成する楽曲信号に対応する［ｖ，ｎ］が含まれない場合に相当する。また、誤検出は、検索結果に、混合音を構成する楽曲信号以外の［ｖ，ｎ］（検索キーに無関係な楽曲）が含まれる場合に相当する。 (Average false rejection rate and average false detection rate)
The performance of the search is evaluated by two indexes of false rejection (Miss) and false detection (False Alarm). The false rejection corresponds to a case where [v, n] corresponding to the music signal constituting the mixed sound is not included in the search result. Misdetection corresponds to a case where the search result includes [v, n] (music not related to the search key) other than the music signal constituting the mixed sound.

Ｑ回検索する例において、平均誤棄却率と平均誤検出率とを説明する。ｑ回目の検索キーをＫ_{ｖ（ｑ），ｗ（ｑ）}（ｎ^（ｑ））とし、得られる集合をφ^（ｑ）とし、Ｋ_{ｖ（ｑ），ｗ（ｑ）}（ｎ^（ｑ））の構成音源をＳ_{１，ｖ´（ｑ）}（ｎ´^（ｑ））とすると、平均誤棄却率は、下記の式（８）により表される。

In the example of searching Q times, the average error rejection rate and the average error detection rate will be described. The search key for the q-th time is K _{v (q), w (q)} (n ^(q) ), the obtained set is φ ^(q), and K _{v (q), w (q)} (n ^(q) ) If S _{1, v ′ (q)} (n ′ ^(q) ) is a constituent sound source, the average error rejection rate is expressed by the following equation (8).

また、平均誤検出率は、下記の式（９）により表される。

ここで、Ｉは、音源データベース５０中の［ｖ，ｎ］が取り得る組の総数を表す。また、φ^（ｑ）＼［ｖ´^（ｑ），ｎ´^（ｑ）］は、集合φ^（ｑ）から、要素［ｖ´^（ｑ），ｎ´^（ｑ）］を取り除く処理を意味する。また、｜｜は、集合の要素数を求める処理を意味する。そして、平均誤棄却率および平均誤検出率は、共に、値が小さいほど、検索性能が高いことを表す。 The average false detection rate is expressed by the following formula (9).

Here, I represents the total number of pairs that [v, n] in the sound source database 50 can take. Also, φ ^(q) \ [v ′ ^(q) , n ′ ^(q) ] means a process of removing elements [v ′ ^(q) , n ′ ^(q) ] from the set φ ^(q) . Also, || means a process for obtaining the number of elements in the set. The average false rejection rate and the average false detection rate both indicate that the smaller the value, the higher the search performance.

（音源データベース）
音源データベース５０には、複数の楽曲が記憶されている。具体的には、上記の混合音と同様に、複数の楽曲の特徴量が２値化された状態で、音源データベース５０に記憶されている。 (Sound source database)
The sound source database 50 stores a plurality of music pieces. Specifically, similar to the above-described mixed sound, a plurality of music feature values are stored in the sound source database 50 in a binarized state.

詳細には、図５に示すように、本実施形態では、音源データベース５０は、データベース用楽曲信号から特徴量を抽出する特徴量抽出手段５１と、抽出した特徴量を２値化する２値化手段５２と、２値化手段５２により２値化されたデータベース用楽曲信号の特徴量から音源データベース５０を構築する構築手段５３とを含む。なお、特徴量抽出手段５１と、２値化手段５２と、構築手段５３とは、たとえば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの制御部により構成されている。なお、特徴量抽出手段５１および２値化手段５２は、それぞれ、本発明の「データベース側特徴量抽出手段」および「データベース側２値化手段」の一例である。 Specifically, as shown in FIG. 5, in the present embodiment, the sound source database 50 includes a feature amount extraction unit 51 that extracts feature amounts from the music signal for database, and binarization that binarizes the extracted feature amounts. Means 52 and construction means 53 for constructing the sound source database 50 from the feature quantity of the database music signal binarized by the binarization means 52. The feature quantity extraction unit 51, the binarization unit 52, and the construction unit 53 are configured by a control unit such as a CPU (Central Processing Unit), for example. The feature quantity extraction means 51 and the binarization means 52 are examples of the “database side feature quantity extraction means” and the “database side binarization means” of the present invention, respectively.

ここで、データベース用楽曲信号の特徴量は、データベース用楽曲信号のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルである。また、データベース用楽曲信号の特徴量は、所定の時間長さを有する１分析フレーム毎に抽出されている。そして、２値化手段５２は、データベース用楽曲信号の特徴量が、１分析フレーム毎における特徴量の所定の基準値（具体的には、平均値）以上の場合に特徴量を１とし、所定の基準値未満の場合に特徴量を０とするように構成されている。なお、データベース用楽曲信号からの特徴量の抽出、抽出した特徴量の２値化の詳細は、上記音源検索装置１００と同様である。 Here, the feature amount of the database music signal is a chroma spectrum that is output energy of each band window calculated based on the Fourier spectrum of the database music signal. Further, the feature amount of the music signal for database is extracted for each analysis frame having a predetermined time length. Then, the binarizing means 52 sets the feature amount to 1 when the feature amount of the music signal for database is equal to or greater than a predetermined reference value (specifically, an average value) of the feature amount for each analysis frame. The feature amount is set to 0 when the value is less than the reference value. Note that the details of extraction of feature values from the music signal for database and binarization of the extracted feature values are the same as those of the sound source search device 100.

［音源データベースの構築方法］
次に、図６を参照して、本実施形態による音源データベース５０の構築方法を説明する。 [How to build a sound source database]
Next, the construction method of the sound source database 50 according to the present embodiment will be described with reference to FIG.

まず、ステップＳ１１において、特徴量抽出手段５１に入力されたデータベース用楽曲信号から特徴量が抽出される。具体的には、１分析フレーム毎に混合音のフーリエスペクトルが算出された後、各帯域窓の出力エネルギであるクロマスペクトルが算出される。 First, in step S <b> 11, a feature amount is extracted from the database music signal input to the feature amount extraction means 51. Specifically, after the Fourier spectrum of the mixed sound is calculated for each analysis frame, the chroma spectrum that is the output energy of each band window is calculated.

次に、ステップＳ１２において、２値化手段５２により、抽出した特徴量が、上記式（３）に基づいて２値化される。そして、ステップＳ１３において、構築手段５３により、２値化されたデータベース用楽曲信号の特徴量から音源データベース５０が構築される。 Next, in step S12, the binarizing means 52 binarizes the extracted feature amount based on the above equation (3). In step S13, the construction unit 53 constructs the sound source database 50 from the binarized database music signal feature amount.

［音源検索方法］
次に、図７を参照して、本実施形態による音源検索方法を説明する。 [Sound source search method]
Next, the sound source search method according to the present embodiment will be described with reference to FIG.

まず、ステップＳ１において、特徴量抽出手段１０に入力された楽曲信号と音声信号とを含む混合音から特徴量が抽出される。具体的には、１分析フレーム毎に混合音のフーリエスペクトルが算出された後、各帯域窓の出力エネルギであるクロマスペクトルが算出される。 First, in step S1, a feature value is extracted from a mixed sound including a music signal and a sound signal input to the feature value extraction unit 10. Specifically, after the Fourier spectrum of the mixed sound is calculated for each analysis frame, the chroma spectrum that is the output energy of each band window is calculated.

次に、ステップＳ２において、抽出した特徴量が、上記式（３）に基づいて２値化される。 Next, in step S2, the extracted feature quantity is binarized based on the above equation (3).

次に、ステップＳ３において、２値化された混合音の特徴量を検索キーとして、音源データベース５０から音源が検索される。具体的には、検索キーの２値化された特徴量ベクトルと、音源データベース５０に記憶されている楽曲の２値化された特徴量ベクトルとの間の距離が、最小の場合、混合音に合致する楽曲が検索（検出）されたと判断される。 Next, in step S3, a sound source is searched from the sound source database 50 using the binarized mixed sound feature amount as a search key. Specifically, when the distance between the binarized feature vector of the search key and the binarized feature vector of the music stored in the sound source database 50 is the minimum, the mixed sound is included. It is determined that the matching music has been searched (detected).

次に、ステップＳ４において、検索結果の評価が行われる。具体的には、上記の式（８）および式（９）により、平均誤棄却率および平均誤検出率が算出される。さらに、算出された平均誤棄却率および平均誤検出率から、Ｆ値（調和平均）が算出される。 Next, in step S4, the search result is evaluated. Specifically, the average error rejection rate and the average error detection rate are calculated by the above equations (8) and (9). Further, an F value (harmonic average) is calculated from the calculated average error rejection rate and average error detection rate.

（クロマスペクトルが楽曲を表現するのに適しているか否かを確認する実験）
図８を参照して、クロマスペクトルが楽曲を表現するのに適しているか否かを確認するために行った実験について説明する。 (Experiment to confirm whether chroma spectrum is suitable for expressing music)
With reference to FIG. 8, an experiment conducted to confirm whether or not the chroma spectrum is suitable for expressing music will be described.

まず、７１曲分の楽曲のクロマスペクトルをデータベース化した。そして、７１曲分の楽曲のうち、１０秒間分の長さの楽曲のクロマスペクトルをランダムに５００個選択して、検索キーとした。そして、５００個の検索キーの特徴量ベクトルと、７１曲分の楽曲の特徴量ベクトルとの間の距離分布を作成した。図８では、横軸は距離を表し、縦軸は、頻度を表している。この実験では、距離が０になる箇所が５００箇所あることが確認された。すなわち、１つの検索キーに対して、距離が０になる箇所が１箇所であることが確認された。これにより、楽曲（１０秒間分の長さの楽曲）の構造に数値的な繰り返し（同じ特徴量の繰り返し）が存在しないことが確認された。すなわち、クロマスペクトルが楽曲を表現するのに適している（楽曲の特徴量として適している）ことが確認された。 First, the chroma spectra of 71 songs were compiled into a database. Then, among 71 music pieces, 500 chroma spectra of music pieces having a length of 10 seconds were randomly selected and used as search keys. And the distance distribution between the feature-value vector of 500 search keys and the feature-value vector of 71 music pieces was created. In FIG. 8, the horizontal axis represents distance, and the vertical axis represents frequency. In this experiment, it was confirmed that there are 500 places where the distance becomes zero. That is, it was confirmed that there is one place where the distance becomes 0 for one search key. Thus, it was confirmed that there was no numerical repetition (repetition of the same feature amount) in the structure of the music (music having a length of 10 seconds). That is, it was confirmed that the chroma spectrum is suitable for expressing music (suitable as a feature amount of music).

また、図８に示すように、頻度は、距離０から徐々に増加し、その後、頻度が急激に増加した後、頻度が急激に低下することが判明した。すなわち、頻度は、概ね１つの凸形状に形成されることが判明した。 Further, as shown in FIG. 8, it has been found that the frequency gradually increases from the distance 0, and then the frequency rapidly increases and then decreases rapidly. That is, it has been found that the frequency is generally formed in one convex shape.

（クロマスペクトルの２値化についての実験）
次に、図９〜図１２を参照して、クロマスペクトルの２値化についての実験について説明する。なお、図９〜図１２では、クロマスペクトルの値が大きい部分ほど、色が濃くなるように表されている。 (Experiment on binarization of chroma spectrum)
Next, with reference to FIG. 9 to FIG. 12, an experiment for binarization of the chroma spectrum will be described. In FIGS. 9 to 12, the larger the chroma spectrum value, the darker the color.

図９は、２値化されていない１楽曲分のクロマスペクトルである。図１０は、図９に示された楽曲の信号波形のエネルギを相対的に１０ｄＢ高くした場合の、２値化されていない１楽曲分のクロマスペクトルである。図１０に示すように、楽曲の信号波形のエネルギを相対的に１０ｄＢ高くした場合では、全体的にクロマスペクトルの値が大きくなる（色が濃くなる）ことが判明した。すなわち、２値化されていないクロマスペクトル（特徴量）は、楽曲の信号波形のエネルギの変化に伴って変化することが確認された。 FIG. 9 is a chroma spectrum for one piece of music that is not binarized. FIG. 10 is a chroma spectrum of one piece of music that is not binarized when the energy of the signal waveform of the music shown in FIG. 9 is relatively increased by 10 dB. As shown in FIG. 10, when the energy of the signal waveform of the music is relatively increased by 10 dB, it has been found that the chroma spectrum value increases as a whole (the color becomes darker). That is, it has been confirmed that the chroma spectrum (feature amount) that has not been binarized changes as the energy of the signal waveform of the music changes.

図１１は、２値化された１楽曲分のクロマスペクトルである。図１２は、図１１に示された楽曲の信号波形のエネルギを相対的に１０ｄＢ高くした場合の、２値化された１楽曲分のクロマスペクトルである。図１２に示すように、楽曲の信号波形のエネルギを相対的に１０ｄＢ高くした場合でも、２値化された１楽曲分のクロマスペクトルは、パターンが完全に一致することが判明した。すなわち、２値化されたクロマスペクトル（特徴量）は、楽曲の信号波形のエネルギの変化に対して不変であることが確認された。 FIG. 11 shows a binarized chroma spectrum for one piece of music. FIG. 12 is a binarized chroma spectrum for one music when the energy of the signal waveform of the music shown in FIG. 11 is relatively increased by 10 dB. As shown in FIG. 12, it was found that even when the energy of the signal waveform of the music was relatively increased by 10 dB, the binarized chroma spectra for one music completely matched the pattern. That is, it was confirmed that the binarized chroma spectrum (feature amount) is invariant to the change in energy of the signal waveform of the music.

（音源検索の実験）
次に、図１３および図１４を参照して、本実施形態による音源検索装置１００による音源検索の実験について、比較例による音源検索装置と比較しながら説明する。 (Sound source search experiment)
Next, referring to FIG. 13 and FIG. 14, a sound source search experiment by the sound source search device 100 according to the present embodiment will be described in comparison with a sound source search device according to a comparative example.

比較例による音源検索装置では、特徴量抽出手段と検索手段とを備えている一方、本実施形態による音源検索装置１００のように２値化手段２０は備えていない。すなわち、比較例による音源検索装置では、特徴量は、クロマスペクトルの値そのもの（単精度浮動小数点数、３２ｂｉｔ）である。つまり、特徴量の次元は、６オクターブ分の７２次元×３２ｂｉｔである。一方、本実施形態による音源検索装置１００では、特徴量（クロマスペクトル）が２値化されているので、特徴量の次元は、６オクターブ分の７２次元×１ｂｉｔである。 The sound source search device according to the comparative example includes the feature amount extraction unit and the search unit, but does not include the binarization unit 20 unlike the sound source search device 100 according to the present embodiment. That is, in the sound source search device according to the comparative example, the feature amount is the chroma spectrum value itself (single precision floating point number, 32 bits). That is, the dimension of the feature quantity is 72 dimensions × 32 bits for 6 octaves. On the other hand, in the sound source search device 100 according to the present embodiment, since the feature amount (chroma spectrum) is binarized, the dimension of the feature amount is 72 dimensions × 6 bits for 6 octaves.

図１３および図１４に示すように、音源検索の実験では、音声信号に対する楽曲信号の音圧の相対的な大きさ（音圧比）を、５ｄＢ小さくした混合音（混合比−５ｄＢ）と、互いに等しい混合音（混合比０ｄＢ）と、５ｄＢ大きくした混合音（混合比５ｄＢ）と、１０ｄＢ大きくした混合音（混合比１０ｄＢ）と、１５ｄＢ大きくした混合音（混合比１５ｄＢ）と、２０ｄＢ大きくした混合音（混合比２０ｄＢ）とを準備して、各々の混合音について、音源検索を実施するとともに、検索結果のＦ値を算出した。なお、たとえば、ナレーションの背景でＢＧＭが流れる場合の音圧比は、−５ｄＢ〜０ｄＢに相当する。 As shown in FIG. 13 and FIG. 14, in the sound source search experiment, the mixed sound (mixing ratio−5 dB) in which the relative magnitude (sound pressure ratio) of the music signal relative to the audio signal is reduced by 5 dB is mutually compared. Equal mixed sound (mixing ratio 0 dB), 5 dB larger mixed sound (mixing ratio 5 dB), 10 dB larger mixed sound (mixing ratio 10 dB), 15 dB larger mixed sound (mixing ratio 15 dB), and 20 dB larger mixed sound Sounds (mixing ratio 20 dB) were prepared, sound source search was performed for each mixed sound, and F value of the search result was calculated. For example, the sound pressure ratio when BGM flows in the background of narration corresponds to −5 dB to 0 dB.

また、音源検索の実験では、帯域窓（フィルタバンク）を、５５（Ｈｚ）〜３５２０（Ｈｚ）とする６オクターブ（７２バンク）により構成した。また、信号のサンプリングレートを、１６０００Ｈｚとした。また、分析フレーム長を、１６．３８４（ｓ）とし、フレームシフト長を、１／１６（ｓ）とした。また、音源データベース５０には、市販のＣＤの７２曲（約２００，０００分析フレーム）分の楽曲を記憶した。 In the sound source search experiment, the band window (filter bank) is composed of 6 octaves (72 banks) with 55 (Hz) to 3520 (Hz). The signal sampling rate was 16000 Hz. The analysis frame length was 16.384 (s) and the frame shift length was 1/16 (s). In the sound source database 50, music pieces for 72 music pieces (about 200,000 analysis frames) of a commercially available CD were stored.

そして、累積フレーム数として、１０（ｓ）区間（１６×１０分析フレーム）と、２（ｓ）区間（１６×２分析フレーム）とを採用した。そして、これらを特徴量（検索キー）として、音源データベース５０に記憶された約２００，０００通りの候補から、連続する１６×１０分析フレーム（１６×２分析フレーム）がマッチする時刻を検索した。 Then, as the cumulative number of frames, a 10 (s) section (16 × 10 analysis frames) and a 2 (s) section (16 × 2 analysis frames) were adopted. Then, using these as feature quantities (search keys), the time when successive 16 × 10 analysis frames (16 × 2 analysis frames) match was searched from about 200,000 candidates stored in the sound source database 50.

楽曲信号に混合する音声信号は、ＪＮＡＳ（“ＪＮＡＳ：Ｊａｐａｎｅｓｅｓｐｅｅｃｈｃｏｒｐｕｓｆｏｒｌａｒｇｅｖｏｃａｂｕｌａｒｙｃｏｎｔｉｎｕｏｕｓｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｒｅｓｅａｒｃｈ”，Ｊ．Ａｃｏｕｓｔ．Ｓｏｃ．Ｊｐｎ（Ｅ）２０（３），ｐｐ．１９９−２０６，１９９９．）から、男女の発話を発話間ポーズを開けずに接続し準備した。そして、１つの楽曲に渡って発話がナレーションのようになるように混合した。 The audio signal mixed with the music signal is JNAS (“JNAS: Japan special speech for large-scale continuous speech recognition research”, J. Acust. Soc. Therefore, the male and female utterances were connected and prepared without opening the pause between utterances. And it mixed so that utterance might become narration over one music.

さらに、信号の振幅（相対値）を、１倍、０．５倍、２倍、０．１倍、１０倍にそれぞれ変化させた場合において、音源検索を実行した。 Furthermore, the sound source search was executed when the amplitude (relative value) of the signal was changed to 1 times, 0.5 times, 2 times, 0.1 times, and 10 times.

特徴量を２値化しない比較例による音源データベースのデータベースサイズ（３２、図１３参照）は、特徴量を２値化した本実施形態の音源データベース５０のデータベースサイズ（１、図１４参照）に比べて、３２倍の大きさになることが確認された。 The database size (32, see FIG. 13) of the sound source database according to the comparative example in which the feature quantity is not binarized is compared with the database size (1, see FIG. 14) of the sound source database 50 of this embodiment in which the feature quantity is binarized. It was confirmed that the size was 32 times.

また、比較例による音源検索装置では、相対処理時間（検索時間）が、２５０または１１３０であったのに対して、本実施形態による音源検索装置１００では、相対処理時間（検索時間）が、３４または１７０であった。これにより、特徴量の２値化を行うことにより、検索速度が高速化されることが確認された。 In the sound source search device according to the comparative example, the relative processing time (search time) is 250 or 1130, whereas in the sound source search device 100 according to the present embodiment, the relative processing time (search time) is 34. Or 170. Thus, it was confirmed that the search speed is increased by binarizing the feature amount.

また、比較例による音源検索装置では、信号の振幅（１倍、０．５倍、２倍、０．１倍、１０倍）の変化に対して、Ｆ値の値が著しく変化していることが判明した。一方、本実施形態による音源検索装置１００では、信号の振幅の変化に対して、Ｆ値の値が不変であることが判明した。これは、図１１および図１２に示すように、２値化されたクロマスペクトル（特徴量）は、楽曲の信号波形のエネルギの変化に対して不変であることから、このような結果が得られたと考えられる。 Further, in the sound source search device according to the comparative example, the value of the F value changes remarkably with respect to the change in the signal amplitude (1 ×, 0.5 ×, 2 ×, 0.1 ×, 10 ×). There was found. On the other hand, in the sound source search device 100 according to the present embodiment, it has been found that the F value is invariant with respect to a change in the amplitude of the signal. This is because, as shown in FIGS. 11 and 12, the binarized chroma spectrum (feature amount) is invariant to the change in energy of the signal waveform of the music. It is thought.

また、信号の振幅が１倍の場合には、比較例による音源検索装置によるＦ値の方が高くなる場合がある一方、信号の振幅が１倍以外の０．５倍、２倍、０．１倍、１０倍では、全ての場合において、本実施形態による音源検索装置１００によるＦ値の方が高くなることが判明した。これにより、全体として、特徴量の２値化を行うことにより、検索精度が向上することが確認された。 When the amplitude of the signal is 1 time, the F value by the sound source search device according to the comparative example may be higher, while the amplitude of the signal is 0.5 times other than 1 time, 2 times, 0. At 1 × and 10 ×, in all cases, it was found that the F value by the sound source search device 100 according to the present embodiment was higher. As a result, it has been confirmed that the search accuracy is improved by binarizing the feature amount as a whole.

[本実施形態の効果]
本実施形態では、以下のような効果を得ることができる。 [Effect of this embodiment]
In the present embodiment, the following effects can be obtained.

本実施形態では、上記のように、抽出した特徴量を２値化する２値化手段２０を備えることによって、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）の要素を有する特徴量を検索キーとして音源データベース５０から音源を検索する場合と比べて、特徴量が２値化される分、次元が小さくなる（１ｂｉｔ）ので、検索速度を高速化させることができる。 In the present embodiment, as described above, by providing the binarizing means 20 for binarizing the extracted feature quantity, a feature quantity having an element of a scalar quantity (for example, a single precision floating point number, 32 bits) is searched. Compared with the case where a sound source is searched from the sound source database 50 as a key, the dimension becomes smaller (1 bit) as the feature value is binarized, so that the search speed can be increased.

また、楽曲信号に音声信号を混合した場合、音声信号が混合される分、楽曲信号の包絡（形状）が変化する。そこで、本実施形態では、上記のように、抽出した特徴量を２値化する２値化手段２０を備えることによって、２値化後の特徴量のうち、「１」の部分は、音声信号が加法的に作用している限り、「１」のままである。一方、２値化後の特徴量のうち、「０」の部分に音声信号が加法的に作用しても、２値化するための基準値を超えない限り「０」のままである。なお、２値化するためのしきい値近傍では、音声信号が混合されることにより、２値化後の特徴量の「０」または「１」が反転する場合がある一方、音声信号の出力エネルギが大きい周波数（反転する可能性がある周波数）は基本周波数の整数倍の周波数近傍のみの比較的小さい範囲であるため、反転による影響は小さいと考えられる。その結果、２値化された特徴量は、楽曲信号の包絡（形状）を表しながら、混合される音声信号に対して頑強な特徴量となる。また、楽曲信号の音圧の変化に対しても、混合音の音量の変化に伴って所定の基準値も変化させることが可能であるので、混合音の特徴量の変化（特徴量が「０」であるか、または、「１」であるかの判断の変化）が防止される。これらによって、検索速度を高速化させ、かつ、検索精度を向上させることができる。 In addition, when an audio signal is mixed with a music signal, the envelope (shape) of the music signal changes as much as the audio signal is mixed. Therefore, in the present embodiment, as described above, the binarizing unit 20 that binarizes the extracted feature quantity includes the binarized feature quantity so that the portion “1” of the binarized feature quantity is an audio signal. As long as is acting additively, it remains “1”. On the other hand, even if the voice signal acts additively on the “0” portion of the binarized feature value, it remains “0” unless the reference value for binarization is exceeded. In the vicinity of the threshold value for binarization, the audio signal is mixed, so that the binarized feature quantity “0” or “1” may be inverted, while the audio signal output Since the frequency with high energy (frequency that can be inverted) is a relatively small range only in the vicinity of a frequency that is an integral multiple of the fundamental frequency, the influence of inversion is considered to be small. As a result, the binarized feature value is a robust feature value with respect to the mixed audio signal while representing the envelope (shape) of the music signal. In addition, since the predetermined reference value can be changed in accordance with the change in the volume of the mixed sound even when the sound pressure of the music signal changes, the change in the characteristic amount of the mixed sound (the characteristic amount is “0”). "Or change in the determination of" 1 ") is prevented. As a result, the search speed can be increased and the search accuracy can be improved.

また、本実施形態では、上記のように、混合音の特徴量を、所定の時間長さＴを有する１分析フレーム毎に抽出して、２値化手段２０を、混合音の特徴量が、１分析フレーム毎における特徴量の平均値以上の場合に特徴量を１とし、１分析フレーム毎における特徴量の平均値未満の場合に特徴量を０とするように構成する。これにより、１分析フレーム毎における特徴量の平均値に基づいて特徴量が２値化されるので、混合音の音量の変化に適切に対応させて、特徴量を２値化することができる。 Further, in the present embodiment, as described above, the feature amount of the mixed sound is extracted for each analysis frame having a predetermined time length T, and the binarizing unit 20 determines that the feature amount of the mixed sound is The feature amount is set to 1 when it is equal to or greater than the average value of the feature values in each analysis frame, and the feature value is set to 0 when it is less than the average value of the feature values in each analysis frame. Thereby, since the feature value is binarized based on the average value of the feature value for each analysis frame, the feature value can be binarized appropriately corresponding to the change in the volume of the mixed sound.

また、本実施形態では、上記のように、混合音の特徴量は、混合音のフーリエスペクトルに基づいて算出される各帯域窓の出力エネルギであるクロマスペクトルであり、２値化手段２０を、クロマスペクトルを２値化するように構成する。これにより、クロマスペクトルを特徴量として、適切に音源を検索することができる。 In the present embodiment, as described above, the feature amount of the mixed sound is a chroma spectrum that is output energy of each band window calculated based on the Fourier spectrum of the mixed sound, and the binarizing means 20 The chroma spectrum is configured to be binarized. Thereby, a sound source can be appropriately searched using the chroma spectrum as a feature amount.

また、本実施形態では、上記のように、混合音の特徴量は、所定の時間長さＴを有する１分析フレーム毎に抽出されており、検索手段３０を、複数の分析フレームに対応する２値化された混合音の特徴量を検索キーとして、音源データベース５０から音源を検索するように構成する。これにより、１つの分析フレームに対応する２値化された混合音の特徴量を検索キーとして検索する場合と比べて、検索キーの特徴量（情報量）が多くなるので、検索の精度を高めることができる。 In the present embodiment, as described above, the feature amount of the mixed sound is extracted for each analysis frame having a predetermined time length T, and the search means 30 is set to 2 corresponding to a plurality of analysis frames. The sound source is searched from the sound source database 50 using the characteristic value of the mixed sound that has been digitized as a search key. As a result, the feature amount (information amount) of the search key is increased compared to the case where the search is performed using the binarized mixed sound feature amount corresponding to one analysis frame as a search key, so that the search accuracy is improved. be able to.

また、本実施形態では、上記のように、検索手段３０を、１０×１６分析フレームまたは２×１６分析フレームに対応する２値化された混合音の特徴量を検索キーとして、音源データベース５０から音源を検索するように構成する。これにより、１０×１６分析フレームまたは２×１６分析フレームの比較的短い複数の分析フレームに対応する２値化された混合音の特徴量を検索キーとして検索が行われた場合でも、特徴量の２値化により、高速、かつ、高精度な検索を行うことができる。 Further, in the present embodiment, as described above, the search unit 30 uses the feature value of the binarized mixed sound corresponding to the 10 × 16 analysis frame or the 2 × 16 analysis frame as a search key from the sound source database 50. Configure to search for sound sources. As a result, even when a search is performed using a binarized mixed sound feature amount corresponding to a plurality of analysis frames that are relatively short of a 10 × 16 analysis frame or a 2 × 16 analysis frame, By binarization, high-speed and high-precision search can be performed.

また、本実施形態では、上記のように、音源データベース５０は、データベース用楽曲信号から特徴量を抽出する特徴量抽出手段５１と、抽出した特徴量を２値化する２値化手段５２と、２値化手段５２により２値化されたデータベース用楽曲信号の特徴量から音源データベース５０を構築する構築手段５３とを含む。これにより、音源データベース５０の特徴量が２値化されるので、スカラー量（たとえば、単精度浮動小数点数、３２ｂｉｔ）の要素を有する特徴量から音源データベース５０が構築される場合と比べて、特徴量が２値化される分、次元が小さくなる（１ｂｉｔ）ので、音源データベース５０のデータベースサイズを小さくすることができる。その結果、検索速度を高速化させることができる。 In the present embodiment, as described above, the sound source database 50 includes the feature amount extraction unit 51 that extracts the feature amount from the database music signal, the binarization unit 52 that binarizes the extracted feature amount, And a construction means 53 for constructing the sound source database 50 from the feature quantity of the database music signal binarized by the binarization means 52. As a result, the feature amount of the sound source database 50 is binarized, so that the feature amount is compared with the case where the sound source database 50 is constructed from feature amounts having elements of scalar amounts (for example, single precision floating point numbers, 32 bits). As the quantity is binarized, the dimension is reduced (1 bit), so the database size of the sound source database 50 can be reduced. As a result, the search speed can be increased.

[変形例]
なお、今回開示された実施形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施形態の説明ではなく特許請求の範囲によって示され、さらに特許請求の範囲と均等の意味および範囲内でのすべての変更（変形例）が含まれる。 [Modification]
The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description of the embodiment but by the scope of claims for patent, and further includes all modifications (modifications) within the meaning and scope equivalent to the scope of claims for patent.

たとえば、上記実施形態では、混合音およびデータベース用楽曲信号の特徴量が、特徴量の平均値以上の場合に特徴量を１とし、特徴量の平均値未満の場合に特徴量を０とすることにより、特徴量を２値化する例を示したが、本発明はこれに限られない。本発明では、特徴量の平均値以外の値を基準として、特徴量を１または０にしてもよい。たとえば、特徴量の中央値などを基準として、特徴量を１または０にしてもよい。 For example, in the above embodiment, the feature amount is set to 1 when the feature amount of the mixed sound and the music signal for the database is equal to or larger than the average value of the feature amount, and is set to 0 when the feature amount is less than the average value of the feature amount. However, the present invention is not limited to this. In the present invention, the feature value may be set to 1 or 0 with reference to a value other than the average value of the feature values. For example, the feature value may be set to 1 or 0 with reference to the median value of the feature values.

また、上記実施形態では、混合音およびデータベース用楽曲信号の特徴量は、所定の時間長さを有する１分析フレーム毎に抽出される例を示したが、本発明はこれに限られない。たとえば、混合音およびデータベース用楽曲信号の特徴を、複数の分析フレーム毎に抽出してもよい。 In the above embodiment, the feature amounts of the mixed sound and the database music signal are extracted for each analysis frame having a predetermined time length. However, the present invention is not limited to this. For example, the characteristics of the mixed sound and database music signal may be extracted for each of a plurality of analysis frames.

また、上記実施形態（実験）では、１０×１６分析フレームまたは２×１６分析フレームに対応する２値化された混合音の特徴量を検索キーとして、音源データベースから音源を検索するように構成されている例を示したが、本発明はこれに限られない。たとえば、１０×１６分析フレームまたは２×１６分析フレーム以外の数の分析フレームに対応する２値化された混合音の特徴量を検索キーとして用いてもよい。 In the above embodiment (experiment), the sound source is searched from the sound source database using the binarized mixed sound feature amount corresponding to the 10 × 16 analysis frame or the 2 × 16 analysis frame as a search key. However, the present invention is not limited to this. For example, a binarized mixed sound feature amount corresponding to a number of analysis frames other than a 10 × 16 analysis frame or a 2 × 16 analysis frame may be used as a search key.

また、上記実施形態では、特徴量ベクトルが、６オクターブ分の次元（７２次元）を有する例を示したが、本発明はこれに限られない。たとえば、特徴量ベクトルが、６オクターブ以外の数のオクターブ分の次元を有するように構成されていてもよい。 In the above-described embodiment, an example in which the feature vector has a dimension corresponding to 6 octaves (72 dimensions) is shown, but the present invention is not limited to this. For example, the feature vector may be configured to have a number of octave dimensions other than six octaves.

また、上記実施形態では、検索キーの特徴量ベクトルと検索対象の特徴量ベクトルとの間の距離が最小になる場合を検索結果とする例を示したが、本発明はこれに限られない。たとえば、２値化手段により２値化された混合音の特徴量と、音源データベースの音源の特徴量との差が、所定のしきい値未満の場合に、検索結果とするように構成してもよい。これにより、混合音を検索キーとする場合において、検索キーの特徴量ベクトルと検索対象（検索したい正解の音源）の特徴量ベクトルとの間の距離が最小にならない場合でも、検索したい音源が検索できなくなるのを防止することができる。 In the above-described embodiment, an example in which the search result is a case where the distance between the feature quantity vector of the search key and the feature quantity vector to be searched becomes the minimum is shown, but the present invention is not limited to this. For example, when the difference between the feature value of the mixed sound binarized by the binarization means and the feature value of the sound source in the sound source database is less than a predetermined threshold, the search result is configured. Also good. As a result, when a mixed sound is used as a search key, the sound source to be searched can be searched even when the distance between the feature vector of the search key and the feature vector of the search target (correct sound source to be searched) is not minimized. It can be prevented that it becomes impossible.

１０特徴量抽出手段（検索装置側特徴量抽出手段）
２０２値化手段（検索装置側特徴量抽出手段）
３０検索手段
５０音源データベース
５１特徴量抽出手段（データベース側特徴量抽出手段）
５２２値化手段（データベース側２値化手段）
１００音源検索装置
Ｔ（所定の）時間長さ 10. Feature amount extraction means (retrieval device side feature amount extraction means)
20 Binarization means (search device side feature quantity extraction means)
30 Search means 50 Sound source database 51 Feature quantity extraction means (database side feature quantity extraction means)
52 Binarization means (database side binarization means)
100 sound source search device T (predetermined) time length

Claims

A search device side feature quantity extraction means for extracting a feature quantity from a mixed sound including a music signal and a voice signal;
A search device-side binarization means for binarizing the extracted feature quantity;
Search means for searching for a sound source from a sound source database, using the feature value of the mixed sound binarized by the search device side binarization means as a search key;
The feature amount of the mixed sound is a chroma spectrum that is output energy of each band window calculated based on a Fourier spectrum of the mixed sound,
The feature amount of the mixed sound is extracted for each analysis frame or a plurality of analysis frames having a predetermined time length,
The search device-side binarization means sets the feature amount to 1 when the feature amount of the mixed sound is equal to or greater than a predetermined reference value of the feature amount for each analysis frame or for each of the plurality of analysis frames. A sound source search apparatus configured to set a feature amount to 0 when the value is less than a predetermined reference value.

The sound source database is
Database-side feature quantity extraction means for extracting feature quantities from the music signal for the database;
Database-side binarization means for binarizing the extracted feature quantity;
Construction means for constructing the sound source database from the feature quantities of the database music signal binarized by the database-side binarization means,
The feature amount of the database music signal is a chroma spectrum that is output energy of each band window calculated based on a Fourier spectrum of the database music signal,
The feature amount of the music signal for the database is extracted for each analysis frame or each of the plurality of analysis frames having the predetermined time length,
The database-side binarization means sets the feature amount to 1 when the feature amount of the database music signal is equal to or greater than the predetermined reference value of the feature amount for each analysis frame or for each of the plurality of analysis frames. The sound source search device according to claim 1, wherein the feature amount is set to 0 when the value is less than the predetermined reference value.

Extracting a feature value from the mixed sound including the music signal and the audio signal;
Binarizing the extracted feature value;
Using the feature value of the binarized mixed sound as a search key, and searching for a sound source from a sound source database,
The feature amount of the mixed sound is extracted for each analysis frame or a plurality of analysis frames having a predetermined time length,
The step of extracting the feature amount from the mixed sound including the music signal and the sound signal includes the step of extracting a chroma spectrum that is output energy of each band window calculated based on the Fourier spectrum of the mixed sound as the feature amount. Including
The step of binarizing the extracted feature value is characterized in that the feature value is set to 1 when the feature value of the mixed sound is equal to or greater than a predetermined reference value of the feature value for each analysis frame or for each of the plurality of analysis frames. And a sound source search method including a step of setting the feature amount to 0 when it is less than the predetermined reference value.