JP2017161572A

JP2017161572A - Sound signal processing method and sound signal processing device

Info

Publication number: JP2017161572A
Application number: JP2016043217A
Authority: JP
Inventors: 暖篠井; Dan Shinoi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2016-03-07
Filing date: 2016-03-07
Publication date: 2017-09-14

Abstract

PROBLEM TO BE SOLVED: To determine a similarity between an input sound and a reference sound in consideration of a tone color.SOLUTION: A sound signal processing method according to one embodiment comprises the steps of: obtaining an input sound signal; computing, with NMF, a first matrix that corresponds to an amplitude spectrogram of the input sound signal and includes a first component related to a frequency and a second component related to time; obtaining a second matrix that corresponds to an amplitude spectrogram of a reference sound signal, includes the first component and the second component, and is computed with the NMF; computing, for each of the second components, a similarity of a combination of the first components in the first matrix and the second matrix; and integrating, regarding the second components, similarities of combinations of the first components and obtaining a tone color similarity regarding the input sound signal and the reference sound signal.SELECTED DRAWING: Figure 14

Description

本発明は、楽曲の音信号を解析する技術に関する。 The present invention relates to a technique for analyzing a sound signal of a music piece.

楽曲の音信号を解析する技術が知られている。例えば特許文献１には、非負値行列因子分解（Nonnegative Matrix Factorization、ＮＭＦ）を用いて、楽曲のジャンルやスタイルを解析する技術が記載されている。 A technique for analyzing the sound signal of music is known. For example, Patent Document 1 describes a technique for analyzing the genre and style of a song using nonnegative matrix factorization (NMF).

特開２０１５−７９１１０号公報JP-A-2015-79110

特許文献１においては、入力音と参照音の類似度は、リズムパターンに基づいて判断されている。これに対し本発明では、音色を考慮して入力音と参照音の類似度を判断する技術を提供する。 In Patent Document 1, the similarity between an input sound and a reference sound is determined based on a rhythm pattern. In contrast, the present invention provides a technique for determining the similarity between the input sound and the reference sound in consideration of the timbre.

本発明は、入力音信号を取得するステップと、前記入力音信号の振幅スペクトログラムに対応し、周波数に関連する第１成分および時間に関連する第２成分を含む第１行列を、所定のアルゴリズムにより計算するステップと、参照音信号の振幅スペクトログラムに対応し、前記第１成分および前記第２成分を含み、前記所定のアルゴリズムにより計算された第２行列を取得するステップと、前記第１行列および前記第２行列における前記第１成分の組み合わせの類似度を前記第２成分毎に計算するステップと、前記第１成分の組み合わせの類似度を前記第２成分について積算し、前記入力音信号および前記参照音信号の音色に関する第１類似度を得るステップとを有する音信号処理方法を提供する。 According to the present invention, an input sound signal is obtained, and a first matrix corresponding to an amplitude spectrogram of the input sound signal and including a first component related to frequency and a second component related to time is obtained by a predetermined algorithm. Obtaining a second matrix corresponding to an amplitude spectrogram of a reference sound signal, including the first component and the second component, and calculated by the predetermined algorithm; the first matrix and the Calculating the similarity of the combination of the first components in the second matrix for each of the second components, integrating the similarity of the combination of the first components for the second component, and calculating the input sound signal and the reference Obtaining a first similarity related to the tone color of the sound signal.

この音信号処理方法は、前記第１行列および前記第２行列のうち特定の前記第１成分における時間変化の類似度を計算し、前記入力音信号および前記参照音信号のリズムに関する第２類似度を得るステップと、前記第１類似度および前記第２類似度を統合した第３類似度を計算するステップとを有してもよい。 In this sound signal processing method, a similarity in time change in a specific first component of the first matrix and the second matrix is calculated, and a second similarity regarding a rhythm of the input sound signal and the reference sound signal is calculated. And calculating a third similarity obtained by integrating the first similarity and the second similarity.

この音信号処理方法は、前記入力音信号のビートスペクトルおよび前記参照音信号のビートスペクトルの類似度を計算し、リズムに関する第４類似度を得るステップを有し、前記第３類似度の計算において、前記第１類似度、前記第２類似度、および前記第４類似度が統合されてもよい。 The sound signal processing method includes a step of calculating a similarity between a beat spectrum of the input sound signal and a beat spectrum of the reference sound signal to obtain a fourth similarity related to a rhythm, and in the calculation of the third similarity The first similarity, the second similarity, and the fourth similarity may be integrated.

前記第３類似度の計算において、音色に関する類似度とリズムに関する類似度との比率が１：１になるように調整された重み付け演算により、前記第１類似度、前記第２類似度、および前記第４類似度が統合されてもよい。 In the calculation of the third similarity, the first similarity, the second similarity, and the weighting calculation adjusted so that the ratio between the similarity related to the timbre and the similarity related to the rhythm is 1: 1. The fourth similarity may be integrated.

また、本発明は、入力音信号を取得する第１取得手段と、前記入力音信号の振幅スペクトログラムに対応し、周波数に関連する第１成分および時間に関連する第２成分を含む第１行列を、所定のアルゴリズムにより計算する観測行列計算手段と、参照音信号の振幅スペクトログラムに対応し、前記第１成分および前記第２成分を含み、前記所定のアルゴリズムにより計算された第２行列を取得する参照行列取得手段と、前記第１行列および前記第２行列における前記第１成分の組み合わせの類似度を前記第２成分毎に計算する組み合わせ類似度計算手段と、前記第１成分の組み合わせの類似度を前記第２成分について積算し、前記入力音信号および前記参照音信号の音色に関する第１類似度を計算する音色類似度計算手段とを有する音信号処理装置を提供する。 The present invention also provides a first acquisition unit that acquires an input sound signal, and a first matrix that includes a first component related to frequency and a second component related to time, corresponding to an amplitude spectrogram of the input sound signal. An observation matrix calculation means for calculating by a predetermined algorithm, a reference corresponding to an amplitude spectrogram of a reference sound signal, including the first component and the second component, and obtaining a second matrix calculated by the predetermined algorithm Matrix acquisition means, combination similarity calculation means for calculating the similarity of the combination of the first components in the first matrix and the second matrix for each of the second components, and the similarity of the combination of the first components A tone signal processing unit that integrates the second component and calculates a first similarity degree relating to the tone color of the input tone signal and the reference tone signal. To provide a device.

本発明によれば、音色を考慮して入力音と参照音の類似度を判断することができる。 According to the present invention, the similarity between the input sound and the reference sound can be determined in consideration of the timbre.

一実施形態に係る楽曲検索システム１の機能構成を例示する図。The figure which illustrates functional composition of music search system 1 concerning one embodiment. 楽曲検索システム１の機能構成を例示する図。The figure which illustrates the function structure of the music search system. 特定手段１２の詳細な機能構成を例示する図。The figure which illustrates the detailed functional structure of the specific means 12. 類似度計算手段１３の詳細な機能構成を例示する図。The figure which illustrates the detailed functional structure of the similarity calculation means 13. 類似度計算手段１５の詳細な機能構成を例示する図。The figure which illustrates the detailed functional structure of the similarity calculation means 15. 電子楽器１０のハードウェア構成を例示する図。2 is a diagram illustrating a hardware configuration of the electronic musical instrument 10. FIG. 情報処理装置２０のハードウェア構成を例示する図。The figure which illustrates the hardware constitutions of the information processing apparatus. 楽曲検索システム１の動作の概要を示すフローチャート。The flowchart which shows the outline | summary of operation | movement of the music search system 1. FIG. 対象区間特定処理の詳細を示すフローチャート。The flowchart which shows the detail of an object area specific process. 楽曲構造解析の詳細を例示するフローチャート。The flowchart which illustrates the detail of music structure analysis. 入力音信号に対して特定された楽曲構造を例示する図。The figure which illustrates the music structure specified with respect to the input sound signal. 対象区間選択処理の詳細を例示するフローチャート。The flowchart which illustrates the detail of object area selection processing. 振幅スペクトログラムに対するＮＭＦの概要を示す図。The figure which shows the outline | summary of NMF with respect to an amplitude spectrogram. ＮＭＦによる類似度計算の詳細を例示するフローチャート。The flowchart which illustrates the detail of the similarity calculation by NMF. 基底の組み合わせを例示する図。The figure which illustrates the combination of a base. ビートスペクトルによる類似度計算の詳細を例示するフローチャート。The flowchart which illustrates the detail of the similarity calculation by a beat spectrum. ビートスペクトルを例示する図。The figure which illustrates a beat spectrum.

１．構成
図１は、一実施形態に係る楽曲検索システム１の機能構成を例示する図である。楽曲検索システム１は、複数の楽曲データをあらかじめ記憶している。処理対象となる楽曲（検索キーとなる楽曲）の音の入力が受け付けられると（以下、この音を「入力音」といい、入力音を示す信号を「入力音信号」という）、楽曲検索システム１は、記憶している楽曲の中から、入力音と類似している楽曲を検索する。 1. Configuration FIG. 1 is a diagram illustrating a functional configuration of a music search system 1 according to an embodiment. The music search system 1 stores a plurality of music data in advance. When an input of a sound of a music to be processed (a music to be a search key) is accepted (hereinafter, this sound is referred to as an “input sound”, and a signal indicating the input sound is referred to as an “input sound signal”), a music search system 1 searches for music similar to the input sound from the stored music.

この例で、楽曲検索システム１は、電子楽器１０および情報処理装置２０を有する。電子楽器１０は、検索対象となる楽曲データを記憶した楽曲記憶装置の一例である。情報処理装置２０は、ユーザーインターフェースを提供するユーザー端末の一例である。電子楽器に記憶されている楽曲データは、伴奏用の楽曲のデータ（以下このデータを「伴奏データ」といい、伴奏用の楽曲の音を「伴奏音」という）である。ユーザーは、例えば、これから自分が演奏しようとする楽曲の情報を情報処理装置２０に入力する。楽曲の情報は、例えば、非圧縮または圧縮形式（ｗａｖやｍｐ３等）の音データに基づく楽曲の音信号であるが、これに限定されるものではない。また、楽曲の情報は、後述する情報処理装置２０のストレージ２０３に予め記憶されていてもよいし、情報処理装置２０の外から入力されてもよい。情報処理装置２０は、電子楽器に記憶されている伴奏データの中から、入力された楽曲に類似しているものを検索する。入力された楽曲に類似している伴奏音を発見すると、情報処理装置２０は、その伴奏音の再生を電子楽器１０に指示する。電子楽器１０は、指示された伴奏音を再生する。ユーザーは、再生される伴奏に合わせて電子楽器１０を演奏する。 In this example, the music search system 1 includes an electronic musical instrument 10 and an information processing device 20. The electronic musical instrument 10 is an example of a music storage device that stores music data to be searched. The information processing apparatus 20 is an example of a user terminal that provides a user interface. The music data stored in the electronic musical instrument is music data for accompaniment (hereinafter, this data is referred to as “accompaniment data”, and the sound of the music for accompaniment is referred to as “accompaniment sound”). For example, the user inputs information on the music that he / she intends to perform to the information processing apparatus 20. The music information is, for example, a sound signal of music based on sound data in an uncompressed or compressed format (wav, mp3, etc.), but is not limited to this. In addition, the music information may be stored in advance in the storage 203 of the information processing apparatus 20 to be described later, or may be input from outside the information processing apparatus 20. The information processing apparatus 20 searches the accompaniment data stored in the electronic musical instrument for data similar to the input music. When the accompaniment sound similar to the input music is found, the information processing apparatus 20 instructs the electronic musical instrument 10 to reproduce the accompaniment sound. The electronic musical instrument 10 reproduces the instructed accompaniment sound. The user plays the electronic musical instrument 10 in accordance with the accompaniment to be played.

図２は、楽曲検索システム１の機能構成を例示する図である。楽曲検索システム１は、楽曲の音信号（入力音信号）が入力されると、その楽曲と類似した楽曲を出力する。楽曲検索システム１は、取得手段１１、特定手段１２、類似度計算手段１３、データベース１４、類似度計算手段１５、統合手段１６、選択手段１７、および出力手段１８を有する。 FIG. 2 is a diagram illustrating a functional configuration of the music search system 1. When a music sound signal (input sound signal) is input, the music search system 1 outputs a music similar to the music. The music search system 1 includes an acquisition unit 11, an identification unit 12, a similarity calculation unit 13, a database 14, a similarity calculation unit 15, an integration unit 16, a selection unit 17, and an output unit 18.

取得手段１１は、入力音信号を取得する。特定手段１２は、入力音信号のうち、以降の処理の対象となる対象区間を特定する。データベース１４は、複数の伴奏データに関する情報を記憶している。類似度計算手段１３は、入力音信号の対象区間において、非負値行列因子分解（Nonnegative Matrix Factorization、ＮＭＦ）を用いて入力音と伴奏音の類似度を計算する。類似度計算手段１５は、入力音信号の対象区間において、ビートスペクトルを用いて入力音と伴奏音の類似度を計算する。統合手段１６は、類似度計算手段１３により計算された類似度および類似度計算手段１５により計算された類似度を統合する。選択手段１７は、統合された類似度に基づいて、入力音と類似している楽曲をデータベース１４の中から選択する。出力手段１８は、選択された楽曲を出力する。 The acquisition unit 11 acquires an input sound signal. The specifying unit 12 specifies a target section that is a target of subsequent processing in the input sound signal. The database 14 stores information related to a plurality of accompaniment data. The similarity calculation means 13 calculates the similarity between the input sound and the accompaniment sound by using nonnegative matrix factorization (NMF) in the target section of the input sound signal. The similarity calculation means 15 calculates the similarity between the input sound and the accompaniment sound using the beat spectrum in the target section of the input sound signal. The integration unit 16 integrates the similarity calculated by the similarity calculation unit 13 and the similarity calculated by the similarity calculation unit 15. The selection means 17 selects the music similar to the input sound from the database 14 based on the integrated similarity. The output means 18 outputs the selected music piece.

図３は、特定手段１２の詳細な機能構成を例示する図である。特定手段１２は、入力音信号に対して、対象区間以外の部分（以下「非対象区間」という）を除いた音信号を出力する。特定手段１２は、構造解析手段１２１、分割手段１２２、選択手段１２３、および信号生成手段１２４を有する。構造解析手段１２１は、入力音信号により示される楽曲の音楽的な構造の解析（以下、「楽曲構造解析」という）を行う。分割手段１２２は、楽曲構造解析の結果に従って、入力音信号を時間領域において複数の区間に分割する。選択手段１２３は、複数の区間の中から、対象区間となる区間を選択する。信号生成手段１２４は、入力音信号から非対象区間を除いた音信号、すなわち対象区間のみの音信号を生成する。 FIG. 3 is a diagram illustrating a detailed functional configuration of the specifying unit 12. The specifying unit 12 outputs a sound signal excluding a portion other than the target section (hereinafter referred to as “non-target section”) with respect to the input sound signal. The specifying unit 12 includes a structure analyzing unit 121, a dividing unit 122, a selecting unit 123, and a signal generating unit 124. The structure analysis unit 121 analyzes the musical structure of the music indicated by the input sound signal (hereinafter referred to as “music structure analysis”). The dividing unit 122 divides the input sound signal into a plurality of sections in the time domain according to the result of the music structure analysis. The selection unit 123 selects a section to be a target section from among a plurality of sections. The signal generation unit 124 generates a sound signal obtained by removing the non-target section from the input sound signal, that is, a sound signal only for the target section.

図４は、類似度計算手段１３の詳細な機能構成を例示する図である。類似度計算手段１３は、入力音信号に対して、音色に関する類似度（以下「音色類似度」という）およびリズムに関する類似度（以下「リズム類似度」という）を出力する。類似度計算手段１３は、観測行列計算手段１３１、参照行列取得手段１３２、組み合わせ類似度計算手段１３３、音色類似度計算手段１３４、およびリズム類似度計算手段１３５を有する。観測行列計算手段１３１は、所定のアルゴリズム（この例ではＮＭＦ。ＮＭＦの詳細は後述）により、入力音信号の振幅スペクトログラムに対応する行列（以下、「観測行列」という）を、基底行列およびアクティベーション行列（係数行列）の積に分解する。以下、入力音信号から得られた基底行列およびアクティベーション行列を、それぞれ「観測基底行列」および「観測アクティベーション行列」という。観測基底行列は、入力音信号の振幅スペクトログラムに対応し、周波数に関連する第１成分および時間に関連する第２成分を含む第１行列の一例である。参照行列取得手段１３２は、参照音信号からＮＭＦにより得られた基底行列およびアクティベーション行列を取得する。以下、参照音信号から得られた基底行列およびアクティベーション行列を、それぞれ「参照基底行列」および「参照アクティベーション行列」という。参照音信号とは、参照用の楽曲を示す音信号をいう。参照用の楽曲はデータベース１４に記録されている伴奏データの中から順次選択された一の伴奏データにより示される楽曲である。参照基底行列は、参照音信号の振幅スペクトログラムに対応し、第１成分および第２成分を含み、前述の所定のアルゴリズムにより計算された第２行列の一例である。組み合わせ類似度計算手段１３３は、観測基底行列および参照基底行列に含まれる基底の組み合わせの類似度を単位時間毎に計算する。音色類似度計算手段１３４は、組み合わせ類似度計算手段１３３により計算された類似度を時間領域で積算し、入力音および参照音の音色類似度（第１類似度の一例）を計算する。リズム類似度計算手段１３５は観測アクティベーション行列および参照アクティベーション行列の類似度を計算する。この類似度は、入力音および参照音のリズム類似度（第２類似度の一例）を示している。 FIG. 4 is a diagram illustrating a detailed functional configuration of the similarity calculation unit 13. The similarity calculation means 13 outputs a similarity relating to the timbre (hereinafter referred to as “tone similarity”) and a similarity relating to the rhythm (hereinafter referred to as “rhythm similarity”) with respect to the input sound signal. The similarity calculation unit 13 includes an observation matrix calculation unit 131, a reference matrix acquisition unit 132, a combination similarity calculation unit 133, a timbre similarity calculation unit 134, and a rhythm similarity calculation unit 135. The observation matrix calculation means 131 converts a matrix corresponding to the amplitude spectrogram of the input sound signal (hereinafter referred to as “observation matrix”) into a base matrix and an activation by a predetermined algorithm (in this example, NMF, details of NMF will be described later). Decomposes a matrix (coefficient matrix) product. Hereinafter, the basis matrix and the activation matrix obtained from the input sound signal are referred to as “observation basis matrix” and “observation activation matrix”, respectively. The observation base matrix is an example of a first matrix that corresponds to the amplitude spectrogram of the input sound signal and includes a first component related to frequency and a second component related to time. The reference matrix acquisition unit 132 acquires a base matrix and an activation matrix obtained by NMF from the reference sound signal. Hereinafter, the basis matrix and the activation matrix obtained from the reference sound signal are referred to as “reference basis matrix” and “reference activation matrix”, respectively. The reference sound signal is a sound signal indicating a reference music. The reference music is a music indicated by one accompaniment data sequentially selected from the accompaniment data recorded in the database 14. The reference basis matrix corresponds to the amplitude spectrogram of the reference sound signal, includes a first component and a second component, and is an example of a second matrix calculated by the above-described predetermined algorithm. The combination similarity calculation unit 133 calculates the similarity of combinations of bases included in the observation base matrix and the reference base matrix for each unit time. The timbre similarity calculation means 134 adds the similarities calculated by the combination similarity calculation means 133 in the time domain, and calculates the timbre similarity (an example of the first similarity) of the input sound and the reference sound. The rhythm similarity calculation means 135 calculates the similarity between the observation activation matrix and the reference activation matrix. This similarity indicates the rhythm similarity (an example of the second similarity) of the input sound and the reference sound.

図５は、類似度計算手段１５の詳細な機能構成を例示する図である。類似度計算手段１５は、入力音信号に対して、類似度計算手段１３とは異なるアルゴリズムにより計算されたリズム類似度を出力する。類似度計算手段１５は、ＢＰＭ取得手段１５１、正規化手段１５２、ＢＳ計算手段１５３、参照ＢＳ取得手段１５４、および類似度計算手段１５５を有する。ＢＰＭ取得手段１５１は、入力音信号のＢＰＭ（Beat Per Minute）、すなわち単位時間あたりのビート数を取得する。正規化手段１５２は、入力音信号をＢＰＭで正規化する。ここで、「入力音信号をＢＰＭで正規化する」とは、入力音信号を直接、ＢＰＭで正規化するものだけでなく、入力音信号に何らかの信号処理を施して得られた信号をＢＰＭで正規化するものも含む。ＢＳ計算手段１５３（第１計算手段の一例）は、正規化された入力音信号のビートスペクトルを計算する。参照ＢＳ取得手段１５４は、参照音信号から得られた、正規化されたビートスペクトルを取得する。類似度計算手段１５５（第２計算手段の一例）は、入力音信号の正規化されたビートスペクトルと参照音信号の正規化されたビートスペクトルとを比較し、入力音および参照音のリズム類似度を計算する。 FIG. 5 is a diagram illustrating a detailed functional configuration of the similarity calculation unit 15. The similarity calculation means 15 outputs the rhythm similarity calculated by an algorithm different from that of the similarity calculation means 13 for the input sound signal. The similarity calculation unit 15 includes a BPM acquisition unit 151, a normalization unit 152, a BS calculation unit 153, a reference BS acquisition unit 154, and a similarity calculation unit 155. The BPM acquisition unit 151 acquires BPM (Beat Per Minute) of the input sound signal, that is, the number of beats per unit time. The normalizing means 152 normalizes the input sound signal with BPM. Here, “normalize the input sound signal with BPM” means not only that the input sound signal is directly normalized with BPM, but also the signal obtained by applying some signal processing to the input sound signal with BPM. Including normalization. BS calculation means 153 (an example of first calculation means) calculates the beat spectrum of the normalized input sound signal. The reference BS acquisition unit 154 acquires a normalized beat spectrum obtained from the reference sound signal. The similarity calculation means 155 (an example of the second calculation means) compares the normalized beat spectrum of the input sound signal with the normalized beat spectrum of the reference sound signal, and compares the rhythm similarity between the input sound and the reference sound. Calculate

図６は、電子楽器１０のハードウェア構成を例示する図である。電子楽器１０は、演奏操作子１０１、音源１０２、発音制御部１０３、出力部１０４、ストレージ１０５、ＣＰＵ１０６、および通信ＩＦ１０７を有する。演奏操作子１０１は、ユーザー（演奏者）が演奏操作を行うための操作子、例えば鍵盤楽器であれば鍵盤、弦楽器であれば弦、または管楽器であればキーである。音源１０２は、各演奏操作子に対応する音データを記憶している。例えば鍵盤楽器において、ある鍵に対応する音データは、その鍵を押鍵したときに発生する音の立ち上がりから消え際までの音波形を示すデータである。発音制御部１０３は、演奏操作子１０１の操作に応じて音源１０２から音データを読み出す。出力部１０４は、読み出されたデータに応じた音信号（以下「演奏音信号」という）を出力する。ストレージ１０５は、データを記憶する不揮発性の記憶装置である。ストレージ１０５に記憶されているデータには、複数の伴奏データを記録したデータベースが含まれる。ＣＰＵ１０６は、電子楽器１０の各部を制御する制御装置である。ＣＰＵ１０６は、ストレージ１０５から読み出した伴奏データを出力部１０４に供給する。出力部１０４は、演奏音信号に加えて、伴奏データに応じた音信号（以下「伴奏音信号」という）を出力する出力装置であり、例えばスピーカーを含む。通信ＩＦ１０７は、他の装置、この例では特に情報処理装置２０と通信するためのインターフェースである。通信ＩＦ１０７は、情報処理装置２０と、例えば所定の規格に従った無線通信により通信する。 FIG. 6 is a diagram illustrating a hardware configuration of the electronic musical instrument 10. The electronic musical instrument 10 includes a performance operator 101, a sound source 102, a sound generation control unit 103, an output unit 104, a storage 105, a CPU 106, and a communication IF 107. The performance operator 101 is an operator for a user (performer) to perform a performance operation, for example, a keyboard for a keyboard instrument, a string for a stringed instrument, or a key for a wind instrument. The sound source 102 stores sound data corresponding to each performance operator. For example, in a keyboard instrument, sound data corresponding to a certain key is data indicating a sound waveform from the rising edge to the disappearing edge of a sound generated when the key is pressed. The sound generation control unit 103 reads sound data from the sound source 102 in accordance with the operation of the performance operator 101. The output unit 104 outputs a sound signal corresponding to the read data (hereinafter referred to as “performance sound signal”). The storage 105 is a non-volatile storage device that stores data. The data stored in the storage 105 includes a database that records a plurality of accompaniment data. The CPU 106 is a control device that controls each unit of the electronic musical instrument 10. The CPU 106 supplies the accompaniment data read from the storage 105 to the output unit 104. The output unit 104 is an output device that outputs a sound signal corresponding to accompaniment data (hereinafter referred to as “accompaniment sound signal”) in addition to the performance sound signal, and includes, for example, a speaker. The communication IF 107 is an interface for communicating with other apparatuses, in this example, the information processing apparatus 20 in particular. The communication IF 107 communicates with the information processing apparatus 20 by wireless communication according to a predetermined standard, for example.

図７は、情報処理装置２０のハードウェア構成を例示する図である。情報処理装置２０は、ユーザー端末として機能するコンピューター装置、例えばスマートフォンである。情報処理装置２０は、ＣＰＵ２０１、メモリー２０２、ストレージ２０３、入力部２０４、出力部２０５、通信ＩＦ２０６を有する。ＣＰＵ２０１は、情報処理装置２０の他の構成要素を制御する制御装置である。メモリー２０２は、ＣＰＵ２０１がプログラムを実行する際のワークスペースとして機能する揮発性の記憶装置である。ストレージ２０３は、各種のデータおよびプログラムを記憶した不揮発性の記憶装置である。入力部２０４は、ユーザーから命令または情報の入力を受け付ける入力装置であり、例えば、タッチセンサー、ボタン、およびマイクロフォンの少なくとも１つを含む。出力部２０５は、外部に情報を出力する出力装置であり、例えば、ディスプレイおよびスピーカーの少なくとも１つを含む。通信ＩＦ２０６は、他の装置、例えば電子楽器１０またはネットワーク上のサーバ装置（図示略）と通信するためのインターフェースである。 FIG. 7 is a diagram illustrating a hardware configuration of the information processing apparatus 20. The information processing device 20 is a computer device that functions as a user terminal, for example, a smartphone. The information processing apparatus 20 includes a CPU 201, a memory 202, a storage 203, an input unit 204, an output unit 205, and a communication IF 206. The CPU 201 is a control device that controls other components of the information processing apparatus 20. The memory 202 is a volatile storage device that functions as a work space when the CPU 201 executes a program. The storage 203 is a non-volatile storage device that stores various data and programs. The input unit 204 is an input device that receives an instruction or information input from a user, and includes, for example, at least one of a touch sensor, a button, and a microphone. The output unit 205 is an output device that outputs information to the outside, and includes, for example, at least one of a display and a speaker. The communication IF 206 is an interface for communicating with other devices such as the electronic musical instrument 10 or a server device (not shown) on the network.

この例では、図２に示した楽曲検索システム１の機能のうち、取得手段１１、特定手段１２、類似度計算手段１３、データベース１４、類似度計算手段１５、統合手段１６、および選択手段１７は、情報処理装置２０に実装されている。出力手段１８は、電子楽器１０に実装されている。 In this example, among the functions of the music search system 1 shown in FIG. 2, the acquisition unit 11, identification unit 12, similarity calculation unit 13, database 14, similarity calculation unit 15, integration unit 16, and selection unit 17 are It is mounted on the information processing apparatus 20. The output means 18 is mounted on the electronic musical instrument 10.

情報処理装置２０においては、コンピューター装置を楽曲検索システム１におけるユーザー端末として機能させるためのプログラムがストレージ２０３に記憶されている。ＣＰＵ２０１がこのプログラムを実行することにより、取得手段１１、特定手段１２、類似度計算手段１３、データベース１４、類似度計算手段１５、統合手段１６、および選択手段１７としての機能が情報処理装置２０に実装される。このプログラムを実行しているＣＰＵ２０１は、取得手段１１、特定手段１２、類似度計算手段１３、類似度計算手段１５、統合手段１６、および選択手段１７の一例である。ストレージ２０３は、データベース１４の一例である。また、電子楽器１０において、出力部１０４は出力手段１８の一例である。 In the information processing apparatus 20, a program for causing a computer apparatus to function as a user terminal in the music search system 1 is stored in the storage 203. When the CPU 201 executes this program, the information processing device 20 functions as the acquisition unit 11, the identification unit 12, the similarity calculation unit 13, the database 14, the similarity calculation unit 15, the integration unit 16, and the selection unit 17. Implemented. The CPU 201 executing this program is an example of the acquisition unit 11, the identification unit 12, the similarity calculation unit 13, the similarity calculation unit 15, the integration unit 16, and the selection unit 17. The storage 203 is an example of the database 14. In the electronic musical instrument 10, the output unit 104 is an example of the output unit 18.

２．動作
２−１．概要
図８は、楽曲検索システム１の動作の概要を示すフローチャートである。図８のフローは、例えば、ユーザーが楽曲の検索開始の指示を入力したことを契機として開始される。ステップＳ１において、取得手段１１は、入力音信号を取得する。ステップＳ２において、特定手段１２は、対象区間特定処理を行う。ステップＳ３において、類似度計算手段１３は、ＮＭＦによる類似度計算を行う。ステップＳ４において、類似度計算手段１５は、ビートスペクトルによる類似度計算を行う。ステップＳ５において、統合手段１６は、ＮＭＦによる類似度およびビートスペクトルによる類似度を統合する。ステップＳ６において、選択手段１７は、統合された類似度に基づいて楽曲を選択する。ステップＳ７において、出力手段１８は、選択された楽曲を出力する。すなわち、出力手段１８は、入力音に似た伴奏音を出力する。以下、各処理の詳細を説明する。 2. Operation 2-1. Outline FIG. 8 is a flowchart showing an outline of the operation of the music search system 1. The flow in FIG. 8 is started when the user inputs an instruction to start searching for music, for example. In step S1, the acquisition unit 11 acquires an input sound signal. In step S2, the specifying unit 12 performs a target section specifying process. In step S3, the similarity calculation means 13 performs similarity calculation by NMF. In step S4, the similarity calculation means 15 performs similarity calculation using a beat spectrum. In step S5, the integration unit 16 integrates the similarity by NMF and the similarity by beat spectrum. In step S6, the selection means 17 selects a music piece based on the integrated similarity. In step S7, the output means 18 outputs the selected music piece. That is, the output means 18 outputs an accompaniment sound similar to the input sound. Details of each process will be described below.

２−２．対象区間特定処理
ステップＳ３およびＳ４における類似度の計算は、入力音信号の全体を対象に行われてもよい。しかし、入力音信号の全体を対象とすると、以下の問題点がある。まず第１に、入力音信号の全体を対象とすると、計算にその分の時間を要する点である。第２に、入力音信号のうち、いわゆるイントロやアウトロ（エンディング）にはリズムが無い箇所が含まれている場合があり、このような部分まで含めて類似度を計算すると、類似度の信頼性が低下してしまうという点である。本実施形態においては、この点に対処するため、入力音信号のうち類似度計算の対象となる部分を一部に限っている。 2-2. Target Section Identification Processing The similarity calculation in steps S3 and S4 may be performed on the entire input sound signal. However, when the entire input sound signal is targeted, there are the following problems. First, if the entire input sound signal is targeted, the time required for the calculation is required. Secondly, there are cases where the so-called intro or outro (ending) of the input sound signal includes a portion having no rhythm, and if similarity is calculated including such a portion, the reliability of similarity is calculated. Is that it will decrease. In the present embodiment, in order to cope with this point, only a part of the input sound signal that is the target of similarity calculation is limited.

図９は、ステップＳ２における対象区間特定処理の詳細を示すフローチャートである。ステップＳ２１において、特定手段１２は、入力音信号に対し楽曲構造解析を行う。楽曲構造解析とは、音楽的な構造（いわゆる、イントロ、Ａメロ、Ｂメロ、サビ、アウトロ（エンディング）といった区分け）を解析する処理をいう。 FIG. 9 is a flowchart showing details of the target section specifying process in step S2. In step S21, the specifying unit 12 performs a music structure analysis on the input sound signal. The music structure analysis is a process of analyzing a musical structure (so-called intro, A melody, B melody, chorus, and outro (ending)).

図１０は、楽曲構造解析の詳細を例示するフローチャートである。ステップＳ２１１において、特定手段１２は、入力音信号を複数の単位区間に区分する。単位区間は、例えば、楽曲の１小節に相当する区間である。単位区間の区分は例えば以下のように行われる。まず、特定手段１２は、入力音信号において拍点を検出する。次に、特定手段１２は、１小節に相当する複数個の拍点により構成される区間を単位区間として画定する。拍点の検出および１小節に相当する区間の画定には、例えば特開２０１５−１１４３６１号公報に記載された技術が用いられる。 FIG. 10 is a flowchart illustrating details of music structure analysis. In step S211, the specifying unit 12 divides the input sound signal into a plurality of unit sections. The unit section is, for example, a section corresponding to one measure of music. The division of the unit section is performed as follows, for example. First, the specifying unit 12 detects a beat point in the input sound signal. Next, the specifying unit 12 defines a section composed of a plurality of beat points corresponding to one measure as a unit section. For example, a technique described in Japanese Patent Application Laid-Open No. 2015-114361 is used to detect beat points and to define a section corresponding to one measure.

ステップＳ２１２において、特定手段１２は、入力音信号から音色の特徴量（以下「音色特徴量」という）を計算する。音色特徴量としては、例えば、所定数（例えば１２個）のＭＦＣＣ（Mel-Frequency Cepstrum Coefficient、メル周波数ケプストラム係数）が用いられる。ＭＦＣＣは、ステップＳ２１１において画定された単位区間毎に計算される。 In step S212, the specifying unit 12 calculates a timbre feature amount (hereinafter referred to as “tone color feature amount”) from the input sound signal. For example, a predetermined number (for example, 12) of MFCC (Mel-Frequency Cepstrum Coefficient) is used as the timbre feature amount. The MFCC is calculated for each unit section defined in step S211.

ステップＳ２１３において、特定手段１２は、入力音信号から和音の特徴量（以下「和音特徴量」という）を計算する。和音特徴量は、拍点に基づいて単位区間をさらに細分化したフレーム（例えば、八分音符または十六分音符に相当する期間）毎に計算される。和音特徴量としては、例えばいわゆるクロマベクトルが用いられる。クロマベクトルは、スペクトラム分析で得られた周波数域のエネルギーを例えば半音毎に区切り、これを１オクターブ内で足し合わせたものである。半音毎に区切ると全部で１２音となるので、クロマベクトルは１２次元のベクトルである。フレーム毎に計算されるクロマベクトルは和音の時間的な変化、すなわちコード進行を表している。 In step S <b> 213, the specifying unit 12 calculates a chord feature amount (hereinafter referred to as “chord feature amount”) from the input sound signal. The chord feature amount is calculated for each frame (for example, a period corresponding to an eighth note or a sixteenth note) in which a unit section is further subdivided based on beat points. As the chord feature value, for example, a so-called chroma vector is used. The chroma vector is obtained by dividing the energy in the frequency range obtained by spectrum analysis into, for example, semitones and adding them within one octave. When divided into semitones, there are 12 sounds in total, so the chroma vector is a 12-dimensional vector. The chroma vector calculated for each frame represents the time change of the chord, that is, the chord progression.

ステップＳ２１４において、特定手段１２は、確率モデルを用いた事後分布推定により、入力音の楽曲構造を推定する。すなわち、特定手段１２は、ある楽曲構造のもとで特徴量の時系列が観測される確率を記述した確率モデルについて、音色特徴量および和音特徴量の時系列が観測されたときの事後確率の確率分布（事後分布）を推定する。 In step S214, the specifying unit 12 estimates the music structure of the input sound by posterior distribution estimation using a probability model. That is, the specifying means 12 uses a probability model describing the probability that a time series of feature values is observed under a certain music structure, and calculates the posterior probability when the time series of timbre feature values and chord feature values are observed. Estimate probability distribution (posterior distribution).

確率モデルとしては、例えば、楽曲構造モデル、音色観測モデル、および和音観測モデルが用いられる。楽曲構造モデルは、楽曲構造を確率的に記述したモデルである。音色観測モデルは、音色特徴量の生成過程を確率的に記述したモデルである。和音観測モデルは和音特徴量の生成過程を確率的に記述したモデルである。これらの確率モデルにおいて、各単位区間は、音楽的な構造が類似または共通するもの同士で同じ構造区間にグルーピングされる。各グループは区間符号（例えば、Ａ、Ｂ、Ｃ、…）により区別される。 As the probability model, for example, a music structure model, a timbre observation model, and a chord observation model are used. The music structure model is a model in which the music structure is described stochastically. The timbre observation model is a model that stochastically describes the generation process of the timbre feature. The chord observation model is a model that stochastically describes the generation process of chord features. In these probability models, each unit section is grouped into the same structure section with similar or common musical structures. Each group is distinguished by a section code (for example, A, B, C,...).

楽曲構造モデルは、例えば、相互に連鎖する複数の状態を状態空間に配列した状態遷移モデル、より詳細には隠れマルコフモデルである。音色観測モデルは、例えば、確率分布に正規分布を採用した無限混合ガウス分布に従うものであり、区間符号には依存するが構造区間内の滞在時間には依存しない確率モデルである。和音観測モデルは、例えば、確率分布に正規分布を採用した無限混合ガウス分布に従うものであり、区間符号および構造区間内の滞在時間の双方に依存する確率モデルである。各確率モデルにおける事後分布は、例えば変分ベイズ法等の反復推定アルゴリズムにより推定される。特定手段１２は、この事後分布を最大化させる楽曲構造を推定する。 The music structure model is, for example, a state transition model in which a plurality of states linked to each other are arranged in a state space, more specifically, a hidden Markov model. The timbre observation model follows, for example, an infinite mixed Gaussian distribution that employs a normal distribution as a probability distribution, and is a probability model that depends on the section code but does not depend on the stay time in the structure section. The chord observation model follows, for example, an infinite mixed Gaussian distribution that employs a normal distribution as the probability distribution, and is a probability model that depends on both the section code and the stay time in the structure section. The posterior distribution in each probability model is estimated by an iterative estimation algorithm such as a variational Bayes method. The specifying unit 12 estimates a music structure that maximizes the posterior distribution.

ステップＳ２１５において、特定手段１２は、ステップＳ２１４における推定結果に基づいて楽曲構造を特定する。 In step S215, the specifying unit 12 specifies the music structure based on the estimation result in step S214.

図１１は、入力音信号に対して特定された楽曲構造を例示する図である。この例では、入力音信号が９つの単位区間（τ１〜τ９）に区分されている。これらの単位区間に対し、先頭から順に、Ａ、Ｂ、Ｃ、Ｃ、Ｃ、Ｄ、Ｂ、Ｅ、およびＦという区間符号が割り当てられている。 FIG. 11 is a diagram illustrating the music structure specified for the input sound signal. In this example, the input sound signal is divided into nine unit sections (τ1 to τ9). To these unit sections, section codes A, B, C, C, C, D, B, E, and F are assigned in order from the top.

再び図９を参照する。ステップＳ２２において、特定手段１２は、入力音信号を分割する。具体的には、特定手段１２は、楽曲構造解析の結果に従って入力音信号を単位区間毎に分割する。ステップＳ２３において、特定手段１２は、複数に分割された入力音信号の中から、以降の処理に用いられる区間（以下「対象区間」という）を選択する。 Refer to FIG. 9 again. In step S22, the specifying unit 12 divides the input sound signal. Specifically, the specifying unit 12 divides the input sound signal into unit sections according to the result of the music structure analysis. In step S <b> 23, the specifying unit 12 selects a section (hereinafter referred to as “target section”) used for the subsequent processing from the input sound signal divided into a plurality of parts.

図１２は、対象区間選択処理の詳細を例示するフローチャートである。ステップＳ２３１において、特定手段１２は、各単位区間の優先度を計算する。この例では、同じ区間符号が割り当てられた単位区間が多いものには高い優先度が、少ないものには低い優先度が与えられる。例えば図１１の例では、区間符号Ｃが割り当てられた区間が３つあるのでこれらには優先度３が、区間符号Ｂが割り当てられた区間が２つあるのでこれらには優先度２が、それ以外の区間には優先度１が、それぞれ割り当てられる。すなわち、ステップＳ２３は、複数の単位区間から、楽曲構造解析において同じグループに分類された区間が多い順にリズム類似度の計算の対象となる区間の選択をするものである。 FIG. 12 is a flowchart illustrating the details of the target section selection process. In step S231, the specifying unit 12 calculates the priority of each unit section. In this example, a high priority is given to a large number of unit sections to which the same section code is assigned, and a low priority is given to a small number of unit sections. For example, in the example of FIG. 11, since there are three sections to which the section code C is assigned, these have priority 3, and since there are two sections to which the section code B is assigned, these have priority 2. Priorities 1 are assigned to the other sections. That is, in step S23, a section that is a target of rhythm similarity calculation is selected from a plurality of unit sections in descending order of sections classified into the same group in the music structure analysis.

なお、優先度を与える基準は上記の例に限定されない。上記の例に代えて、または加えて、他の基準が用いられてもよい。一例としては、例えば、単位区間の時間長が長いものに高い優先度を与え、時間長が短いものに低い優先度を与える基準が用いられる。すなわち、この別の例におけるステップＳ２３は、複数の単位区間のうち時間長の長い順にリズム類似度の計算の対象となる区間の選択をするものである。図１１の例では全ての単位区間の時間長が等しいが、例えば曲の途中でテンポが変わる場合や、楽曲構造解析において連続する複数の単位区間が統合されるアルゴリズムが採用された場合に、時間長に基づいて優先度を与える基準が意味を持つ。また、別の例として、例えば、入力音信号における時間軸上の位置に応じて、例えば開始から所定時間まで、および終了の所定時間前から終了までの区間に低い優先度を与え、他の区間に高い優先度を与える基準が用いられてもよい。これらの基準は重み付け加算され、複合的に適用されてもよい。 In addition, the reference | standard which gives a priority is not limited to said example. Other criteria may be used instead of or in addition to the above examples. As an example, for example, a standard that gives a high priority to a unit section having a long time length and gives a low priority to a unit section having a short time length is used. That is, step S23 in this other example is to select a section that is a target of calculation of the rhythm similarity in descending order of time length among a plurality of unit sections. In the example of FIG. 11, the time lengths of all the unit sections are the same. However, for example, when the tempo changes in the middle of a song, or when an algorithm in which a plurality of continuous unit sections are integrated in music structure analysis is adopted, Criteria that give priority based on length are meaningful. As another example, for example, according to the position on the time axis in the input sound signal, for example, a low priority is given to a section from the start to a predetermined time, and from a predetermined time before the end to the end, and other sections A criterion may be used that gives a higher priority to. These criteria may be weighted and added together.

ステップＳ２３２において、特定手段１２は、未だ対象区間として選択されていない区間（以下「非選択区間」という）のうち、優先度が最も高い区間を対象区間に追加する。優先度が最も高い区間が複数ある場合、特定手段１２は、その中から他の基準に従って選択された１つの区間、例えば番号が最も早い区間を対象区間に追加する。 In step S232, the specifying unit 12 adds, to the target section, a section having the highest priority among sections not yet selected as the target section (hereinafter referred to as “non-selected section”). When there are a plurality of sections with the highest priority, the specifying unit 12 adds one section selected according to another criterion, for example, the section with the earliest number, to the target section.

ステップＳ２３３において、特定手段１２は、対象区間の累積時間長がしきい値を超えたか判断する。しきい値としては、例えば、入力音信号の全時間長に対する所定の割合、一例としては５０％が用いられる。対象区間の累積時間長がしきい値を超えていないと判断された場合（Ｓ２３３：ＮＯ）、特定手段１２は、処理をステップＳ２３２に移行する。対象区間の累積時間長がしきい値を超えたと判断された場合（Ｓ２３３：ＹＥＳ）、特定手段１２は、図１２のフローを終了する。 In step S233, the specifying unit 12 determines whether the cumulative time length of the target section has exceeded a threshold value. As the threshold value, for example, a predetermined ratio with respect to the total time length of the input sound signal, for example, 50% is used. When it is determined that the accumulated time length of the target section does not exceed the threshold value (S233: NO), the specifying unit 12 moves the process to step S232. When it is determined that the accumulated time length of the target section exceeds the threshold value (S233: YES), the specifying unit 12 ends the flow of FIG.

図１１の例では、最初に区間τ３が対象区間に追加され、以降、処理が繰り返し行われる度に、区間τ４、τ５、τ２、およびτ７の順で対象区間に追加される。この例では区間τ１〜τ９の時間長が等しいので、区間τ７が対象区間に追加された時点で対象区間は全部で５区間となり対象区間の累積時間長が入力音信号の全時間長の５０％を超える。 In the example of FIG. 11, the section τ3 is first added to the target section, and thereafter, every time the process is repeated, the sections τ4, τ5, τ2, and τ7 are added to the target section in this order. In this example, since the time lengths of the sections τ1 to τ9 are equal, when the section τ7 is added to the target section, the target section becomes five sections in total, and the cumulative time length of the target section is 50% of the total time length of the input sound signal. Over.

再び図９を参照する。ステップＳ２４において、特定手段１２は、ステップＳ２３の結果に基づいて対象区間を特定する、図１１の例では、区間τ１、τ４、τ５、τ２、およびτ７が対象区間として特定される。特定手段１２は、分割された入力音信号のうち対象区間だけを繋ぎ合わせた信号を生成する。以降の処理では、この信号が入力音信号として処理される。 Refer to FIG. 9 again. In step S24, the specifying unit 12 specifies a target section based on the result of step S23. In the example of FIG. 11, sections τ1, τ4, τ5, τ2, and τ7 are specified as target sections. The specifying unit 12 generates a signal obtained by connecting only the target sections of the divided input sound signals. In the subsequent processing, this signal is processed as an input sound signal.

この例によれば、入力音信号の音楽的な構造に基づいて選択された一部の部分、例えば繰り返し登場する区間を以降の処理の対象として限定することができる。このような区間は、いわゆるサビやＡメロのように音楽的にインパクトの強い部分であることが多い。イントロやアウトロのようにリズムや音色が他の部分と異なっている可能性がある部分を処理の対象から除外することによって、検索の精度を保ちつつ、処理の負荷を低減することができる。 According to this example, a part selected based on the musical structure of the input sound signal, for example, a section that repeatedly appears can be limited as a target of subsequent processing. Such a section is often a musically strong part such as so-called rust and A melody. By excluding from the processing target parts such as intro and outro that the rhythm and timbre may be different from other parts, it is possible to reduce the processing load while maintaining the accuracy of the search.

２−３．ＮＭＦによる類似度計算
次に、ステップＳ３におけるＮＭＦによる類似度計算について説明する。類似度計算の詳細を説明する前に、まずＮＭＦの概要について説明する。ＮＭＦとは、非負値の行列を２つの非負値の行列の積に分解する低ランク近似アルゴリズムである。非負値行列とは、その成分が全て非負値（すなわちゼロまたは正値）である行列をいう。一般にＮＭＦは次式（１）で表される。

ここでＹは与えられた行列すなわち観測行列（ｍ行ｎ列）を示す。Ｈを基底行列（ｍ行ｋ列）といい、Ｕをアクティベーション（または係数）行列（ｋ行ｎ列）という。すなわちＮＭＦは、観測行列Ｙを、基底行列Ｈとアクティベーション行列Ｕとの積で近似する処理である。 2-3. Similarity Calculation by NMF Next, the similarity calculation by NMF in step S3 will be described. Before describing the details of the similarity calculation, an overview of NMF will be described first. NMF is a low rank approximation algorithm that decomposes a non-negative matrix into a product of two non-negative matrices. A non-negative matrix means a matrix whose components are all non-negative values (that is, zero or positive values). In general, NMF is represented by the following formula (1).

Here, Y represents a given matrix, that is, an observation matrix (m rows and n columns). H is referred to as a basis matrix (m rows and k columns), and U is referred to as an activation (or coefficient) matrix (k rows and n columns). That is, NMF is a process of approximating the observation matrix Y by the product of the base matrix H and the activation matrix U.

ＮＭＦを楽曲の類似度計算に適用するため、観測行列Ｙとして音信号の振幅スペクトログラムを表す行列を用いることを考える。振幅スペクトログラムとは、音信号の周波数スペクトルの時間変化を表すもので、時間、周波数、および振幅からなる３次元の情報である。振幅スペクトログラムは、例えば、音信号を時間領域で標本化し、これを短時間フーリエ変換することによって得られる複素スペクトログラムに対して絶対値を取ることにより得られる。ここで、横軸をｎ個に、縦軸をｍ個に分割し、分割された各領域における振幅を数値化すると、振幅スペクトログラムを行列として表すことができる。この行列は、行方向には時間的な情報を、列方向には周波数的な情報を、各成分の値は振幅に関する情報を含んでいる。振幅の値は非負値なので、この行列は非負値行列である。 In order to apply NMF to music similarity calculation, it is considered to use a matrix representing an amplitude spectrogram of a sound signal as the observation matrix Y. The amplitude spectrogram represents the time change of the frequency spectrum of the sound signal, and is three-dimensional information including time, frequency, and amplitude. The amplitude spectrogram is obtained, for example, by taking an absolute value with respect to a complex spectrogram obtained by sampling a sound signal in the time domain and subjecting the sound signal to a short-time Fourier transform. Here, when the horizontal axis is divided into n and the vertical axis is divided into m, and the amplitude in each divided region is digitized, the amplitude spectrogram can be represented as a matrix. This matrix includes temporal information in the row direction, frequency information in the column direction, and each component value includes information related to amplitude. Since the amplitude value is non-negative, this matrix is a non-negative matrix.

図１３は、振幅スペクトログラムに対するＮＭＦの概要を示す図である。振幅スペクトログラムから得られた観測行列ＹにＮＭＦを適用した例を示している。基底行列Ｈは、周波数に関連する成分（第１成分の一例）および時間に関連する成分（第２成分の一例）を含み、振幅スペクトログラムに含まれる代表的スペクトルパターンの集合を表すものである。アクティベーション行列Ｕは、その代表的スペクトルパターンが「どのタイミングで」「どのくらいの強さで」現れているかを表していると考えることができる。より具体的には、基底行列Ｈは、それぞれ異なる音源に対応する複数（図１３の例では２つ）の基底ベクトルｈを含んでいる。各基底ベクトルは、ある音源の代表的な周波数スペクトルを示している。例えば、基底ベクトルｈ（１）はフルートの代表的なスペクトルパターンを示し、基底ベクトルｈ（２）はクラリネットの代表的なスペクトルパターンを示している。また、アクティベーション行列Ｕは、各音源に対応する複数（図１３の例では２つ）のアクティベーションベクトルｕを含んでいる。例えば、アクティベーションベクトルｕ（１）はフルートのスペクトルパターンが現れるタイミングおよびその強さを表しており、アクティベーションベクトルｕ（２）はクラリネットのスペクトルパターンが現れるタイミングおよびその強さを表している（図１３の例では、図面を簡単にするため、アクティベーションベクトルｕの成分の値はオンまたはオフの２値である）。 FIG. 13 is a diagram showing an outline of NMF for an amplitude spectrogram. The example which applied NMF to the observation matrix Y obtained from the amplitude spectrogram is shown. The base matrix H includes a component related to frequency (an example of a first component) and a component related to a time (an example of a second component), and represents a set of representative spectral patterns included in an amplitude spectrogram. The activation matrix U can be considered to represent “at what timing” and “how strong” the representative spectrum pattern appears. More specifically, the basis matrix H includes a plurality of (two in the example of FIG. 13) basis vectors h corresponding to different sound sources. Each basis vector represents a typical frequency spectrum of a certain sound source. For example, the basis vector h (1) represents a typical spectrum pattern of the flute, and the basis vector h (2) represents a typical spectrum pattern of the clarinet. The activation matrix U includes a plurality (two in the example of FIG. 13) of activation vectors u corresponding to each sound source. For example, the activation vector u (1) represents the timing and intensity of the appearance of the flute spectrum pattern, and the activation vector u (2) represents the timing and intensity of the appearance of the clarinet spectrum pattern ( In the example of FIG. 13, the value of the component of the activation vector u is a binary value of on or off for the sake of simplicity.

ＮＭＦは、観測行列Ｙが既知のときに基底行列Ｈおよびアクティベーション行列Ｕを計算するものである。詳細には、ＮＭＦは、次式（２）のように、行列Ｙと行列積ＨＵとの距離Ｄを最小化する問題として定義される。距離Ｄとしては、例えば、ユークリッド距離、一般化ＫＬ距離、板倉斎藤距離、またはβダイバージェンスが用いられる。式（２）の解を閉形式で得ることはできないが、効率的な反復解法がいくつか知られている（例えば、Lee D. D., & Sueng, H. S. (2001), Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13(1) V621-V624）。

なお、上式は、距離Ｄを最小にする行列ＨおよびＵを計算することを意味する。以降の式についても同様である。 The NMF calculates the base matrix H and the activation matrix U when the observation matrix Y is known. Specifically, NMF is defined as a problem that minimizes the distance D between the matrix Y and the matrix product HU, as in the following equation (2). As the distance D, for example, Euclidean distance, generalized KL distance, Itakura Saito distance, or β divergence is used. Although the solution of equation (2) cannot be obtained in closed form, several efficient iterative solutions are known (eg, Lee DD, & Sueng, HS (2001), Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13 (1) V621-V624).

The above equation means that the matrices H and U that minimize the distance D are calculated. The same applies to the following equations.

なお、入力音および伴奏音に含まれる楽器が事前にある程度判明している場合、すなわち入力音および伴奏音に含まれる楽器の候補が事前にある程度限定されている場合、半教師有りＮＭＦが適用されてもよい。半教師有りＮＭＦについては、例えば、Smaragdis P, Raj B, Shashanka MV. Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures, In: ICA. 2007. p. 414-421に記載されている。 If the musical instrument included in the input sound and the accompaniment sound is known to some extent, that is, if the musical instrument candidates included in the input sound and the accompaniment sound are limited to some extent in advance, the semi-supervised NMF is applied. May be. Semi-supervised NMF is described, for example, in Smaragdis P, Raj B, Shashanka MV. Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures, In: ICA. 2007. p. 414-421.

図１４は、ＮＭＦによる類似度計算の詳細を例示するフローチャートである。ステップＳ３１において、類似度計算手段１３は、入力音信号の振幅スペクトログラムを計算する。ステップＳ３２において、類似度計算手段１３は、入力音信号の振幅スペクトログラムに対してＮＭＦを適用する。具体的には、類似度計算手段１３は、まず、入力音信号の振幅スペクトログラムを行列化して観測行列Ｙｏを得る。次に、類似度計算手段１３は、観測行列Ｙｏに対しＮＭＦを適用し、観測基底行列Ｈｏ（第１行列の一例）および観測アクティベーション行列Ｕｏを計算する。すなわち、ステップＳ３２は、第１行列を所定のアルゴリズムにより計算するものである。 FIG. 14 is a flowchart illustrating details of similarity calculation by NMF. In step S31, the similarity calculation unit 13 calculates an amplitude spectrogram of the input sound signal. In step S32, the similarity calculation unit 13 applies NMF to the amplitude spectrogram of the input sound signal. Specifically, the similarity calculation means 13 first obtains an observation matrix Yo by matrixing the amplitude spectrogram of the input sound signal. Next, the similarity calculation means 13 applies NMF to the observation matrix Yo to calculate the observation base matrix Ho (an example of the first matrix) and the observation activation matrix Uo. That is, step S32 calculates the first matrix by a predetermined algorithm.

ステップＳ３３において、類似度計算手段１３は、参照音信号の参照基底行列Ｈｒ（第２行列の一例）および参照アクティベーション行列Ｕｒを取得する。この例では、複数の伴奏データの各々に対してあらかじめＮＭＦが適用され、参照基底行列および参照アクティベーション行列が計算されている。計算された参照基底行列および参照アクティベーション行列は、伴奏データに関する情報としてデータベース１４に記録されている。類似度計算手段１３は、データベースに記録されている複数の伴奏データの中から、参照音とする伴奏音を順次、選択してその伴奏音に対応する参照基底行列および参照アクティベーション行列を、データベース１４から取得する。 In step S33, the similarity calculation unit 13 acquires a reference basis matrix Hr (an example of a second matrix) and a reference activation matrix Ur of the reference sound signal. In this example, NMF is applied in advance to each of a plurality of accompaniment data, and a reference basis matrix and a reference activation matrix are calculated. The calculated reference basis matrix and reference activation matrix are recorded in the database 14 as information on accompaniment data. The similarity calculation means 13 sequentially selects an accompaniment sound as a reference sound from a plurality of accompaniment data recorded in the database, and obtains a reference basis matrix and a reference activation matrix corresponding to the accompaniment sound. 14 from.

なお、データベース１４に記録されている参照基底行列および参照アクティベーション行列は、必ずしも参照音の全体を用いて計算されたものでなくてもよい。入力音に対する対象区間特定処理と同様の処理で特定された一部の区間のみに対してＮＭＦが適用され、参照基底行列および参照アクティベーション行列が計算されてもよい。 Note that the reference basis matrix and the reference activation matrix recorded in the database 14 do not necessarily have to be calculated using the entire reference sound. The reference base matrix and the reference activation matrix may be calculated by applying NMF to only a part of the sections specified by the same process as the target section specifying process for the input sound.

ステップＳ３４において、類似度計算手段１３は、各フレームにおける基底の組み合わせ類似度を計算する。基底の組み合わせとは、基底行列に含まれる複数の基底ベクトルのうち、ある期間にアクティベートされる基底ベクトルの組み合わせをいう。 In step S34, the similarity calculation means 13 calculates the combinational similarity of the bases in each frame. The combination of bases refers to a combination of base vectors activated in a certain period among a plurality of base vectors included in the base matrix.

図１５は、基底の組み合わせを例示する図である。図１５（Ａ）は入力音に対応するＮＭＦの結果を、図１５（Ｂ）は参照音に対応するＮＭＦの結果を、それぞれ模式的に示す図である。この例で、入力音および参照音に対応する基底行列はいずれも、ギター、ベース、ハイハット、スネア、およびバスドラムに対応する基底ベクトルを含んでいる。図においては、各基底ベクトルに対応するアクティベーションベクトルが模式的に図示されている。横軸は時間を、縦軸はアクティベーションの強度を、それぞれ示している。基底の組み合わせを見ると、例えばフレームＦ１において、入力音では、ギター、ベース、ハイハット、およびバスドラムがアクティベートされており、参照音では、ハイハットおよびバスドラムがアクティベートされている。 FIG. 15 is a diagram illustrating combinations of bases. FIG. 15A schematically shows the NMF result corresponding to the input sound, and FIG. 15B schematically shows the NMF result corresponding to the reference sound. In this example, the basis matrices corresponding to the input sound and the reference sound all include basis vectors corresponding to the guitar, bass, hi-hat, snare, and bass drum. In the figure, activation vectors corresponding to each base vector are schematically shown. The horizontal axis represents time, and the vertical axis represents activation intensity. Looking at the combination of bases, for example, in the frame F1, the guitar, bass, hi-hat, and bass drum are activated in the input sound, and the hi-hat and bass drum are activated in the reference sound.

基底の組み合わせ類似度は、例えば、入力音および参照音のそれぞれについてアクティベーション行列からあるフレームに対応する列ベクトルを抜き出し、両者の内積を計算することにより得られる。この内積は１フレームにおける基底の組み合わせ類似度を示している。すなわち、ステップＳ３４は、第１行列および第２行列における第１成分の組み合わせの類似度を第２成分毎に計算するものである。 The combination similarity of the base is obtained, for example, by extracting a column vector corresponding to a certain frame from the activation matrix for each of the input sound and the reference sound and calculating the inner product of both. This inner product indicates the combinational similarity of the basis in one frame. That is, step S34 calculates the similarity of the combination of the first components in the first matrix and the second matrix for each second component.

再び図１４を参照する。ステップＳ３５において、類似度計算手段１３は、各フレームの組み合わせ類似度を積算することにより、入力音と参照音との音色類似度を計算する。すなわち、ステップＳ３５は、第１成分の組み合わせの類似度を第２成分について積算し、入力音信号および参照音信号の音色に関する第１類似度を得るものである。 Refer to FIG. 14 again. In step S <b> 35, the similarity calculation unit 13 calculates the timbre similarity between the input sound and the reference sound by integrating the combination similarity of each frame. That is, in step S35, the similarity of the combination of the first components is integrated for the second component to obtain the first similarity regarding the tone color of the input sound signal and the reference sound signal.

再び図１４を参照する。ステップＳ３６において、類似度計算手段１３は、リズム類似度を計算する。この例では、特定の基底ベクトルに対応するアクティベーションベクトルの類似度がリズム類似度として用いられる。特定の基底ベクトルは、リズムに関連している楽器に対応する基底ベクトルである。すなわち、ステップＳ３６は、第１行列および第２行列のうち特定の第１成分における時間変化の類似度を計算し、入力音信号および参照音信号のリズムに関する第２類似度を得るものである。また、ステップＳ３６は、入力音信号に含まれる複数の区間の少なくとも一部に対して、参照音信号とのリズム類似度の計算をするステップの一例である。図１５の例では、バスドラムに対応するアクティベーションベクトルの類似度が計算される。ステップＳ３３〜Ｓ３６の処理は、参照音を順次更新しつつ、最終的に全ての伴奏データについて音色類似度およびリズム類似度が計算されるまで繰り返し行われる。 Refer to FIG. 14 again. In step S36, the similarity calculation unit 13 calculates the rhythm similarity. In this example, the similarity of the activation vector corresponding to a specific base vector is used as the rhythm similarity. The specific basis vector is a basis vector corresponding to the musical instrument associated with the rhythm. That is, step S36 calculates the similarity of the time change in a specific 1st component among the 1st matrix and the 2nd matrix, and obtains the 2nd similarity concerning the rhythm of an input sound signal and a reference sound signal. Step S36 is an example of a step of calculating a rhythm similarity with the reference sound signal for at least a part of a plurality of sections included in the input sound signal. In the example of FIG. 15, the similarity of the activation vector corresponding to the bass drum is calculated. The processes of steps S33 to S36 are repeated until the timbre similarity and rhythm similarity are finally calculated for all accompaniment data while sequentially updating the reference sound.

この例によれば、リズム類似度だけでなく音色類似度も計算される。したがって、リズム類似度だけを用いる場合と比較して、より高精度に楽曲を検索することができる。 According to this example, not only the rhythm similarity but also the timbre similarity is calculated. Therefore, it is possible to search for music with higher accuracy than when only the rhythm similarity is used.

２−４．ビートスペクトルによる類似度計算
図１６は、ステップＳ４におけるビートスペクトルによる類似度計算の詳細を例示するフローチャートである。ビートスペクトルとは、スペクトル上の繰り返しパターンを捉えた特徴量であり、何らかのスペクトログラム的な特徴量の時間領域の自己相関により計算される。この例では、スペクトル差分の自己相関により計算される。 2-4. FIG. 16 is a flowchart illustrating the details of the similarity calculation based on the beat spectrum in step S4. The beat spectrum is a feature amount that captures a repetitive pattern on the spectrum, and is calculated by time-domain autocorrelation of some spectrogram-like feature amount. In this example, it is calculated by the autocorrelation of the spectral difference.

ステップＳ４１において、類似度計算手段１５は、入力音信号のＢＰＭを取得する。この例で、類似度計算手段１５は、入力音信号を解析することによりＢＰＭを計算する。ＢＰＭの計算には公知の手法が用いられる。ステップＳ４２において、類似度計算手段１５は、入力音信号の振幅スペクトログラムを計算する。ステップＳ４３において、類似度計算手段１５は、振幅スペクトログラムから特徴量、この例ではスペクトル差分を得る。スペクトル差分とは、振幅スペクトログラムから時間軸上において隣り合うフレーム間の振幅の差をとったものをいう。すなわちスペクトル差分は、横軸が時間、縦軸が前フレームとの振幅差のデータである。ステップＳ４４において、類似度計算手段１５は、入力音信号を単位時間あたりのビート数で正規化する。具体的には、類似度計算手段１５は、スペクトル差分の時間軸をＢＰＭで正規化する。より具体的には、類似度計算手段１５は、スペクトル差分の時間軸をＢＰＭのｎ倍で除算することにより、時間軸を１／ｎ拍単位に正規化することができる。 In step S41, the similarity calculation means 15 acquires the BPM of the input sound signal. In this example, the similarity calculation means 15 calculates BPM by analyzing the input sound signal. A known method is used for calculating the BPM. In step S42, the similarity calculation unit 15 calculates an amplitude spectrogram of the input sound signal. In step S43, the similarity calculation unit 15 obtains a feature amount, in this example, a spectral difference, from the amplitude spectrogram. The spectral difference is a value obtained by taking the amplitude difference between adjacent frames on the time axis from the amplitude spectrogram. That is, the spectral difference is data of time difference on the horizontal axis and amplitude difference from the previous frame on the vertical axis. In step S44, the similarity calculation means 15 normalizes the input sound signal with the number of beats per unit time. Specifically, the similarity calculation means 15 normalizes the time axis of the spectrum difference with BPM. More specifically, the similarity calculation means 15 can normalize the time axis in units of 1 / n beats by dividing the time axis of the spectral difference by n times BPM.

ステップＳ４５において、類似度計算手段１５は、正規化された入力音信号のビートスペクトルを計算する。具体的には、類似度計算手段１５は、正規化されたスペクトル差分の自己相関から、ビートスペクトルを計算する。ステップＳ４６において、類似度計算手段１５は、参照音信号の正規化されたビートスペクトルを取得する。この例では、複数の伴奏データの各々に対してあらかじめビートスペクトルが計算されている。計算されたビートスペクトルは、伴奏データに関する情報としてデータベース１４に記録されている。類似度計算手段１３は、データベースに記録されている複数の伴奏データの中から、参照音とする伴奏音を順次、選択してその伴奏音に対応するビートスペクトルをデータベース１４から取得する。ステップＳ４７において、類似度計算手段１５は、正規化された入力音信号のビートスペクトルと、参照音信号から計算された正規化されたビートスペクトルとを比較し、リズム類似度を計算する。具体的には、類似度計算手段１５は、入力音および伴奏音のビートスペクトルの類似度を計算する。ステップＳ４７は、入力音信号に含まれる複数の区間の少なくとも一部に対して、参照音信号とのリズム類似度の計算をするステップの別の例である。 In step S45, the similarity calculation unit 15 calculates the beat spectrum of the normalized input sound signal. Specifically, the similarity calculation means 15 calculates a beat spectrum from the autocorrelation of normalized spectrum differences. In step S46, the similarity calculation unit 15 acquires a normalized beat spectrum of the reference sound signal. In this example, a beat spectrum is calculated in advance for each of a plurality of accompaniment data. The calculated beat spectrum is recorded in the database 14 as information on accompaniment data. The similarity calculation means 13 sequentially selects an accompaniment sound as a reference sound from a plurality of accompaniment data recorded in the database, and acquires a beat spectrum corresponding to the accompaniment sound from the database 14. In step S47, the similarity calculation means 15 compares the beat spectrum of the normalized input sound signal with the normalized beat spectrum calculated from the reference sound signal, and calculates the rhythm similarity. Specifically, the similarity calculation means 15 calculates the similarity between the beat spectra of the input sound and the accompaniment sound. Step S47 is another example of the step of calculating the rhythm similarity with the reference sound signal for at least a part of the plurality of sections included in the input sound signal.

図１７は、ビートスペクトルを例示する図である。図１７（Ａ）は入力音のビートスペクトルを、図１７（Ｂ）は参照音のビートスペクトルを、それぞれ示している。図において、横軸は正規化されたビート周波数を、縦軸はスペクトル強度を、それぞれ示している。類似度計算手段１５は、これらのスペクトルをパターンマッチングすることにより両者の類似度を計算する。具体的には、ビートスペクトルは、ピークが現れる周波数およびそのピーク強度で特徴付けられる。類似度計算手段１５は、例えば、ピーク強度がしきい値以上のピークに関し、そのピークの周波数およびピーク強度を特徴量として抽出することにより、ビートスペクトルを数値化する。類似度計算手段１５は、これら特徴量を用いて両者の類似度を計算する。この類似度はリズム類似度（第４類似度の一例）である。すなわち、ステップＳ４７は、入力音信号のビートスペクトルおよび参照音信号のビートスペクトルの類似度を計算し、リズムに関する第４類似度を得るものである。 FIG. 17 is a diagram illustrating a beat spectrum. FIG. 17A shows the beat spectrum of the input sound, and FIG. 17B shows the beat spectrum of the reference sound. In the figure, the horizontal axis indicates the normalized beat frequency, and the vertical axis indicates the spectrum intensity. The similarity calculation means 15 calculates the similarity between the two spectra by pattern matching. Specifically, the beat spectrum is characterized by the frequency at which the peak appears and its peak intensity. The similarity calculation means 15 quantifies the beat spectrum by, for example, extracting the peak frequency and peak intensity as a feature amount for a peak having a peak intensity equal to or greater than a threshold value. The similarity calculation means 15 calculates the similarity between both using these feature quantities. This similarity is a rhythm similarity (an example of a fourth similarity). That is, step S47 calculates the similarity between the beat spectrum of the input sound signal and the beat spectrum of the reference sound signal to obtain the fourth similarity related to the rhythm.

ＮＭＦを用いた類似度計算においては、アクティベーション行列からリズム類似度を計算している。しかし、一般にＮＭＦでは時間分解能が足りず、いわゆる、イーブンやシャッフルといった細かいリズム構造の違いを判断することができない。ＮＭＦにおいて時間をより細かく分解して計算することも可能であるが、計算量が著しく増えてしまうという問題がある。また、図１５の例では各楽器の基底がきれいに分離されている例を示したが、ＮＭＦの一般的な問題として、楽器音の分解が必ずしもうまくいくとは限らない。したがって、楽器音をうまく分離できない場合には、ＮＭＦではリズム構造を正確に捉えることができないという問題がある。 In similarity calculation using NMF, rhythm similarity is calculated from an activation matrix. However, in general, NMF has insufficient time resolution, and it is impossible to determine a difference in so-called even rhythm structure such as even or shuffle. Although it is possible to calculate by dividing the time more finely in NMF, there is a problem that the amount of calculation increases remarkably. Further, although the example of FIG. 15 shows an example in which the bases of the instruments are cleanly separated, as a general problem with NMF, the decomposition of instrument sounds does not always work well. Therefore, when instrument sounds cannot be separated well, there is a problem that the rhythm structure cannot be accurately captured by NMF.

これに対し、この例ではビートスペクトルを用いてリズム類似度を計算している。そのため、細かいリズム構造をより正確に捉えることができる。また、ビートスペクトルにおいては一般にＢＰＭの差が特徴量に影響を与えてしまうため、単にビートスペクトル同士を比較してもリズム構造をリズム類似度として評価することは難しい。しかしこの例では、ビートスペクトルを計算する前にスペクトル差分をＢＰＭで正規化しており、入力音および参照音におけるＢＰＭの差が吸収されている。 On the other hand, in this example, the rhythm similarity is calculated using the beat spectrum. Therefore, a fine rhythm structure can be grasped more accurately. Further, in the beat spectrum, the difference in BPM generally affects the feature quantity, and therefore it is difficult to evaluate the rhythm structure as the rhythm similarity even if the beat spectra are simply compared. However, in this example, the spectrum difference is normalized by BPM before the beat spectrum is calculated, and the difference in BPM in the input sound and the reference sound is absorbed.

２−５．類似度の統合、楽曲の選択
ステップＳ５における類似度の統合は、詳細には以下のように行われる。この例では、ＮＭＦにより２つの類似度（音色類似度およびリズム類似度）が、ビートスペクトルにより１つの類似度（リズム類似度）が得られている。これらの類似度は、共通のスケールに正規化されている（例えば、類似度最低がゼロ、類似度最高が１）。 2-5. Integration of similarity and selection of music The integration of similarity in step S5 is performed in detail as follows. In this example, two similarities (tone color similarity and rhythm similarity) are obtained by NMF, and one similarity (rhythm similarity) is obtained by the beat spectrum. These similarities are normalized to a common scale (for example, the lowest similarity is zero and the highest similarity is 1).

統合手段１６は、ＮＭＦによる類似度とビートスペクトルによる類似度とが所定の重み、この例では１：１となるように調整された重み付け演算により、複数の類似度を統合する。具体的には、統合手段１６は、次式（３）により統合された類似度Ｄｉ（第３類似度の一例）を計算する。
Ｄｉ＝２・ＤｔＮ＋ＤｒＮ＋Ｄｒｂ …（３）
ここで、ＤｔＮおよびＤｒＮはＮＭＦにより得られた音色類似度およびリズム類似度を、Ｄｒｂはビートスペクトルにより得られたリズム類似度を、それぞれ示している。この例によれば、ＮＭＦによる類似度とビートスペクトルによる類似度とが同じ重みで評価される。統合された類似度は、複数の伴奏データの各々について計算される。 The integration unit 16 integrates a plurality of similarities by a weighting operation adjusted so that the similarity by NMF and the similarity by beat spectrum are predetermined weights, in this example, 1: 1. Specifically, the integration unit 16 calculates the similarity Di (an example of the third similarity) integrated by the following equation (3).
Di = 2 · DtN + DrN + Drb (3)
Here, DtN and DrN indicate the timbre similarity and rhythm similarity obtained by NMF, and Drb indicates the rhythm similarity obtained by the beat spectrum. According to this example, the similarity based on the NMF and the similarity based on the beat spectrum are evaluated with the same weight. The integrated similarity is calculated for each of the plurality of accompaniment data.

選択手段１７は、複数の伴奏データのうち、入力音との類似度が最も高い伴奏データを選択する。この例においては選択手段１７が情報処理装置２０にあり、出力手段１８が電子楽器１０にあるので、情報処理装置２０は、選択手段１７により選択された伴奏データの識別子を電子楽器１０に通知する。電子楽器１０において、出力手段１８は、通知された識別子に対応する伴奏データを読み出し、伴奏音すなわち楽曲を出力する。 The selection means 17 selects the accompaniment data having the highest similarity to the input sound from among the plurality of accompaniment data. In this example, since the selection means 17 is in the information processing apparatus 20 and the output means 18 is in the electronic musical instrument 10, the information processing apparatus 20 notifies the electronic musical instrument 10 of the identifier of the accompaniment data selected by the selection means 17. . In the electronic musical instrument 10, the output means 18 reads accompaniment data corresponding to the notified identifier, and outputs an accompaniment sound, that is, a music piece.

３．変形例
本発明は上述の実施形態に限定されるものではなく種々の変形実施が可能である。以下、変形例をいくつか説明する。以下の変形例のうち２つ以上のものが組み合わせて用いられてもよい。 3. Modifications The present invention is not limited to the above-described embodiments, and various modifications can be made. Hereinafter, some modifications will be described. Two or more of the following modifications may be used in combination.

楽曲検索システム１における機能構成とハードウェア構成との対応関係は、実施形態で説明した例に限定されない。例えば、楽曲検索システム１は、情報処理装置２０に全ての機能を集約したものであってもよい。この場合、検索対象となる楽曲は、電子楽器における伴奏音に限定されない。例えば、音楽プレーヤーにおいて再生される一般的な楽曲コンテンツの検索に楽曲検索システム１が適用されてもよい。あるいは、カラオケ装置における楽曲の検索に楽曲検索システム１が適用されてもよい。また、情報処理装置２０の機能の一部を、ネットワーク上のサーバ装置に実装してもよい。例えば、楽曲検索システム１の機能のうち、特定手段１２、類似度計算手段１３、データベース１４、類似度計算手段１５、統合手段１６、および選択手段１７をサーバ装置に実装してもよい。この場合、情報処理装置２０は、入力音信号を取得すると、データ化された入力音信号を含む検索要求をこのサーバ装置に送信する。サーバ装置は、受信した検索要求に含まれる入力音信号に類似する楽曲を検索し、その結果を情報処理装置２０に回答する。 The correspondence relationship between the functional configuration and the hardware configuration in the music search system 1 is not limited to the example described in the embodiment. For example, the music search system 1 may be one in which all functions are integrated in the information processing apparatus 20. In this case, the music to be searched is not limited to the accompaniment sound in the electronic musical instrument. For example, the music search system 1 may be applied to search for general music content played on a music player. Or the music search system 1 may be applied to the search of the music in a karaoke apparatus. Moreover, you may mount a part of function of the information processing apparatus 20 in the server apparatus on a network. For example, among the functions of the music search system 1, the specifying unit 12, the similarity calculating unit 13, the database 14, the similarity calculating unit 15, the integrating unit 16, and the selecting unit 17 may be mounted on the server device. In this case, when acquiring the input sound signal, the information processing device 20 transmits a search request including the input sound signal converted into data to the server device. The server device searches for music similar to the input sound signal included in the received search request, and returns the result to the information processing device 20.

特定手段１２が入力音信号から対象区間を特定する方法は、実施形態で説明した例に限定されない。特定手段１２は、楽曲構造解析により得られた複数の区間の中から、例えばランダムに、またはユーザーの指示応じて、選択された区間を対象区間として特定してもよい。また、特定手段１２は、対象区間の選択を、対象区間の累積時間長がしきい値を超えるまで行うものに限定されない。特定手段１２は、例えば、対象区間として選択された区間の数がしきい値を超えるまで対象区間の選択を行ってもよい。あるいは、特定手段１２は、優先度がしきい値よりも高い区間がなくなるまで対象区間の選択を行ってもよい。 The method by which the specifying unit 12 specifies the target section from the input sound signal is not limited to the example described in the embodiment. The specifying unit 12 may specify a selected section as a target section from among a plurality of sections obtained by music structure analysis, for example, randomly or in response to a user instruction. Further, the specifying unit 12 is not limited to performing the selection of the target section until the accumulated time length of the target section exceeds the threshold value. For example, the specifying unit 12 may select the target section until the number of sections selected as the target section exceeds a threshold value. Alternatively, the specifying unit 12 may select the target section until there is no section whose priority is higher than the threshold value.

特定手段１２により特定された対象区間に対して行われる信号処理は、類似度計算手段１３および類似度計算手段１５によるものに限定されない。特定手段１２により特定された対象区間に対して、類似度の計算以外の処理が行われてもよい。 The signal processing performed on the target section specified by the specifying unit 12 is not limited to that performed by the similarity calculating unit 13 and the similarity calculating unit 15. Processing other than the similarity calculation may be performed on the target section specified by the specifying unit 12.

類似度計算手段１３は、リズム類似度および音色類似度の双方を計算するものに限定されない。類似度計算手段１３は、リズム類似度および音色類似度のいずれか一方のみを計算するものであってもよい。また、類似度計算手段１３において、参照行列取得手段１３２は、参照音信号に対応する基底行列およびアクティベーション行列をデータベース１４から取得するのではなく、参照音信号自体をデータベース１４から取得し、ＮＭＦにより基底行列およびアクティベーション行列を計算してもよい。 The similarity calculation means 13 is not limited to the one that calculates both the rhythm similarity and the timbre similarity. The similarity calculation means 13 may calculate only one of the rhythm similarity and the timbre similarity. In the similarity calculation means 13, the reference matrix acquisition means 132 does not acquire the base matrix and activation matrix corresponding to the reference sound signal from the database 14, but acquires the reference sound signal itself from the database 14. The basis matrix and activation matrix may be calculated by

類似度計算手段１３および類似度計算手段１５のいずれか一方は省略されてもよい。この場合、統合手段１６は不要であり、選択手段１７は、類似度計算手段１３および類似度計算手段１５のいずれか一方による類似度のみに基づいて楽曲を選択する。 Either one of the similarity calculation means 13 and the similarity calculation means 15 may be omitted. In this case, the integration unit 16 is not necessary, and the selection unit 17 selects a music piece based only on the similarity by either the similarity calculation unit 13 or the similarity calculation unit 15.

取得手段１１、特定手段１２、類似度計算手段１３、類似度計算手段１５、統合手段１６、および選択手段１７は、ソフトウェアによってコンピューター装置に実装されるものに限定されない。これらのうち少なくとも一部は、例えば専用の集積回路によりハードウェアとして実装されてもよい。 The acquisition unit 11, the identification unit 12, the similarity calculation unit 13, the similarity calculation unit 15, the integration unit 16, and the selection unit 17 are not limited to those implemented in a computer device by software. At least a part of these may be implemented as hardware by a dedicated integrated circuit, for example.

情報処理装置２０のＣＰＵ２０１等により実行されるプログラムは、光ディスク、磁気ディスク、半導体メモリーなどの記憶媒体により提供されてもよいし、インターネット等の通信回線を介してダウンロードされてもよい。また、このプログラムは、図８のすべてのステップを備える必要はない。例えば、このプログラムは、ステップＳ１、ステップＳ２およびステップＳ３のみを備えるようにしてもよい。また、このプログラムは、ステップＳ１、ステップＳ２およびステップＳ４のみを備えるようにしてもよい。さらに、このプログラムは、ステップＳ１およびステップＳ４のみを備えるようにしてもよい。 The program executed by the CPU 201 of the information processing apparatus 20 may be provided by a storage medium such as an optical disk, a magnetic disk, or a semiconductor memory, or may be downloaded via a communication line such as the Internet. Also, this program need not comprise all the steps of FIG. For example, this program may include only step S1, step S2, and step S3. Further, this program may include only step S1, step S2, and step S4. Furthermore, this program may include only step S1 and step S4.

１…楽曲検索システム、１０…電子楽器、１１…取得手段、１２…特定手段、１３…類似度計算手段、１４…データベース、１５…類似度計算手段、１６…統合手段、１７…選択手段、１８…出力手段、２０…情報処理装置、１０１…演奏操作子、１０２…音源、１０３…発音制御部、１０４…出力部、１０５…ストレージ、１０６…ＣＰＵ、１０７…通信ＩＦ、１２１…構造解析手段、１２２…分割手段、１２３…選択手段、１２４…信号生成手段、１３１…観測行列計算手段、１３２…参照行列取得手段、１３３…類似度計算手段、１３４…類似度計算手段、１３５…類似度計算手段、１５１…ＢＰＭ取得手段、１５２…正規化手段、１５３…ＢＳ計算手段、１５４…参照ＢＳ取得手段、１５５…類似度計算手段、２０１…ＣＰＵ、２０２…ストレージ、２０３…通信ＩＦ、２０４…入力部、１０５…出力部 DESCRIPTION OF SYMBOLS 1 ... Music search system, 10 ... Electronic musical instrument, 11 ... Acquisition means, 12 ... Identification means, 13 ... Similarity calculation means, 14 ... Database, 15 ... Similarity calculation means, 16 ... Integration means, 17 ... Selection means, 18 DESCRIPTION OF SYMBOLS ... Output means, 20 ... Information processing apparatus, 101 ... Performance operator, 102 ... Sound source, 103 ... Sound generation control section, 104 ... Output section, 105 ... Storage, 106 ... CPU, 107 ... Communication IF, 121 ... Structure analysis means, DESCRIPTION OF SYMBOLS 122 ... Dividing means, 123 ... Selection means, 124 ... Signal generation means, 131 ... Observation matrix calculation means, 132 ... Reference matrix acquisition means, 133 ... Similarity calculation means, 134 ... Similarity calculation means, 135 ... Similarity calculation means 151 ... BPM acquisition means, 152 ... normalization means, 153 ... BS calculation means, 154 ... reference BS acquisition means, 155 ... similarity calculation means, 201 ... CPU, 20 ... storage, 203 ... communication IF, 204 ... input section, 105 ... Output section

Claims

Obtaining an input sound signal;
Calculating a first matrix corresponding to an amplitude spectrogram of the input sound signal and including a first component related to frequency and a second component related to time by a predetermined algorithm;
Obtaining a second matrix corresponding to an amplitude spectrogram of a reference sound signal, including the first component and the second component, and calculated by the predetermined algorithm;
Calculating the similarity of the combination of the first components in the first matrix and the second matrix for each second component;
A sound signal processing method comprising: integrating the similarity of the combination of the first components with respect to the second component to obtain a first similarity related to a tone color of the input sound signal and the reference sound signal.

Calculating a temporal change similarity in a specific first component of the first matrix and the second matrix to obtain a second similarity related to a rhythm of the input sound signal and the reference sound signal;
The sound signal processing method according to claim 1, further comprising: calculating a third similarity obtained by integrating the first similarity and the second similarity.

Calculating a similarity between the beat spectrum of the input sound signal and the beat spectrum of the reference sound signal to obtain a fourth similarity related to the rhythm,
The sound signal processing method according to claim 2, wherein in the calculation of the third similarity, the first similarity, the second similarity, and the fourth similarity are integrated.

In the calculation of the third similarity, the first similarity, the second similarity, and the weighting calculation adjusted so that the ratio between the similarity related to the timbre and the similarity related to the rhythm is 1: 1. The sound signal processing method according to claim 3, wherein the fourth similarity is integrated.

First acquisition means for acquiring an input sound signal;
Observation matrix calculation means for calculating a first matrix corresponding to an amplitude spectrogram of the input sound signal and including a first component related to frequency and a second component related to time by a predetermined algorithm;
A reference matrix acquisition unit that corresponds to an amplitude spectrogram of a reference sound signal, includes the first component and the second component, and acquires a second matrix calculated by the predetermined algorithm;
Combination similarity calculation means for calculating the similarity of the combination of the first components in the first matrix and the second matrix for each second component;
A sound signal processing apparatus comprising: a timbre similarity calculating unit that integrates the similarity of the combination of the first components with respect to the second component and calculates a first similarity related to a timbre of the input sound signal and the reference sound signal.