JP7375302B2

JP7375302B2 - Acoustic analysis method, acoustic analysis device and program

Info

Publication number: JP7375302B2
Application number: JP2019003324A
Authority: JP
Inventors: 康平須見
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2023-11-08
Anticipated expiration: 2039-01-11
Also published as: CN113196381B; WO2020145326A1; US20210287641A1; JP2020112683A; CN113196381A

Description

本発明は、楽曲を解析する技術に関する。 The present invention relates to a technique for analyzing music.

楽曲の歌唱音または演奏音等の音響から各種の情報を推定する技術が従来から提案されている。例えば、特許文献１には、楽曲を表す音響信号からコードを推定する構成が開示されている。具体的には、音響信号から楽曲の調が推定され、当該推定された調を加味してコードが推定される。また、特許文献２には、楽曲のパワースペクトルの形状から調性の種別を特定する構成が開示されている。パワースペクトルの時系列データから算出された音名毎のパワーに応じて調性の種別が特定される。 BACKGROUND ART Conventionally, techniques have been proposed for estimating various types of information from acoustics such as singing sounds or performance sounds of music. For example, Patent Document 1 discloses a configuration for estimating a code from an acoustic signal representing a song. Specifically, the key of the song is estimated from the acoustic signal, and the chord is estimated by taking the estimated key into consideration. Further, Patent Document 2 discloses a configuration for identifying the type of tonality from the shape of the power spectrum of a song. The type of tonality is specified according to the power of each note name calculated from time-series data of the power spectrum.

特開２０１５－３１７３８号公報JP 2015-31738 Publication 特開２００７－２４８６１０号公報JP2007-248610A

特許文献１の技術では、最も出現頻度が高い音符から楽曲の調が推定される。しかし、調に対応する音符の出現頻度が低い楽曲もある。また、特許文献２の技術では、各音名のパワーと調性の種別との相関関係を利用して調性の種別が特定される。しかし、各音名のパワーと調性の種別とが相関関係にない楽曲もある。すなわち、特許文献１および特許文献２の技術のもとでは、多様な楽曲に対して高精度に調を推定することは実際には困難である。以上の事情を考慮して、本発明は、調を高精度に推定することを目的とする。 In the technique disclosed in Patent Document 1, the key of a piece of music is estimated from the note that appears most frequently. However, there are some pieces of music in which the frequency of occurrence of notes corresponding to the key is low. Further, in the technique of Patent Document 2, the type of tonality is specified using the correlation between the power of each note name and the type of tonality. However, there are songs in which there is no correlation between the power of each note name and the type of tonality. That is, under the techniques of Patent Document 1 and Patent Document 2, it is actually difficult to estimate keys with high precision for a variety of songs. In consideration of the above circumstances, an object of the present invention is to estimate the key with high accuracy.

以上の課題を解決するために、本発明の好適な態様に係る音響解析方法は、音響信号の特徴量の時系列と調との関係を学習した学習済モデルに、音響信号の特徴量の時系列を入力することで、調を表す調情報を生成する。 In order to solve the above problems, an acoustic analysis method according to a preferred embodiment of the present invention uses a trained model that has learned the relationship between the time series of the feature values of an acoustic signal and the key. By inputting the sequence, key information representing the key is generated.

本発明の好適な態様に係る音響解析装置は、音響信号の特徴量の時系列と調との関係を学習した学習済モデルであって、音響信号の特徴量の時系列の入力から、調を表す調情報を生成する調推定モデルを具備する。 The acoustic analysis device according to a preferred aspect of the present invention is a trained model that has learned the relationship between the time series of the feature quantities of an acoustic signal and the key, and is configured to calculate the key from the input of the time series of the feature quantities of the acoustic signal. It includes a key estimation model that generates key information to represent.

第１実施形態に係る音響解析装置の構成を例示するブロック図である。FIG. 1 is a block diagram illustrating the configuration of an acoustic analysis device according to a first embodiment. 音響解析装置の機能的な構成を例示するブロック図である。FIG. 2 is a block diagram illustrating the functional configuration of an acoustic analysis device. 特徴量および調情報の概略的な説明図である。FIG. 3 is a schematic explanatory diagram of feature amounts and key information. 特徴量の説明図である。FIG. 3 is an explanatory diagram of feature amounts. 調情報の説明図である。FIG. 3 is an explanatory diagram of key information. 調推定処理の具体的な手順を例示するフローチャートである。3 is a flowchart illustrating a specific procedure of key estimation processing. 学習処理部の動作の説明図である。FIG. 3 is an explanatory diagram of the operation of the learning processing section. 第２実施形態に係る音響解析装置の機能構成を例示するブロック図である。FIG. 2 is a block diagram illustrating a functional configuration of an acoustic analysis device according to a second embodiment. 第２実施形態に係る後処理により修正された調の時系列の説明図である。FIG. 7 is an explanatory diagram of a time series of keys corrected by post-processing according to the second embodiment. 第２実施形態に係る後処理の具体的な手順を例示するフローチャートである。7 is a flowchart illustrating a specific procedure of post-processing according to the second embodiment. 第３実施形態に係る音響解析装置の機能構成を例示するブロック図である。FIG. 3 is a block diagram illustrating the functional configuration of an acoustic analysis device according to a third embodiment. 第３実施形態に係る後処理により修正された調の時系列の説明図である。FIG. 7 is an explanatory diagram of a time series of keys corrected by post-processing according to the third embodiment. 第３実施形態に係る後処理の具体的な手順を例示するフローチャートである。12 is a flowchart illustrating a specific procedure of post-processing according to a third embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音響解析装置１００の構成を例示するブロック図である。音響解析装置１００は、楽曲の歌唱音または演奏音等の音響を表す音響信号Ｖを解析することで、当該楽曲の調を推定する情報処理装置である。音響解析装置１００は、複数の主音（具体的には平均律の１２半音）と調名（長調および短調）との組合せに相当する２４種類の調を候補として音響信号Ｖから何れかの調を推定する。なお、調の種類数は、２４種類に限定されない。 <First embodiment>
FIG. 1 is a block diagram illustrating the configuration of an acoustic analysis device 100 according to a first embodiment of the present invention. The acoustic analysis device 100 is an information processing device that estimates the key of a song by analyzing an acoustic signal V representing acoustics such as singing sounds or performance sounds of the song. The acoustic analysis device 100 selects one of the 24 types of keys from the acoustic signal V as candidates corresponding to the combination of a plurality of tonic tones (specifically, 12 semitones of equal temperament) and key names (major and minor keys). presume. Note that the number of types of keys is not limited to 24 types.

音響解析装置１００は、制御装置１１と記憶装置１２と表示装置１３とを具備する。例えば携帯電話機、スマートフォンまたはパーソナルコンピュータ等の情報端末が、音響解析装置１００として好適に利用される。表示装置１３は、音響信号Ｖから推定された調を表示する。表示装置１３は、音響信号Ｖを解析した結果を再生する再生装置の一例である。例えば、音響信号Ｖを解析した結果に応じた音響を放音する放音装置を再生装置として利用してもよい。 The acoustic analysis device 100 includes a control device 11, a storage device 12, and a display device 13. For example, an information terminal such as a mobile phone, a smartphone, or a personal computer is suitably used as the acoustic analysis device 100. The display device 13 displays the key estimated from the acoustic signal V. The display device 13 is an example of a reproducing device that reproduces the results of analyzing the acoustic signal V. For example, a sound emitting device that emits sound according to the result of analyzing the acoustic signal V may be used as the reproducing device.

制御装置１１は、例えばＣＰＵ（Central Processing Unit）等の単数または複数の処理回路で構成され、音響解析装置１００の各要素を制御する。記憶装置１２は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体で構成された単数または複数のメモリであり、制御装置１１が実行するプログラムと制御装置１１が使用する各種のデータとを記憶する。例えば記憶装置１２は音響信号Ｖを記憶する。なお、複数種の記録媒体の組合せにより記憶装置１２を構成してもよい。また、音響解析装置１００に対して着脱可能な可搬型の記録媒体、または音響解析装置１００が通信網を介して通信可能な外部記録媒体（例えばオンラインストレージ）を、記憶装置１２として利用してもよい。 The control device 11 is configured with one or more processing circuits such as a CPU (Central Processing Unit), and controls each element of the acoustic analysis device 100. The storage device 12 is one or more memories configured with a known recording medium such as a magnetic recording medium or a semiconductor recording medium, and stores programs executed by the control device 11 and various data used by the control device 11. Remember. For example, the storage device 12 stores the acoustic signal V. Note that the storage device 12 may be configured by a combination of multiple types of recording media. Alternatively, a portable recording medium that is detachable from the acoustic analysis device 100 or an external recording medium (for example, online storage) with which the acoustic analysis device 100 can communicate via a communication network may be used as the storage device 12. good.

図２は、記憶装置１２に記憶されたプログラムを制御装置１１が実行することで実現される機能を例示するブロック図である。制御装置１１は、特徴抽出部２１と調推定モデル２２と学習処理部２３とを実現する。なお、相互に別体で構成された複数の装置により制御装置１１の機能を実現してもよい。制御装置１１の機能の一部または全部を専用の電子回路により実現してもよい。 FIG. 2 is a block diagram illustrating functions realized by the control device 11 executing a program stored in the storage device 12. As shown in FIG. The control device 11 implements a feature extraction section 21, a key estimation model 22, and a learning processing section 23. Note that the functions of the control device 11 may be realized by a plurality of devices configured separately from each other. Part or all of the functions of the control device 11 may be realized by a dedicated electronic circuit.

特徴抽出部２１は、記憶装置１２に記憶された音響信号Ｖから、当該音響信号Ｖの特徴量Ｙを抽出する。第１実施形態の特徴抽出部２１は、第１処理部２１１と第２処理部２１２と第３処理部２１３とを具備する。 The feature extraction unit 21 extracts the feature amount Y of the acoustic signal V stored in the storage device 12 . The feature extraction unit 21 of the first embodiment includes a first processing unit 211, a second processing unit 212, and a third processing unit 213.

第１処理部２１１は、音響信号Ｖから当該音響信号Ｖの特徴量Ｘを抽出する。第２処理部２１２は、第１処理部２１１が抽出した特徴量ＸからコードＯを推定する。第３処理部２１３は、音響信号Ｖの特徴量Ｙを抽出する。特徴量Ｙは、音響信号Ｖの時間変化を加味して音響的な特徴を表す指標である。一例として、第３処理部２１３は、第１処理部２１１が抽出した特徴量Ｘと、第２処理部２１２が推定したコードＯとから特徴量Ｙを抽出する。特徴量Ｙの時系列が調推定モデル２２に入力される。
The first processing unit 211 extracts the feature quantity X of the acoustic signal V from the acoustic signal V. The second processing unit 212 estimates the code O from the feature amount X extracted by the first processing unit 211. The third processing unit 213 extracts the feature Y of the acoustic signal V. The feature amount Y is an index representing an acoustic feature by taking into consideration the temporal change of the acoustic signal V. As an example, the third processing section 213 extracts the feature amount Y from the feature amount X extracted by the first processing section 211 and the code O estimated by the second processing section 212. The time series of the feature Y is input to the key estimation model 22.

調推定モデル２２は、特徴量Ｙの時系列と調との関係を学習した学習済モデルである。
具体的には、調推定モデル２２は、特徴量Ｙの時系列の入力により、調を表す情報（以下「調情報Ｈ」という）を生成する。 The key estimation model 22 is a trained model that has learned the relationship between the time series of the feature quantity Y and the key.
Specifically, the key estimation model 22 generates information representing a key (hereinafter referred to as "key information H") by inputting a time series of feature quantities Y.

図３は、特徴量Ｘ，特徴量Ｙおよび調情報Ｈの説明図である。特徴量Ｘは、単位期間Ｔ（Ｔ1，Ｔ2，Ｔ3，…）毎に抽出される。単位期間Ｔは、例えば楽曲の１拍分に相当する期間である。すなわち、音響信号Ｖから特徴量Ｘの時系列が生成される。なお、楽曲の拍点とは無関係に固定長または可変長の単位期間Ｔを画定してもよい。 FIG. 3 is an explanatory diagram of the feature amount X, the feature amount Y, and the key information H. The feature quantity X is extracted every unit period T (T1, T2, T3,...). The unit period T is, for example, a period corresponding to one beat of a song. That is, a time series of feature amounts X is generated from the acoustic signal V. Note that a unit period T of fixed length or variable length may be defined regardless of the beat of the song.

特徴量Ｘは、音響信号Ｖのうち各単位期間Ｔに対応した部分について音響的な特徴を表す指標である。コードＯは、特徴量Ｘ毎（すなわち単位期間Ｔ毎）に推定される。すなわち、コードＯの時系列が生成される。例えば、相異なるコードが対応付けられた複数の特徴量Ｘのうち、第１処理部２１１が抽出した特徴量Ｘに最も類似する特徴量Ｘに対応付けられたコードがコードＯとして推定される。なお、音響信号Ｖの入力によりコードＯを生成する統計的推定モデル（例えば隠れマルコフモデルまたはニューラルネットワーク）をコードＯの推定に利用してもよい。 The feature amount X is an index representing the acoustic feature of the portion of the acoustic signal V that corresponds to each unit period T. Code O is estimated for each feature amount X (that is, for each unit period T). That is, a time series of code O is generated. For example, among the plurality of feature quantities X associated with different codes, the code associated with the feature quantity X most similar to the feature quantity X extracted by the first processing unit 211 is estimated as the code O. Note that a statistical estimation model (for example, a hidden Markov model or a neural network) that generates the code O based on the input of the acoustic signal V may be used for estimating the code O.

特徴量Ｙは、第２処理部２１２が推定した同じコードＯが継続する一連の区間（以下「継続区間」という）Ｕ毎に抽出される。第２処理部２１２により同じコードが推定された区間が継続区間Ｕである。楽曲内に複数の継続区間Ｕ（Ｕ1，Ｕ2，Ｕ3…）が推定される。例えばコードＯとして「Ｆ」が推定された継続区間Ｕ1（単位期間Ｔ1－Ｔ4に相当する区間）について、１個の特徴量Ｙが抽出される。 The feature quantity Y is extracted for each of a series of sections (hereinafter referred to as "continuous sections") U in which the same code O estimated by the second processing unit 212 continues. The interval in which the same code is estimated by the second processing unit 212 is the continuous interval U. A plurality of continuous sections U (U1, U2, U3...) are estimated within the song. For example, one feature value Y is extracted for the continuous section U1 (section corresponding to the unit period T1-T4) in which "F" is estimated as the code O.

図４は、特徴量Ｘおよび特徴量Ｙを模式的に示した図である。特徴量Ｘは、複数の音階音（具体的には平均律の１２半音）にそれぞれ対応する複数の要素を含むクロマベクトル（ＰＣＰ：Pitch Class Profile）と、音響信号Ｖの強度Ｐvとを含む。音階音は、オクターブの相違を無視した音名（ピッチクラス）である。クロマベクトルのうち任意の音階音に対応する要素は、音響信号Ｖのうち当該音階音に対応する成分の強度を複数のオクターブにわたり加算した強度（以下「成分強度」という）Ｐqに設定される。特徴量Ｘは、所定の周波数よりも低域側の帯域と高域側の帯域との各々について、クロマベクトルおよび強度Ｐvを含む。つまり、音響信号Ｖのうち低域側の帯域に関するクロマベクトルと、当該帯域内の音響信号Ｖの強度Ｐvと、音響信号Ｖのうち高域側の帯域に関するクロマベクトルと、当該帯域内の音響信号Ｖの強度Ｐvとが特徴量Ｘに含まれる。すなわち、特徴量Ｘは、全体として２６次元のベクトルで表現される。
FIG. 4 is a diagram schematically showing the feature amount X and the feature amount Y. The feature quantity X includes a chroma vector (PCP: Pitch Class Profile) including a plurality of elements each corresponding to a plurality of scale tones (specifically, 12 semitones of equal temperament), and the intensity Pv of the acoustic signal V. A scale note is a note name (pitch class) that ignores differences in octaves. An element of the chroma vector that corresponds to an arbitrary scale note is set to an intensity (hereinafter referred to as "component intensity") Pq that is the sum of the intensities of the components of the acoustic signal V that correspond to the scale note in question over a plurality of octaves. The feature amount X includes a chroma vector and an intensity Pv for each of a band on the lower side and a band on the higher side than the predetermined frequency. In other words, the chroma vector related to the lower band of the acoustic signal V, the intensity Pv of the acoustic signal V within the band, the chroma vector related to the higher band of the acoustic signal V, and the acoustic signal within the band. The feature amount X includes the intensity Pv of V. That is, the feature quantity X is expressed as a 26-dimensional vector as a whole.

特徴量Ｙは、音階音毎の成分強度Ｐqの時系列に関する分散σqおよび平均μqと、音響信号Ｖの強度Ｐvの時系列に関する分散σvおよび平均μvとを、低域側の帯域と高域側の帯域との各々について含む。第３処理部２１３は、継続区間Ｕ内の複数の特徴量Ｘの各々に含まれる成分強度Ｐq（つまり継続区間Ｕ内における成分強度Ｐqの時系列）の分散σqおよび平均μqと、継続区間Ｕ内の複数の特徴量Ｘの各々に含まれる強度Ｐv（つまり継続区間Ｕ内における強度Ｐvの時系列）の分散σvおよび平均μvとを算定することにより、特徴量Ｙを抽出する。特徴量Ｙは、全体として５２次元のベクトルで表現される。以上の説明から理解される通り、各継続区間Ｕの特徴量Ｙは、音響信号Ｖのうち音階音に対応する成分強度Ｐqにおける当該継続区間Ｕの時間変化に関する指標（典型的には分散σq等の散布度）を含む。 The feature value Y is the variance σq and average μq regarding the time series of the component strength Pq of each scale note, and the variance σv and average μv regarding the time series of the intensity Pv of the acoustic signal V. including for each of the bands. The third processing unit 213 calculates the variance σq and average μq of the component strength Pq included in each of the plurality of feature quantities X within the continuous interval U (that is, the time series of the component strength Pq within the continuous interval U), and the The feature quantity Y is extracted by calculating the variance σv and average μv of the intensity Pv (that is, the time series of the intensity Pv within the continuous interval U) included in each of the plurality of feature quantities X in the continuous interval U. The feature amount Y is expressed as a 52-dimensional vector as a whole. As can be understood from the above explanation, the feature value Y of each continuous section U is an index (typically a variance σq, etc.) related to the temporal change of the continuous section U in the component strength Pq corresponding to the scale note in the acoustic signal V. dispersion).

図５は、調情報Ｈの説明図である。調情報Ｈは、推定の候補となる２４種類の調の各々について、継続区間Ｕの調に該当するか否かを表す指標Ｑ（Ｑ1～Ｑ24）を含む。例えば、任意の１個の調に対応する指標Ｑは、当該調が継続区間Ｕの調に該当するか否かを２値的に表す。すなわち、調情報Ｈは、複数の調のうち何れかを示す情報である。継続区間Ｕ毎に特徴量Ｙを調推定モデル２２に入力することで、当該継続区間Ｕ毎に調情報Ｈが生成される。すなわち、調情報Ｈの時系列が調推定モデル２２により生成される。以上の説明から理解される通り、調推定モデル２２は、特徴量Ｙの時系列から各継続区間Ｕの調を推定する統計的推定モデルである。すなわち、楽曲における調の時系列が推定される。 FIG. 5 is an explanatory diagram of key information H. The key information H includes an index Q (Q1 to Q24) indicating whether each of the 24 types of keys that are candidates for estimation corresponds to the key of the continuous section U. For example, the index Q corresponding to an arbitrary key expresses whether or not the key corresponds to the key of the continuous section U in a binary manner. That is, the key information H is information indicating any one of a plurality of keys. By inputting the feature amount Y into the key estimation model 22 for each continuous section U, key information H is generated for each continuous section U. That is, a time series of key information H is generated by the key estimation model 22. As understood from the above description, the key estimation model 22 is a statistical estimation model that estimates the key of each continuous section U from the time series of the feature Y. In other words, the time series of keys in the song is estimated.

調推定モデル２２は、特徴量Ｙの時系列から調情報Ｈを生成する演算を制御装置１１に実行させるプログラム（例えば人工知能ソフトウェアを構成するプログラムモジュール）と、当該演算に適用される複数の係数Ｋとの組合せで実現される。複数の係数Ｋは、複数の教師データを利用した機械学習（特に深層学習）により設定されて記憶装置１２に記憶される。例えば時系列データの処理に好適な長短期記憶（ＬＳＴＭ：Long Short Term Memory）等の再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）が調推定モデル２２として利用される。 The key estimation model 22 includes a program (for example, a program module that constitutes artificial intelligence software) that causes the control device 11 to execute an operation for generating key information H from a time series of feature values Y, and a plurality of coefficients applied to the operation. This is realized in combination with K. The plurality of coefficients K are set by machine learning (particularly deep learning) using a plurality of teacher data and are stored in the storage device 12. For example, a recurrent neural network (RNN) such as a long short term memory (LSTM) suitable for processing time series data is used as the key estimation model 22.

図６は、音響信号Ｖから各継続区間Ｕの調情報Ｈを推定する処理（以下「調推定処理」という）の具体的な手順を例示するフローチャートである。例えば利用者からの指示を契機として調推定処理が開始される。調推定処理を開始すると、特徴抽出部２１は、記憶装置１２に記憶された音響信号Ｖから特徴量Ｙを継続区間Ｕ毎に抽出する（Ｓa1）。調推定モデル２２は、特徴抽出部２１が抽出した特徴量Ｙの時系列から調情報Ｈを生成する（Ｓa2）。制御装置１１は、調推定モデル２２から出力される調情報Ｈが表す調を継続区間Ｕ毎に表示装置１３に表示させる（Ｓa3）。なお、調推定モデル２２による推定の結果を表す表示画面の内容は任意である。例えば、第２処理部２１２が推定したコードの時系列と調推定モデル２２が推定した調とを併記した表示画面が表示される。なお、調推定モデル２２が推定した調に応じて構成音が共通するコードの標記を表示してもよい。例えば、調「Ｄ♭major」が推定された継続区間Ｕについてコード「Ｇ♭」を表示し、調「Ｂmajor」が推定された継続区間Ｕについてコード「Ｆ♯」を表示する。 FIG. 6 is a flowchart illustrating a specific procedure for estimating the key information H of each continuous section U from the acoustic signal V (hereinafter referred to as "key estimation process"). For example, the key estimation process is started in response to an instruction from the user. When the key estimation process is started, the feature extraction unit 21 extracts the feature amount Y for each continuous section U from the acoustic signal V stored in the storage device 12 (Sa1). The key estimation model 22 generates key information H from the time series of the feature amount Y extracted by the feature extraction unit 21 (Sa2). The control device 11 causes the display device 13 to display the key represented by the key information H output from the key estimation model 22 for each continuous section U (Sa3). Note that the content of the display screen showing the result of estimation by the key estimation model 22 is arbitrary. For example, a display screen is displayed that shows both the time series of chords estimated by the second processing unit 212 and the key estimated by the key estimation model 22. Note that, depending on the key estimated by the key estimation model 22, the notations of chords having common constituent notes may be displayed. For example, the code "G♭" is displayed for a continuous section U in which the key "D♭major" is estimated, and the code "F#" is displayed for a continuous section U in which the key "Bmajor" is estimated.

学習処理部２３は、調推定モデル２２の複数の係数Ｋを機械学習（特に深層学習）により設定する。学習処理部２３は、複数の教師データＬを利用した機械学習により複数の係数Ｋを設定する。図７は、学習処理部２３の動作の説明図である。複数の教師データＬの各々は、特徴量Ｙの時系列と、調情報Ｈxの時系列との組合せで構成される。各教師データＬの調情報Ｈxの時系列は、当該教師データＬにおける特徴量Ｙの時系列に対する正解値に相当する。既存の楽曲の演奏音から抽出される特徴量Ｙの時系列と、当該楽曲の調を表す調情報Ｈの時系列とが教師データＬに含められる。 The learning processing unit 23 sets a plurality of coefficients K of the key estimation model 22 by machine learning (particularly deep learning). The learning processing unit 23 sets a plurality of coefficients K by machine learning using a plurality of teacher data L. FIG. 7 is an explanatory diagram of the operation of the learning processing section 23. Each of the plurality of teacher data L is composed of a combination of a time series of feature quantity Y and a time series of key information Hx. The time series of tone information Hx of each teacher data L corresponds to the correct value for the time series of the feature Y in the teacher data L. The teacher data L includes a time series of feature values Y extracted from the performance sounds of an existing song and a time series of key information H representing the key of the song.

学習処理部２３は、教師データＬの特徴量Ｙの時系列を入力することにより暫定的な調推定モデル２２から出力される調情報Ｈの時系列と、当該教師データＬの調情報Ｈxとの相違が低減されるように、調推定モデル２２の複数の係数Ｋを更新する。具体的には、学習処理部２３は、調情報Ｈと調情報Ｈxとの相違を表す評価関数が最小化されるように、例えば誤差逆伝播法により複数の係数Ｋを反復的に更新する。以上の手順で学習処理部２３が設定した複数の係数Ｋが記憶装置１２に記憶される。したがって、調推定モデル２２は、複数の教師データＬにおける特徴量Ｙの時系列と調情報Ｈxとの間に潜在する傾向のもとで、未知の特徴量Ｙの時系列に対して統計的に妥当な調情報Ｈを出力する。 The learning processing unit 23 inputs the time series of the feature Y of the teacher data L, and calculates the time series of the key information H output from the provisional key estimation model 22 and the key information Hx of the teacher data L. A plurality of coefficients K of the key estimation model 22 are updated so that the difference is reduced. Specifically, the learning processing unit 23 repeatedly updates the plurality of coefficients K by, for example, error backpropagation so that the evaluation function representing the difference between the key information H and the key information Hx is minimized. A plurality of coefficients K set by the learning processing section 23 through the above procedure are stored in the storage device 12. Therefore, the key estimation model 22 statistically calculates the time series of the unknown feature value Y based on the latent tendency between the time series of the feature value Y in the plurality of teaching data L and the key information Hx. Output valid key information H.

以上に説明した通り、音響信号Ｖの特徴量Ｙと調との関係を学習した調推定モデル２２に音響信号Ｖの特徴量Ｙを入力することで調情報Ｈが生成されるから、所定の規則に従って楽曲の調情報Ｈを生成する構成と比較して、調を高精度に推定することができる。 As explained above, the key information H is generated by inputting the feature Y of the acoustic signal V to the key estimation model 22 that has learned the relationship between the feature Y of the acoustic signal V and the key. The key can be estimated with high precision compared to a configuration in which key information H of a song is generated according to the following.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下の各例示において機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 <Second embodiment>
A second embodiment of the present invention will be described. In each of the following examples, for elements whose functions are similar to those in the first embodiment, the reference numerals used in the description of the first embodiment will be used, and detailed descriptions of each will be omitted as appropriate.

楽曲における調が短期間で変化する可能性は低いという傾向がある。第２実施形態では、以上の傾向をもとに、第１実施形態で推定された調を修正する。 There is a tendency that the key of a song is unlikely to change in a short period of time. In the second embodiment, the key estimated in the first embodiment is corrected based on the above tendency.

図８は、第２実施形態に係る音響解析装置１００の機能構成を例示するブロック図である。第２実施形態の音響解析装置１００は、第１実施形態の音響解析装置１００に後処理部２４を追加した構成である。特徴抽出部２１と調推定モデル２２と学習処理部２３とは、第１実施形態と同様である。図９には、調推定モデル２２により推定された調の時系列Ｗaが模式的に図示されている。図９の調区間Ｉ（Ｉ1，Ｉ2，Ｉ3…）は、調推定モデル２２が生成した調情報Ｈが表す調が連続する区間である。図９に例示される通り、１個の調区間Ｉには、同じ調が推定された連続する１個以上の継続区間Ｕが含まれる。 FIG. 8 is a block diagram illustrating the functional configuration of the acoustic analysis device 100 according to the second embodiment. The acoustic analysis device 100 of the second embodiment has a configuration in which a post-processing section 24 is added to the acoustic analysis device 100 of the first embodiment. The feature extraction unit 21, key estimation model 22, and learning processing unit 23 are the same as those in the first embodiment. FIG. 9 schematically shows the key time series Wa estimated by the key estimation model 22. The key interval I (I1, I2, I3, . . . ) in FIG. 9 is an interval in which keys represented by the key information H generated by the key estimation model 22 are continuous. As illustrated in FIG. 9, one key interval I includes one or more consecutive continuous intervals U in which the same key is estimated.

後処理部２４は、調推定モデル２２により推定された調の時系列Ｗaを修正する。具体的には、後処理部２４は、調の時系列Ｗaに対応する複数の調区間Ｉのうち時間長が所定値を下回る調区間Ｉの調を修正することで、時系列Ｗbを生成する。所定値は、例えば楽曲の３拍分に相当する時間長である。図９では、調「Ｅmajor」が推定された調区間Ｉ2が所定値を下回る場合が例示される。第２実施形態の後処理部２４は、調区間Ｉの時間長が所定値を下回る場合に、当該調区間Ｉの調を当該調区間Ｉの直前の調情報Ｈが表す調に置換する。具体的には、調区間Ｉ2の直前の継続区間Ｕ（すなわち調区間Ｉ1の末尾の継続区間Ｕ）の調情報Ｈが表す調「Ｆmajor」に、調区間Ｉ2の調「Ｅmajor」が置換される。 The post-processing unit 24 modifies the key time series Wa estimated by the key estimation model 22. Specifically, the post-processing unit 24 generates the time series Wb by correcting the key of the key interval I whose time length is less than a predetermined value among the plurality of key intervals I corresponding to the key time series Wa. . The predetermined value is, for example, a time length corresponding to three beats of a song. In FIG. 9, a case is illustrated in which the key interval I2 in which the key "Emajor" is estimated is less than a predetermined value. The post-processing unit 24 of the second embodiment replaces the key of the key section I with the key represented by the key information H immediately before the key section I when the time length of the key section I is less than a predetermined value. Specifically, the key "Emajor" of the key interval I2 is replaced with the key "Fmajor" represented by the key information H of the continuation interval U immediately before the key interval I2 (that is, the continuation interval U at the end of the key interval I1). .

図１０は、調推定モデルが推定した調を修正する処理（以下「後処理１」という）の具体的な手順を例示するフローチャートである。調推定モデルにより調の時系列Ｗaが推定された後に、例えば利用者からの指示を契機として後処理１が開始される。後処理１を開始すると、後処理部２４は、調の時系列Ｗaから複数の調区間Ｉ（Ｉ1，Ｉ2，Ｉ3…）を時間軸上に画定する（Ｓb1）。すなわち、調区間Ｉの時系列が特定される。後処理部２４は、複数の調区間Ｉの何れかを選択する（Ｓb2）。具体的には、先頭から末尾に向かって調区間Ｉが順次に選択される。後処理部２４は、ステップＳb2において選択した調区間Ｉの時間長が所定値を下回るか否かを判定する（Ｓb3）。調区間Ｉの時間長が所定値を下回る場合（Ｓb3:YES）、後処理部２４は、当該調区間Ｉの調を当該調区間Ｉの直前の調情報Ｈが表す調に置換する（Ｓb4）。一方で、調区間Ｉの時間長が所定値を上回る場合（Ｓb3:NO）、調の修正は実行されず、当該調区間Ｉの直後に位置する調区間Ｉが選択される（Ｓb2）。全部の調区間Ｉの各々について時間長の判定（Ｓb3）と所定値よりも短い調区間Ｉの調の修正（Ｓb4）とを実行すると（Ｓb5:YES）、制御装置１１は、後処理部２４が生成した調の時系列Ｗbを表示装置１３に表示させる（Ｓb6）。すなわち、複数の調区間Ｉのうち時間長が所定値を下回る全ての調区間Ｉについて調が置換された時系列Ｗｂが表示装置１３により表示される。一方で、ステップＳb2において未選択の調区間Ｉがある場合（Ｓb5:NO）、後処理部２４は、当該未選択の調区間ＩについてステップＳb2～Ｓb4の処理を繰り返し実行する。なお、調区間Ｉの時系列のうち先頭の調区間Ｉにおける調が置換対象となる場合には、当該調区間Ｉの直後の調情報Ｈが表す調に置換する。 FIG. 10 is a flowchart illustrating a specific procedure of a process (hereinafter referred to as "post-processing 1") for correcting the key estimated by the key estimation model. After the key time series Wa is estimated by the key estimation model, post-processing 1 is started, for example, in response to an instruction from the user. When the post-processing 1 is started, the post-processing unit 24 defines a plurality of key intervals I (I1, I2, I3, . . . ) on the time axis from the key time series Wa (Sb1). That is, the time series of key interval I is specified. The post-processing unit 24 selects one of the plurality of key intervals I (Sb2). Specifically, key intervals I are sequentially selected from the beginning to the end. The post-processing unit 24 determines whether the time length of the key interval I selected in step Sb2 is less than a predetermined value (Sb3). If the time length of the key interval I is less than the predetermined value (Sb3: YES), the post-processing unit 24 replaces the key of the key interval I with the key represented by the key information H immediately before the key interval I (Sb4). . On the other hand, if the time length of the key section I exceeds the predetermined value (Sb3: NO), the key is not corrected, and the key section I located immediately after the key section I is selected (Sb2). When determining the time length (Sb3) for each of all key sections I and correcting the key of the key section I that is shorter than a predetermined value (Sb5: YES), the control device 11 controls the post-processing section 24 The key time series Wb generated by is displayed on the display device 13 (Sb6). That is, the display device 13 displays the time series Wb in which the key has been replaced for all the key intervals I whose time lengths are less than a predetermined value among the plurality of key intervals I. On the other hand, if there is an unselected key interval I in step Sb2 (Sb5: NO), the post-processing unit 24 repeatedly executes the processes of steps Sb2 to Sb4 for the unselected key interval I. Note that when the key in the first key section I in the time series of key sections I is to be replaced, it is replaced with the key represented by the key information H immediately after the key section I.

第２実施形態でも第１実施形態と同様の効果が実現される。第２実施形態では、調情報Ｈが表す調が連続する調区間Ｉの時間長が所定値を下回る場合に、当該調区間Ｉの調が、当該調区間Ｉの直前の調情報Ｈが表す調に置換される。したがって、調が短期間で変化する可能性が低いという傾向を加味して、調推定モデルにより推定された調を適切に修正することができる。なお、第２実施形態では、置換元の直前の調区間Ｉの調情報Ｈが表す調を置換先としたが、置換元の直後の調区間Ｉの調情報Ｈが表す調を置換先としてもよい。以上の構成では、例えば図１０のステップＳb2において複数の調区間Ｉのうち末尾から先頭に向かって調区間Ｉを順次に選択する。 The second embodiment also achieves the same effects as the first embodiment. In the second embodiment, when the time length of a key interval I in which keys represented by key information H are consecutive is less than a predetermined value, the key of the relevant key interval I is changed to the key represented by the key information H immediately before the relevant key interval I. will be replaced with Therefore, the key estimated by the key estimation model can be appropriately corrected, taking into account the tendency that the key is unlikely to change in a short period of time. In addition, in the second embodiment, the key represented by the key information H of the key interval I immediately before the replacement source is set as the replacement destination, but the key represented by the key information H of the key interval I immediately after the replacement source is also used as the replacement destination. good. In the above configuration, for example, in step Sb2 of FIG. 10, the key sections I are sequentially selected from the end to the beginning among the plurality of key sections I.

＜第３実施形態＞
楽曲のコードと調とは相関関係がある。例えば、楽曲の調の音階を構成音とするコードが当該楽曲内で演奏される。特に、楽曲において特定の調が設定された区間の先頭のコードは、当該調の主音を根音とするコードと一致する可能性が高いという傾向がある。第３実施形態では、以上の傾向をもとに第１実施形態で推定された調を修正する。 <Third embodiment>
There is a correlation between the chord and key of a song. For example, a chord whose constituent tones correspond to the scale of the key of the song is played within the song. In particular, there is a tendency that the first chord of a section in which a specific key is set in a song is likely to match a chord whose root note is the tonic of the key. In the third embodiment, the key estimated in the first embodiment is corrected based on the above tendency.

図１１は、第３実施形態に係る音響解析装置１００の機能構成を例示するブロック図である。第３実施形態の音響解析装置１００は、第２実施形態と同様に後処理部２４を具備する。後処理部２４は、調推定モデル２２により推定された調の時系列Ｗaを修正することで時系列Ｗbを生成する。第３実施形態の後処理部２４は、音響信号ＶにおけるコードＯの時系列（例えば第２処理部２１２が推定したコードＯの時系列）を利用して時系列Ｗbを生成する。図１２は、第３実施形態に係る後処理部２４が生成する時系列Ｗbの説明図である。具体的には、後処理部２４は、音響信号ＶにおけるコードＯの時系列に応じて、調区間Ｉの端点（具体的には始点Ｓ）を変更する。第３実施形態の後処理部２４は、調区間Ｉの始点Ｓを含む区間（以下「探索区間」という）Ｒにおける音響信号ＶのコードＯの時系列に、当該調区間Ｉの調情報Ｈが表す調に対応するコード（以下「調対応コード」という）がある場合に、当該調区間Ｉの始点Ｓを、当該調対応コードに対応する区間（典型的には継続区間Ｕ）の始点に変更する。探索区間Ｒは、例えば調区間Ｉの始点Ｓを中心とした複数（図１２では６個）の継続区間Ｕが探索区間Ｒとして例示される。調対応コードは、例えば調の主音を根音とするコード（典型的にはトニックコード）である。図１２では、「Ｅmajor」が推定された調区間Ｉ2の始点Ｓが、探索区間Ｒについて推定された複数のコードＯのうちコード「Ｅ」が推定された継続区間Ｕの始点に変更される場合が例示されている。
FIG. 11 is a block diagram illustrating the functional configuration of an acoustic analysis device 100 according to the third embodiment. The acoustic analysis device 100 of the third embodiment includes a post-processing section 24 similarly to the second embodiment. The post-processing unit 24 generates a time series Wb by modifying the key time series Wa estimated by the key estimation model 22. The post-processing unit 24 of the third embodiment generates the time series Wb using the time series of the code O in the acoustic signal V (for example, the time series of the code O estimated by the second processing unit 212). FIG. 12 is an explanatory diagram of the time series Wb generated by the post-processing unit 24 according to the third embodiment. Specifically, the post-processing unit 24 changes the end point (specifically, the starting point S) of the key interval I according to the time series of the code O in the acoustic signal V. The post-processing unit 24 of the third embodiment adds the key information H of the key interval I to the time series of the code O of the acoustic signal V in the section R (hereinafter referred to as "search section") including the starting point S of the key interval I. If there is a chord corresponding to the key represented (hereinafter referred to as "key corresponding chord"), change the start point S of the key section I to the start point of the section (typically the continuation section U) corresponding to the key corresponding chord. do. The search section R is exemplified by, for example, a plurality of (six in FIG. 12) continuous sections U centered on the starting point S of the key section I. A key-compatible chord is, for example, a chord whose root note is the tonic of the key (typically a tonic chord). In FIG. 12, the starting point S of the key interval I2 in which "Emajor" is estimated is changed to the starting point of the continuation interval U in which the code "E" is estimated among the plurality of codes O estimated for the search interval R. is exemplified.

図１３は、第３実施形態に係る後処理部２４が調推定モデルにより推定された調を修正する処理（以下「後処理２」という）の具体的な手順を例示するフローチャートである。後処理部２４は、調推定モデル２２により調の時系列Ｗaが推定された後に、例えば利用者からの指示を契機として後処理２が開始される。後処理２を開始すると、後処理部２４は、調の時系列Ｗaから複数の調区間Ｉを時間軸上に画定する（Ｓc1）。すなわち、調区間Ｉの時系列が特定される。後処理部２４は、複数の調区間Ｉの何れかを選択する（Ｓc2）。具体的には、先頭から末尾に向かって調区間Ｉが順次に選択される。 FIG. 13 is a flowchart illustrating a specific procedure of a process (hereinafter referred to as "post-processing 2") in which the post-processing unit 24 according to the third embodiment corrects the key estimated by the key estimation model. After the key time series Wa is estimated by the key estimation model 22, the post-processing unit 24 starts post-processing 2, for example, in response to an instruction from the user. When post-processing 2 is started, the post-processing unit 24 defines a plurality of key intervals I on the time axis from the key time series Wa (Sc1). That is, the time series of key interval I is specified. The post-processing unit 24 selects one of the plurality of tonal intervals I (Sc2). Specifically, key intervals I are sequentially selected from the beginning to the end.

次に、後処理部２４は、ステップＳc2において選択された調区間Ｉにおける探索区間Ｒについて推定された複数のコード（以下「候補コード」という）の各々について、当該候補コードと調対応コードとの間における類似度とを算定する（Ｓc3）。類似度は、調対応コードを表すベクトルと候補コードを表すベクトルとの距離または相関を表す指標である。調対応コードまたは候補コードを表すベクトルとしては、例えば文献「山口直彦、管村昇，"非調構成音を含む和音への対応を目的としたＴＰＳ（Tonal Pitch Space）の拡張－ジャズ音楽理論への適用を目指して－"，情報処理学会研究報告，２０１１年２月１１日」に記載されたベーシックスペース関数が好適である。次に、後処理部２４は、複数の候補コードのうち調対応コードとの類似度が最大となる候補コードを探索する（Ｓc4）。ステップＳc3およびステップＳc4は、調区間Ｉの調が表す調対応コードに最も類似（典型的には一致）する１個の候補コードを、当該調区間Ｉにおける探索区間Ｒの複数の候補コードから探索する処理である。 Next, the post-processing unit 24 compares each of the plurality of chords (hereinafter referred to as "candidate chords") estimated for the search interval R in the key interval I selected in step Sc2 with the key corresponding code. The degree of similarity between the two is calculated (Sc3). The degree of similarity is an index representing the distance or correlation between a vector representing a key-corresponding chord and a vector representing a candidate chord. Vectors representing key-corresponding chords or candidate chords include, for example, the literature ``Naohiko Yamaguchi, Noboru Kanmura, ``Expansion of TPS (Tonal Pitch Space) to accommodate chords including non-tonal constituent notes - Towards jazz music theory.'' The basic space function described in "Aiming for the application of ``Information Processing Society of Japan Research Report, February 11, 2011'' is suitable. Next, the post-processing unit 24 searches for a candidate code having the maximum degree of similarity with the key-corresponding code from among the plurality of candidate codes (Sc4). Step Sc3 and Step Sc4 search for one candidate chord that is most similar (typically matches) to the key-corresponding chord represented by the key of key interval I from a plurality of candidate chords in search interval R in key interval I. This is the process of

後処理部２４は、調区間Ｉの始点Ｓを、ステップＳc4で探索された候補コードに対応する区間の始点に変更する（Ｓc5）。具体的には、当該候補コードが推定された継続区間Ｕの始点に調区間Ｉの始点Ｓが変更される。なお、調区間Ｉの先頭における候補コードの類似度が最大となる場合には、調区間Ｉの始点Ｓは変更されずに維持される。全部の調区間Ｉの各々についてステップＳc2～Ｓc6の処理を実行した場合（Ｓc6:YES）、制御装置１１は、後処理部２４が生成した時系列Ｗbを表示装置１３に表示させる（Ｓc7）。すなわち、時系列Ｗaから画定された複数の調区間Ｉの各々について始点Ｓが変更または維持された時系列Ｗbが表示装置１３により表示される。一方で、ステップＳc2において未選択の調区間Ｉがある場合（Ｓc6:NO）、後処理部２４は、当該未選択の調区間ＩについてステップＳc2～Ｓc6の処理を実行する。 The post-processing unit 24 changes the starting point S of the key interval I to the starting point of the interval corresponding to the candidate chord searched in step Sc4 (Sc5). Specifically, the starting point S of the key interval I is changed to the starting point of the continuation interval U from which the candidate chord has been estimated. Note that when the similarity of the candidate chords at the beginning of the key interval I is the maximum, the starting point S of the key interval I is maintained unchanged. When the processes of steps Sc2 to Sc6 are executed for each of all the key intervals I (Sc6: YES), the control device 11 causes the display device 13 to display the time series Wb generated by the post-processing section 24 (Sc7). That is, the display device 13 displays the time series Wb in which the starting point S is changed or maintained for each of the plurality of adjustment intervals I defined from the time series Wa. On the other hand, if there is an unselected key interval I in step Sc2 (Sc6: NO), the post-processing unit 24 executes the processes of steps Sc2 to Sc6 for the unselected key interval I.

第３実施形態においても第１実施形態と同様の効果が実現される。第３実施形態では、音響信号ＶにおけるコードＯの時系列に応じて、調区間Ｉの端点が変更されるから、コードの時間変化を加味して、調推定モデル２２により推定された調を適切に修正することができる。また、第３実施形態では、調区間Ｉの始点Ｓを含む探索区間Ｒにおける音響信号Ｖのコードの時系列（すなわち複数の候補コード）に、当該調区間Ｉの調情報Ｈが表す調の主音を根音とする調対応コードがある場合に、当該調区間Ｉの始点Ｓが、当該調対応コードに対応する区間の始点に変更される。したがって、調区間Ｉに対応する音響信号ＶのコードＯの時系列の先頭が、当該調の主音を根音とするコードである可能性が高いという傾向を加味して、調情報Ｈに適切に修正することができる。なお、第２実施形態における後処理１および第３実施形態における後処理２を組み合わせてもよい。 The third embodiment also achieves the same effects as the first embodiment. In the third embodiment, since the end points of the key interval I are changed according to the time series of the chord O in the acoustic signal V, the key estimated by the key estimation model 22 is appropriately adjusted by taking into account the time change of the chord. can be corrected. In addition, in the third embodiment, the time series of chords (that is, a plurality of candidate chords) of the acoustic signal V in the search interval R including the starting point S of the key interval I includes the tonic of the key represented by the key information H of the key interval I. When there is a key-corresponding chord having the root note , the starting point S of the key interval I is changed to the starting point of the section corresponding to the key-corresponding chord. Therefore, taking into consideration the tendency that the beginning of the time series of the chord O of the acoustic signal V corresponding to the key interval I is likely to be a chord whose root note is the tonic of the key, the key information H is appropriately adjusted. Can be fixed. Note that the post-processing 1 in the second embodiment and the post-processing 2 in the third embodiment may be combined.

＜変形例＞
以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された複数の態様を、相互に矛盾しない範囲で適宜に併合してもよい。 <Modified example>
Specific modification modes added to each of the embodiments exemplified above are illustrated below. A plurality of aspects arbitrarily selected from the examples below may be combined as appropriate to the extent that they do not contradict each other.

（１）前述の各形態では、調推定モデル２２と学習処理部２３とを具備する音響解析装置１００を例示したが、音響解析装置１００とは別体の情報処理装置（以下「機械学習装置」という）に学習処理部２３を搭載してもよい。機械学習装置の学習処理部２３が設定した複数の係数Ｋを適用した調推定モデル２２が、音響解析装置１００に転送されて調情報Ｈの生成に利用される。以上の説明から理解される通り、音響解析装置１００から学習処理部２３は省略される。 (1) In each of the above embodiments, the acoustic analysis device 100 that includes the key estimation model 22 and the learning processing section 23 is illustrated, but an information processing device (hereinafter referred to as a “machine learning device”) separate from the acoustic analysis device 100 is illustrated. ) may be equipped with the learning processing unit 23. The key estimation model 22 to which the plurality of coefficients K set by the learning processing unit 23 of the machine learning device is applied is transferred to the acoustic analysis device 100 and used for generating key information H. As understood from the above description, the learning processing section 23 is omitted from the acoustic analysis device 100.

（２）携帯電話機またはスマートフォン等の情報端末との間で通信するサーバ装置により音響解析装置１００を実現してもよい。例えば、音響解析装置１００は、情報端末から受信した音響信号Ｖの解析により調情報Ｈを生成して情報端末に送信する。なお、音響信号Ｖから特徴量Ｙを抽出する特徴抽出部２１が情報端末に搭載された構成では、音響解析装置１００は、情報端末から受信した特徴量Ｙの時系列を調推定モデル２２に入力することで調情報Ｈを生成し、当該調情報Ｈを情報端末に送信する。以上の説明から理解される通り、音響解析装置１００から特徴抽出部２１を省略してもよい。 (2) The acoustic analysis device 100 may be realized by a server device that communicates with an information terminal such as a mobile phone or a smartphone. For example, the acoustic analysis device 100 generates key information H by analyzing the acoustic signal V received from the information terminal and transmits it to the information terminal. Note that in a configuration in which the feature extraction unit 21 that extracts the feature Y from the acoustic signal V is installed in the information terminal, the acoustic analysis device 100 inputs the time series of the feature Y received from the information terminal to the key estimation model 22. By doing so, tone information H is generated and the tone information H is transmitted to the information terminal. As understood from the above description, the feature extraction unit 21 may be omitted from the acoustic analysis device 100.

（３）前述の各形態において、例えば楽曲の楽譜において指定された既知のコードＯの時系列を特徴量Ｙの抽出に利用してもよい。第３処理部２１３は、第１処理部２１１が抽出した特徴量Ｘと既知のコードＯの時系列とから特徴量Ｙを生成する。すなわち、第２処理部２１２は省略され得る。以上の構成では、既知のコードＯの時系列と調推定モデル２２が推定した調の時系列とを併記した表示画面が表示装置１３により表示される。なお、既知のコードＯの時系列は、事前に記憶装置１２に記憶される。 (3) In each of the above-mentioned embodiments, for example, the time series of known chords O specified in the musical score of a song may be used to extract the feature Y. The third processing unit 213 generates the feature quantity Y from the feature quantity X extracted by the first processing unit 211 and the time series of the known code O. That is, the second processing unit 212 may be omitted. In the above configuration, the display device 13 displays a display screen on which the time series of the known chord O and the time series of the key estimated by the key estimation model 22 are written together. Note that the time series of the known code O is stored in the storage device 12 in advance.

（４）前述の各形態では、継続区間Ｕ毎の特徴量Ｙを調推定モデル２２に入力することで、当該継続区間Ｕ毎に調情報Ｈを生成したが、調推定モデル２２に対する入力および出力は以上の例示に限定されない。例えば以下の［Ａ］－［Ｄ］の構成が採用される。 (4) In each of the above embodiments, the key information H is generated for each continuous section U by inputting the feature value Y for each continuous section U into the key estimation model 22. However, the input and output to the key estimation model 22 is not limited to the above examples. For example, the following configurations [A] to [D] are adopted.

［Ａ］第１処理部２１１が生成した単位期間Ｔ毎の特徴量Ｘの時系列を調推定モデル２２に入力することで、当該単位期間Ｔ毎に調情報Ｈを生成する。すなわち、調推定モデル２２は、特徴量Ｘの時系列と調情報Ｈとの関係を学習する。なお、第２処理部２１２および第３処理部２１３は省略される。 [A] By inputting the time series of the feature amount X for each unit period T generated by the first processing unit 211 to the key estimation model 22, key information H is generated for each unit period T. That is, the key estimation model 22 learns the relationship between the time series of the feature amount X and the key information H. Note that the second processing section 212 and the third processing section 213 are omitted.

［Ｂ］第１処理部２１１が生成した単位期間Ｔ毎の特徴量Ｘの時系列と、第２処理部２１２が生成した単位期間Ｔ毎のコードＯの時系列とを調推定モデル２２に入力することで、単位期間Ｔ毎に調情報Ｈを生成する。すなわち、調推定モデル２２は、特徴量Ｘの時系列およびコードＯの時系列と、調情報Ｈとの関係を学習する。なお、第３処理部２１３は省略される。 [B] Input the time series of feature amount X for each unit period T generated by the first processing unit 211 and the time series of code O for each unit period T generated by the second processing unit 212 to the key estimation model 22 By doing so, key information H is generated for each unit period T. That is, the key estimation model 22 learns the relationship between the time series of the feature amount X, the time series of the chord O, and the key information H. Note that the third processing unit 213 is omitted.

［Ｃ］第２処理部２１２が生成した単位期間Ｔ毎のコードＯの時系列を調推定モデル２２に入力することで、単位期間Ｔ毎に調情報Ｈを生成してもよい。すなわち、調推定モデル２２は、コードＯの時系列と調情報Ｈとの関係を学習する。第３処理部２１３は省略される。なお、［Ｂ］および［Ｃ］において、例えば楽曲の既知の楽譜から生成されたコードＯの時系列を調推定モデル２２に対する入力として利用してもよい。 [C] The key information H may be generated for each unit period T by inputting the time series of the chord O for each unit period T generated by the second processing unit 212 to the key estimation model 22. That is, the key estimation model 22 learns the relationship between the time series of the chord O and the key information H. The third processing unit 213 is omitted. Note that in [B] and [C], for example, a time series of chord O generated from a known musical score of a song may be used as an input to the key estimation model 22.

［Ｄ］第１処理部２１１が生成した単位期間Ｔ毎の特徴量Ｘの時系列と、継続区間Ｕの時系列を表すデータ（以下「区間データ」という）とを調推定モデル２２に入力することで、継続区間Ｕ毎に調情報Ｈを生成する。すなわち、調推定モデル２２は、特徴量Ｘの時系列および区間データと、調情報Ｈとの関係を学習する。なお、第３処理部２１３は省略される。区間データは、例えば継続区間Ｕの境界を示すデータであり、例えば第２処理部２１２が生成したコードＯの時系列から生成してもよいし、例えば楽曲の既知の楽譜から生成された区間データを利用してもよい。 [D] Input the time series of the feature quantity X for each unit period T generated by the first processing unit 211 and the data representing the time series of the continuous interval U (hereinafter referred to as "interval data") to the key estimation model 22. Thus, key information H is generated for each continuous section U. That is, the key estimation model 22 learns the relationship between the time series and interval data of the feature amount X and the key information H. Note that the third processing unit 213 is omitted. The section data is, for example, data indicating the boundary of the continuous section U, and may be generated, for example, from the time series of the chord O generated by the second processing unit 212, or may be, for example, section data generated from the known musical score of the song. You may also use

以上の説明から理解される通り、調推定モデル２２に対する入力および出力は任意である。なお、入力および出力の単位は、入力の種類に応じて適宜に変更し得る。例えば継続区間Ｕ毎または単位期間Ｔ毎の入力および出力が例示される。また、調推定モデル２２に対する入力に応じて、特徴抽出部２１の構成も適宜に変更し得る。 As understood from the above description, inputs and outputs to the key estimation model 22 are arbitrary. Note that the units of input and output can be changed as appropriate depending on the type of input. For example, input and output for each continuous section U or for each unit period T are exemplified. Furthermore, the configuration of the feature extraction unit 21 can be changed as appropriate depending on the input to the key estimation model 22.

（５）前述の各形態では、複数種の調の何れかを２値的に表す指標Ｑを含む調情報Ｈを例示したが、調情報Ｈの内容は以上の例示に限定されない。例えば、各調に対応する指標Ｑが、楽曲の調がその調に該当する尤度を表す調情報Ｈを利用してもよい。尤度を表す指標Ｑは、０以上かつ１以下の範囲内の数値に設定され、相異なる調に対応する全部の調にわたる指標Ｑの合計値は所定値（例えば１）となる。また、複数種の調の何れかを識別するための識別情報を調情報Ｈとして調推定モデル２２が生成してもよい。 (5) In each of the above-described embodiments, the key information H including the index Q that binaryly represents one of a plurality of keys is illustrated, but the content of the key information H is not limited to the above examples. For example, the index Q corresponding to each key may use key information H indicating the likelihood that the key of the song corresponds to that key. The index Q representing the likelihood is set to a numerical value within the range of 0 or more and 1 or less, and the total value of the index Q over all keys corresponding to different keys is a predetermined value (for example, 1). Further, the key estimation model 22 may generate the key information H as identification information for identifying one of a plurality of keys.

（６）前述の各形態では、クロマベクトルと強度Ｐvとを含む特徴量Ｘを例示したが、特徴量Ｘの内容は以上の例示に限定されない。例えば、クロマベクトルを単独で特徴量Ｙとして利用してもよい。また、音響信号Ｖのうち所定の周波数に対して低域側の帯域成分と高域側の帯域成分の各々についてクロマベクトルと強度Ｐvとを含む特徴量Ｘを例示したが、音響信号Ｖにおける全体の周波数帯域についてクロマベクトルと強度Ｐvとを含む特徴量Ｘを生成してもよい。また、同様に、特徴量Ｙが、音階音毎の成分強度Ｐqの時系列に関する分散σqおよび平均μqと、音響信号Ｖの強度Ｐvの時系列に関する分散σvおよび平均μvとを、音響信号Ｖの全体の周波数帯域について含んでもよい。 (6) In each of the above-described embodiments, the feature amount X including the chroma vector and the intensity Pv is illustrated, but the content of the feature amount X is not limited to the above examples. For example, the chroma vector may be used alone as the feature Y. In addition, although the feature quantity X including the chroma vector and the intensity Pv for each of the low-frequency band component and the high-frequency band component with respect to a predetermined frequency of the acoustic signal V is illustrated, A feature amount X including a chroma vector and intensity Pv may be generated for the frequency band. Similarly, the feature value Y is the variance σq and average μq of the time series of the component strength Pq of each scale note, and the variance σv and average μv of the time series of the intensity Pv of the acoustic signal V. It may be included for the entire frequency band.

（７）第３実施形態の後処理２において、音響信号Ｖに対応する楽曲の音楽構造上の構造区間（例えばＡメロ，サビ，Ｂメロ等の各フレーズ）を加味してもよい。例えば、調が構造区間毎に変化するという傾向がある。以上の傾向を利用して、例えば調区間Ｉにおける探索区間Ｒ内に構造区間の始点が位置する場合には、当該構造区間の始点に当該調区間Ｉの始点Ｓを変更してもよい。 (7) In the post-processing 2 of the third embodiment, structural sections in the musical structure of the song corresponding to the acoustic signal V (for example, phrases such as the A melody, chorus, B melody, etc.) may be taken into account. For example, there is a tendency for the key to change from one structural section to another. Utilizing the above tendency, for example, if the starting point of the structural section is located within the search section R of the key section I, the starting point S of the key section I may be changed to the starting point of the structural section.

（８）前述の各形態では、調推定モデル２２が生成した調情報Ｈが表す調を表示したが、調情報Ｈの用途は以上の例示に限定されない。表示装置１３が表示するコード（第２処理部２１２が推定したコード）が、例えば利用者の演奏が困難なコードである場合には、当該コードを簡単なコードに変更したいという事情がある。以上の事情を考慮して、利用者による変更の候補となる複数のコードの特定に、調推定モデル２２が推定した調を利用してもよい。調推定モデル２２が推定した調を加味して、音響的に近似した複数のコードが変更の候補として特定される。 (8) In each of the above embodiments, the key represented by the key information H generated by the key estimation model 22 is displayed, but the use of the key information H is not limited to the above examples. If the chord displayed by the display device 13 (the chord estimated by the second processing unit 212) is, for example, a chord that is difficult for the user to play, there is a situation in which the user wants to change the chord to a simpler chord. In consideration of the above circumstances, the key estimated by the key estimation model 22 may be used to identify a plurality of chords that are candidates for change by the user. Taking into account the key estimated by the key estimation model 22, a plurality of acoustically similar chords are identified as candidates for change.

（９）前述の形態において、調推定モデルは平均律における調を推定したが、調推定モデルが推定する調の基礎となる音律は、平均律に限定されない。例えば、インド音楽等の民族音楽の音律における調を調推定モデルが推定してもよい。 (9) In the above embodiment, the key estimation model estimates the key in equal temperament, but the temperament that is the basis of the key estimated by the key estimation model is not limited to equal temperament. For example, the key estimation model may estimate the key in the temperament of ethnic music such as Indian music.

（１０）前述の各形態に係る音響解析装置１００は、各形態での例示の通り、コンピュータ（具体的には制御装置１１）とプログラムとの協働により実現される。前述の各形態に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体を含み得る。なお、非一過性の記録媒体とは、一過性の伝搬信号（transitory, propagating signal）を除く任意の記録媒体を含み、揮発性の記録媒体を除外するものではない。また、通信網を介した配信の形態でプログラムをコンピュータに提供することも可能である。 (10) The acoustic analysis device 100 according to each of the above embodiments is realized by cooperation between a computer (specifically, the control device 11) and a program, as illustrated in each embodiment. The programs according to each of the above embodiments may be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but any known recording medium such as a semiconductor recording medium or a magnetic recording medium is used. may include a recording medium in the form of. Note that the non-transitory recording medium includes any recording medium excluding transitory, propagating signals, and does not exclude volatile recording media. It is also possible to provide the program to a computer in the form of distribution via a communication network.

なお、調推定モデル２２を実現するプログラムの実行主体はＣＰＵ等の汎用の処理回路に限定されない。例えば、Tensor Processing UnitまたはNeural Engine等の人工知能に特化した処理回路、または信号処理用の電子回路（ＤＳＰ：Digital Signal Processor）がプログラムを実行してもよい。また、以上の例示から選択された複数種の主体が協働してプログラムを実行してもよい。 Note that the main body that executes the program that implements the key estimation model 22 is not limited to a general-purpose processing circuit such as a CPU. For example, a processing circuit specialized for artificial intelligence such as a Tensor Processing Unit or a Neural Engine, or an electronic circuit for signal processing (DSP: Digital Signal Processor) may execute the program. Furthermore, multiple types of entities selected from the above examples may cooperate to execute the program.

（１１）学習済モデル（調推定モデル２２）は、制御装置（コンピュータの例示）により実現される統計的推定モデル（例えばニューラルネットワーク）であり、入力Ａに応じた出力Ｂを生成する。具体的には、学習済モデルは、入力Ａから出力Ｂを特定する演算を制御装置に実行させるプログラム（例えば人工知能ソフトウェアを構成するプログラムモジュール）と、当該演算に適用される複数の係数との組合せで実現される。学習済モデルの複数の係数は、入力Ａと出力Ｂとを対応させた複数の教師データを利用した事前の機械学習（深層学習）により最適化されている。すなわち、学習済モデルは、入力Ａと出力Ｂとの間の関係を学習した統計的推定モデルである。制御装置は、学習済の複数の係数と所定の応答関数とを適用した演算を未知の入力Ａに対して実行することにより、複数の教師データに潜在する傾向（入力Ａと出力Ｂとの間の関係）のもとで入力Ａに対して統計的に妥当な出力Ｂを生成する。 (11) The trained model (key estimation model 22) is a statistical estimation model (for example, a neural network) realized by a control device (eg, a computer), and generates an output B according to an input A. Specifically, the learned model includes a program (for example, a program module that configures artificial intelligence software) that causes a control device to execute a calculation that specifies output B from input A, and a plurality of coefficients that are applied to the calculation. Realized by combination. The plurality of coefficients of the trained model are optimized by prior machine learning (deep learning) using a plurality of teacher data in which input A and output B are made to correspond. That is, the trained model is a statistical estimation model that has learned the relationship between input A and output B. The control device performs an operation applying a plurality of learned coefficients and a predetermined response function to an unknown input A, thereby detecting a tendency latent in a plurality of training data (between input A and output B). A statistically valid output B is generated for the input A under the following relationship.

（１２）以上に例示した形態から、例えば以下の構成が把握される。 (12) From the embodiments exemplified above, the following configurations can be understood, for example.

本発明の態様（第１態様）に係る音響解析方法は、音響信号の特徴量の時系列と調との関係を学習した学習済モデルに、音響信号の特徴量の時系列を入力することで、調を表す調情報を生成する。以上の態様によれば、音響信号の特徴量と調との関係を学習した学習済モデルに音響信号の特徴量を入力することで調を表す調情報が生成されるから、所定の規則に従って楽曲の調情報を生成する構成と比較して、調を高精度に推定することができる。 The acoustic analysis method according to the aspect (first aspect) of the present invention involves inputting the time series of the features of the acoustic signal to a trained model that has learned the relationship between the time series of the features of the acoustic signal and the key. , generate key information representing the key. According to the aspect described above, key information representing the key is generated by inputting the feature amount of the audio signal to the trained model that has learned the relationship between the feature amount of the audio signal and the key, so that the key information representing the key is generated. The key can be estimated with high accuracy compared to a configuration that generates key information.

第１態様の一例（第２態様）において、同じコードが継続する継続区間毎に前記音響信号の特徴量を前記学習済モデルに入力することで、当該継続区間毎に前記調情報を生成する。以上の態様によれば、同じコードが継続する継続区間毎に音響信号の特徴量を学習済モデルに入力することで、当該継続区間毎に調情報が生成されるから、同じコードが継続する継続区間内では調が変化しないという傾向を加味して、調情報を高精度に推定することができる。 In an example of the first aspect (second aspect), the feature amount of the acoustic signal is input to the learned model for each continuous section in which the same chord continues, thereby generating the key information for each continuous section. According to the above aspect, by inputting the feature amount of the acoustic signal into the learned model for each continuous section in which the same chord continues, key information is generated for each continuous section, so the key information is generated for each continuous section in which the same chord continues. Taking into consideration the tendency that the key does not change within the interval, key information can be estimated with high accuracy.

第２態様の一例（第３態様）において、各継続区間の前記特徴量は、前記音響信号のうち音階音に対応する成分強度における当該継続区間の時間変化に関する指標を音階音毎に含む。以上の態様によれば、音響信号のうち音階音に対応する成分強度における継続区間の時間変化に関する指標を音階音毎に含む特徴量が継続区間毎に学習済モデルに入力されるから、音響信号の時間変化を加味して、高精度に調情報を推定することができる。 In an example of the second aspect (third aspect), the feature amount of each continuous section includes, for each scale note, an index regarding a temporal change in the continuous section in the component intensity corresponding to the scale note of the acoustic signal. According to the above aspect, since the feature amount including the index regarding the time change of the continuous interval in the component intensity corresponding to the scale note in the acoustic signal is input for each continuous interval to the trained model for each continuous interval, the acoustic signal It is possible to estimate the key information with high accuracy by taking into account the temporal change in the .

第１態様の一例（第４態様）において、前記特徴量は、前記音響信号のうち音階音に対応する成分強度を複数のオクターブにわたり加算した成分強度を音階音毎に含む。以上の態様によれば、音響信号のうち音階音に対応する成分強度を複数のオクターブにわたり加算した成分強度を音階音毎に含む特徴量が学習済モデルに入力されるから、音響信号が表す楽曲のコードが適切に反映された特徴量を利用して、調情報を高精度に推定できるという利点がある。 In an example of the first aspect (fourth aspect), the feature amount includes, for each scale note, component intensities obtained by adding component intensities corresponding to scale notes in the acoustic signal over a plurality of octaves. According to the above aspect, since the feature amount including the component strength obtained by adding the component intensities corresponding to the scale notes of the acoustic signal over multiple octaves is input to the trained model for each scale note, the music represented by the acoustic signal is input into the trained model. This has the advantage that key information can be estimated with high precision by using features that appropriately reflect the chord.

第１態様から第４態様の何れかの一例（第５態様）において、前記調情報が表す調が連続する調区間の時間長が所定値を下回る場合に、当該調区間の調を、当該調区間の直前または直後の調情報が表す調に置換する。以上の態様によれば、調情報が表す調が連続する調区間の時間長が所定値を下回る場合に、当該調区間の調が、当該調区間の直前または直後の調情報が表す調に置換される。したがって、調が短期間で変化する可能性が低いという傾向を加味して、学習済モデルにより推定された調を適切に修正することができる。 In an example of any one of the first to fourth aspects (fifth aspect), when the time length of a key interval in which the key represented by the key information is continuous is less than a predetermined value, the key of the key interval is Replace with the key indicated by the key information immediately before or after the section. According to the above aspect, when the time length of a key interval in which the keys represented by the key information are continuous is less than a predetermined value, the key of the relevant key interval is replaced with the key represented by the key information immediately before or after the relevant key interval. be done. Therefore, the key estimated by the learned model can be appropriately corrected, taking into account the tendency that the key is unlikely to change in a short period of time.

第１態様から第５態様の何れかの一例（第６態様）において、音響信号におけるコードの時系列に応じて、前記調情報が表す調が連続する調区間の端点を変更する。以上の態様によれば、音響信号におけるコードの時系列に応じて、調情報が表す調が連続する調区間の端点が変更されるから、コードの時間変化を加味して、学習済モデルにより推定された調を適切に修正することができる。 In an example of any one of the first to fifth aspects (sixth aspect), the end points of a key interval in which keys represented by the key information are continuous are changed in accordance with the time series of chords in the acoustic signal. According to the above aspect, since the end point of the key interval in which the keys represented by the key information are consecutive is changed according to the time series of the chords in the acoustic signal, it is estimated by the trained model, taking into account the time change of the chord. It is possible to appropriately correct the key that has been created.

第６態様の一例（第７態様）において、前記調区間の始点を含む探索区間における音響信号のコードの時系列に、当該調区間の調情報が表す調の主音を根音とするコードがある場合に、当該調区間の始点を、当該コードに対応する区間の始点に変更する。以上の態様によれば、調区間の始点を含む探索区間における音響信号のコードの時系列に、当該調区間の調情報が表す調の主音を根音とするコードに音響的に近似するコード（理想的には一致するコード）がある場合に、当該調区間の始点が、当該コードに対応する区間の始点に変更される。したがって、調区間に対応する音響信号のコードの時系列の先頭が、当該調の主音を根音とするコードに音響的に近似するコード（理想的には一致するコード）である可能性が高いという傾向を加味して、調情報に適切に修正することができる。 In an example of the sixth aspect (seventh aspect), the time series of chords of the acoustic signal in the search section including the start point of the key section includes a code whose root note is the tonic of the key represented by the key information of the key section. In this case, the start point of the key interval is changed to the start point of the interval corresponding to the chord. According to the above aspect, in the time series of the chords of the acoustic signal in the search section including the start point of the key section, a code ( Ideally, if there is a matching chord), the start point of the key section is changed to the start point of the section corresponding to the chord. Therefore, it is highly likely that the beginning of the time series of chords in the acoustic signal corresponding to a key interval is a chord that acoustically approximates (ideally, a matching chord) a chord whose root is the tonic of the key. By taking this tendency into consideration, the key information can be appropriately corrected.

本発明の好適な態様（第８態様）に係る音響解析装置は、音響信号の特徴量の時系列と調との関係を学習した学習済モデルであって、音響信号の特徴量の時系列の入力から、調を表す調情報を生成する調推定モデルを具備する。以上の態様によれば、音響信号の特徴量と調との関係を学習した学習済モデルに音響信号の特徴量を入力することで調を表す調情報が生成されるから、所定の規則に従って楽曲の調情報を生成する構成と比較して、調を高精度に推定することができる。 An acoustic analysis device according to a preferred aspect (eighth aspect) of the present invention is a trained model that has learned the relationship between the time series of feature quantities of an acoustic signal and the key, It includes a key estimation model that generates key information representing a key from an input. According to the aspect described above, key information representing the key is generated by inputting the feature amount of the audio signal to the trained model that has learned the relationship between the feature amount of the audio signal and the key, so that the key information representing the key is generated. The key can be estimated with high accuracy compared to a configuration that generates key information.

１００…音響解析装置、１１…制御装置、１２…記憶装置、１３…表示装置、２１…特徴抽出部、２１１…処理部、２１２…処理部、２１３…処理部、２２…調推定モデル、２３…学習処理部、２４…後処理部。 DESCRIPTION OF SYMBOLS 100... Acoustic analysis device, 11... Control device, 12... Storage device, 13... Display device, 21... Feature extraction part, 211... Processing part, 212... Processing part, 213... Processing part, 22... Key estimation model, 23... Learning processing unit, 24... Post-processing unit.

Claims

a first processing unit that extracts a first feature from the acoustic signal;
a third processing unit that generates a second feature amount from the first feature amount and a code corresponding to the acoustic signal; and
A trained model that has learned a relationship between a second feature amount and a key, and is a key estimation model that generates key information representing a key from a time series input of the second feature amount generated by the third processing unit. ,
A program that makes a computer function as

moreover,
causing the computer to function as a second processing unit that estimates the code from the acoustic signal;
The program according to claim 1 , wherein the third processing section generates the second feature amount from the first feature amount extracted by the first processing section and the code estimated by the second processing section.

By inputting the time series of the features of the acoustic signal for each continuous section in which the same chord continues, to the trained model that has learned the relationship between the feature and the key, key information representing the key can be calculated for each continuous section. A computer-implemented acoustic analysis method.

An acoustic analysis method realized by a computer that generates key information representing a key by inputting a time series of features of an acoustic signal to a trained model that has learned the relationship between features and keys,
The feature quantity includes, for each scale note, an index related to a temporal change in the intensity of a component corresponding to a scale note in the acoustic signal.

By inputting the time series of the features of the acoustic signal to a trained model that has learned the relationship between the features and the key, key information representing the key is generated.
When the time length of a key interval in which the key represented by the key information is continuous is less than a predetermined value, the key of the concerned key interval is replaced with the key represented by the key information immediately before or after the concerned key interval.Implemented by a computer. Acoustic analysis method.

By inputting the time series of the features of the acoustic signal to a trained model that has learned the relationship between the features and the key, key information representing the key is generated.
An acoustic analysis method realized by a computer, wherein an end point of a key interval in which keys represented by the key information are continuous is changed according to a time series of chords in the acoustic signal, the method comprising:
In changing the end point of the key interval, if there is a key-corresponding code corresponding to the key represented by the key information of the key interval in the time series of codes of the acoustic signal in the search interval including the start point of the key interval, the end point of the key interval is changed. An acoustic analysis method that changes the start point of a key interval to the start point of an interval corresponding to the key corresponding chord.

7. The acoustic analysis method according to claim 6 , wherein the key corresponding chord is a chord whose root note is the tonic of the key represented by the key information of the key interval.

By inputting the time series of the features of the acoustic signal for each continuous section in which the same chord continues, to the trained model that has learned the relationship between the feature and the key, key information representing the key can be calculated for each continuous section. An acoustic analysis device that generates