JP2011095509A

JP2011095509A - Acoustic signal analysis device, acoustic signal analysis method and acoustic signal analysis program

Info

Publication number: JP2011095509A
Application number: JP2009249520A
Authority: JP
Inventors: Ichiro Shishido; 一郎宍戸
Original assignee: JVCKenwood Holdings Inc
Current assignee: JVCKenwood Holdings Inc
Priority date: 2009-10-29
Filing date: 2009-10-29
Publication date: 2011-05-12
Anticipated expiration: 2029-10-29
Also published as: JP5359786B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique to solve the problem of conventional arts which do not accurately detect characteristics of music. <P>SOLUTION: The acoustic signal analysis device 1 includes: an acquisition part 12 for acquiring an acoustic signal 2; a first feature quantity-calculating part 13 for calculating a first value of each sound volume of a plurality of sections including a first section of the acoustic signal 2; a second feature quantity-calculating part 14 for calculating a second value regarding each sound volume in a plurality of sections including a second section which is longer than the first section of the acoustic signal 2; an evaluation value-calculating part 15 for calculating an evaluation value of time sequence by using the first value and the second value, which becomes larger, as the first value is larger, and as a second value corresponding to the first value in time is larger; and a feature position-detecting part 16 for detecting a section where the evaluation value calculated by the evaluation value-calculating part 15 becomes a maximum, or a local maximum. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音響信号を分析する技術に関する。 The present invention relates to a technique for analyzing an acoustic signal.

近年、音楽データをコンピュータの記憶媒体等に大量に保存して利用することが広く行われている。それに伴って、保存した大量の音楽データそれぞれの内容を簡単に素早く把握するための技術のニーズが高まっている。その技術の一つとして、曲の聴きどころである音楽のサビや盛り上がる箇所を検出する技術が提案されている。 In recent years, a large amount of music data stored in a computer storage medium or the like has been widely used. Along with this, there is an increasing need for technology for easily and quickly grasping the contents of each of a large amount of stored music data. As one of the techniques, a technique for detecting music rust and excitement as a point of listening to music has been proposed.

例えば特許文献１には、楽曲データの中の音量が最大である位置を検出し、その位置を含むその楽曲データの特定の部分を再生する技術が開示されている。また、特許文献２には、高帯域、中帯域、及び低帯域それぞれのフィルタの出力値の比率を用いて、楽曲の高揚感や盛り上がりの度合を検知する技術が開示されている。 For example, Patent Document 1 discloses a technique for detecting a position where the volume is maximum in music data and reproducing a specific portion of the music data including the position. Patent Document 2 discloses a technique for detecting the degree of excitement and excitement of music using the ratio of the output values of the filters of the high band, the medium band, and the low band.

特開２００７−８０３０４号公報JP 2007-80304 A 特開２００３−２２８３８７号公報JP2003-228387A

上述した従来の技術を用いると楽曲の特徴的な箇所を検出することは可能であるが、多種多様な楽曲が存在するため、上述した従来の技術を用いても、楽曲の特徴的な箇所の検出を誤ることがあり、より高い精度で楽曲の特徴的な箇所を検出することができる音響信号分析装置が望まれていた。 Although it is possible to detect a characteristic part of a music piece using the above-described conventional technique, there are various kinds of music pieces. There has been a demand for an acoustic signal analyzer that can detect errors and detect characteristic portions of music with higher accuracy.

本発明は、このような問題点に鑑みなされたものであり、楽曲の特徴的な箇所を高い精度で検出することができる音響信号分析装置等を提供することを目的とする。 The present invention has been made in view of such problems, and an object of the present invention is to provide an acoustic signal analyzer and the like that can detect a characteristic portion of a music piece with high accuracy.

上記課題を解決し上記目的を達成するために、本発明の音響信号分析装置は、音響信号を取得する取得部と、前記取得部によって取得された音響信号の第１の期間を有する複数の区間それぞれの音量に関する第１の値を算出する第１の音量情報算出部と、前記取得部によって取得された音響信号の前記第１の期間より長い第２の期間を有する複数の区間それぞれの音量に関する第２の値を算出する第２の音量情報算出部と、前記第１の値と前記第２の値とを用い、前記第１の値が大きいほど、かつその第１の値に時間的に対応する前記第２の値が大きいほど大きな値となる、時系列の評価値を算出する評価値算出部と、前記評価値算出部によって算出された評価値が最大又は極大となる位置を検出する特徴位置検出部とを有する。 In order to solve the above problems and achieve the above object, an acoustic signal analyzer of the present invention includes an acquisition unit that acquires an acoustic signal, and a plurality of sections having a first period of the acoustic signal acquired by the acquisition unit. A first volume information calculation unit that calculates a first value related to each volume, and a volume of each of a plurality of sections having a second period longer than the first period of the acoustic signal acquired by the acquisition unit; Using a second volume information calculation unit that calculates a second value, the first value, and the second value, the larger the first value, the longer the first value. An evaluation value calculation unit that calculates a time-series evaluation value that increases as the corresponding second value increases, and a position at which the evaluation value calculated by the evaluation value calculation unit is maximum or maximum is detected. And a feature position detection unit.

また、本発明の音響信号分析方法は、音響信号を取得するステップと、取得した音響信号の第１の期間を有する複数の区間それぞれの音量に関する第１の値を算出するステップと、取得した音響信号の前記第１の期間より長い第２の期間を有する複数の区間それぞれの音量に関する第２の値を算出するステップと、前記第１の値と前記第２の値とを用い、前記第１の値が大きいほど、かつその第１の値に時間的に対応する前記第２の値が大きいほど大きな値となる、時系列の評価値を算出するステップと、算出した評価値が最大又は極大となる区間を検出するステップとを含む。 Moreover, the acoustic signal analysis method of the present invention includes a step of acquiring an acoustic signal, a step of calculating a first value related to a volume of each of a plurality of sections having a first period of the acquired acoustic signal, and an acquired acoustic signal Calculating a second value relating to the volume of each of a plurality of sections having a second period longer than the first period of the signal, and using the first value and the second value, A step of calculating a time-series evaluation value that becomes a larger value as the value of is larger and the second value temporally corresponding to the first value is larger, and the calculated evaluation value is maximum or maximum Detecting a section that becomes.

更に、本発明の音響信号分析装置の各構成要件の機能をコンピュータに実現させるためのプログラムも、本発明の一態様である。 Furthermore, a program for causing a computer to realize the functions of the constituent elements of the acoustic signal analyzer of the present invention is also an aspect of the present invention.

本発明は、楽曲の特徴的な箇所を高い精度で検出する音響信号分析装置等を提供することができる。 The present invention can provide an acoustic signal analyzing apparatus and the like that detect characteristic portions of music with high accuracy.

実施の形態１の音響信号分析装置の構成を示す図である。1 is a diagram illustrating a configuration of an acoustic signal analysis device according to a first embodiment. フレームの時間長Ｔｆ１と、フレームシフトの時間長Ｔｇ１との関係を示す図である。It is a figure which shows the relationship between frame time length Tf1 and frame shift time length Tg1. 実施の形態１の音響信号分析装置の第１の特徴量算出部の動作の各ステップを示すフローチャートである。4 is a flowchart illustrating steps of an operation of a first feature amount calculation unit of the acoustic signal analysis device according to the first embodiment. 実施の形態１の音響信号分析装置の第２の特徴量算出部の動作の各ステップを示すフローチャートである。6 is a flowchart illustrating steps of an operation of a second feature amount calculation unit of the acoustic signal analysis device according to the first embodiment. 実施の形態１の音響信号分析装置の評価値算出部の動作の各ステップを示すフローチャートである。4 is a flowchart illustrating steps of an operation of an evaluation value calculation unit of the acoustic signal analysis device according to the first embodiment. 実施の形態１の音響信号分析装置の特徴位置検出部の動作の各ステップを示すフローチャートである。4 is a flowchart illustrating steps of an operation of a feature position detection unit of the acoustic signal analysis device according to the first embodiment. 評価値が時間の経過とともに変化する状況を示す図である。It is a figure which shows the condition where an evaluation value changes with progress of time. 比較的短い区間長を用いて算出され第１の特徴量Ｅ１の変化の様子を示す模式図である。It is a schematic diagram which shows the mode of a change of the 1st feature-value E1 calculated using a comparatively short section length. 比較的長い区間長を用いて算出した第２の特徴量Ｅ２の変化の様子を示す模式図である。It is a schematic diagram which shows the mode of the change of the 2nd feature-value E2 calculated using comparatively long section length. 第１の特徴量と第２の特徴量の和（Ｅ１＋Ｅ２）を評価値とした場合の模式図である。It is a schematic diagram when the sum (E1 + E2) of the first feature value and the second feature value is used as an evaluation value. 第１の特徴量と第２の特徴量の積（Ｅ１×Ｅ２）を評価値とした場合の模式図である。It is a schematic diagram when the product (E1 × E2) of the first feature value and the second feature value is used as an evaluation value. 実施の形態２の音響信号分析装置の構成を示す図である。It is a figure which shows the structure of the acoustic signal analyzer of Embodiment 2. FIG. 実施の形態２の音響信号分析装置の拍時間検出部の動作の各ステップを示すフローチャートである。6 is a flowchart illustrating steps of an operation of a beat time detection unit of the acoustic signal analysis device according to the second embodiment. 自己相関の例を示す図である。It is a figure which shows the example of an autocorrelation. 拍の時間長の存在確率の分布を示す図である。It is a figure which shows distribution of the existence probability of the time length of a beat. 実施の形態３の音響信号分析装置の構成図である。FIG. 6 is a configuration diagram of an acoustic signal analysis device according to a third embodiment. 実施の形態３の音響信号分析装置の周波数帯域データ算出部の動作の各ステップを示すフローチャートである。10 is a flowchart illustrating steps of an operation of a frequency band data calculation unit of the acoustic signal analysis device according to the third embodiment. 周波数スペクトルを示す図である。It is a figure which shows a frequency spectrum. 実施の形態３の音響信号分析装置の特徴位置検出部の動作の各ステップを示すフローチャートである。10 is a flowchart illustrating steps of an operation of a feature position detection unit of the acoustic signal analysis device according to the third embodiment. 周波数帯域の幅が時間の経過とともに変化する状況を示す図である。It is a figure which shows the condition where the width of a frequency band changes with progress of time. 実施の形態４の音響信号分析装置の構成図である。FIG. 10 is a configuration diagram of an acoustic signal analysis device according to a fourth embodiment. 実施の形態４の音響信号分析装置の評価値算出部の動作の各ステップを示すフローチャートである。10 is a flowchart showing steps of an operation of an evaluation value calculation unit of the acoustic signal analysis device according to the fourth embodiment. 実施の形態５の音響信号分析装置の構成図である。FIG. 10 is a configuration diagram of an acoustic signal analysis device according to a fifth embodiment. 実施の形態５の音響信号分析装置の音量データ算出部の動作の各ステップを示すフローチャートである。10 is a flowchart illustrating steps of an operation of a volume data calculation unit of the acoustic signal analyzer of the fifth embodiment. 実施の形態６の音響信号分析装置の構成図である。FIG. 10 is a configuration diagram of an acoustic signal analysis device according to a sixth embodiment. 実施の形態６の音響信号分析装置の拍時間検出部の動作の各ステップを示すフローチャートである。16 is a flowchart illustrating steps of an operation of a beat time detection unit of the acoustic signal analysis device according to the sixth embodiment.

以下に、本発明を実施するための形態を図面を参照して説明する。 EMBODIMENT OF THE INVENTION Below, the form for implementing this invention is demonstrated with reference to drawings.

（実施の形態１）
先ず、実施の形態１の音響信号分析装置１を図１を用いて説明する。図１は、実施の形態１の音響信号分析装置１の構成図である。実施の形態１の音響信号分析装置１は、図１に示すように、制御部１１と、取得部１２と、第１の特徴量算出部１３と、第２の特徴量算出部１４と、評価値算出部１５と、特徴位置検出部１６とを有する。 (Embodiment 1)
First, the acoustic signal analyzer 1 of Embodiment 1 will be described with reference to FIG. FIG. 1 is a configuration diagram of an acoustic signal analyzer 1 according to the first embodiment. As shown in FIG. 1, the acoustic signal analysis device 1 according to the first embodiment includes a control unit 11, an acquisition unit 12, a first feature value calculation unit 13, a second feature value calculation unit 14, and an evaluation. A value calculation unit 15 and a feature position detection unit 16 are included.

音響信号分析装置１は、音響信号２を取得し、特徴位置情報３を出力する。 The acoustic signal analyzer 1 acquires the acoustic signal 2 and outputs the characteristic position information 3.

音響信号２は、音楽に係る音響信号である。音響信号２はデジタル信号であってもよいし、アナログ信号であってもよい。音響信号２は、楽曲だけの信号ではなく、ラジオ又はテレビ等の音楽番組の音響信号のように、楽曲の他にＤＪ等の楽曲以外の音を含む信号であってもよい。音響信号２は音響信号分析装置１の外部に存在する。しかしながら、音響信号分析装置１に記憶部が設けられていれば、音響信号２はその記憶部に格納されて音響信号分析装置１の内部に存在していてもよい。 The acoustic signal 2 is an acoustic signal related to music. The acoustic signal 2 may be a digital signal or an analog signal. The sound signal 2 may be a signal including sound other than music such as DJ in addition to music, such as an audio signal of a music program such as radio or television, instead of a signal only of music. The acoustic signal 2 exists outside the acoustic signal analyzer 1. However, if the storage unit is provided in the acoustic signal analyzer 1, the acoustic signal 2 may be stored in the storage unit and exist inside the acoustic signal analyzer 1.

特徴位置情報３は、音響信号２の「総合的な音量」が大きい箇所を特定する情報である。その箇所は、楽曲のサビの位置又は楽曲の構成もしくは楽器の編成が大きく変化する箇所、すなわち楽曲の特徴的な箇所と一致する場合が多い。 The feature position information 3 is information for identifying a portion where the “total volume” of the acoustic signal 2 is large. The part often coincides with a part where the rust position of the music or the composition of the music or the organization of the musical instrument changes greatly, that is, a characteristic part of the music.

音響信号分析装置１の制御部１１は、音響信号分析装置１を構成する他の各部と情報を交換して各部を制御する。 The control unit 11 of the acoustic signal analysis device 1 controls each unit by exchanging information with other units constituting the acoustic signal analysis device 1.

取得部１２は、音響信号２を取得し、取得した音響信号２から、サンプリング周期Ｔｓ（サンプリング周波数Ｆｓ＝１／Ｔｓ）でサンプリングしたＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）データを生成する。取得部１２は、音響信号２がアナログ信号である場合、アナログ信号をデジタル信号に変換してＰＣＭデータを生成し、音響信号２がＰＣＭ以外のデジタル圧縮信号である場合、デジタル圧縮信号をデコードしてＰＣＭデータを生成する。また、音響信号２がデジタル信号であって、そのサンプリング周期が上記のサンプリング周期Ｔｓと異なる場合、取得部１２は、サンプリングレートを変換してサンプリング周期ＴｓのＰＣＭデータを生成する。 The acquisition unit 12 acquires the acoustic signal 2 and generates PCM (Pulse Code Modulation) data sampled at the sampling period Ts (sampling frequency Fs = 1 / Ts) from the acquired acoustic signal 2. When the acoustic signal 2 is an analog signal, the acquisition unit 12 converts the analog signal into a digital signal to generate PCM data. When the acoustic signal 2 is a digital compressed signal other than PCM, the acquisition unit 12 decodes the digital compressed signal. To generate PCM data. When the acoustic signal 2 is a digital signal and the sampling period is different from the sampling period Ts, the acquisition unit 12 converts the sampling rate and generates PCM data with the sampling period Ts.

以下の説明では、取得部１２によって生成されるＰＣＭデータを、音響データｘ［ｍ］（ｍ＝０〜Ｍ−１、Ｍは音響データのサンプル総数）、又は音響データと記載する。取得部１２は、音響データの生成を終了すると、その旨を制御部１１に通知する。 In the following description, PCM data generated by the acquisition unit 12 is described as acoustic data x [m] (m = 0 to M−1, where M is the total number of samples of acoustic data) or acoustic data. When the acquisition unit 12 finishes generating the acoustic data, the acquisition unit 12 notifies the control unit 11 to that effect.

第１の特徴量算出部１３は、取得部１２によって生成された音響データから、音量に関する第１の特徴量を算出する。第１の特徴量算出部１３は、比較的短い時間区間の音量に関する特徴量を算出する。第１の特徴量算出部１３は、フレーム単位で処理を行う。しかしながら、処理の単位はそれに限定されない。 The first feature value calculation unit 13 calculates a first feature value related to sound volume from the acoustic data generated by the acquisition unit 12. The first feature amount calculation unit 13 calculates a feature amount related to the volume of a relatively short time interval. The first feature amount calculation unit 13 performs processing in units of frames. However, the unit of processing is not limited thereto.

以下では、第１の特徴量算出部１３によって処理される各フレームの時間長をＴｆ１とし、フレームシフトの時間長をＴｇ１とする。このとき、フレームのサンプル数Ｎ１＝Ｔｆ１／Ｔｓとなり、フレームシフトのサンプル数Ｇ１＝Ｔｇ１／Ｔｓとなる。なお、フレームシフトは、隣り合ったフレームの先頭の時間差である。隣り合ったフレームは、一部が重なっていてもよいし、重なっていなくてもよい。 In the following, it is assumed that the time length of each frame processed by the first feature quantity calculation unit 13 is Tf1, and the time length of the frame shift is Tg1. At this time, the frame sample number N1 = Tf1 / Ts, and the frame shift sample number G1 = Tg1 / Ts. The frame shift is a time difference between the heads of adjacent frames. Adjacent frames may partially overlap or may not overlap.

図２を用いてフレームの時間長とフレームシフトの時間長について説明する。図２は、フレームの時間長Ｔｆ１と、フレームシフトの時間長Ｔｇ１との関係を示す図である。図２（ａ）は、隣り合ったフレームが重ならず、フレームの隙間もない場合を示す図である。図２（ｂ）は、隣合ったフレームが一部重なる場合を示す図である。この場合、Ｔｆ１＞Ｔｇ１となる。図２（ｃ）は、隣り合ったフレームの間に隙間がある場合を示す図である。この場合、Ｔｆ１＜Ｔｇ１となる。 The time length of the frame and the time length of the frame shift will be described with reference to FIG. FIG. 2 is a diagram illustrating the relationship between the frame time length Tf1 and the frame shift time length Tg1. FIG. 2A shows a case where adjacent frames do not overlap and there is no gap between frames. FIG. 2B is a diagram illustrating a case where adjacent frames partially overlap. In this case, Tf1> Tg1. FIG. 2C is a diagram illustrating a case where there is a gap between adjacent frames. In this case, Tf1 <Tg1.

第１の特徴量算出部１３は、制御部１１の指示に従って、図３のフローチャートに示す動作を開始する。図３は、第１の特徴量算出部１３の動作の各ステップを示すフローチャートである。 The first feature quantity calculation unit 13 starts the operation shown in the flowchart of FIG. 3 according to the instruction of the control unit 11. FIG. 3 is a flowchart showing each step of the operation of the first feature quantity calculation unit 13.

第１の特徴量算出部１３は、先ず、下記の式（１）に従って、フレームの総数Ｈ１を算出する（Ｓ１００）。 First, the first feature amount calculation unit 13 first calculates the total number H1 of frames according to the following equation (1) (S100).

ｆｌｏｏｒ（）は、小数点以下を切り捨てた整数を返す関数である。ＭとＮ１との関係は、Ｍ＞Ｎ１である。

floor () is a function that returns an integer with the decimal part truncated. The relationship between M and N1 is M> N1.

次に、第１の特徴量算出部１３は、制御変数ｉに「０」をセットする（Ｓ１１０）。 Next, the first feature quantity calculation unit 13 sets “0” to the control variable i (S110).

次に、第１の特徴量算出部１３は、ｉ番目のフレームデータを生成する（Ｓ１２０）。ｉ番目のフレームデータは、音響データｘ［ｉ×Ｇ１］から音響データｘ［ｉ×Ｇ１＋Ｎ１−１］までのデータである。なお、第１の特徴量算出部１３は、音響データｘ［ｉ×Ｇ１］から音響データｘ［ｉ×Ｇ１＋Ｎ１−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。窓関数を用いない方法は、音響データに矩形窓を掛け合わせてｉ番目のフレームデータを生成する方法と同じ方法であると言える。 Next, the first feature amount calculation unit 13 generates i-th frame data (S120). The i-th frame data is data from the acoustic data x [i × G1] to the acoustic data x [i × G1 + N1-1]. The first feature amount calculation unit 13 generates, as the i-th frame data, a value obtained by multiplying the data from the acoustic data x [i × G1] to the acoustic data x [i × G1 + N1-1] by a window function. May be. The window function is a Hamming window function, a Hanning window function, a Blackman window function, a Gauss window function, or the like. It can be said that the method not using the window function is the same as the method of generating the i-th frame data by multiplying the acoustic data by the rectangular window.

窓関数を用いる場合、通常はフレームの中央で窓関数の係数を最大とし、フレームの先頭と末尾で窓関数の係数を最小とするが、この他の方法を用いてもよい。例えば、フレームの先頭（ｘ［ｉ×Ｇ１］）で窓関数の係数を最大とし、その後窓関数の係数を順次減少させ、フレームの末尾（ｘ［ｉ×Ｇ１＋Ｎ１−１］）で窓関数の係数を最小にするようにしてもよい。ｉ番目のフレームデータを「Ｄ１［ｉ］［ｊ］（ｊ＝０〜ＮＤ１、ただしＮＤ１＝Ｎ１−１）」と記載する。 When a window function is used, the window function coefficient is usually maximized at the center of the frame and the window function coefficient is minimized at the beginning and end of the frame, but other methods may be used. For example, the window function coefficient is maximized at the beginning of the frame (x [i × G1]), then the window function coefficient is sequentially decreased, and the window function coefficient at the end of the frame (x [i × G1 + N1-1]). May be minimized. The i-th frame data is described as “D1 [i] [j] (j = 0 to ND1, where ND1 = N1-1)”.

次に、第１の特徴量算出部１３は、ｉ番目のフレームの第１の特徴量を、後述するいずれかの方法を用いて算出する（Ｓ１３０）。 Next, the first feature value calculation unit 13 calculates the first feature value of the i-th frame using any of the methods described later (S130).

次に、第１の特徴量算出部１３は、制御変数ｉの値を「１」増やす（Ｓ１４０）。 Next, the first feature amount calculation unit 13 increases the value of the control variable i by “1” (S140).

次に、第１の特徴量算出部１３は、制御変数ｉの値がＨ１未満であるか否かを判定する（Ｓ１５０）。第１の特徴量算出部１３は、制御変数ｉの値がＨ１未満であれば（Ｓ１５０でＹｅｓ）、ステップＳ１２０に戻ってステップＳ１４０までの処理を繰り返し、制御変数ｉの値がＨ１であれば（Ｓ１５０でＮｏ）、処理を終了する。 Next, the first feature amount calculation unit 13 determines whether or not the value of the control variable i is less than H1 (S150). If the value of the control variable i is less than H1 (Yes in S150), the first feature amount calculation unit 13 returns to step S120 and repeats the processing up to step S140. If the value of the control variable i is H1, (No in S150), the process ends.

第１の特徴量算出部１３は、このようにして音量に関する第１の特徴量であるＨ１個の時系列データＥ１［ｉ］（ｉ＝０〜Ｈ１−１）を算出し、処理が終了したことを制御部１１に通知する。 In this way, the first feature amount calculation unit 13 calculates the H1 time-series data E1 [i] (i = 0 to H1-1), which is the first feature amount related to the volume, and the processing is completed. This is notified to the control unit 11.

次に、第１の特徴量算出部１３がステップＳ１３０においてｉ番目のフレームの第１の特徴量を算出する方法を説明する。 Next, a method in which the first feature value calculation unit 13 calculates the first feature value of the i-th frame in step S130 will be described.

（１）特徴量の第１の算出方法は、音響データの振幅の絶対値を用いる方法である。具体的には、下記の式（２）に示すように、振幅の絶対値をフレームのサンプル数だけ加算した値（総和）を、ｉ番目のフレームに対応する特徴量Ｅ１［ｉ］とする。 (1) The first feature amount calculation method uses an absolute value of the amplitude of acoustic data. Specifically, as shown in the following equation (2), a value (sum) obtained by adding the absolute value of the amplitude by the number of samples of the frame is set as a feature amount E1 [i] corresponding to the i-th frame.

なお、下記の式（３）に示すように、総和の代わりに平均値を用いてもよい。

As shown in the following formula (3), an average value may be used instead of the sum.

（２）特徴量の第２の算出方法は、音響データの振幅の２乗を用いる方法である。具体的には、下記の式（４）に示すように、振幅の２乗の値をフレームのサンプル数だけ加算した値（総和）を、ｉ番目のフレームに対応する特徴量Ｅ１［ｉ］とする。

(2) The second feature amount calculation method uses the square of the amplitude of acoustic data. Specifically, as shown in the following equation (4), a value (sum) obtained by adding the square of the amplitude by the number of samples of the frame is defined as a feature amount E1 [i] corresponding to the i-th frame. To do.

なお、下記の式（５）に示すように、総和の代わりに平均値を用いてもよい。また、式（４）又は式（５）の右辺の平方根をとった値を、ｉ番目のフレームに対応する特徴量Ｅ１［ｉ］としてもよい。第１及び第２の算出方法は、計算量が少なくなるという効果が得られる。

As shown in the following formula (5), an average value may be used instead of the sum. Further, a value obtained by taking the square root of the right side of Expression (4) or Expression (5) may be used as the feature amount E1 [i] corresponding to the i-th frame. The first and second calculation methods have the effect of reducing the amount of calculation.

（３）特徴量の第３の算出方法は、特定の周波数成分を用いる方法である。ｉ番目のフレームデータＤ１［ｉ］［ｊ］に対して離散フーリエ変換（ＤＦＴ)を行い、出力の実数部Ｒｅ［ｋ］と虚数部Ｉｍ［ｋ］（ｋ＝０〜（Ｎ１／２））とを用いて、下記の式（６）又は式（７）式により、特徴量Ｅ１［ｉ］を算出する。

(3) The third feature amount calculation method uses a specific frequency component. The discrete Fourier transform (DFT) is performed on the i-th frame data D1 [i] [j], and the real part Re [k] and the imaginary part Im [k] (k = 0 to (N1 / 2)) of the output And the feature quantity E1 [i] is calculated by the following formula (6) or formula (7).

式（６）は、音響データの振幅スペクトルの特定の周波数成分を用いて特徴量Ｅ１［ｉ］を算出するための式であり、式（７）は、音響データのパワースペクトルの特定の周波数成分を用いて特徴量Ｅ１［ｉ］を算出するための式である。これらの式において、ＦＬは利用する周波数成分の下限を示す所定の定数であり、ＦＨは利用する周波数成分の上限を示す所定の定数であって、０≦ＦＬ≦ＦＨ≦Ｎ１／２の関係が満たされており、ＦＬとＦＨとの間の周波数成分の総和を算出して特徴量Ｅ１［ｉ］としている。ＦＬ及びＦＨは、例えば、高い周波数成分（例えば８ＫＨz以上）が除外されるように設定される。なお、下記の式（８）に示すように、周波数成分毎に定められた重み係数ｗ［ｋ］をスペクトルの周波数成分と掛け合わせて特徴量Ｅ１［ｉ］を算出してもよい。

Expression (6) is an expression for calculating the feature quantity E1 [i] using a specific frequency component of the amplitude spectrum of the acoustic data, and Expression (7) is a specific frequency component of the power spectrum of the acoustic data. Is a formula for calculating the feature quantity E1 [i]. In these equations, FL is a predetermined constant indicating the lower limit of the frequency component to be used, FH is a predetermined constant indicating the upper limit of the frequency component to be used, and the relationship 0 ≦ FL ≦ FH ≦ N1 / 2 is satisfied. The sum of the frequency components between FL and FH is calculated and used as the feature amount E1 [i]. FL and FH are set such that, for example, high frequency components (for example, 8 kHz or more) are excluded. Note that, as shown in the following equation (8), the feature amount E1 [i] may be calculated by multiplying the weighting factor w [k] determined for each frequency component by the frequency component of the spectrum.

第３の方法では、特定の周波数成分のみを選択する。これにより、全周波数成分を使う場合に比べて、特徴量と人間の感じる音量感との対応性が向上するという効果が得られる。特に、聴覚特性に従って周波数成分毎の重み係数ｗ［ｋ］を設定することにより、音量感に近い特徴量が得られる。 In the third method, only a specific frequency component is selected. Thereby, compared with the case where all frequency components are used, the effect of improving the correspondence between the feature amount and the sense of volume felt by humans can be obtained. In particular, by setting the weighting coefficient w [k] for each frequency component according to the auditory characteristics, a feature value close to a volume feeling can be obtained.

上述した第３の算出方法では、離散フーリエ変換（ＤＦＴ)を用いるが、これに限定される訳ではなく、ＤＦＴに代えて、例えば、デジタルフィルタやアナログフィルタを用いて特定の周波数成分を抽出してもよい。

In the third calculation method described above, discrete Fourier transform (DFT) is used. However, the present invention is not limited to this. For example, a specific frequency component is extracted using a digital filter or an analog filter instead of DFT. May be.

（４）特徴量の第４の算出方法は、ｉ番目のフレームデータを時間的に前と後の２つの部分区間（グループ）に分け、部分区間毎に算出した音量に関する数値の差を用いる方法である。各部分区間の音量に関する数値は、上述した第１から第３の算出方法を用いて算出する。 (4) A fourth feature amount calculation method is a method in which the i-th frame data is divided into two partial sections (groups) before and after in time, and a difference in numerical values related to sound volume calculated for each partial section is used. It is. Numerical values related to the volume of each partial section are calculated using the first to third calculation methods described above.

一例として、特徴量の第１の算出方法を用いる場合を説明する。先ず、ｉ番目のフレームデータＤ１［ｉ］［ｊ］（ｊ＝０〜ＮＤ１）を時間的に前と後の２つの部分区間に分ける。時間的に前の部分区間１を「Ｄａ［ｉ］［ｊ］（ｊ＝０〜Ｎ１／２−１）」と記載し、時間的に後の部分区間２を「Ｄｂ［ｉ］［ｊ］（ｊ＝Ｎ１／２〜ＮＤ１）」と記載する。次に、部分区間１及び部分区間２それぞれのデータを式（２）に代入する。ただし、式（２）において、ｊ＝０〜ＮＤ１の加算の範囲は、各部分区間の開始点と終了点に変更する。時間的に前の部分区間１を式（２）に代入した結果をＥａ［ｉ］とし、時間的に後の部分区間２を式（２）に代入した結果をＥｂ［ｉ］として、それらの差を特徴量Ｅ１［ｉ］とする。すなわち、Ｅ１［ｉ］＝Ｅｂ［ｉ］−Ｅａ［ｉ］を特徴量として算出する。 As an example, a case in which the first feature amount calculation method is used will be described. First, the i-th frame data D1 [i] [j] (j = 0 to ND1) is divided into two partial sections before and after in time. The partial interval 1 before in time is described as “Da [i] [j] (j = 0 to N1 / 2-1)”, and the partial interval 2 after in time is referred to as “Db [i] [j]. (J = N1 / 2 to ND1) ”. Next, the data of each of the partial section 1 and the partial section 2 are substituted into Expression (2). However, in Expression (2), the range of addition of j = 0 to ND1 is changed to the start point and end point of each partial section. The result of substituting the previous partial interval 1 into the expression (2) in terms of time is Ea [i], and the result of substituting the subsequent partial interval 2 in the expression (2) is defined as Eb [i]. The difference is defined as a feature quantity E1 [i]. That is, E1 [i] = Eb [i] −Ea [i] is calculated as the feature amount.

なお、Ｅ１［ｉ］が負の値になった場合、特徴量を「０」にする処理を行ってもよい。また、上記の例では、部分区間１と部分区間２との間に隙間は存在していないが、部分区間１と部分区間２との間に隙間があってもよい。また、部分区間１の一部と部分区間２の一部は重なっていてもよい。 When E1 [i] becomes a negative value, a process for setting the feature amount to “0” may be performed. In the above example, there is no gap between the partial section 1 and the partial section 2, but there may be a gap between the partial section 1 and the partial section 2. Further, a part of the partial section 1 and a part of the partial section 2 may overlap.

また、上述したように、ハミング窓又はガウス窓等を用いてフレームデータを作ってもよい。その際に２つの部分区間を分ける境界点と、ハミング窓又はガウス窓等の中心点（係数が最大となる箇所）を一致させた上で、差を算出してもよい。この場合、２つの部分区間の境界近い音響データほど、大きな値の重み係数で重みづけしていることになる。つまり、上記の例のようにＮ１／２番目のサンプルデータの前後で部分区間を分ける場合、境界に最も近い（Ｎ１／２−１）及びＮ１／２に相当する音響データに最も大きな係数を掛け合わせ、境界から最も遠い０及びＮＤ１に相当する音響データに最も小さい係数を掛け合わせて差を算出していることになる。 Further, as described above, frame data may be created using a Hamming window or a Gaussian window. At this time, the difference may be calculated after matching the boundary point that divides the two partial sections with the center point (location where the coefficient is maximum) such as a Hamming window or a Gaussian window. In this case, the acoustic data closer to the boundary between the two partial sections is weighted with a larger weighting factor. That is, when the partial section is divided before and after the N1 / 2th sample data as in the above example, the acoustic data corresponding to (N1 / 2-1) and N1 / 2 closest to the boundary is multiplied by the largest coefficient. In addition, the difference is calculated by multiplying the acoustic data corresponding to 0 and ND1 farthest from the boundary by the smallest coefficient.

（５）特徴量の第５の算出方法は、隣接する２つのフレームの音量を示す数値の差を用いる方法である。フレームの音量を示す数値は、上記の第１の算出方法から第３の算出方法のいずれかで得られる特徴量である。例えば、第１の算出方法により得られる特徴量を用いる場合、ｉ−１番目のフレームに対応する音響データを式（２）に代入して得られた演算結果をＥ１’［ｉ−１］として保持するとともに、ｉ番目のフレームに対応する音響データを式（２）に代入して得られた演算結果をＥ１’［ｉ］として保持する。そして、Ｅ１’［ｉ］とＥ１’［ｉ−１］との差を算出する。すなわち、Ｅ１［ｉ］＝Ｅ１’［ｉ］−Ｅ１’［ｉ−１］を特徴量として算出する。なお、第４及び第５の算出方法を用いると、音量が急激に変化する箇所を検出し易いという効果が得られる。 (5) A fifth feature amount calculation method uses a difference in numerical values indicating the volume of two adjacent frames. The numerical value indicating the volume of the frame is a feature amount obtained by any one of the first calculation method to the third calculation method. For example, when using the feature amount obtained by the first calculation method, the calculation result obtained by substituting the acoustic data corresponding to the (i−1) -th frame into Equation (2) is E1 ′ [i−1]. In addition, the calculation result obtained by substituting the acoustic data corresponding to the i-th frame into Equation (2) is held as E1 ′ [i]. Then, the difference between E1 '[i] and E1' [i-1] is calculated. That is, E1 [i] = E1 ′ [i] −E1 ′ [i−1] is calculated as the feature amount. In addition, when the fourth and fifth calculation methods are used, an effect that it is easy to detect a portion where the volume rapidly changes can be obtained.

上記の第１から第５の算出方法において、例えば特徴量の最大値が１となり、最小値が０になるように、得られたデータを正規化してもよい。 In the above first to fifth calculation methods, for example, the obtained data may be normalized so that the maximum value of the feature amount is 1 and the minimum value is 0.

ここで、音楽に係る音響信号の音量の特徴について説明する。音楽に係る音響信号の音量は、個々の音符、トレモロやビブラート等の音符の装飾音、拍、小節、フレーズ、イントロやサビ等の音楽の重層的な構造と密接に関係している。このような音楽の重層的な構造において、トレモロやビブラート等の音符の装飾音や個々の音符は、非常に短い時間単位の音量変化をもたらすのに対し、イントロやサビ等の音楽の大きな構成は、非常に長い時間単位の音量変化をもたらす。実施の形態１は、トレモロやビブラート等の音符の装飾音や個々の音符が非常に短い時間単位の音量変化をもたらし、イントロやサビ等の音楽の大きな構成が非常に長い時間単位の音量変化をもたらすという従来は考慮されていなかった特徴に着目している。 Here, the characteristics of the volume of the acoustic signal related to music will be described. The volume of the sound signal related to music is closely related to the multi-layered structure of music such as individual notes, decorative sounds of notes such as tremolo and vibrato, beats, measures, phrases, intros and rusts. In such a multi-layered structure of music, ornamental sounds of notes such as tremolo and vibrato and individual notes bring about a very short time unit volume change, whereas the big composition of music such as intro and rust is Bring volume changes in very long time units. In the first embodiment, ornamental sounds of notes such as tremolo and vibrato and individual notes cause a change in volume in a very short time unit, and a large composition of music such as intro and rust causes a change in volume in a very long time unit. It focuses on a feature that has not been considered in the past.

例えば、音楽に係る音響信号の音量に関連する特徴量は、１秒の区間毎に音量を算出する場合と、１０秒の区間毎に音量を算出する場合とでは、音量が最大となる位置が全く異なることがある。従来のように１種類の区間毎に音量が最大となる位置を検出すると、サビ等の特徴箇所を誤検出することが多くなる。それに対して実施の形態１の音響信号分析装置１は、以下に述べるように２種類の異なる長さの区間についての音量に関する特徴量を算出する。 For example, the characteristic amount related to the volume of the sound signal related to music is the position where the volume is maximum in the case where the volume is calculated for each section of 1 second and in the case where the volume is calculated for each section of 10 seconds. It can be quite different. If a position where the volume is maximized for each type of section is detected as in the prior art, a characteristic location such as rust is often erroneously detected. On the other hand, the acoustic signal analysis device 1 according to the first embodiment calculates a feature amount related to the sound volume for two types of sections having different lengths as described below.

第１の特徴量算出部１３は、上述した重層的な構造の内で、比較的短時間の音量を検出するように、フレームの時間長Ｔｆ１を設定する。例えば、第１の特徴量算出部１３は、フレームの時間長Ｔｆ１を、ほぼ１つの音符の長さに相当する時間長に設定する。 The first feature amount calculation unit 13 sets the frame time length Tf1 so as to detect a relatively short time volume in the multi-layered structure described above. For example, the first feature amount calculation unit 13 sets the time length Tf1 of the frame to a time length substantially corresponding to the length of one note.

例えば、音楽が４／４拍子である場合、一般的な音楽のテンポは１分間に６０拍から２４０拍程度の範囲であることが多く、１６分音符（１／４拍）から全音符（４拍）程度の音符が頻繁に使われる。その範囲の音符の時間長は、６２．５ｍｓｅｃ（テンポ２４０の時の１６分音符）から４ｓｅｃ（テンポ６０の時の全音符）の範囲になるので、第１の特徴量算出部１３は、その範囲でフレームの時間長Ｔｆ１を設定する。例えば、第１の特徴量算出部１３は、フレームの時間長Ｔｆ１として、テンポ１２０の時の４分音符に相当する５００ｍｓｅｃを設定する。 For example, if the music has a 4/4 time signature, the tempo of general music is often in the range of about 60 to 240 beats per minute, from a 16th note (1/4 beat) to a whole note (4 (Beats) is often used. Since the time length of the notes in the range is from 62.5 msec (16th notes at tempo 240) to 4 sec (all notes at tempo 60), the first feature amount calculation unit 13 The frame time length Tf1 is set in the range. For example, the first feature amount calculation unit 13 sets 500 msec corresponding to a quarter note at the tempo 120 as the time length Tf1 of the frame.

次に、第２の特徴量算出部１４について説明する。第２の特徴量算出部１４は、音量に関する第２の特徴量を算出する。 Next, the second feature amount calculation unit 14 will be described. The second feature amount calculation unit 14 calculates a second feature amount related to volume.

第２の特徴量算出部１４は、取得部１２によって生成された音響データから、音量に関する第２の特徴量を算出する。第２の特徴量算出部１４は、比較的長い時間区間の音量に関する特徴量を算出する。第２の特徴量算出部１４によって処理されるフレームの時間長Ｔｆ２は、第１の特徴量算出部１３によって処理されるフレームの時間長Ｔｆ１よりも長い。第２の特徴量算出部１４によって処理されるフレームのサンプル数Ｎ２は、Ｎ２＝Ｔｆ２／Ｔｓであって、第１の特徴量算出部１３によって処理されるフレームのサンプル数Ｎ１よりも多い。 The second feature value calculation unit 14 calculates a second feature value related to the sound volume from the acoustic data generated by the acquisition unit 12. The second feature amount calculation unit 14 calculates a feature amount related to the volume of a relatively long time interval. The time length Tf2 of the frame processed by the second feature value calculation unit 14 is longer than the time length Tf1 of the frame processed by the first feature value calculation unit 13. The number N2 of frame samples processed by the second feature quantity calculation unit 14 is N2 = Tf2 / Ts, which is larger than the number N1 of frame samples processed by the first feature quantity calculation unit 13.

第２の特徴量算出部１４が動作する際のフレームシフトの時間長Ｔｇ２と、第１の特徴量算出部１３が動作する際のフレームシフトの時間長Ｔｇ１とは、同じであってもよいし、異なっていてもよい。以下では、フレームシフトの時間長Ｔｇ２＝Ｑ×Ｔｇ１であり、第２の特徴量算出部１４が動作する際のフレームシフトのサンプル数Ｇ２＝Ｑ×Ｇ１である（Ｑは１以上の整数）。しかしながら、フレームシフトの時間長Ｔｇ２、及びフレームシフトのサンプル数Ｇ２は、それらに限定されない。 The frame shift time length Tg2 when the second feature value calculation unit 14 operates may be the same as the frame shift time length Tg1 when the first feature value calculation unit 13 operates. , May be different. In the following, the frame shift time length Tg2 = Q × Tg1, and the number of frame shift samples G2 = Q × G1 when the second feature amount calculator 14 operates (Q is an integer of 1 or more). However, the frame shift time length Tg2 and the frame shift sample number G2 are not limited thereto.

第２の特徴量算出部１４は、制御部１１の指示に従って、図４のフローチャートに示す動作を開始する。図４は、第２の特徴量算出部１４の動作の各ステップを示すフローチャートである。図４と図３とを比較すると明らかなように、第２の特徴量算出部１４は、第１の特徴量算出部１３と同様に動作する。 The second feature quantity calculation unit 14 starts the operation shown in the flowchart of FIG. 4 in accordance with an instruction from the control unit 11. FIG. 4 is a flowchart showing each step of the operation of the second feature amount calculation unit 14. As is apparent from a comparison between FIG. 4 and FIG. 3, the second feature quantity calculation unit 14 operates in the same manner as the first feature quantity calculation unit 13.

第２の特徴量算出部１４は、先ず、上記の式（１）を用いてフレームの総数Ｈ２を算出する（Ｓ２００）。すなわち、第２の特徴量算出部１４は、式（１）のＮ１をＮ２に置き換え、Ｇ１をＧ２に置き換え、Ｈ１をＨ２に置き換えて、フレームの総数Ｈ２を算出する（Ｓ２００）。ＭとＮ２との関係は、Ｍ＞Ｎ２である。第２の特徴量算出部１４が処理するフレーム総数Ｈ２は、第１の特徴量算出部１３が処理するフレーム総数Ｈ１以下である。 First, the second feature quantity calculation unit 14 calculates the total number H2 of frames using the above equation (1) (S200). That is, the second feature amount calculation unit 14 replaces N1 in Equation (1) with N2, replaces G1 with G2, replaces H1 with H2, and calculates the total number H2 of frames (S200). The relationship between M and N2 is M> N2. The total number of frames H2 processed by the second feature quantity calculation unit 14 is equal to or less than the total number of frames H1 processed by the first feature quantity calculation unit 13.

次に、第２の特徴量算出部１４は、制御変数ｉに「０」をセットする（Ｓ２１０）。 Next, the second feature quantity calculation unit 14 sets “0” to the control variable i (S210).

次に、第２の特徴量算出部１４は、ｉ番目のフレームデータを生成する（Ｓ２２０）。ｉ番目のフレームデータは、音響データｘ［ｉ×Ｇ２］から音響データｘ［ｉ×Ｇ２＋Ｎ２−１］までのデータである。なお、第２の特徴量算出部１４は、音響データｘ［ｉ×Ｇ２］から音響データｘ［ｉ×Ｇ２＋Ｎ２−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。窓関数を用いない方法は、音響データに矩形窓を掛け合わせてｉ番目のフレームデータを生成する方法と同じ方法であると言える。 Next, the second feature amount calculation unit 14 generates i-th frame data (S220). The i-th frame data is data from acoustic data x [i × G2] to acoustic data x [i × G2 + N2-1]. Note that the second feature amount calculation unit 14 generates, as the i-th frame data, a value obtained by multiplying the data from the acoustic data x [i × G2] to the acoustic data x [i × G2 + N2-1] by a window function. May be. The window function is a Hamming window function, a Hanning window function, a Blackman window function, a Gauss window function, or the like. It can be said that the method not using the window function is the same as the method of generating the i-th frame data by multiplying the acoustic data by the rectangular window.

窓関数を用いる場合、通常はフレームの中央で窓関数の係数を最大とし、フレームの先頭と末尾で窓関数の係数を最小とするが、この他の方法を用いてもよい。例えば、フレームの先頭（ｘ［ｉ×Ｇ２］）で窓関数の係数を最大とし、その後窓関数の係数を順次減少させ、フレームの末尾（ｘ［ｉ×Ｇ２＋Ｎ１−１］）で窓関数の係数を最小にするようにしてもよい。ｉ番目のフレームデータを「Ｄ２［ｉ］［ｊ］（ｊ＝０〜ＮＤ２、ただしＮＤ２＝Ｎ２−１）」と記載する。 When a window function is used, the window function coefficient is usually maximized at the center of the frame and the window function coefficient is minimized at the beginning and end of the frame, but other methods may be used. For example, the window function coefficient is maximized at the beginning of the frame (x [i × G2]), then the window function coefficient is sequentially decreased, and the window function coefficient at the end of the frame (x [i × G2 + N1-1]). May be minimized. The i-th frame data is described as “D2 [i] [j] (j = 0 to ND2, where ND2 = N2-1)”.

第１の特徴量算出部１３が処理する（ｉ×Ｑ）番目のフレームデータの先頭Ｄ１［ｉ×Ｑ］［０］と、第２の特徴量算出部１４が処理するｉ番目のフレームデータの先頭Ｄ１［ｉ］［０］が、ともにｘ［ｉ×Ｇ２］となって一致するが、必ずしもこのようにフレームの先頭を一致させなくてもよい。例えば、フレームの中央を一致させるようにしたり、フレームの最後を一致させるようにしてもよい。 The head D1 [i × Q] [0] of the (i × Q) -th frame data processed by the first feature value calculation unit 13 and the i-th frame data processed by the second feature value calculation unit 14 The heads D1 [i] [0] coincide with each other as x [i × G2]. However, the heads of the frames do not necessarily have to be matched in this way. For example, the center of the frame may be matched, or the end of the frame may be matched.

次に、第２の特徴量算出部１４は、ｉ番目のフレームデータの特徴量を、第１の特徴量算出部１３がｉ番目のフレームデータの特徴量を算出したようにして算出する（Ｓ２３０）。第２の特徴量算出部１４は、第１の特徴量算出部１３が用いるＮＤ１をＮＤ２に置き換え、Ｄ１をＤ２に置き換えて特徴量を算出する。 Next, the second feature quantity calculation unit 14 calculates the feature quantity of the i-th frame data as if the first feature quantity calculation unit 13 calculated the feature quantity of the i-th frame data (S230). ). The second feature amount calculation unit 14 replaces ND1 used by the first feature amount calculation unit 13 with ND2, and replaces D1 with D2 to calculate the feature amount.

次に、第２の特徴量算出部１４は、制御変数ｉの値を「１」増やす（Ｓ２４０）。 Next, the second feature amount calculation unit 14 increases the value of the control variable i by “1” (S240).

次に、第２の特徴量算出部１４は、制御変数ｉの値がＨ２未満であるか否かを判定する（Ｓ２５０）。第２の特徴量算出部１４は、制御変数ｉの値がＨ２未満であると判定すると（Ｓ２５０でＹｅｓ）、ステップＳ２２０に戻ってステップＳ２４０までの処理を繰り返し、制御変数ｉの値がＨ２であると判定すると（Ｓ２５０でＮｏ）、処理を終了する。 Next, the second feature amount calculation unit 14 determines whether or not the value of the control variable i is less than H2 (S250). If the second feature quantity calculation unit 14 determines that the value of the control variable i is less than H2 (Yes in S250), the process returns to Step S220 and repeats the process up to Step S240, and the value of the control variable i is H2. If it is determined that there is any (No in S250), the process is terminated.

第２の特徴量算出部１４は、上述した処理により、音量に関する特徴量であるＨ２個の時系列データＥ２［ｉ］（ｉ＝０〜Ｈ２−１）を算出し、処理が終了したことを制御部１１に通知する。 The second feature amount calculation unit 14 calculates H2 time-series data E2 [i] (i = 0 to H2-1), which is a feature amount related to volume, by the above-described processing, and indicates that the processing is completed. Notify the control unit 11.

次に、第２の特徴量算出部１４が処理するフレームの時間長Ｔｆ２について説明する。上述したように、一般的な音楽では、様々な時間スケールを持つ重層的な構造に起因して音量が変化する。第２の特徴量算出部１４は、比較的長時間の音量を検出するように、フレームの時間長Ｔｆ２を設定する。例えば、第２の特徴量算出部１４は、フレームの時間長Ｔｆ２を、１小節以上の長さに設定する。 Next, the time length Tf2 of the frame processed by the second feature quantity calculation unit 14 will be described. As described above, in general music, the volume changes due to the multi-layered structure having various time scales. The second feature amount calculator 14 sets the frame time length Tf2 so as to detect a relatively long sound volume. For example, the second feature amount calculation unit 14 sets the frame time length Tf2 to a length of one bar or more.

特に、音楽の聴きどころであるサビは、４から８小節程度の単位で繰り返されることが多く、サビの開始位置から４から８小節程度は音量が大きい可能性が高い。一般的な音楽のテンポは１分間に６０拍から２４０拍の範囲であることが多いので、第２の特徴量算出部１４は、フレームの時間長Ｔｆ２を４小節に相当する４秒から３２秒の範囲に設定する。例えば、第２の特徴量算出部１４は、フレームの時間長Ｔｆ２として、テンポ１２０の時の４小節に相当する８秒を設定する。 In particular, rust, which is the point of listening to music, is often repeated in units of about 4 to 8 bars, and there is a high possibility that the volume is high from about 4 to 8 bars from the start position of the rust. Since the general music tempo often ranges from 60 beats to 240 beats per minute, the second feature quantity calculation unit 14 sets the frame time length Tf2 to 4 seconds to 32 seconds corresponding to 4 bars. Set to the range. For example, the second feature amount calculation unit 14 sets 8 seconds corresponding to 4 bars at the tempo 120 as the frame time length Tf2.

評価値算出部１５は、第１の特徴量算出部１３によって算出された第１の特徴量と、第２の特徴量算出部１４によって算出された第２の特徴量とを用いて、評価値を算出する。評価値算出部１５は、第１の特徴量が大きく、かつ第１の特徴量に時間的に対応する第２の特徴量が大きいほど大きな値になるように評価値を算出する。 The evaluation value calculation unit 15 uses the first feature value calculated by the first feature value calculation unit 13 and the second feature value calculated by the second feature value calculation unit 14 to evaluate the evaluation value. Is calculated. The evaluation value calculation unit 15 calculates the evaluation value so that the larger the first feature amount and the larger the second feature amount corresponding to the first feature amount, the larger the value.

制御部１１は、第１の特徴量算出部１３及び第２の特徴量算出部１４の処理の終了を検知すると、評価値算出部１５に対して動作を開始するように指示する。評価値算出部１５は、図５のフローチャートに示す動作を開始する。図５は、評価値算出部１５の動作の各ステップを示すフローチャートである。 When the control unit 11 detects the end of the processes of the first feature value calculation unit 13 and the second feature value calculation unit 14, the control unit 11 instructs the evaluation value calculation unit 15 to start the operation. The evaluation value calculation unit 15 starts the operation shown in the flowchart of FIG. FIG. 5 is a flowchart showing each step of the operation of the evaluation value calculation unit 15.

評価値算出部１５は、先ず、制御変数ｉに「０」をセットする（Ｓ３００）。 The evaluation value calculation unit 15 first sets “0” to the control variable i (S300).

次に、評価値算出部１５は、下記の式（９）に従って、制御変数ｊにセットする値を算出する（Ｓ３１０）。 Next, the evaluation value calculation unit 15 calculates a value to be set in the control variable j according to the following equation (9) (S310).

ｆｌｏｏｒ（）は、小数点以下を切り捨てた整数を返す関数である。Ｑは、第１の特徴量算出部１３が動作する際のフレームシフト時間長に対する、第２の特徴量算出部１４が動作する際のフレームシフト時間長の倍率であり、１以上の整数である。

floor () is a function that returns an integer with the decimal part truncated. Q is a magnification of the frame shift time length when the second feature quantity calculation unit 14 operates with respect to the frame shift time length when the first feature quantity calculation unit 13 operates, and is an integer of 1 or more. .

次に、評価値算出部１５は、後述する方法に従って、制御変数ｉに対応する評価値α［ｉ］を算出する（Ｓ３２０）。 Next, the evaluation value calculation unit 15 calculates an evaluation value α [i] corresponding to the control variable i according to a method described later (S320).

次に、評価値算出部１５は、制御変数の値を「１」増やす（Ｓ３３０）。 Next, the evaluation value calculation unit 15 increases the value of the control variable by “1” (S330).

次に、評価値算出部１５は、制御変数ｉが、Ｈ２（第２の特徴量算出部１４によって算出された特徴量の時系列データの個数）と、Ｑとの積の値（Ｑ×Ｈ２）未満であるか否かを判定する（Ｓ３４０）。評価値算出部１５は、制御変数ｉが（Ｑ×Ｈ２）未満であると判定すると（Ｓ３４０でＹｅｓ）、ステップＳ３１０に戻ってステップＳ３３０までの処理を繰り返し、制御変数ｉが（Ｑ×Ｈ２）であると判定すると（Ｓ３４０でＮｏ）、処理を終了する。 Next, the evaluation value calculation unit 15 determines that the control variable i is H2 (the number of time-series data of feature amounts calculated by the second feature amount calculation unit 14) and Q (Q × H2). ) Is determined (S340). If the evaluation value calculation unit 15 determines that the control variable i is less than (Q × H2) (Yes in S340), the process returns to Step S310 and repeats the processing up to Step S330, and the control variable i is (Q × H2). If determined to be (No in S340), the process is terminated.

評価値算出部１５は、上述した処理により、（Ｑ×Ｈ２）個の時系列データである評価値α［ｉ］（ｉ＝０〜Ｑ×Ｈ２−１）を算出し、処理を終了したことを制御部１１に通知する。 The evaluation value calculation unit 15 calculates the evaluation value α [i] (i = 0 to Q × H2-1), which is (Q × H2) pieces of time series data, by the above-described processing, and ends the processing. Is notified to the control unit 11.

評価値算出部１５は、以下に示すいずれかの方法により評価値α［ｉ］を算出する。 The evaluation value calculation unit 15 calculates the evaluation value α [i] by any of the following methods.

（１）評価値の第１の算出方法は、下記の式（１０）に示すように、第１の特徴量算出部１３によって算出された特徴量Ｅ１［ｉ］と、特徴量Ｅ１［ｉ］に時間的に対応する、第２の特徴量算出部１４によって算出された特徴量Ｅ２［ｊ］と加算する方法である。 (1) As shown in the following formula (10), the first evaluation value calculation method uses a feature quantity E1 [i] calculated by the first feature quantity calculation unit 13 and a feature quantity E1 [i]. Is added to the feature quantity E2 [j] calculated by the second feature quantity calculation unit 14 corresponding to the time.

なお、特徴量Ｅ１［ｉ］と特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］とを加算した値に所定値を乗算した値を評価値としてもよい。

Note that a value obtained by multiplying a feature value E1 [i] and a feature value E2 [j] temporally corresponding to the feature value E1 [i] by a predetermined value may be used as the evaluation value.

（２）評価値の第２の算出方法は、下記の式（１１）に示すように、特徴量Ｅ１［ｉ］に係数β１を乗じた値と、特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］に係数β２を乗じた値との加算値を用いる方法である。ただし、β１＞０、β２＞０である。第２の算出方法では、特徴量Ｅ１と特徴量Ｅ２に対して、各々重み付けを行なって加算していることになる。 (2) The second calculation method of the evaluation value temporally corresponds to the value obtained by multiplying the feature value E1 [i] by the coefficient β1 and the feature value E1 [i] as shown in the following equation (11). This is a method using an addition value of a value obtained by multiplying the feature amount E2 [j] by a coefficient β2. However, β1> 0 and β2> 0. In the second calculation method, the feature amount E1 and the feature amount E2 are respectively weighted and added.

（３）評価値の第３の算出方法は、下記の式（１２）に示すように、特徴量Ｅ１［ｉ］の対数値に係数β１を乗じた値と、特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］の対数値に係数β２を乗じた値との加算値を用いる方法である。なお、第１から第３の算出方法は、Ｅ１とＥ２のどちらかが小さい箇所で、評価値をあまり小さくしたくない場合に用いる。第３の算出方法は、それに加えて、Ｅ１とＥ２のそれぞれの値の範囲が大きく異なる場合に適している。

(3) As the third calculation method of the evaluation value, as shown in the following equation (12), a value obtained by multiplying the logarithmic value of the feature quantity E1 [i] by the coefficient β1 and the feature quantity E1 [i] are timed. This is a method using an addition value of a value obtained by multiplying a logarithmic value of the corresponding feature quantity E2 [j] by a coefficient β2. The first to third calculation methods are used when one of E1 and E2 is small and it is not desired to make the evaluation value too small. In addition to this, the third calculation method is suitable when the ranges of the values of E1 and E2 are greatly different.

（４）評価値の第４の算出方法は、下記の式（１３）に示すように、特徴量Ｅ１［ｉ］と特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］との積を用いる方法である。なお、式（１３）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(4) A fourth calculation method of the evaluation value is as follows, as shown in the following equation (13), the feature quantity E1 [i] and the feature quantity E2 [j] temporally corresponding to the feature quantity E1 [i] This method uses the product of Note that a value obtained by further multiplying the right side of Expression (13) by a predetermined value may be used as the evaluation value.

（５）評価値の第５の算出方法は、下記の式（１４）に示すように、特徴量Ｅ１［ｉ］を基数としてγ１を指数とした累乗値と、特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］を基数としてγ２を指数とした累乗値との積を用いる方法である。第４及び第５の方法は、Ｅ１とＥ２のどちらかが小さければ、評価値も小さくしたい場合に用いる。第５の方法は、それに加えて、Ｅ１とＥ２の評価値への影響力に重みを付けたい場合に適している。なお、式（１４）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(5) The fifth calculation method of the evaluation value is as follows. As shown in the following equation (14), the characteristic value E1 [i] is a radix and γ1 is an exponent, and the characteristic value E1 [i] is time. This is a method of using a product of a characteristic value E2 [j] corresponding to a base and a power value with γ2 as an index. The fourth and fifth methods are used when it is desired to reduce the evaluation value if either E1 or E2 is small. In addition, the fifth method is suitable when it is desired to weight the influence on the evaluation values of E1 and E2. Note that a value obtained by further multiplying the right side of Expression (14) by a predetermined value may be used as the evaluation value.

（６）評価値の第６の算出方法は、下記の式（１５）に示すように、特徴量Ｅ１［ｉ］を基数としてγ１を指数とした累乗値と係数β１の積と、特徴量Ｅ１［ｉ］に時間的に対応する特徴量Ｅ２［ｊ］を基数としてγ２を指数とした累乗値と係数β２の積との和を用いる方法である。なお、式（１５）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(6) The sixth method for calculating the evaluation value is as shown in the following equation (15): the product of the power value with the characteristic amount E1 [i] as the base and γ1 as the exponent and the coefficient β1, and the characteristic amount E1 This is a method using the sum of a product of a power value with a characteristic value E2 [j] temporally corresponding to [i] and γ2 as an index and a coefficient β2. Note that a value obtained by further multiplying the right side of Expression (15) by a predetermined value may be used as the evaluation value.

評価値算出部１５は、Ｅ１［ｉ］≧θ１かつＥ２［ｊ］≧θ２（θ１、θ２は所定値）の条件が成立する場合、上述した第１から第６の算出方法を用いて評価値を算出し、その条件が成立しない場合、評価値を「０」に設定してもよい。また、評価値算出部１５は、評価値α［ｉ］を計算した後に、α［ｉ］＜θ３（θ３は所定値）である場合、α［ｉ］を「０」にする処理を行ってもよい。

When the condition of E1 [i] ≧ θ1 and E2 [j] ≧ θ2 (θ1 and θ2 are predetermined values) is satisfied, the evaluation value calculation unit 15 uses the first to sixth calculation methods described above to evaluate the evaluation value If the condition is not satisfied, the evaluation value may be set to “0”. Further, after calculating the evaluation value α [i], the evaluation value calculation unit 15 performs a process of setting α [i] to “0” when α [i] <θ3 (θ3 is a predetermined value). Also good.

上述した方法によって算出される評価値α［ｉ］は、第１の特徴量Ｅ１［ｉ］が大きいほど、かつ第１の特徴量Ｅ１［ｉ］に時間的に対応する第２の特徴量Ｅ２［ｊ］が大きいほど大きな値となる。様々な時間スケールを持つ音楽の重層的な構造の中で、第１の特徴量の時間区間は、１音符や１拍といった時間スケールに対応しており、第２の特徴量の時間区間は、１小節以上の長さの時間スケールに対応している。サビの開始位置、曲調が大きく変わる変化点、試聴に適する位置、又は、リスナに強い印象を与える位置等の楽曲の「特徴位置」では、その先頭の音量が大きいとともに、そこから４から８小節の間は、平均的な音量が大きい場合が多いので、このような箇所の評価値は大きな値となる。従って、評価値の最大値又は極大値を検出することにより、サビの開始位置等の特徴位置を精度良く検出することができる。 The evaluation value α [i] calculated by the above-described method is the second feature amount E2 corresponding to the first feature amount E1 [i] in terms of time as the first feature amount E1 [i] is larger. The larger the value [j], the larger the value. In the multi-layered structure of music with various time scales, the time interval of the first feature value corresponds to a time scale such as one note or one beat, and the time interval of the second feature value is Corresponds to time scales longer than one measure. In the “characteristic position” of the song, such as the start position of the chorus, the changing point where the melody changes greatly, the position suitable for the audition, or the position that gives a strong impression to the listener, the volume at the beginning is large and 4 to 8 measures from there. During this period, the average sound volume is often large, and the evaluation value of such a portion is a large value. Therefore, by detecting the maximum value or the maximum value of the evaluation values, it is possible to accurately detect a feature position such as a rust start position.

制御部１１は、評価値算出部１５の処理が終了したことを検知すると、特徴位置検出部１６に対して動作を開始するように指示する。 When the control unit 11 detects that the process of the evaluation value calculation unit 15 has been completed, the control unit 11 instructs the feature position detection unit 16 to start the operation.

特徴位置検出部１６は、評価値算出部１５によって算出された評価値を用いて、音響信号２におけるサビの開始位置等の特徴的な位置を検出する。 The feature position detection unit 16 detects a characteristic position such as a rust start position in the acoustic signal 2 using the evaluation value calculated by the evaluation value calculation unit 15.

特徴位置検出部１６は、以下のいずれかの方法を用いて特徴位置を検出する。 The feature position detection unit 16 detects the feature position using one of the following methods.

（１）特徴位置の第１の検出方法は、評価値が最大となるフレーム（位置）を検出する方法である。評価値α［ｉ］（ｉ＝０〜Ｑ×Ｈ２−１）の中で、最大となる評価値を探し、それに対応するインデックスＩｍａｘを検出する。そして、Ｉｍａｘに対応する時間（Ｔｇ１×Ｉｍａｘ）を特徴位置とする。 (1) The first feature position detection method is a method for detecting a frame (position) having the maximum evaluation value. Among the evaluation values α [i] (i = 0 to Q × H2-1), the largest evaluation value is searched for and the index Imax corresponding to it is detected. Then, the time corresponding to Imax (Tg1 × Imax) is set as the feature position.

なお、算出された評価値全部を対象に最大値を探すのではなく、最大値を探す範囲を限定してもよい。つまり、音響信号２の連続する一部分について、評価値算出部１５によって算出された評価値が最大となる位置を検出してもよい。具体的には、α［ｉ］（ｉ＝Ｈ３〜Ｈ４、ただしＨ３及びＨ４は、０≦Ｈ３＜Ｈ４＜Ｑ×Ｈ２−１、を満たす整数）を対象に最大値を探してもよい。例えば、Ｈ３＝０とし、Ｈ４を楽曲の長さの７０％程度に相当する値にする。また、楽曲の連続する一部分、例えば楽曲の７０％程度に相当する音響信号２から第１の特徴量Ｅ１及び第２の特徴量Ｅ２を算出し、これらに基づき算出される評価値αが最大となる位置を検出してもよい。このように音響信号２の連続する一部分に相当する評価値を対象に最大値を探す方法を用いると、処理量を削減することができるとともに、以下の理由により特徴位置の検出精度を改善することができる。 Instead of searching for the maximum value for all the calculated evaluation values, the range for searching for the maximum value may be limited. That is, the position where the evaluation value calculated by the evaluation value calculation unit 15 is maximum may be detected for a continuous portion of the acoustic signal 2. Specifically, the maximum value may be searched for α [i] (i = H3 to H4, where H3 and H4 are integers satisfying 0 ≦ H3 <H4 <Q × H2-1). For example, H3 = 0 and H4 is set to a value corresponding to about 70% of the music length. Further, the first feature value E1 and the second feature value E2 are calculated from the acoustic signal 2 corresponding to a continuous part of the song, for example, about 70% of the song, and the evaluation value α calculated based on these is the maximum. May be detected. When the method of searching for the maximum value for the evaluation value corresponding to a continuous part of the acoustic signal 2 is used as described above, the processing amount can be reduced and the feature position detection accuracy can be improved for the following reason. Can do.

楽曲のサビは、１つの楽曲で複数回繰り返されることが多いが、演奏や歌唱のニュアンスは、毎回同じではなく、微妙に異なることが多い。すなわち、時間的に前の位置にあるサビは、後ろの位置にあるサビに比べて、完全には盛り上がっておらず、まだ少し盛り上がり度に余裕がある場合が多い。試聴用に楽曲のある一部を再生する場合を考えると、その箇所は「この曲全体を聴いてみたい」とリスナに思わせる箇所であることが望ましく、その意味で、完全に盛り上がった状態の時間的に後ろの位置のサビよりも、今後の盛り上がりに期待感を抱かせる時間的に前の位置のサビの方が、試聴用に適している。評価値の最大を検出する範囲を楽曲の前半の７０％程度に制限することにより、前の位置にあるサビが検出され易くなり、試聴用の特徴位置の検出精度が向上する。 The rust of a song is often repeated multiple times in one song, but the nuances of performance and singing are not always the same each time and are often slightly different. That is, the rust at the front position in time is not completely swelled compared to the rust at the back position, and there are many cases where there is still a margin in the degree of swell. Considering the case of playing a part of a song for audition, it is desirable that the part should be a part that makes the listener think "I want to listen to this whole piece of music". Rust in the front position in time that gives a sense of expectation to the future excitement is more suitable for audition than rust in the rear position in time. By limiting the range in which the maximum evaluation value is detected to about 70% of the first half of the music piece, rust at the previous position is easily detected, and the detection accuracy of the characteristic position for trial listening is improved.

また、楽曲のイントロ部分が特徴位置の検出対象に含められないように、Ｈ３に適切な正の値が設定されてもよい。なお、評価値が最大となる位置そのものを特徴位置とするのではなく、評価値が最大となる位置から所定時間だけ前の位置、又は評価値が最大となる位置より前で最大値より所定値だけ評価値が小さくなる位置を特徴位置としてもよい。これにより、サビの出だしの検出漏れを防止することができる。第１の検出方法は、楽曲の中で特徴位置を１つ検出したい場合に適する。第１の検出方法を用いると、処理量が少なくなるという効果が得られる。 Also, an appropriate positive value may be set in H3 so that the intro part of the music is not included in the feature position detection target. The position at which the evaluation value is maximum is not used as the feature position, but a position that is a predetermined time before the position at which the evaluation value is maximum, or a position that is before the position at which the evaluation value is maximum. Only the position where the evaluation value becomes smaller may be set as the feature position. As a result, it is possible to prevent detection omission of rust out. The first detection method is suitable when it is desired to detect one feature position in the music. When the first detection method is used, an effect that the processing amount is reduced can be obtained.

（２）特徴位置の第２の検出方法は、図６に示すフローチャートに従って、評価値が極大となる位置を検出する方法である。図６は、特徴位置検出部１６が特徴位置の第２の検出方法を実行する動作の各ステップを示すフローチャートである。 (2) The second feature position detection method is a method for detecting a position where the evaluation value is maximum according to the flowchart shown in FIG. FIG. 6 is a flowchart showing the steps of the operation in which the feature position detector 16 executes the second feature position detection method.

特徴位置検出部１６は、先ず、制御変数ｉに初期値「Ｈ５」をセットする（Ｓ４００）。Ｈ５は、１≦Ｈ５＜Ｑ×Ｈ２−２を満たす所定の整数である。評価値算出部１５によって算出された評価値全部を対象に極大位置を探す場合、Ｈ５＝１とする。また、楽曲のイントロ等を極大位置の検出の対象に含めない場合、Ｈ５＞１とする。 The feature position detector 16 first sets an initial value “H5” in the control variable i (S400). H5 is a predetermined integer that satisfies 1 ≦ H5 <Q × H2-2. When searching for the maximum position for all the evaluation values calculated by the evaluation value calculation unit 15, H5 = 1 is set. If the intro of music is not included in the maximum position detection target, H5> 1.

次に、特徴位置検出部１６は、α［ｉ］が極大値であるか否かを判定する（Ｓ４１０）。この判定方法は、例えば、α［ｉ］＞α［ｉ−１］かつα［ｉ］＞α［ｉ＋１］であれば、α［ｉ］を極大値と判定する方法である。特徴位置検出部１６は、α［ｉ］が極大値であると判定すると（Ｓ４１０でＹｅｓ）、極大位置における評価値α［ｉ］と、極大位置における制御変数の値（インデックス、時間情報）ｉの情報を、特徴位置検出部１６内部の作業用メモリに格納する（Ｓ４２０）。 Next, the feature position detection unit 16 determines whether α [i] is a maximum value (S410). This determination method is, for example, a method of determining α [i] as a maximum value if α [i]> α [i−1] and α [i]> α [i + 1]. If the characteristic position detection unit 16 determines that α [i] is a maximum value (Yes in S410), the evaluation value α [i] at the maximum position and the value (index, time information) i of the control variable at the maximum position. Is stored in the working memory inside the feature position detector 16 (S420).

次に、特徴位置検出部１６は、制御変数ｉの値を「１」増やす（Ｓ４３０）。なお、特徴位置検出部１６は、ステップＳ４１０において、α［ｉ］が極大値ではないと判定すると（Ｓ４１０でＮｏ）、制御変数ｉの値を「１」増やす（Ｓ４３０）。 Next, the feature position detection unit 16 increases the value of the control variable i by “1” (S430). If the characteristic position detection unit 16 determines in step S410 that α [i] is not the maximum value (No in S410), the feature position detection unit 16 increases the value of the control variable i by “1” (S430).

次に、特徴位置検出部１６は、制御変数ｉが所定値Ｈ６以下であるか否かを判定する（Ｓ４４０）。Ｈ６は、Ｈ５＜Ｈ６＜Ｑ×Ｈ２−１を満たす所定の整数である。評価値算出部１５によって算出された評価値全部を対象に極大位置を探す場合、Ｈ６＝Ｑ×Ｈ２−２とし、上述した理由等により、楽曲の後の部分を極大位置の検出の対象から除外する場合、Ｈ６＜Ｑ×Ｈ２−２とし、例えば、楽曲の長さの７０％に相当する値にする。特徴位置検出部１６は、制御変数ｉが所定値Ｈ６以下であると判定すると（Ｓ４４０でＹｅｓ）、ステップＳ４１０に戻ってステップＳ４３０までの処理を繰り返す。 Next, the feature position detection unit 16 determines whether or not the control variable i is equal to or less than the predetermined value H6 (S440). H6 is a predetermined integer that satisfies H5 <H6 <Q × H2-1. When searching for the maximum position for all the evaluation values calculated by the evaluation value calculation unit 15, H6 = Q × H2-2 is set, and the portion after the music is excluded from the detection target of the maximum position for the reasons described above. In this case, H6 <Q × H2-2, for example, a value corresponding to 70% of the length of the music. If the characteristic position detection unit 16 determines that the control variable i is equal to or less than the predetermined value H6 (Yes in S440), the feature position detection unit 16 returns to Step S410 and repeats the processing up to Step S430.

特徴位置検出部１６は、制御変数ｉが所定値Ｈ６を超えたと判定すると（Ｓ４４０でＮｏ）、作業用メモリに格納されている極大値の情報のなかから、所定個数の極大位置を選択する（Ｓ４５０）。例えば、特徴位置検出部１６は、値が大きい順に所定個数の極大位置を選択する。評価値が大きい順に選択されたＰ個の極大位置（時間）をＩｐ［ｖ］（ｖ＝０〜Ｐ−１）と記載する。このとき、α［Ｉｐ［０］］≧α［Ｉｐ［１］］≧α［Ｉｐ［２］］≧．．．≧α［Ｉｐ［Ｐ−１］］である。例えば、評価値が図７に示すように時間の経過とともに変化する場合、特徴位置検出部１６は、値が最大である極大位置Ａと、値が２番目である極大位置Ｂと、値が３番目である極大位置Ｃとを選択する。 When the characteristic position detection unit 16 determines that the control variable i has exceeded the predetermined value H6 (No in S440), the characteristic position detection unit 16 selects a predetermined number of local maximum positions from the local maximum information stored in the work memory ( S450). For example, the feature position detection unit 16 selects a predetermined number of maximum positions in descending order. The P maximum positions (time) selected in descending order of evaluation value are described as Ip [v] (v = 0 to P−1). At this time, α [Ip [0]] ≧ α [Ip [1]] ≧ α [Ip [2]] ≧. . . ≧ α [Ip [P-1]]. For example, when the evaluation value changes with the passage of time as shown in FIG. 7, the feature position detection unit 16 has a maximum position A having the maximum value, a maximum position B having the second value, and a value of 3 And the local maximum position C is selected.

なお、特徴位置検出部１６は、極大値が大きい順に所定個数の極大位置を選択する際、既に選択している極大位置と時間的に近いものを除外してもよい。例えば、特徴位置検出部１６は、既に選択した極大位置と所定の時間以上離れている極大値のみを選択する。また、特徴位置検出部１６は、音響信号２の連続する一部分について、評価値算出部１５によって算出された評価値が極大となる位置を検出してもよい。以上が特徴位置の第２の検出方法の説明である。第２の検出方法は、楽曲の中から特徴位置を複数検出したい場合に適する。 Note that the feature position detection unit 16 may exclude those that are close in time to the already selected maximum positions when selecting a predetermined number of maximum positions in descending order of the maximum values. For example, the feature position detection unit 16 selects only the maximum value that is separated from the already selected maximum position by a predetermined time or more. In addition, the feature position detection unit 16 may detect a position where the evaluation value calculated by the evaluation value calculation unit 15 is maximum for a continuous part of the acoustic signal 2. The above is the description of the second feature position detection method. The second detection method is suitable when it is desired to detect a plurality of feature positions from the music.

特徴位置検出部１６は、このようにして検出した評価値の最大位置Ｉｍａｘ又は極大位置Ｉｐ［ｖ］（ｖ＝０〜Ｐ−１）を特徴位置情報３として音響信号分析装置１の外部に出力する。特徴位置情報３を用いて音響信号２を再生することにより、楽曲のサビ等の特徴的な箇所を再生することが可能になる。 The feature position detection unit 16 outputs the maximum position Imax or the maximum position Ip [v] (v = 0 to P−1) of the evaluation value detected in this way as feature position information 3 to the outside of the acoustic signal analyzer 1. To do. By reproducing the acoustic signal 2 using the characteristic position information 3, it is possible to reproduce a characteristic portion such as a rust of the music.

上述した実施の形態１の音響信号分析装置１は、二つの異なる区間長を用いて特徴的な箇所を検出する。以下に、その効果を図８から図１１を用いて説明する。 The acoustic signal analysis device 1 according to Embodiment 1 described above detects a characteristic location using two different section lengths. Below, the effect is demonstrated using FIGS. 8-11.

図８は、比較的短い区間長を用いて算出された第１の特徴量Ｅ１の変化の様子を示す模式図である。図８の横軸は、フレーム番号（時間）を示す。図８において、フレーム番号８からフレーム番号１６の区間がサビの区間である。一般的には、サビの区間の音量は、他の箇所に比べて大きい傾向にある。ただし、サビの区間であっても、図８のフレーム番号１０のＸ点のように、ボーカルの切れ目等で、音量が若干下がる場合がある。また、図８のフレーム番号２のＳ点のように、打楽器が強く演奏されたり、ボーカルのシャウト等が入るような箇所では、サビの区間以外で音量が瞬間的に大きな値となる場合がある。このような場合、特徴量の最大位置を特徴位置として検出すると、実際にはサビの区間ではないＳ点をサビの区間と検出する。それは、誤検出である。 FIG. 8 is a schematic diagram illustrating a change in the first feature amount E1 calculated using a relatively short section length. The horizontal axis in FIG. 8 indicates the frame number (time). In FIG. 8, the section from frame number 8 to frame number 16 is the chorus section. In general, the volume of the chorus section tends to be higher than that of other parts. However, even in the chorus section, the volume may be slightly reduced due to a break in the vocal, as indicated by point X in frame number 10 in FIG. Also, as in the point S of frame number 2 in FIG. 8, the volume may be instantaneously high except in the chorus section where a percussion instrument is played strongly or a vocal shout or the like enters. . In such a case, when the maximum feature amount position is detected as the feature position, an S point that is not actually a chorus section is detected as a chorus section. That is a false detection.

図９は、比較的長い区間長を用いて算出された第２の特徴量Ｅ２の変化の様子を示す模式図である。図９のフレーム番号と図８のフレーム番号とは対応しており、同じフレーム番号の箇所は、同じ箇所を示している。図９においても、フレーム番号８からフレーム番号１６の区間がサビの区間である。図８と図９とを比較すると明らかなように、図９に示す第２の特徴量Ｅ２は、第1の特徴量Ｅ１よりなだらかに変化する。 FIG. 9 is a schematic diagram illustrating a change state of the second feature amount E2 calculated using a relatively long section length. The frame number in FIG. 9 and the frame number in FIG. 8 correspond to each other, and the same frame number indicates the same position. Also in FIG. 9, the section from frame number 8 to frame number 16 is the chorus section. As is apparent from a comparison between FIG. 8 and FIG. 9, the second feature quantity E2 shown in FIG. 9 changes more gently than the first feature quantity E1.

図８において値が最大であるＳ点は、図９ではあまり大きな値ではない。サビの区間において、第２の特徴量Ｅ２は大きな値をとることが多い。第２の特徴量Ｅ２は、サビの区間の先頭ではなく、サビの区間の途中で最大となることがある。図９の例では、第２の特徴量Ｅ２が最大となるのは、フレーム番号１２のＹ点である。その箇所はサビの区間に含まれているが、サビの区間の先頭（フレーム番号８）ではない。 The point S having the maximum value in FIG. 8 is not so large in FIG. In the rust section, the second feature amount E2 often takes a large value. The second feature amount E2 may be maximized in the middle of the chorus section instead of at the head of the chorus section. In the example of FIG. 9, the second feature amount E2 has the maximum at the Y point of the frame number 12. The location is included in the chorus section, but is not the head of the chorus section (frame number 8).

楽曲の試聴開始位置としては、サビ区間の先頭（Ｔ点）が検出されることが最も望ましいが、１種類の区間長を用いると、図８のように区間長が短くても、図９のように区間長が長くても、サビ区間の先頭を検出することができない場合がある。 As the trial listening start position of the music, it is most desirable to detect the head (T point) of the chorus section. However, if one section length is used, even if the section length is short as shown in FIG. Thus, even if the section length is long, the head of the chorus section may not be detected.

図１０は、第１の特徴量と第２の特徴量の和（Ｅ１＋Ｅ２）を評価値とした場合の模式図である。図１０は、図８及び図９と同じ範囲を示している。図１０では、特徴量の和（Ｅ１＋Ｅ２）は、サビの区間以外のＳ点と、サビ区間の途中のＺ点（フレーム番号１３）で比較的大きくなるものの、サビの区間の先頭のＴ点で最大となる。 FIG. 10 is a schematic diagram when the sum (E1 + E2) of the first feature value and the second feature value is used as the evaluation value. FIG. 10 shows the same range as FIG. 8 and FIG. In FIG. 10, the sum of the feature values (E1 + E2) is relatively large at the S point other than the chorus section and the Z point (frame number 13) in the middle of the chorus section, but at the leading T point of the chorus section. Maximum.

図１１は、第１の特徴量と第２の特徴量の積（Ｅ１×Ｅ２）を評価値とした場合の模式図である。図１１は、図８から図１０と同じ範囲を示している。図１１では、特徴量の積（Ｅ１×Ｅ２）は、サビの区間以外のＳ点と、サビの区間の途中のＹ点（フレーム番号１２）で比較的大きくなるものの、サビの区間の先頭のＴ点で最大となる。 FIG. 11 is a schematic diagram when the product (E1 × E2) of the first feature value and the second feature value is used as the evaluation value. FIG. 11 shows the same range as FIG. 8 to FIG. In FIG. 11, the product (E1 × E2) of the feature quantity is relatively large at the S point other than the chorus section and the Y point (frame number 12) in the middle of the chorus section, but at the beginning of the chorus section. Maximum at point T.

図１０及び図１１から明らかなように、区間長の異なる特徴量を組合せて評価値を算出することにより、サビの区間（サビの区間の先頭）の検出精度は向上する。そのため、実施の形態１の音響信号分析装置１は、特徴位置を精度よく検出するために、区間長の異なる特徴量を組合せて評価値を算出して特徴位置を検出する。 As is apparent from FIGS. 10 and 11, by calculating the evaluation value by combining the feature amounts having different section lengths, the detection accuracy of the chorus section (the head of the chorus section) is improved. Therefore, in order to detect the feature position with high accuracy, the acoustic signal analysis device 1 according to Embodiment 1 detects the feature position by calculating an evaluation value by combining feature amounts having different section lengths.

なお、実施の形態１では、２種類の時間長の区間を用いて、２種類の特徴量を算出し、それらを用いて評価値を算出したが、これに限定される訳ではない。例えば、３種類以上の時間長の区間を用いて、３種類以上の特徴量を算出し、それらを用いて評価値を算出してもよい。 In the first embodiment, two types of feature amounts are calculated using two types of time length sections, and an evaluation value is calculated using them. However, the present invention is not limited to this. For example, three or more types of feature amounts may be calculated using three or more types of time length sections, and an evaluation value may be calculated using them.

（実施の形態２）
次に、実施の形態２の音響信号分析装置１を図１２を用いて説明する。図１２は、実施の形態２の音響信号分析装置１の構成図である。実施の形態２の音響信号分析装置１は、図１２に示すように、制御部１１と、取得部１２と、第１の特徴量算出部１３と、第２の特徴量算出部１４と、評価値算出部１５と、特徴位置検出部１６と、拍時間検出部１７とを有する。 (Embodiment 2)
Next, the acoustic signal analyzer 1 according to the second embodiment will be described with reference to FIG. FIG. 12 is a configuration diagram of the acoustic signal analyzer 1 according to the second embodiment. As shown in FIG. 12, the acoustic signal analysis device 1 according to the second embodiment includes a control unit 11, an acquisition unit 12, a first feature value calculation unit 13, a second feature value calculation unit 14, and an evaluation. A value calculation unit 15, a feature position detection unit 16, and a beat time detection unit 17 are included.

実施の形態２の音響信号分析装置１は、実施の形態１の音響信号分析装置１が有する構成部に加えて拍時間検出部１７を有する。その点が、実施の形態１と実施の形態２との相違点である。 The acoustic signal analysis device 1 according to the second embodiment includes a beat time detection unit 17 in addition to the components included in the acoustic signal analysis device 1 according to the first embodiment. This is the difference between the first embodiment and the second embodiment.

制御部１１は、取得部１２によって音響データが生成されたことを検知すると、第１の特徴量算出部１３及び第２の特徴量算出部１４に動作を開始するように指示する前に、拍時間検出部１７に動作を開始するように指示する。 When the control unit 11 detects that the acoustic data is generated by the acquisition unit 12, the control unit 11 determines whether the first feature value calculation unit 13 and the second feature value calculation unit 14 are instructed to start the operation. Instructs the time detector 17 to start the operation.

拍時間検出部１７は、フレーム単位で処理を行う。拍時間検出部１７によって処理されるフレームの時間長をＴｆ３とし、拍時間検出部１７が動作する際のフレームシフトの時間長をＴｇ３とする。拍時間検出部１７によって処理されるフレームのサンプル数Ｎ３は、Ｎ３＝Ｔｆ３／Ｔｓであり、フレームシフトのサンプル数Ｇ３は、Ｇ３＝Ｔｇ３／Ｔｓである。拍時間を精度良く算出するために、Ｔｆ３及びＴｇ３は１拍の長さよりもかなり短い時間に設定される。一般的な音楽では、テンポが６０から２４０であり、１拍の時間長が２５０ｍｓｅｃから１ｓｅｃの範囲であることが多いので、Ｔｆ３及びＴｇ３は、５ｍｓｅｃから５０ｍｓｅｃ程度の範囲の適切な値に設定される。 The beat time detector 17 performs processing in units of frames. The time length of the frame processed by the beat time detection unit 17 is Tf3, and the time length of the frame shift when the beat time detection unit 17 operates is Tg3. The frame sample number N3 processed by the beat time detection unit 17 is N3 = Tf3 / Ts, and the frame shift sample number G3 is G3 = Tg3 / Ts. In order to accurately calculate the beat time, Tf3 and Tg3 are set to a time considerably shorter than the length of one beat. In general music, the tempo is 60 to 240 and the time length of one beat is often in the range of 250 msec to 1 sec. Therefore, Tf3 and Tg3 are set to appropriate values in the range of about 5 msec to 50 msec. The

拍時間検出部１７は、図１３に示すフローチャートに従って処理を行う。図１３は、拍時間検出部１７の動作の各ステップを示すフローチャートである。 The beat time detector 17 performs processing according to the flowchart shown in FIG. FIG. 13 is a flowchart showing each step of the operation of the beat time detection unit 17.

拍時間検出部１７は、先ず、式（１）を用いてフレームの総数Ｈ７を算出する（Ｓ５００）。具体的には、拍時間検出部１７は、式（１）のＮ１をＮ３に置き換え、Ｇ１をＧ３に置き換え、Ｈ１をＨ７に置き換えて、フレームの総数Ｈ７を算出する。 First, the beat time detection unit 17 calculates the total number H7 of frames using Equation (1) (S500). Specifically, the beat time detection unit 17 replaces N1 in Equation (1) with N3, replaces G1 with G3, replaces H1 with H7, and calculates the total number H7 of frames.

次に、拍時間検出部１７は、制御変数ｉに「０」をセットする（Ｓ５１０）。 Next, the beat time detection unit 17 sets “0” in the control variable i (S510).

次に、拍時間検出部１７は、ｉ番目のフレームデータを生成する（Ｓ５２０）。具体的には、拍時間検出部１７は、音響データｘ［ｉ×Ｇ３］から音響データｘ［ｉ×Ｇ３＋Ｎ３−１］をｉ番目のフレームデータとして生成する。なお、拍時間検出部１７は、音響データｘ［ｉ×Ｇ３］から音響データｘ［ｉ×Ｇ３＋Ｎ３−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。最初に述べた方法は、音響データに矩形窓を掛け合わせることによりｉ番目のフレームデータを生成する方法と同じ方法であると言える。ｉ番目のフレームデータを「Ｄ３［ｉ］［ｊ］（ｊ＝０〜ＮＤ３、ただしＮＤ３＝Ｎ３−１）」と記載する。 Next, the beat time detection unit 17 generates i-th frame data (S520). Specifically, the beat time detection unit 17 generates acoustic data x [i × G3 + N3-1] as the i-th frame data from the acoustic data x [i × G3]. The beat time detection unit 17 may generate a value obtained by multiplying the data from the acoustic data x [i × G3] to the acoustic data x [i × G3 + N3-1] by a window function as the i-th frame data. Good. The window function is a Hamming window function, a Hanning window function, a Blackman window function, a Gauss window function, or the like. It can be said that the method described first is the same as the method of generating the i-th frame data by multiplying the acoustic data by a rectangular window. The i-th frame data is described as “D3 [i] [j] (j = 0 to ND3, where ND3 = N3-1)”.

次に、拍時間検出部１７は、ｉ番目のフレームの特徴量を算出する（Ｓ５３０）。具体的には、拍時間検出部１７は、第１の特徴量算出部１３が特徴量を算出する際に用いる第４又は第５の算出方法を用いて、特徴量を算出する。すなわち、拍時間検出部１７は、音響データの振幅又は音響データの特定の周波数成分を用いて、フレーム内又はフレーム間の差を算出し、特徴量Ｅ３［ｉ］を算出する。 Next, the beat time detector 17 calculates the feature amount of the i-th frame (S530). Specifically, the beat time detection unit 17 calculates the feature amount by using the fourth or fifth calculation method used when the first feature amount calculation unit 13 calculates the feature amount. In other words, the beat time detection unit 17 calculates the difference E3 [i] within the frame or between the frames by using the amplitude of the acoustic data or the specific frequency component of the acoustic data.

次に、拍時間検出部１７は、制御変数ｉの値を「１」増やす（Ｓ５４０）。 Next, the beat time detecting unit 17 increases the value of the control variable i by “1” (S540).

次に、拍時間検出部１７は、制御変数ｉの値がＨ７未満であるか否かを判定する（Ｓ５５０）。拍時間検出部１７は、制御変数ｉの値がＨ７未満であると判定すると（Ｓ５５０でＹｅｓ）、ステップＳ５２０に戻ってステップＳ５４０までの処理を繰り返す。 Next, the beat time detector 17 determines whether or not the value of the control variable i is less than H7 (S550). When determining that the value of the control variable i is less than H7 (Yes in S550), the beat time detection unit 17 returns to step S520 and repeats the processing up to step S540.

拍時間検出部１７は、制御変数ｉの値がＨ７であると判定すると（Ｓ５５０でＮｏ）、特徴量Ｅ３［ｉ］（ｉ＝０〜Ｈ７−１）の自己相関を算出する（Ｓ５６０）。拍時間検出部１７は、自己相関のインデックスの差Δを所定のテンポの範囲で順次変えながら、下記の式（１６）に従って自己相関Ｙ（Δ）を算出する。 When determining that the value of the control variable i is H7 (No in S550), the beat time detection unit 17 calculates the autocorrelation of the feature amount E3 [i] (i = 0 to H7-1) (S560). The beat time detector 17 calculates the autocorrelation Y (Δ) according to the following equation (16) while sequentially changing the autocorrelation index difference Δ within a predetermined tempo range.

Ｈ８は、０≦Ｈ８＜Ｈ９を満たす所定値であり、Ｈ９は、Ｈ８＜Ｈ９≦Ｈ７−１−Δを満たす所定値である。例えば、テンポの検出範囲が６０から２４０である場合、Ｅ３はＴｇ３の時間間隔で生成されているので、Δ＝（２５０／Ｔｇ３）から（１０００／Ｔｇ３）の範囲でΔは変えられる。Ｔｇ３は、ｍｓｅｃ単位の値である。

H8 is a predetermined value that satisfies 0 ≦ H8 <H9, and H9 is a predetermined value that satisfies H8 <H9 ≦ H7-1−Δ. For example, when the detection range of tempo is 60 to 240, E3 is generated at a time interval of Tg3, and therefore Δ can be changed in the range of Δ = (250 / Tg3) to (1000 / Tg3). Tg3 is a value in units of msec.

次に、拍時間検出部１７は、自己相関Ｙ（Δ）のピーク位置を検出して、拍の時間長τを算出する（Ｓ５７０）。ステップＳ５６０において算出された自己相関Ｙ（Δ）は、図１４に示すように、いくつかのピークを持っている。拍時間検出部１７は、検出対象の最短の拍から検出対象の最長の拍の間で最大値の位置Δｍａｘを検出し、τ＝Ｔｇ３×Δｍａｘを１拍の時間長とする。なお、図１４において、「Ｐ」は検出対象の最短の拍に相当するΔであり、「Ｒ」は検出対象の最長の拍に相当するΔである。 Next, the beat time detector 17 detects the peak position of the autocorrelation Y (Δ) and calculates the beat time length τ (S570). The autocorrelation Y (Δ) calculated in step S560 has several peaks as shown in FIG. The beat time detection unit 17 detects the maximum value position Δmax between the shortest beat of the detection target and the longest beat of the detection target, and sets τ = Tg3 × Δmax as the time length of one beat. In FIG. 14, “P” is Δ corresponding to the shortest beat to be detected, and “R” is Δ corresponding to the longest beat to be detected.

また、図１５に示すように、拍の時間長の存在確率を示す分布Ω（Δ）が用意されており、拍時間検出部１７は、自己相関Ｙ（Δ）と分布Ω（Δ）との積（Ω（Δ）Ｙ（Δ））を算出した後に、その最大値の位置を検出し、それにより一拍の時間長を検出してもよい。拍時間検出部１７は、Ω（Δ）を用いることにより、更に精度良く拍の時間長を算出することができる。なお、図１５において、「Ｐ」は検出対象の最短の拍に相当するΔであり、「Ｕ」は拍の存在確率が最大となるΔであり、「Ｒ」は検出対象の最長の拍に相当するΔである。 Further, as shown in FIG. 15, a distribution Ω (Δ) indicating the existence probability of the beat time length is prepared, and the beat time detecting unit 17 calculates the autocorrelation Y (Δ) and the distribution Ω (Δ). After calculating the product (Ω (Δ) Y (Δ)), the position of the maximum value may be detected, thereby detecting the time length of one beat. The beat time detector 17 can calculate the beat length with higher accuracy by using Ω (Δ). In FIG. 15, “P” is Δ corresponding to the shortest beat to be detected, “U” is Δ that maximizes the existence probability of the beat, and “R” is the longest beat to be detected. The corresponding Δ.

拍時間検出部１７は、このようにして検出した拍の時間長τを制御部１１に通知する。 The beat time detection unit 17 notifies the control unit 11 of the beat time length τ thus detected.

制御部１１は、τ１＝λ１×τ、τ２＝λ２×τの２つの数値を計算する。λ１及びλ２は、λ１＜λ２を満たす、所定の係数である。 The control unit 11 calculates two numerical values of τ1 = λ1 × τ and τ2 = λ2 × τ. λ1 and λ2 are predetermined coefficients that satisfy λ1 <λ2.

そして、制御部１１は、Ｔｆ１＝τ１とするように、第１の特徴量算出部１３に指示するとともに、Ｔｆ２＝τ２とするように、第２の特徴量算出部１４に指示する。その後、制御部１１は、第１の特徴量算出部１３及び第２の特徴量算出部１４に対して動作を開始するように指示する。第１の特徴量算出部１３は、拍時間検出部１７によって検出された一拍の時間長に基づくτ１をフレームの時間長Ｔｆ１に設定し、第２の特徴量算出部１４は、拍時間検出部１７によって検出された一拍の時間長に基づくτ２をフレームの時間長Ｔｆ２に設定する。それ以降の各部の動作は、実施の形態１において説明した動作と同じである。 Then, the control unit 11 instructs the first feature value calculation unit 13 to set Tf1 = τ1, and also instructs the second feature value calculation unit 14 to set Tf2 = τ2. Thereafter, the control unit 11 instructs the first feature value calculation unit 13 and the second feature value calculation unit 14 to start the operation. The first feature value calculator 13 sets τ1 based on the time length of one beat detected by the beat time detector 17 to the frame time length Tf1, and the second feature value calculator 14 detects the beat time. Τ2 based on the time length of one beat detected by the unit 17 is set as the time length Tf2 of the frame. The subsequent operation of each unit is the same as the operation described in the first embodiment.

実施の形態２の音響信号分析装置１は、音量に関する特徴量を算出する際の区間長を、その音楽の拍の時間長に基づいて設定するので、様々なジャンルやタイプの音楽に対しても、精度良く特徴位置を検出することができる。 Since the acoustic signal analysis apparatus 1 according to the second embodiment sets the section length when calculating the feature quantity related to the volume based on the time length of the beat of the music, it can be applied to music of various genres and types. The feature position can be detected with high accuracy.

なお、上述した各実施の形態の音響信号分析装置１の各構成部の機能は、例えばコンピュータのＣＰＵ（プロセッサ）及びメモリ等のハードウェアと、その機能を実現するためのコンピュータプログラムとが協働することによって実現される。しかしながら、上記各機能は、専用の回路により実現される等、どのような形態により実現されてもよい。また、音響信号分析装置１の各構成部の機能を実現するためのコンピュータプログラムは、記録媒体に格納されてもよい。 Note that the functions of the components of the acoustic signal analysis apparatus 1 according to each of the above-described embodiments are performed by, for example, hardware such as a computer CPU (processor) and a memory, and a computer program for realizing the functions. It is realized by doing. However, each of the above functions may be realized in any form such as realized by a dedicated circuit. Moreover, the computer program for implement | achieving the function of each structure part of the acoustic signal analyzer 1 may be stored in a recording medium.

（実施の形態３）
一般的に、楽曲のサビや盛り上がる箇所といった楽曲の特徴的な箇所では、複数の楽器や歌唱が同時に演奏されることが多く、その特徴的な箇所の音響信号は、周波数帯域の幅が広いことが多い。言い換えると、特徴的な箇所の音響信号は、低域から高域までの幅広い周波数成分が含まれることが多い。実施の形態３の音響信号分析装置は、従来は考慮されていなかった上記の特徴的な箇所の音響信号の周波数帯域の性質に着目し、楽曲の特徴的な箇所を精度良く検出する。 (Embodiment 3)
In general, multiple musical instruments and singing are often performed simultaneously at characteristic parts of the music such as rust and excitement of the music, and the acoustic signal at the characteristic part has a wide frequency band. There are many. In other words, the acoustic signal at a characteristic location often includes a wide range of frequency components from a low range to a high range. The acoustic signal analyzing apparatus according to the third embodiment focuses on the characteristic of the frequency band of the acoustic signal at the above characteristic part, which has not been conventionally considered, and accurately detects the characteristic part of the music.

先ず、実施の形態３の音響信号分析装置１０１を図１６を用いて説明する。図１６は、実施の形態３の音響信号分析装置１０１の構成図である。実施の形態３の音響信号分析装置１０１は、図１６に示すように、制御部１１１と、取得部１１２と、周波数帯域データ算出部１１３と、平滑化部１１４と、特徴位置検出部１１５とを有する。 First, the acoustic signal analysis apparatus 101 of Embodiment 3 is demonstrated using FIG. FIG. 16 is a configuration diagram of the acoustic signal analyzer 101 according to the third embodiment. As shown in FIG. 16, the acoustic signal analysis apparatus 101 according to the third embodiment includes a control unit 111, an acquisition unit 112, a frequency band data calculation unit 113, a smoothing unit 114, and a feature position detection unit 115. Have.

音響信号分析装置１０１は、音響信号１０２を取得し、特徴位置情報１０３を出力する。 The acoustic signal analyzer 101 acquires the acoustic signal 102 and outputs the characteristic position information 103.

音響信号１０２は、音楽に係る音響信号である。音響信号１０２はデジタル信号であってもよいし、アナログ信号であってもよい。音響信号１０２は、楽曲だけの信号ではなく、ラジオ又はテレビ等の音楽番組の音響信号のように、楽曲の他にＤＪ等の楽曲以外の音を含む信号であってもよい。音響信号１０２は音響信号分析装置１０１の外部に存在する。しかしながら、音響信号分析装置１０１に記憶部が設けられていれば、音響信号１０２はその記憶部に格納されて音響信号分析装置１０１の内部に存在していてもよい。 The acoustic signal 102 is an acoustic signal related to music. The acoustic signal 102 may be a digital signal or an analog signal. The sound signal 102 may be a signal including sound other than music such as DJ in addition to music, such as an audio signal of a music program such as radio or television, instead of a signal only of music. The acoustic signal 102 exists outside the acoustic signal analyzer 101. However, if the acoustic signal analyzer 101 is provided with a storage unit, the acoustic signal 102 may be stored in the storage unit and exist inside the acoustic signal analyzer 101.

特徴位置情報１０３は、音響信号１０２の周波数帯域の幅が広い箇所を特定する情報である。その箇所は、楽曲のサビの位置又は楽曲の構成もしくは楽器の編成が大きく変化する箇所、すなわち楽曲の特徴的な箇所と一致する場合が多い。 The feature position information 103 is information that identifies a location where the frequency band of the acoustic signal 102 is wide. The part often coincides with a part where the rust position of the music or the composition of the music or the organization of the musical instrument changes greatly, that is, a characteristic part of the music.

音響信号分析装置１０１の制御部１１１は、音響信号分析装置１０１を構成する他の各部と情報を交換して各部を制御する。 The control unit 111 of the acoustic signal analysis device 101 exchanges information with other units constituting the acoustic signal analysis device 101 to control each unit.

取得部１１２は、音響信号１０２を取得し、取得した音響信号１０２から、サンプリング周期Ｔｓ（サンプリング周波数Ｆｓ＝１／Ｔｓ）でサンプリングしたＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）データを生成する。取得部１１２は、音響信号１０２がアナログ信号である場合、アナログ信号をデジタル信号に変換してＰＣＭデータを生成し、音響信号１０２がＰＣＭ以外のデジタル圧縮信号である場合、デジタル圧縮信号をデコードしてＰＣＭデータを生成する。また、音響信号１０２がデジタル信号であって、そのサンプリング周期が上記のサンプリング周期Ｔｓと異なる場合、取得部１１２は、サンプリングレートを変換してサンプリング周期ＴｓのＰＣＭデータを生成する。 The acquisition unit 112 acquires the acoustic signal 102 and generates PCM (Pulse Code Modulation) data sampled at the sampling period Ts (sampling frequency Fs = 1 / Ts) from the acquired acoustic signal 102. When the acoustic signal 102 is an analog signal, the acquisition unit 112 converts the analog signal into a digital signal to generate PCM data. When the acoustic signal 102 is a digital compressed signal other than PCM, the acquisition unit 112 decodes the digital compressed signal. To generate PCM data. When the acoustic signal 102 is a digital signal and the sampling period is different from the sampling period Ts, the acquisition unit 112 converts the sampling rate and generates PCM data with the sampling period Ts.

以下の説明では、取得部１１２によって生成されるＰＣＭデータを、音響データｘ［ｍ］（ｍ＝０〜Ｍ−１、Ｍは音響データのサンプル総数）、又は音響データと記載する。取得部１１２は、音響データの生成を終了すると、その旨を制御部１１１に通知する。実施の形態３では、周波数帯域データ算出部１１３、平滑化部１１４、及び、特徴位置検出部１１５は、取得部１１２が音響データの全部を生成した後に、動作を開始する。しかしながら、周波数帯域データ算出部１１３、平滑化部１１４、及び、特徴位置検出部１１５は、取得部１１２が音響データの一部を生成した後に、動作を開始してもよい。 In the following description, PCM data generated by the acquisition unit 112 is described as acoustic data x [m] (m = 0 to M−1, where M is the total number of samples of acoustic data) or acoustic data. When the acquisition unit 112 finishes generating the acoustic data, the acquisition unit 112 notifies the control unit 111 accordingly. In Embodiment 3, the frequency band data calculation unit 113, the smoothing unit 114, and the feature position detection unit 115 start operating after the acquisition unit 112 generates all of the acoustic data. However, the frequency band data calculation unit 113, the smoothing unit 114, and the feature position detection unit 115 may start operating after the acquisition unit 112 generates part of the acoustic data.

周波数帯域データ算出部１１３は、取得部１１２によって生成された音響データから、周波数帯域の幅に関する時系列データを算出する。周波数帯域データ算出部１１３は、フレーム単位で処理を行う。しかしながら、処理の単位はそれに限定されない。 The frequency band data calculation unit 113 calculates time-series data related to the width of the frequency band from the acoustic data generated by the acquisition unit 112. The frequency band data calculation unit 113 performs processing in units of frames. However, the unit of processing is not limited thereto.

以下では、周波数帯域データ算出部１１３によって処理される各フレームの時間長をＴｆ１１とし、フレームシフトの時間長をＴｇ１１とする。このとき、フレームのサンプル数Ｎ１１＝Ｔｆ１１／Ｔｓとなり、フレームシフトのサンプル数Ｇ１１＝Ｔｇ１１／Ｔｓとなる。フレームシフトは、隣り合ったフレームの先頭の時間差である。隣り合ったフレームは、一部が重なっていてもよいし、重なっていなくてもよい。 In the following, it is assumed that the time length of each frame processed by the frequency band data calculation unit 113 is Tf11 and the time length of the frame shift is Tg11. At this time, the frame sample number N11 = Tf11 / Ts, and the frame shift sample number G11 = Tg11 / Ts. Frame shift is the time difference between the heads of adjacent frames. Adjacent frames may partially overlap or may not overlap.

周波数帯域データ算出部１１３は、制御部１１１の指示に従って、図１７のフローチャートに示す動作を開始する。図１７は、周波数帯域データ算出部１１３の動作の各ステップを示すフローチャートである。 The frequency band data calculation unit 113 starts the operation shown in the flowchart of FIG. 17 in accordance with an instruction from the control unit 111. FIG. 17 is a flowchart showing each step of the operation of the frequency band data calculation unit 113.

周波数帯域データ算出部１１３は、先ず、下記の式（１７）に従って、フレームの総数Ｈ１１を算出する（Ｓ６００）。 First, the frequency band data calculation unit 113 calculates the total number H11 of frames according to the following equation (17) (S600).

ｆｌｏｏｒ（）は、小数点以下を切り捨てた整数を返す関数である。ＭとＮ１１との関係は、Ｍ＞Ｎ１１である。

floor () is a function that returns an integer with the decimal part truncated. The relationship between M and N11 is M> N11.

次に、周波数帯域データ算出部１１３は、制御変数ｉに「０」をセットする（Ｓ６１０）。 Next, the frequency band data calculation unit 113 sets “0” to the control variable i (S610).

次に、周波数帯域データ算出部１１３は、ｉ番目のフレームデータを生成する（Ｓ６２０）。ｉ番目のフレームデータは、音響データｘ［ｉ×Ｇ１１］から音響データｘ［ｉ×Ｇ１１＋Ｎ１１−１］までのデータである。なお、周波数帯域データ算出部１１３は、音響データｘ［ｉ×Ｇ１１］から音響データｘ［ｉ×Ｇ１１＋Ｎ１１−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。最初に述べた方法は、音響データに矩形窓を掛け合わせることによりｉ番目のフレームデータを生成する方法と同じ方法であると言える。ｉ番目のフレームデータを「Ｄ１１［ｉ］［ｊ］（ｊ＝０〜ＮＤ１１、ただしＮＤ１１＝Ｎ１１−１）」と記載する。 Next, the frequency band data calculation unit 113 generates i-th frame data (S620). The i-th frame data is data from the acoustic data x [i × G11] to the acoustic data x [i × G11 + N11-1]. The frequency band data calculation unit 113 generates a value obtained by multiplying the data from the acoustic data x [i × G11] to the acoustic data x [i × G11 + N11-1] by a window function as the i-th frame data. Also good. The window function is a Hamming window function, a Hanning window function, a Blackman window function, a Gauss window function, or the like. It can be said that the method described first is the same as the method of generating the i-th frame data by multiplying the acoustic data by a rectangular window. The i-th frame data is described as “D11 [i] [j] (j = 0 to ND11, where ND11 = N11-1)”.

次に、周波数帯域データ算出部１１３は、公知の離散フーリエ変換（ＤＦＴ）を用いて、ｉ番目のフレームデータの周波数を分析して周波数スペクトルを算出する（Ｓ６３０）。周波数スペクトルは、振幅スペクトルとパワースペクトルのいずれであってもよい。周波数スペクトルの強度は、リニア（線形）スケールにより表現されてもよいし、対数スケールにより表現されてもよい。 Next, the frequency band data calculating unit 113 calculates the frequency spectrum by analyzing the frequency of the i-th frame data using a known discrete Fourier transform (DFT) (S630). The frequency spectrum may be either an amplitude spectrum or a power spectrum. The intensity of the frequency spectrum may be expressed by a linear (linear) scale or a logarithmic scale.

周波数帯域データ算出部１１３は、離散フーリエ変換の代わりにウェーブレット変換、又はフィルタバンク等の方法を用いてもよい。ｉ番目のフレームの周波数スペクトルを「Ｓ［ｉ］［ｋ］（ｋ＝０〜Ｎ１１／２）」と記載する。ｋ＝０は最も周波数の低い直流成分に対応し、ｋ＝Ｎ１１／２は、最も高い周波数であるサンプリング周波数Ｆｓの半分の周波数に対応し、その間の成分は、ｋ×（Ｆｓ／Ｎ１１）周波数に対応する。なお、周波数を示す軸は、リニアスケールではなく、対数スケールであってもよい。 The frequency band data calculation unit 113 may use a method such as a wavelet transform or a filter bank instead of the discrete Fourier transform. The frequency spectrum of the i-th frame is described as “S [i] [k] (k = 0 to N11 / 2)”. k = 0 corresponds to the DC component with the lowest frequency, k = N11 / 2 corresponds to half the sampling frequency Fs, which is the highest frequency, and the component therebetween is the k × (Fs / N11) frequency. Corresponding to The axis indicating the frequency may be a logarithmic scale instead of a linear scale.

次に、周波数帯域データ算出部１１３は、ｉ番目のフレームの周波数スペクトルの帯域幅に関する指標（周波数帯域の幅広さを示す指標）Ｅ１１［ｉ］を後述する方法を用いて算出する（Ｓ６４０）。 Next, the frequency band data calculation unit 113 calculates an index (an index indicating the width of the frequency band) E11 [i] related to the frequency spectrum bandwidth of the i-th frame using a method described later (S640).

次に、周波数帯域データ算出部１１３は、制御変数ｉの値を「１」増やす（Ｓ６５０）。 Next, the frequency band data calculation unit 113 increases the value of the control variable i by “1” (S650).

次に、周波数帯域データ算出部１１３は、制御変数ｉの値がＨ１１未満であるか否かを判定する（Ｓ６６０）。周波数帯域データ算出部１１３は、制御変数ｉの値がＨ１１未満であれば（Ｓ６６０でＹｅｓ）、ステップＳ６２０に戻ってステップＳ６５０までの処理を繰り返し、制御変数ｉの値がＨ１１であれば（Ｓ６６０でＮｏ）、処理を終了する。 Next, the frequency band data calculation unit 113 determines whether or not the value of the control variable i is less than H11 (S660). If the value of the control variable i is less than H11 (Yes in S660), the frequency band data calculation unit 113 returns to step S620 and repeats the processing up to step S650, and if the value of the control variable i is H11 (S660). No), the process is terminated.

周波数帯域データ算出部１１３は、このようにして、周波数スペクトルの帯域幅に関する、Ｈ１１個の時系列の周波数帯域データＥ１１［ｉ］（ｉ＝０〜Ｈ１１−１）を算出し、処理が終了したことを制御部１１１に通知する。 In this way, the frequency band data calculation unit 113 calculates H11 time-series frequency band data E11 [i] (i = 0 to H11-1) related to the bandwidth of the frequency spectrum, and the processing is completed. This is notified to the control unit 111.

次に、周波数帯域データ算出部１１３がステップＳ６４０においてｉ番目のフレームの周波数スペクトルの帯域幅に関する指標Ｅ１１［ｉ］を算出する方法を説明する。 Next, a method in which the frequency band data calculation unit 113 calculates the index E11 [i] related to the bandwidth of the frequency spectrum of the i-th frame in step S640 will be described.

（１）帯域幅に関する指標の第１の算出方法は、周波数スペクトルにおいて、所定値以上のスペクトル強度を有する最小周波数及び最大周波数を検出し、それらの周波数の差を算出する方法である。一般的に周波数スペクトルは、図１８に示すように表現される。低域用の閾値λａと高域用の閾値λｂとを用意し、Ｓ［ｉ］［ｋ］≧λａを満たす最小のｋである周波数Ｋａと、Ｓ［ｉ］［ｋ］≧λｂを満たす最大のｋである周波数Ｋｂとを検出する。そして、周波数Ｋａと周波数Ｋｂとの差、すなわち（Ｋｂ−Ｋａ）を帯域幅とし、それを帯域幅に関する指標とする。なお、帯域幅に関する指標は、（Ｆｓ／Ｎ１１）×（Ｋｂ−Ｋａ）であってもよい。なお、第１の算出方法では、数十Ｈｚから数ＫＨｚ又は１０ＫＨｚ程度の周波数範囲で、条件を満たす最小周波数及び最大周波数を求める。最小周波数は数十Ｈｚ程度の刻み（精度）で検出し、最大周波数は数百Ｈｚ程度の刻み（精度）で検出する。 (1) The first calculation method of the index related to the bandwidth is a method of detecting a minimum frequency and a maximum frequency having a spectrum intensity equal to or higher than a predetermined value in a frequency spectrum, and calculating a difference between those frequencies. In general, the frequency spectrum is expressed as shown in FIG. A threshold λa for a low frequency and a threshold λb for a high frequency are prepared, and a frequency Ka that is the minimum k that satisfies S [i] [k] ≧ λa and a maximum that satisfies S [i] [k] ≧ λb The frequency Kb, which is k, is detected. Then, the difference between the frequency Ka and the frequency Kb, that is, (Kb−Ka) is defined as a bandwidth, and this is used as an index related to the bandwidth. The index related to the bandwidth may be (Fs / N11) × (Kb−Ka). In the first calculation method, the minimum frequency and the maximum frequency that satisfy the condition are obtained in a frequency range of about several tens Hz to several KHz or 10 KHz. The minimum frequency is detected in steps of several tens Hz (accuracy), and the maximum frequency is detected in steps of several hundred Hz (accuracy).

（２）帯域幅に関する指標の第２の算出方法は、周波数スペクトルの形状を用いる方法である。具体的には、第２の算出方法は、周波数スペクトルの各周波数の値と所定値との差に関係する値と、その周波数の強度との積の総和（積和演算の値）を用いる方法である。より具体的には、下記の式（１８）又は式（１９）を用いて周波数スペクトルの帯域幅に関する指標Ｅ１１［ｉ］を算出する。 (2) The second method for calculating the bandwidth-related index is a method using the shape of the frequency spectrum. Specifically, the second calculation method uses a sum of products (values of product-sum operation) of values related to the difference between each frequency value of the frequency spectrum and a predetermined value and the intensity of the frequency. It is. More specifically, the index E11 [i] related to the bandwidth of the frequency spectrum is calculated using the following formula (18) or formula (19).

式（１８）は、周波数スペクトルの各周波数の値と所定値との差の２乗値とその周波数の強度との積の総和を用いて帯域幅に関する指標Ｅ１１［ｉ］を算出するための式であり、式（１９）は、周波数スペクトルの各周波数の値と所定値との差の絶対値とその周波数の強度との積の総和を用いて帯域幅に関する指標Ｅ１１［ｉ］を算出するための式である。

Expression (18) is an expression for calculating the bandwidth index E11 [i] using the sum of the products of the square value of the difference between each frequency value of the frequency spectrum and the predetermined value and the intensity of the frequency. Equation (19) is for calculating the bandwidth index E11 [i] using the sum of products of the absolute value of the difference between each frequency value of the frequency spectrum and the predetermined value and the intensity of the frequency. It is an expression.

式（１８）及び式（１９）において、Ｋ１は処理対象の周波数の下限を示す整数であり、Ｋ２は処理対象の周波数の上限を示す整数であって、０≦Ｋ１＜Ｋ２≦（Ｎ１１／２）の関係が満たされる。ωは所定値であり、Ｋ１≦ω≦Ｋ２の関係が満たされる。ηは、０≦η≦１の範囲で設定される値である。η＝１の場合、Ｅ１１［ｉ］に周波数スペクトルの強度の情報が入らないので、帯域幅に関する純粋な指標が得られる。η＝０の場合、分母が１となりＥ１１［ｉ］は分子だけで表現されるので、周波数スペクトルの強度が加味された指標が得られる。ηの値は、０と１の中間の値、例えば０．５等であってもよい。 In Expression (18) and Expression (19), K1 is an integer indicating the lower limit of the frequency to be processed, K2 is an integer indicating the upper limit of the frequency to be processed, and 0 ≦ K1 <K2 ≦ (N11 / 2) ) Relationship is satisfied. ω is a predetermined value, and the relationship of K1 ≦ ω ≦ K2 is satisfied. η is a value set in a range of 0 ≦ η ≦ 1. When η = 1, information on the intensity of the frequency spectrum does not enter E11 [i], so that a pure index regarding the bandwidth can be obtained. When η = 0, the denominator is 1, and E11 [i] is expressed only by the numerator, so that an index taking into account the intensity of the frequency spectrum is obtained. The value of η may be an intermediate value between 0 and 1, for example, 0.5.

また、式（１８）において、μは０より大きい所定値である。例えば、μ＝１、又はμ＝０．５である。また、式（１８）及び式（１９）において、ｋは周波数そのものではなく、周波数成分を識別する番号であるが、（ｋ−ω）の代わりに、周波数そのものの（ｋ×Ｆｓ／Ｎ１１−ω）が用いられてもよい。 In Expression (18), μ is a predetermined value greater than zero. For example, μ = 1 or μ = 0.5. In Equation (18) and Equation (19), k is not a frequency itself but a number for identifying a frequency component. Instead of (k−ω), (k × Fs / N11−ω) of the frequency itself is used. ) May be used.

（３）帯域幅に関する指標の第３の算出方法は、周波数スペクトルの各成分の周波数と平均周波数との差に関係する値と、その成分との積の総和を用いる方法である。具体的には、先ず、下記の式（２０）に従って、平均周波数ωａを算出する。 (3) A third method for calculating the bandwidth-related index is a method using the sum of products of values related to the difference between the frequency of each component of the frequency spectrum and the average frequency and the component. Specifically, first, the average frequency ωa is calculated according to the following equation (20).

次に、式（１８）又は式（１９）のωに、ωａを代入してＥ１１［ｉ］を算出する。第３の算出方法を用いる場合、事前にωを決定しておく必要がないので、第２の算出方法を用いる場合よりも、多様なジャンルや音楽スタイルの楽曲に対応して帯域幅に関する指標Ｅ１１を算出することができる。

Next, E11 [i] is calculated by substituting ωa into ω in formula (18) or formula (19). When the third calculation method is used, it is not necessary to determine ω in advance. Therefore, the bandwidth index E11 corresponds to music of various genres and music styles, compared to the case where the second calculation method is used. Can be calculated.

なお、第２及び第３の算出方法では、数十Ｈｚから数ＫＨｚ又は１０ＫＨｚ程度の周波数範囲の周波数スペクトルを算出する。周波数スペクトルの分解能は、数十Ｈｚから数百Ｈｚとする。また、周波数が等間隔（リニア）のスペクトルではなく、低域の周波数分解能が細かく、高域になるに従って分解能が粗くなるような（周波数軸上で対数的な）スペクトルを算出してもよい。また、音楽で用いられている音律（平均律等）に対応する周波数スペクトルを算出してもよい。平均律では各音階、ド、ド＃、レ、レ＃、の周波数は対数的に等間隔で並んでいる。また、第２の算出方法において、所定値ωとして、例えば、１Ｋから２ＫＨｚ程度の値を設定する。また、所定値ωとして一般的な音楽における平均的な周波数を設定してもよい。 In the second and third calculation methods, a frequency spectrum in a frequency range of about several tens Hz to several KHz or 10 KHz is calculated. The resolution of the frequency spectrum is set to several tens Hz to several hundreds Hz. Alternatively, a spectrum may be calculated in which the frequency resolution is not equal (linear), but the frequency resolution in the low band is fine and the resolution becomes coarser (logarithmically on the frequency axis) as the frequency increases. Moreover, you may calculate the frequency spectrum corresponding to the temperament (average temperament etc.) used by the music. In the equal temperament, the frequencies of each scale, de, de #, re, re # are logarithmically arranged at equal intervals. In the second calculation method, for example, a value of about 1 K to 2 KHz is set as the predetermined value ω. Further, an average frequency in general music may be set as the predetermined value ω.

（４）帯域幅に関する指標の第４の算出方法は、隣接する２つのフレームの周波数スペクトルの帯域幅に関する数値の差を用いる方法である。帯域幅に関する数値は、上記の第１から第３の算出方法のいずれかで得られる値である。 (4) The fourth calculation method of the index related to the bandwidth is a method of using a numerical difference regarding the bandwidth of the frequency spectrum of two adjacent frames. The numerical value related to the bandwidth is a value obtained by any one of the first to third calculation methods.

例えば、第２の算出方法によって得られた値を用いる場合、ｉ−１番目のフレームに対応する音響データを式（１８）又は式（１９）に代入した結果をＥ１１’［ｉ−１］として保持するとともに、ｉ番目のフレームに対応する音響データを式（１８）又は式（１９）に代入した結果をＥ１１’［ｉ］として保持する。そして、Ｅ１１’［ｉ］とＥ１１’［ｉ−１］との差Ｅ１１［ｉ］＝Ｅ１１’［ｉ］−Ｅ１１’［ｉ−１］を算出し、これをフレームｉの帯域幅に関する指標とする。この指標は、帯域幅そのものではなく、帯域幅の変化量である。楽曲のサビの開始位置において、帯域幅が急激に広がることが多いので、このような箇所ではこの指標の値は大きくなる。 For example, when using the value obtained by the second calculation method, the result of substituting the acoustic data corresponding to the (i−1) -th frame into the equation (18) or the equation (19) is E11 ′ [i−1]. And holding the result of substituting the acoustic data corresponding to the i-th frame into the equation (18) or the equation (19) as E11 ′ [i]. Then, a difference E11 [i] = E11 ′ [i] −E11 ′ [i−1] between E11 ′ [i] and E11 ′ [i−1] is calculated, and this is calculated as an index related to the bandwidth of the frame i. To do. This index is not the bandwidth itself but the amount of change in bandwidth. Since the bandwidth often increases rapidly at the start position of the chorus of the music, the value of this index becomes large at such a location.

上記の帯域幅に関する指標の第１から第４の算出方法において、例えば、最大値が１になり、最小値が０になるように、得られたデータを正規化してもよい。 In the first to fourth calculation methods of the bandwidth-related index, for example, the obtained data may be normalized so that the maximum value is 1 and the minimum value is 0.

制御部１１１は、周波数帯域データ算出部１１３の処理の終了を検知すると、平滑化部１１４に対して動作を開始するように指示する。 When detecting the end of the processing of the frequency band data calculation unit 113, the control unit 111 instructs the smoothing unit 114 to start the operation.

次に、平滑化部１１４について説明する。周波数帯域データ算出部１１３によって生成された周波数帯域データＥ１１［ｉ］（ｉ＝０〜Ｈ１１−１）には、微小な変動（ノイズ）が含まれていることが多いので、平滑化部１１４は、ローパスフィルタによるフィルタリングを行うことにより、ノイズを除去する。例えば、平滑化部１１４は、下記の式（２１）を用いて、隣接する３つのフレームの周波数帯域データＥ１１に（１，２，１）の係数を掛け合わせて平滑化出力Ｅ［ｉ］（ｉ＝０〜Ｈ１１−１）を算出する。もちろんこの他の係数のローパスフィルタを用いてもよい。 Next, the smoothing unit 114 will be described. Since the frequency band data E11 [i] (i = 0 to H11-1) generated by the frequency band data calculation unit 113 often includes minute fluctuations (noise), the smoothing unit 114 Then, noise is removed by filtering with a low-pass filter. For example, the smoothing unit 114 uses the following equation (21) to multiply the frequency band data E11 of three adjacent frames by the coefficient (1, 2, 1) to obtain a smoothed output E [i] ( i = 0 to H11-1) is calculated. Of course, a low-pass filter with other coefficients may be used.

なお、両側の隣接フレームデータが揃わないＥ［０］及びＥ［Ｈ１１−１］については、揃っていないデータに対する係数を「０」に設定する。このように、周波数帯域データを平滑化することにより、特徴位置検出部１１５による特徴位置の検出精度が向上する。なお、平滑化部１１４は、省略されてもよい。

For E [0] and E [H11-1] where adjacent frame data on both sides are not aligned, the coefficient for the unaligned data is set to “0”. Thus, by smoothing the frequency band data, the feature position detection accuracy by the feature position detector 115 is improved. Note that the smoothing unit 114 may be omitted.

制御部１１１は、平滑化部１１４の処理の終了を検知すると、特徴位置検出部１１５に対して動作を開始するように指示する。 When detecting the end of the process of the smoothing unit 114, the control unit 111 instructs the feature position detection unit 115 to start the operation.

特徴位置検出部１１５は、平滑化部１１４によって得られた値を用いて、音響信号１０２におけるサビの開始位置等の特徴的な位置を検出する。特徴位置を検出する方法として、以下のいずれかの方法を用いる。ただし、平滑化部１１４が省略さている場合、特徴位置検出部１１５は、周波数帯域データ算出部１１３によって算出された周波数帯域データＥ１１［ｉ］を処理する。また、以下の説明のＥ［ｉ］をＥ１１［ｉ］に置き換える。 The feature position detection unit 115 detects a characteristic position such as a rust start position in the acoustic signal 102 using the value obtained by the smoothing unit 114. One of the following methods is used as a method for detecting the feature position. However, when the smoothing unit 114 is omitted, the feature position detection unit 115 processes the frequency band data E11 [i] calculated by the frequency band data calculation unit 113. Also, E [i] in the following description is replaced with E11 [i].

（１）特徴位置の第１の検出方法は、平滑化出力が最大となるフレーム（位置）を検出する方法である。平滑化出力Ｅ［ｉ］（ｉ＝０〜Ｈ１１−１）が最大となる位置のインデックスｉ（以下、「Ｉｍａｘ」と記載する。）を検出し、Ｉｍａｘに対応する楽曲の先頭からの時間（Ｔｇ１１×Ｉｍａｘ）を特徴位置とする。 (1) The first feature position detection method is a method for detecting a frame (position) at which the smoothed output is maximized. The index i (hereinafter referred to as “Imax”) at the position where the smoothed output E [i] (i = 0 to H11-1) is maximum is detected, and the time from the beginning of the music corresponding to Imax ( Tg11 × Imax) is defined as the feature position.

なお、平滑化出力の全部から最大値を探すのではなく、最大値を探す範囲を限定してもよい。つまり、音響信号１０２の連続する一部分について、平滑化部１１４によって得られた値が最大となる位置を検出してもよい。具体的には、Ｅ［ｉ］（ｉ＝Ｈａ〜Ｈｂ、Ｈａ及びＨｂは、０≦Ｈａ＜Ｈｂ＜Ｈ１１−１、を満たす整数）を対象に最大値を探してもよい。例えば、Ｈａ＝０とし、Ｈｂを楽曲の長さの７０％程度に相当する値に設定する。また、楽曲の連続する一部分、例えば楽曲の７０％程度に相当する音響信号１０２から周波数帯域データＥ１１を算出し、これに基づき算出される平滑化出力Ｅが最大となる位置を検出してもよい。このように音響信号１０２の連続する一部分に相当する平滑化出力を対象に最大値を探す方法を用いると、処理量を削減することができるとともに、以下の理由により特徴位置の検出精度を改善することができる。 Note that, instead of searching for the maximum value from all of the smoothed outputs, the range for searching for the maximum value may be limited. That is, the position where the value obtained by the smoothing unit 114 is maximum may be detected for a continuous portion of the acoustic signal 102. Specifically, the maximum value may be searched for E [i] (i = Ha to Hb, Ha and Hb are integers satisfying 0 ≦ Ha <Hb <H11-1). For example, Ha = 0 is set, and Hb is set to a value corresponding to about 70% of the music length. Further, the frequency band data E11 may be calculated from the acoustic signal 102 corresponding to a continuous part of the music, for example, about 70% of the music, and the position where the smoothed output E calculated based on this is maximized may be detected. . Using the method of searching for the maximum value for the smoothed output corresponding to a continuous portion of the acoustic signal 102 as described above can reduce the processing amount and improve the detection accuracy of the feature position for the following reason. be able to.

楽曲のサビは、１つの楽曲で複数回繰り返されることが多いが、演奏や歌唱のニュアンスは、毎回同じではなく、微妙に異なることが多い。すなわち、時間的に前の位置にあるサビは、後ろの位置にあるサビに比べて、完全には盛り上がっていない場合が多い。試聴用に楽曲のある一部を再生する場合を考えると、その箇所は「この曲全体を聴いてみたい」とリスナに思わせる箇所であることが望ましい。したがって、完全に盛り上がった状態の楽曲の後ろの位置のサビよりも、今後の盛り上がりに期待感を抱かせる楽曲の前の位置のサビの方が、試聴用に適している。平滑化出力の最大を検出する範囲を楽曲の前半の７０％程度に制限することにより、楽曲の前の位置にあるサビが検出され易くなり、試聴用の特徴位置の検出精度が向上する。 The rust of a song is often repeated multiple times in one song, but the nuances of performance and singing are not always the same each time and are often slightly different. That is, the rust at the front position in time is often not completely raised as compared with the rust at the rear position. Considering the case of playing a part of a piece of music for trial listening, it is desirable that the part is a part that makes the listener think that “I want to listen to this whole piece of music”. Therefore, the rust in the position before the music that gives a sense of expectation to the future excitement is more suitable for trial listening than the rust in the position behind the music in the fully excited state. Limiting the range of detecting the maximum smoothed output to about 70% of the first half of the music makes it easier to detect rust at the position in front of the music and improves the detection accuracy of the characteristic position for trial listening.

また、楽曲のイントロ部分を特徴位置の検出対象に含めないように、Ｈａに適切な値を設定してもよい。 Also, an appropriate value may be set for Ha so that the intro part of the music is not included in the feature position detection target.

なお、平滑化出力が最大となる位置そのものを特徴位置とするのではなく、平滑化出力が最大となる位置から所定時間だけ前の位置、又は平滑化出力が最大となる位置より前で、平滑化出力が最大値より所定値だけ小さくなる位置を特徴位置としてもよい。これにより、サビの出だしの検出漏れを防止することができる。 Note that the position where the smoothed output is maximized is not set as the feature position, but the smoothed output is smoothed before the position where the smoothed output is maximized for a predetermined time or before the position where the smoothed output is maximized. A position where the normalized output is smaller than the maximum value by a predetermined value may be set as the feature position. As a result, it is possible to prevent detection omission of rust out.

（２）特徴位置の第２の検出方法は、図１９に示すフローチャートに従って、平滑化出力が極大となる位置を検出する方法である。図１９は、特徴位置検出部１１５が特徴位置の第２の検出方法を実行する際の動作の各ステップを示すフローチャートである。 (2) The second feature position detection method is a method for detecting a position where the smoothed output is maximized according to the flowchart shown in FIG. FIG. 19 is a flowchart showing each step of the operation when the feature position detection unit 115 executes the second feature position detection method.

特徴位置検出部１１５は、先ず、制御変数ｉに初期値「Ｈｃ」をセットする（Ｓ７００）。Ｈｃは、１≦Ｈｃ＜Ｈ１１−１を満たす所定の整数である。平滑化出力の全部から極大位置を探す場合、Ｈｃ＝１である。楽曲のイントロ等を極大位置の検出の対象に含めない場合、Ｈｃ＞１である。 The feature position detection unit 115 first sets an initial value “Hc” in the control variable i (S700). Hc is a predetermined integer that satisfies 1 ≦ Hc <H11-1. When searching for the maximum position from all the smoothed outputs, Hc = 1. When the intro of the music is not included in the detection target of the maximum position, Hc> 1.

次に、特徴位置検出部１１５は、Ｅ［ｉ］が極大値であるか否かを判定する（Ｓ７１０）。この判定方法は、例えば、Ｅ［ｉ］＞Ｅ［ｉ−１］かつＥ［ｉ］＞Ｅ［ｉ＋１］であれば、Ｅ［ｉ］を極大値と判定する方法である。特徴位置検出部１１５は、Ｅ［ｉ］が極大値であると判定すると（Ｓ７１０でＹｅｓ）、極大位置における平滑化出力Ｅ［ｉ］と、極大位置における制御変数の値（インデックス、時間情報）ｉとを特徴位置検出部１１５内部の作業用メモリに格納する（Ｓ７２０）。 Next, the feature position detection unit 115 determines whether E [i] is a local maximum (S710). This determination method is, for example, a method of determining E [i] as a maximum value if E [i]> E [i-1] and E [i]> E [i + 1]. If the characteristic position detection unit 115 determines that E [i] is a maximum value (Yes in S710), the smoothed output E [i] at the maximum position and the value of the control variable (index, time information) at the maximum position. i is stored in the working memory inside the feature position detector 115 (S720).

次に、特徴位置検出部１１５は、制御変数ｉの値を「１」増やす（Ｓ７３０）。なお、ステップＳ７１０において、Ｅ［ｉ］が極大値ではないと判定した場合（Ｓ７１０でＮｏ）、特徴位置検出部１１５は、ステップＳ７３０の処理を行う。 Next, the feature position detection unit 115 increases the value of the control variable i by “1” (S730). When it is determined in step S710 that E [i] is not the maximum value (No in S710), the feature position detection unit 115 performs the process of step S730.

次に、特徴位置検出部１１５は、制御変数ｉが所定値Ｈｄ以下であるか否かを判定する（Ｓ７４０）。Ｈｄは、Ｈｃ＜Ｈｄ＜Ｈ１１−１を満たす所定の整数である。平滑化出力の全部を対象に極大位置を探す場合、Ｈｄ＝Ｈ１１−２である。上述した理由等により、楽曲の後ろの部分を極大位置の検出の対象から除外する場合、Ｈｄ＜Ｈ１１−２とし、平滑化出力の極大値の検出範囲を、例えば、楽曲の長さの７０％に限定する。 Next, the feature position detection unit 115 determines whether or not the control variable i is equal to or less than the predetermined value Hd (S740). Hd is a predetermined integer that satisfies Hc <Hd <H11-1. When searching for the maximum position for all of the smoothed outputs, Hd = H11-2. For the reasons described above, when the portion behind the music is excluded from the detection target of the maximum position, Hd <H11-2 is set, and the detection range of the maximum value of the smoothed output is, for example, 70% of the length of the music Limited to.

特徴位置検出部１１５は、制御変数ｉが所定値Ｈｄ以下であると判定すると（Ｓ７４０でＹｅｓ）、ステップＳ７１０に戻ってステップＳ７３０までの処理を繰り返す。 If the characteristic position detection unit 115 determines that the control variable i is equal to or less than the predetermined value Hd (Yes in S740), the feature position detection unit 115 returns to step S710 and repeats the processing up to step S730.

特徴位置検出部１１５は、制御変数ｉが所定値Ｈｄより大きいと判定すると（Ｓ７４０でＮｏ）、作業用メモリに格納した極大値の情報の内から、所定個数の極大位置を選択する（Ｓ７５０）。具体的には、特徴位置検出部１１５は、極大値が大きい順に所定個数の極大位置を選択する。大きい順に選択されたＰ個の極大位置（時間）をＩｐ［ｖ］（ｖ＝０〜Ｐ−１）と記載する。このとき、Ｅ［Ｉｐ［０］］≧Ｅ［Ｉｐ［１］］≧Ｅ［Ｉｐ［２］］≧．．．≧Ｅ［Ｉｐ［Ｐ−１］］である。例えば、周波数帯域の幅が図２０に示すように時間の経過とともに変化する場合、特徴位置検出部１１５は、値が最大である極大位置Ａ’と、値が２番目である極大位置Ｂ’と、値が３番目である極大位置Ｃ’とを選択する。 If the characteristic position detection unit 115 determines that the control variable i is greater than the predetermined value Hd (No in S740), the characteristic position detection unit 115 selects a predetermined number of local maximum positions from the local maximum information stored in the work memory (S750). . Specifically, the feature position detection unit 115 selects a predetermined number of local maximum positions in descending order of local maximum values. The P maximum positions (time) selected in descending order are described as Ip [v] (v = 0 to P−1). At this time, E [Ip [0]] ≧ E [Ip [1]] ≧ E [Ip [2]] ≧. . . ≧ E [Ip [P-1]]. For example, when the width of the frequency band changes as time passes as shown in FIG. 20, the feature position detection unit 115 sets the maximum position A ′ having the maximum value and the maximum position B ′ having the second value. The local maximum position C ′ having the third value is selected.

なお、特徴位置検出部１１５は、極大値が大きい順に所定個数の極大位置を選択する際、既に選択している極大位置と時間的に近いものを除外してもよい。例えば、特徴位置検出部１１５は、既に選択した極大位置と所定の時間以上離れている極大値のみを選択してもよい。また、特徴位置検出部１１５は、音響信号１０２の連続する一部分について、平滑化部１１４によって得られた値が極大となる位置を検出してもよい。以上が特徴位置の第２の検出方法の説明である。 Note that the feature position detection unit 115 may exclude a position close in time to the already selected maximum position when selecting a predetermined number of maximum positions in descending order of the maximum value. For example, the feature position detection unit 115 may select only the maximum value that is separated from the already selected maximum position by a predetermined time or more. In addition, the feature position detection unit 115 may detect a position where the value obtained by the smoothing unit 114 is maximum for a continuous portion of the acoustic signal 102. The above is the description of the second feature position detection method.

特徴位置検出部１１５は、このようにして検出した、最大位置Ｉｍａｘ又は極大位置Ｉｐ［ｖ］（ｖ＝０〜Ｐ−１）を特徴位置情報１０３として音響信号分析装置１０１の外部に出力する。特徴位置情報１０３を用いて音響信号１０２を再生することにより、サビ等の楽曲の特徴的な箇所を再生することが可能になる。 The feature position detection unit 115 outputs the maximum position Imax or the maximum position Ip [v] (v = 0 to P−1) detected in this way to the outside of the acoustic signal analysis apparatus 101 as the feature position information 103. By reproducing the acoustic signal 102 using the characteristic position information 103, it is possible to reproduce a characteristic portion of the music such as rust.

上述したように、実施の形態３の音響信号分析装置１０１は、音響信号１０２を構成する各区分の周波数帯域の幅又はそれに直接関係するデータを算出し、それが最大又は極大となる区間を検出する。これにより、楽曲のサビや盛り上がる箇所等の特徴位置を精度良く検出することができる。 As described above, the acoustic signal analysis apparatus 101 according to Embodiment 3 calculates the width of the frequency band of each section constituting the acoustic signal 102 or data directly related thereto, and detects the section in which the maximum or maximum is obtained. To do. As a result, it is possible to accurately detect feature positions such as rust and excitement of music.

（実施の形態４）
次に、実施の形態４の音響信号分析装置１０１を図２１を用いて説明する。図２１は、実施の形態４の音響信号分析装置１０１の構成図である。実施の形態４の音響信号分析装置１０１は、図２１に示すように、制御部１１１と、取得部１１２と、周波数帯域データ算出部１１３と、平滑化部１１４ａと、特徴位置検出部１１５と、第２の周波数帯域データ算出部１１６と、評価値算出部１１７とを有する。 (Embodiment 4)
Next, the acoustic signal analysis apparatus 101 of Embodiment 4 is demonstrated using FIG. FIG. 21 is a configuration diagram of the acoustic signal analysis device 101 according to the fourth embodiment. As shown in FIG. 21, the acoustic signal analysis apparatus 101 according to the fourth embodiment includes a control unit 111, an acquisition unit 112, a frequency band data calculation unit 113, a smoothing unit 114a, a feature position detection unit 115, A second frequency band data calculation unit 116 and an evaluation value calculation unit 117 are included.

実施の形態４の音響信号分析装置１０１は、実施の形態３の音響信号分析装置１０１が有する各構成部に加えて、第２の周波数帯域データ算出部１１６と、評価値算出部１１７とを有する。また、実施の形態４の音響信号分析装置１０１は、実施の形態３の音響信号分析装置１０１が有する平滑化部１１４に代えて平滑化部１１４ａを有する。その点が実施の形態４と実施の形態３との相違点である。 The acoustic signal analysis device 101 according to the fourth embodiment includes a second frequency band data calculation unit 116 and an evaluation value calculation unit 117 in addition to the components included in the acoustic signal analysis device 101 according to the third embodiment. . The acoustic signal analysis apparatus 101 according to the fourth embodiment includes a smoothing unit 114a instead of the smoothing unit 114 included in the acoustic signal analysis apparatus 101 according to the third embodiment. This is the difference between the fourth embodiment and the third embodiment.

取得部１１２及び周波数帯域データ算出部１１３の動作は、実施の形態３において説明した動作と同じである。 The operations of the acquisition unit 112 and the frequency band data calculation unit 113 are the same as those described in the third embodiment.

第２の周波数帯域データ算出部１１６の動作は、周波数帯域データ算出部１１３の動作とほぼ同じである。ただし、第２の周波数帯域データ算出部１１６は、周波数帯域データ算出部１１３が処理するフレームの時間長Ｔｆ１１とは異なる時間長Ｔｆ１２のフレームを処理する。以下にその理由を説明する。 The operation of the second frequency band data calculation unit 116 is substantially the same as the operation of the frequency band data calculation unit 113. However, the second frequency band data calculation unit 116 processes a frame having a time length Tf12 different from the time length Tf11 of the frame processed by the frequency band data calculation unit 113. The reason will be described below.

音楽に係る音響信号の周波数成分は、音楽を構成する個々の音符、ビブラート等の音符の装飾音、拍、小節、フレーズ、及び、イントロやサビ等の大局的な構成等の時間スケールの異なる様々な要因（音楽の重層的な構造）により変化する。このような音楽の重層的な構造において、１つの音符の装飾音は、相対的に短い時間スケールで周波数を変化させるのに対し、イントロやサビ等の大局的な構成は相対的に長い時間スケールで周波数を変化させる。 The frequency components of acoustic signals related to music vary in various time scales, such as individual notes that make up music, decorative sounds of notes such as vibrato, beats, measures, phrases, and general composition such as intro and rust Change due to various factors (multi-layered structure of music). In such a multi-layered structure of music, the ornamental sound of one note changes its frequency on a relatively short time scale, whereas the global structure such as intro and rust has a relatively long time scale. Change the frequency with.

例えば、サビの開始点においては、音域の異なる複数の楽器や歌唱が同時に演奏されることが多く、特に周波数帯域が広く、減衰時間の短い打楽器が演奏されることが多いため、１６分音符から２分音符に相当する比較的短い時間で周波数帯域が広がる傾向が強い。また、通常のサビは数小節以上の長さを持ち、低域パートと高域パートが両方演奏され続けることが多いため、サビの開始点から数小節に相当する比較的長い時間で周波数帯域が広い傾向がある。サビにはこのような特性があるので、時間スケールの異なる複数の周波数帯域データを算出することで、サビの検出精度を向上させることができる。 For example, at the starting point of chorus, a plurality of instruments and singing with different sound ranges are often played simultaneously, and particularly percussion instruments with a wide frequency band and a short decay time are often played. There is a strong tendency to spread the frequency band in a relatively short time corresponding to a half note. In addition, normal chorus has a length of several bars or more, and both low-frequency parts and high-frequency parts often continue to be played, so the frequency band can be extended in a relatively long time corresponding to several bars from the start point of chorus. There is a wide tendency. Since rust has such characteristics, rust detection accuracy can be improved by calculating a plurality of frequency band data having different time scales.

第２の周波数帯域データ算出部１１６が処理するフレームの時間長Ｔｆ１２は、周波数帯域データ算出部１１３が処理するフレームの時間長Ｔｆ１１より長い。具体的には、周波数帯域データ算出部１１３は、楽曲の１音符又は１拍以下の時間長に相当するＴｆ１１のフレームを処理し、第２の周波数帯域データ算出部１１６は、１拍より長い、１小節から８小節程度の時間長に相当するＴｆ１２のフレームを処理する。例えば、Ｔｆ１１を４／４拍子でテンポが１２０の楽曲の１６分音符に相当する１２５ｍｓｅｃとし、Ｔｆ１２を１小節に相当する２ｓｅｃとする。 The time length Tf12 of the frame processed by the second frequency band data calculation unit 116 is longer than the time length Tf11 of the frame processed by the frequency band data calculation unit 113. Specifically, the frequency band data calculation unit 113 processes a frame of Tf11 corresponding to one musical note or a time length of one beat or less of the music, and the second frequency band data calculation unit 116 is longer than one beat. A frame of Tf12 corresponding to a time length of about 1 bar to 8 bars is processed. For example, Tf11 is set to 125 msec corresponding to a sixteenth note of a music piece having a 4/4 time signature and a tempo of 120, and Tf12 is set to 2 sec corresponding to one measure.

第２の周波数帯域データ算出部１１６が動作する際のフレームシフトの時間長Ｔｇ１２と、周波数帯域データ算出部１１３が動作する際のフレームシフトの時間長Ｔｇ１１とは、同じであってもよいし、異なっていてもよい。実施の形態４では、Ｔｇ１２＝Ｑ１×Ｔｇ１１であり、第２の周波数帯域データ算出部１１６が動作する際のフレームシフトのサンプル数Ｇ１２＝Ｑ１×Ｇ１１である（Ｑ１は１以上の整数）。しかしながら、Ｔｇ１２及びＧ１２はこれに限定されない。 The frame shift time length Tg12 when the second frequency band data calculation unit 116 operates and the frame shift time length Tg11 when the frequency band data calculation unit 113 operates may be the same, May be different. In the fourth embodiment, Tg12 = Q1 × Tg11, and the number of frame shift samples G12 = Q1 × G11 when the second frequency band data calculation unit 116 operates (Q1 is an integer equal to or greater than 1). However, Tg12 and G12 are not limited to this.

また、第２の周波数帯域データ算出部１１６が処理するフレームの総数をＨ１２とする。 In addition, the total number of frames processed by the second frequency band data calculation unit 116 is H12.

このような条件の下、第２の周波数帯域データ算出部１１６は、実施の形態３の周波数帯域データ算出部１１３と同様な動作を行って、第２の周波数帯域データＥ１２［ｊ］（ｊ＝０〜Ｈ１２−１）を算出する。 Under such conditions, the second frequency band data calculation unit 116 performs the same operation as the frequency band data calculation unit 113 of the third embodiment, and the second frequency band data E12 [j] (j = 0 to H12-1) is calculated.

次に、評価値算出部１１７について説明する。評価値算出部１１７は、周波数帯域データ算出部１１３によって算出された周波数帯域データＥ１１［ｉ］と、第２の周波数帯域データ算出部１１６によって算出された第２の周波数帯域データＥ１２［ｊ］とを用いて、評価値を算出する。評価値算出部１１７は、周波数帯域データＥ１１［ｉ］と、第２の周波数帯域データＥ１２［ｊ］とを用い、Ｅ１１［ｉ］が大きく、かつＥ１１［ｉ］に時間的に対応するＥ１２［ｊ］が大きいほど大きな値になるように、評価値を算出する。 Next, the evaluation value calculation unit 117 will be described. The evaluation value calculation unit 117 includes frequency band data E11 [i] calculated by the frequency band data calculation unit 113, and second frequency band data E12 [j] calculated by the second frequency band data calculation unit 116. Is used to calculate an evaluation value. The evaluation value calculation unit 117 uses the frequency band data E11 [i] and the second frequency band data E12 [j], E11 [i] is large, and E12 [i] is temporally corresponding to E11 [i]. The evaluation value is calculated such that the larger j] is, the larger the value is.

制御部１１１は、周波数帯域データ算出部１１３及び第２の周波数帯域データ算出部１１６の処理の終了を検知すると、評価値算出部１１７に対して動作を開始するように指示し、評価値算出部１１７は、図２２のフローチャートに示す動作を開始する。図２２は、評価値算出部１１７の動作の各ステップを示すフローチャートである。 Upon detecting the end of the processing of the frequency band data calculation unit 113 and the second frequency band data calculation unit 116, the control unit 111 instructs the evaluation value calculation unit 117 to start operation, and the evaluation value calculation unit 117 starts the operation shown in the flowchart of FIG. FIG. 22 is a flowchart showing the steps of the operation of the evaluation value calculation unit 117.

先ず、評価値算出部１１７は、制御変数ｉに「０」をセットする（Ｓ８００）。 First, the evaluation value calculation unit 117 sets “0” to the control variable i (S800).

次に、評価値算出部１１７は、下記の式（２２）に従って、制御変数ｊにセットする値を算出する（Ｓ８１０）。 Next, the evaluation value calculation unit 117 calculates a value to be set in the control variable j according to the following equation (22) (S810).

ｆｌｏｏｒ（）は、小数点以下を切り捨てた整数を返す関数である。Ｑ１は、周波数帯域データ算出部１１３が動作する際のフレームシフト時間長を基準とした、第２の周波数帯域データ算出部１１６が動作する際のフレームシフト時間長の倍率であり、１以上の整数である。

floor () is a function that returns an integer with the decimal part truncated. Q1 is a magnification of the frame shift time length when the second frequency band data calculation unit 116 is operated, based on the frame shift time length when the frequency band data calculation unit 113 is operated, and is an integer of 1 or more It is.

次に、評価値算出部１１７は、後述する方法に従って、制御変数ｉに対応する評価値α［ｉ］を算出する（Ｓ８２０）。 Next, the evaluation value calculation unit 117 calculates an evaluation value α [i] corresponding to the control variable i according to a method described later (S820).

次に、評価値算出部１１７は、制御変数ｉの値を「１」増やす（Ｓ８３０）。 Next, the evaluation value calculation unit 117 increases the value of the control variable i by “1” (S830).

次に、評価値算出部１１７は、制御変数ｉが、Ｈ１２（第２の周波数帯域データ算出部１１６が処理するフレームの総数）と、Ｑ１との積の値（Ｑ１×Ｈ１２）未満であるか否かを判定する（Ｓ８４０）。評価値算出部１１７は、制御変数ｉが（Ｑ１×Ｈ１２）未満であると判定すると（Ｓ８４０でＹｅｓ）、ステップＳ８１０に戻ってステップＳ８３０までの処理を繰り返し、制御変数ｉが（Ｑ１×Ｈ１２）であると判定すると（Ｓ８４０でＮｏ）、処理を終了する。 Next, the evaluation value calculation unit 117 determines whether the control variable i is less than the product value (Q1 × H12) of H12 (the total number of frames processed by the second frequency band data calculation unit 116) and Q1. It is determined whether or not (S840). If the evaluation value calculation unit 117 determines that the control variable i is less than (Q1 × H12) (Yes in S840), the process returns to Step S810 and repeats the processing up to Step S830, and the control variable i is (Q1 × H12). If it is determined that (No in S840), the process is terminated.

評価値算出部１１７は、上述した処理により、（Ｑ１×Ｈ１２）個の時系列データである評価値α［ｉ］（ｉ＝０〜Ｑ１×Ｈ１２−１）を算出する。評価値算出部１１７は、処理を終了したことを制御部１１１に通知する。 The evaluation value calculation unit 117 calculates the evaluation value α [i] (i = 0 to Q1 × H12-1), which is (Q1 × H12) pieces of time series data, by the above-described processing. The evaluation value calculation unit 117 notifies the control unit 111 that the processing has been completed.

評価値算出部１１７は、以下に示すいずれかの方法により評価値α［ｉ］を算出する。 The evaluation value calculation unit 117 calculates the evaluation value α [i] by any of the following methods.

（１）評価値の第１の算出方法は、下記の式（２３）に示すように、周波数帯域データＥ１１［ｉ］と、周波数帯域データＥ１１［ｉ］に時間的に対応する第２の周波数帯域データＥ１２［ｊ］とを加算する方法である。 (1) The first evaluation value calculation method includes frequency band data E11 [i] and a second frequency corresponding in time to frequency band data E11 [i] as shown in the following equation (23). This is a method of adding the band data E12 [j].

なお、周波数帯域データＥ１１［ｉ］と、Ｅ１１［ｉ］に時間的に対応する第２の周波数帯域データＥ１２［ｊ］とを加算した値に所定値を乗算した値を評価値としてもよい。

Note that a value obtained by multiplying a value obtained by adding the frequency band data E11 [i] and the second frequency band data E12 [j] temporally corresponding to E11 [i] by a predetermined value may be used as the evaluation value.

（２）評価値の第２の算出方法は、下記の式（２４）に示すように、Ｅ１１［ｉ］に係数β３を乗じた値と、Ｅ１１［ｉ］に時間的に対応するＥ１２［ｊ］に係数β４を乗じた値との加算値を用いる方法である。ただし、β３＞０、β４＞０である。第２の算出方法では、Ｅ１１［ｉ］とＥ１２［ｊ］に対して、各々重み付けを行なって加算していることになる。 (2) As shown in the following equation (24), the second method of calculating the evaluation value is a value obtained by multiplying E11 [i] by a coefficient β3, and E12 [j] temporally corresponding to E11 [i]. ] Is added to a value obtained by multiplying the coefficient β4 by a coefficient β4. However, β3> 0 and β4> 0. In the second calculation method, E11 [i] and E12 [j] are respectively weighted and added.

（３）評価値の第３の算出方法は、下記の式（２５）に示すように、Ｅ１１［ｉ］の対数値に係数β３を乗じた値と、Ｅ１１［ｉ］に時間的に対応するＥ１２［ｊ］の対数値に係数β４を乗じた値との加算値を用いる方法である。なお、第１から第３の算出方法は、Ｅ１１とＥ１２のどちらかが小さい箇所で、評価値をあまり小さくしたくない場合に用いる。第３の算出方法は、それに加えて、Ｅ１１とＥ１２のそれぞれの値の範囲が大きく異なる場合に適している。

(3) The third calculation method of the evaluation value corresponds to the value obtained by multiplying the logarithmic value of E11 [i] by the coefficient β3 and E11 [i] in time, as shown in the following equation (25). In this method, an addition value of a value obtained by multiplying the logarithmic value of E12 [j] by a coefficient β4 is used. The first to third calculation methods are used when one of E11 and E12 is small and it is not desired to make the evaluation value too small. In addition to this, the third calculation method is suitable when the ranges of the values of E11 and E12 are greatly different.

（４）評価値の第４の算出方法は、下記の式（２６）に示すように、Ｅ１１［ｉ］とＥ１１［ｉ］に時間的に対応するＥ１２［ｊ］との積を用いる方法である。なお、式（２６）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(4) The fourth calculation method of the evaluation value is a method using a product of E11 [i] and E12 [j] temporally corresponding to E11 [i] as shown in the following equation (26). is there. Note that a value obtained by further multiplying the right side of Expression (26) by a predetermined value may be used as the evaluation value.

（５）評価値の第５の算出方法は、下記の式（２７）に示すように、Ｅ１１［ｉ］を基数としγ３を指数とした累乗値と、Ｅ１１［ｉ］に時間的に対応するＥ１２［ｊ］を基数としγ４を指数とした累乗値との積を用いる方法である。第４及び第５の算出方法は、Ｅ１１とＥ１２のどちらかが小さければ、評価値も小さくしたい場合に用いる。第５の算出方法は、それに加えて、Ｅ１１とＥ１２の評価値への影響力に重みを付けたい場合に適している。なお、式（２７）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(5) The fifth method of calculating the evaluation value temporally corresponds to the power value with E11 [i] as the base and γ3 as the exponent, and E11 [i] as shown in the following equation (27). This is a method of using a product of a power value with E12 [j] as a radix and γ4 as an exponent. The fourth and fifth calculation methods are used when it is desired to reduce the evaluation value if either E11 or E12 is small. In addition, the fifth calculation method is suitable when it is desired to weight the influence on the evaluation values of E11 and E12. Note that a value obtained by further multiplying the right side of Expression (27) by a predetermined value may be used as the evaluation value.

（６）評価値の第６の算出方法は、下記の式（２８）に示すように、Ｅ１１［ｉ］を基数としγ３を指数とした累乗値と係数β３の積と、Ｅ１１［ｉ］に時間的に対応するＥ１２［ｊ］を基数としγ４を指数とした累乗値と係数β４の積との和を用いる方法である。なお、式（２８）の右辺にさらに所定値を乗算した値を評価値としてもよい。

(6) A sixth method for calculating the evaluation value is as follows: E11 [i] is a radix and γ3 is an exponential product with a coefficient β3 and E11 [i] This is a method of using the sum of the product of the power value and the coefficient β4 with the temporally corresponding E12 [j] as the radix and γ4 as the exponent. Note that a value obtained by further multiplying the right side of Expression (28) by a predetermined value may be used as the evaluation value.

なお、評価値算出部１１７は、Ｅ１１［ｉ］≧θ１かつＥ１２［ｊ］≧θ２（θ１、θ２は所定値）の条件が成立する場合、上述した第１から第６の算出方法により評価値を算出し、その条件が成立しない場合、評価値を「０」に設定してもよい。また、評価値算出部１１７は、評価値α［ｉ］を算出した後に、α［ｉ］＜θ３（θ３は所定値）である場合、α［ｉ］を「０」に設定してもよい。

Note that the evaluation value calculation unit 117 evaluates the evaluation value using the first to sixth calculation methods described above when the conditions of E11 [i] ≧ θ1 and E12 [j] ≧ θ2 (θ1 and θ2 are predetermined values) are satisfied. If the condition is not satisfied, the evaluation value may be set to “0”. Moreover, after calculating the evaluation value α [i], the evaluation value calculation unit 117 may set α [i] to “0” when α [i] <θ3 (θ3 is a predetermined value). .

上述した方法によって算出される評価値α［ｉ］は、Ｅ１１［ｉ］が大きいほど、かつＥ１１［ｉ］に時間的に対応するＥ１２［ｊ］が大きいほど大きな値となる。様々な時間スケールを持つ音楽の重層的な構造の中で、Ｅ１１［ｉ］は１音符や１拍といった比較的短い時間変化を表し、Ｅ１２［ｊ］はそれより長い時間変化を表す。 The evaluation value α [i] calculated by the above-described method becomes larger as E11 [i] is larger and E12 [j] corresponding to E11 [i] is larger in time. In the multi-layered structure of music having various time scales, E11 [i] represents a relatively short time change such as one note or one beat, and E12 [j] represents a longer time change.

サビの開始位置、曲調が大きく変わる変化点、試聴に適する位置、及び、リスナに強い印象を与える位置等の楽曲の「特徴位置」では、その先頭の周波数帯域が非常に広いことが多く、そこから１から８小節程度の間でも、平均的な周波数帯域が広い場合が多いので、そのような箇所の評価値は大きな値となる。従って、評価値の最大値又は極大値を検出することにより、サビの開始位置等の特徴位置を精度良く検出することができる。 The “feature position” of a song, such as the start position of the chorus, the point where the tune changes greatly, the position suitable for auditioning, and the position that gives a strong impression to the listener, often has a very wide frequency band at the beginning. From 1 to 8 bars, the average frequency band is often wide, and the evaluation value of such a portion is a large value. Therefore, by detecting the maximum value or the maximum value of the evaluation values, it is possible to accurately detect a feature position such as a rust start position.

制御部１１１は、評価値算出部１１７の処理の終了を検知すると、平滑化部１１４ａに対して動作を開始するように指示する。平滑化部１１４ａは、実施の形態３の平滑化部１１４と同様な動作を行う。ただし、平滑化部１１４ａは、周波数帯域データＥ１１［ｉ］（ｉ＝０〜Ｈ１１−１）の代わりに、評価値α［ｉ］（ｉ＝０〜Ｑ１×Ｈ１２−１）を対象として処理を行い、平滑化出力Ｅ［ｉ］（ｉ＝０〜Ｑ１×Ｈ１２−１）を算出する。なお、平滑化部１１４ａは省略されてもよい。また、実施の形態３と同様に、周波数帯域データ算出部１１３の後に平滑化部１１４を設けて周波数帯域データを平滑化してもよい。更に、第２の周波数帯域データを平滑化してもよい。 When detecting the end of the processing of the evaluation value calculation unit 117, the control unit 111 instructs the smoothing unit 114a to start the operation. The smoothing unit 114a performs the same operation as the smoothing unit 114 of the third embodiment. However, the smoothing unit 114a processes the evaluation value α [i] (i = 0 to Q1 × H12-1) instead of the frequency band data E11 [i] (i = 0 to H11-1). The smoothed output E [i] (i = 0 to Q1 × H12-1) is calculated. Note that the smoothing unit 114a may be omitted. Similarly to the third embodiment, the frequency band data may be smoothed by providing a smoothing unit 114 after the frequency band data calculation unit 113. Further, the second frequency band data may be smoothed.

制御部１１１は、平滑化部１１４ａの処理の終了を検知すると、特徴位置検出部１１５に対して動作を開始するように指示する。特徴位置検出部１１５は、実施の形態３において説明した処理と同様な処理を行って、特徴位置情報１０３を音響信号分析装置１０１の外部に出力する。 When detecting the end of the process of the smoothing unit 114a, the control unit 111 instructs the feature position detection unit 115 to start the operation. The feature position detection unit 115 performs processing similar to the processing described in the third embodiment, and outputs the feature position information 103 to the outside of the acoustic signal analyzer 101.

上述したように、実施の形態４の音響信号分析装置１０１は、音楽の重層的な構造に起因する異なる時間スケールでの周波数帯域の変化を精度よく検出するために、時間長の異なる２つの時間区間を用いて２種類の周波数帯域データを算出し、それらを組合せて評価値を算出する。これにより、時間的なスケールが異なる周波数帯域の変化がある場合でも、サビの位置等の特徴位置を精度良く検出することができる。 As described above, the acoustic signal analysis apparatus 101 according to Embodiment 4 uses two time periods having different time lengths in order to accurately detect changes in frequency bands at different time scales due to the multi-layered structure of music. Two types of frequency band data are calculated using the sections, and an evaluation value is calculated by combining them. As a result, even when there is a change in frequency bands with different temporal scales, it is possible to accurately detect a characteristic position such as a rust position.

（実施の形態５）
次に、実施の形態５の音響信号分析装置１０１を図２３を用いて説明する。図２３は、実施の形態５の音響信号分析装置１０１の構成図である。実施の形態５の音響信号分析装置１０１は、図２３に示すように、制御部１１１と、取得部１１２と、周波数帯域データ算出部１１３と、平滑化部１１４ａと、特徴位置検出部１１５と、評価値算出部１１７ａと、音量データ算出部１１８とを有する。 (Embodiment 5)
Next, the acoustic signal analysis apparatus 101 of Embodiment 5 is demonstrated using FIG. FIG. 23 is a configuration diagram of the acoustic signal analyzer 101 according to the fifth embodiment. As shown in FIG. 23, the acoustic signal analysis apparatus 101 according to the fifth embodiment includes a control unit 111, an acquisition unit 112, a frequency band data calculation unit 113, a smoothing unit 114a, a feature position detection unit 115, An evaluation value calculation unit 117a and a volume data calculation unit 118 are included.

実施の形態５の音響信号分析装置１０１は、実施の形態４の音響信号分析装置１０１が有する第２の周波数帯域データ算出部１１６の代わりに音量データ算出部１１８を有する。また、実施の形態５の音響信号分析装置１０１は、実施の形態４の音響信号分析装置１０１が有する評価値算出部１１７に代えて評価値算出部１１７ａを有する。その点が、実施の形態５と実施の形態４の相違点である。 The acoustic signal analysis device 101 according to the fifth embodiment includes a volume data calculation unit 118 instead of the second frequency band data calculation unit 116 included in the acoustic signal analysis device 101 according to the fourth embodiment. The acoustic signal analysis apparatus 101 according to the fifth embodiment includes an evaluation value calculation unit 117a instead of the evaluation value calculation unit 117 included in the acoustic signal analysis apparatus 101 according to the fourth embodiment. This is the difference between the fifth embodiment and the fourth embodiment.

音量データ算出部１１８は、所定の時間区間毎に音量に関するデータを算出する。音量データ算出部１１８が処理するフレームの時間長Ｔｆ１３と、周波数帯域データ算出部１１３が処理するフレームの時間長Ｔｆ１１とは、同じであってもよいし、異なっていてもよい。実施の形態５では、Ｔｆ１３＞Ｔｆ１１とするが、これに限定されない。この場合、音量データ算出部１１８が処理するフレームのサンプル数Ｎ１３は、Ｎ１３＝Ｔｆ１３／Ｔｓであるので、周波数帯域データ算出部１１３が処理するフレームのサンプル数Ｎ１１よりも多い。 The volume data calculation unit 118 calculates data related to volume for each predetermined time interval. The time length Tf13 of the frame processed by the volume data calculation unit 118 and the time length Tf11 of the frame processed by the frequency band data calculation unit 113 may be the same or different. In the fifth embodiment, Tf13> Tf11 is set, but the present invention is not limited to this. In this case, the number N13 of frames processed by the sound volume data calculation unit 118 is N13 = Tf13 / Ts, and thus is larger than the number N11 of frames processed by the frequency band data calculation unit 113.

実施の形態４において説明したように、音楽に係る音響信号の周波数成分は、時間スケールの異なる様々な要因（音楽の重層的な構造）により変化するが、音量についても同様なことが言える。 As described in the fourth embodiment, the frequency component of an acoustic signal related to music varies depending on various factors (multi-layered structure of music) having different time scales, but the same can be said for sound volume.

例えば、楽曲のサビの開始点おいては、複数の楽器や歌唱が同時に演奏されることに加え、個々の楽器が「強く（フォルテで）」演奏されることが多いため、１６分音符から２分音符に相当する比較的短い時間区間で周波数帯域が広がり、かつ音量が大きくなる傾向が強い。また、通常のサビは数小節以上の長さを持ち、低域パートと高域パートの両方が演奏され続けることが多いため、サビの開始点から数小節に相当する比較的長い時間で周波数帯域が広く、かつ音量が大きい傾向がある。サビにはこのような特性があるので、異なる時間スケールを持つ周波数帯域データと音量データとを組み合せることにより、サビ等の特徴位置の検出精度を向上させることができる。 For example, at the beginning of the chorus of a song, in addition to playing multiple instruments and singing at the same time, each instrument is often played “strongly (forte)”. There is a strong tendency that the frequency band is widened and the volume is increased in a relatively short time interval corresponding to a half note. In addition, normal chorus has a length of several bars or more, and both the low-frequency part and the high-frequency part continue to be played, so the frequency band in a relatively long time corresponding to several bars from the start point of the chorus Tend to be wide and loud. Since rust has such characteristics, the accuracy of detecting a characteristic position such as rust can be improved by combining frequency band data having different time scales and volume data.

Ｔｆ１１を１６分音符から２分音符程度以下の長さに設定し、Ｔｆ１３を１小節から８小節程度の時間長に設定する。例えば、Ｔｆ１１を４／４拍子でテンポが１２０の楽曲の１６分音符に相当する１２５ｍｓｅｃとし、Ｔｆ１３を４小節に相当する８ｓｅｃとする。 Tf11 is set to a length of about 16th note to half note or less, and Tf13 is set to a time length of about 1 bar to about 8 bars. For example, Tf11 is set to 125 msec corresponding to a sixteenth note of a music piece having a 4/4 time signature and a tempo of 120, and Tf13 is set to 8 sec corresponding to four measures.

音量データ算出部１１８が動作する際のフレームシフトの時間長Ｔｇ１３と、周波数帯域データ算出部１１３が動作する際のフレームシフトの時間長Ｔｇ１１とは、同じであってもよいし、異なっていてもよい。実施の形態５では、Ｔｇ１３は、Ｔｇ１３＝Ｒ１×Ｔｇ１１であり、音量データ算出部１１８が動作する際のフレームシフトのサンプル数Ｇ１３は、Ｇ１３＝Ｒ１×Ｇ１１（Ｒ１は１以上の整数）である。しかしながら、Ｔｇ１３及びＧ１３はこれに限定されない。 The frame shift time length Tg13 when the volume data calculation unit 118 operates and the frame shift time length Tg11 when the frequency band data calculation unit 113 operates may be the same or different. Good. In the fifth embodiment, Tg13 is Tg13 = R1 × Tg11, and the frame shift sample number G13 when the volume data calculation unit 118 operates is G13 = R1 × G11 (R1 is an integer of 1 or more). . However, Tg13 and G13 are not limited to this.

音量データ算出部１１８は、制御部１１１の指示に従って、図２４のフローチャートに示す動作を開始する。図２４は、音量データ算出部１１８の動作の各ステップを示すフローチャートである。 The sound volume data calculation unit 118 starts the operation shown in the flowchart of FIG. 24 in accordance with an instruction from the control unit 111. FIG. 24 is a flowchart showing each step of the operation of the sound volume data calculation unit 118.

先ず、音量データ算出部１１８は、式（１７）を用いてフレームの総数Ｈ１３を算出する（Ｓ９００）。すなわち、音量データ算出部１１８は、式（１７）のＮ１１をＮ１３に置き換え、Ｇ１１をＧ１３に置き換え、Ｈ１１をＨ１３に置き換えて、フレームの総数Ｈ１３を算出する。実施の形態５では、Ｍ＞Ｎ１３である。音量データ算出部１１８が処理するフレーム総数Ｈ１３は、周波数帯域データ算出部１１３が処理するフレーム総数Ｈ１１以下である。 First, the volume data calculation unit 118 calculates the total number H13 of frames using the equation (17) (S900). That is, the sound volume data calculation unit 118 calculates the total number of frames H13 by replacing N11 in Equation (17) with N13, replacing G11 with G13, and replacing H11 with H13. In the fifth embodiment, M> N13. The total number of frames H13 processed by the volume data calculation unit 118 is equal to or less than the total number of frames H11 processed by the frequency band data calculation unit 113.

次に、音量データ算出部１１８は、制御変数ｉに「０」をセットする（Ｓ９１０）。 Next, the volume data calculation unit 118 sets “0” to the control variable i (S910).

次に、音量データ算出部１１８は、ｉ番目のフレームデータを生成する（Ｓ９２０）。具体的には、音量データ算出部１１８は、音響データｘ［ｉ×Ｇ１３］から音響データｘ［ｉ×Ｇ１３＋Ｎ１３−１］をｉ番目のフレームデータとして生成する。なお、音量データ算出部１１８は、音響データｘ［ｉ×Ｇ１３］から音響データｘ［ｉ×Ｇ１３＋Ｎ１３−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、例えば、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。最初に述べた方法は、音響データに矩形窓を掛け合わせることによりｉ番目のフレームデータを生成する方法と同じ方法であると言える。 Next, the volume data calculation unit 118 generates i-th frame data (S920). Specifically, the volume data calculation unit 118 generates the acoustic data x [i × G13 + N13-1] as the i-th frame data from the acoustic data x [i × G13]. Note that the volume data calculation unit 118 may generate a value obtained by multiplying the data from the acoustic data x [i × G13] to the acoustic data x [i × G13 + N13-1] by the window function as the i-th frame data. Good. The window function is, for example, a Hamming window function, a Hanning window function, a Blackman window function, or a Gauss window function. It can be said that the method described first is the same as the method of generating the i-th frame data by multiplying the acoustic data by a rectangular window.

ところで、窓関数を用いる場合、通常はフレームの中央で窓関数の係数を最大とし、フレームの先頭と末尾で窓関数の係数を最小とするが、この他の方法を用いてもよい。例えば、音量データ算出部１１８は、フレームの先頭（ｘ［ｉ×Ｇ１３］）で窓関数の係数を最大とし、その後窓関数の係数を順次減少させ、フレームの末尾（ｘ［ｉ×Ｇ１３＋Ｎ１３−１］）で窓関数の係数が最小となるようにしてもよい。ｉ番目のフレームデータを「Ｄ１３［ｉ］［ｊ］（ｊ＝０〜ＮＤ１３、ただしＮＤ１３＝Ｎ１３−１）」と記載する。 By the way, when the window function is used, the window function coefficient is usually maximized at the center of the frame and the window function coefficient is minimized at the beginning and end of the frame, but other methods may be used. For example, the volume data calculation unit 118 maximizes the window function coefficient at the beginning of the frame (x [i × G13]), then sequentially decreases the coefficient of the window function, and ends the frame (x [i × G13 + N13-1). ]), The window function coefficient may be minimized. The i-th frame data is described as “D13 [i] [j] (j = 0 to ND13, where ND13 = N13-1)”.

周波数帯域データ算出部１１３が処理するｉ×Ｒ１番目のフレームデータの先頭Ｄ１１［ｉ×Ｒ１］［０］と、音量データ算出部１１８が処理するｉ番目のフレームデータの先頭Ｄ１３［ｉ］［０］が、ともにｘ［ｉ×Ｇ１３］となって一致するが、必ずしもこのようにフレームの先頭を一致させなくてもよい。例えば、フレームの中央を一致させるようにしたり、フレームの最後を一致させるようにしてもよい。 The head D11 [i × R1] [0] of the i × R1 frame data processed by the frequency band data calculation unit 113 and the head D13 [i] [0] of the i th frame data processed by the volume data calculation unit 118 ] Are matched as x [i × G13], but it is not always necessary to match the heads of the frames in this way. For example, the center of the frame may be matched, or the end of the frame may be matched.

次に、音量データ算出部１１８は、ｉ番目のフレームデータを使って、後述する方法に従って音量データを算出する（Ｓ９３０）。 Next, the volume data calculation unit 118 calculates volume data using the i-th frame data according to a method described later (S930).

次に、音量データ算出部１１８は、制御変数ｉの値を「１」増やす（Ｓ９４０）。 Next, the volume data calculation unit 118 increases the value of the control variable i by “1” (S940).

次に、音量データ算出部１１８は、制御変数ｉの値がＨ１３未満であるか否かを判定する（Ｓ９５０）。音量データ算出部１１８は、制御変数ｉの値がＨ１３未満であると判定すると（Ｓ９５０でＹｅｓ）、ステップＳ９２０に戻ってステップＳ９４０までの処理を繰り返し、制御変数ｉの値がＨ１３であると判定すると（Ｓ９５０でＮｏ）、処理を終了する。 Next, the volume data calculation unit 118 determines whether or not the value of the control variable i is less than H13 (S950). When determining that the value of the control variable i is less than H13 (Yes in S950), the sound volume data calculation unit 118 returns to step S920 and repeats the processing up to step S940, and determines that the value of the control variable i is H13. Then (No in S950), the process ends.

音量データ算出部１１８は、上述した処理により、Ｈ１３個の音量データＥ１３[ｉ］（ｉ＝０〜Ｈ１３−１）を算出し、処理が終了したことを制御部１１１に通知する。 The volume data calculation unit 118 calculates H13 volume data E13 [i] (i = 0 to H13-1) by the above-described process, and notifies the control unit 111 that the process is completed.

次に、音量データ算出部１１８がステップＳ９３０において行う処理の詳細を説明する。 Next, details of the processing performed by the volume data calculation unit 118 in step S930 will be described.

（１）音量データの第１の算出方法は、音響データの振幅の絶対値を用いる方法である。具体的には、下記の式（２９）に示すように、振幅の絶対値をフレームのサンプル数だけ加算した値（総和）をｉ番目のフレームに対応する音量データとする。 (1) The first calculation method of the volume data is a method using the absolute value of the amplitude of the acoustic data. Specifically, as shown in the following equation (29), a value (sum) obtained by adding the absolute value of the amplitude by the number of samples of the frame is set as volume data corresponding to the i-th frame.

なお、下記の式（３０）に示すように、総和の代わりに平均値を用いてｉ番目のフレームに対応する音量データを算出してもよい。

Note that, as shown in the following equation (30), volume data corresponding to the i-th frame may be calculated using an average value instead of the sum.

（２）音量データの第２の算出方法は、音響データの振幅の２乗を用いる方法である。具体的には、下記の式（３１）に示すように、振幅の２乗の値をフレームのサンプル数だけ加算した値（総和）をｉ番目のフレームに対応する音量データとする。

(2) The second calculation method of volume data is a method using the square of the amplitude of acoustic data. Specifically, as shown in the following equation (31), a value (sum) obtained by adding the square value of the amplitude by the number of samples of the frame is set as volume data corresponding to the i-th frame.

なお、下記の式（３２）に示すように、総和の代わりに平均値を用いてｉ番目のフレームに対応する音量データを算出してもよい。また、式（３１）又は式（３２）の右辺の平方根をとった値をｉ番目のフレームに対応する音量データＥ１３［ｉ］としてもよい。

As shown in the following equation (32), the volume data corresponding to the i-th frame may be calculated using an average value instead of the sum. Further, a value obtained by taking the square root of the right side of Expression (31) or Expression (32) may be used as the volume data E13 [i] corresponding to the i-th frame.

（３）音量データの第３の算出方法は、所定の範囲の周波数成分を用いる方法である。ｉ番目のフレームデータＤ１３［ｉ］［ｊ］に対して、離散フーリエ変換（ＤＦＴ)を行い、周波数スペクトルＳ１３［ｉ］［ｋ］（ｋ＝０〜Ｎ１３／２）を算出する。周波数スペクトルは、振幅スペクトルとパワースペクトルのいずれでもよい。そして、所定の範囲の各周波数の強度の総和をＥ１３［ｉ］とする。

(3) The third calculation method of volume data is a method using frequency components in a predetermined range. A discrete Fourier transform (DFT) is performed on the i-th frame data D13 [i] [j], and a frequency spectrum S13 [i] [k] (k = 0 to N13 / 2) is calculated. The frequency spectrum may be either an amplitude spectrum or a power spectrum. Then, the sum of the intensities of the respective frequencies within a predetermined range is set to E13 [i].

（４）音量データの第４の算出方法は、隣接する２つのフレームの音量を示す数値の差を用いる方法である。フレームの音量を示す数値は、上述した第１から第３のいずれかの算出方法により得られる値である。例えば、第１の算出方法によって得られた値を用いる場合、ｉ−１番目のフレームに対応する音響データを式（２９）に代入したときの演算結果をＥ１３’［ｉ−１］として保持するとともに、ｉ番目のフレームに対応する音響データを式（２９）に代入したときの演算結果をＥ１３’［ｉ］として保持する。そして、Ｅ１３’［ｉ］とＥ１３’［ｉ−１］との差Ｅ１３［ｉ］＝Ｅ１３’［ｉ］−Ｅ１３’［ｉ−１］を、音量データとして算出する。この方法は、音量の変化量を算出する方法である。 (4) A fourth calculation method of volume data is a method using a difference in numerical values indicating the volume of two adjacent frames. The numerical value indicating the volume of the frame is a value obtained by any one of the first to third calculation methods described above. For example, when the value obtained by the first calculation method is used, the calculation result when the acoustic data corresponding to the (i−1) th frame is substituted into the equation (29) is held as E13 ′ [i−1]. At the same time, the calculation result when the acoustic data corresponding to the i-th frame is substituted into Expression (29) is held as E13 ′ [i]. Then, the difference E13 [i] = E13 ′ [i] −E13 ′ [i−1] between E13 ′ [i] and E13 ′ [i−1] is calculated as volume data. This method is a method for calculating the amount of change in volume.

上述した第１から第４の算出方法において、例えば、音量データの最大値が１となり、最小値が０になるように、得られたデータを正規化してもよい。 In the first to fourth calculation methods described above, for example, the obtained data may be normalized so that the maximum value of the volume data is 1 and the minimum value is 0.

制御部１１１は、周波数帯域データ算出部１１３及び音量データ算出部１１８の処理の終了を検知すると、評価値算出部１１７ａに対して動作を開始するように指示する。評価値算出部１１７ａは、実施の形態４の評価値算出部１１７と同様な動作を行う。ただし、実施の形態４では、評価値算出部１１７は、周波数帯域データＥ１１と第２の周波数帯域データＥ１２とを用いて評価値を算出したが、実施の形態５では、評価値算出部１１７ａは、周波数帯域データＥ１１と音量データＥ１３とを用いて評価値αを算出する。 When detecting the end of the processing of the frequency band data calculation unit 113 and the volume data calculation unit 118, the control unit 111 instructs the evaluation value calculation unit 117a to start the operation. The evaluation value calculation unit 117a performs the same operation as the evaluation value calculation unit 117 of the fourth embodiment. However, in the fourth embodiment, the evaluation value calculation unit 117 calculates the evaluation value using the frequency band data E11 and the second frequency band data E12. However, in the fifth embodiment, the evaluation value calculation unit 117a includes The evaluation value α is calculated using the frequency band data E11 and the volume data E13.

制御部１１１は、評価値算出部１１７ａの処理の終了を検知すると、平滑化部１１４ａに対して動作を開始するように指示する。平滑化部１１４ａは実施の形態４と同じ動作を行う。 When detecting the end of the process of the evaluation value calculation unit 117a, the control unit 111 instructs the smoothing unit 114a to start the operation. The smoothing unit 114a performs the same operation as in the fourth embodiment.

制御部１１１は、平滑化部１１４ａの処理の終了を検知すると、特徴位置検出部１１５に対して動作を開始するように指示する。特徴位置検出部１１５は、実施の形態３において説明した動作と同じ動作を行って、特徴位置情報１０３を音響信号分析装置１０１の外部に出力する。 When detecting the end of the process of the smoothing unit 114a, the control unit 111 instructs the feature position detection unit 115 to start the operation. The feature position detection unit 115 performs the same operation as that described in the third embodiment, and outputs the feature position information 103 to the outside of the acoustic signal analyzer 101.

上述したように、実施の形態５の音響信号分析装置１０１は、音楽の重層的な構造に起因する異なる時間スケールでの周波数帯域の変化と音量変化とを精度良く検出するために、時間長の異なる２つの時間区間を用いて周波数帯域データと音量データとを算出し、それらを組合せて評価値を算出する。このため、更に精度良く特徴位置を検出することができる。 As described above, the acoustic signal analysis apparatus 101 according to the fifth embodiment has a time length in order to accurately detect a change in frequency band and a change in volume at different time scales due to the multi-layered structure of music. Frequency band data and volume data are calculated using two different time intervals, and an evaluation value is calculated by combining them. For this reason, the feature position can be detected with higher accuracy.

（実施の形態６）
次に、実施の形態６の音響信号分析装置１０１を図２５を用いて説明する。図２５は、実施の形態６の音響信号分析装置１０１の構成図である。実施の形態６の音響信号分析装置１０１は、図２５に示すように、制御部１１１と、取得部１１２と、周波数帯域データ算出部１１３と、平滑化部１１４ａと、特徴位置検出部１１５と、第２の周波数帯域データ算出部１１６と、評価値算出部１１７と、拍時間検出部１１９とを有する。 (Embodiment 6)
Next, the acoustic signal analysis apparatus 101 according to the sixth embodiment will be described with reference to FIG. FIG. 25 is a configuration diagram of the acoustic signal analyzer 101 according to the sixth embodiment. As shown in FIG. 25, the acoustic signal analysis apparatus 101 according to the sixth embodiment includes a control unit 111, an acquisition unit 112, a frequency band data calculation unit 113, a smoothing unit 114a, a feature position detection unit 115, A second frequency band data calculation unit 116, an evaluation value calculation unit 117, and a beat time detection unit 119 are included.

実施の形態６音響信号分析装置１０１は、実施の形態４の音響信号分析装置１０１が有する構成部に加えて拍時間検出部１１９を有している。その点が、実施の形態６と実施の形態４の相違点である。 Embodiment 6 The acoustic signal analysis device 101 includes a beat time detection unit 119 in addition to the components included in the acoustic signal analysis device 101 of the fourth embodiment. This is the difference between the sixth embodiment and the fourth embodiment.

制御部１１１は、取得部１１２によって音響データが生成されたことを検知すると、周波数帯域データ算出部１１３及び第２の周波数帯域データ算出部１１６に動作を開始するように指示する前に、拍時間検出部１１９に動作を開始するように指示する。 When the control unit 111 detects that the acquisition unit 112 has generated the acoustic data, the control unit 111 determines the beat time before instructing the frequency band data calculation unit 113 and the second frequency band data calculation unit 116 to start the operation. The detection unit 119 is instructed to start the operation.

拍時間検出部１１９は、フレーム単位で処理を行う。拍時間検出部１１９が処理するフレームの時間長をＴｆ１４とし、拍時間検出部１１９が動作する際のフレームシフトの時間長をＴｇ１４とする。拍時間検出部１１９が処理するフレームのサンプル数Ｎ１４は、Ｎ１４＝Ｔｆ１４／Ｔｓであり、フレームシフトのサンプル数Ｇ１４は、Ｇ１４＝Ｔｇ１４／Ｔｓである。拍時間を精度良く算出するために、Ｔｆ１４及びＴｇ１４は１拍の長さよりもかなり短い時間に設定される。一般的な音楽では、テンポが６０から２４０であり、１拍の時間長が２５０ｍｓｅｃから１ｓｅｃの範囲であることが多いので、Ｔｆ１４及びＴｇ１４は、５ｍｓｅｃから５０ｍｓｅｃ程度の適切な値に設定される。 The beat time detector 119 performs processing in units of frames. The time length of the frame processed by the beat time detecting unit 119 is Tf14, and the time length of the frame shift when the beat time detecting unit 119 is operated is Tg14. The frame sample number N14 processed by the beat time detection unit 119 is N14 = Tf14 / Ts, and the frame shift sample number G14 is G14 = Tg14 / Ts. In order to accurately calculate the beat time, Tf14 and Tg14 are set to a time considerably shorter than the length of one beat. In general music, since the tempo is 60 to 240 and the time length of one beat is often in the range of 250 msec to 1 sec, Tf14 and Tg14 are set to appropriate values of about 5 msec to 50 msec.

拍時間検出部１１９は、図２６に示すフローチャートに従って処理を行う。図２６は、拍時間検出部１１９の動作の各ステップを示すフローチャートである。 The beat time detection unit 119 performs processing according to the flowchart shown in FIG. FIG. 26 is a flowchart showing each step of the operation of the beat time detection unit 119.

拍時間検出部１１９は、先ず、式（１７）を用いてフレームの総数Ｈ１４を算出する（Ｓ１０００）。具体的には、拍時間検出部１１９は、式（１７）のＮ１１をＮ１４に置き換え、Ｇ１１をＧ１４に置き換え、Ｈ１１をＨ１４に置き換えて、フレームの総数Ｈ１４を算出する。 First, the beat time detection unit 119 calculates the total number H14 of frames using Expression (17) (S1000). Specifically, the beat time detection unit 119 calculates the total number H14 of frames by replacing N11 in Equation (17) with N14, replacing G11 with G14, and replacing H11 with H14.

次に、拍時間検出部１１９は、制御変数ｉに「０」をセットする（Ｓ１０１０）。 Next, the beat time detection unit 119 sets “0” to the control variable i (S1010).

次に、拍時間検出部１１９は、ｉ番目のフレームデータを生成する（Ｓ１０２０）。具体的には、拍時間検出部１１９は、音響データｘ［ｉ×Ｇ１４］から音響データｘ［ｉ×Ｇ１４＋Ｎ１４−１］をｉ番目のフレームデータとして生成する。なお、拍時間検出部１１９は、音響データｘ［ｉ×Ｇ１４］から音響データｘ［ｉ×Ｇ１４＋Ｎ１−１］までのデータに窓関数を掛け合わせた値をｉ番目のフレームデータとして生成してもよい。窓関数は、例えば、ハミング窓関数、ハニング窓関数、ブラックマン窓関数、又は、ガウス窓関数等である。最初に述べた方法は、音響データに矩形窓を掛け合わせることによりｉ番目のフレームデータを生成する方法と同じ方法であると言える。ｉ番目のフレームデータを「Ｄ１４［ｉ］［ｊ］（ｊ＝０〜ＮＤ１４、ただしＮＤ１４＝Ｎ１４−１）」と記載する。 Next, the beat time detection unit 119 generates i-th frame data (S1020). Specifically, the beat time detection unit 119 generates acoustic data x [i × G14 + N14-1] as the i-th frame data from the acoustic data x [i × G14]. The beat time detection unit 119 generates a value obtained by multiplying the data from the acoustic data x [i × G14] to the acoustic data x [i × G14 + N1-1] by the window function as the i-th frame data. Good. The window function is, for example, a Hamming window function, a Hanning window function, a Blackman window function, or a Gauss window function. It can be said that the method described first is the same as the method of generating the i-th frame data by multiplying the acoustic data by a rectangular window. The i-th frame data is described as “D14 [i] [j] (j = 0 to ND14, where ND14 = N14-1)”.

次に、拍時間検出部１１９は、ｉ番目のフレームに対応する音量の変化量を算出する（Ｓ１０３０）。具体的には、拍時間検出部１１９は、実施の形態５の音量データ算出部１１８が用いる音量データの第４の算出方法を用いて、音量の変化量Ｅ１４［ｉ］を算出する。 Next, the beat time detection unit 119 calculates the amount of change in volume corresponding to the i-th frame (S1030). Specifically, the beat time detection unit 119 calculates the volume change amount E14 [i] by using the fourth volume data calculation method used by the volume data calculation unit 118 of the fifth embodiment.

次に、拍時間検出部１１９は、制御変数ｉの値を「１」増やす（Ｓ１０４０）。 Next, the beat time detection unit 119 increases the value of the control variable i by “1” (S1040).

次に、拍時間検出部１１９は、制御変数ｉの値がＨ１４未満であるか否かを判定する（Ｓ１０５０）。拍時間検出部１１９は、制御変数ｉの値がＨ１４未満であると判定すると（Ｓ１０５０でＹｅｓ）、ステップＳ１０２０に戻ってステップＳ１０４０までの処理を繰り返す。 Next, the beat time detection unit 119 determines whether or not the value of the control variable i is less than H14 (S1050). When determining that the value of the control variable i is less than H14 (Yes in S1050), the beat time detecting unit 119 returns to Step S1020 and repeats the process up to Step S1040.

拍時間検出部１１９は、制御変数ｉの値がＨ１４であると判定すると（Ｓ１０５０でＮｏ）、音量の変化量Ｅ１４［ｉ］（ｉ＝０〜Ｈ１４−１）の自己相関を算出する（Ｓ１０６０）。拍時間検出部１１９は、自己相関のインデックスの差Δを所定のテンポの範囲で順次変えながら、下記の式（３３）に従って自己相関Ｙ（Δ）を算出する。 When determining that the value of the control variable i is H14 (No in S1050), the beat time detecting unit 119 calculates the autocorrelation of the volume change amount E14 [i] (i = 0 to H14-1) (S1060). ). The beat time detector 119 calculates the autocorrelation Y (Δ) according to the following equation (33) while sequentially changing the autocorrelation index difference Δ within a predetermined tempo range.

式（３３）において、Ｈｅ及びＨｆは、０≦Ｈｅ＜Ｈｆ≦Ｈ１４−１−Δ、を満たす所定の整数である。例えば、テンポの検出範囲が６０から２４０（１拍の時間２５０ｍｓｅｃから１０００ｍｓｅｃ）である場合、Ｅ１４はＴｇ１４の時間間隔で生成されているので、Δ＝（２５０／Ｔｇ１４）から（１０００／Ｔｇ１４）の範囲でΔは変えられる。Ｔｇ１４は、ｍｓｅｃ単位の値である。

In the formula (33), He and Hf are predetermined integers that satisfy 0 ≦ He <Hf ≦ H14-1-Δ. For example, if the tempo detection range is 60 to 240 (one beat time is 250 msec to 1000 msec), E14 is generated at a time interval of Tg14, so Δ = (250 / Tg14) to (1000 / Tg14) Δ can be changed in the range. Tg14 is a value in units of msec.

次に、拍時間検出部１１９は、自己相関Ｙ（Δ）のピーク位置を検出して、拍の時間長τを算出する（Ｓ１０７０）。ステップＳ１０６０において算出された自己相関Ｙ（Δ）は、図１４に示すように、いくつかのピークを持っている。拍時間検出部１１９は、検出対象の最短の拍から検出対象の最長の拍の間で最大値の位置Δｍａｘを検出し、τ＝Ｔｇ１４×Δｍａｘを１拍の時間長とする。なお、図１４において、「Ｐ」は検出対象の最短の拍に相当するΔであり、「Ｒ」は検出対象の最長の拍に相当するΔである。 Next, the beat time detector 119 detects the peak position of the autocorrelation Y (Δ) and calculates the beat time length τ (S1070). The autocorrelation Y (Δ) calculated in step S1060 has several peaks as shown in FIG. The beat time detection unit 119 detects the position Δmax of the maximum value between the shortest beat of the detection target and the longest beat of the detection target, and sets τ = Tg14 × Δmax as the time length of one beat. In FIG. 14, “P” is Δ corresponding to the shortest beat to be detected, and “R” is Δ corresponding to the longest beat to be detected.

また、図１５に示すように、拍の時間長の存在確率を示す分布Ω（Δ）が用意されており、拍時間検出部１１９は、自己相関Ｙ（Δ）と分布Ω（Δ）との積（Ω（Δ）Ｙ（Δ））を算出した後に、その最大値の位置を検出し、それにより１拍の時間長を検出してもよい。拍時間検出部１１９は、Ω（Δ）を用いることにより、更に精度良く拍の時間長を算出することができる。なお、図１５において、「Ｐ」は検出対象の最短の拍に相当するΔであり、「Ｕ」は拍の存在確率が最大となるΔであり、「Ｒ」は検出対象の最長の拍に相当するΔである。 Further, as shown in FIG. 15, a distribution Ω (Δ) indicating the existence probability of the beat time length is prepared, and the beat time detection unit 119 calculates the autocorrelation Y (Δ) and the distribution Ω (Δ). After calculating the product (Ω (Δ) Y (Δ)), the position of the maximum value may be detected, thereby detecting the time length of one beat. The beat time detector 119 can calculate the beat length with higher accuracy by using Ω (Δ). In FIG. 15, “P” is Δ corresponding to the shortest beat to be detected, “U” is Δ that maximizes the existence probability of the beat, and “R” is the longest beat to be detected. The corresponding Δ.

拍時間検出部１１９は、このようにして検出した拍の時間長τを制御部１１１に通知する。 The beat time detection unit 119 notifies the control unit 111 of the beat duration τ detected in this way.

制御部１１１は、τ１１＝λ１１×τ、τ１２＝λ１２×τの２つの数値を算出する。λ１１及びλ１２は、λ１１＜λ１２を満たす、所定の係数である。例えば、λ１１は「０．２５」から「１」の値であり、λ１２は「４」から「８」程度の値である。 The control unit 111 calculates two numerical values of τ11 = λ11 × τ and τ12 = λ12 × τ. λ11 and λ12 are predetermined coefficients that satisfy λ11 <λ12. For example, λ11 is a value from “0.25” to “1”, and λ12 is a value from about “4” to “8”.

そして、制御部１１１は、Ｔｆ１１＝τ１１とするように、周波数帯域データ算出部１１３に指示するとともに、Ｔｆ１２＝τ１２とするように、第２の周波数帯域データ算出部１１６に指示する。その後、制御部１１１は、周波数帯域データ算出部１１３及び第２の周波数帯域データ算出部１１６に対して動作を開始するように指示する。周波数帯域データ算出部１１３は、拍時間検出部１１９によって検出された一拍の時間長に基づくτ１１をフレームの時間長Ｔｆ１１に設定し、第２の周波数帯域データ算出部１１６は、拍時間検出部１１９によって検出された一拍の時間長に基づくτ１２をフレームの時間長Ｔｆ１２に設定する。それ以降の各部の動作は、実施の形態４において説明した動作と同じである。 Then, the control unit 111 instructs the frequency band data calculation unit 113 to set Tf11 = τ11 and also instructs the second frequency band data calculation unit 116 to set Tf12 = τ12. Thereafter, the control unit 111 instructs the frequency band data calculation unit 113 and the second frequency band data calculation unit 116 to start operation. The frequency band data calculating unit 113 sets τ11 based on the time length of one beat detected by the beat time detecting unit 119 to the frame time length Tf11, and the second frequency band data calculating unit 116 is a beat time detecting unit. Τ12 based on the time length of one beat detected by 119 is set as the time length Tf12 of the frame. The subsequent operation of each part is the same as the operation described in the fourth embodiment.

楽曲の１拍の時間長は、音楽のジャンルやスタイルによって異なるため、周波数帯域データや音量データを算出する際の最適な区間長も、音楽のジャンルやスタイルによって異なり、全てのジャンルの音楽に対して最適な区間長を予め決めておくことは難しい。実施の形態６の音響信号分析装置１０１は、１拍の時間長を検出し、それに基づいて周波数帯域データ及び第２の周波数帯域データを算出する際の区間長を設定する。これにより、様々なジャンルやタイプの音楽に対して、精度良く特徴位置を検出することができる。 Since the time length of one beat of the music differs depending on the music genre and style, the optimum section length for calculating the frequency band data and volume data also differs depending on the music genre and style. It is difficult to determine the optimal section length in advance. The acoustic signal analysis apparatus 101 according to the sixth embodiment detects a time length of one beat, and sets a section length when calculating frequency band data and second frequency band data based on the time length. Thereby, it is possible to accurately detect the feature position for various genres and types of music.

なお、実施の形態５において説明した音量データ算出部１１８も、拍の時間長に基づいてＴｆ１３を設定してもよい。 Note that the volume data calculation unit 118 described in the fifth embodiment may also set Tf13 based on the time length of the beat.

また、上述した実施の形態１から実施の形態６の各方法を組合せてもよい。例えば、実施の形態４の方法と実施の形態５の方法とを組合せて、周波数帯域データと、第２の周波数帯域データと、音量データとを算出し、それら３つを用いて評価値を算出してもよい。また、実施の形態１の方法と実施の形態４の方法とを組合せて、音量に関する第１の特徴量と、音量に関する第２の特徴量と、周波数帯域データと、第２の周波数帯域データとを算出し、それら４つを用いて評価値を算出してもよい。更に、３種類以上の周波数帯域データを算出してもよい。このように、異なる種類の特徴量（音量に関する特徴量も、周波数帯域に関するデータも含む）を組合せて評価値を算出することにより、多種多様な楽曲に対して、楽曲の特徴的な箇所を更に精度良く検出することができる。 Moreover, you may combine each method of Embodiment 1- Embodiment 6 mentioned above. For example, the frequency band data, the second frequency band data, and the volume data are calculated by combining the method of the fourth embodiment and the method of the fifth embodiment, and the evaluation value is calculated using these three. May be. Further, by combining the method of the first embodiment and the method of the fourth embodiment, a first feature amount related to volume, a second feature amount related to volume, frequency band data, and second frequency band data And the evaluation value may be calculated using these four. Further, three or more types of frequency band data may be calculated. In this way, by calculating an evaluation value by combining different types of feature quantities (including volume-related feature quantities and data related to frequency bands), it is possible to further determine the characteristic parts of the song for a wide variety of songs. It can be detected with high accuracy.

更に、上述した各実施の形態の音響信号分析装置１０１の各構成部の機能は、例えばコンピュータのＣＰＵ（プロセッサ）及びメモリ等のハードウェアと、その機能を実現するためのコンピュータプログラムとが協働することによって実現される。しかしながら、上記各機能は、専用の回路により実現される等、どのような形態により実現されてもよい。また、音響信号分析装置１０１の各構成部の機能を実現するためのコンピュータプログラムは、記録媒体に格納されてもよい。 Furthermore, the functions of the components of the acoustic signal analysis apparatus 101 according to each of the above-described embodiments are performed by, for example, hardware such as a computer CPU (processor) and a memory, and a computer program for realizing the functions. It is realized by doing. However, each of the above functions may be realized in any form such as realized by a dedicated circuit. Moreover, the computer program for implement | achieving the function of each structure part of the acoustic signal analyzer 101 may be stored in a recording medium.

１音響信号分析装置、２音響信号、３特徴位置情報、１１制御部、１２取得部、１３第１の特徴量算出部、１４第２の特徴量算出部、１５評価値算出部、１６特徴位置検出部、１７拍時間検出部、１０１音響信号分析装置、１０２音響信号、１０３特徴位置情報、１１１制御部、１１２取得部、１１３周波数帯域データ算出部、１１４平滑化部、１１４ａ平滑化部、１１５特徴位置検出部、１１６第２の周波数帯域データ算出部、１１７評価値算出部、１１７ａ評価値算出部、１１８音量データ算出部、１１９拍時間検出部。 DESCRIPTION OF SYMBOLS 1 Acoustic signal analyzer, 2 Acoustic signal, 3 Feature position information, 11 Control part, 12 Acquisition part, 13 1st feature-value calculation part, 14 2nd feature-value calculation part, 15 Evaluation value calculation part, 16 Feature position Detection unit, 17 beat time detection unit, 101 acoustic signal analyzer, 102 acoustic signal, 103 feature position information, 111 control unit, 112 acquisition unit, 113 frequency band data calculation unit, 114 smoothing unit, 114a smoothing unit, 115 A feature position detection unit, 116 a second frequency band data calculation unit, 117 evaluation value calculation unit, 117a evaluation value calculation unit, 118 volume data calculation unit, 119 beat time detection unit.

Claims

An acquisition unit for acquiring an acoustic signal;
A first volume information calculation unit that calculates a first value related to the volume of each of a plurality of sections having a first period of the acoustic signal acquired by the acquisition unit;
A second volume information calculation unit that calculates a second value related to the volume of each of a plurality of sections having a second period longer than the first period of the acoustic signal acquired by the acquisition unit;
Using the first value and the second value, the larger the first value and the larger the second value corresponding to the first value in time, the larger the time. An evaluation value calculation unit for calculating the evaluation value of the series;
An acoustic signal analyzer comprising: a feature position detection unit that detects a position where the evaluation value calculated by the evaluation value calculation unit is maximum or maximum.

Furthermore, a beat time detection unit that detects a time length of one beat of the acoustic signal acquired by the acquisition unit,
The first volume information calculation unit sets the first period based on a time length of one beat detected by the beat time detection unit,
The acoustic signal analysis device according to claim 1, wherein the second volume information calculation unit sets the second period based on a time length of one beat detected by the beat time detection unit.

The evaluation value calculation unit
An added value of the first value and the second value;
An added value of a value obtained by multiplying the first value by a first coefficient and a value obtained by multiplying the second value by a second coefficient;
An addition value of a value obtained by multiplying a logarithmic value of the first value by a third coefficient and a value obtained by multiplying the logarithmic value of the second value by a fourth coefficient; A multiplication value of the value of 1 and the second value;
A multiplication value of a first power value having the first value as a radix and a fifth coefficient as an index and a second power value having the second value as a radix and a sixth coefficient as an index;
Using either one of a value obtained by multiplying the first power value by a seventh coefficient and a value obtained by multiplying the second power value by an eighth coefficient The acoustic signal analyzer according to claim 1 or 2, wherein the evaluation value is calculated.

The first volume information calculation unit is configured to determine whether each section having the first period includes:
A sum of absolute values of the amplitudes of the acoustic signals;
A sum of square values of amplitudes of the acoustic signals;
A sum of predetermined frequency components of the acoustic signal;
A difference in the sum of absolute values of the amplitude of the acoustic signal in each of the first half and the second half,
A difference in the sum of squares of the amplitude of the acoustic signal in each of the first half and the second half;
The acoustic signal analyzer according to any one of claims 1 to 3, wherein any one of the difference between the sum totals of the predetermined frequency components of the acoustic signal in each of the first half and the second half is the first value of the section.

The second sound volume information calculation unit, in each section having the second period,
A sum of absolute values of the amplitudes of the acoustic signals;
A sum of square values of amplitudes of the acoustic signals;
A sum of predetermined frequency components of the acoustic signal;
A difference in the sum of absolute values of the amplitude of the acoustic signal in each of the first half and the second half,
A difference in the sum of squares of the amplitude of the acoustic signal in each of the first half and the second half;
5. The acoustic signal analyzer according to claim 1, wherein any one of the difference between the sum totals of the predetermined frequency components of the acoustic signal in each of the first half and the second half is the second value of the section.

The acoustic signal according to any one of claims 1 to 5, wherein the characteristic position detection unit detects a position where the evaluation value calculated by the evaluation value calculation unit is maximum or maximum for a continuous part of the acoustic signal. Analysis equipment.

Obtaining an acoustic signal;
Calculating a first value relating to the volume of each of a plurality of sections having a first period of the acquired acoustic signal;
Calculating a second value related to the volume of each of a plurality of sections having a second period longer than the first period of the acquired acoustic signal;
Using the first value and the second value, the larger the first value and the larger the second value corresponding to the first value in time, the larger the time. Calculating a series evaluation value;
An acoustic signal analysis method comprising: detecting a section where the calculated evaluation value is maximum or maximum.

A function to acquire an acoustic signal;
A function of calculating a first value relating to a volume of each of a plurality of sections having a first period of the acquired acoustic signal;
A function of calculating a second value related to the volume of each of a plurality of sections having a second period longer than the first period of the acquired acoustic signal;
Using the first value and the second value, the larger the first value and the larger the second value corresponding to the first value in time, the larger the time. A function to calculate the evaluation value of the series;
An acoustic signal analysis program for causing a computer to realize a function for detecting a section where the calculated evaluation value is maximum or maximum.