JP4926044B2

JP4926044B2 - Apparatus and method for describing characteristics of sound signals

Info

Publication number: JP4926044B2
Application number: JP2007511960A
Authority: JP
Inventors: マルクスクレーマー; クリスティアンウーレ
Original assignee: フラウンホッファー−ゲゼルシャフトツァフェルダールングデァアンゲヴァンテンフォアシュンクエー．ファオ
Priority date: 2004-05-07
Filing date: 2005-04-27
Publication date: 2012-05-09
Anticipated expiration: 2025-04-27
Also published as: WO2005114650A1; JP2007536586A; DE102004022659B3; EP1671315B1; DE502005000658D1; EP1671315A1

Description

本発明は、音信号の分析に関し、特に、音信号の特徴を記述するために、音信号を分類して特定する目的で行う音信号の分析に関する。 The present invention relates to sound signal analysis, and more particularly to sound signal analysis performed for the purpose of classifying and identifying sound signals in order to describe the characteristics of the sound signals.

マルチメディアコンテンツをデジタル配信する媒体が絶え間なく開発されており、大量の数々のデータが提供されるようになっている。人間のユーザにとって、この全体量は、すでにはるかに上回る量となっている。従って、メタデータでデータをテキストで記述することが、ますます重要になっている。基本的に、テキストファイルを作成することばかりでなく、例えば、検索可能な音楽ファイル、映像ファイルおよび他の情報信号ファイルについても作成することが目標となっていて、一般的なテキストデータベースと同じ使い勝手であることが目標となっている。このためのアプローチの１つが、ＭＰＥＧ７規格として周知である。 Media for digital distribution of multimedia content is constantly being developed, and a large amount of data is being provided. For human users, this total volume is already much higher. Therefore, it is becoming increasingly important to describe data in text with metadata. Basically, the goal is not only to create text files, but also to create searchable music files, video files, and other information signal files, for example. The goal is to be. One approach for this is known as the MPEG7 standard.

特に、音声信号を分析する際、すなわち、音楽および／または言語を含む信号を分析する際に、はっきりとした特徴の抽出は、非常に重要である。 In particular, when analyzing speech signals, i.e. when analyzing signals containing music and / or language, distinct feature extraction is very important.

例えば、１曲の音楽のフィンガープリントに基づいてメタデータを検索するために、メタデータを有する音声データを“充実させる”ことはさらに望ましい。“フィンガープリント”は、一方では表現豊かである必要があり、他方では、できるだけ短く、簡潔である必要がある。“フィンガープリント”は、従って、音楽信号から生成した圧縮情報信号を表すもので、メタデータを含まないが、メタデータを参照するためのものである。例えば、データベースを検索したり、例えば、オーディオ素材（“オーディオＩＤ”）を特定するシステムを検索したりすることにより、参照を行う。 For example, it is further desirable to “enrich” audio data with metadata in order to search for metadata based on the fingerprint of a piece of music. The “fingerprint” needs to be expressive on the one hand and as short and concise as possible on the other hand. The “fingerprint” thus represents a compressed information signal generated from a music signal, and does not include metadata, but is for referring to metadata. For example, reference is made by searching a database or searching for a system that identifies an audio material (“audio ID”), for example.

通常、音楽データは、個別の音源の部分信号を重畳したものから成る。１曲のポピュラー音楽では、個別の音源、すなわち歌手、ギター、ベースギター、ドラムおよびキーボードが比較的少ないが、オーケストラの楽曲では、音源の数が非常に多い場合がある。オーケストラの楽曲およびポピュラー音楽の楽曲は例えば、個別の楽器から発した音が重畳したものから成る。従って、オーケストラの楽曲または任意の楽曲はそれぞれ、個別の音源から部分信号を重畳したものを表している。部分信号は、オーケストラまたはポピュラー音楽アンサンブルそれぞれの個別の楽器から生成した音であり、個別の楽器は、個別の音源である。 Usually, music data is composed of superposed partial signals of individual sound sources. In one piece of popular music, there are relatively few individual sound sources, i.e. singers, guitars, bass guitars, drums and keyboards, but in orchestral songs there may be a very large number of sound sources. The orchestral music and popular music are composed of, for example, superimposed sounds emitted from individual musical instruments. Accordingly, each orchestra song or arbitrary song represents a partial signal superimposed from an individual sound source. The partial signal is a sound generated from an individual instrument of each orchestra or popular music ensemble, and each individual instrument is an individual sound source.

あるいは、元の音源群についても個別の音源として解釈することもできるので、少なくとも２つの個別の音源を１つの信号に対応付けることもできる。 Alternatively, since the original sound source group can also be interpreted as individual sound sources, at least two individual sound sources can be associated with one signal.

以下に、一般的な情報信号の分析として、単なる例として、オーケストラ信号を参照して説明する。オーケストラ信号の分析は、多くのやり方で行うことができる。従って、個別の楽器を認識して、全体的な信号から楽器の個別の信号を抽出することが望ましい。該当する場合には、これらを音符に変換する。音符が“メタデータ”として機能する。さらに、分析の可能性としては、主要なリズムを抽出することである。リズム抽出は、ハーモニー維持楽器とも呼ぶ、音を発生する楽器によらず、打楽器に基づいて行うことが好ましい。一般に、打楽器は、ケトルドラム、ドラム、ラトルまたは他の打楽器を含むが、ハーモニー維持楽器は、例えばバイオリン、管楽器等の、任意の他の楽器である。 In the following, a general information signal analysis will be described with reference to an orchestra signal as a mere example. The analysis of the orchestra signal can be done in many ways. Therefore, it is desirable to recognize individual instruments and extract individual instrument signals from the overall signal. If applicable, convert them to notes. The notes function as “metadata”. Furthermore, the possibility of analysis is to extract the main rhythm. Rhythm extraction is preferably performed based on a percussion instrument, not a musical instrument that generates sound, also called a harmony-maintaining instrument. In general, percussion instruments include kettle drums, drums, rattles or other percussion instruments, but harmony sustaining instruments are any other instruments such as violins, wind instruments, and the like.

さらに、アコースティックまたは合成音発生装置はすべて、サウンドの特徴により、リズムセクションとなる打楽器（例えば、リズムギター）のうちに数えられる。 Furthermore, all acoustic or synthetic sound generators are counted among percussion instruments (eg, rhythm guitars) that become rhythm sections, depending on the characteristics of the sound.

従って、ハーモニー維持楽器の信号が“擾乱”したリズムを認識することなしに、例えば、１つの全曲の音楽からパーカッション部だけを抽出して、これらのパーカッション部に基づいてリズム認識を行うことは、１曲の音楽のリズム抽出に望ましいことである。 Therefore, without recognizing the rhythm where the signal of the harmony sustaining instrument is “disturbed”, for example, extracting only the percussion part from the music of one whole song and performing the rhythm recognition based on these percussion parts, This is desirable for extracting the rhythm of a piece of music.

この技術では、数曲の音楽から異なるパターンを自動的に抽出したり、パターンの存在をそれぞれ検出したりする、別の可能性がある。コイル（Ｃｏｙｌｅ、Ｅ．Ｊ．）、シュマルビッチ（Ｓｈｍｕｌｅｖｉｃｈ、Ｉ．）の、“音楽パターンの機械認識システム（ＡＳｙｓｔｅｍｆｏｒＭａｃｈｉｎｅＲｅｃｏｇｎｉｔｉｏｎｏｆＭｕｓｉｃＰａｔｔｅｒｎｓ）”、１９９８年ＩＥＥＥ国際会議、音響、音声、信号処理部会（ｈｔｔｐ：／／ｗｗｗ２．ｍｄａｎｄｅｒｓｏｎ．ｏｒｇ／ａｐｐ／ｉｌｙａ／−Ｐｕｂｌｉｃａｔｉｏｎｓ／ｉｃａｓｓｐ９８ｍｐｒ．ｐｄｆ）では、メロディの主旋律を検索している。このために、主旋律が与えられる。そして、この主旋律が発生する箇所の検索が行われる。 With this technique, there is another possibility of automatically extracting different patterns from several pieces of music or detecting the presence of each pattern. “A System for Machine Recognition of Music Patterns” by Coyle, EJ, Shmalevich, I., 1998 IEEE International Conference, Sound, Voice, Signal Processing The subcommittee (http://www2.mdanderson.org/app/ilya/-Publications/icassp98mpr.pdf) searches for the main melody of the melody. For this, the main melody is given. And the search of the location where this main melody occurs is performed.

シュレーター（Ｓｃｈｒｏｅｔｅｒ、Ｔ．）、ドライサミ（Ｄｏｒａｉｓａｍｙ、Ｓ．）、ルーガー（Ｒｕｅｇｅｒ、Ｓ．）の、“原音のポリフォニック音声から継続的に発生する主旋律を検索する（ＦｒｏｍＲａｗＰｏｌｙｐｈｏｎｉｃＡｕｄｉｏｔｏＬｏｃａｔｉｎｇＲｅｃｕｒｒｉｎｇＴｈｅｍｅｓ）”、２０００年ＩＳＭＩＲ（ｈｔｔｐ：／ｉｓｍｉｒ２０００．ｉｓｍｉｒ．ｎｅｔ／ｐｏｓｔｅｒｓ／ｓｈｏｒｅｔｅｒｒｕｇｅｒ．ｐｄｆ）では、音楽信号を採譜した表現のなかのメロディの主旋律が検索される。やはり、主旋律が与えられ、それから、この主旋律が発生する箇所の検索が行われる。 "From Raw Raw Polyphonic Audio to Locating Recurring Recurring Theme, Searching for the main melody generated from the polyphonic sound of the original sound (From Raw Polyphonic Audio to Locating Recurring Therme, Schlosser, T.) ) ", 2000 ISMIR (http: /ismir2000.ismir.net/posters/shorter ruger.pdf), the main melody of the melody in the expression obtained by recording the music signal is searched. Again, the main melody is given, and then a search is made for the location where this main melody occurs.

西洋音楽の従来の構造によれば、リズム構造と対照的に、メロディ部分は、概して周期的に発生しない。この理由により、メロディ部分を検索する方法の多くは、発生したものを個別に検出することに制限されている。これと対照的に、リズム分析の分野での興味の対象は主に、周期的構造の検出に向けられている。 According to the conventional structure of Western music, in contrast to the rhythm structure, the melody part generally does not occur periodically. For this reason, many of the methods for searching for a melody part are limited to detecting what has occurred individually. In contrast, the object of interest in the field of rhythm analysis is mainly directed to the detection of periodic structures.

メーディック（Ｍｅｕｄｉｃ、Ｂ．）の、“音楽パターン抽出：音楽構造の反復から（ＭｕｓｉｃａｌＰａｔｔｅｒｎＥｘｔｒａｃｔｉｏｎ：ｆｒｏｍＲｅｐｅｔｉｔｉｏｎｔｏＭｕｓｉｃａｌＳｔｒｕｃｔｕｒｅ）”、２００３年ＣＭＭＲ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ＲＭＰａｐｅｒｓ／ＣＭＭＲ−ｍｅｕｄｉｃ−２００３．ｐｄｆ）では、自己類似性マトリックスを用いてメロディパターンを特定している。 Medic, B., “Music Pattern Extraction: From Musical Structure Extraction”, 2003 CMMR Bulletin (http: //www.ircamu.ircamu.ircamu. repmus / RMPerpers / CMMR-music-2003.pdf) specifies a melody pattern using a self-similarity matrix.

ミーク（Ｍｅｅｋ）、コリン（Ｃｏｌｉｎ）、バーミンガム（Ｂｉｒｍｉｎｇｈａｍ、Ｗ．Ｐ．）の、“主旋律抽出器（ＴｈｅｍａｔｉｃＥｘｔｒａｃｔｏｒ）”、２００１年ＩＳＭＩＲ（ｈｔｔｐ：／／ｉｓｍｉｒ２００１．ｉｓｍｉｒ．ｎｅｔ／ｐｄｆ／ｍｅｅｋ．ｐｄｆ）では、メロディの主旋律が検索される。特に、シーケンスが検索され、１つのシーケンスの長さを２つから所定の数の音符とすることができる。 Meek, Collin, Birmingham, WP, “Thematic Extractor”, 2001 ISMIR (http://ismir2001.ismir.net/pdf/mek.pdf/mek.pdf/mek.pdf ) Retrieves the main melody of the melody. In particular, sequences are searched and the length of one sequence can be from two to a predetermined number of notes.

スミス（Ｓｍｉｔｈ、Ｌ．）、メジナ（Ｍｅｄｉｎａ、Ｒ．）の、“パターン一致の抽出による主旋律の発見（ＤｉｓｃｏｖｅｒｉｎｇＴｈｅｍｅｂｙＥｘａｃｔＰａｔｔｅｒｎＭａｔｃｈｉｎｇ）”２００１年（ｈｔｔｐ：／／ｃｉｔｅｓｅｅｒ．ｉｓｔ．ｐｓｕ．ｅｄｕ／−４９８２２６．ｈｔｍｌ）では、自己類似性マトリックスを有するメロディの主旋律が検索される。 Smith, L., Medina, R., "Discovering Theme by Exact Pattern Matching" 2001 (http://sitesee.ist.psu.edu/ -498226.html), the main melody of a melody having a self-similarity matrix is searched.

ラルティロット（Ｌａｒｔｉｌｌｏｔ、Ｏ．）の、“知覚ベースの音楽パターンの発見（Ｐｅｒｃｅｐｔｉｏｎ−ＢａｓｅｄＭｕｓｉｃａｌＰａｔｔｅｒｎＤｉｓｃｏｖｅｒｙ）”、２００３年ＩＦＭＣ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／−ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ｌａｒｔｉｌｌｏｔ／ｃｍｍｒ．ｐｄｆ）では、やはりメロディの主旋律が検索される。 Lartillot, O., “Perception-Based Musical Pattern Discovery”, 2003 IFMC Bulletin (http://www.ircam.fr/-equilpes/mm .Pdf), the main melody of the melody is also retrieved.

ブラウン（Ｂｒｏｗｎ、Ｊ．Ｃ．）の、“自己相関による楽譜基準値の算出（ＤｅｔｅｒｍｉｎａｔｉｏｎｏｆｔｈｅＭｅｔｅｒｏｆＭｕｓｉｃａｌＳｃｏｒｅｓｂｙＡｕｔｏｃｏｒｒｅｌａｔｉｏｎ）”、米国音響学会ジャーナル（Ｊ．ｏｆＡｃｏｕｓｔ、Ｓｏｃ．ＯｆＡｍｅｒｉｃａ）第９４巻第４号１９９３年では、音楽信号のシンボル表現から、すなわち、ＭＩＤＩ表現に基づいて周期関数を用いることにより、１曲の音楽の基礎となる基準値リズムのタイプを算出している。 Brown (JC), “Determining of the Meter of Musical Score by Autocorrelation”, Journal of the Acoustical Society of Japan (J. of Acoustic, Soc. Of America) 94th. In Vol. 4 No. 1993, a reference value rhythm type that is the basis of one piece of music is calculated from a symbol representation of a music signal, that is, using a periodic function based on a MIDI representation.

同様のことが、メーディック（Ｍｅｕｄｉｃ、Ｂ．）の、“ＭＩＤＩファイルからの自動基準値抽出（ＡｕｔｏｍａｔｉｃＭｅｔｅｒＥｘｔｒａｃｔｉｏｎｆｒｏｍＭＩＤＩｆｉｌｅｓ）”、２００２年ＪＩＭ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ＲＭＰａｐｅｒｓ／ＪＩＭ−ｂｅｎｏｉｔ．２００２．ｐｄｆ）で行われている。周期性を推定すると直ちに、音声信号のテンポおよび基準値リズムの推定を行っている。 The same can be said of Medic, B., “Automatic Meter Extraction from MIDI files”, 2002 JIM Bulletin (http://www.ircam.fr/equipes). /Repmus/RMPapers/JIM-benoit.2002.pdf). As soon as the periodicity is estimated, the tempo of the audio signal and the reference value rhythm are estimated.

メロディの主旋律を特定する方法には制限があり、音楽の主旋律が繰り返すので、音信号に存在する周期性を特定するのに適している。しかしながら、すでに説明したように、これらの方法は、１曲の音楽のなかの基本的な周期性を記述しておらず、上位の周期性情報は全く含んでいない。いずれにせよ、メロディの主旋律の検索では、主旋律に異なるバリエーションがあることを考慮する必要があるので、メロディの主旋律を特定する方法は非常にコストがかかる。従って、通常、すなわち例えば入れ換え、ミラーリング等により、主旋律が変化することが、音楽の世界では周知である。 There is a limit to the method for specifying the main melody of the melody, and the main melody of music is repeated, which is suitable for specifying the periodicity present in the sound signal. However, as already explained, these methods do not describe the basic periodicity of a piece of music and do not include any high order periodicity information. In any case, since it is necessary to consider that there are different variations in the main melody in the search for the main melody of the melody, the method for specifying the main melody of the melody is very expensive. Therefore, it is well known in the music world that the main melody changes normally, for example, by switching, mirroring and the like.

コイル（Ｃｏｙｌｅ、Ｅ．Ｊ．）、シュマルビッチ（Ｓｈｍｕｌｅｖｉｃｈ、Ｉ．）の、“音楽パターンの機械認識システム（ＡＳｙｓｔｅｍｆｏｒＭａｃｈｉｎｅＲｅｃｏｇｎｉｔｉｏｎｏｆＭｕｓｉｃＰａｔｔｅｒｎｓ）”、１９９８年ＩＥＥＥ国際会議、音響、音声、信号処理部会（ｈｔｔｐ：／／ｗｗｗ２．ｍｄａｎｄｅｒｓｏｎ．ｏｒｇ／ａｐｐ／ｉｌｙａ／−Ｐｕｂｌｉｃａｔｉｏｎｓ／ｉｃａｓｓｐ９８ｍｐｒ．ｐｄｆ）“A System for Machine Recognition of Music Patterns” by Coyle, EJ, Shmalevich, I., 1998 IEEE International Conference, Sound, Voice, Signal Processing Section (http://www2.mdanderson.org/app/ilya/-Publications/icassp98mpr.pdf) シュレーター（Ｓｃｈｒｏｅｔｅｒ、Ｔ．）、ドライサミ（Ｄｏｒａｉｓａｍｙ、Ｓ．）、ルーガー（Ｒｕｅｇｅｒ、Ｓ．）の、“原音のポリフォニック音声から継続的に発生する主旋律を検索する（ＦｒｏｍＲａｗＰｏｌｙｐｈｏｎｉｃＡｕｄｉｏｔｏＬｏｃａｔｉｎｇＲｅｃｕｒｒｉｎｇＴｈｅｍｅｓ）”、２０００年ＩＳＭＩＲ（ｈｔｔｐ：／ｉｓｍｉｒ２０００．ｉｓｍｉｒ．ｎｅｔ／ｐｏｓｔｅｒｓ／ｓｈｏｒｅｔｅｒｒｕｇｅｒ．ｐｄｆ）"From Raw Raw Polyphonic Audio to Locating Recurring Recurring Theme, Searching for the main melody generated from the polyphonic sound of the original sound (From Raw Polyphonic Audio to Locating Recurring Therme, Schlosser, T.) ) ", 2000 ISMIR (http://ismir2000.ismir.net/posters/soreter ruger.pdf) メーディック（Ｍｅｕｄｉｃ、Ｂ．）の、“音楽パターン抽出：音楽構造の反復から（ＭｕｓｉｃａｌＰａｔｔｅｒｎＥｘｔｒａｃｔｉｏｎ：ｆｒｏｍＲｅｐｅｔｉｔｉｏｎｔｏＭｕｓｉｃａｌＳｔｒｕｃｔｕｒｅ）”、２００３年ＣＭＭＲ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ＲＭＰａｐｅｒｓ／ＣＭＭＲ−ｍｅｕｄｉｃ−２００３．ｐｄｆ）Medic, B., “Music Pattern Extraction: From Musical Structure Extraction”, 2003 CMMR Bulletin (http: //www.ircamu.ircamu.ircamu. repmus / RMPapers / CMMR-music-2003.pdf) ミーク（Ｍｅｅｋ）、コリン（Ｃｏｌｉｎ）、バーミンガム（Ｂｉｒｍｉｎｇｈａｍ、Ｗ．Ｐ．）の、“主旋律抽出器（ＴｈｅｍａｔｉｃＥｘｔｒａｃｔｏｒ）”、２００１年ＩＳＭＩＲ（ｈｔｔｐ：／／ｉｓｍｉｒ２００１．ｉｓｍｉｒ．ｎｅｔ／ｐｄｆ／ｍｅｅｋ．ｐｄｆ）Meek, Collin, Birmingham, WP, “Thematic Extractor”, 2001 ISMIR (http://ismir2001.ismir.net/pdf/mek.pdf/mek.pdf/mek.pdf ) スミス（Ｓｍｉｔｈ、Ｌ．）、メジナ（Ｍｅｄｉｎａ、Ｒ．）の、“パターン一致の抽出による主旋律の発見（ＤｉｓｃｏｖｅｒｉｎｇＴｈｅｍｅｂｙＥｘａｃｔＰａｔｔｅｒｎＭａｔｃｈｉｎｇ）”２００１年（ｈｔｔｐ：／／ｃｉｔｅｓｅｅｒ．ｉｓｔ．ｐｓｕ．ｅｄｕ／−４９８２２６．ｈｔｍｌ）Smith, L., Medina, R., "Discovering Theme by Exact Pattern Matching" 2001 (http://sitesee.ist.psu.edu/ -498226. Html) ラルティロット（Ｌａｒｔｉｌｌｏｔ、Ｏ．）の、“知覚ベースの音楽パターンの発見（Ｐｅｒｃｅｐｔｉｏｎ−ＢａｓｅｄＭｕｓｉｃａｌＰａｔｔｅｒｎＤｉｓｃｏｖｅｒｙ）”、２００３年ＩＦＭＣ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／−ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ｌａｒｔｉｌｌｏｔ／ｃｍｍｒ．ｐｄｆ）Lartillot, O., “Perception-Based Musical Pattern Discovery”, 2003 IFMC Bulletin (http://www.ircam.fr/-equilpes/mm .Pdf) ブラウン（Ｂｒｏｗｎ、Ｊ．Ｃ．）の、“自己相関による楽譜基準値の算出（ＤｅｔｅｒｍｉｎａｔｉｏｎｏｆｔｈｅＭｅｔｅｒｏｆＭｕｓｉｃａｌＳｃｏｒｅｓｂｙＡｕｔｏｃｏｒｒｅｌａｔｉｏｎ）”、米国音響学会ジャーナル（Ｊ．ｏｆＡｃｏｕｓｔ、Ｓｏｃ．ＯｆＡｍｅｒｉｃａ）第９４巻第４号１９９３年Brown (JC), “Determining of the Meter of Musical Score by Autocorrelation”, Journal of the Acoustical Society of Japan (J. of Acoustic, Soc. Of America) 94th. Volume 4 No. 1993 メーディック（Ｍｅｕｄｉｃ、Ｂ．）の、“ＭＩＤＩファイルからの自動基準値抽出（ＡｕｔｏｍａｔｉｃＭｅｔｅｒＥｘｔｒａｃｔｉｏｎｆｒｏｍＭＩＤＩｆｉｌｅｓ）”、２００２年ＪＩＭ紀要（ｈｔｔｐ：／／ｗｗｗ．ｉｒｃａｍ．ｆｒ／ｅｑｕｉｐｅｓ／ｒｅｐｍｕｓ／ＲＭＰａｐｅｒｓ／ＪＩＭ−ｂｅｎｏｉｔ．２００２．ｐｄｆ）Medic, B., “Automatic Meter Extraction from MIDI files”, 2002 JIM Bulletin (http://www.ircamm.fr/equipes/reprums/reprums/reprums/reprums/reprums/reprums/reprums/repmus/reverse JIM-benoit.2002.pdf)

本発明の目的は、音信号の特徴を記述する、効率的で信頼のおける概念を提供する。 The object of the present invention provides an efficient and reliable concept describing the characteristics of a sound signal.

この目的は、請求項１に記載の音信号の特徴を記述する装置、請求項１３に記載の音信号の特徴を記述する方法、または請求項１４に記載のコンピュータプログラムにより達成される。 This object is achieved device describes the characteristics of the sound signal according to claim 1, it is achieved by claim 1 way to describe the characteristics of the sound signal according to 3 or a computer program according to claim 1 4,.

本発明が基づく知見は、効率的に算出できる多数の情報に関して、周期長判定によるエントリタイムのシーケンスと、サブシーケンスへの分割と、集計したサブシーケンスへの集約とに基づいて、特徴として、音信号の表現特徴を確定するというものである。 The knowledge based on the present invention is that, for a large number of information that can be calculated efficiently, based on the sequence of entry times by period length determination, division into subsequences, and aggregation into aggregated subsequences, This is to determine the expression characteristics of the signal.

さらに、好ましくは、時間に沿った、１つの楽器の１つのエントリタイムのシーケンス、すなわち個別の音源の１つのエントリタイムのシーケンスについて考慮するばかりでなく、１曲の音楽で並行して発生する、２つの異なる音源の少なくとも２つのエントリタイムのシーケンスについても考慮する。通常、全ての音源、または、例えば１曲の音楽のなかのパーカッション音源のような少なくともサブセットの音源は、同じ基礎となる周期長を有していると考えられる。２つの音源のエントリタイムのシーケンスを用いて、少なくとも２つの音源の基礎となる共通の周期長を求める。本発明によれば、次に、各エントリタイムシーケンスがそれぞれサブシーケンスに分割される。サブシーケンス長は、共通の周期長と等しい。 Furthermore, preferably not only consider one entry time sequence of one instrument over time, ie one entry time sequence of individual sound sources, but also occur in parallel in one piece of music. Consider also a sequence of at least two entry times of two different sound sources. Usually, all sound sources, or at least a subset of sound sources, such as percussion sound sources in a piece of music, for example, are considered to have the same underlying period length. Using a sequence of entry times of two sound sources, a common period length that is the basis of at least two sound sources is obtained. According to the present invention, each entry time sequence is then divided into subsequences. The subsequence length is equal to the common period length.

第１の音源のサブシーケンスを第１の合成したサブシーケンスに合成することと、第２の音源のサブシーケンスを第２の合成したサブシーケンスに合成することとに基づいて、特徴抽出を行う。合成したサブシーケンスが、音信号の特徴として機能し、これを用いてさらに処理をおこなうこともできる。例えば、ジャンル、テンポ、基準値リズムの種類、他の楽曲との類似性等の、１つの全曲の音楽に関する意味的に重要な情報を抽出する。 Feature extraction is performed based on synthesizing the first sound source sub-sequence into the first synthesized sub-sequence and synthesizing the second sound source sub-sequence into the second synthesized sub-sequence. The synthesized subsequence functions as a feature of the sound signal, and can be further processed using this. For example, semantically important information about music of one whole song such as genre, tempo, reference value rhythm type, similarity to other songs, and the like is extracted.

従って、エントリタイムのシーケンスに対して想定した２つの音源がパーカッション音源である場合は、例えば、音のピッチではなく出力音の特徴スペクトルが、またはピッチではなく出力音の立ち上がりまたは立ち下がりそれぞれが、上位の音楽意味を持つという事実により、各々を区別する、ドラムや、他のドラム楽器または任意の他の打楽器である場合は、第１の音源の合成したサブシーケンスと第２の音源の合成したサブシーケンスとが、音信号のドラムパターンを形成する。 Therefore, when the two sound sources assumed for the entry time sequence are percussion sound sources, for example, the characteristic spectrum of the output sound, not the pitch of the sound, or the rising or falling edge of the output sound, not the pitch, In the case of drums, other drum instruments, or any other percussion instrument that distinguishes them due to the fact that they have higher musical meanings, the synthesized sub-sequence of the first sound source and the second sound source are combined. The subsequence forms a drum pattern of the sound signal.

従って、好ましくは、本発明により、好適には採譜した音楽信号、すなわち、例えば音楽信号の音符表現から、ドラムパターンを自動的に抽出する。この表現をＭＩＤＩフォーマットで記述してもよいし、デジタル信号処理の方法により、オーディオ信号から自動的に求めたりしてもよい。例えば、独立成分分析（ＩＣＡ）、または、例えば非負独立成分分析法等の、これを変更した方法、または、一般に“ブラインド音源分離（ｂｌｉｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ”（ＢＳＳ）というキーワードで周知の概念を用いる。 Therefore, preferably, according to the present invention, a drum pattern is automatically extracted from a music signal that has been recorded, that is, for example, from a musical note representation of the music signal. This expression may be described in the MIDI format, or may be automatically obtained from the audio signal by a digital signal processing method. For example, a well-known concept is used with an independent component analysis (ICA) or a modified method such as a non-negative independent component analysis method, or the keyword “blind source separation” (BSS).

本発明の好適な実施の形態では、はじめにドラムパターンの抽出を行うには、音符エントリの認識、すなわち、異なる楽器ごと、音色楽器のピッチごとの開始時間の認識を行う。あるいは、音符表現の読出を行ってもよい。この読出は、ＭＩＤＩファイルの読出、または楽譜のサンプリングおよび画像処理、また手入力した音符の受信を含んでいてもよい。 In the preferred embodiment of the present invention, the drum pattern is first extracted by recognizing the note entry, that is, by recognizing the start time for each different instrument and each pitch of the timbre instrument. Alternatively, note expression may be read out. This reading may include reading a MIDI file, or sampling and image processing of a score, and receiving manually entered notes.

ここで、本発明の好適な実施の形態では、ラスタが求められ、これにより、音符エントリタイムが量子化され、再び音符エントリタイムが量子化される。 Here, in a preferred embodiment of the present invention, a raster is determined, whereby the note entry time is quantized and the note entry time is quantized again.

ここで、音楽小節の長さ、音楽小節の長さの整数倍、または音楽カウントタイムの長さの整数倍として、ドラムパターンの長さを求める。 Here, the length of the drum pattern is obtained as the length of the music measure, the integral multiple of the length of the music measure, or the integral multiple of the length of the music count time.

ここで、パターンヒストグラムを用いて、基準位置ごとに特定の楽器が出現する周波数を求める。 Here, the frequency at which a specific musical instrument appears for each reference position is obtained using the pattern histogram.

次に、音信号の特徴として好ましいドラムパターンの形態を得るために、該当するエントリを選択する。あるいは、パターンヒストグラムをそのように処理することもできる。パターンヒストグラムはまた、音楽イベントの圧縮表現、すなわち音符生成であり、変化の程度および好ましいカウントタイムに関する情報を含んでいる。ヒストグラムの平坦性は、強い変化を表し、ちょうど“山のような”ヒストグラムは、自己類似性の意味で定常信号を表す。 Next, in order to obtain a preferable drum pattern form as a characteristic of the sound signal, a corresponding entry is selected. Alternatively, the pattern histogram can be processed as such. The pattern histogram is also a compressed representation of music events, i.e. note generation, and contains information about the degree of change and the preferred count time. The flatness of the histogram represents a strong change, and a “mountain-like” histogram represents a stationary signal in the sense of self-similarity.

ヒストグラムの表現を向上させるためには、信号を信号の特徴が類似した領域に分割して、互いに類似した信号の領域に対してだけドラムパターンを抽出して、信号内の他の特徴領域の別のドラムパターンを求めるために、はじめに前処理を行うことが好ましい。 In order to improve the expression of the histogram, the signal is divided into regions with similar signal features, drum patterns are extracted only for regions of similar signals, and other feature regions in the signal are separated. In order to obtain the drum pattern, it is preferable to perform pre-processing first.

音信号の特徴を算出する正しく機能する効率的な方法を得る点において、本発明には利点がある。特に、正しく機能する方法で、全ての信号に対して等しいように、統計的方法で求める周期長により、分割を行うことに基づいている。さらに、合成したサブシーケンスの算出に、ますますコストがかかるように、本発明の概念が拡張可能であるという点は、上位の算出時間がかかるものの、共通の周期長の判定と、ドラムパターンの判定とに、出現時間シーケンスがますます多く含まれるようになり、異なる音源、すなわち楽器がますます数多く含まれるようになるという事実により、この概念の表現および精度について容易に向上させることもできるという点である。 The present invention is advantageous in that it provides an efficient and functioning method for calculating the characteristics of a sound signal. In particular, it is based on performing a division according to a periodic length determined by a statistical method so that it functions correctly and is equal for all signals. In addition, the concept of the present invention can be expanded so that the calculation of the synthesized subsequence is more costly, although the calculation time of the upper level takes time, the determination of the common period length and the drum pattern Judgment also includes more time-of-occurrence sequences, and the fact that more and more different sound sources, ie instruments, are also included, can easily improve the representation and accuracy of this concept. Is a point.

しかしながら、別の拡張性は、さらに処理を行うかどうかによるが、必要な場合には、次に、得られた合成したサブシーケンスに後処理を行って、表現に関するサブシーケンスを低減するために、特定の数の音源に対して、特定の数の合成したサブシーケンスを計算することである。例えば、特定の閾値を下回るヒストグラムエントリを無視することもできる。しかしながら、ヒストグラムエントリについてもそのように量子化したり、あるいは、特定の時点に、合成したサブシーケンス内にヒストグラムエントリがあるというステートメントだけをヒストグラムが含んでいるかどうかについて、閾値を決定するかにより、２値化したりしてもよい。 However, another extensibility depends on whether further processing is required, and if necessary, in order to post-process the resulting synthesized sub-sequence to reduce sub-sequences related to the representation. To calculate a specific number of synthesized subsequences for a specific number of sound sources. For example, histogram entries below a certain threshold can be ignored. However, depending on whether the threshold is determined as to whether the histogram contains only such statements for histogram entries or whether the histogram contains only statements that there is a histogram entry in the synthesized subsequence at a particular time. It may be priced.

本発明の概念は、多くのサブシーケンスを合成するために、サブシーケンスと“一体化”するという事実に基づく、正しく機能する方法である。この方法は、いずれにしろ効率的に行われるが、本発明の処理工程数は多くを必要としない。 The concept of the present invention is a correctly functioning method based on the fact that it “integrates” with subsequences to synthesize many subsequences. Although this method is performed efficiently anyway, the number of processing steps of the present invention does not require much.

特に、以下でドラムと呼ぶピッチのない打楽器は、特にポピュラー音楽で基本的な役割をはたす。リズムおよび音楽ジャンルに関する多数の情報の部分は、ドラムによって演奏される“音符”に含まれている。例えばこれを、分類または少なくとも事前分類それぞれを実行できるようにするために、音楽アーカイブの知的そして直感的な検索に用いることもできる。 In particular, the non-pitch percussion instrument referred to below as a drum plays a fundamental role, especially in popular music. Many pieces of information about rhythm and music genre are contained in "notes" played by the drum. For example, it can also be used for intelligent and intuitive searching of music archives so that classification or at least pre-classification respectively can be performed.

ドラムによって演奏される音符は、ドラムパターンとも呼ぶ繰り返しパターンを頻繁に形成する。ドラムパターンは、長い音符像からドラムパターンの長さの音符像を抽出することにより、演奏した音符の圧縮表現として機能することもできる。それにより、ドラムパターンから、１つの全曲の音楽に関する意味論的に意味のある情報を抽出することもできる。例えばジャンル、テンポ、基準値リズムの種類、他の楽曲との類似性等の情報である。 The notes played by the drum frequently form repeating patterns, also called drum patterns. The drum pattern can also function as a compressed expression of a played note by extracting a note image of the length of the drum pattern from a long note image. Thereby, semantically meaningful information about the music of one whole song can be extracted from the drum pattern. For example, information such as genre, tempo, reference value rhythm type, similarity to other music pieces, and the like.

以下では、添付の図面を参照して、本発明の好適な実施の形態についてより詳細に説明する。
図１は、本発明の音信号の特徴を記述する装置のブロック図を示す。
図２は、音符エントリポイントの割り出しを説明する概略図を示す。
図３は、量子化ラスタと、ラスタを用いた音符の量子化とを表す概略図を示す。
図４は、任意の楽器を用いて統計的に時間の長さを求めることにより得られる、共通の周期長の例を説明する図を示す。
図５は、個別の音源（楽器）の合成したサブシーケンスの例として、例示のパターンヒストグラムを示す。
図６は、音信号の別の特徴の例として、後処理を行ったパターンヒストグラムを示す。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 shows a block diagram of an apparatus for describing the characteristics of a sound signal of the present invention.
FIG. 2 shows a schematic diagram for explaining the determination of note entry points.
FIG. 3 shows a schematic diagram representing the quantization raster and the quantization of the notes using the raster.
FIG. 4 is a diagram illustrating an example of a common period length obtained by statistically obtaining the length of time using an arbitrary musical instrument.
FIG. 5 shows an exemplary pattern histogram as an example of a synthesized subsequence of individual sound sources (instruments).
FIG. 6 shows a pattern histogram subjected to post-processing as another example of the sound signal.

図１は、本発明の音信号の特徴を記述する装置を示す。はじめに、図１は、時間とともに、少なくとも２つの音源のそれぞれに対しエントリタイムシーケンスを提供する手段１０を含んでいる。好ましくは、エントリタイムは、量子化ラスタに存在する、すでに量子化したエントリタイムである。図２は、異なる音源の音符のエントリタイムのシーケンス、すなわち、図２で“ｘ”で表される楽器１、２、．．．楽器ｎを示しており、図３は、図３に示すラスタ内の量子化された各音源の量子化エントリタイムのシーケンス、すなわち、各楽器１、２．．．楽器ｎの量子化エントリタイムのシーケンスを示している。 FIG. 1 shows an apparatus for describing the characteristics of a sound signal of the present invention. First, FIG. 1 includes means 10 for providing an entry time sequence for each of at least two sound sources over time. Preferably, the entry time is an already quantized entry time that exists in the quantized raster. FIG. 2 shows a sequence of note entry times of different sound sources, ie instruments 1, 2,... Represented by “x” in FIG. . . FIG. 3 shows a sequence of quantized entry times of each quantized sound source in the raster shown in FIG. 3, that is, each instrument 1, 2,. . . The sequence of the quantization entry time of the musical instrument n is shown.

図３は同時に、エントリタイムの行列またはリストを示している。図３の縦の列は、２つのラスタポイントまたはラスタラインの間の距離を表しており、従って、時間間隔を表す。エントリタイムのシーケンスによるが、音符エントリは存在することもあるし存在しないこともある。図３に示す実施の形態では、例えば、符号３０で示す縦の列では、楽器１の音符エントリが存在している。図３の２つの楽器１および楽器２に対応付けられた２つのラインに“ｘ”で示されているように、楽器２にも当てはまる。それと対照的に、楽器ｎは、符号３０で示す時間間隔に音符エントリタイムがない。 FIG. 3 simultaneously shows a matrix or list of entry times. The vertical column in FIG. 3 represents the distance between two raster points or raster lines and thus represents the time interval. Depending on the sequence of entry times, note entries may or may not exist. In the embodiment shown in FIG. 3, for example, in the vertical column indicated by reference numeral 30, note entries of the musical instrument 1 exist. The same applies to the musical instrument 2, as indicated by "x" in the two lines associated with the two musical instruments 1 and 2 in FIG. In contrast, instrument n has no note entry time in the time interval indicated by reference numeral 30.

好ましくは、共通の周期長を算出するために、量子化したいくつかのエントリタイムのシーケンスが、手段１０から手段１２に供給される。各エントリタイムシーケンスの個別の周期長を求めるのではなく、少なくとも２つの音源の最もよい基礎となる共通の周期長を検出するために、共通の周期長を算出する手段１２は実行される。これは、例えば１曲の中で打楽器をいくつか演奏している場合は、打楽器はすべて、多かれ少なかれ同じリズムを刻んでいるので、音信号を構成している事実上全ての楽器、すなわち全ての音源に対して、共通の周期長が必ず存在することになるという事実に基づいている。 Preferably, a sequence of several quantized entry times is supplied from means 10 to means 12 to calculate a common period length. Rather than determining the individual period length of each entry time sequence, the means 12 for calculating the common period length is executed in order to detect the common period length that is the best basis for at least two sound sources. This is because, for example, if you are playing several percussion instruments in a song, all percussion instruments have more or less the same rhythm, so virtually all instruments that make up the sound signal, ie all This is based on the fact that there is always a common period length for the sound sources.

ここで、出力側で各音源のサブシーケンスのセットを得るために、各エントリタイムシーケンスを分割する手段１４に、共通の音の周期長が供給される。 Here, in order to obtain a set of subsequences of each sound source on the output side, a common sound period length is supplied to the means 14 for dividing each entry time sequence.

例えば、次に、図４について考えると、すなわち、任意の楽器１、２、．．．楽器ｎに対して共通の周期長４０が検出されていることがわかる。任意のエントリタイムのシーケンスを共通の周期長４０の長さのサブシーケンスに分割するために、サブシーケンスに分割する手段１４が実行される。次に、例えば、図４の例に示す楽器１のシーケンスに対して３つのサブシーケンスを得るために、図４に示すように、楽器のエントリタイムシーケンスを、第１のサブシーケンス４１と、次の第２のサブシーケンス４２と、その次のサブシーケンス４３とに分割する。これと同様に、楽器１のエントリタイムシーケンスで説明しているように、楽器２、．．．楽器ｎの他のシーケンスについても対応する隣接するサブシーケンスに分割する。 For example, now consider FIG. 4, ie, any instrument 1, 2,. . . It can be seen that a common period length 40 is detected for the musical instrument n. In order to divide a sequence of arbitrary entry times into subsequences having a common period length of 40, means 14 for dividing into subsequences is executed. Next, for example, in order to obtain three sub-sequences for the sequence of the musical instrument 1 shown in the example of FIG. 4, as shown in FIG. The second sub-sequence 42 and the next sub-sequence 43 are divided. Similarly, as described in the entry time sequence for instrument 1, instruments 2,. . . Other sequences of the musical instrument n are also divided into corresponding adjacent sub-sequences.

次に、第１の音源の合成したサブシーケンスおよび第２の音源の合成したサブシーケンスを音信号の特徴として得るために、音源のサブシーケンスのセットが、各音源を合成する手段１６に供給される。好ましくは、パターンヒストグラムの形態で合成を行う。各サブシーケンス間の第１の間隔を、いわゆる各サブシーケンス間の第１の間隔の“上”に配置するように、第１の楽器のサブシーケンスを、互いに隣接するように上に配置する。次に、図５を参照して示すように、合成したサブシーケンスの各スロット内のエントリ、またはパターンヒストグラムの各ヒストグラム成分内のエントリそれぞれが数えられる。従って、第１の音源の合成したサブシーケンスは、図５に示す例の、パターンヒストグラムの第１の段５０となる。第２の音源、すなわち、例えば楽器２に対して、合成したサブシーケンスは、パターンヒストグラムの第２の段５２等となる。全体として、図５のパターンヒストグラムは従って、音信号の特徴を表し、これを次にさらに様々な目的のために用いることもできる。 Next, in order to obtain the synthesized subsequence of the first sound source and the synthesized subsequence of the second sound source as features of the sound signal, a set of sound source subsequences is supplied to the means 16 for synthesizing each sound source. The Preferably, the synthesis is performed in the form of a pattern histogram. The sub-sequences of the first musical instrument are arranged above each other so that the first interval between the sub-sequences is arranged “above” the so-called first interval between the sub-sequences. Next, as shown in FIG. 5, each entry in each slot of the synthesized subsequence or each entry in each histogram component of the pattern histogram is counted. Therefore, the synthesized sub-sequence of the first sound source becomes the first stage 50 of the pattern histogram in the example shown in FIG. For the second sound source, that is, for example, the musical instrument 2, the synthesized subsequence is the second stage 52 of the pattern histogram. Overall, the pattern histogram of FIG. 5 thus represents the characteristics of the sound signal, which can then be used for further various purposes.

以下では、ステップ１２で、共通の周期長を求める異なる実施の形態について説明する。パターン長さの検出を、異なるやり方で実施することもできる。すなわち、例えば推測的な基準から、現在の音符情報に基づいて、あるいは、例えば好ましくは、パターン長さの多くの仮定を推定して、得られる結果を用いてそれらの妥当性を検証する、反復的検索アルゴリズムにより、周期性／パターン長さの推定値を直接生成することができる。これについてもやはり、例えば、パターンヒストグラムを解釈することにより実行してもよい。例えば、合成手段１６で実行したり、または他の自己類似性測定手段を用いて実行したりする。 In the following, different embodiments for obtaining a common period length in step 12 will be described. Pattern length detection can also be implemented in different ways. That is, iterative, eg, based on current note information, eg, from speculative criteria, or preferably estimating many assumptions of pattern length, and verifying their validity using the results obtained The periodic search algorithm can directly generate periodicity / pattern length estimates. Again, this may be done, for example, by interpreting the pattern histogram. For example, it is executed by the synthesizing means 16 or by using other self-similarity measuring means.

図５に示すようにすでに実行したように、パターンヒストグラムを、合成手段１６で生成してもよい。あるいは、妥当性に従って音符に重み付けを行うために、パターンヒストグラムでは、個別の音符の強度についても考慮することもできる。あるいは、図５に示したように、ヒストグラムは単に、サブシーケンス内またはサブシーケンスのビンまたはタイムスロット内に音が存在しているかどうかに関する情報含んでいてもよい。ここでは、妥当性に対する個別の音符の重み付けを、ヒストグラムに含めないものとする。 As already shown in FIG. 5, the pattern histogram may be generated by the synthesizing unit 16. Alternatively, in order to weight the notes according to their validity, the pattern histogram can also take into account the intensity of individual notes. Alternatively, as shown in FIG. 5, the histogram may simply contain information regarding whether sound is present in the subsequence or in the bin or time slot of the subsequence. Here, the weighting of individual notes for validity is not included in the histogram.

本発明の好適な実施の形態では、ここでは好ましくはパターンヒストグラムである図５に示す特徴を、さらに処理する。これを行う際に、例えば、周波数または合成した強度値を閾値と比較するといったような基準を用いて、音符の選択が行われてもよい。なかでも、この閾値は、楽器の種類またはヒストグラムの平坦性に依存してもよい。ドラムパターン内のエントリは、ブールの大きさであってもよい。“１”は、音符が発生したという事実を表し、“０”は、音符が発生しなかったことを表す。あるいは、ヒストグラム内のエントリについても、このタイムスロットで発生する音符の強度（音量）または妥当性が、音楽信号でどのように大きいかについての単位であってもよい。図６について考えると、任意のタイムスロットまたはビンそれぞれが、各楽器のパターンヒストグラムで“ｘ”で示されるように、閾値を選択したことがわかる。エントリ数は、３以上である。それと対照的に、任意のビンを削除する。エントリ数は３未満、すなわち、例えば２または１である。 In the preferred embodiment of the invention, the features shown in FIG. 5, which are preferably pattern histograms, are further processed here. In doing this, note selection may be performed using criteria such as, for example, comparing the frequency or the combined intensity value with a threshold. Among other things, this threshold may depend on the instrument type or the flatness of the histogram. The entry in the drum pattern may be a Boolean size. “1” indicates the fact that a note has occurred, and “0” indicates that a note has not occurred. Alternatively, the entry in the histogram may be a unit of how large the intensity (volume) or validity of the note generated in this time slot is in the music signal. Considering FIG. 6, it can be seen that each arbitrary time slot or bin has selected a threshold, as indicated by “x” in the pattern histogram of each instrument. The number of entries is 3 or more. In contrast, remove any bins. The number of entries is less than 3, that is, 2 or 1, for example.

本発明によれば、ピッチでまったく、あるいはそれほど特徴付けられていない打楽器から、音楽の“結果”すなわち譜面が生成される。音楽イベントは、音楽楽器の音の発生として定義される。好ましくは、実質的なピッチを持たない打楽器だけを考える。イベントが、オーディオ信号内で検出され、楽器の種類が分類される。イベントの時間的位置が、量子化ラスタについて量子化される。これを、テータムグリッドとも呼ぶ。さらに、音楽の小節、またはミリ秒での小節の長さ、または多数の量子化間隔がそれぞれ算出され、さらに、好ましくは、アップビートについても特定される。ドラムパターンでの特定の位置で音楽イベントが発生する周波数に基づくリズム構造を特定することにより、テンポを確実に特定することができ、音楽的背景知識についても用いる場合には、小節ラインを位置づける有用な表示が得られる。 In accordance with the present invention, a musical “result” or musical score is generated from a percussion instrument that is not or not well characterized by pitch. A music event is defined as the occurrence of a musical instrument sound. Preferably, only percussion instruments with no substantial pitch are considered. Events are detected in the audio signal and the instrument type is classified. The temporal position of the event is quantized for the quantized raster. This is also called a tatum grid. In addition, a measure of music, or a measure length in milliseconds, or a number of quantization intervals are respectively calculated, and preferably, an upbeat is also specified. By specifying a rhythm structure based on the frequency at which a music event occurs at a specific position in a drum pattern, the tempo can be specified reliably, and when using musical background knowledge, it is useful to position a bar line Display is obtained.

楽譜または特徴それぞれは、好ましくは、例えば開始時間および継続時間といったリズム情報を含んでいることに留意されたい。この基準値情報の推定値、すなわち、拍子記号の推定値は、採譜した音楽の自動分析を行うのに必ずしも必要ではないが、しかしながら、人為的な再生装置にとっては、有効な楽譜の生成および再生を行うのに必要である。従って、自動採譜処理を、２つのタスクに分割することもできる。すなわち、すでに上述したように、音楽イベント、すなわち音符の検出および分類と、検出した音符、すなわちドラムパターンからの音楽譜面の生成とである。このために、好ましくは、音楽の基準値構造を推定して、検出した音符の時間的位置の量子化と、アップビートの検出と、小節ラインの位置の割り出しとを行ってもよい。特に、多声の音楽オーディオ信号の有意のピッチ情報がない、打楽器の音楽譜面の抽出について記述している。好ましくは、独立部分空間分析法を用いて、イベントの検出および分類を行う。 Note that each score or feature preferably includes rhythm information such as start time and duration. The estimated value of the reference value information, that is, the estimated value of the time signature is not always necessary for automatic analysis of the recorded music. However, for an artificial reproduction device, generation and reproduction of an effective score is possible. Is necessary to do. Therefore, the automatic music recording process can be divided into two tasks. That is, as already described above, the detection and classification of music events, that is, notes, and the generation of a musical score surface from the detected notes, that is, drum patterns. For this purpose, preferably, the reference value structure of the music is estimated, and the temporal position of the detected note is quantized, the upbeat is detected, and the position of the bar line is determined. In particular, it describes the extraction of the percussion musical score surface without significant pitch information of polyphonic music audio signals. Preferably, event detection and classification is performed using independent subspace analysis.

拡張ＩＣＡは、独立部分空間分析（ＩＳＡ）により表される。ここでは、成分を、統計的に独立させる必要のない成分を持つ独立部分空間に分割する。音楽信号を変換することにより、混合信号の多次元表現が求められ、これを最後に推定したＩＣＡと一致させる。過去に、独立成分を算出する異なる方法が開発されている。音声信号分析を部分的に扱っている、該当する文献は以下の通りである。
１．Ｊ．カルーネン（Ｋａｒｈｕｎｅｎ）、“独立成分分析および音源の分離に対するニューラルアプローチ（Ｎｅｕｒａｌａｐｐｒｏａｃｈｅｓｔｏｉｎｄｅｐｅｎｄｅｎｔｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓａｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ）”１９９６年人工神経網欧州シンポジウム紀要２４９〜２６６ページ、ブリュージュ。
２．Ｍ．Ａ．ケーシー（Ｃａｓｅｙ）およびＡ．ウェストナー（Ｗｅｓｔｎｅｒ）、“独立部分空間分析による混合音声音源の分離（ＳｅｐａｒａｔｉｏｎｏｆＭｉｘｅｄＡｕｄｉｏＳｏｕｒｃｅｓｂｙＩｎｄｅｐｅｎｄｅｎｔＳｕｂｓｐａｃｅＡｎａｌｙｓｉｓ）”２０００年国際コンピュータ音楽会議紀要、ベルリン。
３．Ｊ．−Ｆ．カルドーゾ（Ｃａｒｄｏｓｏ）、“多次元独立成分分析（Ｍｕｌｔｉｄｉｍｅｎｓｉｏｎａｌｉｎｄｅｐｅｎｄｅｎｔｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓ）”１９９８年、ＩＣＡＳＳＰ１９９８年紀要、シアトル。
４．Ａ．ヒュバリネン（Ｈｙｖａｒｉｎｅｎ）、Ｐ．Ｏ．ホイヤー（Ｈｏｙｅｒ）およびＭ．インキ（Ｉｎｋｉ）、“地形独立分析（ＴｏｐｏｇｒａｐｈｉｃＩｎｄｅｐｅｎｄｅｎｔａｎａｌｙｓｉｓ）”２００１年ニューロコンピューティング１３（７）、１５２５〜１５５８ページ。
５．Ｓ．ドゥブノフ（Ｄｕｂｎｏｖ）、“独立部分空間分析によるサウンドオブジェクトの抽出（ＥｘｔｒａｃｔｉｎｇＳｏｕｎｄＯｂｊｅｃｔｓｂｙＩｎｄｅｐｅｎｄｅｎｔＳｕｂｓｐａｃｅＡｎａｌｙｓｉｓ）”２００２年仮想、合成およびエンターテイメント音声、ＡＥＳ第２２回国際会議紀要、ヘルシンキ。
６．Ｊ．−Ｆ．カルドーゾ（Ｃａｒｄｏｓｏ）およびＡ．スルミア（Ｓｏｕｌｏｕｍｉａｃ）、“非ガウス信号のブラインドビーム形成（ＢｌｉｎｄｂｅａｍｆｏｒｍｉｎｇｆｏｒｎｏｎＧａｕｓｓｉａｎｓｉｇｎａｌｓ”１９９３年ＩＥＥ紀要、第１４０巻、第６号、３６２〜３７０ページ。 Extended ICA is represented by independent subspace analysis (ISA). Here, the components are divided into independent subspaces having components that do not need to be statistically independent. By converting the music signal, a multi-dimensional representation of the mixed signal is sought, which is matched with the last estimated ICA. In the past, different methods of calculating the independent component have been developed. The relevant literature, which deals in part with speech signal analysis, is:
1. J. et al. Karhunen, “Neural approaches to independent analysis and source separation” Bulletin pp. 249-266, 1996 Artificial Neural Network European Symposium.
2. M.M. A. Casey and A.C. Westner, “Separation of Mixed Audio Sources by Independent Subspace Analysis” 2000 Bulletin of the International Computer Music Conference, Berlin.
3. J. et al. -F. Cardoso, "Multidimensional independent component analysis" 1998, ICASSP 1998 Bulletin, Seattle.
4). A. Hyvarinen, P.A. O. Hoyer and M.C. Inki, “Topographic Independent Analysis” 2001 Neurocomputing 13 (7), pages 1525-1558.
5. S. Dunovov, “Extracting Sound Objects by Independent Subspace Analysis” 2002 Virtual, Synthetic and Entertainment Speech, Bulletin of the 22nd International Conference on AES, Helsinki.
6). J. et al. -F. Cardoso and A.I. Suloumiac, “Blind beamforming for non Gaussian signals”, 1993 IEEE Bulletin, Vol. 140, No. 6, pages 362-370.

イベントが、音楽楽器の音符の発生として定義される。また、音符の発生時間は、１曲の音楽で音符が発生した時点である。オーディオ信号を部分にセグメント化する。オーディオ信号セグメントは、類似のリズム特徴を有している。下位の音声特性のベクトルで示される、オーディオ信号の短いフレームの間の距離基準を用いて、これを行う。テータムグリッドおよび上位の基準値レベルは、セグメント化部分から別々に求められる。基準値構造は、オーディオ信号のセグメント化部分内で変化しないものとする。好ましくは、検出したイベントが推定したテータムグリッドと整合される。この処理はおおよそ、音楽作成用の従来のＭＩＤＩシーケンサソフトウェアプログラムで周知の量子化関数に対応している。小節の長さが、量子化イベントリストから推定され、反復リズム構造が特定される。リズム構造に関する知識を用いて、推定したテンポを補正して、音楽的背景知識を用いて、小節ラインの位置を特定する。 An event is defined as the occurrence of a musical instrument note. The note generation time is the time when a note is generated in one piece of music. Segment audio signal into parts. Audio signal segments have similar rhythmic characteristics. This is done using a distance criterion between short frames of the audio signal, indicated by a vector of subordinate speech characteristics. The tatum grid and the upper reference level are determined separately from the segmented part. The reference value structure shall not change within the segmented portion of the audio signal. Preferably, the detected event is aligned with the estimated tatum grid. This process roughly corresponds to a quantization function well known in the conventional MIDI sequencer software program for music creation. The length of the bar is estimated from the quantization event list and the repetitive rhythm structure is identified. The estimated tempo is corrected using knowledge about the rhythm structure, and the position of the bar line is specified using musical background knowledge.

以下では、異なる本発明の構成要素の好ましい実施例について説明する。好ましくは、手段１０は、いくつかの音源にエントリタイムのシーケンスを提供するために、量子化を行う。好ましくは、検出したイベントがテータムグリッド内で量子化される。従来の音符エントリ検出方法を用いて動作する音符エントリタイムとともに、検出したイベントの音符エントリタイムを用いて、テータムグリッドが推定される。検出したパーカッションイベントに基づくテータムグリッドの生成は、確実に正しく機能する。ここでは、１曲の音楽の中の２つのラスタポイントの間の距離は、通常、一番早く演奏した音符を表すことに留意されたい。従って、１曲の音楽に、せいぜい１６分の１音符と、１６分の１音符より遅い音符とが発生した場合は、テータムグリッドの２つのラスタポイントの間の距離は、音信号の１６分の１音符の時間長に等しい。 In the following, preferred embodiments of different components of the invention will be described. Preferably, means 10 performs quantization to provide a sequence of entry times for some sound sources. Preferably, the detected event is quantized in the theta grid. A tatum grid is estimated using the note entry time of the detected event together with the note entry time that operates using the conventional note entry detection method. Generation of a theta grid based on detected percussion events will work correctly. It should be noted here that the distance between two raster points in a piece of music usually represents the earliest played note. Therefore, if a piece of music contains at most 1 / 16th note and notes later than 1 / 16th note, the distance between the two raster points of the tatum grid will be 16 / 16th of the sound signal. Equal to the time length of one note.

一般的な場合では、２つのラスタポイントの間の距離は、この音符値の整数倍を形成することにより、全ての発生した音符値または一時的な周期長それぞれを表すために必要な、最も大きい音符値を表す。従って、ラスタ距離は、全ての発生した音符継続時間／周期長等の最も大きい共通除数である。 In the general case, the distance between two raster points is the largest necessary to represent each generated note value or each temporary period length by forming an integer multiple of this note value. Represents a note value. Thus, the raster distance is the largest common divisor of all generated note durations / period lengths, etc.

以下では、テータムグリッドを求める２つの別のアプローチについて説明する。はじめに、第１のアプローチとして、テータムグリッドは、双方向ミスマッチプロシージャ（ＴＷＭ）を用いて表される。テータム期間の一連の実験的値、すなわち２つのラスタポイントの距離が、発音時刻の間隔（ｉｎｔｅｒ−ｏｎｓｅｔｉｎｔｅｒｖａｌ：ＩＯＩ）のヒストグラムから導出される。ＩＯＩの計算は、連続する発音に限られず、事実上、時間フレーム内の発音の全ての対に限定されている。テータム候補が、最も頻繁に出現するＩＯＩの整数の断片として計算される。双方向ミスマッチ誤差関数に最適に従って、ＩＯＩのハーモニー構造を予測する候補が選択される。その次に、テータム期間から導出したコームグリッドと信号の発音時間との間の誤差関数を算出することにより、推定したテータム期間が計算される。従って、ＩＯＩのヒストグラムが生成されて、ＦＩＲローパスフィルタにより、平滑化される。また、ＩＯＩヒストグラムのピークによりＩＯＩを分割して、例えば、１から４の間の値のセットにより、テータム候補が得られる。ＴＷＭの適用の後で、ＩＯＩヒストグラムからテータム期間のおよその推定値が導出される。その次に、音符エントリタイムと、前に推定したテータム期間に近い期間を有するいくつかのテータムグリッドとの間でＴＷＭを用いて、テータムグリッドの位相およびテータム期間の正確な推定値が計算される。 In the following, two alternative approaches for finding a theta grid will be described. First, as a first approach, a theta grid is represented using a bidirectional mismatch procedure (TWM). A series of experimental values of the tatum period, i.e., the distance between two raster points, is derived from a histogram of inter-onset interval (IOI). The IOI calculation is not limited to consecutive pronunciations, but is practically limited to all pairs of pronunciations within a time frame. Tatum candidates are calculated as integer fragments of the most frequently occurring IOI. A candidate that predicts the harmony structure of the IOI is selected according to the optimal bi-directional mismatch error function. Next, the estimated tatam period is calculated by calculating an error function between the comb grid derived from the tatam period and the sounding time of the signal. Therefore, a histogram of IOI is generated and smoothed by the FIR low-pass filter. In addition, the IOI is divided by the peak of the IOI histogram, and for example, a tatum candidate is obtained by setting a value between 1 and 4, for example. After application of the TWM, an approximate estimate of the tatum period is derived from the IOI histogram. Then, using TWM between the note entry time and a number of theta grids that have periods close to the previously estimated tatam period, an accurate estimate of the phase of the tatam grid and the tatam period is calculated. .

音符エントリベクトルとテータムグリッドとの間のベストマッチ、すなわち、音符エントリベクトルｘとテータムｙとの間の相関係数Ｒｘｙを用いて計算することにより、第２の方法により、テータムグリッドをさらに改良することを説明する。

The tatum grid is further improved by the second method by calculating using the best match between the note entry vector and the tatum grid, ie, the correlation coefficient Rxy between the note entry vector x and the tatum y. Explain that.

わずかなテンポの変化に追従させるために、隣接するフレームのテータムグリッドは、例えば、２．５秒の長さで推定される。隣接するフレームのテータムグリッド間の変化を、テータムグリッドポイントのＩＯＩベクトルにローパスフィルタリングを行うことにより平滑化して、テータムグリッドは平滑化したＩＯＩベクトルから抽出される。その次に、各イベントは、その最も近いグリッド位置に対応付けられている。それにより、いわゆる量子化が行われる。 In order to follow a slight tempo change, the thetagram grid of adjacent frames is estimated to be 2.5 seconds long, for example. The change between adjacent frame's tatam grids is smoothed by low-pass filtering the IOI vectors of the tatam grid points, and the tatam grids are extracted from the smoothed IOI vectors. Each event is then associated with its nearest grid position. Thereby, so-called quantization is performed.

次に、譜面を、行列Ｔｉｋ、ｉ＝１，．．．ｎおよびｊ＝１、．．．、ｍとして書き込むこともできる。ｎは、検出した楽器の数で、ｍは、テータムグリッド要素の数、すなわち、行列の縦の列の数に等しい。検出したイベントの強度は、削除されてもよいし、これを用いられてもよい。これにより、ブール行列または強度値を有する行列になる。 Next, the musical score is represented by a matrix Tik, i = 1,. . . n and j = 1,. . . , M can also be written. n is the number of detected instruments and m is equal to the number of theta grid elements, ie the number of vertical columns of the matrix. The intensity of the detected event may be deleted or used. This results in a Boolean matrix or a matrix having intensity values.

以下では共通の周期長を算出する手段１２の特殊な実施の形態について説明する。パーカッションイベントの量子化表現が、音源を再生する基礎となる、音楽の小節または周期性それぞれの推定値に関する有用な情報を準備する。基準値リズムレベルの周期性を例えば、２つの段階で求める。はじめに、小節長さを推定するために、周期性が計算される。 In the following, a special embodiment of the means 12 for calculating the common period length will be described. The quantized representation of the percussion event provides useful information about each measure or periodicity estimate of the music on which the sound source is played. For example, the periodicity of the reference value rhythm level is obtained in two stages. First, periodicity is calculated to estimate the bar length.

好ましくは、周期性関数として、自己相関関数（ＡＣＦ）または平均差量関数（ａｖｅｒａｇｅａｍｏｕｎｔｄｉｆｆｅｒｅｎｅｃｅｆｕｎｃｔｉｏｎ：ＡＭＤＦ）が用いられる。これらは、以下の式で表される。

Preferably, an autocorrelation function (ACF) or an average difference function (AMDF) is used as the periodicity function. These are represented by the following equations.

また、ＡＭＤＦが、音楽信号および音声信号の基礎周波数の推定値と、音楽の小節の推定値とに用いられる。 Also, AMDF is used for the estimated value of the fundamental frequency of the music signal and the audio signal and the estimated value of the musical measure.

一般的な場合では、周期性関数は、信号と、その時間的に異なるバージョンとの間の類似性または非類似性それぞれを測定する。異なる類似性基準は周知のものである。従って、例えば、次の式により、２つのブールベクトルＢｉとＢ２との間の非類似性を算出するハミング距離（ＨＤ）が存在する。

In the general case, the periodicity function measures the similarity or dissimilarity between the signal and its temporally different versions, respectively. Different similarity criteria are well known. Therefore, for example, there is a Hamming distance (HD) for calculating dissimilarity between the two Boolean vectors Bi and B2 by the following equation.

適切に拡張を行って、リズム構造を比較することは、類似の音符および休止符に異なる重み付けを行うことに起因する。次に、以下に示すように、譜面Ｔ１およびＴ２の２つの区分の間の類似性Ｂが、ブール演算の重み付け加算により計算される。

Appropriate expansion and comparison of rhythm structures results from different weighting of similar notes and pauses. Next, as shown below, the similarity B between the two sections of the musical scores T1 and T2 is calculated by weighted addition of Boolean operations.

上記の式では、重みａ、ｂおよびｃは、最初は、ａ＝１、ｂ＝０．５およびｃ＝０に設定されている。ａは、共通音符の発生の重み付けを行い、ｂは、共通休止符の発生の重み付けを行い、ｃは、重み付け差の発生に対して、すなわち音符が一方の譜面で発生するが、もう一方の譜面では音符が発生しないといった重み付けを行う。以下に示すように、要素Ｂを加算することにより、類似性基準Ｍが得られる。

In the above equation, the weights a, b and c are initially set to a = 1, b = 0.5 and c = 0. a weights the occurrence of a common note, b weights the occurrence of a common pause note, c is the occurrence of a weighting difference, i.e., a note occurs on one musical score, while the other musical score Then, weighting is performed so that notes are not generated. As shown below, the similarity criterion M is obtained by adding the element B.

行列要素の間の差を類似の方法で考える限りでは、この類似性基準は、ハミング距離に類似している。以下では、距離基準として、変形ハミング距離（ＭＨＤ）が用いられる。また、重み付けベクトルνｉ、ｉ＝１、．．．ｎを用いて、特徴的な楽器の影響を制御してもよい。例えば、小型ドラム（スネアドラム）または低音域楽器により重点を置いたりするといった音楽背景知識を用いたり、あるいは楽器の出現する周波数および規則性により、制御してもよい。

また、ブール行列の類似性基準が、強度値を考えるために、Ｔ１およびＴ２からの平均値を有する重み付けＢにより拡張してもよい。距離または非類似性それぞれが、負の類似性と見なされてもよい。譜面Ｔと、これをシフトしたバージョンとの間で類似性基準Ｍを算出することにより、周期性関数Ｐ＝ｆ（Ｍ、１）が計算される。シフトは１に基づく。Ｐを基準値モデルの数と比較することにより、拍子記号が求められる。実行した基準値モデルＱは、異なる拍子記号および微少時間の一般的なアクセント位置における一連のスパイクからなる。微少時間は、音楽カウントタイムの継続時間の整数比である。すなわち音楽テンポを確定する音符値（例えば、４分の１音符）と、テータム期間の継続時間との整数比である。 As long as the differences between the matrix elements are considered in a similar way, this similarity criterion is similar to the Hamming distance. Hereinafter, the modified Hamming distance (MHD) is used as the distance reference. Also, weighting vectors ν i, i = 1,. . . n may be used to control the influence of a characteristic instrument. For example, it may be controlled by using a musical background knowledge such as focusing on a small drum (snare drum) or a bass musical instrument, or by a frequency and regularity at which the musical instrument appears.

Also, the Boolean matrix similarity criterion may be extended with a weighting B having an average value from T1 and T2 to consider intensity values. Each distance or dissimilarity may be considered a negative similarity. The periodicity function P = f (M, 1) is calculated by calculating the similarity criterion M between the musical score T and the shifted version. The shift is based on 1. The time signature is determined by comparing P with the number of reference value models. The implemented reference value model Q consists of a series of spikes at different time signatures and a general accent position of a minute time. The minute time is an integer ratio of the duration of the music count time. That is, an integer ratio between a note value (for example, a quarter note) for determining the music tempo and the duration of the theta period.

相関係数が最大の場合は、ＰとＱとの間のベストマッチが得られる。現状のシステム１３では、７つの異なる拍子記号に対して基準値モデルが実行される。 When the correlation coefficient is maximum, the best match between P and Q is obtained. In the current system 13, a reference value model is executed for seven different time signatures.

例えば、アップビートを検出して、正しく機能するテンポ推定値を得るために、繰り返し構造が検出される。ドラムパターンを検出するには、次の式により、類似の基準値位置を有する行列要素Ｔを加算することにより、小節ｂの長さから、譜面Ｔが得られる。

上記の式におけるｂは、推定した小節長さと、Ｔの中の小節数ｐとを示す。以下では、Ｔ’は、それぞれ譜面ヒストグラムまたはパターンヒストグラムとして参照される。大きいヒストグラム値を有する譜面要素Ｔ’ｉ，ｊを検索することにより、譜面ヒストグラムＴ’からドラムパターンが得られる。測定した長さの整数値に対して上述の手順を繰り返し使用することにより、小節を超える長さのパターンが取り出される。音信号のさらなる別の特徴として代表的な最大パターンを得るために、すなわちパターン長さ自体に対して、演奏する音符が最も多いパターン長さが選択される。 For example, a repetitive structure is detected to detect an upbeat and obtain a tempo estimate that functions correctly. In order to detect a drum pattern, a musical score T is obtained from the length of the measure b by adding matrix elements T having similar reference value positions according to the following equation.

In the above formula, b represents the estimated bar length and the number of bars p in T. In the following, T ′ is referred to as a musical score histogram or a pattern histogram, respectively. A drum pattern is obtained from the score histogram T ′ by searching for the score elements T′i, j having a large histogram value. By repeatedly using the above procedure for the integer value of the measured length, a pattern with a length exceeding the measure is extracted. In order to obtain a representative maximum pattern as yet another feature of the sound signal, ie, the pattern length itself, the pattern length with the most notes to be played is selected.

好ましくは、音楽知識から導出した規則のセットを用いることにより、特定したリズムパターンが解釈される。好ましくは、個別の楽器が出現した等距離イベントが特定されて、楽器分類を参照して評価される。これにより、ポピュラー音楽でしばしば出現する演奏スタイルを特定できるようになる。一例としては、４分の４拍子の第２拍子および第４拍子でかなり頻繁に用いられる、小型ドラム（スネアドラム）またはタンバリンまたは拍手があげられる。バックビートと呼ばれるこの概念は、時間ラインの位置の指標として機能する。バックビートパターンが存在する場合は、小型ドラムの２つの拍子の間で時間が開始する。 Preferably, the specified rhythm pattern is interpreted by using a set of rules derived from music knowledge. Preferably, equidistant events in which individual instruments appear are identified and evaluated with reference to instrument classification. This makes it possible to identify performance styles that often appear in popular music. An example is a small drum (snare drum) or tambourine or applause that is used quite often in the second and fourth time signatures of a quarter quarter. This concept called backbeat serves as an indicator of the position of the timeline. If a backbeat pattern is present, time starts between the two beats of the small drum.

さらに時間ラインを位置づける音符は、キックドラムイベントの発生、すなわち、通常足で演奏する大型ドラムのイベントの発生である。 Further, the musical note that positions the time line is the occurrence of a kick drum event, that is, the occurrence of an event of a large drum that is played with a normal foot.

たいていのキックドラム音符が発生する基準値位置により、音楽の小節のはじめに印がつけられていると仮定する。 Assume that the beginning of a musical measure is marked by the reference position where most kick drum notes occur.

例えば、図１に示して説明した、各音源の合成手段１６で得られるような、図５または図６で説明したような、特徴の好ましい例は、ポピュラー音楽のジャンル分類に含まれる。得られたドラムパターンから、通常の演奏スタイルを特定するために、上位レベルの異なる特性を導出することもできる。分類手順により、音楽の小節、すなわち、例えば、１分毎の拍子といった速度についての情報に関連して、使用した打楽器を用いてこれらの特性を評価する。この概念は、任意の打楽器がリズム情報を有していて、頻繁に繰り返し演奏されるという事実に基づいている。ドラムパターンは、ジャンルに特有の特徴を有している。従って、これらのドラムパターンは、音楽ジャンルの分類に用いることもできる。 For example, a preferable example of the feature as described in FIG. 5 or 6 as obtained by the sound source synthesizing means 16 shown in FIG. 1 is included in the genre classification of popular music. From the obtained drum pattern, different characteristics at a higher level can be derived in order to specify a normal performance style. The classification procedure evaluates these characteristics using the percussion instrument used in relation to information about the speed of the music bar, ie the time signature, for example, every minute. This concept is based on the fact that any percussion instrument has rhythm information and is frequently played repeatedly. The drum pattern has characteristics specific to the genre. Therefore, these drum patterns can also be used for classification of music genres.

このために、それぞれ個別の楽器に対応付けられた、異なる演奏スタイルの分類が行われる。従って、例えば演奏スタイルは、各４分の１音符上に限ってイベントが発生するという事実からなる。この演奏スタイルに対応付けられた楽器は、キックドラム、すなわち足で演奏するドラムの大型のドラムである。この演奏スタイルを、ＦＳと略記する。 For this purpose, different performance styles associated with individual musical instruments are classified. Thus, for example, a performance style consists of the fact that an event occurs only on each quarter note. The musical instrument associated with this performance style is a kick drum, that is, a large drum that is played with feet. This performance style is abbreviated as FS.

別の演奏スタイルは例えば、４分の４拍子の各第２および第４の４分音符でイベントが発生することである。これは主として、小型ドラム（スネアドラム）およびタンバリン、すなわち拍手で演奏される。この演奏スタイルを、ＢＳと略記する。さらに例示の演奏スタイルは、三連音符の第１および第３の音符上でしばしば音符が発生するという事実からなる。これをＳＰと略記する。ハイハットつまりシンバルで、しばしば観察される。 Another performance style is, for example, that an event occurs at each second and fourth quarter note of a quarter quarter. It is mainly played with small drums (snare drums) and tambourine, ie applause. This performance style is abbreviated as BS. Further exemplary performance styles consist of the fact that notes often occur on the first and third notes of a triple note. This is abbreviated as SP. Often observed in hi-hats or cymbals.

従って、演奏スタイルは、異なる音楽楽器に固有である。例えば、キックドラムイベントが各４分の１音符上で発生する場合に限って、第１の特性ＦＳは、ブール値を取り、真である。特定の値に限っては、ブール変数は、全く計算されないが、例えば、ハイハット、シェイカーまたはタンバリンが演奏するとして、例えば、オフビートイベントの数と、オンビートイベントの数との間の関係といったような、特定の数が算出される。 Therefore, the performance style is unique to different musical instruments. For example, the first characteristic FS takes a Boolean value and is true only when a kick drum event occurs on each quarter note. For certain values only, Boolean variables are not calculated at all, but as a hi-hat, shaker or tambourine perform, for example, the relationship between the number of offbeat events and the number of onbeat events, A specific number is calculated.

ジャンル分類の特性をさらに得るために、典型的なドラム楽器の組み合わせが、例えばロック、ジャズ、ラテン、ディスコおよびテクノといった、異なるドラムセットの分類の１つに分類される。楽器音を用いるのではなく、個別のジャンルに属する異なる楽曲のドラム楽器の出現を一般的に検出することにより、ドラムセットの分類が導出される。従って、例えば、分類がロックであるドラムセットは、キックドラム、スネアドラム、ハイハットおよびシンバルを用いるという事実により、区別される。それと対照的に、“ラテン”の分類では、ボンゴ、コンガ、クラベスおよびシェイカーが用いられる。 In order to further obtain the characteristics of genre classification, typical drum instrument combinations are classified into one of the different drum set classifications, for example rock, jazz, latin, disco and techno. Rather than using instrument sounds, the drum set classification is derived by generally detecting the appearance of drum instruments of different musical compositions belonging to individual genres. Thus, for example, drum sets whose classification is rock are distinguished by the fact that they use kick drums, snare drums, hi-hats and cymbals. In contrast, the “Latin” classification uses Bongo, Conga, Claves and Shaker.

さらに、特性セットが、ドラム譜面またはドラムパターン、それぞれのリズム特性から導出される。これらの特性には、音楽テンポ、拍子記号、微少時間等が含まれる。また、ドラムパターンにおいて発生する異なるＩＯＩの数を数えることにより、キックドラム音符の出現する変化の基準が得られる。 Furthermore, a characteristic set is derived from the drum music score or drum pattern and the rhythm characteristics of each. These characteristics include music tempo, time signature, minute time, and the like. Also, by counting the number of different IOIs that occur in the drum pattern, a reference for the change in which the kick drum note appears can be obtained.

規則に基づく決定ネットワークを利用して、ドラムパターンを用いて音楽ジャンルの分類が行われる。現在検証した仮説を満たす場合は、考えられるジャンル候補が与えられ、現在検証した仮説の側面を満たさない場合は、候補が“補正する”。この処理が、各ジャンルの好ましい特性の組み合わせの選択に結実する。賢明な決定を行うための規則が、代表的な楽曲および音楽知識それ自体を観察することから導出される。選択または補正するためのそれぞれの値は、抽出概念のロバストネスを考慮して、経験的に設定される。特定の音楽ジャンルとして得られる決定を、最大数の選択を含む、ジャンル候補と考える。
従って、例えば、ドラムセット種類がディスコで、テンポの範囲が１１５〜１３２ｂｐｍで、拍子記号が４／４ビットで、微少時間が２に等しい場合は、ジャンルがディスコと認識する。さらに、ジャンルのディスコに対する特性はさらに、例えば、演奏スタイルＦＳがあり、例えば、さらに演奏スタイルがもう１つあること、すなわち、各オフビート位置でイベントが発生することである。他のジャンルに、例えばヒップホップ、ソウル／ファンク、ドラムおよびベース、ジャズ／スウィング、ロック／ポップ、ヘビーメタル、ラテン、ワルツ、ポルカ／パンクまたはテクノなどに、同様の基準を設定してもよい。 A music genre is classified using a drum pattern using a rule-based decision network. If the currently verified hypothesis is satisfied, possible genre candidates are given, and if the currently verified hypothesis aspect is not satisfied, the candidate is “corrected”. This process results in the selection of a preferred combination of characteristics for each genre. Rules for making wise decisions are derived from observing representative music and the music knowledge itself. Each value for selection or correction is empirically set in consideration of the robustness of the extraction concept. A decision obtained as a particular music genre is considered a genre candidate, including the maximum number of selections.
Therefore, for example, when the drum set type is disco, the tempo range is 115 to 132 bpm, the time signature is 4/4 bits, and the minute time is equal to 2, the genre is recognized as disco. Further, the characteristics of the genre disco are, for example, a performance style FS, for example, that there is another performance style, that is, an event occurs at each offbeat position. Similar criteria may be set for other genres such as hip hop, soul / funk, drums and bass, jazz / swing, rock / pop, heavy metal, Latin, waltz, polka / punk or techno.

状況によるが、音信号の特徴を記述する本発明の方法を、ハードウェアまたはソフトウェアで実施することもできる。実施にあたっては、本方法を実行するように、プログラム可能コンピュータシステムと協働する電子的に読み取り可能な制御信号を有するデジタル記録媒体、特に、フロッピー（登録商標）ディスクまたはＣＤ上で実施することもできる。従って、一般に、コンピュータプログラム製品をコンピュータ上で実行する場合は、本発明は、本方法を実行する機械読み取り可能なキャリアに格納したプログラムコードを有する、コンピュータプログラム製品からなる。従って、換言すれば、本発明は、コンピュータ上でコンピュータプログラムを実行する場合は、本方法を実行するプログラムコードを有するコンピュータプログラムとして実施することもできる。 Depending on the situation, the method of the invention describing the characteristics of the sound signal can also be implemented in hardware or software. In practice, the method may also be implemented on a digital recording medium having electronically readable control signals that cooperate with a programmable computer system, in particular a floppy disk or CD. it can. Thus, in general, when a computer program product is executed on a computer, the present invention comprises a computer program product having program code stored on a machine-readable carrier for performing the method. Therefore, in other words, when the computer program is executed on a computer, the present invention can be implemented as a computer program having a program code for executing the method.

本発明の音信号の特徴を記述する装置のブロック図を示す。FIG. 2 shows a block diagram of an apparatus for describing the characteristics of a sound signal of the present invention. 音符エントリポイントの割り出しを説明する概略図を示す。The schematic diagram explaining the determination of a note entry point is shown. 量子化ラスタと、ラスタを用いた音符の量子化とを表す概略図を示す。FIG. 2 is a schematic diagram illustrating a quantization raster and note quantization using a raster. 任意の楽器を用いて統計的に時間の長さを求めることにより得られる、共通の周期長の例を説明する図を示す。The figure explaining the example of the common period length obtained by calculating | requiring the length of time statistically using arbitrary musical instruments is shown. 個別の音源（楽器）の合成したサブシーケンスの例として、例示のパターンヒストグラムを示す。An exemplary pattern histogram is shown as an example of a synthesized subsequence of individual sound sources (instruments). 音信号の別の特徴の例として、後処理を行ったパターンヒストグラムを示す。As another example of the sound signal, a pattern histogram after post-processing is shown.

Claims

A device for describing the characteristics of a sound signal,
Means (10) for generating a sequence of sound entry times for at least one sound source;
Means (12) for calculating a common period length that is the basis of the at least one sound source using the sequence of the at least one entry time;
And means for dividing pre Symbol least one entry time sequence to each subsequence (14), the length of the subsequence, derives equal to the common period length, or from the common period length Said means for dividing (14);
Said at least one source of said means for combining the sub-sequences into one synthesized subsequence (16), said synthesized subsequence Ri characterized der of said sound signal, said means for combining (16) includes, and means (16) which are Ru, said synthetic configured to generate a histogram as a subsequence which is the synthesis device.

Generating means (10) is executed to generate a sequence of at least two entry times for at least two sound sources;
In order to calculate the common period length of the at least two sound sources, calculation means (12) is executed,
In order to divide the sequence of at least two entry times according to the common period length, a dividing means (14) is executed,
The first synthesized subsequence and the second synthesized subsequence represent the characteristics of the sound signal, and a second synthesized subsequence of the second sound source is synthesized into one synthesized subsequence . synthesized in the sub-sequence and to order the synthesis means (16) is executed, according to claim 1.

In order to generate a sequence of one quantized entry time for each of the at least two sound sources, generating means (10) is executed, the entry time is quantized against a quantized raster, and between two raster points The distance between the raster points of the sound signal is equal to the shortest time distance between two sounds in the sound signal or equal to the greatest common divisor of different time durations of different sounds in the sound signal. Equipment.

4. An apparatus according to any one of the preceding claims, wherein generating means (10) is executed to generate the entry time of a percussion instrument rather than the entry time of a harmony instrument.

In order to generate a list for each sound source, generation means (10) is executed, wherein the list for each raster point of the raster contains one associated information as to whether the raster point has a sound entry time. An apparatus according to any one of claims 1 to 4, comprising:

Wherein as each raster point of a sound raster synthesized subsequence represents histograms components, in order to generate the histogram, combining means (16) is executed, according to claim 1.

When the input is detected, or by adding the reference obtained from the input, the synthesis means (16) is executed, and the count value is increased so that the correspondence in the histogram in each sub-sequence of the sound source is increased. 7. An apparatus according to claim 1 or claim 6 , wherein the count value of the component is increased and the input is a measure of the intensity of the sound having the entry time entry.

Combining means (16) is executed for outputting only the values of the subsequences in the first combined subsequence and the second combined subsequence as features exceeding a threshold. Item 8. The device according to any one of Items 7 .

Means for extracting characteristics from the features for the sound signal;
Using said characteristic, and means for determining a musical genre belonging to the sound signal, apparatus according to any one of claims 1 to 8.

The apparatus according to claim 9 , wherein a determination means is implemented to use a rule-based decision network, pattern recognition means or classifier.

Further comprising means for extracting a tempo from the characteristic Apparatus according to any of claims 1 to 1 0.

Based on the common period length, to determine the tempo extraction means is executed, according to claim 1 1.

A method for describing the characteristics of a sound signal,
Generating (10) a sequence of sound entry times for at least one sound source;
Using the sequence of at least one entry time to calculate a common period length that is the basis of the at least one sound source;
Dividing the sequence of at least one entry time into respective subsequences, wherein the length of the subsequence is equal to or derived from the common period length; The dividing step (14);
Synthesizing the subsequence of the at least one sound source into a synthesized subsequence (16), wherein the synthesized subsequence generates a histogram as the synthesized subsequence, thereby generating a histogram of the sound signal; Combining the step (16) of representing features.

When operating the computer program on a computer, the computer program having a program code for performing the method according to the claim 1 3.