JP2007193222A

JP2007193222A - Melody input device and musical piece retrieval device

Info

Publication number: JP2007193222A
Application number: JP2006012926A
Authority: JP
Inventors: Shigeru Kafuku; 滋加福
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-01-20
Filing date: 2006-01-20
Publication date: 2007-08-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a melody input device with which an improvement in retrieval accuracy is achieved by performing musical piece retrieval in consideration of chain probability of notes or pitches. <P>SOLUTION: The melody input device (1) equipped with an input means (4) inputting a melody, and an extraction means (6) extracting pitch information from the input melody is equipped with a database (10) including note chain information relating to the chains of the notes and a correction means (11) correcting the note information extracted by the extraction means based on the note chain information in the database. Even when the chains of the notes of the low probability are observed in the input melody, the chains are corrected according to the chain probability of the correct notes and therefore, when the device is applied to, for example, the musical piece retrieval device, wrong musical piece retrieval does not occur. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、メロディ入力装置及び楽曲検索装置に関し、たとえば、ハミングしたメロディを入力するメロディ入力装置及びハミングしたメロディに対応する楽曲をデータベースの中から検索する楽曲検索装置に関する。 The present invention relates to a melody input device and a music search device, for example, a melody input device that inputs a hummed melody and a music search device that searches a database for music corresponding to the hummed melody.

たとえば、テレビや街中で流れる音楽を携帯電話に聴かせることにより、その音楽の名前等の情報を調べることができる従来技術（以下、従来技術１）が知られている（非特許文献１参照）。しかし、この従来技術１は、音楽そのものを聴かせる必要があるため、音源が不可欠であり、メロディの一部しか覚えていない楽曲の検索に使えないという欠点がある。 For example, there is known a conventional technique (hereinafter referred to as Conventional Technique 1) in which information such as the name of the music can be checked by listening to music that flows on a television or in a city using a mobile phone (see Non-Patent Document 1). . However, since this prior art 1 needs to listen to the music itself, the sound source is indispensable, and there is a drawback that it cannot be used for searching for music that only remembers part of the melody.

この欠点を克服した従来技術（以下、従来技術２）として、ハミング（メロディの口ずさみや鼻歌）入力装置が知られている（非特許文献２参照）。
この従来技術２は、楽曲をデータベースに登録する際に、１つの楽曲を多数の「音楽片」に分割し、利用者のハミング入力と、データベース中の楽曲とを音楽片単位でマッチングするというものである。利用者が曲のどの部分をハミングしていてもきちんと検索することができる。 As a prior art overcoming this drawback (hereinafter referred to as Prior Art 2), a humming input device is known (see Non-Patent Document 2).
In this conventional technique 2, when a music piece is registered in the database, one music piece is divided into a number of “music pieces”, and the user's Hamming input and the music pieces in the database are matched in units of music pieces. It is. You can search for any part of the song you're humming.

上記の「音楽片」の生成法としては、たとえば、楽曲中のあらゆる音符を先頭にして所定数ｎの音符群ごとに切り出すようにした技術（以下、従来技術３）が知られている（特許文献１参照）。
この従来技術３では、１つの楽曲が音符ａ₁、ａ₂
、ａ₃、・・・・、ａ_pからなるとき、ｎ＝４として、ａ₁
〜ａ₄の第１の音楽片、ａ₂〜ａ₅の第２の音楽片、ａ₃〜ａ₆
の第３の音楽片、・・・・、ａ_p-3〜ａ_p
の第ｐの音楽片に分割する。つまり、楽曲を構成する音符の先頭からｎ個の音符群を切り出して、それを第１の音楽片とし、以降、音符の切り出し位置を一個ずつ後方にずらしながら、第２の音楽片、第３の音楽片、・・・・、第ｐの音楽片を生成するというものである。 As a method of generating the above “music piece”, for example, a technique (hereinafter, “Prior Art 3”) in which a predetermined number n of note groups are cut out starting from every note in a song is known (patent 3). Reference 1).
In this prior art 3, one musical piece is a note a ₁ , a _2.
, A ₃ ,..., A _p and n = 4, a ₁
First musical piece ~a _4, the second musical piece _{_{_{a 2 ~a 5, a 3 ~a}}} 6
3rd music piece of ... _{ap-3 to} _ap
Into the p th music pieces. That is, a group of n notes is cut out from the beginning of the notes constituting the music piece, and is used as the first piece of music. Thereafter, the second piece of music, the third piece of music, and the third piece of music are moved while shifting the cutout position of the notes one by one. ,..., To generate a p-th music piece.

また、ハミング入力装置の楽譜データベース以外の適用例としては、たとえば、自動採譜装置に関する技術（以下、従来技術４）も知られている（特許文献２参照）。
この従来技術４は、ハミング等で入力された音響信号のピッチ情報及びパワー情報を分析周期毎に抽出し、その後、抽出されたピッチ情報及びパワー情報から音響信号を一音とみなせる区間（セグメントとも言う）に区分し、各区間の音高（文献中では音程と記載）を同定して自動採譜を行う自動採譜装置であり、その特徴とする点は、各区間の音高を同定する際、まず、当該区間の各ピッチ情報に対して、同定する音高候補との距離と区間内の位置によって定まる重み付け係数とを求め、その積和値が最も小さくなる音高に同定するというものである。
重み付け係数を区間の始端や終端付近では小さく設定しておけば、区間始端や終端のピッチが積和値に及ぼす影響は小さくなるので、この部分の不安定な音高で区間全体が意図しない音高に同定されることを少なくすることができ、より正確な音高同定が可能になる。 Further, as an application example of the Hamming input device other than the score database, for example, a technique related to an automatic musical score device (hereinafter referred to as Conventional Technology 4) is also known (see Patent Document 2).
This prior art 4 extracts pitch information and power information of an acoustic signal input by humming or the like for each analysis period, and thereafter, a section (also referred to as a segment) in which the acoustic signal can be regarded as one sound from the extracted pitch information and power information. It is an automatic music transcription device that performs automatic music transcription by identifying the pitch of each section (denoted as the pitch in the literature), and its characteristic point is that when identifying the pitch of each section, First, for each pitch information of the section, a distance to the pitch candidate to be identified and a weighting coefficient determined by a position in the section are obtained, and the pitch with the smallest product sum value is identified. .
If the weighting factor is set small near the beginning and end of the section, the influence of the pitch at the beginning and end of the section on the product sum value will be small. It is possible to reduce the identification of the pitch to be high, and it is possible to identify the pitch more accurately.

“ａｕの「聴かせて検索」”、［ｏｎｌｉｎｅ］、［平成１８年１月１２日検索］、インターネット＜ＵＲＬ：http://plusd.itmedia.co.jp/mobile/articles/0506/22/news009.html＞“Au's“ Let me listen ””, [online], [Search January 12, 2006], Internet <URL: http://plusd.itmedia.co.jp/mobile/articles/0506/22/ news009.html> “ハミングによる楽曲検索システム”、［ｏｎｌｉｎｅ］、日本電信電話株式会社、［平成１８年１月１２日検索］、インターネット＜ＵＲＬ：http://www.ntt.co.jp/saiyo/rd/review/2001/pf/10.html＞"Humming music search system", [online], Nippon Telegraph and Telephone Corporation, [Search January 12, 2006], Internet <URL: http://www.ntt.co.jp/saiyo/rd/review /2001/pf/10.html> 特開２０００−１７２６９３号公報JP 2000-172893 A 特開平７−４４１６３号公報JP 7-44163 A

上記のとおり、従来技術１は、音楽そのものを聴かせる必要があるため、音源が不可欠であり、メロディの一部しか覚えていない音楽の検索に使えないという欠点があるが、他の従来技術２〜４は、ハミング入力が可能であるため、このような欠点はない。 As described above, the prior art 1 needs to be able to listen to the music itself, so that the sound source is indispensable and cannot be used for searching for music that only remembers a part of the melody. Since hamming input is possible for -4, there is no such a fault.

しかしながら、これらの従来技術２〜４は、いずれも楽曲の音楽片やセグメント（区間）といった部分単位で検索を行うものに過ぎず、充分な検索精度が得られないという問題点がある。 However, these prior arts 2 to 4 are only for performing a search in a partial unit such as a musical piece of music or a segment (section), and there is a problem that sufficient search accuracy cannot be obtained.

すなわち、一の楽曲がＡ、Ｂ、Ｃ、Ｄの各部分（音楽片又はセグメント）からなり、二の楽曲がａ、Ｂ、Ｃ、ｄの各部分（同）からなるとき、ハミング入力を仮にＡ、Ｂとすると、たとえば、音高のズレやノイズ等によってＡがａと誤認された場合には、本来であれば一の楽曲がヒットすべきところ、間違って二の楽曲がヒットしてしまうという不都合がある。かかる不都合の要因は、音楽片やセグメントといった部分単位での照合しか行っていないからである。 That is, when one piece of music is composed of parts A, B, C, and D (music pieces or segments) and two music pieces are composed of parts a, B, C, and d (the same), the Hamming input is temporarily If A and B are used, for example, if A is misidentified as a due to pitch deviation or noise, one song should be hit, but the second song will be hit by mistake. There is an inconvenience. The cause of such inconvenience is that only collation is performed in units of parts such as music pieces and segments.

このように、従来技術２〜４のハミング入力装置は、入力された部分単位の音響的特徴のみを使って検索を行っていたので、入力時の音高の微妙なズレや周囲のノイズ、或いは、曲中の速い部分の周波数分解能の少なさなどの影響を受けやすく、充分な検索精度が得られないという問題点がある。
また、従来技術４は、セグメント位置による重み付けによりセグメント両端の不安定さを排除しているものの、それだけでは、長い音符の途中で音高が不安定になった場合（たとえば、延ばした音の後に装飾音があった場合）に充分な効果を期待できない。 As described above, since the hamming input devices of the conventional techniques 2 to 4 perform the search using only the acoustic characteristics of the input partial units, a subtle shift in pitch at the time of input, ambient noise, or However, there is a problem that sufficient search accuracy cannot be obtained because it is easily affected by the low frequency resolution of the fast part of the song.
Further, although the prior art 4 eliminates the instability at both ends of the segment by weighting by the segment position, that alone can cause the pitch to become unstable in the middle of a long note (for example, after an extended sound). A sufficient effect cannot be expected when there is a decorative sound.

そこで、本発明は、音符又は音高の連鎖確率を考慮して楽曲検索を行うことにより、認識精度の向上を図ったメロディ入力装置及び楽曲検索精度の向上を図った楽曲検索装置を提供することにある。 Therefore, the present invention provides a melody input device that improves recognition accuracy and a music search device that improves music search accuracy by performing music search in consideration of the chain probability of notes or pitches. It is in.

請求項１記載の発明は、メロディを入力する入力手段と、前記入力されたメロディから音高情報を抽出する抽出手段とを備えたメロディ入力装置において、音符の連鎖に関する音符連鎖情報を含むデータベースと、前記抽出手段によって抽出された音高情報を前記データベース内の音符連鎖情報に基づいて補正する補正手段とを備えたことを特徴とするメロディ入力装置である。
請求項２記載の発明は、前記音符連鎖情報は、楽器毎に作られていることを特徴とする請求項１に記載のメロディ入力装置である。
請求項３記載の発明は、前記音符連鎖情報は、音楽のジャンル毎に作られていることを特徴とする請求項１に記載のメロディ入力装置である。
請求項４記載の発明は、前記音符連鎖情報は、作曲家毎に作られていることを特徴とする請求項１に記載のメロディ入力装置である。
請求項５記載の発明は、前記音符連鎖情報は、音の長さに関する情報を考慮しないものであることを特徴とする請求項１に記載のメロディ入力装置である。
請求項６記載の発明は、メロディを入力する入力手段と、前記入力されたメロディから音高情報を抽出する抽出手段と、音符の連鎖に関する音符連鎖情報を含むデータベースと、前記抽出手段によって抽出された音高情報を前記データベース内の音符連鎖情報に基づいて補正する補正手段と、前記補正手段によって補正された後の音高情報を用いて複数の楽曲の中から音高情報が類似した楽曲を検索する検索手段とを備えたことを特徴とする楽曲検索装置である。
請求項７記載の発明は、前記音符連鎖情報は、楽器毎に作られていることを特徴とする請求項６に記載の楽曲検索装置である。
請求項８記載の発明は、前記音符連鎖情報は、音楽のジャンル毎に作られていることを特徴とする請求項６に記載の楽曲検索装置である。
請求項９記載の発明は、前記音符連鎖情報は、作曲家毎に作られていることを特徴とする請求項６に記載の楽曲検索装置である。
請求項１０記載の発明は、前記音符連鎖情報は、音の長さに関する情報を考慮しないものであることを特徴とする請求項６に記載の楽曲検索装置である。 The invention described in claim 1 is a melody input device comprising an input means for inputting a melody and an extracting means for extracting pitch information from the input melody, a database including note chain information relating to a chain of notes; A melody input device comprising correction means for correcting the pitch information extracted by the extraction means based on the note chain information in the database.
The invention according to claim 2 is the melody input device according to claim 1, wherein the note chain information is created for each musical instrument.
The invention according to claim 3 is the melody input device according to claim 1, wherein the note chain information is created for each genre of music.
The invention according to claim 4 is the melody input device according to claim 1, wherein the note chain information is created for each composer.
The invention according to claim 5 is the melody input device according to claim 1, wherein the note chain information does not take into account information relating to the length of a sound.
The invention described in claim 6 is extracted by the input means for inputting a melody, the extraction means for extracting pitch information from the input melody, the database including the note chain information relating to the chain of notes, and the extraction means. Correction means for correcting the pitch information based on the note chain information in the database, and music having similar pitch information from a plurality of music pieces using the pitch information corrected by the correction means. A music search apparatus comprising a search means for searching.
The invention described in claim 7 is the music search apparatus according to claim 6, wherein the note chain information is created for each musical instrument.
The invention described in claim 8 is the music search apparatus according to claim 6, wherein the note chain information is created for each genre of music.
The invention according to claim 9 is the music search apparatus according to claim 6, wherein the note chain information is created for each composer.
The invention according to claim 10 is the music search apparatus according to claim 6, wherein the note chain information does not take into account information relating to the length of a sound.

本発明では、入力されたメロディから音高情報が抽出され、抽出された音高情報がデータベース内の音符連鎖情報に基づいて補正される。
ここで、音符連鎖情報とは、多くの楽曲で音符の連鎖する確率を統計的にモデル化したデータのことを言う。かかる連鎖確率は、音楽のジャンルや楽器の種類、作曲家によって一応の傾向があるが、少なくとも、経験的に全く連鎖しないか、あるいは、連鎖することがきわめて希な音符の連鎖があり得る。こうした音符の連鎖、つまり、入力されたメロディに確率の低い音符の連鎖が見られた場合、それは、たとえばハミングであれば発音の揺らぎや音高のズレ、ノイズ等の影響を原因とする。
したがって、このような確率の低い音符の連鎖からなるメロディに基づいて、たとえば、楽曲検索を行った場合、間違った検索結果を引き起こしかねない。
これに対して、本発明のように、入力されたメロディから音高情報を抽出し、抽出した音高情報をデータベース内の音符連鎖情報に基づいて補正すれば、仮に、入力されたメロディに確率の低い音符の連鎖が見られた場合であっても、正しい音符の連鎖確率に従って補正されるため、上記のような間違った検索結果を引き起さない。 In the present invention, pitch information is extracted from the input melody, and the extracted pitch information is corrected based on the note chain information in the database.
Here, note chain information refers to data obtained by statistically modeling the probability that notes are chained in many music pieces. Such chain probabilities tend to be tentative depending on the genre of music, the type of musical instrument, and the composer, but at least there may be a chain of notes that are not linked at all empirically or very rarely linked. If such a chain of notes, that is, a chain of notes with low probability is found in the input melody, for example, if it is humming, it is caused by the influence of fluctuations in pronunciation, pitch deviation, noise, and the like.
Therefore, for example, when a music search is performed based on a melody composed of a chain of notes with low probability, an erroneous search result may be caused.
On the other hand, if the pitch information is extracted from the input melody and the extracted pitch information is corrected based on the note chain information in the database as in the present invention, the input melody is assumed to be probable. Even when a low note sequence is found, it is corrected according to the correct note sequence probability, so that the above-mentioned erroneous search result is not caused.

以下、本発明の実施形態を、図面を参照しながら説明する。なお、以下の説明における様々な細部の特定ないし実例および数値や文字列その他の記号の例示は、本発明の思想を明瞭にするための、あくまでも参考であって、それらのすべてまたは一部によって本発明の思想が限定されないことは明らかである。また、周知の手法、周知の手順、周知のアーキテクチャおよび周知の回路構成等（以下「周知事項」）についてはその細部にわたる説明を避けるが、これも説明を簡潔にするためであって、これら周知事項のすべてまたは一部を意図的に排除するものではない。かかる周知事項は本発明の出願時点で当業者の知り得るところであるので、以下の説明に当然含まれている。 Embodiments of the present invention will be described below with reference to the drawings. It should be noted that the specific details or examples in the following description and the illustrations of numerical values, character strings, and other symbols are only for reference in order to clarify the idea of the present invention, and the present invention may be used in whole or in part. Obviously, the idea of the invention is not limited. In addition, a well-known technique, a well-known procedure, a well-known architecture, a well-known circuit configuration, and the like (hereinafter, “well-known matter”) are not described in detail, but this is also to simplify the description. Not all or part of the matter is intentionally excluded. Such well-known matters are known to those skilled in the art at the time of filing of the present invention, and are naturally included in the following description.

〔第１実施形態〕
図１は、第１実施形態におけるハミング入力装置の機能ブロック図である。この図において、ハミング入力装置１は、インターフェース部２、制御部３、音響信号入力部４、音響信号記憶部５、特徴抽出部６、特徴記憶部７、仮説音符列出力部８、音符連鎖モデル生成部９、音符連鎖モデル記憶部１０、入力補正部１１及び音符列出力部１２を備える。 [First Embodiment]
FIG. 1 is a functional block diagram of a hamming input device according to the first embodiment. In this figure, a hamming input device 1 includes an interface unit 2, a control unit 3, an acoustic signal input unit 4, an acoustic signal storage unit 5, a feature extraction unit 6, a feature storage unit 7, a hypothetical note string output unit 8, a note chain model. A generation unit 9, a note chain model storage unit 10, an input correction unit 11, and a note string output unit 12 are provided.

各部を説明する前に、音符連鎖モデルについて概説する。音符連鎖モデルとは、音声認識技術で言うところの音響モデルと同様の意味合いであり、音符がどのように連鎖するかをＮｇｒａｍで表した統計データベースである。 Before describing each part, the note chain model will be outlined. The note chain model has the same meaning as the acoustic model referred to in the speech recognition technology, and is a statistical database that expresses how notes are chained in Ngram.

図２は、音声認識の音響モデルと言語モデルの関係を示す概念図である。この図において、今、「ニホンケイザイ」（日本経済）という発声があったとする。そして、「ザ」の発音がやや曖昧で、たとえば、「タ」とも聞こえる可能性があるとしたとき、ニホンケイザイ（日本経済）の音響的な確率値を０．１５８、ニホンケイタイ（日本携帯）の同確率値を０．１６５とすれば、結果は、音響的な確率値が高い方の「日本携帯」が出力され、間違った結果になってしまう。 FIG. 2 is a conceptual diagram showing a relationship between an acoustic model for speech recognition and a language model. In this figure, it is assumed that there is an utterance "Nihon Keizai" (Japanese economy). And if the pronunciation of “za” is somewhat ambiguous, for example, it may be heard as “ta”, the acoustic probability value of Nihon Keizai (Japanese economy) is 0.158, Nihon Keitai (Japanese mobile) If the same probability value is 0.165, “Nippon Mobile”, which has a higher acoustic probability value, is output, resulting in an incorrect result.

一方、言語モデルには、全ての単語の連鎖する確率も格納されている。たとえば、「日本」から「経済」へと連鎖する確率として０．０６５が格納され、また、「日本」から「携帯」へと連鎖する確率として０．００５が格納されている。「日本」から「経済」への連鎖確率０．０６５に対して、「日本」から「携帯」への連鎖確率０．００５は相当低く（小さく）、これは、一般的に「日本」から「携帯」へと連鎖する確率がきわめて希れであることを意味する。 On the other hand, the language model also stores the probability that all words are linked. For example, 0.065 is stored as the probability of chaining from “Japan” to “economy”, and 0.005 is stored as the probability of chaining from “Japan” to “mobile”. The linkage probability 0.005 from “Japan” to “mobile” is considerably lower (smaller) than the linkage probability 0.065 from “Japan” to “economy”. This means that the probability of chaining to “mobile” is extremely rare.

ハミング入力においても、かかる連鎖確率を考慮することにより、正しい認識結果を得ることができる。すなわち、音響確率を音響尤度とすると共に、言語確率（連鎖確率）を言語尤度として、次式（１）を演算することにより、ハミング入力においても、正しい認識結果を得ることができるのである。 Even in the hamming input, a correct recognition result can be obtained by considering the chain probability. That is, the correct recognition result can be obtained even in Hamming input by calculating the following equation (1) using the acoustic probability as the acoustic likelihood and the language probability (linkage probability) as the language likelihood. .

補正尤度＝音響尤度×言語尤度・・・・（１） Corrected likelihood = acoustic likelihood × language likelihood (1)

日本経済の場合の補正尤度は０．０１０２７となり、日本携帯の場合の補正尤度は０．０００８２５となる。この場合、０．０１０２７＞０．０００８２５であるから、正しい結果（日本経済）が得られる。つまり、音響尤度で劣っていた「日本経済」が日本語の連鎖する確率によって補正され、正解として出力されることになる。 The correction likelihood for the Japanese economy is 0.01027, and the correction likelihood for the Japanese mobile phone is 0.000825. In this case, since 0.01027> 0.000825, a correct result (Japanese economy) is obtained. In other words, “Japanese economy”, which was inferior in acoustic likelihood, is corrected by the probability that Japanese is chained, and is output as a correct answer.

こうした言語モデルは、新聞（の電子データ）などの膨大な日本語の文章を形態素解析し、各単語の連鎖する頻度を母数で割ったものである。これに対して、本実施形態で使用する音符連鎖モデルは、上記の言語モデルの「単語」を「音符」に置き換えたものということができる。 Such a language model is a morphological analysis of a huge amount of Japanese sentences such as newspapers (and its electronic data), and the frequency with which each word is linked is divided by the parameter. On the other hand, it can be said that the note chain model used in this embodiment is obtained by replacing “words” in the above language model with “notes”.

図３は、音符連鎖モデルの生成（構築）アルゴリズムを示す図である。音符連鎖モデルを生成するためには、言語モデルと同様に膨大な楽譜データ（ＭＩＤＩ等の楽譜データ）が必要となるが、それには、例えば、Ｊ−ＰＯＰ楽曲集のような既存の楽譜データを利用することができる。 FIG. 3 is a diagram showing a note chain model generation (construction) algorithm. In order to generate a note chain model, a large amount of musical score data (musical score data such as MIDI) is required in the same way as a language model. For example, existing musical score data such as a J-POP music book is used. Can be used.

今、手元に１００，０００曲の楽譜データがあったとすると、まず、それらの楽譜データを読み込み（ステップＳ１）、楽譜データのメロディ部分を単位（特に限定しないが、例えば１小節）ごとに切り出す（ステップＳ３）。次に、単位数（上記の例示に従えば、楽曲を構成する小節の数）だけループしながら（ステップＳ４）、切り出された各単位の最初の音が基準の音（たとえばＣ４のド）となるように正規化する（ステップＳ５）。以下、正規化した後の切り出し単位をパターンという。 If there are 100,000 pieces of musical score data at hand, first, the musical score data is read (step S1), and the melody portion of the musical score data is cut out for each unit (for example, one bar). Step S3). Next, while looping by the number of units (the number of bars constituting the music according to the above example) (step S4), the first sound of each unit cut out is a reference sound (for example, C4). It normalizes so that it may become (step S5). Hereinafter, the cutout unit after normalization is referred to as a pattern.

次いで、そのパターンが過去に出現したかを検索し（ステップＳ６）、過去に出現していない場合（最初は必ず出現していない）には、新規パターンとして登録し、その出現回数を１とする（ステップＳ７）。一方、出現している場合は、出現回数を１増やす（ステップＳ８）。この処理を楽譜データ（上記の例示に従えば、１００，０００曲の楽譜データ）がなくなるまで繰り返す（ステップＳ２）。
そして、楽譜データがなくなると、パターンの出現回数を母数で割り（ステップＳ９）、その結果を出力する（ステップＳ１０）。なお、出現回数が頻度Ｎ（例えばＮ＝１）以下のパターンについては、レアなパターンとみなして結果を出力しない。 Next, it is searched whether the pattern has appeared in the past (step S6). If it has not appeared in the past (it does not always appear first), it is registered as a new pattern, and the number of appearances is set to 1. (Step S7). On the other hand, when it appears, the number of appearances is increased by 1 (step S8). This process is repeated until there is no musical score data (100,000 musical score data according to the above example) (step S2).
When there is no musical score data, the number of pattern appearances is divided by the parameter (step S9), and the result is output (step S10). Note that a pattern whose appearance count is less than or equal to the frequency N (for example, N = 1) is regarded as a rare pattern and the result is not output.

このようにして、音符連鎖モデルをあらかじめ生成しておく。 In this way, a note chain model is generated in advance.

音符連鎖モデル生成部９は、図３のアルゴリズムを実行して音符連鎖モデルをあらかじめ生成しておく部分であり、音符連鎖モデル記憶部１０は、その音符連鎖モデルを記憶保持する部分である。 The note chain model generation unit 9 is a part that executes the algorithm of FIG. 3 to generate a note chain model in advance, and the note chain model storage unit 10 is a part that stores and holds the note chain model.

また、インターフェース部２は、利用者がハミング入力の開始や終了等を指示するための操作入力部であり、制御部３は、このハミング入力装置１の全体動作を統括制御する部分である。
音響信号入力部４は、ハミング入力用の音響マイクや増幅器及びＡ／Ｄ変換器等を含む部分であり、音響信号記憶部５は、音響信号入力部４から取り込まれた音響信号を記憶保持する部分である。 The interface unit 2 is an operation input unit for a user to instruct the start and end of a hamming input, and the control unit 3 is a unit that performs overall control of the overall operation of the hamming input device 1.
The acoustic signal input unit 4 includes a humming input acoustic microphone, an amplifier, an A / D converter, and the like, and the acoustic signal storage unit 5 stores and holds the acoustic signal captured from the acoustic signal input unit 4. Part.

特徴抽出部６は、音響信号記憶部５に記憶保持されている音響信号の特徴を抽出する部分であり、特徴記憶部７は、その抽出結果を記憶保持する部分である。音響信号の特徴については、後で説明する。
仮説音符列出力部８は、特徴記憶部７に記憶保持されている特徴値に基づいて音高を推定し、推定された音高からなる仮説音符列を出力する部分である。
入力補正部１１は、音符連鎖モデルを用いて仮説音符列を補正し、音符列出力部１２は、補正後出力値の最も大きい音符列を正解として出力する。 The feature extraction unit 6 is a part that extracts the feature of the acoustic signal stored and held in the acoustic signal storage unit 5, and the feature storage unit 7 is a part that stores and holds the extraction result. The characteristics of the acoustic signal will be described later.
The hypothetical note string output unit 8 is a part that estimates the pitch based on the feature value stored and held in the feature storage unit 7 and outputs a hypothetical note string including the estimated pitch.
The input correction unit 11 corrects the hypothetical note string using the note chain model, and the note string output unit 12 outputs the note string having the largest corrected output value as a correct answer.

図４は、第１実施形態の動作フローチャートを示す図である。このフローチャートにおいては、まず、音符連鎖モデル生成部９で音符連鎖モデルを生成し、その音符連鎖モデルを音符連鎖モデル記憶部１０に記憶する（ステップＳ２１）。なお、前記のとおり、あらかじめ音符連鎖モデル生成部９で音符連鎖モデルを生成しておき、事前に、その音符連鎖モデルを音符連鎖モデル記憶部１０に記憶しておいても構わない。 FIG. 4 is a diagram showing an operation flowchart of the first embodiment. In this flowchart, first, a note chain model is generated by the note chain model generation unit 9, and the note chain model is stored in the note chain model storage unit 10 (step S21). As described above, the note chain model generation unit 9 may generate a note chain model in advance, and the note chain model may be stored in the note chain model storage unit 10 in advance.

次いで、ハミングを入力し（ステップＳ２２）、音響信号に変換して音響信号記憶部５に記憶する。このハミング入力は、利用者が音響信号入力部４の音響マイクに向かって、検索を希望する楽曲の一部のメロディを口ずさむことによって行われる。たとえば、「ラララ」のように歌ってもよい。 Next, humming is input (step S22), converted into an acoustic signal, and stored in the acoustic signal storage unit 5. This hamming input is performed by the user humming a part of a melody of a music piece desired to be searched toward the acoustic microphone of the acoustic signal input unit 4. For example, it may be sung like “LaLa La”.

ハミングを入力すると、次に、その音響信号から特徴を抽出する（ステップＳ２３）。特徴の抽出方法については様々な方法があり、ここでは詳しくは述べないが、たとえば、クロマベクトルを特徴として抽出するものとする。 If humming is input, next, a feature is extracted from the acoustic signal (step S23). There are various methods for extracting features. Although not described in detail here, for example, a chroma vector is extracted as a feature.

図５は、クロマベクトルを示す模式図である。この図において、クロマベクトルは、各音高（この図ではＧ３〜Ｂ８）にあたる周波数に、どの程度、音が存在するかを示した特徴である。具体的には、入力音について短時間窓を設定し、ＦＦＴ（高速フーリエ変換）によりスペクトログラムを求め、各帯域（たとえば、中央のド（Ｃ４）なら２６２Ｈｚ）の中心がピークになるようなフィルタを掛け合わせることによって特徴を抽出する。 FIG. 5 is a schematic diagram showing chroma vectors. In this figure, the chroma vector is a feature indicating how much sound is present at the frequency corresponding to each pitch (G3 to B8 in this figure). Specifically, a short-time window is set for the input sound, a spectrogram is obtained by FFT (Fast Fourier Transform), and a filter in which the center of each band (for example, 262 Hz in the case of the center (C4)) has a peak is obtained. Extract features by multiplying.

図６は、フィルタの模式図である。縦軸は信号レベル、横軸は周波数である。図示の例では、四角記号で示す第１のフィルタと、丸記号で示す第２のフィルタと、三角記号で示す第３のフィルタと、斜線記号で示す第４のフィルタとが示されている。適当な特性のフィルタを使用することにより、目的とするクロマベクトル（特徴）を抽出できる。たとえば、中央のド（Ｃ４）のクロマベクトルを抽出するのであれば、ド（Ｃ４）の音の周波数（２６２Ｈｚ）に対応した三角記号で示す第３のフィルタを用いればよい。 FIG. 6 is a schematic diagram of a filter. The vertical axis is the signal level, and the horizontal axis is the frequency. In the illustrated example, a first filter indicated by a square symbol, a second filter indicated by a circle symbol, a third filter indicated by a triangle symbol, and a fourth filter indicated by a hatched symbol are shown. By using a filter having an appropriate characteristic, a target chroma vector (feature) can be extracted. For example, if the chroma vector of the center do (C4) is extracted, a third filter indicated by a triangle symbol corresponding to the frequency of the sound of do (C4) (262 Hz) may be used.

図７は、ハミング入力の一例を示す図、図８は、そのハミング入力の周波数分布図である。ただし、図７のハミング入力は、Ｃ４Ｄ４Ｅ４Ｃ４Ｄ４であるが、これは、やや音高を外したものであるとする。 FIG. 7 is a diagram showing an example of a Hamming input, and FIG. 8 is a frequency distribution diagram of the Hamming input. However, although the hamming input in FIG. 7 is C4D4E4C4D4, it is assumed that the pitch is slightly removed.

図８において、縦軸は音高であり、この音高は上に行くにつれて低くなり、下に行くにつれて高くなるものとする。音高の右隣に併記された数値は、各音高毎の周波数のピーク値を示している。具体的には、上から順に、音高Ｇ３の周波数ピーク＝１９６Ｈｚ、音高Ａｂ３の周波数ピーク＝２０８Ｈｚ、音高Ａ３の周波数ピーク＝２２０Ｈｚ、・・・・、音高Ｃ＃５の周波数ピーク＝５５４Ｈｚであることを示している。 In FIG. 8, the vertical axis represents the pitch, and this pitch is lowered as it goes up and becomes higher as it goes down. The numerical value written to the right of the pitch indicates the peak value of the frequency for each pitch. Specifically, in order from the top, frequency peak of pitch G3 = 196 Hz, frequency peak of pitch Ab3 = 208 Hz, frequency peak of pitch A3 = 220 Hz,..., Frequency peak of pitch C # 5 = It indicates that the frequency is 554 Hz.

また、横軸は時間であり、図中の最上段には、この時間軸の単位を示す時間スケール値（０、１、２、３、・・・・、５３）が示されている。なお、時間スケール値の最大値は５３になっているが、これは便宜値である。時間スケール値の最大値は実際のハミング入力の長さに対応する。 In addition, the horizontal axis represents time, and the time scale values (0, 1, 2, 3,..., 53) indicating the unit of the time axis are shown at the top in the drawing. The maximum value of the time scale value is 53, but this is a convenient value. The maximum time scale value corresponds to the actual Hamming input length.

さて、図中の破線１３の範囲内には、多数の数値が記載されているが、これらの数値は、ハミング入力された音響信号の周波数成分毎のレベル（スペクトログラム）を表している。たとえば、時間スケール値０に注目すると、上から下に向かって「００００００１００００００００００００」という値が並んでおり、これは、ハミング入力された最初の音（時間スケール値０のときの音）の周波数１９６Ｈｚ〜２６２Ｈｚのレベルが０、周波数２７７Ｈｚのレベルが１、周波数２９４Ｈｚ〜５５４Ｈｚのレベルが０であることを示している。つまり、周波数２７７Ｈｚのレベルだけが１で、他の周波数のレベルが全て０であったことを意味し、周波数２７７Ｈｚの音高はＣ＃４であるから、結局、時間スケール値０のときの音の音高がＣ＃４であったことを意味している。 Now, many numerical values are described within the range of the broken line 13 in the figure. These numerical values represent the level (spectrogram) for each frequency component of the humming input acoustic signal. For example, when paying attention to the time scale value 0, values “0000001000000000000000” are arranged from the top to the bottom, and this is the frequency of the first sound (the sound when the time scale value is 0) input from Hamming to 196 Hz. It shows that the level of 262 Hz is 0, the level of frequency 277 Hz is 1, and the level of frequencies 294 Hz to 554 Hz is 0. In other words, it means that only the level of the frequency 277 Hz is 1 and the levels of the other frequencies are all 0. Since the pitch of the frequency 277 Hz is C # 4, the sound at the time scale value 0 is eventually obtained. Means that the pitch of C # is C # 4.

特徴抽出部６は、図８のハミング入力の周波数分布図から、各時間スケール値毎に最大のレベルを持つものを音響信号の特徴として抽出する。たとえば、時間スケール値０では周波数２７７Ｈｚ（音高Ｃ＃４）のレベル１を抽出し、時間スケール値１では周波数２６２Ｈｚと２７７Ｈｚ（音高Ｃ４とＣ＃４）のレベル１を抽出し、時間スケール値２では周波数２６２Ｈｚ（音高Ｃ４）のレベル３を抽出し、時間スケール値３では周波数２７７Ｈｚ（音高Ｃ＃４）のレベル５を抽出し、時間スケール値４では周波数２６２Ｈｚ（音高Ｃ４）のレベル６を抽出し、時間スケール値５では周波数２６２Ｈｚ（音高Ｃ４）のレベル７を抽出し、時間スケール値６では周波数２６２Ｈｚ（音高Ｃ４）のレベル７を抽出し、時間スケール値７では周波数２６２Ｈｚ（音高Ｃ４）のレベル６を抽出し、時間スケール値８では周波数２６２Ｈｚ（音高Ｃ４）のレベル７を抽出し、時間スケール値９では周波数２６２Ｈｚと２７７Ｈｚ（音高Ｃ４とＣ＃４）のレベル６を抽出し、・・・・、時間スケール値５３では周波数２９４Ｈｚ（音高Ｄ４）のレベル１を抽出する。
これにより、図中背景を黒く塗りつぶした部分で示すように、各時間スケール値毎の特徴抽出（特徴ベクトル抽出）が行われる。 The feature extraction unit 6 extracts a feature having the maximum level for each time scale value from the Hamming input frequency distribution diagram of FIG. 8 as a feature of the acoustic signal. For example, level 1 with a frequency of 277 Hz (pitch C # 4) is extracted at time scale value 0, and level 1 with frequencies 262 Hz and 277 Hz (pitch C4 and C # 4) is extracted at time scale value 1. Level 2 with frequency 262 Hz (pitch C4) is extracted for value 2, level 5 with frequency 277 Hz (pitch C # 4) is extracted for time scale value 3, and frequency 262 Hz (pitch C4) for time scale value 4 Level 6 is extracted at time scale value 5, level 7 at frequency 262 Hz (pitch C4) is extracted, level 7 at frequency 262 Hz (pitch C4) is extracted at time scale value 6, and time scale value 7 is extracted. Level 6 at frequency 262 Hz (pitch C4) is extracted, level 7 at frequency 262 Hz (pitch C4) is extracted at time scale value 8, and time scale value 9 Extract the level 6 of the wave number 262Hz and 277Hz (pitch C4 and C # 4), ····, to extract the level 1 of the time scale value 53 Frequency 294Hz (pitch D4).
As a result, feature extraction (feature vector extraction) is performed for each time scale value, as indicated by the black background in the figure.

音響信号の特徴を抽出すると、次に、その特徴に基づいて仮説音符列を生成して出力する（ステップＳ２４）。仮説音符列の生成は、図８で求めた特徴ベクトル列に対し、特徴ベクトルの変化量の極大値などから、特徴値が大きく変化する点を抽出し、その点を音符変化の候補とする。 Once the characteristics of the acoustic signal are extracted, a hypothetical note string is generated and output based on the characteristics (step S24). In the generation of a hypothetical note string, a point where the feature value changes greatly is extracted from the maximum value of the change amount of the feature vector with respect to the feature vector string obtained in FIG. 8, and that point is set as a note change candidate.

図９は、音符変化候補の探索模式図である。この図は、上記の図８と似ているが、時間軸に沿って所々に縦方向の区切り線Ｌ１〜Ｌ６が入れられている点で相違する。これらの区切り線Ｌ１〜Ｌ６の位置が、特徴値が大きく変化する点（音符変化の候補となる点）である。 FIG. 9 is a schematic diagram of searching for note change candidates. This figure is similar to FIG. 8 described above, but differs in that vertical dividing lines L1 to L6 are inserted in some places along the time axis. The positions of these dividing lines L1 to L6 are points where feature values change greatly (points that are candidates for note change).

次に、その時間間隔（区切り線の間隔；以下、フレーム長）を、事前に定義しておいた音符の種類（四分音符か八分音符かなど）毎の平均フレーム長と照合し、音符の種類を推定する。 Next, the time interval (separation line interval; hereinafter referred to as frame length) is checked against the average frame length for each predefined note type (quarter note or octal note, etc.) Estimate the type of

そして、求めた間隔が四分音符や八分音符と推定された場合、十六分音符レベルまで候補位置を絞り込んで（図９の点線Ｌ７〜Ｌ１３参照）から音高を推定する。 If the obtained interval is estimated as a quarter note or an eighth note, the pitch is estimated after narrowing down the candidate positions to the sixteenth note level (see dotted lines L7 to L13 in FIG. 9).

図１０は、１つのフレームに着目した仮説音符列出力の概念図である。この図に示すように、クロマベクトルの値を範囲内で足し込み、最も大きな値を示すもの（音響による確度：この場合、Ｃ４の八分音符で確度４７とする）から、枝刈りの閾値（例えば１０）以内にある候補を全て出力する。図１０の右端には、そのようにして出力されたいくつかの仮説音符列、ここでは、たとえば、「Ｃ４」、「Ｃ＃４とＣ４」、「Ｂ３とＣ４」が示されている。 FIG. 10 is a conceptual diagram of hypothetical note string output focusing on one frame. As shown in this figure, the value of the chroma vector is added within the range, and the one showing the largest value (accuracy by sound: in this case, the accuracy is 47 with the C4 eighth note), the pruning threshold ( For example, all candidates within 10) are output. At the right end of FIG. 10, several hypothetical note strings output in this way, for example, “C4”, “C # 4 and C4”, and “B3 and C4” are shown.

具体的に説明すると、図１０（ａ）に示す１つのフレームを、十六分音符レベル区切り線Ｌ７を境にして前フレームＦと後フレームＢに分け、図１０（ｂ）に示すように、前フレームＦと後フレームＢの各々について、各音高（周波数）ごとのレベルの足し込みを行う。その結果、前フレームＦにおいては、音高Ｂｂ３（周波数１５６Ｈｚ）で０＋０＋０＋１＋０＝１、音高Ｂ３（周波数１６５Ｈｚ）で０＋０＋１＋４＋１＝６、音高Ｃ４（周波数２６２Ｈｚ）で０＋１＋３＋４＋６＝１４、音高Ｃ＃４（周波数２７７Ｈｚ）で１＋１＋２＋５＋１＝１０、音高Ｄ４（周波数２９４Ｈｚ）で０＋０＋１＋２＋０＝３、音高Ｅｂ４（周波数３１１Ｈｚ）で０＋０＋０＋１＋０＝１が得られ、後フレームＢにおいては、音高Ｂｂ３（周波数１５６Ｈｚ）で０＋１＋１＋０＋０＝２、音高Ｂ３（周波数１６５Ｈｚ）で２＋１＋２＋１＋４＝１０、音高Ｃ４（周波数２６２Ｈｚ）で７＋７＋６＋７＋６＝３３、音高Ｃ＃４（周波数２７７Ｈｚ）で１＋１＋２＋１＋６＝１１、音高Ｄ４（周波数２９４Ｈｚ）で１＋１＋０＋０＋２＝４、音高Ｅｂ４（周波数３１１Ｈｚ）で０＋０＋０＋０＋１＝１が得られる。 More specifically, one frame shown in FIG. 10 (a) is divided into a front frame F and a rear frame B with a sixteenth note level dividing line L7 as a boundary, and as shown in FIG. 10 (b), For each of the front frame F and the rear frame B, the level for each pitch (frequency) is added. As a result, in the previous frame F, the pitch Bb3 (frequency 156 Hz) is 0 + 0 + 0 + 1 + 0 = 1, the pitch B3 (frequency 165 Hz) is 0 + 0 + 1 + 4 + 1 = 6, the pitch C4 (frequency 262 Hz) is 0 + 1 + 3 + 4 + 6 = 14, the pitch C # 4 1 + 2 + 2 + 5 + 1 = 10 at (frequency 277 Hz), 0 + 0 + 1 + 2 + 0 = 3 at pitch D4 (frequency 294 Hz), 0 + 0 + 0 + 1 + 0 = 1 at pitch Eb4 (frequency 311 Hz), and in pitch Bb3 (frequency 156 Hz) in the subsequent frame B 0 + 1 + 1 + 0 + 0 = 2, 2 + 1 + 2 + 1 + 4 = 10 at pitch B3 (frequency 165 Hz), 7 + 7 + 6 + 7 + 6 = 33 at pitch C4 (frequency 262 Hz), 1 + 1 + 2 + 1 + 6 = 11 at pitch C # 4 (frequency 277 Hz), pitch D4 (frequency 294 Hz) 1 + 1 + 0 0 + 2 = 4, pitch Eb4 (Frequency 311Hz) 0 + 0 + 0 + 0 + 1 = 1 is obtained.

そして、後フレームＢの最大足し込み値３３と前フレームＦの最大足し込み値１４とを足して４７を得ると共に、後フレームＢの最大足し込み値３３と前フレームＦの次位最大足し込み値１０とを足して４９を得、後フレームＢの最大足し込み値３３と前フレームＦの次々位最大足し込み値６とを足して４３を得る。 Then, the maximum addition value 33 of the rear frame B and the maximum addition value 14 of the previous frame F are added to obtain 47, and the maximum addition value 33 of the rear frame B and the next highest addition value of the previous frame F are obtained. 49 is obtained by adding 10, and 43 is obtained by adding the maximum addition value 33 of the rear frame B and the second highest addition value 6 of the previous frame F.

同様に、この処理を全区間に対して行うことにより、ハミング入力全体の仮説音符列を出力する。 Similarly, by performing this process for all sections, a hypothetical note string of the entire Hamming input is output.

図１１は、ハミング入力全体から出力された仮説音符列を示す図である。この図においては、Ｌ１とＬ７の間のフレームの仮説音符列として「Ｃ４Ｃ４」、「Ｃ＃４Ｃ４」、「Ｂ３Ｃ４」が出力され、Ｌ２とＬ８の間のフレームの仮説音符列として「Ｄ４Ｄ４」が出力され、Ｌ３とＬ９の間のフレームの仮説音符列として「Ｅｂ４Ｅｂ４」、「Ｅｂ４Ｅ４」、「Ｅｂ４Ｄ４」、「Ｅ４Ｅｂ４」、「Ｅ４Ｅ４」、「Ｅ４Ｄ４」が出力され、Ｌ４とＬ１０の間のフレームの仮説音符列として「Ｃ４Ｃ４」、「Ｂ３Ｃ４」、「Ｂ３Ｂ３」が出力され、Ｌ５とＬ１１の間のフレームの仮説音符列として「Ｄ４Ｄ４Ｄ４Ｅｂ４」、「Ｄ４Ｄ４Ｄ４Ｄ４」、「Ｄ４Ｄ４Ｄ４Ｃ＃４」、「Ｄ４Ｄ４Ｅｂ４Ｅｂ４」、「Ｄ４Ｄ４Ｅｂ４Ｄ４」、「Ｄ４Ｄ４Ｅ＃４Ｅｂ４」、「Ｄ４Ｄ４Ｃ＃４Ｄ４」、「Ｄ４Ｄ４Ｄ４Ｅ４」、「Ｄ４Ｄ４Ｅｂ４Ｃ＃４」、「Ｄ４Ｄ４Ｃ＃４Ｃ＃４」が出力されている。 FIG. 11 is a diagram showing a hypothetical note string output from the entire Hamming input. In this figure, “C4C4”, “C # 4C4”, and “B3C4” are output as hypothetical note sequences of frames between L1 and L7, and “D4D4” is output as hypothetical note sequences of frames between L2 and L8. “Eb4Eb4”, “Eb4E4”, “Eb4D4”, “E4Eb4”, “E4E4”, “E4D4” are output as hypothetical note strings of the frame between L3 and L9, and the frame between L4 and L10 “C4C4”, “B3C4”, and “B3B3” are output as hypothetical note strings, and “D4D4D4Eb4”, “D4D4D4D4”, “D4D4D4C # 4”, “D4D4Eb4Eb4”, “D4D4D4Eb4”, "D4D4Eb4D4", "D4D4E # 4Eb4", "D4D4C # 4D4", "D4D4D4E4", "D4D Eb4C # 4 "," D4D4C # 4C # 4 "is output.

なお、同じ音高の音が続いた場合は、それが八分音符か十六分音符二つかなどを認識するのは困難であるので、全て十六分分単位で出力を表している。 If the same pitch continues, it is difficult to recognize whether it is an eighth note or two sixteenth notes, so the output is expressed in units of sixteenths.

このようにして仮説音符列を出力すると、次に、音符連鎖モデルによる補正処理（ステップＳ２５）を実行した後、補正後出力値の最も大きい音符列を正解として出力する（ステップＳ２６）。 After the hypothetical note string is output in this way, the correction process using the note chain model (step S25) is executed, and then the note string having the largest corrected output value is output as the correct answer (step S26).

図１２は、音符連鎖モデルによる補正処理を示す図である。この音符連鎖モデルによる補正処理では、まず、パターン検索を行い（ステップＳ２５ａ）、パターンがある場合（ステップＳ２５ｂの“ＹＥＳ”）には「音符連鎖モデルによる確率＝一致したパターンの連鎖確率」とする（ステップＳ２５ｃ）。一方、パターンがない場合（ステップＳ２５ｂの“ＮＯ”）には「音符連鎖モデルによる確率＝α（αは所定値）」とし（ステップＳ２５ｄ）、いずれの場合も、「出力値＝音響による確度×音符連鎖モデルによる確率」を演算（ステップＳ２５ｅ）した後、図４のフローに復帰する。 FIG. 12 is a diagram showing a correction process using a note chain model. In the correction process using the note chain model, pattern search is first performed (step S25a). If there is a pattern (“YES” in step S25b), “probability based on note chain model = chain probability of matched pattern”. (Step S25c). On the other hand, when there is no pattern (“NO” in step S25b), “probability by note chain model = α (α is a predetermined value)” (step S25d), and in either case, “output value = accuracy by sound × After calculating the probability based on the note chain model (step S25e), the flow returns to the flow of FIG.

図１３は、ステップＳ２５ｅの演算結果の一例を示す図である。この図において、左側にあるのが仮説音符列出力（図４のステップＳ２４）で音響的に求められた出力である。ここでは、ア〜カの符号を付すと共に、見やすいように五線譜の形に直し、さらに、確度が高い順にソートしている。 FIG. 13 is a diagram illustrating an example of the calculation result of step S25e. In this figure, on the left side is an output obtained acoustically by hypothetical note string output (step S24 in FIG. 4). Here, the symbols A to A are attached, the staff notation is changed to be easy to see, and further sorted in descending order of accuracy.

それらの候補（ア〜カ）に対し、音符連鎖モデルによる確率値を乗ずる。たとえば、アの音符連鎖モデルによる確率値が０．００４５、イ〜エの同確率値が０．００１、オの同確率値が０．００６３７、・・・・、カの同確率値が０．００７０１であるとする。 Those candidates (A to F) are multiplied by a probability value by a note chain model. For example, the probability value of A's note chain model is 0.0045, the same probability value of A to D is 0.001, the same probability value of E is 0.00637,. It is assumed that 00701.

音符モデルの確率値は、音符がどのように連鎖するかを表したものであり、この場合、四分音符３個分の長さの音符の連鎖する確率を示したものである。当然ながら一般的にあり得るメロディ進行の場合は確率が高く、また、頻度が閾値以下しか存在しなかったメロディ進行の場合はモデルに登録されない。その場合の確率値は、図１２のステップＳ２５ｄでフロアリングされるため、αに固定される（この場合、α＝０．００１）。 The probability value of the note model represents how the notes are chained, and in this case, indicates the probability that the notes of the length of three quarter notes are chained. Of course, in the case of a melody progression that is generally possible, the probability is high, and in the case of a melody progression whose frequency is less than or equal to a threshold value, it is not registered in the model. The probability value in that case is fixed to α (in this case, α = 0.001) because it is floored in step S25d of FIG.

したがって、図示の例によれば、アの補正後の出力値は「２７２×０．００４５＝１．２２４」、イの同出力値は「２７２×０．００１＝０．２７２」、ウ〜エの同出力値は「２７１×０．００１＝０．２７１」、オの同出力値は「２７１×０．００６３７＝１．７２６２７」、・・・・、カの同出力値は「２６６×０．００７０１＝１．８６４６６」となるので、補正後出力値の最も大きい音符列、つまり、カの「Ｃ４Ｃ４Ｄ４Ｄ４Ｅ４Ｅ４Ｃ４Ｃ４Ｄ４Ｄ４Ｄ４」を正解として出力する。 Therefore, according to the example shown in the figure, the corrected output value of “a” is “272 × 0.0045 = 1.224”, the same output value of “272 × 0.001 = 0.272”, The same output value of “271 × 0.001 = 0.271”, the same output value of “271 × 0.00637 = 1.72627”,..., The same output value of “266 × 0” .00701 = 1.86466 ”, the musical note string having the largest corrected output value, that is,“ C4C4D4D4E4E4C4C4D4D4D4 ”is output as a correct answer.

以上のとおり、本第１実施形態によれば、音符連鎖モデルを使い、ハミングで入力された不安定なメロディを補正するようにしたので、一般的には存在しないメロディ、すなわち、音符連鎖の確率が低いメロディによる誤検索を回避することができる。このことについて、前記の従来技術２〜４との対比を行うと、前記の従来技術２〜４は、いずれも楽曲の音楽片やセグメント（区間）といった部分単位で検索を行うものであった。このため、ハミング入力された音響信号の音楽片やセグメントに、音高のズレやノイズ、曲中の速い部分の周波数分解能の少なさなどの不都合が生じていた場合、それらの不都合を抱えたまま、楽曲検索が行われてしまうので、間違った楽曲を検索してしまうことがあった。これは、音楽片やセグメントといった部分単位での照合しか行っていないからである。 As described above, according to the first embodiment, an unstable melody input by humming is corrected using a note chain model, so that a melody that does not generally exist, that is, a probability of a note chain It is possible to avoid erroneous searches due to low melody. In this regard, when compared with the prior arts 2 to 4, the prior arts 2 to 4 each perform a search by a partial unit such as a musical piece or segment (section) of the music. For this reason, if there are inconveniences such as pitch deviation or noise in the music piece or segment of the humming input acoustic signal, or a lack of frequency resolution in the fast part of the song, these problems remain. Because music search is performed, the wrong music may be searched. This is because collation is performed only in units of parts such as music pieces and segments.

これに対して、本実施形態においては、ハミング入力されたメロディを音符連鎖モデルを用いて補正するので、仮に不安定なメロディがハミング入力されたとしても、常に連鎖確率が高いメロディ、すなわち、一般的にあり得るメロディに修正されるから、以降に行われる楽曲検索の精度を高めることができるのである。 On the other hand, in the present embodiment, since the humming input melody is corrected using the note chain model, even if an unstable melody is input by humming, a melody having a high chain probability, that is, a general Since the melody is corrected to a possible melody, the accuracy of the music search performed thereafter can be improved.

本第１実施形態では、簡略化のために、仮説音符列の開始音が音符連鎖モデルの開始音と一致する場合について説明した。両者が一致しない場合には、仮説音符列の開始音または音符連鎖モデルの開始音のいずれかの音高をシフトして開始音を一致させて補正処理を行う。また、補正尤度を求めるにあたって、音響尤度と言語尤度の積を用いたが、重み係数を導入して、前式（１）を、次式（２）のように変形してもよい。
補正尤度＝音響尤度＋重み係数×言語尤度・・・・（２） In the first embodiment, for the sake of simplification, a case has been described in which the start sound of the hypothetical note string matches the start sound of the note chain model. When the two do not match, the correction processing is performed by shifting the pitch of either the start sound of the hypothetical note string or the start sound of the note chain model to match the start sounds. Further, in calculating the correction likelihood, the product of the acoustic likelihood and the language likelihood is used. However, by introducing a weighting factor, the previous equation (1) may be transformed into the following equation (2). .
Corrected likelihood = acoustic likelihood + weighting coefficient × language likelihood (2)

なお、音符連鎖モデルを構築する際に、たとえば、楽器毎（ピアノ、ギター、ベース、トランペットなど）や、音楽のジャンル毎（ロック、ワルツ、日本民謡など）、あるいは、作曲家や歌手毎に音符連鎖モデルを構築してもよい。このようにすると、音符連鎖モデルのボリュームを小さくすることができ、モデル構築時間の短縮や記憶容量の削減を図ることができるから好ましく、さらに、楽器毎、ジャンル毎、作曲家毎に音符連鎖モデルを作成すれば、それぞれに特有の音符連鎖の特徴をモデル化できるので、認識精度が向上する。 When building a note chain model, for example, notes for each instrument (piano, guitar, bass, trumpet, etc.), music genre (rock, waltz, Japanese folk song, etc.), or for each composer or singer. A chain model may be constructed. This is preferable because the volume of the note chain model can be reduced, the model construction time can be reduced and the storage capacity can be reduced, and the note chain model for each instrument, genre, and composer. Can be used to model the characteristics of note chains specific to each, thus improving recognition accuracy.

また、上記の第１実施形態では、音の長さを考慮し、音符の種類（四分音符、八分音符など）を含めてモデル化しているが、簡易的に音の長さを無視してモデルを構築しても構わない。
また、第１実施形態では、ハミング入力に適用しているが、人の声に限定されず、楽器やその他の音源による入力であっても構わない。 In the first embodiment described above, the model includes the type of note (quarter note, eighth note, etc.) in consideration of the length of the sound. However, the length of the sound is simply ignored. You can build a model.
In the first embodiment, the hamming input is applied. However, the input is not limited to a human voice and may be input by a musical instrument or other sound source.

また、第１実施形態では、単音入力を例にしているが、楽器等による和音入力であってもよく、音符連鎖モデルも和音を考慮して構築しても構わない。
また、第１実施形態では、入力音も音符連鎖モデルも四分音符３個分としているが、これは代表例であり、必ずしもこの通りでなくてもよい。例えば入力音がもっと長い場合は、ｎ音ずつ入力音をシフトさせながら第１実施形態の音符連鎖モデルによる補正処理（図１２参照）を繰り返し、総和が大きいものを正解としても構わない。 In the first embodiment, a single tone input is taken as an example, but a chord input by a musical instrument or the like may be used, and a note chain model may be constructed in consideration of a chord.
In the first embodiment, the input sound and the note chain model are set to three quarter notes, but this is a representative example, and this is not necessarily the case. For example, when the input sound is longer, the correction process (see FIG. 12) according to the note chain model of the first embodiment may be repeated while shifting the input sound by n sounds, and the correct sum may be used.

〔第２実施形態〕
図１４は、第２実施形態における楽曲検索装置の機能ブロック図である。この図において、楽曲検索装置１ａは、インターフェース部２、制御部３、音響信号入力部４、音響信号記憶部５、特徴抽出部６、特徴記憶部７、仮説音符列出力部８、音符連鎖モデル生成部９、音符連鎖モデル記憶部１０、入力補正部１１及び音符列出力部１２を備える点で前記の第１実施形態のハミング入力装置１と共通し、楽曲データベース（図ではＤＢと略記）１５及び比較部１６からなる楽曲検索部１４を備える点で第１実施形態のハミング入力装置１と相違する。 [Second Embodiment]
FIG. 14 is a functional block diagram of the music search device according to the second embodiment. In this figure, the music search device 1a includes an interface unit 2, a control unit 3, an acoustic signal input unit 4, an acoustic signal storage unit 5, a feature extraction unit 6, a feature storage unit 7, a hypothetical note string output unit 8, a note chain model. The music database (abbreviated as DB in the figure) 15 is common to the hamming input device 1 of the first embodiment in that it includes a generation unit 9, a note chain model storage unit 10, an input correction unit 11, and a note string output unit 12. And it differs from the Hamming input device 1 of 1st Embodiment by the point provided with the music search part 14 which consists of a comparison part 16.

なお、第１実施形態のハミング入力装置１との共通構成要素、すなわち、インターフェース部２、制御部３、音響信号入力部４、音響信号記憶部５、特徴抽出部６、特徴記憶部７、仮説音符列出力部８、音符連鎖モデル生成部９、音符連鎖モデル記憶部１０、入力補正部１１及び音符列出力部１２の説明については、前記の第１実施形態を参照することにする。 In addition, a common component with the hamming input device 1 of 1st Embodiment, ie, the interface part 2, the control part 3, the acoustic signal input part 4, the acoustic signal storage part 5, the feature extraction part 6, the feature storage part 7, and a hypothesis For the description of the note string output unit 8, the note chain model generation unit 9, the note chain model storage unit 10, the input correction unit 11, and the note string output unit 12, the first embodiment will be referred to.

楽曲データベース１５には、あらかじめ検索対象の楽曲情報が格納されている。この楽曲情報は、楽曲毎の音符列情報を含む。 The music database 15 stores music information to be searched for in advance. This music information includes note string information for each music.

図１５は、第２実施形態の動作フローチャートを示す図である。このフローチャートにおいて、ステップＳ２１〜ステップＳ２５及びステップＳ２６は、前記の第１実施形態のハミング入力装置１の動作（図４参照）と同一であるが、ステップＳ２５とステップＳ２６の間で、ソート処理（ステップＳ３１）、楽曲データベース１５の曲数分のループ（ステップＳ３２）、楽曲データベース１５から１曲分の参照データ読み込み（ステップＳ３３）、補正後候補数分のループ（ステップＳ３４）及び類似度計算（ステップＳ３５）を行う点で相違する。 FIG. 15 is a diagram illustrating an operation flowchart of the second embodiment. In this flowchart, Step S21 to Step S25 and Step S26 are the same as the operation of the hamming input device 1 of the first embodiment (see FIG. 4), but the sort process (Step S25 and Step S26) Step S31), a loop for the number of songs in the music database 15 (Step S32), reading of reference data for one song from the music database 15 (Step S33), a loop for the number of candidates after correction (Step S34), and similarity calculation ( The difference is that step S35) is performed.

すなわち、比較部１６では、まず、入力補正部１１からの補正後出力（音符連鎖確率を用いて補正されたもの）を取り込み、それらをソートし、最も値の大きかったものから閾値β以内のもの（Ｍ個の音符列及び補正後出力値）を候補として保持する。 That is, the comparison unit 16 first takes the corrected output from the input correction unit 11 (corrected using the note chain probability), sorts them, and sorts them out from the highest value within the threshold β. (M note strings and corrected output values) are held as candidates.

次いで、楽曲データベース１５から１曲分の音符列を読み込み、保持しておいたＭ個の音符列と楽曲データベース１５から読み込んだ音符列とを比較して類似度を計算するという動作を、補正後出力分だけ繰り返し、且つ、楽曲データベース１５から読み込んだ１曲分の音符列との類似度計算を完了すると、次の楽曲を楽曲データベース１５から読み込んで同様の動作を繰り返し、全ての楽曲とを類似度を計算し終えると、最も類似度が高かった楽曲を最終結果として出力する。 Next, an operation of reading a note string for one song from the music database 15 and comparing the M note strings stored and the note strings read from the music database 15 to calculate the similarity is performed after correction. When the output is repeated and the similarity calculation with the musical note string for one song read from the song database 15 is completed, the next song is read from the song database 15 and the same operation is repeated to make all songs similar. When the degree is calculated, the music with the highest similarity is output as the final result.

図１６は、類似度計算の概念図である。たとえば、類似度評価値について、１６分音符長毎の音の高さが完全に一致したときには３点、半音ズレは２点、一音ズレは１点、それ以上のズレは０点を与えることにする。類似度は以下の式（３）で求められる。 FIG. 16 is a conceptual diagram of similarity calculation. For example, with regard to the similarity evaluation value, 3 points are given when the pitches of every 16th note length are completely matched, 2 points are obtained for semitone deviation, 1 point is given for one note deviation, and 0 points are given for deviations beyond that. To. The similarity is obtained by the following equation (3).

類似度＝補正後出力値×一致度・・・・（３） Similarity = corrected output value × coincidence (3)

たとえば、ハミングの入力パターンを「Ｃ４Ｃ４Ｄ４Ｄ４Ｅ４Ｅ４Ｃ４Ｃ４Ｄ４Ｄ４Ｄ４Ｄ４」とし、比較対象中の楽曲の音符列パターンを「Ｃ４“Ｄ４”Ｄ４Ｄ４Ｅ４Ｅ４“Ｃ＃４”Ｃ４Ｄ４Ｄ４“Ｆ４”Ｄ４」とする。“ ”で括った三つの音（“Ｄ４”、“Ｃ＃４”、“Ｆ４”）は完全一致しない音である。この場合、“Ｄ４”はＣ４の１音ズレ、“Ｃ＃４”はＣ４の半音ズレ、“Ｆ４”はＤ４の１音以上のズレであるから、前式（３）より、一致度は「３０」となる。 For example, the input pattern of Hamming is “C4C4D4D4E4E4C4C4D4D4D4D4”, and the note sequence pattern of the music being compared is “C4“ D4 ”D4D4E4E4“ C # 4 ”C4D4D4“ F4 ”D4”. The three sounds enclosed by “” (“D4”, “C # 4”, “F4”) are sounds that do not coincide completely. In this case, “D4” is a C4 one-tone deviation, “C # 4” is a C4 semitone deviation, and “F4” is a D4 one-tone or more deviation. 30 ".

このようにして、全ての楽曲との類似度の計算が終わったら、類似度の大きい方から１乃至複数個の楽曲リストを検束結果として出力する。 In this way, when the calculation of the degree of similarity with all the music pieces is completed, one or more music lists from the one with the higher degree of similarity are output as the checking result.

以上のとおり、本第２実施形態では、音符の連鎖確率を用いて補正された音符列のみを使って楽曲検索を行うので、音高が多少ずれたハミングやノイズの影響を受けたハミングであっても、正しい楽曲を検索することができる。 As described above, in the second embodiment, the music search is performed using only the note string corrected using the note chain probability, so that the humming is slightly distorted or affected by noise. But you can search for the correct music.

なお、本第２実施形態では、音符の連鎖確率を用いて補正された音符列のみを使い、楽曲検索を行っているが、補正していない音符列についても類似度を求め、類似度計算に使用しても構わない。 In the second embodiment, the music search is performed using only the note string corrected by using the note chain probability. However, the similarity is also obtained for the uncorrected note string, and the similarity calculation is performed. You can use it.

また、本第２実施形態では、楽曲データベース１５を楽曲検索装置１ａの一部としているが、この態様に限定されない。たとえば、電話回線やインターネット等を介して遠隔地に配置された一つ又は複数の楽曲データベースを参照するような構成であってもよい。この場合、比較部１６も、当該楽曲データベースと同様に遠隔地に設置されたものであってもよい。 In the second embodiment, the music database 15 is part of the music search device 1a. However, the present invention is not limited to this mode. For example, a configuration in which one or a plurality of music databases arranged in a remote place via a telephone line or the Internet may be referred to may be used. In this case, the comparison unit 16 may also be installed at a remote place in the same manner as the music database.

また、以上の各実施形態においては、“音符”の連鎖確率に基づいて補正を行っているが、これに限らず、音符を“音高”と読み替えても構わない。 In each of the above embodiments, correction is performed based on the chain probability of “notes”, but the present invention is not limited to this, and the notes may be read as “pitch”.

第１実施形態におけるハミング入力装置の機能ブロック図である。It is a functional block diagram of the Hamming input device in a 1st embodiment. 音声認識の音響モデルと言語モデルの関係を示す概念図である。It is a conceptual diagram which shows the relationship between the acoustic model of speech recognition, and a language model. 音符連鎖モデルの生成（構築）アルゴリズムを示す図である。It is a figure which shows the production | generation (construction) algorithm of a note chain model. 第１実施形態の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of 1st Embodiment. クロマベクトルを示す模式図である。It is a schematic diagram which shows a chroma vector. フィルタの模式図である。It is a schematic diagram of a filter. ハミング入力の一例を示す図である。It is a figure which shows an example of a hamming input. ハミング入力の周波数分布図である。It is a frequency distribution figure of a Hamming input. 音符変化候補の探索模式図である。It is a search schematic diagram of a note change candidate. １つのフレームに着目した仮説音符列出力の概念図である。It is a conceptual diagram of hypothesis note sequence output focusing on one frame. ハミング入力全体から出力された仮説音符列を示す図である。It is a figure which shows the hypothetical musical note sequence output from the whole Hamming input. 音符連鎖モデルによる補正処理を示す図である。It is a figure which shows the correction | amendment process by a note chain model. ステップＳ２５ｅの演算結果の一例を示す図である。It is a figure which shows an example of the calculation result of step S25e. 第２実施形態における楽曲検索装置の機能ブロック図である。It is a functional block diagram of the music search device in 2nd Embodiment. 第２実施形態の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of 2nd Embodiment. 類似度計算の概念図である。It is a conceptual diagram of similarity calculation.

Explanation of symbols

１ハミング入力装置（メロディ入力装置）
１ａ楽曲検索装置
４音響信号入力部（入力手段）
６特徴抽出部（抽出手段）
１０音符連鎖モデル記憶部（データベース）
１１入力補正部（補正手段）
１４楽曲検索部（検索手段）
1 Hamming input device (melody input device)
1a Music searching device 4 Acoustic signal input unit (input means)
6 Feature extraction unit (extraction means)
10 Note chain model storage (database)
11 Input correction unit (correction means)
14 Music search section (search means)

Claims

In a melody input device comprising input means for inputting a melody and extraction means for extracting pitch information from the inputted melody,
A database containing note chain information about the chain of notes;
A melody input device comprising: correction means for correcting pitch information extracted by the extraction means based on note chain information in the database.

The melody input device according to claim 1, wherein the note chain information is created for each musical instrument.

The melody input device according to claim 1, wherein the note chain information is created for each genre of music.

The melody input device according to claim 1, wherein the note chain information is created for each composer.

The melody input device according to claim 1, wherein the note chain information does not take into account information related to a sound length.

An input means for inputting a melody;
Extraction means for extracting pitch information from the inputted melody;
A database containing note chain information about the chain of notes;
Correction means for correcting the pitch information extracted by the extraction means based on the note chain information in the database;
A music search apparatus comprising: search means for searching for music having similar pitch information from a plurality of music using the pitch information corrected by the correcting means.

The music search apparatus according to claim 6, wherein the musical note chain information is created for each musical instrument.

The music search apparatus according to claim 6, wherein the musical note chain information is created for each genre of music.

The music search apparatus according to claim 6, wherein the note chain information is created for each composer.

7. The music search apparatus according to claim 6, wherein the musical note chain information does not take into account information related to a sound length.