JP2017129787A

JP2017129787A - Scoring device

Info

Publication number: JP2017129787A
Application number: JP2016009952A
Authority: JP
Inventors: 隆一成山; Ryuichi Nariyama; 信敏上杉; Nobutoshi Uesugi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2016-01-21
Filing date: 2016-01-21
Publication date: 2017-07-27

Abstract

PROBLEM TO BE SOLVED: To provide a scoring device capable of improving the accuracy in scoring.SOLUTION: A scoring function 100 includes: an input sound acquisition part 101 that acquires an input sound; a feature quantity detection part 103 that detects feature quantities including pitches sequentially based on the input sound acquired by the input sound acquisition part; a period identification part 107 that identifies the period corresponding to each sound constituting the input sound based on the feature quantity detected by the feature quantity detection part; and a pitch sorting part 105 that sorts the respective sounds to any one of plural reference pitches based on the pitch.SELECTED DRAWING: Figure 2

Description

本発明は、音を採譜する技術に関する。 The present invention relates to a technique for recording a sound.

入力音声に基づいて歌唱された音を採譜する採譜装置が知られている。例えば、特許文献１には、入力音声から時系列的な音圧の変化とピッチ（音高）の変化を求め、求めた音圧の変化及びピッチの変化に基づいて音符期間を特定する音声処理装置が開示されている。 2. Description of the Related Art A music transcription device that records a sung sound based on an input voice is known. For example, Japanese Patent Application Laid-Open No. 2004-228867 obtains a time-series change in sound pressure and a change in pitch (pitch) from input speech, and specifies a note period based on the obtained change in sound pressure and change in pitch. An apparatus is disclosed.

特開２０１１−６５０４４号公報JP 2011-65044 A

歌唱にはさまざまな表現（歌唱技法や揺らぎ）が含まれるが、特許文献１に記載された技術では、歌唱に含まれるさまざまな表現によって採譜の精度が悪いという問題があった。 The singing includes various expressions (singing technique and fluctuation), but the technique described in Patent Document 1 has a problem that the accuracy of transcription is poor due to various expressions included in the singing.

本発明の課題の一つは、精度が向上された、音を採譜する技術を提供することにある。 One of the objects of the present invention is to provide a technique for recording a sound with improved accuracy.

本発明の一実施形態による音採譜装置は、入力音を取得する入力音取得部と、前記入力音取得部によって取得された前記入力音に基づいてピッチを含む特徴量を時系列に検出する特徴量検出部と、前記特徴量検出部によって検出された前記特徴量に基づいて前記入力音を構成する各音に対応する期間を特定する期間特定部と、前記ピッチに基づいて前記各音を複数の基準ピッチの何れかに分類するピッチ分類部と、を備える。 A sound transcription device according to an embodiment of the present invention includes an input sound acquisition unit that acquires an input sound, and a feature amount that includes a pitch based on the input sound acquired by the input sound acquisition unit in time series. A quantity detecting unit; a period specifying unit for specifying a period corresponding to each sound constituting the input sound based on the feature quantity detected by the feature quantity detecting unit; and a plurality of the sounds based on the pitch A pitch classifying unit that classifies the pitch into any of the reference pitches.

前記期間特定部は、前記ピッチに基づいて前記期間を特定してもよい。 The period specifying unit may specify the period based on the pitch.

前記期間特定部は、前記ピッチが同一と判定される期間を前記入力音を構成する各音に対応する期間と特定してもよい。 The period specifying unit may specify a period in which the pitches are determined to be the same as a period corresponding to each sound constituting the input sound.

前記期間特定部は、前記同一と判定されるピッチが所定時間継続した場合、少なくとも予め決められたロック時間が経過するまでを前記入力音を構成する各音に対応する期間と特定してもよい。 The period specifying unit may specify a period corresponding to each sound constituting the input sound at least until a predetermined lock time elapses when the pitch determined to be the same continues for a predetermined time. .

前記期間特定部は、前記特徴量に含まれる音量の変化に基づいて前記期間を特定してもよい。 The period specifying unit may specify the period based on a change in sound volume included in the feature amount.

前記期間特定部は、前記特徴量に含まれる周波数のＳＮ比に基づいて前記期間を特定してもよい。 The period specifying unit may specify the period based on an SN ratio of a frequency included in the feature amount.

前記期間特定部は、前記特徴量に含まれる音量の変化と前記特徴量に含まれる周波数のＳＮ比に基づいて前記期間を特定してもよい。 The period specifying unit may specify the period based on a change in volume included in the feature amount and an SN ratio of a frequency included in the feature amount.

本発明の一実施形態によるプログラムは、コンピュータに、入力音を取得し、取得された前記入力音に基づいてピッチを含む特徴量を時系列に検出し、前記特徴量に基づいて前記入力音を構成する各音に対応する期間を特定すること、を実行させる。 A program according to an embodiment of the present invention acquires an input sound from a computer, detects a feature amount including a pitch based on the acquired input sound in time series, and extracts the input sound based on the feature amount. Specifying a period corresponding to each sound to be configured.

本発明によると、精度が向上された、音を採譜する技術を提供することができる。 According to the present invention, it is possible to provide a technique for recording a sound with improved accuracy.

本発明一実施形態に係る音採譜装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound transcription device which concerns on one Embodiment of this invention. 本発明の一実施形態に係る採譜機能の構成を示すブロック図である。It is a block diagram which shows the structure of the music transcription function which concerns on one Embodiment of this invention. 本発明の一実施形態に係る、歌唱音声を構成する各音に対応する音符期間を特定する方法を説明するための図である。It is a figure for demonstrating the method to specify the note period corresponding to each sound which comprises singing voice based on one Embodiment of this invention. 本発明の一実施形態に係る、ピッチの時間的な変動に基づく、歌唱音声を構成する各音に対応する音符期間を特定するための概念の説明するための図である。It is a figure for demonstrating the concept for specifying the note period corresponding to each sound which comprises singing voice based on the temporal fluctuation | variation of the pitch based on one Embodiment of this invention. 本発明の一実施形態の変形例に係る採譜機能の構成を示すブロック図である。It is a block diagram which shows the structure of the music recording function which concerns on the modification of one Embodiment of this invention. 本発明の一実施形態に係る、音符期間の開始時点を特定する方法を説明するための図である。It is a figure for demonstrating the method to specify the start time of the note period based on one Embodiment of this invention. 本発明の一実施形態に係る歌唱音声に含まれる各音の開始点に対応する音符期間の開始時点を検出する方法を用いて、子音を含まない歌詞が含まれる場合の例を説明する図である。The figure explaining the example in case the lyrics which do not contain a consonant are included using the method of detecting the start time of the note period corresponding to the start point of each sound contained in the singing voice concerning one embodiment of the present invention. is there. 本発明の一実施形態において用いられる歌唱音声の周波数分布のＳＮ比を説明する図である。It is a figure explaining the S / N ratio of the frequency distribution of the singing voice used in one Embodiment of this invention. 本発明の一実施形態に係る、周波数分布のＳＮ比と特徴量に含まれる音量の時間的な変動とに基づいて、音符期間と特定する方法を説明するための図である。It is a figure for demonstrating the method to identify with a note period based on SN ratio of frequency distribution and the time-dependent fluctuation | variation of the sound volume contained in the feature-value based on one Embodiment of this invention. 本発明の一実施形態に係る評価機能の構成を示すブロック図である。It is a block diagram which shows the structure of the evaluation function which concerns on one Embodiment of this invention. 本発明の一実施形態に係るデータ処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the data processing system which concerns on one Embodiment of this invention.

以下、本発明の一実施形態における音採譜装置について、図面を参照しながら詳細に説明する。以下に示す実施形態は、本発明の実施形態の一例であって、本発明はこれらの実施形態に限定されるものではない。なお、本実施形態で参照する図面において、同一部分または同様な機能を有する部分には同一の符号または類似の符号を付し、その繰り返しの説明は省略する場合がある。 Hereinafter, a sound transcription device according to an embodiment of the present invention will be described in detail with reference to the drawings. The following embodiments are examples of the embodiments of the present invention, and the present invention is not limited to these embodiments. Note that in the drawings referred to in this embodiment, the same portions or portions having similar functions are denoted by the same reference symbols or similar symbols, and repeated description thereof may be omitted.

＜第１実施形態＞
［音採譜装置の構成］
本発明の第１の実施形態における音採譜装置１０について説明する。図１は、本発明の第１の実施形態における音採譜装置１０の構成を示すブロック図である。音採譜装置１０は、例えば、採譜機能を備えたカラオケ装置である。音採譜装置１０は、制御部１１、記憶部１３、操作部１５、表示部１７、通信部１９、および信号処理部２１を含む。信号処理部２１には、音入力部（例えば、マイクロフォン）２３及び音出力部（例えば、スピーカ）２５が接続されている。これらの各構成は、バスを介して相互に接続されている。 <First Embodiment>
[Configuration of sound transcription system]
The sound transcription device 10 according to the first embodiment of the present invention will be described. FIG. 1 is a block diagram showing a configuration of a sound transcription device 10 according to the first embodiment of the present invention. The sound transcription device 10 is, for example, a karaoke device having a transcription function. The sound transcription device 10 includes a control unit 11, a storage unit 13, an operation unit 15, a display unit 17, a communication unit 19, and a signal processing unit 21. A sound input unit (for example, a microphone) 23 and a sound output unit (for example, a speaker) 25 are connected to the signal processing unit 21. Each of these components is connected to each other via a bus.

制御部１１は、ＣＰＵなどの演算処理回路を含む。制御部１１は、記憶部１３に記憶された制御プログラム１３ａをＣＰＵにより実行して、各種機能を音採譜装置１０において実現させる。実現される機能には、歌唱音声の採譜機能が含まれる。採譜機能については、後述する。 The control unit 11 includes an arithmetic processing circuit such as a CPU. The control unit 11 causes the CPU to execute the control program 13 a stored in the storage unit 13 and realizes various functions in the sound transcription device 10. The realized functions include a singing voice transcription function. The transcription function will be described later.

記憶部１３は、不揮発性メモリ、ハードディスク等の記憶装置である。記憶部１３は、採譜機能を実現するための制御プログラム１３ａを記憶する。制御プログラムは、磁気記録媒体、光記録媒体、光磁気記録媒体、半導体メモリなどのコンピュータ読み取り可能な記録媒体に記憶した状態で提供されてもよい。この場合には、音採譜装置１０は、記録媒体を読み取る装置を備えていればよい。また、制御プログラム１３ａは、インターネット等のネットワーク経由でダウンロードされてもよい。 The storage unit 13 is a storage device such as a nonvolatile memory or a hard disk. The storage unit 13 stores a control program 13a for realizing a music recording function. The control program may be provided in a state stored in a computer-readable recording medium such as a magnetic recording medium, an optical recording medium, a magneto-optical recording medium, or a semiconductor memory. In this case, the sound transcription device 10 only needs to include a device that reads the recording medium. The control program 13a may be downloaded via a network such as the Internet.

また、記憶部１３は、歌唱に関するデータとして、楽曲データ１３ｂ及び歌唱音声データ１３ｃを記憶する。楽曲データ１３ｂは、カラオケの歌唱曲に関連するデータ、例えば、ガイドメロディデータ、伴奏データ、歌詞データなどが含まれている。ガイドメロディデータは、歌唱曲のメロディを示すデータである。伴奏データは、歌唱曲の伴奏を示すデータである。ガイドメロディデータおよび伴奏データは、ＭＩＤＩ形式で表現されたデータであってもよい。歌詞データは、歌唱曲の歌詞を表示させるためのデータ、および表示させた歌詞テロップを色替えするタイミングを示すデータである。歌唱音声データ１３ｃは、歌唱者が音入力部２３から入力した歌唱音声に対応するデータである。本実施形態では、歌唱音声データ１３ｃは、採譜機能によって歌唱音声が採譜されるまで、記憶部１３に記憶される。 Moreover, the memory | storage part 13 memorize | stores the music data 13b and the song audio | voice data 13c as data regarding a song. The music data 13b includes data related to a karaoke song, for example, guide melody data, accompaniment data, and lyrics data. The guide melody data is data indicating the melody of the song. Accompaniment data is data indicating the accompaniment of a song. The guide melody data and accompaniment data may be data expressed in the MIDI format. The lyric data is data for displaying the lyrics of the song and data indicating the timing for changing the color of the displayed lyrics telop. The singing voice data 13c is data corresponding to the singing voice input from the sound input unit 23 by the singer. In the present embodiment, the singing voice data 13c is stored in the storage unit 13 until the singing voice is scored by the transcription function.

操作部１５は、操作パネルおよびリモコンなどに設けられた操作ボタン、キーボード、マウスなどの装置であり、入力された操作に応じた信号を制御部１１に出力する。表示部１７は、液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置であり、制御部１１による制御に基づいた画面が表示される。なお、操作部１５と表示部１７とは一体としてタッチパネルを構成してもよい。通信部１９は、制御部１１の制御に基づいて、インターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）などの通信回線と接続して、サーバ等の外部装置と情報の送受信を行う。なお、記憶部１３の機能は、通信部１９において通信可能な外部装置で実現されてもよい。 The operation unit 15 is a device such as an operation button, a keyboard, or a mouse provided on an operation panel and a remote controller, and outputs a signal corresponding to the input operation to the control unit 11. The display unit 17 is a display device such as a liquid crystal display or an organic EL display, and displays a screen based on control by the control unit 11. Note that the operation unit 15 and the display unit 17 may integrally form a touch panel. The communication unit 19 is connected to a communication line such as the Internet or a LAN (Local Area Network) based on the control of the control unit 11 and transmits / receives information to / from an external device such as a server. The function of the storage unit 13 may be realized by an external device that can communicate with the communication unit 19.

信号処理部２１は、ＭＩＤＩ形式の信号からオーディオ信号を生成する音源、Ａ／Ｄコンバータ、Ｄ／Ａコンバータ等を含む。歌唱音声は、マイクロフォン等の音入力部２３において電気信号に変換されて信号処理部２１に入力され、信号処理部２１においてＡ／Ｄ変換されて制御部１１に出力される。上述したように、歌唱音声は、歌唱音声データとして記憶部１３に記憶される。また、伴奏データは、制御部１１によって読み出され、信号処理部２１においてＤ／Ａ変換され、スピーカー等の音出力部２５から歌唱曲の伴奏音として出力される。このとき、ガイドメロディも音出力部２５から出力されるようにしてもよい。 The signal processing unit 21 includes a sound source that generates an audio signal from a MIDI format signal, an A / D converter, a D / A converter, and the like. The singing voice is converted into an electric signal by a sound input unit 23 such as a microphone and input to the signal processing unit 21, and A / D converted by the signal processing unit 21 and output to the control unit 11. As described above, the singing voice is stored in the storage unit 13 as singing voice data. The accompaniment data is read out by the control unit 11, D / A converted by the signal processing unit 21, and output as an accompaniment sound of a song from a sound output unit 25 such as a speaker. At this time, a guide melody may also be output from the sound output unit 25.

［採譜機能］
音採譜装置１０の制御部１１が記憶部１３に記憶された制御プログラム１３ａを実行することによって実現される採譜機能について説明する。なお、以下に説明する採譜機能を実現する構成の一部または全部は、ハードウエアによって実現されてもよい。 [Music transcription function]
A music recording function realized by the control unit 11 of the sound transcription device 10 executing the control program 13a stored in the storage unit 13 will be described. A part or all of the configuration for realizing the music transcription function described below may be realized by hardware.

図２は、本発明の第１実施形態における採譜機能１００の構成を示すブロック図である。採譜機能１００は、入力音取得部１０１、特徴量検出部１０３、ピッチ分類部１０５及び期間特定部１０７を含む。 FIG. 2 is a block diagram showing the configuration of the music transcription function 100 according to the first embodiment of the present invention. The music transcription function 100 includes an input sound acquisition unit 101, a feature amount detection unit 103, a pitch classification unit 105, and a period specifying unit 107.

入力音取得部１０１は、入力された歌唱音声に対応する歌唱音声データを取得する。本実施形態では、伴奏出力部１１１から伴奏音が出力されている期間における音入力部２３への入力音を、対象の歌唱音声として認識する。なお、本実施形態では、入力音取得部１０１は、記憶部１３に記憶された歌唱音声データ１３ｃを取得するが、信号処理部２１から直接取得するように構成してもよい。また、入力音取得部１０１は、音入力部２３への入力音を示す歌唱音声データを取得する場合に限らず、外部装置への入力音を示す歌唱音声データを、通信部１９によりネットワーク経由で取得してもよい。入力音取得部１０１は、取得した歌唱音声データを特徴量検出部１０３に伝達する。 The input sound acquisition unit 101 acquires singing voice data corresponding to the input singing voice. In this embodiment, the input sound to the sound input part 23 in the period when the accompaniment sound is output from the accompaniment output part 111 is recognized as a target song voice. In the present embodiment, the input sound acquisition unit 101 acquires the singing voice data 13 c stored in the storage unit 13, but may be configured to acquire directly from the signal processing unit 21. The input sound acquisition unit 101 is not limited to acquiring singing voice data indicating the input sound to the sound input unit 23, and the singing voice data indicating the input sound to the external device is transmitted by the communication unit 19 via the network. You may get it. The input sound acquisition unit 101 transmits the acquired singing voice data to the feature amount detection unit 103.

特徴量検出部１０３は、入力音取得部１０１によって取得された歌唱音声データに対して、歌唱音声のピッチを含む特徴量を時系列に検出する。特徴量検出部１０３は、ピッチの算出は、歌唱音声の波形のゼロクロスを用いた方法やフーリエ解析など、その他の公知の方法を用いて算出することができる。また、特徴量検出部１０３は、歌唱音声の特徴量として音量、周波数などを時系列に検出してもよい。特徴量検出部１０３は、時系列に検出したピッチを含む特徴量をピッチ分類部１０５に伝達する。 The feature amount detection unit 103 detects the feature amount including the pitch of the singing voice in time series with respect to the singing voice data acquired by the input sound acquisition unit 101. The feature amount detection unit 103 can calculate the pitch using another known method such as a method using a zero cross of the waveform of the singing voice or a Fourier analysis. Further, the feature amount detection unit 103 may detect volume, frequency, etc. in time series as the feature amount of the singing voice. The feature amount detection unit 103 transmits the feature amount including the pitches detected in time series to the pitch classification unit 105.

ピッチ分類部１０５は、取得した特徴量に含まれるピッチに基づいて、歌唱音声を構成する各音を複数の基準ピッチの何れかに分類する。図３は、ピッチ分類部１０５における、特徴量に含まれるピッチに基づいて、歌唱音声を構成する各音を複数の基準ピッチの何れかに分類するための概念の一例を説明するための図である。図３には、歌唱音声のピッチを時系列に示すピッチ波形の一例を示しており、縦軸はピッチ（ｃｅｎｔ）を示し、横軸は時間（Ｔ）を示す。図３では、時間ｔ₀〜ｔ₂₇におけるフレームｆ_n-1〜ｆ_n+25のピッチ波形が示されている。 The pitch classification | category part 105 classify | categorizes each sound which comprises singing voice into either of several reference | standard pitch based on the pitch contained in the acquired feature-value. FIG. 3 is a diagram for explaining an example of a concept for classifying each sound constituting the singing voice into one of a plurality of reference pitches based on the pitch included in the feature amount in the pitch classification unit 105. is there. FIG. 3 shows an example of a pitch waveform indicating the pitch of the singing voice in time series, the vertical axis indicates the pitch (cent), and the horizontal axis indicates time (T). In FIG. 3, the pitch waveforms of the frames f _{n-1 to} f _{n + 25} at times t _{0 to} t ₂₇ are shown.

ピッチ分類部１０５は、時系列に検出されたピッチに基づいて、フレーム（所定期間で区切られたデータサンプル）ごとにピッチの平均を算出する。図３において、各フレームの平均ピッチを黒い丸（●）で示す。次に、ピッチ分類部１０５は、算出されたフレームごとの平均ピッチの１０の位を四捨五入して１００セントごとのグリッドに当てはめる。例えば、図３においては、フレームｆ_n-1〜ｆ_n+2の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋３）セントである。フレームｆ_n+3〜ｆ_n+4の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋４）セントである。フレームｆ_n+5〜ｆ_n+7の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋３）セントである。フレームｆ_n+8〜ｆ_n+10の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋５）セントである。フレームｆ_n11の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋７）セントである。フレームｆ_n+12〜ｆ_n+15の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋８）セントである。フレームｆ_n+16〜ｆ_n+17の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋６）セントである。フレームｆ_n+18〜ｆ_n+22の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋５）セントである。フレームｆ_n+23〜ｆ_n+24の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋６）セントである。フレームｆ_n25の１０の位を四捨五入された平均ピッチに当てはまるグリッドに対応するピッチは１００＊（ｍ＋５）セントである。ピッチ分類部１０５は、各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータを期間特定部１０７に伝達する。また、ピッチ分類部１０５によって生成された歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータは、音採譜装置１０の記憶部１３に記憶される。 The pitch classification unit 105 calculates the average pitch for each frame (data sample separated by a predetermined period) based on the pitches detected in time series. In FIG. 3, the average pitch of each frame is indicated by a black circle (●). Next, the pitch classification unit 105 rounds the 10th place of the calculated average pitch for each frame and applies it to the grid every 100 cents. For example, in FIG. 3, the pitch corresponding to the grid corresponding to the average pitch rounded to the 10th place in the frames f _{n-1 to} f _{n + 2} is 100 * (m + 3) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 3 to} f _{n + 4} is 100 * (m + 4) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 5 to} f _{n + 7} is 100 * (m + 3) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 8 to} f _{n + 10} is 100 * (m + 5) cents. The pitch corresponding to the grid that fits into the average pitch rounded to the 10th place of the frame f _n11 is 100 * (m + 7) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 12 to} f _{n + 15} is 100 * (m + 8) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 16 to} f _{n + 17} is 100 * (m + 6) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 18 to} f _{n + 22} is 100 * (m + 5) cents. The pitch corresponding to the grid that fits the average pitch rounded to the 10th place of the frames f _{n + 23 to} f _{n + 24} is 100 * (m + 6) cents. The pitch corresponding to the grid that _fits the average pitch rounded to the 10th place of the frame f _n25 is 100 * (m + 5) cents. The pitch classification unit 105 transmits time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame to the period specifying unit 107. Further, time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame in the singing voice generated by the pitch classification unit 105 is stored in the storage unit 13 of the sound transcription device 10.

図３を参照して、歌唱音声を構成する各音を複数の基準ピッチの何れかに分類する方法の一例を説明したが、フレームごとのピッチが属する基準ピッチが決定することができれば以上に述べた方法に限定されるわけではない。 With reference to FIG. 3, an example of a method for classifying each sound constituting a singing voice into one of a plurality of reference pitches has been described. If the reference pitch to which the pitch for each frame can be determined is described above. The method is not limited to this.

期間特定部１０７は、特徴量検出部１０３によって時系列に検出されたピッチを含む特徴量に基づいて、歌唱音声を構成する各音に対応する期間（音符期間）を特定する。ここでは、特徴量検出部１０７は、ピッチ分類部１０５によって検出されたピッチの時間的な変動に基づいて歌唱音声を構成する各音に対応する音符期間を特定する。 The period specifying unit 107 specifies a period (note period) corresponding to each sound constituting the singing voice based on the feature amount including the pitch detected in time series by the feature amount detecting unit 103. Here, the feature amount detection unit 107 specifies a note period corresponding to each sound constituting the singing voice based on the temporal variation of the pitch detected by the pitch classification unit 105.

期間特定部１０７は、対応する平均ピッチが同一のグリッドに当てはめられたフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、当該フレーム期間を、前記入力音を構成する各音に対応する音符期間と特定する。ここで、期間特定部１０７は、同一の平均ピッチが所定時間継続した場合、少なくとも予め決められたロック時間が経過するまでを音符期間と特定する。例えば、該所定時間が連続する二つのフレームの開始時点から終了時点までにかかる時間に該当し、且つ、該ロック時間が連続する二つのフレームとその直後の一つにフレームとを合わせた連続する三つのフレームの開始時点から終了時点までにかかる時間に該当する場合、期間特定部１０７は、ピッチが同一である、連続する２フレームの開始時点から、該連続する２フレームの直後に続く１フレームが終了時点までの時間をロック時間としてもよい。即ち、期間特定部１０７は、ピッチが同一である連続する二つのフレーム（フレームＡ、フレームＢ）が検出された場合、フレームＡの開始時点から連続する二つのフレーム（フレームＡ、フレームＢ）の直後に続く一つのフレーム（フレームＣ）の終了時点までをロック時間とし、フレームＡ〜ＣをフレームＡ、Ｂに対応するピッチの音符期間としてもよい。この場合、フレームＡ及びフレームＢの平均ピッチに対応するグリッドと、フレームＣの平均ピッチに対応するグリッドとは異なっていてもよい。ロック時間内において、所定時間継続したピッチ以外のピッチは音符として認識されない。期間特定部１０７は、対応する平均ピッチが同一のグリッドに当てはめられたフレームが予め決められた数未満である場合は、当該フレーム期間を音符期間から除外してもよい。 The period specifying unit 107, when frames corresponding to the same average pitch applied to the same grid are continuously detected for a predetermined number or more (for example, two frames or more), the frame period is determined based on the input sound. The note period corresponding to each sound to be configured is specified. Here, when the same average pitch continues for a predetermined time, the period specifying unit 107 specifies at least a predetermined lock time as a note period. For example, it corresponds to the time taken from the start point to the end point of two frames in which the predetermined time is continuous, and the two frames in which the lock time is continuous and the frame immediately following it are combined. When it corresponds to the time taken from the start time point to the end time point of the three frames, the period specifying unit 107 has the same pitch from the start time point of two consecutive frames and immediately following the two consecutive frames. The time until the point of end may be set as the lock time. That is, when two consecutive frames (frame A and frame B) having the same pitch are detected, the period specifying unit 107 detects the two consecutive frames (frame A and frame B) from the start point of the frame A. The period up to the end of the immediately following one frame (frame C) may be set as the lock time, and the frames A to C may be set as the note period of the pitch corresponding to the frames A and B. In this case, the grid corresponding to the average pitch of frame A and frame B may be different from the grid corresponding to the average pitch of frame C. Within the lock time, any pitch other than a pitch that has continued for a predetermined time is not recognized as a note. The period specifying unit 107 may exclude the frame period from the note period when the corresponding average pitch is less than a predetermined number of frames applied to the same grid.

図４は、期間特定部１０７における、ピッチの時間的な変動に基づく、歌唱音声を構成する各音に対応する音符期間を特定するための概念の説明するための図である。図４では、フレームｆ_n-1〜ｆ_n+25（時間ｔ₀〜ｔ₂₇）における入力音を構成する各音に対応する音符期間を斜線で示す。図４においては、一例として、同一のグリッド上に当てはめられたピッチが２フレーム連続した場合にロック時間が発生するものとする。尚、ロック時間は、一例として、ピッチが同一である、連続する２フレームの開始時点から、該連続する２フレームの直後に続く１フレームが終了時点までの時間までとする。つまり、図４に示す例では、音符期間は、連続する３フレーム以上の期間を有する。 FIG. 4 is a diagram for explaining the concept for specifying the note period corresponding to each sound constituting the singing voice based on the temporal variation of the pitch in the period specifying unit 107. In FIG. 4, note periods corresponding to the sounds constituting the input sound in the frames f _{n-1 to} f _{n + 25} (time t _{0 to} t ₂₇ ) are indicated by hatching. In FIG. 4, as an example, it is assumed that the lock time occurs when the pitches applied on the same grid are two consecutive frames. As an example, the lock time is from the start time of two consecutive frames having the same pitch to the time from the end of one frame immediately following the two consecutive frames to the end time. That is, in the example shown in FIG. 4, the note period has a period of three or more consecutive frames.

図４において、フレームｆ_n-1〜ｆ_n+2に対応している１００＊（ｍ＋３）セントの音符期間は、時間ｔ₀〜ｔ₄である。フレームｆ_n+3〜ｆ_n+4に対応している１００＊（ｍ＋４）セントの音符期間は、時間ｔ₄〜ｔ₇である。フレームｆ_n+5〜ｆ_n+7に対応している１００＊（ｍ＋３）セントの音符期間は、時間ｔ₇〜ｔ₁₀である。フレームｆ_n+8〜ｆ_n+10に対応している１００＊（ｍ＋５）セントの音符期間は、時間ｔ₁₀〜ｔ₁₃である。フレームｆ_n+12〜ｆ_n+15に対応している１００＊（ｍ＋８）セントの音符期間は、時間ｔ₁₃〜ｔ₁₇である。フレームｆ_n+16〜ｆ_n+17に対応している１００＊（ｍ＋６）セントの音符期間は、時間ｔ₁₇〜ｔ₂₀である。フレームｆ_n+19〜ｆ_n+22に対応している１００＊（ｍ＋５）セントの音符期間は、時間ｔ₂₀〜ｔ₂₄である。フレームｆ_n+23〜ｆ_n+24に対応している１００＊（ｍ＋６）セントの音符期間は、時間ｔ₂₄〜ｔ₂₇である。 In FIG. 4, the note period of 100 * (m + 3) cents corresponding to the frames f _{n-1 to} f _{n + 2} is time t _{0 to} t ₄ . The note period of 100 * (m + 4) cents corresponding to frames f _{n + 3 to} f _{n + 4} is time t _{4 to} t ₇ . The note period of 100 * (m + 3) cents corresponding to the frames f _{n + 5 to} f _{n + 7} is time t _{7 to} t ₁₀ . The note period of 100 * (m + 5) cents corresponding to the frames f _{n + 8 to} f _{n + 10} is the time t _{10 to} t ₁₃ . The note period of 100 * (m + 8) cents corresponding to the frames f _{n + 12 to} f _{n + 15} is the time t _{13 to} t ₁₇ . The note period of 100 * (m + 6) cents corresponding to the frames f _{n + 16 to} f _{n + 17} is time t _{17 to} t ₂₀ . The note period of 100 * (m + 5) cents corresponding to the frames f _{n + 19 to} f _{n + 22} is the time t _{20 to} t ₂₄ . The note period of 100 * (m + 6) cents corresponding to the frames f _{n + 23 to} f _{n + 24} is time t _{24 to} t ₂₇ .

本発明の一実施形態に係る採譜機能１００によって特定された歌唱音声を構成する各音に対応する音符期間は、歌唱の評価基準（リファレンス）の生成に用いることができる。期間特定部１０７によって特定された、歌唱音声の音符期間を示す時系列のデータは、音採譜装置１０の記憶部１３において記憶される。ここで、歌唱音声の音符期間を示す時系列のデータは、ピッチ分類部１０５によって生成された歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータに関連付けられて記憶される。 The note period corresponding to each sound constituting the singing voice specified by the music recording function 100 according to the embodiment of the present invention can be used for generating a singing evaluation reference (reference). Time series data indicating the note period of the singing voice specified by the period specifying unit 107 is stored in the storage unit 13 of the sound transcription device 10. Here, the time-series data indicating the note period of the singing voice is stored in association with the time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame in the singing voice generated by the pitch classification unit 105. Is done.

（変形例）
期間特定部１０７による、歌唱音声を構成する各音に対応する音符期間の特定に用いる特徴量はピッチに限定されるわけではない。期間特定部１０７は、特徴量検出部１０３で検出された、ピッチを除く他の特徴量に基づいて歌唱音声を構成する各音に対応する音符期間を特定してもよい。 (Modification)
The feature amount used for specifying the note period corresponding to each sound constituting the singing voice by the period specifying unit 107 is not limited to the pitch. The period specifying unit 107 may specify the note period corresponding to each sound constituting the singing voice based on the other feature quantity excluding the pitch detected by the feature quantity detection unit 103.

（変形例１）
図５は、本発明の第１実施形態の変形例に係る採譜機能１００ａの構成を示すブロック図である。採譜機能１００ａにおいて、図２に示した採譜機能１００と同じ又は類似の構成については、採譜機能１００における構成と同一の参照番号を付与し、重複する説明は省略する。採譜機能１００ａは、入力音取得部１０１、特徴量検出部１０３、ピッチ分類部１０５及び期間特定部１０７ａを含む。本発明の一実施形態に係る採譜機能１００ａにおける期間特定部１０７ａは、特徴量検出部１０３によって時系列に検出された特徴量に含まれる音量の時間的な変動に基づいて、歌唱音声を構成する各音に対応する期間（音符期間）を特定してもよい。つまり、期間特定部１０７ａは、特徴量検出部１０３で検出された、ピッチを除く他の特徴量に基づいて歌唱音声を構成する各音に対応する音符期間を特定する。ここでは、期間特定部１０７ａが、特徴量検出部１０３において検出された時系列の音量レベルに基づいて、歌唱音声に含まれる各音の開始点に対応する区切位置を検出する例を説明する。ここで、各音とは例えば歌詞の各音節の発音に対応する。また、この例では、各音の開始点は子音から母音に切り替わるタイミング、すなわち母音が始まる部分に相当する。各音の開始点に対応する区切位置とは、この開始点と一致する場合に限らず、開始点に基づいて予め決められた処理によって決められる位置である。 (Modification 1)
FIG. 5 is a block diagram showing the configuration of the music transcription function 100a according to a modification of the first embodiment of the present invention. In the music recording function 100a, the same or similar configuration as the music recording function 100 shown in FIG. 2 is assigned the same reference numeral as that of the music recording function 100, and redundant description is omitted. The musical notation function 100a includes an input sound acquisition unit 101, a feature amount detection unit 103, a pitch classification unit 105, and a period specifying unit 107a. The period specifying unit 107a in the music recording function 100a according to the embodiment of the present invention configures the singing voice based on the temporal variation of the volume included in the feature amount detected in time series by the feature amount detection unit 103. You may specify the period (note period) corresponding to each sound. That is, the period specifying unit 107a specifies the note period corresponding to each sound constituting the singing voice based on the other feature quantity excluding the pitch detected by the feature quantity detection unit 103. Here, an example will be described in which the period specifying unit 107 a detects a break position corresponding to the start point of each sound included in the singing voice based on the time-series volume level detected by the feature amount detection unit 103. Here, each sound corresponds to the pronunciation of each syllable of lyrics, for example. In this example, the start point of each sound corresponds to the timing at which the consonant is switched to the vowel, that is, the portion where the vowel begins. The break position corresponding to the start point of each sound is not limited to the case where it coincides with this start point, but is a position determined by a process determined in advance based on the start point.

日本語では、子音および母音の組み合わせによって発音される場合、子音の発音期間の音量レベルが、母音の発音期間の音量レベルに比べて小さくなる傾向にある。この傾向は、１音節を単独で発音する場合に限らず、複数の音節を連続して発音する場合においても見られる。期間特定部１０７ａは、このような特徴を利用して、歌唱音声に含まれる各音の開始時点に対応する区切位置を検出する。 In Japanese, when sound is generated by a combination of consonants and vowels, the volume level during the consonant pronunciation period tends to be smaller than the volume level during the vowel pronunciation period. This tendency is not limited to the case where a single syllable is pronounced alone, but also when a plurality of syllables are pronounced continuously. The period specifying unit 107a uses such characteristics to detect a break position corresponding to the start time of each sound included in the singing voice.

採譜機能１００ａにおいて、特徴量検出部１０３は、入力音取得部１０１によって取得された歌唱音声データに対して、歌唱音声のピッチ及び音量を含む特徴量を時系列に検出する。特徴量検出部１０３は、時系列に検出した歌唱音声のピッチをピッチ分類部１０５に伝達する。ピッチ分類部１０５では、取得したピッチに基づいて、歌唱音声を構成する各音を複数の基準ピッチの何れかに分類する。ピッチ分類部１０５によって生成された各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータは、音採譜装置１０の記憶部１３に記憶される。また、特徴量検出部１０３は、時系列に検出した歌唱音声に音量を含む特徴量を期間特定部１０７ａに伝達する。期間特定部１０７ａは、特徴量検出部１０３によって時系列に検出された歌唱音声の特徴量に含まれる音量の時間的な変動に基づいて、歌唱音声を構成する各音に対応する期間（音符期間）を特定する。 In the music recording function 100a, the feature amount detection unit 103 detects the feature amount including the pitch and volume of the singing voice in time series from the singing voice data acquired by the input sound acquisition unit 101. The feature amount detection unit 103 transmits the pitch of the singing voice detected in time series to the pitch classification unit 105. In the pitch classification | category part 105, based on the acquired pitch, each sound which comprises singing voice is classify | categorized into either of several reference | standard pitches. Time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame generated by the pitch classification unit 105 is stored in the storage unit 13 of the sound transcription device 10. In addition, the feature amount detection unit 103 transmits a feature amount including volume in the singing voice detected in time series to the period specifying unit 107a. The period specifying unit 107a is a period (note period) corresponding to each sound constituting the singing voice based on the temporal variation of the volume included in the feature quantity of the singing voice detected in time series by the feature quantity detecting unit 103. ).

図６は、期間特定部１０７ａにおける、音量レベルの時間的な変動に基づく、音符期間の開始時点を特定するための概念の説明するための図である。図６では、「も（子音「ｍ」＋母音「ｏ」）」、「り（子音「ｒ」＋母音「ｉ」）」、「の（子音「ｎ」＋母音「ｏ」）」の音節を歌唱した場合において、音量レベルの時間変化を例示した。この音量変化は、図６に示すスペクトルＶＳである。この時間は、歌唱音声が入力されてから（評価開始のタイミングから）経過した時間を示している。いずれの音節の発音においても、子音の発音期間において音量レベルが下がる傾向が見られる。尚、日本語以外の場合、例えば、中国語の歌唱音声の場合、歌唱音声データに１ｋＨｚ程度でローパスフィルタをかけておくと母音と子音との差が分かりやすくなる。遮断周波数は、１ｋＨｚなどのような固定値であってもよく、ピッチに基づいて可変であってもよい。 FIG. 6 is a diagram for explaining a concept for specifying the start point of the note period based on the temporal variation of the volume level in the period specifying unit 107a. In FIG. 6, the syllables of “mo (consonant“ m ”+ vowel“ o ””), “ri” (consonant “r” + vowel “i”) ”,“ no (consonant “n” + vowel “o”) ”. In the case of singing, the time change of the volume level is exemplified. This volume change is the spectrum VS shown in FIG. This time indicates the time that has elapsed since the singing voice was input (from the evaluation start timing). In any syllable pronunciation, the volume level tends to decrease during the consonant pronunciation period. In the case of other than Japanese, for example, in the case of Chinese singing voice, if the low-pass filter is applied to the singing voice data at about 1 kHz, the difference between the vowel and the consonant can be easily understood. The cut-off frequency may be a fixed value such as 1 kHz or may be variable based on the pitch.

期間特定部１０７ａは、複数の音節が連続して発音される場合であっても、このように子音部分で音量レベルが低下する部分を利用して、区切位置を検出する。図６に示す例では、期間特定部１０７ａは、音量レベルの閾値Ｖｔｈを決定し、Ｖｔｈより低いレベルから高いレベルに遷移した点を区切位置として検出する。図６では、区切位置は、時間ｔｓ１、ｔｓ２、ｔｓ３、・・・として検出される。Ｖｔｈは、予め決められた音量レベルであればよく、この例では、音量レベルのバックグラウンドのレベルＶｂと音量レベルの最大レベルＶｐとに基づいて決められる。例えば、単位をｄＢとした場合に、Ｖｔｈ＝Ｖｐ×０．９（Ｖｐ−Ｖｂ）などとして、所定の演算式によって決められればよい。図６において、Ｖｐは、１曲全体における音量レベルの最大値を示す。なお、１曲を複数の区間に分けておき、予め決められたルールにしたがって、各区間に対応するＶｔｈが設定されてもよい。この場合は、区間毎のＶｂ及びＶｐを用いてＶｔｈが決定されてもよい。 Even in the case where a plurality of syllables are continuously generated, the period specifying unit 107a detects the break position by using the part where the volume level is reduced in the consonant part. In the example illustrated in FIG. 6, the period specifying unit 107a determines a threshold Vth of the volume level, and detects a point where the level has changed from a level lower than Vth as a break position. In FIG. 6, the break positions are detected as times ts1, ts2, ts3,. Vth may be a predetermined volume level. In this example, Vth is determined based on the background level Vb of the volume level and the maximum level Vp of the volume level. For example, when the unit is dB, Vth = Vp × 0.9 (Vp−Vb) may be determined by a predetermined arithmetic expression. In FIG. 6, Vp indicates the maximum value of the volume level in the entire song. Note that one song may be divided into a plurality of sections, and Vth corresponding to each section may be set according to a predetermined rule. In this case, Vth may be determined using Vb and Vp for each section.

期間特定部１０７ａは、歌唱音声を構成する各音に対応する音符期間として、検出した区切位置を音符期間の開始時点と特定する。また、図６に示すように、期間特定部１０７ａは、音符期間の開始時点と直後の音符期間の開始時点との間において、音量レベルが最も低くなる時点を該音符期間の終了時点として特定してもよい。また、図示はしないが、期間特定部１０７ａは、例えば、音符期間の開始時点後における音量レベルが閾値Ｖｔｈより高いレベルから低いレベルに遷移した点を検出して、該音符期間の終了時点として特定してもよい。また、期間特定部１０７ａは、例えば、音符期間の開始時点から次の音符期間の開始時点までを一つの音符期間として特定してもよい。即ち、期間特定部１０７ａは、ある音符期間の開始時点を、該音符期間の直前の音符期間の終了時点として特定してもよい。 The period specifying unit 107a specifies the detected separation position as the start time of the note period as the note period corresponding to each sound constituting the singing voice. Further, as shown in FIG. 6, the period specifying unit 107a specifies the time point at which the volume level is lowest between the start time of the note period and the start time of the immediately following note period as the end time of the note period. May be. Although not shown, the period specifying unit 107a detects, for example, a point where the volume level after the start of the note period has changed from a level higher than the threshold Vth to a lower level, and specifies the end of the note period. May be. The period specifying unit 107a may specify, for example, one note period from the start time of the note period to the start time of the next note period. That is, the period specifying unit 107a may specify the start time of a certain note period as the end time of the note period immediately before the note period.

（変形例２）
以上に説明した変形例１では、期間特定部１０７ａは、子音の発音期間の音量レベルが、母音の発音期間の音量レベルに比べて小さくなる傾向を利用し、歌唱音声に含まれる各音の開始点に対応する区切位置（音符期間の開始時点）を検出する。しかしながら、歌唱音声に子音を含まない母音のみの歌詞が含まれる場合、母音のみの音に対応する音符の開始時点が検出できないという問題がある。図７は、変形例１で述べた、音量の時間的な変化に基づいて歌唱音声に含まれる各音の開始点に対応する音符期間の開始時点を検出する方法を用いて、子音を含まない歌詞が含まれる場合の例を説明する図である。ここでは、歌詞が「で（子音「ｄ」＋母音「ｅ」）」、「あ（母音「ａ」）」、「た（子音「ｔ」＋母音「ａ」）」である例を示している。図７に示すように、「で」と「あ」との間には子音が存在しないため、音量レベルの減少が見られない。そのため、音符期間の開始時点に対応する区切位置は、ｔｓ５、ｔｓ７、ｔｓ８の位置で検出され、「あ」の音の開始時点に対応する区切位置は検出されない。このような場合、期間特定部１０７ａは、歌唱音声を構成する各音に対応する音符期間の特定に用いる特徴量として、歌唱音声の音量の時間的な変化とともに周波数を用いて音符期間を特定してもよい。 (Modification 2)
In the first modification described above, the period specifying unit 107a uses the tendency that the volume level of the consonant sound generation period is smaller than the sound volume level of the vowel sound generation period, and starts each sound included in the singing voice. The break position corresponding to the point (the start time of the note period) is detected. However, when the singing voice includes lyrics of only vowels that do not contain consonants, there is a problem that the start time of a note corresponding to the sound of only vowels cannot be detected. FIG. 7 does not include a consonant using the method for detecting the start point of the note period corresponding to the start point of each sound included in the singing voice based on the temporal change in volume described in the first modification. It is a figure explaining the example in case a lyric is included. Here, an example in which the lyrics are “de (consonant“ d ”+ vowel“ e ””), “a (vowel“ a ”)”, “ta (consonant“ t ”+ vowel“ a ”)” is shown. Yes. As shown in FIG. 7, since there is no consonant between “de” and “a”, there is no decrease in the volume level. Therefore, the break position corresponding to the start time of the note period is detected at the positions ts5, ts7, and ts8, and the break position corresponding to the start time of the “A” sound is not detected. In such a case, the period specifying unit 107a specifies the note period using the frequency together with the temporal change in the volume of the singing voice as the feature amount used for specifying the note period corresponding to each sound constituting the singing voice. May be.

この場合、特徴量検出部１０３は、時系列に検出した歌唱音声の音量及び周波数を含む特徴量を期間特定部１０７ａに伝達する。期間特定部１０７ａは、特徴量検出部１０３によって時系列に検出された歌唱音声の特徴量に含まれる音量及び周波数の時間的な変動に基づいて、歌唱音声を構成する各音に対応する期間（音符期間）を特定する。ここで、期間特定部１０７ａは、特徴量検出部１０３において検出された特徴量に含まれる歌唱音声の周波数を解析して、フーリエ解析を用いて周波数分布の時間的な変化を算出し、さらに、周波数分布のＳＮ比を算出する。図８は、歌唱音声の周波数分布のＳＮ比を説明する図である。歌唱音声の周波数分布のスペクトルＦＳは、基音ｆ０の他、整数倍の倍音ｆ１、ｆ２、・・・をピークとして含む。それぞれのピークにおいて、半値幅Ｗｆ０、Ｗｆ１、Ｗｆ２、・・・に含まれる領域（ハッチング部分）の積分値をシグナルＳ（Ｓｆ０、Ｓｆ１、Ｓｆ２，・・・）とし、それ以外の部分をノイズＮとして、Ｓ／ＮをＳＮ比として算出する。このとき、所定の倍音（例えば、３倍音）のピークまでを範囲とするなどして、所定の周波数範囲でＳＮ比を算出する。 In this case, the feature amount detection unit 103 transmits the feature amount including the volume and frequency of the singing voice detected in time series to the period specifying unit 107a. The period specifying unit 107a is a period corresponding to each sound constituting the singing voice based on temporal variation in volume and frequency included in the feature quantity of the singing voice detected by the feature quantity detecting unit 103 in time series ( Note period). Here, the period specifying unit 107a analyzes the frequency of the singing voice included in the feature amount detected by the feature amount detection unit 103, calculates a temporal change in the frequency distribution using Fourier analysis, The SN ratio of the frequency distribution is calculated. FIG. 8 is a diagram for explaining the S / N ratio of the frequency distribution of the singing voice. The spectrum FS of the frequency distribution of the singing voice includes harmonics f1, f2,. In each peak, an integrated value of a region (hatched portion) included in the half-value widths Wf0, Wf1, Wf2,... Is a signal S (Sf0, Sf1, Sf2,...), And other portions are noise N. S / N is calculated as the SN ratio. At this time, the S / N ratio is calculated in a predetermined frequency range, for example, by setting up to a peak of a predetermined harmonic (for example, third harmonic).

期間特定部１０７ａは、算出した周波数分布のＳＮ比と、特徴量検出部１０３によって時系列に検出された特徴量に含まれる音量の時間的な変動とを用いて、音符期間を特定する。図９は、算出した周波数分布のＳＮ比と、特徴量検出部１０３によって時系列に検出された特徴量に含まれる音量の時間的な変動とに基づいて、音符期間と特定するための概念の説明するための図である。以上に述べたように算出されたＳＮ比は、図９におけるスペクトルＤＳに示すように、各音の開始点では低く（ノイズＮ成分が多く）、その後、急激に増加するという傾向を有している。母音のみの音であっても、このような傾向を有している。期間特定部１０７ａは、ＳＮ比を用いて音符期間の開始時点を検出するときには、このような傾向を利用する。 The period specifying unit 107a specifies the note period using the calculated SN ratio of the frequency distribution and the temporal variation of the volume included in the feature amount detected in time series by the feature amount detection unit 103. FIG. 9 is a conceptual diagram for identifying a note period based on the calculated S / N ratio of the frequency distribution and the temporal variation of the volume included in the feature amount detected in time series by the feature amount detection unit 103. It is a figure for demonstrating. The S / N ratio calculated as described above has a tendency that it is low at the start point of each sound (the noise N component is large) and then increases rapidly as shown in the spectrum DS in FIG. Yes. Even a vowel-only sound has such a tendency. The period specifying unit 107a uses such a tendency when detecting the start point of the note period using the SN ratio.

期間特定部１０７ａは、音量レベルに基づいて、変形例１と同様に区切位置を検出する。そして、期間特定部１０７ａは、音量レベルを用いて検出できなかった区切位置、すなわち、隣接する区切位置の間において、ＳＮ比を用いて他の区切位置を検出する。例えば、期間特定部１０７ａは、ＳＮ比に対して所定の閾値Ｖｔｈｆを決定する。そして、期間特定部１０７ａは、ＳＮ比が予め決められた割合以上に減少してから増加に転じた位置、この例ではＶｔｈｆより低い値から高い値に遷移した位置を区切位置として検出する。このとき、期間特定部１０７ａは、音量レベルに基づいて検出した区切位置から所定時間以上ずれた位置においてのみ区切位置を検出する。例えば、図９の例では、「ｔｓｐ６」がＳＮ比に基づく区切位置として検出される。この例では「ｔｓ５」および「ｔｓ７」は、区切位置に近いため、ＳＮ比に基づく区切位置としては検出されない。 The period specifying unit 107a detects the break position based on the volume level as in the first modification. And the period specific | specification part 107a detects another division position using an S / N ratio between the division positions which could not be detected using the volume level, that is, between adjacent division positions. For example, the period specifying unit 107a determines a predetermined threshold value Vthf for the SN ratio. Then, the period specifying unit 107a detects a position where the SN ratio has decreased to a predetermined ratio or more and then increased, in this example, a position where the value has changed from a value lower than Vthf to a higher value as a delimiter position. At this time, the period specifying unit 107a detects the break position only at a position that is deviated by a predetermined time or more from the break position detected based on the volume level. For example, in the example of FIG. 9, “tsp6” is detected as a break position based on the SN ratio. In this example, since “ts5” and “ts7” are close to the break position, they are not detected as the break positions based on the SN ratio.

ＳＮ比に対する閾値Ｖｔｈｆは、ＳＮ比の最小値と最大値とに基づいて、決められてもよい。例えば、最小値をＳＮｍ、最大値をＳＮｐとすると、Ｖｔｈｆ＝ＳＮｐ×０．９（ＳＮｐ−ＳＮｍ）などとして、所定の演算式によって決められればよい。なお、次に述べるように決められてもよい。音量レベルに基づいて決められた区切位置（図９の例では、「ｔｓ５」および「ｔｓ７」）のいずれかにおいて、ＳＮ比のレベル（図９の例において、区切位置「ｔｓ５」であれば、ＣＰ５）を閾値Ｖｔｈｆとしてもよい。また、音量レベルに基づいて決められた区切位置に到達する度に閾値Ｖｔｈｆを更新していってもよい。例えば、「ｔｓ５」を経過した後「ｔｓ７」を経過するまで区間は、「ｔｓ５」の時点でのＳＮ比ＣＰ５を閾値Ｖｔｈｆ５とし、「ｔｓ７」の後の区間は「ｔｓ７」の時点でのＳＮ比ＣＰ７を閾値Ｖｔｈｆ７としてもよい。このようにすると、音量レベルの閾値ＶｔｈとＳＮ比の閾値Ｖｔｈｆとに間接的に相関関係を持たせることもできる。その結果、異なる方法で区切位置を検出したとしても、その方法の違いによる補正をせずに、区切位置の規則性の評価をすることができる。 The threshold value Vthf for the S / N ratio may be determined based on the minimum value and the maximum value of the S / N ratio. For example, if the minimum value is SNm and the maximum value is SNp, Vthf = SNp × 0.9 (SNp−SNm) may be determined by a predetermined arithmetic expression. The following may be determined. At any of the separation positions determined based on the volume level (“ts5” and “ts7” in the example of FIG. 9), if the SN ratio level (the separation position “ts5” in the example of FIG. 9), CP5) may be the threshold value Vthf. Further, the threshold value Vthf may be updated every time a break position determined based on the volume level is reached. For example, in the interval from “ts5” to “ts7”, the SN ratio CP5 at the time of “ts5” is set as the threshold Vthf5, and the interval after “ts7” is the SN at the time of “ts7”. The ratio CP7 may be the threshold value Vthf7. In this way, the sound volume level threshold value Vth and the SN ratio threshold value Vthf can be indirectly correlated. As a result, even if the delimiter position is detected by a different method, the regularity of the delimiter position can be evaluated without correction due to the difference in the method.

このように、連続する歌唱音声に母音のみの音が含まれることで、期間特定部１０７ａにおいて音量レベルだけを用いて検出できない区切位置（音符期間の開始時点）があったとしても、ＳＮ比を用いることで、その区切位置を検出することができる。なお、ＳＮ比を用いた区切位置の検出の際に、音量レベルを用いた区切位置の検出については、必ずしも併用する必要はない。期間特定部１０７ａは、特徴量検出部１０３において検出された特徴量に含まれる歌唱音声の周波数のみに基づいて、音符期間の開始時点を特定することができる。この場合、期間特定部１０７ａは、ある音符期間の直後の音符期間の開始時点を直前の音符期間の終了時点として特定しもよい。また、期間特定部１０７ａは、音符期間の開始時点後において、最小値ＳＮｍが検出された時点を該音符期間の終了時点として特定してもよい。 In this way, even if there is a break position (start point of the note period) that cannot be detected using only the volume level in the period specifying unit 107a by including the sound of only the vowel in the continuous singing voice, the SN ratio is increased. By using this, the separation position can be detected. It should be noted that the detection of the separation position using the volume level is not necessarily used together when the separation position using the SN ratio is detected. The period specifying unit 107a can specify the start point of the note period based only on the frequency of the singing voice included in the feature quantity detected by the feature quantity detection unit 103. In this case, the period specifying unit 107a may specify the start point of the note period immediately after a certain note period as the end point of the immediately preceding note period. In addition, the period specifying unit 107a may specify the time when the minimum value SNm is detected after the start of the note period as the end of the note period.

尚、連続する歌唱音声に母音のみの音が含まれることで、期間特定部１０７ａにおいて音量レベルだけを用いて検出できない区切位置（音符期間の開始時点）がある場合、母音の倍音比率を用いることで、その区切位置を検出してもよい。母音の倍音比率は、図９に示したＳＮ比の時間的な変化と同様に変化する。即ち、母音の倍音比率は各音の開始点では低く（基音に対する倍音比率が低く）、その後、急激に増加するという傾向を有している。したがって、期間特定部１０７ａは、母音の倍音比率を用いて音符期間の開始時点を検出することもできる。 In addition, when the continuous singing voice includes only a vowel sound, if there is a break position (start point of the note period) that cannot be detected using only the volume level in the period specifying unit 107a, the overtone ratio of the vowel is used. In this case, the separation position may be detected. The overtone ratio of the vowel changes in the same manner as the temporal change in the SN ratio shown in FIG. That is, the overtone ratio of vowels is low at the starting point of each sound (the overtone ratio with respect to the fundamental tone is low), and then has a tendency to increase rapidly. Therefore, the period specifying unit 107a can also detect the start time point of the note period using the overtone ratio of the vowel.

＜第２実施形態＞
以上に述べた採譜機能１００によって特定された歌唱音声を構成する各音に対応する音符期間は、歌唱の評価基準（リファレンス）の生成に用いることができる。図１に示した音採譜装置１０において実現される機能には、以上に述べた歌唱の採譜機能１００、１００ａに加え、採譜機能１００、１００ａによって特定された音符期間に基づいて評価基準を用いた歌唱の評価機能が含まれてもよい。以下に、音採譜装置１０の制御部１１が記憶部１３に記憶された制御プログラム１３ａを実行することによって実現される評価機能２００について説明する。評価機能２００を実現する構成の一部または全部は、ハードウエアによって実現されてもよい。 Second Embodiment
The note period corresponding to each sound constituting the singing voice specified by the music recording function 100 described above can be used to generate a singing evaluation standard (reference). The functions implemented in the sound transcription device 10 shown in FIG. 1 use evaluation criteria based on the note period specified by the transcription functions 100 and 100a in addition to the above-described song transcription functions 100 and 100a. A singing evaluation function may be included. Below, the evaluation function 200 implement | achieved when the control part 11 of the sound transcription device 10 runs the control program 13a memorize | stored in the memory | storage part 13 is demonstrated. Part or all of the configuration for realizing the evaluation function 200 may be realized by hardware.

図１０は、本発明の第２の実施形態における評価機能２００の構成を示すブロック図である。尚、音採譜装置１０の制御プログラム１３ａには、第１の実施形態において説明した採譜機能１００、１００ａも含まれる。図１０においては、採譜機能１００が含まれている例を示しているが、採譜機能１００ａが含まれていてもよい。図１０において、図２と同じ又は類似の構成については、同一の参照符号を付与し、重複する説明を省略する。 FIG. 10 is a block diagram showing the configuration of the evaluation function 200 in the second embodiment of the present invention. The control program 13a of the sound transcription device 10 includes the transcription functions 100 and 100a described in the first embodiment. Although FIG. 10 shows an example in which the music recording function 100 is included, the music recording function 100a may be included. 10, the same reference numerals are given to the same or similar configurations as those in FIG. 2, and duplicate descriptions are omitted.

図１０を参照すると、評価機能２００は、入力音取得部２０１、特徴量取得部２０３、基準データ取得部２０５、比較部２０７、及び評価部２０９を含む。評価機能１００は、採譜機能１００において生成された歌唱音声の音符期間を示す時系列のデータ、及び前記音符期間を示す時系列のデータに関連付けられた歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータを評価基準として、採譜機能１００において音符期間の特定に用いられた歌唱音声とは異なる歌唱音声を評価する。 Referring to FIG. 10, the evaluation function 200 includes an input sound acquisition unit 201, a feature amount acquisition unit 203, a reference data acquisition unit 205, a comparison unit 207, and an evaluation unit 209. The evaluation function 100 is a grid that is applied to the time series data indicating the note period of the singing voice generated by the music recording function 100 and the average pitch of each frame in the singing voice associated with the time series data indicating the note period. A singing voice different from the singing voice used for specifying the note period in the transcription function 100 is evaluated using time-series data indicating the corresponding pitch as an evaluation criterion.

評価機能２００における入力音取得部２０１は、入力された歌唱音声に対応する歌唱音声データを取得する。本実施形態では、伴奏音が出力されている期間における音入力部２３への入力音を、評価対象の歌唱音声として認識する。なお、本実施形態では、入力音取得部２０１は、記憶部１３に記憶された歌唱音声データ１３ｃを取得するが、信号処理部２１から直接取得するように構成してもよい。また、入力音取得部２０１は、音入力部２３への入力音を示す歌唱音声データを取得する場合に限らず、外部装置への入力音を示す歌唱音声データを、通信部１９によりネットワーク経由で取得してもよい。入力音取得部２０１は、取得した歌唱音声データを特徴量取得部２０３に伝達する。 The input sound acquisition unit 201 in the evaluation function 200 acquires singing voice data corresponding to the input singing voice. In this embodiment, the input sound to the sound input part 23 in the period when the accompaniment sound is output is recognized as a singing voice to be evaluated. In the present embodiment, the input sound acquisition unit 201 acquires the singing voice data 13 c stored in the storage unit 13, but may be configured to acquire directly from the signal processing unit 21. Moreover, the input sound acquisition unit 201 is not limited to acquiring singing voice data indicating the input sound to the sound input unit 23, and the singing voice data indicating the input sound to the external device is transmitted by the communication unit 19 via the network. You may get it. The input sound acquisition unit 201 transmits the acquired singing voice data to the feature amount acquisition unit 203.

特徴量取得部２０３は、入力音取得部２０１によって取得された歌唱音声データに対して、歌唱音声のピッチや音量などを含む特徴量を時系列に検出する。特徴量取得部２０３は、ピッチの算出は、歌唱音声の波形のゼロクロスを用いた方法やフーリエ解析など、その他の公知の方法を用いて算出することができる。特徴量取得部２０３は、時系列に検出したピッチや音量など含む特徴量を比較部２０７に伝達する。 The feature amount acquisition unit 203 detects the feature amount including the pitch and volume of the singing voice in time series with respect to the singing voice data acquired by the input sound acquisition unit 201. The feature amount acquisition unit 203 can calculate the pitch by using another known method such as a method using a zero cross of the waveform of the singing voice or a Fourier analysis. The feature amount acquisition unit 203 transmits the feature amount including the pitch and volume detected in time series to the comparison unit 207.

基準データ取得部２０５は、採譜機能１００におけるピッチ分類部１０５によって生成された歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータと、期間特定部１０７によって特定された音符期間を示す時系列のデータを記憶部１３から取得する。基準データ取得部２０５は、歌唱音声の音符期間を示す時系列のデータ、及び前記音符期間を示す時系列のデータに関連付けられた歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータを評価基準として比較部２０７に出力する。 The reference data acquisition unit 205 is specified by the time specifying unit 107 and time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame in the singing voice generated by the pitch classification unit 105 in the transcription function 100. Time-series data indicating the note period is acquired from the storage unit 13. The reference data acquisition unit 205 indicates time series data indicating the note period of the singing voice, and a pitch corresponding to a grid corresponding to the average pitch of each frame in the singing voice associated with the time series data indicating the note period. The time series data is output to the comparison unit 207 as an evaluation criterion.

比較部２０７は、特徴量取得部２０３から取得した特徴量を示すデータと、基準データ取得部２０５から取得した評価基準とを比較する。比較部２０７は、比較結果を評価部２０９に出力する。評価部２０９は、比較部２０７から出力された比較結果に基づいて、歌唱音の評価の指標となる評価値を算出する。評価部２０９は、歌唱者による歌唱音の特徴量を示すデータと対応する評価基準との一致度が高いほど評価値を高く算出し、不一致度が高いほど評価値を低く算出する。 The comparison unit 207 compares the data indicating the feature amount acquired from the feature amount acquisition unit 203 with the evaluation criterion acquired from the reference data acquisition unit 205. The comparison unit 207 outputs the comparison result to the evaluation unit 209. Based on the comparison result output from the comparison unit 207, the evaluation unit 209 calculates an evaluation value that serves as an index for evaluating the singing sound. The evaluation unit 209 calculates the evaluation value higher as the degree of coincidence between the data indicating the characteristic amount of the singing sound by the singer and the corresponding evaluation criterion is higher, and calculates the evaluation value lower as the degree of mismatch is higher.

＜第３実施形態＞
以上に第２の実施形態では、採譜機能１００、１００ａにおいて生成された評価基準（歌唱音声の音符期間を示す時系列のデータ、及び前記音符期間を示す時系列のデータに関連付けられた歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータ）を評価機能２００における歌唱音声の評価に用いた。しかし、音採譜装置１０の評価機能２００における歌唱の評価には、複数の音採譜装置１０において生成された音符期間を示すデータが統計処理されることにより生成された統計的な評価基準が用いられてもよい。 <Third Embodiment>
As described above, in the second embodiment, in the evaluation criteria (the time series data indicating the note period of the song voice and the time series data indicating the note period) generated in the transcription function 100, 100a. Time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame) was used for the evaluation of the singing voice in the evaluation function 200. However, the evaluation of the singing by the evaluation function 200 of the sound transcription device 10 uses a statistical evaluation standard generated by statistically processing the data indicating the note periods generated in the plurality of sound transcription devices 10. May be.

［データ処理システムの構成］
図１１は、本発明の第３の実施形態におけるデータ処理システムの構成を示すブロック図である。データ処理システム１０００は、音採譜装置１０、データ処理装置２０、およびデータベース３０を備える。これらの各構成は、インターネット等のネットワーク４０を介して接続されている。音採譜装置１０は、第１の実施形態における音採譜装置１０と同じであるため、詳細な説明は省略する。データ処理システム１０００では、複数の楽採譜装置１０がネットワーク４０に接続されている。それぞれの音採譜装置１０は、例えば、カラオケ装置であり、この例では歌唱評価が可能なカラオケ装置である。なお、音採譜装置１０は、スマートフォン等の端末装置であってもよい。 [Data processing system configuration]
FIG. 11 is a block diagram showing a configuration of a data processing system according to the third embodiment of the present invention. The data processing system 1000 includes a sound transcription device 10, a data processing device 20, and a database 30. Each of these components is connected via a network 40 such as the Internet. Since the sound transcription device 10 is the same as the sound transcription device 10 in the first embodiment, detailed description thereof is omitted. In the data processing system 1000, a plurality of music transcription devices 10 are connected to the network 40. Each sound transcription device 10 is, for example, a karaoke device, and in this example is a karaoke device capable of singing evaluation. The sound transcription device 10 may be a terminal device such as a smartphone.

本実施形態では、複数の音採譜装置１０のそれぞれにおいて歌唱音声が入力され、歌唱音声に基づいて特徴量が検出されて、検出された特徴量に基づき、入力された歌唱音声を構成する各音の音符期間が特定される。同様に、複数の音採譜装置１０のそれぞれにおいて歌唱音声のピッチに基づいて、該歌唱音声における各フレームの平均ピッチに当てはまるグリッドに対応するピッチを示す時系列のデータが生成される。この歌唱音声の音符期間を示すデータとピッチを示す時系列のデータとは、いったん各音採譜装置１０の記憶部１３に互いに関連付けられて記憶された後、楽曲を識別する識別子と共にネットワーク４０を介してデータベース３０に送信され、音符期間データ３０ａ、ピッチ分布データ３０ｂとしてそれぞれ登録される。このとき、音符期間を示すデータとピッチを示す時系列のデータとのデータベース３０への送信は、データ処理装置２０を経由して行われてもよい。また、このとき、楽曲を識別する識別子は、音採譜装置１０の記憶部１３に記憶された楽曲データ１３ｂから取得されてもよい。音符期間データ３０ａ及びピッチ分布データ３０ｂは、ネットワーク４０を通じてデータ処理装置２０に伝達され、統計的な音符期間及び統計的なピッチを算出するために、データ処理装置２０において統計処理される。データ処理装置２０で算出された統計的な音符期間及び統計的なピッチは、データベース３０の音符データ３０ｃ、ピッチデータ３０ｄとしてそれぞれ登録される。 In the present embodiment, a singing voice is input to each of the plurality of sound transcription devices 10, a feature amount is detected based on the singing voice, and each sound constituting the input singing voice is based on the detected feature amount. The note period is specified. Similarly, time-series data indicating the pitch corresponding to the grid corresponding to the average pitch of each frame in the singing voice is generated based on the pitch of the singing voice in each of the plurality of sound recording devices 10. The data indicating the note period of the singing voice and the time-series data indicating the pitch are once stored in association with each other in the storage unit 13 of each sound transcription device 10, and then, together with an identifier for identifying the music, via the network 40. Are transmitted to the database 30 and registered as note period data 30a and pitch distribution data 30b. At this time, the data indicating the note period and the time-series data indicating the pitch may be transmitted to the database 30 via the data processing device 20. Moreover, the identifier which identifies a music may be acquired from the music data 13b memorize | stored in the memory | storage part 13 of the sound transcription device 10 at this time. The note period data 30a and the pitch distribution data 30b are transmitted to the data processing apparatus 20 through the network 40, and are statistically processed in the data processing apparatus 20 in order to calculate a statistical note period and a statistical pitch. The statistical note period and the statistical pitch calculated by the data processing device 20 are registered as the note data 30c and the pitch data 30d in the database 30, respectively.

音符データ３０ｃは、複数の歌唱音声の音符期間を示すデータについての統計処理の結果を示すデータである。例えば、音符データ３０ｃとしては、過去に歌唱された複数の歌唱音声について、それぞれの音符期間データ３０ａを用いて統計処理を行い、その結果得られた、歌唱音声を構成する各音の音符期間のタイミングの分布を示すデータを用いることができる。また、音符データ３０ｃには、分布から算出することが可能な各種統計値を含めることができ、例えば散布度（標準偏差、分散）や代表値（最頻値、中央値、平均値）などを含めることができる。この音符データ３０ｃが、歌唱音声の評価における、歌唱音声を構成する各音の音符期間の評価基準となる。 The note data 30c is data indicating the result of statistical processing for data indicating the note periods of a plurality of singing voices. For example, as the note data 30c, statistical processing is performed on each of a plurality of singing voices sung in the past using the respective note period data 30a, and as a result, the note period of each sound constituting the singing voice is obtained. Data indicating the distribution of timing can be used. The note data 30c can include various statistical values that can be calculated from the distribution. For example, the degree of dispersion (standard deviation, variance), representative values (mode, median, average), etc. Can be included. This musical note data 30c serves as an evaluation reference for the musical note period of each sound constituting the singing voice in the evaluation of the singing voice.

ピッチデータｄは、複数の歌唱音声のピッチを示す時系列のデータについての統計処理の結果を示すデータである。例えば、ピッチデータ３０ｄとしては、過去に歌唱された複数の歌唱音声について、それぞれのピッチ分布データ３０ｂを用いて統計処理を行い、その結果得られた、歌唱音声を構成する各音のピッチの分布を示す時系列のデータを用いることができる。また、ピッチデータ３０ｄには、分布から算出することが可能な各種統計値を含めることができ、例えば代表値（最頻値）などを含めることができる。ピッチデータ３０ｄは、時系列なデータであり、歌唱音声の評価における、歌唱音声を構成する各音のピッチの評価基準となる。ここで、ピッチデータ３０ｄにおける歌唱音声を構成する各音のピッチのタイミングと、対応する音符データ３０ｃの各音の音符期間にズレが生じる場合、所定の時間以内のズレであれば無視できるものとする。該所定時間は、例えば、前述したロック時間に対応する長さであってもよい。 The pitch data d is data indicating the result of statistical processing for time-series data indicating the pitch of a plurality of singing voices. For example, as the pitch data 30d, statistical processing is performed on each of the plurality of singing voices sung in the past using the respective pitch distribution data 30b, and the resulting pitch distribution of each sound constituting the singing voice is obtained. Time-series data indicating can be used. The pitch data 30d can include various statistical values that can be calculated from the distribution. For example, a representative value (mode) can be included. The pitch data 30d is time-series data and serves as an evaluation reference for the pitch of each sound constituting the singing voice in the evaluation of the singing voice. Here, when a deviation occurs between the pitch timing of each sound constituting the singing voice in the pitch data 30d and the note period of each sound of the corresponding note data 30c, the deviation within a predetermined time can be ignored. To do. The predetermined time may be, for example, a length corresponding to the lock time described above.

本実施形態では、データ処理装置２０が、複数の音採譜装置１０から伝達された複数の音符期間データ３０ａ及びピッチ分布データ３０ｂの統計処理を行う例を示している。しかしながら、本実施形態はこの態様に限定されるわけではなく、音採譜装置１０で行われる処理の一部がデータ処理装置２０で行われてもよい。例えば、データ処理装置２０は、複数の音採譜装置１０からピッチを含む特徴量を時系列に示す複数の特徴量データを取得し、取得したピッチを含む特徴量を時系列に示すデータを統計処理して、音採譜装置１０の期間特定部１０７、１０７ａと同様の処理を行って音符データ３０ｃ及びピッチデータ３０ｄを算出してもよい。 In the present embodiment, an example is shown in which the data processing device 20 performs statistical processing on a plurality of note period data 30a and pitch distribution data 30b transmitted from a plurality of sound transcription devices 10. However, the present embodiment is not limited to this aspect, and part of the processing performed by the sound transcription device 10 may be performed by the data processing device 20. For example, the data processing device 20 acquires a plurality of feature amount data indicating a feature amount including a pitch in time series from the plurality of sound transcription devices 10, and statistically processes data indicating the acquired feature amount including the pitch in a time series. Then, the note data 30c and the pitch data 30d may be calculated by performing the same processing as the period specifying units 107 and 107a of the sound transcription device 10.

以上のように、データベース３０には、各音採譜装置１０またはデータ処理装置２０において歌唱音声から生成された音符データ３０ｃ及びピッチデータ３０ｄが、楽曲ごと（例えば歌唱音声に関連する楽曲を識別する識別子ごと）に関連付けられて登録される。このように、複数の歌唱音声から特定された音符期間を統計処理すると、より精度の高い音符期間を特定することができる。また、各音採譜装置１０から音符期間を示すデータ及びピッチを示す時系列のデータを新たに取得し、データベースに保存されたデータとともに再度データ処理装置２０で統計処理を行うことによって、データベース３０に登録された楽曲について音符データ３０ｃ及びピッチデータｄの更新を行うことができる。また、データ処理装置２０は、各音採譜装置１０における評価機能２００によって高評価された歌唱音声データに基づいて特定された音符期間を示すデータに加重値を付与して統計処理することもできる。このとき、データ処理装置２０は、特に得点の高いに対応する音符期間を示すデータだけを選択して統計処理してもよい。 As described above, in the database 30, the note data 30 c and the pitch data 30 d generated from the singing voice in each sound transcription device 10 or the data processing device 20 include, for each song (for example, an identifier for identifying a song related to the singing voice). Registered). Thus, if the note period specified from the plurality of singing voices is statistically processed, a more accurate note period can be specified. In addition, data indicating the note period and time-series data indicating the pitch are newly acquired from each sound transcription device 10, and statistical processing is performed again by the data processing device 20 together with the data stored in the database. The note data 30c and the pitch data d can be updated for the registered music. The data processing device 20 can also perform statistical processing by assigning a weight value to data indicating the note period specified based on the singing voice data highly evaluated by the evaluation function 200 in each sound transcription device 10. At this time, the data processing device 20 may select only data indicating a note period corresponding to a particularly high score and perform statistical processing.

なお、図１１では、データ処理装置２０とデータベース３０とがネットワーク４０を介して接続される構成を示しているが、これに限らず、データベース３０がデータ処理装置２０に対して直接的に接続された構成としてもよい。また、データベース３０には、音符期間データ３０ａだけでなく、その元となった歌唱音声データも登録してあってもよい。 11 shows a configuration in which the data processing device 20 and the database 30 are connected via the network 40, the present invention is not limited to this, and the database 30 is directly connected to the data processing device 20. It is good also as a structure. The database 30 may register not only the note period data 30a but also the singing voice data that is the source of the data.

［データ処理装置の構成］
図１１に示すように、データ処理装置２０は、制御部２１、記憶部２３、および通信部２５を含む。制御部２１は、ＣＰＵなどの演算処理回路を含む。制御部２１は、記憶部２３に記憶された制御プログラム２３ａをＣＰＵにより実行して、各種機能をデータ処理装置２０において実現する。実現される機能には、歌唱音声の音符期間に対して統計処理を行い、歌唱音声の評価基準となる音符期間データを生成する機能（評価基準生成機能）が含まれる。 [Data processor configuration]
As shown in FIG. 11, the data processing device 20 includes a control unit 21, a storage unit 23, and a communication unit 25. The control unit 21 includes an arithmetic processing circuit such as a CPU. The control unit 21 executes a control program 23 a stored in the storage unit 23 by the CPU, and realizes various functions in the data processing device 20. The realized functions include a function (evaluation reference generation function) that performs statistical processing on the note period of the singing voice and generates note period data that is an evaluation reference of the singing voice.

記憶部２３は、不揮発性メモリ、ハードディスク等の記憶装置である。記憶部２３は、評価基準生成機能を実現するための制御プログラム２３ａを記憶する。制御プログラム２３ａは、コンピュータにより実行可能であればよく、磁気記録媒体、光記録媒体、光磁気記録媒体、半導体メモリなどのコンピュータ読み取り可能な記録媒体に記憶した状態で提供されてもよい。この場合には、データ処理装置２０は、記録媒体を読み取る装置を備えていればよい。また、制御プログラム２３ａは、ネットワーク４０を経由して外部サーバ等からダウンロードされてもよい。通信部２５は、制御部２１の制御に基づいて、ネットワーク４０に接続して、ネットワーク４０に接続された外部装置と情報の送受信を行う。 The storage unit 23 is a storage device such as a nonvolatile memory or a hard disk. The storage unit 23 stores a control program 23a for realizing the evaluation reference generation function. The control program 23a may be executed by a computer, and may be provided in a state of being stored in a computer-readable recording medium such as a magnetic recording medium, an optical recording medium, a magneto-optical recording medium, or a semiconductor memory. In this case, the data processing device 20 may include a device that reads the recording medium. Further, the control program 23a may be downloaded from an external server or the like via the network 40. Based on the control of the control unit 21, the communication unit 25 connects to the network 40 and transmits / receives information to / from an external device connected to the network 40.

本発明の第２の実施形態の説明において述べた評価機能２００において、歌唱を評価する場合、評価機能２００の基準データ取得部２０５はデータ処理装置２０で特定された音符データ３０ｃ及びピッチデータ３０ｄを用いてもよい。このとき、互いに関連付けられた音符データ３０ｃ及びピッチデータ３０ｄは、データベース３０からネットワーク４０を介してダウンロードされて、音採譜装置１０の通信部１９で受信される。通信部１９で受信された音符データ３０ｃ及びピッチデータ３０ｄは、基準データ取得部２０５に伝達される。また、受信された音符データ及びピッチデータ３０ｄは、記憶部１３に記憶されてもよい。音符データ３０ｃ及びピッチデータ３０ｄは、音採譜装置１０の評価機能２００で評価される歌唱に関連付けられたものを取得する。すなわち、入力音取得部２０１で取得された歌唱音声に関連した楽曲に関連付けられた音符データ３０ｃ及びピッチデータ３０ｄを取得する。この関連付けは、例えば楽曲を識別する識別子を用いて行うことができる。この場合、楽曲を識別する識別子は、入力音２０１において取得すればよい。 In the evaluation function 200 described in the description of the second embodiment of the present invention, when the singing is evaluated, the reference data acquisition unit 205 of the evaluation function 200 uses the note data 30c and the pitch data 30d specified by the data processing device 20. It may be used. At this time, the note data 30c and the pitch data 30d associated with each other are downloaded from the database 30 via the network 40 and received by the communication unit 19 of the sound transcription device 10. The note data 30 c and pitch data 30 d received by the communication unit 19 are transmitted to the reference data acquisition unit 205. The received note data and pitch data 30 d may be stored in the storage unit 13. The note data 30c and the pitch data 30d are acquired in association with a song evaluated by the evaluation function 200 of the sound transcription device 10. That is, the note data 30c and the pitch data 30d associated with the music related to the singing voice acquired by the input sound acquisition unit 201 are acquired. This association can be performed using, for example, an identifier for identifying a music piece. In this case, an identifier for identifying the music may be acquired in the input sound 201.

以上に述べた採譜機能１００において、入力音取得部１０３によって取得される歌唱音声データが示す音は、歌唱者による音声に限られず、歌唱合成による音声であってもよいし、楽器音であってもよい。楽器音である場合には、単音演奏であることが望ましい。なお、楽器音である場合には、子音および母音の概念が存在しないが、演奏方法によっては、各音の発音の開始点において歌唱と同様な傾向を有する。したがって、楽器音においても同様の判定ができる場合もある。 In the music recording function 100 described above, the sound indicated by the singing voice data acquired by the input sound acquisition unit 103 is not limited to the voice by the singer, but may be voice by singing synthesis or instrument sound. Also good. If it is a musical instrument sound, it is desirable to be a single note performance. In the case of instrument sounds, there is no concept of consonants and vowels, but depending on the performance method, there is a tendency similar to singing at the starting point of pronunciation of each sound. Therefore, the same determination may be made for musical instrument sounds.

本発明の実施形態として説明した構成を基にして、当業者が適宜構成要素の追加、削除もしくは設計変更を行ったもの、又は、工程の追加、省略もしくは条件変更を行ったものも、本発明の要旨を備えている限り、本発明の範囲に含まれる。 Based on the configuration described as the embodiment of the present invention, those in which a person skilled in the art appropriately added, deleted, or changed the design of the component, or added, omitted, or changed conditions of the process are also included in the present invention. As long as the gist of the present invention is provided, the scope of the present invention is included.

また、上述した実施形態の態様によりもたらされる作用効果とは異なる他の作用効果であっても、本明細書の記載から明らかなもの、又は、当業者において容易に予測し得るものについては、当然に本発明によりもたらされると解される。 Of course, other operational effects that are different from the operational effects brought about by the above-described embodiment are obvious from the description of the present specification or can be easily predicted by those skilled in the art. It is understood that this is brought about by the present invention.

１０…音採譜装置、１１…制御部、１３…記憶部、１５…操作部、１７…表示部、１９…通信部、２１…信号処理部、２３…音入力部、２５…音出力部、１００…採譜機能、１０１…入力音取得部、１０３…特徴量検出部、１０５…ピッチ分類部、１０７、１０７ａ…期間特定部、１１１…伴奏出力部、２００…評価機能、２０１…入力音取得部、２０３…特徴量取得部、２０５…基準データ取得部、２０７…比較部、２０９…評価部、１０００…データ処理システム、２０…データ処理装置、２１…制御部、２３…記憶部、２３ａ…制御プログラム、２５…通信部、３０…データベース、３０ａ…音符期間データ、３０ｂ…ピッチ分布データ、３０ｃ…音符データ、３０ｄ…ピッチデータ、４０…ネットワーク DESCRIPTION OF SYMBOLS 10 ... Sound transcription device, 11 ... Control part, 13 ... Memory | storage part, 15 ... Operation part, 17 ... Display part, 19 ... Communication part, 21 ... Signal processing part, 23 ... Sound input part, 25 ... Sound output part, 100 DESCRIPTION OF SYMBOLS ... Transcription function 101 ... Input sound acquisition part 103 ... Feature quantity detection part 105 ... Pitch classification part 107, 107a ... Period specification part 111 ... Accompaniment output part 200 ... Evaluation function 201 ... Input sound acquisition part, DESCRIPTION OF SYMBOLS 203 ... Feature-value acquisition part, 205 ... Reference | standard data acquisition part, 207 ... Comparison part, 209 ... Evaluation part, 1000 ... Data processing system, 20 ... Data processing apparatus, 21 ... Control part, 23 ... Memory | storage part, 23a ... Control program 25 ... communication unit, 30 ... database, 30a ... note period data, 30b ... pitch distribution data, 30c ... note data, 30d ... pitch data, 40 ... network

Claims

An input sound acquisition unit for acquiring the input sound;
A feature amount detection unit that detects a feature amount including a pitch in time series based on the input sound acquired by the input sound acquisition unit;
A period identifying unit that identifies a period corresponding to each sound constituting the input sound based on the feature amount detected by the feature amount detection unit;
A pitch classification unit that classifies each sound into one of a plurality of reference pitches based on the pitch;
A sound transcription device.

The sound transcription device according to claim 1, wherein the period specifying unit specifies the period based on the pitch.

The sound transcription device according to claim 2, wherein the period specifying unit specifies a period in which the pitches are determined to be the same as a period corresponding to each sound constituting the input sound.

The period specifying unit specifies, as a period corresponding to each sound constituting the input sound, at least until a predetermined lock time elapses when the pitches determined to be the same continue for a predetermined time. 3. The sound transcription device according to 3.

The sound transcription device according to claim 1, wherein the period identifying unit identifies the period based on a change in volume included in the feature amount.

The sound transcription device according to claim 1, wherein the period specifying unit specifies the period based on an SN ratio of a frequency included in the feature amount.

6. The sound transcription device according to claim 5, wherein the period specifying unit specifies the period based on a change in volume included in the feature amount and an SN ratio of a frequency included in the feature amount.