JP2017111268A

JP2017111268A - Technique judgement device

Info

Publication number: JP2017111268A
Application number: JP2015244827A
Authority: JP
Inventors: 隆一成山; Ryuichi Nariyama; 辰弥寺島; Tatsuya Terajima; 松本　秀一; Shuichi Matsumoto; 秀一松本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-12-16
Filing date: 2015-12-16
Publication date: 2017-06-22
Anticipated expiration: 2035-12-16
Also published as: JP6627482B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique judgement device capable of judging technique of input sound.SOLUTION: A technique judgement device in an embodiment includes: an input sound acquisition part that acquires an input sound; a feature quantity detection section that time-serially detects a feature quantity of the input sound acquired by the input sound acquisition part; a flat portion detection section that detects a flat portion in a feature quantity based on the feature quantity acquired by the feature quantity detection part; and a technique determination part that judges the technique in the input sound based on the variation of the feature quantity in a predetermined period before and after a flat portion in the feature quantity.SELECTED DRAWING: Figure 2

Description

本発明は、入力音の技法を判定する技術に関する。 The present invention relates to a technique for determining an input sound technique.

カラオケ装置には、歌唱音声を解析して評価する機能が備えられている。歌唱の評価には様々な方法が用いられる。その方法の一つとして、例えば、特許文献１には、歌唱技法の一つであるしゃくり上げを検出して評価するカラオケ採点装置が開示されている。 The karaoke apparatus has a function of analyzing and evaluating a singing voice. Various methods are used for singing evaluation. As one of the methods, for example, Patent Document 1 discloses a karaoke scoring device that detects and evaluates the screaming that is one of singing techniques.

特開２００４−１０２１４９号公報JP 2004-102149 A

しかしながら、特許文献１に開示された技術では、誤検出が多く、歌唱の技法を正確に判定し、評価することができないという問題があった。 However, in the technique disclosed in Patent Document 1, there are many false detections, and there is a problem that the singing technique cannot be accurately determined and evaluated.

本発明の課題の一つは、入力音の技法を判定することにある。 One of the objects of the present invention is to determine the technique of input sound.

本発明の一実施形態によると、入力音を取得する入力音取得部と、前記入力音取得部によって取得された前記入力音の特徴量を時系列に検出する特徴量検出部と、前記特徴量検出部によって取得された前記特徴量に基づいて、特徴量の平坦部を検出する平坦部検出部と、前記特徴量の平坦部の前又は後の所定の期間における、前記特徴量の変動に基づいて前記入力音の技法を判定する技法判定部と、を備える、技法判定装置が提供される。 According to an embodiment of the present invention, an input sound acquisition unit that acquires an input sound, a feature amount detection unit that detects a feature amount of the input sound acquired by the input sound acquisition unit in time series, and the feature amount Based on the feature amount acquired by the detection unit, based on the flat portion detection unit that detects the flat portion of the feature amount, and the variation of the feature amount in a predetermined period before or after the flat portion of the feature amount And a technique determination unit that determines a technique of the input sound.

前記平坦部検出部は、前記特徴量の時系列の変動が所定の変動以下である期間を検出し、前記期間が所定時間以上である場合、前記期間を前記平坦部として検出してもよい。 The flat part detection unit may detect a period in which the time-series fluctuation of the feature amount is equal to or less than a predetermined fluctuation, and may detect the period as the flat part when the period is equal to or longer than a predetermined time.

前記特徴量は、ピッチ又は音量であってもよい。 The feature amount may be a pitch or a volume.

技法判定装置は、前記特徴量に基づいて、前記入力音のテンポを推定するテンポ推定部を備え、前記所定時間は、前記テンポに応じて決定されてもよい。 The technique determination apparatus may include a tempo estimation unit that estimates a tempo of the input sound based on the feature amount, and the predetermined time may be determined according to the tempo.

前記平坦部検出部によって、複数の平坦部が検出された場合、前記所定の期間は、時系列的に互いに隣接する２つの平坦部の間の期間であってもよい。 When a plurality of flat portions are detected by the flat portion detection unit, the predetermined period may be a period between two flat portions adjacent to each other in time series.

本発明の一実施形態によると、コンピュータに、入力音を取得し、前記入力音の特徴量を時系列に検出し、前記特徴量に基づいて特徴量の平坦部を検出し、特徴量の平坦部の前又は後の所定の期間における、前記特徴量の変動に基づいて前記入力音の技法を判定する、ことを実行させるプログラムが提供される。 According to an embodiment of the present invention, a computer acquires an input sound, detects a feature amount of the input sound in time series, detects a flat portion of the feature amount based on the feature amount, and flattenes the feature amount. There is provided a program for executing the determination of the technique of the input sound based on the variation of the feature amount in a predetermined period before or after the part.

本発明の一実施形態によれば、入力音の技法を判定することが可能になる。 According to one embodiment of the present invention, it is possible to determine the technique of the input sound.

本発明の一実施形態における技法判定装置の構成を示すブロック図である。It is a block diagram which shows the structure of the technique determination apparatus in one Embodiment of this invention. 本発明の一実施形態における技法判定機能および評価機能の構成を示すブロック図である。It is a block diagram which shows the structure of the technique determination function and evaluation function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の平坦部検出部における、音量の平坦部検出の概念を説明するための図である。It is a figure for demonstrating the concept of the flat part detection of a volume in the flat part detection part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の平坦部検出部による、ピッチの平坦部検出の概念を説明するための図である。It is a figure for demonstrating the concept of the flat part detection of a pitch by the flat part detection part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、しゃくり上げ判定の概念を説明するための図である。It is a figure for demonstrating the concept of the screeching determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、しゃくり下げ判定の概念を説明するための図である。It is a figure for demonstrating the concept of the scoop down determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、跳ね上げ判定の概念を説明するための図である。It is a figure for demonstrating the concept of the jump determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、フォール判定の概念を説明するための図である。It is a figure for demonstrating the concept of the fall determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、クレッシェンド判定の概念を説明するための図である。It is a figure for demonstrating the concept of the crescendo determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、デクレッシェンド判定の概念を説明するための図である。It is a figure for demonstrating the concept of the decrescendo determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、こぶし判定の概念を説明するための図である。It is a figure for demonstrating the concept of fist determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態における技法判定機能の技法判定部による、フォルテピアノ判定の概念を説明するための図である。It is a figure for demonstrating the concept of forte piano determination by the technique determination part of the technique determination function in one Embodiment of this invention. 本発明の一実施形態の変形例における技法判定機能および評価機能の構成を示すブロック図である。It is a block diagram which shows the structure of the technique determination function and evaluation function in the modification of one Embodiment of this invention.

以下、本発明の一実施形態における技法判定装置について、図面を参照しながら詳細に説明する。以下に示す実施形態は本発明の実施形態の一例であって、本発明はこれらの実施形態に限定されるものではない。 Hereinafter, a technique determination apparatus according to an embodiment of the present invention will be described in detail with reference to the drawings. The following embodiments are examples of the embodiments of the present invention, and the present invention is not limited to these embodiments.

＜第１実施形態＞
本発明の第１実施形態における技法判定装置について、図面を参照しながら詳細に説明する。第１実施形態に係る技法判定装置は、歌唱するユーザ（以下、歌唱者という場合がある）の歌唱音を判定する機能を備えた装置である。この技法判定装置は、歌唱音のピッチと音量を時系列に検出し、音量の変化とピッチの変動に基づいて特定の技法を判定する。 <First Embodiment>
A technique determination apparatus according to a first embodiment of the present invention will be described in detail with reference to the drawings. The technique determination apparatus which concerns on 1st Embodiment is an apparatus provided with the function which determines the song sound of the user who sings (it may be hereafter called a singer). This technique determination apparatus detects the pitch and volume of a singing sound in time series, and determines a specific technique based on a change in volume and a change in pitch.

［ハードウエア］
図１は、本発明の第１実施形態における技法判定装置１０の構成を示すブロック図である。技法判定装置１０は、例えば、歌唱採点機能を備えたカラオケ装置である。技法判定装置１０は、制御部１１、記憶部１３、操作部１５、表示部１７、通信部１９、および信号処理部２１を含む。信号処理部２１には、音入力部（例えば、マイクロフォン）２３及び音出力部（例えば、スピーカ）２５が接続されている。これらの各構成は、バスを介して相互に接続されている。 [Hardware]
FIG. 1 is a block diagram showing a configuration of a technique determination apparatus 10 according to the first embodiment of the present invention. The technique determination apparatus 10 is a karaoke apparatus provided with a singing scoring function, for example. The technique determination apparatus 10 includes a control unit 11, a storage unit 13, an operation unit 15, a display unit 17, a communication unit 19, and a signal processing unit 21. A sound input unit (for example, a microphone) 23 and a sound output unit (for example, a speaker) 25 are connected to the signal processing unit 21. Each of these components is connected to each other via a bus.

制御部１１は、ＣＰＵなどの演算処理回路を含む。制御部１１は、記憶部１３に記憶された制御プログラム１３ａをＣＰＵにより実行して、各種機能を技法判定装置１０において実現させる。実現される機能には、歌唱の技法判定機能が含まれる。また、実現される機能に、技法判定によって判定された技法に基づいた歌唱の評価機能が含まれてもよい。 The control unit 11 includes an arithmetic processing circuit such as a CPU. The control unit 11 causes the CPU to execute the control program 13 a stored in the storage unit 13 and realizes various functions in the technique determination apparatus 10. The realized functions include a singing technique judging function. Further, the realized function may include a song evaluation function based on the technique determined by the technique determination.

記憶部１３は、不揮発性メモリ、ハードディスク等の記憶装置である。記憶部１３は、技法判定機能を実現するための制御プログラム１３ａを記憶する。制御プログラム１３ａは、歌唱の評価機能を含んでもよい。制御プログラム１３ａは、磁気記録媒体、光記録媒体、光磁気記録媒体、半導体メモリなどのコンピュータ読み取り可能な記録媒体に記憶した状態で提供されてもよい。この場合には、技法判定装置１０は、記録媒体を読み取る装置を備えていればよい。また、制御プログラム１３ａは、インターネット等のネットワーク経由でダウンロードされてもよい。 The storage unit 13 is a storage device such as a nonvolatile memory or a hard disk. The storage unit 13 stores a control program 13a for realizing the technique determination function. The control program 13a may include a song evaluation function. The control program 13a may be provided in a state stored in a computer-readable recording medium such as a magnetic recording medium, an optical recording medium, a magneto-optical recording medium, or a semiconductor memory. In this case, the technique determination apparatus 10 only needs to include a device that reads the recording medium. The control program 13a may be downloaded via a network such as the Internet.

また、記憶部１３は、歌唱に関するデータとして、楽曲データ１３ｂ、歌唱音声データ１３ｃを記憶する。また、記憶部１３は、評価基準データ１３ｄを記憶してもよい。楽曲データ１３ｂは、カラオケの歌唱曲に関連するデータ、例えば、ガイドメロディデータ、伴奏データ、歌詞データなどが含まれている。ガイドメロディデータは、歌唱曲のメロディを示すデータである。伴奏データは、歌唱曲の伴奏を示すデータである。ガイドメロディデータおよび伴奏データは、ＭＩＤＩ形式で表現されたデータであってもよい。歌詞データは、歌唱曲の歌詞を表示させるためのデータ、および表示させた歌詞テロップを色替えするタイミングを示すデータである。歌唱音声データ１３ｃは、歌唱者が音入力部２３から入力した歌唱音声に対応するデータである。本実施形態では、歌唱音声データ１３ｃは、技法判定機能によって歌唱音声が判定されるまで、記憶部１３に記憶される。評価基準データ１３ｄは、評価機能によって歌唱音声の評価の基準として用いられる情報であり、評価対象の歌唱曲（歌唱音声の入力がされるときに出力されている歌唱曲）を示す楽曲データに予め対応付けられている基準音データであってもよい。 Moreover, the memory | storage part 13 memorize | stores the music data 13b and the song audio | voice data 13c as data regarding a song. The storage unit 13 may store evaluation reference data 13d. The music data 13b includes data related to a karaoke song, for example, guide melody data, accompaniment data, and lyrics data. The guide melody data is data indicating the melody of the song. Accompaniment data is data indicating the accompaniment of a song. The guide melody data and accompaniment data may be data expressed in the MIDI format. The lyric data is data for displaying the lyrics of the song and data indicating the timing for changing the color of the displayed lyrics telop. The singing voice data 13c is data corresponding to the singing voice input from the sound input unit 23 by the singer. In the present embodiment, the singing voice data 13c is stored in the storage unit 13 until the singing voice is determined by the technique determination function. The evaluation reference data 13d is information used as a reference for the evaluation of the singing voice by the evaluation function, and is preliminarily added to the music data indicating the singing song to be evaluated (the singing tune outputted when the singing voice is input). The associated reference sound data may be used.

操作部１５は、操作パネルおよびリモコンなどに設けられた操作ボタン、キーボード、マウスなどの装置であり、入力された操作に応じた信号を制御部１１に出力する。表示部１７は、液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置であり、制御部１１による制御に基づいた画面が表示される。なお、操作部１５と表示部１７とが一体となったタッチパネル装置であってもよい。通信部１９は、制御部１１の制御に基づいて、インターネットやＬＡＮなどの通信回線と接続して、サーバ等の外部装置と情報の送受信を行う。なお、記憶部１３の機能は、通信部１９において通信可能な外部装置で実現されてもよい。 The operation unit 15 is a device such as an operation button, a keyboard, or a mouse provided on an operation panel and a remote controller, and outputs a signal corresponding to the input operation to the control unit 11. The display unit 17 is a display device such as a liquid crystal display or an organic EL display, and displays a screen based on control by the control unit 11. Note that a touch panel device in which the operation unit 15 and the display unit 17 are integrated may be used. Based on the control of the control unit 11, the communication unit 19 is connected to a communication line such as the Internet or a LAN and transmits / receives information to / from an external device such as a server. The function of the storage unit 13 may be realized by an external device that can communicate with the communication unit 19.

信号処理部２１は、ＭＩＤＩ形式の信号からオーディオ信号を生成する音源、Ａ／Ｄコンバータ、Ｄ／Ａコンバータ等を含む。歌唱音声は、音入力部２３において電気信号に変換されて信号処理部２１に入力され、信号処理部２１においてＡ／Ｄ変換されて制御部１１に出力される。歌唱音声は、歌唱音声データ１３ｃとして記憶部１３に記憶される。また、伴奏データは、制御部１１によって読み出され、信号処理部２１においてＤ／Ａ変換され、音出力部２５から歌唱曲の伴奏として出力される。このとき、ガイドメロディも音出力部２５から出力されるようにしてもよい。 The signal processing unit 21 includes a sound source that generates an audio signal from a MIDI format signal, an A / D converter, a D / A converter, and the like. The singing voice is converted into an electrical signal by the sound input unit 23 and input to the signal processing unit 21, and A / D converted by the signal processing unit 21 and output to the control unit 11. The singing voice is stored in the storage unit 13 as singing voice data 13c. The accompaniment data is read by the control unit 11, D / A converted by the signal processing unit 21, and output from the sound output unit 25 as an accompaniment of the song. At this time, a guide melody may also be output from the sound output unit 25.

［技法判定機能］
技法判定装置１０の制御部１１が記憶部１３に記憶された制御プログラム１３ａを実行することによって実現される技法判定機能について説明する。なお、以下に説明する技法判定機能を実現する構成の一部または全部は、ハードウエアによって実現されてもよい。 [Technology judgment function]
A technique determination function realized by executing the control program 13a stored in the storage unit 13 by the control unit 11 of the technique determination apparatus 10 will be described. A part or all of the configuration for realizing the technique determination function described below may be realized by hardware.

図２は、本発明の第１実施形態における技法判定機能１００の構成を示すブロック図である。図２を参照すると、技法判定機能１００は、入力音取得部１０３、特徴量検出部１０５、平坦部検出部１１１、及び技法判定部１１３を含む。 FIG. 2 is a block diagram showing a configuration of the technique determination function 100 according to the first embodiment of the present invention. Referring to FIG. 2, the technique determination function 100 includes an input sound acquisition unit 103, a feature amount detection unit 105, a flat part detection unit 111, and a technique determination unit 113.

入力音取得部１０３は、音入力部２３から入力された歌唱音声に対応する歌唱音声データ（入力音）を取得する。なお、入力音取得部１０３は、歌唱音声データを信号処理部２１から直接取得するが、いったん記憶部１３に記憶された歌唱音声データを取得するようにしてもよい。また、入力音取得部１０３は、音入力部２３への入力音を示す歌唱音声データを取得する場合に限らず、外部装置への入力音を示す歌唱音声データを、通信部１９によりネットワーク経由で取得してもよい。本実施形態においては、入力音取得部１０３は、楽曲データの再生中に順次入力される歌唱音声データを、特徴量検出部１０５に順次出力する。 The input sound acquisition unit 103 acquires singing voice data (input sound) corresponding to the singing voice input from the sound input unit 23. In addition, although the input sound acquisition part 103 acquires song voice data directly from the signal processing part 21, you may make it acquire the song voice data once memorize | stored in the memory | storage part 13. FIG. The input sound acquisition unit 103 is not limited to acquiring singing voice data indicating the input sound to the sound input unit 23, and the singing voice data indicating the input sound to the external device is transmitted by the communication unit 19 via the network. You may get it. In the present embodiment, the input sound acquisition unit 103 sequentially outputs the singing voice data sequentially input during the reproduction of the music data to the feature amount detection unit 105.

特徴量検出部１０５は、入力音取得部１０３から歌唱音声データを取得する。特徴量検出部１０５は、取得した歌唱音声データについて、音量及びピッチを含む歌唱音の特徴量を検出する。特徴量検出部１０５は、音量検出部１０７及びピッチ検出部１０９を含む。 The feature amount detection unit 105 acquires singing voice data from the input sound acquisition unit 103. The feature amount detection unit 105 detects the feature amount of the singing sound including the volume and the pitch with respect to the acquired singing voice data. The feature amount detection unit 105 includes a volume detection unit 107 and a pitch detection unit 109.

音量検出部１０７は、入力音取得部１０３によって取得された歌唱音声データから、歌唱音の音量を時系列に検出する。即ち、音量検出部１０７は、歌唱音声データに基づいて、歌唱音の音量の時間的な変化を検出する。本実施形態において、音量検出部１０７は、歌唱音声データが示す音声信号の振幅に基づいて音量を検出する。音量検出部１０７は、検出された音量を時系列に示すデータ（音量波形）を平坦部検出部１１１及び技法判定部１１３に時系列に出力する。 The volume detecting unit 107 detects the volume of the singing sound in time series from the singing voice data acquired by the input sound acquiring unit 103. That is, the volume detection unit 107 detects a temporal change in the volume of the singing sound based on the singing voice data. In the present embodiment, the volume detector 107 detects the volume based on the amplitude of the audio signal indicated by the singing audio data. The volume detection unit 107 outputs data (volume waveform) indicating the detected volume in time series to the flat part detection unit 111 and the technique determination unit 113 in time series.

ピッチ検出部１０９は、入力音取得部１０３によって取得された歌唱音声データから、歌唱音のピッチを時系列に検出する。即ち、ピッチ検出部１０９は、フレーム（所定期間で区切られたデータサンプル）ごとに歌唱音声データが示す音声信号の波形が負から正に変化する際のゼロクロスを検出し、そのゼロクロスの時間間隔を測定することによって歌唱音のピッチ（周波数）を特定する。このとき、この音声信号から、ローパスフィルタによりノイズ成分となる高域成分をカットしたり、ハイパスフィルタにより直流成分をカットしたりしておいてもよい。また、ピッチ検出部１０９は、歌唱音声データにＦＦＴ（Fast Fourier Transform）を施して得られるスペクトルからピッチを特定してもよい。ピッチ検出部１０９は、このようにして検出したピッチを時系列に示すデータ（ピッチ波形）を、平坦部検出部１１１及び技法判定部１１３に時系列に出力する。 The pitch detection unit 109 detects the pitch of the singing sound in time series from the singing voice data acquired by the input sound acquisition unit 103. That is, the pitch detection unit 109 detects a zero cross when the waveform of the voice signal indicated by the singing voice data changes from negative to positive for each frame (data sample divided by a predetermined period), and sets the time interval of the zero cross. The pitch (frequency) of the singing sound is specified by measuring. At this time, a high-frequency component that becomes a noise component may be cut from the audio signal by a low-pass filter, or a DC component may be cut by a high-pass filter. Moreover, the pitch detection part 109 may identify a pitch from the spectrum obtained by giving FFT (Fast Fourier Transform) to song voice data. The pitch detection unit 109 outputs data (pitch waveform) indicating the pitches detected in this way in time series to the flat part detection unit 111 and the technique determination unit 113 in time series.

平坦部検出部１１１は、特徴量検出部１０５によって検出された特徴量の時系列の変動に基づいて、特徴量の平坦部を検出する。平坦部検出部１１１は、検出された特徴量の平坦部を示すデータを技法判定部１１３に出力する。 The flat part detection unit 111 detects a flat part of the feature amount based on the time-series variation of the feature amount detected by the feature amount detection unit 105. The flat part detection unit 111 outputs data indicating the flat part of the detected feature amount to the technique determination unit 113.

具体的には、平坦部検出部１１１は、音量検出部１０７によって検出された音量を示すデータにおける平坦部、即ち、音量が略一定となっている期間を音量の平坦部として検出してもよい。例えば、平坦部検出部１１１は、音量検出部１０７によって検出された音量を示すデータに対しフレーム（所定時間ごとに区切られたデータサンプル）ごとに音量の変動が予め決められた所定の閾値ΔＶｔｈ以下か否かを判定する。音量の変動が所定の閾値ΔＶｔｈ以下であるフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、即ち、音量の変動が所定の閾値ΔＶｔｈ以下である期間が予め決められた所定時間以上である場合、平坦部検出部１１１は、音量の変動が所定の閾値ΔＶｔｈ以下であるフレームを音量の平坦部として検出してもよい。 Specifically, the flat part detection unit 111 may detect a flat part in the data indicating the volume detected by the volume detection unit 107, that is, a period during which the volume is substantially constant as the flat part of the volume. . For example, the flat part detecting unit 111 has a volume fluctuation of a predetermined threshold ΔVth or less for each frame (data sample divided every predetermined time) with respect to the data indicating the volume detected by the volume detecting unit 107. It is determined whether or not. When frames whose fluctuation in volume is equal to or less than a predetermined threshold ΔVth are continuously detected for a predetermined number or more (for example, two or more frames), that is, a period in which fluctuation in volume is equal to or less than a predetermined threshold ΔVth When the predetermined time is longer than the predetermined time, the flat part detecting unit 111 may detect a frame whose volume fluctuation is equal to or less than a predetermined threshold ΔVth as a flat part of the volume.

図３は、平坦部検出部１１１における、音量の平坦部検出の概念の一例を説明するための図である。図３は、歌唱音の音量を時系列に示す音量波形であり、縦軸は音量（Ｖ）を示し、横軸は時間（Ｔ）を示す。図３では、フレームｆ_ｎ−１〜ｆ_ｎ＋６が示されている。フレームｆの長さは任意である。開始点検出部１０９は、各フレームｆ_ｎ−１〜ｆ_ｎ＋６における音量の変動が所定の閾値ΔＶｔｈ以下あるか否かを判定する。例えば、フレームｆ_ｎ−１、ｆ_ｎ、ｆ_ｎ＋１、ｆ_ｎ＋４、ｆ_ｎ＋5、ｆ_ｎ＋6における音量の変動が所定の閾値ΔＶｔｈを上回り（ΔＶｎ−１＞ΔＶｔｈ、ΔＶｎ＋１＞ΔＶｔｈ、ΔＶｎ＋４＞ΔＶｔｈ、ΔＶｎ＋５＞ΔＶｔｈ、ΔＶｎ＋６＞ΔＶｔｈ）、ｆ_ｎ＋２、ｆ_ｎ＋３における音量の変動が所定の閾値ΔＶｔｈ以下（ΔＶｎ＋２≦ΔＶｔｈ、ΔＶｎ＋３≦ΔＶｔｈ、）である場合、平坦部検出部１１１は、フレームｆ_ｎ＋２、ｆ_ｎ＋３を音量の平坦部として検出してもよい。 FIG. 3 is a diagram for explaining an example of the concept of detecting the flat portion of the sound volume in the flat portion detecting unit 111. FIG. 3 is a volume waveform showing the volume of the singing sound in time series, with the vertical axis indicating volume (V) and the horizontal axis indicating time (T). In FIG. 3, frames f _{n−1 to} f _{n + 6} are shown. The length of the frame f is arbitrary. The start point detection unit 109 determines whether or not the change in volume in each of the frames f _{n−1 to} f _{n + 6} is equal to or less than a predetermined threshold value ΔVth. For example, volume fluctuations in the frames f _n−1 , f _n , f _{n + 1} , f _{n + 4} , f _{n + 5} , f _{n + 6} exceed a predetermined threshold ΔVth (ΔVn−1> ΔVth, ΔVn + 1> ΔVth, ΔVn + 4> ΔVth, ΔVn + 5> (ΔVth, ΔVn + 6> ΔVth), f _{n + 2} , f _{n + 3} , when the fluctuation in volume is equal to or less than a predetermined threshold ΔVth (ΔVn + 2 ≦ ΔVth, ΔVn + 3 ≦ ΔVth,), the flat part detecting unit 111 detects the frames f _{n + 2} and f _{n + 3} . You may detect as a flat part of a sound volume.

また、平坦部検出部１１１は、所定の複数のフレームごとに音量の平均値を算出し、算出された音量の平均値の変動に基づいて音量の平坦部を検出してもよい。例えば、平坦部１１１は、フレームｆ_ｎ−１、フレームｆ_ｎ及びフレームｆ_ｎ＋１における音量の平均値を算出し、次に、フレームｆ_ｎ、フレームｆ_ｎ＋１及びフレームｆ_ｎ＋２における音量の平均値を算出し、その次に、フレームｆ_ｎ＋１、フレームｆ_ｎ＋２及びフレームｆ_ｎ＋３における音量の平均値を算出する。このように、平坦部検出部１１１は、音量の平均値を算出する期間を所定の時間（ここでは、１フレームに相当する時間）だけずらしながら音量の平均値を算出する。平坦部検出部１１１は、連続する所定の複数のフレーム（例えば、フレームｆ_ｎ−１〜ｆ_ｎ＋１及びフレーム_ｎ〜ｆ_ｎ＋２）の音量の平均値の差が所定の閾値以下である場合、該連続する所定の複数のフレーム期間に対応する期間（例えば、フレームｆ_ｎ−１〜ｆ_ｎ＋２）を音量の平坦部として検出してもよい。 Further, the flat part detecting unit 111 may calculate an average value of the sound volume for each of a plurality of predetermined frames, and detect the flat part of the sound volume based on the fluctuation of the calculated average value of the sound volume. For example, the flat part 111 calculates the average value of the sound volume in the frame f _n−1 , the frame f _n and the frame f _{n + 1} , and then calculates the average value of the sound volume in the frame f _n , the frame f _{n + 1} and the frame f _{n + 2} . Then, the average value of the volume in the frame f _{n + 1} , the frame f _{n + 2} and the frame f _{n + 3} is calculated. As described above, the flat portion detection unit 111 calculates the average value of the volume while shifting the period for calculating the average value of the volume by a predetermined time (here, a time corresponding to one frame). The flat part detection unit 111 determines that the difference between the average values of the volume of a plurality of consecutive predetermined frames (for example, the frames f _{n−1 to} f _{n + 1} and the frames _{n to} f _{n + 2} ) is equal to or less than a predetermined threshold. A period corresponding to a plurality of predetermined frame periods (for example, frames f _{n−1 to} f _{n + 2} ) may be detected as a flat portion of the sound volume.

また、例えば、平坦部検出部１１１は、音量検出部１０７によって検出された音量を示すデータに対しフレームごとに、音量を示すデータ（音量波形）の傾きの絶対値を算出し、算出された傾きの絶対値が所定の値以下となるか否かを判定してもよい。算出された傾きの絶対値が所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、平坦部検出部１１１は、算出された傾きの絶対値が所定の値以下となるフレームを音量の平坦部として検出してもよい。 For example, the flat part detection unit 111 calculates the absolute value of the slope of the data indicating the volume (volume waveform) for each frame of the data indicating the volume detected by the volume detection unit 107, and calculates the calculated slope. It may be determined whether or not the absolute value of is less than or equal to a predetermined value. When the number of frames in which the absolute value of the calculated inclination is equal to or less than a predetermined value is continuously detected for a predetermined number or more (for example, two or more frames), the flat part detection unit 111 detects the absolute value of the calculated inclination. You may detect the flame | frame from which a value is below a predetermined value as a flat part of a sound volume.

さらに、平坦部検出部１１１は、ピッチ検出部１０９によって検出されたピッチを示すデータにおける平坦部、即ち、ピッチが略一定となっている期間をピッチの平坦部として検出してもよい。例えば、平坦部検出部１１１は、ピッチ検出部１０９によって検出されたピッチを示すデータに対しフレーム（所定時間ごとに区切られたデータサンプル）ごとにピッチの平均値を算出し、算出されたピッチの平均値の１０の位を四捨五入して１００セントごとのグリッドに当てはめ、四捨五入されたピッチの値が同一のグリッドに乗っているフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、即ち、四捨五入されたピッチの値が同一のグリッドに乗っている期間が予め決められた所定時間以上である場合、平坦部検出部１１１は、四捨五入されたピッチの値が同一のグリッドに乗っているフレームをピッチの平坦部として検出してもよい。 Further, the flat part detecting unit 111 may detect a flat part in the data indicating the pitch detected by the pitch detecting unit 109, that is, a period in which the pitch is substantially constant, as the flat part of the pitch. For example, the flat part detection unit 111 calculates the average value of the pitch for each frame (data sample divided every predetermined time) with respect to the data indicating the pitch detected by the pitch detection unit 109, and calculates the calculated pitch. Round the 10th place of the average value and apply it to the grid every 100 cents, and the number of frames on the same grid with the rounded pitch value is a predetermined number (for example, 2 frames or more) continuously. When detected, that is, when the rounded pitch value is on the same grid for a predetermined time or more, the flat part detecting unit 111 has the same rounded pitch value. A frame on the grid may be detected as a flat part of the pitch.

図４は、平坦部検出部１１１における、ピッチの平坦部検出の概念の一例を説明するための図である。図４は、歌唱音のピッチを時系列に示すピッチ波形であり、縦軸はピッチ（ｃｅｎｔ）を示し、横軸は時間（Ｔ）を示す。図４では、フレームｆ_ｎ−１〜ｆ_ｎ＋６が示されている。フレームｆの長さは任意である。平坦部検出部１１１は、各フレームｆ_ｎ−１〜ｆ_ｎ＋６におけるピッチの平均値を算出する。図４において、各フレームｆ_ｎ−１〜ｆ_ｎ＋６におけるピッチの平均値が黒い丸（●）で示している。平坦部検出部１１１は、算出されたピッチの平均値の１０の位を四捨五入して１００セントごとのグリッドに当てはめ、当てはまるグリッドに対応するピッチの値を、各フレームｆ_ｎ−１〜ｆ_ｎ＋６におけるピッチとする。 FIG. 4 is a diagram for explaining an example of the concept of the flat part detection of the pitch in the flat part detection unit 111. FIG. 4 is a pitch waveform showing the pitch of the singing sound in time series, the vertical axis indicates the pitch (cent), and the horizontal axis indicates time (T). In FIG. 4, frames f _{n−1 to} f _{n + 6} are shown. The length of the frame f is arbitrary. The flat part detection part 111 calculates the average value of the pitch in each frame fn _{-1 to} fn _{+ 6} . In FIG. 4, the average value of pitches in the frames f _{n−1 to} f _{n + 6} is indicated by black circles (●). The flat part detection unit 111 rounds the calculated average value of the pitch to the 10th place and applies it to the grid every 100 cents, and sets the pitch value corresponding to the applied grid in each frame f _{n−1 to} f _{n + 6} . The pitch.

例えば、図４では、フレームｆ_ｎ−１のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋３）セントとする。フレームｆ_ｎのピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋４）セントとする。フレームｆ_ｎ＋１のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋４）セントとする。フレームｆ_ｎ＋２のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋４）セントとする。フレームｆ_ｎ＋３のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋３）セントとする。フレームｆ_ｎ＋４のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋２）セントとする。フレームｆ_ｎ＋５のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋１）セントとする。フレームｆ_ｎ＋６のピッチの平均値の１０の位の四捨五入後の値に当てはまるグリッドに対応するピッチは１００＊（ｍ＋２）セントとする。平坦部検出部１１１は、ピッチが同一であるフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、ピッチが同一であるフレームをピッチの平坦部として検出してもよい。図４では、フレームｆ_ｎ、ｆ_ｎ＋１、ｆ_ｎ＋２の四捨五入されたピッチの平均値に当てはまるグリッドに対応するピッチが、１００＊（ｍ＋４）セントである。したがって、平坦部検出部１１１は、フレームｆ_ｎ〜ｆ_ｎ＋２をピッチの平坦部として検出する。 For example, in FIG. 4, the pitch corresponding to the grid corresponding to the value after rounding to the 10th of the average value of the pitch of the frame f _n−1 is 100 * (m + 3) cents. Pitch corresponding to grid true value after rounding digit of 10 of the average value of the pitch of the frame f _n is set to 100 * (m + 4) cents. The pitch corresponding to the grid corresponding to the value after rounding off to the 10th place of the average value of the pitch of the frame f _{n + 1} is 100 * (m + 4) cents. The pitch corresponding to the grid corresponding to the value after rounding to the tenth of the average value of the pitches of the frame f _{n + 2} is 100 * (m + 4) cents. The pitch corresponding to the grid corresponding to the value after rounding to the tenth of the average value of the pitch of the frame f _{n + 3} is 100 * (m + 3) cents. The pitch corresponding to the grid corresponding to the value after rounding to the 10th place of the average value of the pitch of the frame f _{n + 4} is 100 * (m + 2) cents. The pitch corresponding to the grid corresponding to the value after rounding to the tenth of the average value of the pitch of the frame f _{n + 5} is 100 * (m + 1) cents. The pitch corresponding to the grid corresponding to the value after rounding to the tenth of the average value of the pitches of the frame f _{n + 6} is 100 * (m + 2) cents. The flat part detecting unit 111 detects a frame having the same pitch as a flat part of the pitch when frames having the same pitch are continuously detected by a predetermined number or more (for example, two or more frames). Also good. In FIG. 4, the pitch corresponding to the grid that corresponds to the average value of the rounded pitches of frames f _n , f _{n + 1} , and f _{n + 2} is 100 * (m + 4) cents. Therefore, the flat part detection unit 111 detects the frames f _{n to} f _{n + 2} as a flat part of the pitch.

尚、平坦部検出部１１１は、より正確にピッチの平坦部を検出するために、ピッチの平均値の１０の位の四捨五入後の値が同一のグリッドに乗っているフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、検出されたフレームにおけるピッチの時間的な変動が予め決められた所定の幅以内の変動であれば、該検出されたフレームをピッチの平坦部として検出してもよい。 In addition, in order to detect the flat part of the pitch more accurately, the flat part detection unit 111 is a predetermined number of frames on the grid in which the value after rounding to the 10th digit of the average value of the pitch is on the same grid. When the above-described (for example, two or more frames) are detected continuously, if the temporal variation of the pitch in the detected frame is within a predetermined range, the detected frame is You may detect as a flat part.

また、例えば、平坦部検出部１１１は、ピッチ検出部１０９によって検出されたピッチを示すデータに対しフレームごとに、ピッチを示すデータ（ピッチ波形）の傾きの絶対値を算出し、算出された傾きの絶対値が所定の値以下となるか否かを判定してもよい。算出された傾きの絶対値が所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、平坦部検出部１１１は、算出された傾きの絶対値が所定の値以下となるフレームをピッチの平坦部として検出してもよい。また、算出された傾きの絶対値が所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続して検出された場合、平坦部検出部１１１は、検出されたフレームにおけるピッチの時間的な変動が予め決められた所定の幅以内の変動であれば、該検出されたフレームをピッチの平坦部として検出してもよい。 Further, for example, the flat part detection unit 111 calculates the absolute value of the inclination of the data indicating the pitch (pitch waveform) for each frame with respect to the data indicating the pitch detected by the pitch detection unit 109, and calculates the calculated inclination It may be determined whether or not the absolute value of is less than or equal to a predetermined value. When the number of frames in which the absolute value of the calculated inclination is equal to or less than a predetermined value is continuously detected for a predetermined number or more (for example, two or more frames), the flat part detection unit 111 detects the absolute value of the calculated inclination. You may detect the flame | frame from which a value becomes below a predetermined value as a flat part of a pitch. In addition, when the number of frames in which the calculated absolute value of the slope is equal to or less than a predetermined value is continuously detected for a predetermined number or more (for example, two or more frames), the flat part detection unit 111 detects the detected frames. If the temporal variation of the pitch at is a variation within a predetermined width, the detected frame may be detected as a flat portion of the pitch.

また、例えば、平坦部検出部１１１は、所定の複数のフレームにおけるピッチの最大値と最小値との差が所定の値以内であれば、該所定の複数のフレームをピッチの平坦部として検出してもよい。 In addition, for example, if the difference between the maximum and minimum pitch values in a plurality of predetermined frames is within a predetermined value, the flat portion detection unit 111 detects the predetermined plurality of frames as a flat portion of the pitch. May be.

技法判定部１１３は、平坦部検出部１１１によって検出された歌唱音の特徴量の平坦部（音量の平坦部、又はピッチの平坦部）の前後の音量の変化、及びピッチの変動に基づいて、歌唱音声の技法を判定する。例えば、技法判定部１１３は、歌唱技法として、しゃくり上げ、しゃくり下げ、跳ね上げ、フォール及びこぶしやクレッシェンド、デクレッシェンド、及びフォルテピアノなどの抑揚を判定してもよい。 The technique determination unit 113 is based on the change in volume before and after the flat part (volume flat part or pitch flat part) of the feature amount of the singing sound detected by the flat part detection unit 111, and the fluctuation of the pitch. Determine the singing voice technique. For example, the technique determination unit 113 may determine, as a singing technique, inflections such as scooping up, scooping down, jumping up, fall and fist, crescendo, decrescendo, and forte piano.

図５は、技法判定部１１３における、しゃくり上げ判定の概念を説明するための図である。しゃくり上げとは、主にピッチが安定する前にピッチを下から上に上昇させる技法である。図５は、歌唱音のピッチ波形である。平坦検出部１１１は、図５におけるフレームｆ_ｎ＋１〜ｆ_ｎ＋３をフレームの平坦部として検出したものとする。図５に示すように、平坦部の前のフレーム（フレームｆ_ｎ−１、ｆ_ｎ）において、ピッチが上昇している。この場合、技法判定部１１３は、平坦部の前にしゃくり上げが含まれていると判定する。ここで、技法判定部１１３は、平坦部の前において、ピッチを示すデータ（ピッチ波形）の傾きが所定の値以上となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の前にしゃくり上げが含まれていると判定してもよい。 FIG. 5 is a diagram for explaining the concept of scooping determination in the technique determination unit 113. Scribbing is a technique that raises the pitch from bottom to top before the pitch stabilizes. FIG. 5 is a pitch waveform of the singing sound. It is assumed that the flatness detection unit 111 detects frames f _{n + 1 to} f _{n + 3} in FIG. 5 as flat portions of the frame. As shown in FIG. 5, the pitch is increased in the frame (frames f _n−1 , f _n ) before the flat portion. In this case, the technique determination unit 113 determines that scooping is included before the flat portion. Here, the technique determination unit 113 continuously performs a predetermined number or more (for example, two or more frames) of frames in which the slope of the data indicating the pitch (pitch waveform) is a predetermined value or more before the flat portion. If it is, it may be determined that scooping is included before the flat portion.

図６は、技法判定部１１３における、しゃくり下げ判定の概念を説明するための図である。しゃくり下げとは、主にピッチが安定する前にピッチを上から下に下降させる技法である。図６は、歌唱音のピッチ波形である。平坦検出部１１１は、図６におけるフレームｆ_ｎ＋１〜ｆ_ｎ＋２をフレームの平坦部として検出したものとする。図６に示すように、平坦部の前のフレーム（フレームｆ_ｎ−１、ｆ_ｎ）において、ピッチが下降している。この場合、技法判定部１１３は、平坦部の前にしゃくり下げが含まれていると判定する。ここで、技法判定部１１３は、平坦部の前において、ピッチを示すデータ（ピッチ波形）の傾きが所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の前にしゃくり下げが含まれていると判定してもよい。 FIG. 6 is a diagram for explaining the concept of scrambling determination in the technique determination unit 113. Squeaking down is a technique that lowers the pitch from top to bottom before the pitch stabilizes. FIG. 6 is a pitch waveform of the singing sound. It is assumed that the flatness detection unit 111 detects the frames f _{n + 1 to} f _{n + 2} in FIG. 6 as flat portions of the frame. As shown in FIG. 6, the pitch is lowered in the frame (frames f _n−1 , f _n ) in front of the flat portion. In this case, the technique determination unit 113 determines that scrambling is included before the flat portion. Here, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data indicating the pitch (pitch waveform) is equal to or less than a predetermined value before the flat portion. If it is, it may be determined that a scooping down is included before the flat portion.

図７は、技法判定部１１３における、跳ね上げ判定の概念を説明するための図である。図７は、歌唱音のピッチ波形である。跳ね上げとは、主にピッチの安定後にピッチを下から上に上昇させる技法である。図７は、歌唱音のピッチ波形である。平坦検出部１１１は、図７におけるフレームｆ_ｎ−１〜ｆ_ｎ＋１をフレームの平坦部として検出したものとする。図７に示すように、平坦部の後のフレーム（フレームｆ_ｎ＋２、ｆ_ｎ＋３、ｆ_ｎ＋４）において、ピッチが上昇している。この場合、技法判定部１１３は、平坦部の後に跳ね上げが含まれていると判定する。ここで、技法判定部１１３は、平坦部の後において、ピッチを示すデータ（ピッチ波形）の傾きが所定の値以上となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の後に跳ね上げが含まれていると判定してもよい。 FIG. 7 is a diagram for explaining the concept of the flip-up determination in the technique determination unit 113. FIG. 7 is a pitch waveform of the singing sound. Bounce-up is a technique that raises the pitch from bottom to top mainly after the pitch is stabilized. FIG. 7 is a pitch waveform of the singing sound. It is assumed that the flatness detection unit 111 detects the frames f _{n−1 to} f _{n + 1} in FIG. 7 as flat portions of the frame. As shown in FIG. 7, the pitch is increased in the frames (frames f _{n + 2} , f _{n + 3} , f _{n + 4} ) after the flat portion. In this case, the technique determination unit 113 determines that the flip-up is included after the flat portion. Here, after the flat portion, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data indicating the pitch (pitch waveform) becomes a predetermined value or more. If it is, it may be determined that the flip-up is included after the flat portion.

図８は、技法判定部１１３における、フォール判定の概念を説明するための図である。フォールとは、主にピッチの安定後にピッチを上から下に下降させる技法である。図８は、歌唱音のピッチ波形である。平坦検出部１１１は、図８におけるフレームｆ_ｎ＋１〜ｆ_ｎ＋３をフレームの平坦部として検出したものとする。図８に示すように、平坦部の後のフレーム（フレームｆ_ｎ＋４、ｆ_ｎ＋５）において、ピッチが下降している。この場合、技法判定部１１３は、平坦部の後ろにフォールが含まれていると判定する。ここで、技法判定部１１３は、平坦部の後において、ピッチを示すデータ（ピッチ波形）の傾きが所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の後にフォールが含まれていると判定してもよい。 FIG. 8 is a diagram for explaining the concept of fall determination in the technique determination unit 113. Fall is a technique that lowers the pitch from top to bottom after the pitch is stabilized. FIG. 8 is a pitch waveform of the singing sound. It is assumed that the flatness detection unit 111 detects frames f _{n + 1 to} f _{n + 3} in FIG. 8 as flat portions of the frame. As shown in FIG. 8, the pitch is lowered in the frames (frames f _{n + 4} and f _{n + 5} ) after the flat portion. In this case, the technique determination unit 113 determines that a fall is included behind the flat portion. Here, after the flat portion, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data indicating the pitch (pitch waveform) is a predetermined value or less. If it is, it may be determined that a fall is included after the flat portion.

図９は、技法判定部１１３における、クレッシェンド判定の概念を説明するための図である。図９は、歌唱音の音量波形である。平坦検出部１１１は、図９におけるフレームｆ_ｎ−１〜ｆ_ｎ＋１をフレームの平坦部として検出したものとする。図９に示すように、平坦部の後のフレーム（フレームｆ_ｎ＋２、ｆ_ｎ＋３、ｆ_ｎ＋４）において、音量が増大している。この場合、技法判定部１１３は、平坦部の後にクレッシェンドが含まれていると判定する。ここで、技法判定部１１３は、平坦部の後において、音量を示すデータ（音量波形）の傾きが所定の値以上となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の後にクレッシェンドが含まれていると判定してもよい。 FIG. 9 is a diagram for explaining the concept of crescendo determination in the technique determination unit 113. FIG. 9 is a volume waveform of the singing sound. It is assumed that the flatness detection unit 111 detects the frames f _{n−1 to} f _{n + 1} in FIG. 9 as the flat part of the frame. As shown in FIG. 9, the volume increases in the frames after the flat portion (frames f _{n + 2} , f _{n + 3} , f _{n + 4} ). In this case, the technique determination unit 113 determines that the crescendo is included after the flat portion. Here, after the flat portion, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data indicating the volume (volume waveform) is a predetermined value or more. If it is, it may be determined that the crescendo is included after the flat portion.

図１０は、技法判定部１１３における、デクレッシェンド判定の概念を説明するための図である。図１０は、歌唱音の音量波形である。平坦検出部１１１は、図１０におけるフレームｆ_ｎ〜ｆ_ｎ＋２をフレームの平坦部として検出したものとする。図１０に示すように、平坦部の後のフレーム（フレームｆ_ｎ＋３、ｆ_ｎ＋４）において、音量が減少している。この場合、技法判定部１１３は、平坦部の後にデクレッシェンドが含まれていると判定する。ここで、技法判定部１１３は、平坦部の後において、音量を示すデータ（音量波形）の傾きが所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の後にデクレッシェンドが含まれていると判定してもよい。 FIG. 10 is a diagram for explaining the concept of the decrescendo determination in the technique determination unit 113. FIG. 10 is a volume waveform of the singing sound. It is assumed that the flatness detection unit 111 detects the frames f _{n to} f _{n + 2} in FIG. 10 as flat portions of the frame. As shown in FIG. 10, the volume decreases in the frames after the flat part (frames f _{n + 3} , f _{n + 4} ). In this case, the technique determination unit 113 determines that the decrescendo is included after the flat portion. Here, after the flat portion, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data (volume waveform) indicating the volume is equal to or less than a predetermined value. If it is, it may be determined that the crescendo is included after the flat portion.

図１１は、技法判定部１１３における、こぶし判定の概念を説明するための図である。こぶしとは、主にピッチをある基準から所定の時間範囲内で上又は下に所定ピッチ（例えば、１００セント程度以上）変化させて、速やかに基準ピッチまで戻す技法である。図１１は、歌唱音のピッチ波形である。平坦検出部１１１は、図１１におけるフレームｆ_ｎ−１〜ｆ_ｎ＋１を第１の平坦部として検出し、フレームｆ_ｎ−４〜ｆ_ｎ＋６を第２の平坦部として検出したものとする。ここで、第１の平坦部におけるピッチの平均値と第２の平坦部におけるピッチの平均値は、略等しいものとする。ここで、ピッチの平均が略等しいとは、第１の平坦部及び第２の平坦部におけるピッチの平均値を算出し、算出されたピッチの平均値を四捨五入して１００セントごとのグリッドに当てはめた場合、第１の平坦部における四捨五入されたピッチの平均値と第２の平坦部における四捨五入されたピッチの平均値が同一のグリッドに乗っている場合であってもよい。図１１に示すように、第１の平坦部と第２の平坦部との間のフレーム（フレームｆ_ｎ＋２、ｆ_ｎ＋３）において、ピッチが上下に振動している。この場合、技法判定部１１３は、第１の平坦部と第２の平坦部との間にこぶしが含まれていると判定する。 FIG. 11 is a diagram for explaining the concept of fist determination in the technique determination unit 113. Fist is a technique in which the pitch is mainly changed from a reference to a predetermined pitch (for example, about 100 cents or more) up or down within a predetermined time range, and quickly returned to the reference pitch. FIG. 11 is a pitch waveform of the singing sound. It is assumed that the flatness detection unit 111 detects frames f _{n−1 to} f _{n + 1} in FIG. 11 as first flat portions and detects frames f _{n−4 to} f _{n + 6} as second flat portions. Here, it is assumed that the average value of the pitch in the first flat portion and the average value of the pitch in the second flat portion are substantially equal. Here, the average pitch is substantially equal means that the average value of the pitch in the first flat part and the second flat part is calculated, and the calculated average value of the pitch is rounded off and applied to the grid every 100 cents. In this case, the average value of the rounded pitch in the first flat portion and the average value of the rounded pitch in the second flat portion may be on the same grid. As shown in FIG. 11, the pitch vibrates up and down in the frame (frames f _{n + 2} and f _{n + 3} ) between the first flat portion and the second flat portion. In this case, the technique determination unit 113 determines that a fist is included between the first flat portion and the second flat portion.

図１２は、技法判定部１１３における、フォルテピアノ判定の概念を説明するための図である。フォルテピアノとは、主に音量が安定する前に音量を強くした直後に減少させる奏法であるものとする。図１２は、歌唱音の音量波形である。平坦検出部１１１は、図１２におけるフレームｆ_ｎ＋５〜ｆ_ｎ＋６をフレームの平坦部として検出したものとする。図１２に示すように、平坦部の前のフレームｆ_ｎ＋１付近で強くなった直後のフレーム（フレームｆ_ｎ＋２、ｆ_ｎ＋３、ｆ_ｎ＋４）において、音量が減少している。この場合、技法判定部１１３は、平坦部の前にフォルテピアノが含まれていると判定する。ここで、技法判定部１１３は、平坦部の前において、音量を示すデータ（音量波形）の傾きが所定の値以下となるフレームが予め決められた数以上（例えば、２フレーム以上）連続している場合、平坦部の前にフォルテピアノが含まれていると判定してもよい。 FIG. 12 is a diagram for explaining the concept of forte piano determination in the technique determination unit 113. The forte piano is a performance technique that is reduced immediately after the volume is increased before the volume is stabilized. FIG. 12 is a volume waveform of the singing sound. The flatness detection unit 111 detects frames f _{n + 5 to} f _{n + 6} in FIG. 12 as flat portions of the frame. As shown in FIG. 12, the sound volume decreases in the frame immediately after becoming strong near the frame f _{n + 1 in} front of the flat portion (frames f _{n + 2} , f _{n + 3} , f _{n + 4} ). In this case, the technique determination unit 113 determines that a forte piano is included before the flat portion. Here, the technique determination unit 113 continuously has a predetermined number or more (for example, two or more frames) of frames in which the slope of the data (volume waveform) indicating the volume is equal to or less than a predetermined value before the flat portion. If it is, it may be determined that the forte piano is included in front of the flat portion.

尚、技法判定機能１００は歌唱者に指定された歌唱曲に対応する伴奏データを読み出し、信号処理部２１を介して、伴奏音を音出力部２５から出力させる伴奏出力部１０１を有してもよい。この場合、伴奏音が出力されている期間における音入力部２３への入力音が判定対象の歌唱音声として認識される。 The technique determination function 100 may include an accompaniment output unit 101 that reads accompaniment data corresponding to a song designated by the singer and outputs an accompaniment sound from the sound output unit 25 via the signal processing unit 21. Good. In this case, the input sound to the sound input unit 23 during the period in which the accompaniment sound is output is recognized as the determination target singing voice.

以上のように、第１実施形態における技法判定装置１０は、入力された歌唱音声データから特徴量（ピッチ及び音量）を時系列に検出し、特徴量の平坦部（ピッチの平坦部及び音量の平坦部）を検出し、特徴量の平坦部（ピッチの平坦部又は音量の平坦部）の前後の音量の変動（音量の変化）とピッチの変動に基づいて特定の技法を判定する。ピッチ及び音量の検出から技法判定までの一連の処理は、所定のフレームごとに少ない演算量で実行することが可能であるため、歌唱音声データの蓄積や機械学習、及びリファレンスデータが不要である。これにより、演算量を抑えつつ、リアルタイムに特定の技法を正確に判定することが可能となる。 As described above, the technique determination apparatus 10 according to the first embodiment detects feature amounts (pitch and volume) from input singing voice data in time series, and features feature flat portions (pitch flat portions and volume levels). A flat part) is detected, and a specific technique is determined based on a change in volume (change in volume) and a change in pitch before and after the flat part of the feature amount (flat part of the pitch or flat part of the volume). Since a series of processes from pitch and volume detection to technique determination can be executed with a small amount of calculation for each predetermined frame, singing voice data accumulation, machine learning, and reference data are unnecessary. This makes it possible to accurately determine a specific technique in real time while suppressing the amount of calculation.

＜変形例＞
本発明の実施形態について以上に説明したが、本発明は上述した実施形態に限定されるわけではなく、他の様々な態様で実施可能である。以下の他の態様の一例を示す。 <Modification>
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be implemented in various other modes. An example of the following other aspect is shown.

（変形例１）
技法判定装置１０において実現される機能には、以上に述べた歌唱の技法判定機能１００に加え、技法判定によって判定された技法に基づいた歌唱の評価機能が含まれてもよい。以下に、技法判定装置１０の制御部１１が記憶部１３に記憶された制御プログラム１３ａを実行することによって実現される評価機能２００について説明する。評価機能２００を実現する構成の一部または全部は、ハードウエアによって実現されてもよい。 (Modification 1)
The functions realized in the technique determination apparatus 10 may include a singing evaluation function based on the technique determined by the technique determination, in addition to the singing technique determination function 100 described above. Below, the evaluation function 200 implement | achieved when the control part 11 of the technique determination apparatus 10 runs the control program 13a memorize | stored in the memory | storage part 13 is demonstrated. Part or all of the configuration for realizing the evaluation function 200 may be realized by hardware.

図２では、技法判定機能１００とともに、技法判定機能１００によって判定された技法に基づいて歌唱の評価を行う評価機能２００も示している。図２を参照すると、評価機能２００は、技法取得部２０１、ピッチ取得部２０３、音量取得部２０５、基準データ取得部２０７、比較部２０９、及び評価部２１１を含む。 FIG. 2 also shows an evaluation function 200 that evaluates a song based on the technique determined by the technique determination function 100 together with the technique determination function 100. Referring to FIG. 2, the evaluation function 200 includes a technique acquisition unit 201, a pitch acquisition unit 203, a volume acquisition unit 205, a reference data acquisition unit 207, a comparison unit 209, and an evaluation unit 211.

技法取得部２０１は、技法判定機能１００における技法判定部１１１によって判定された歌唱音の技法を示すデータを取得し、比較部２０９に出力する。ピッチ取得部２０３は、技法判定機能１００におけるピッチ検出部１０５によって検出されたピッチを示すデータを時系列に取得し、比較部２０９に出力する。音量取得部２０５は、技法判定機能１００における音量検出部１０７によって検出された歌唱音の音量を示すデータを時系列に取得し、比較部に出力する。基準データ取得部２０７は、記憶部１３に記憶された、対応する歌唱音の評価基準データ１３ｄを読み出して取得し、比較部２０９に出力する。なお、評価基準音データ１３ｄは、評価の基準となる音を示していればよいため、必ずしも歌唱の模範となる音声を示していなくてもよい。 The technique acquisition unit 201 acquires data indicating the technique of the singing sound determined by the technique determination unit 111 in the technique determination function 100 and outputs the data to the comparison unit 209. The pitch acquisition unit 203 acquires data indicating the pitch detected by the pitch detection unit 105 in the technique determination function 100 in time series, and outputs the data to the comparison unit 209. The volume acquisition unit 205 acquires data indicating the volume of the singing sound detected by the volume detection unit 107 in the technique determination function 100 in time series, and outputs the data to the comparison unit. The reference data acquisition unit 207 reads out and acquires the evaluation reference data 13 d of the corresponding singing sound stored in the storage unit 13, and outputs it to the comparison unit 209. Note that the evaluation reference sound data 13d only needs to indicate a sound that serves as a reference for evaluation, and therefore does not necessarily indicate a voice that serves as a model for singing.

比較部２０９は、取得した歌唱音のピッチを示すデータ、歌唱音の音量を示すデータ、及び歌唱音の技法を示すデータを対応する歌唱音の評価基準データ１３ｄと比較する。比較部２０９は、取得した歌唱音のピッチを示すデータと評価基準データ１３ｄに含まれる基準ピッチデータとを時系列に比較してもよく、取得した歌唱音の音量を示すデータと評価基準データ１３ｄに含まれる基準音量データとを時系列に比較してもよく、取得した歌唱音の技法を示すデータと価基準データ１３ｄに含まれる基準の歌唱技法データとを比較してもよい。例えば、比較部２０９は、抜きやビブラートなどの技法に関し、周波数の標準偏差、周波数の平均値、ピッチの振幅の平均値、ピッチの振幅の標準偏差、及びピッチの振幅の線形近似直線の傾きなどについて、取得した歌唱音の技法と価基準データ１３ｄに含まれる基準の歌唱技法とを比較してもよい。比較部２０９は、比較結果を評価部２１１に出力する。 The comparison unit 209 compares the data indicating the pitch of the acquired singing sound, the data indicating the volume of the singing sound, and the data indicating the technique of the singing sound with the corresponding evaluation reference data 13d of the singing sound. The comparison unit 209 may compare the acquired data indicating the pitch of the singing sound with the reference pitch data included in the evaluation reference data 13d in time series, and the data indicating the volume of the acquired singing sound and the evaluation reference data 13d. The reference volume data included in the singing sound may be compared in time series, or the acquired singing sound technique may be compared with the reference singing technique data included in the value reference data 13d. For example, the comparison unit 209 relates to a technique such as extraction or vibrato, etc., a frequency standard deviation, a frequency average value, a pitch amplitude average value, a pitch amplitude standard deviation, a slope of a linear approximation line of the pitch amplitude, and the like. The acquired singing sound technique may be compared with the reference singing technique included in the value reference data 13d. The comparison unit 209 outputs the comparison result to the evaluation unit 211.

評価部２１１は、比較部２０９から出力された比較結果に基づいて、歌唱音の評価の指標となる評価値を算出する。評価部２１１は、歌唱者による歌唱音のピッチを示すデータ、歌唱音の音量を示すデータ、及び歌唱音の技法を示すデータと対応する歌唱音の評価基準データ１３ｄとの一致度が高いほど評価値を高く算出し、不一致度が高いほど評価値を低く算出する。また、評価部２１１は、抜きやビブラートなどの難易度の高い技法について、歌唱者による歌唱音と歌唱音の評価基準データ１３ｄとの一致度が高い場合は、加重値を付与してもよい。評価部２１１による評価結果は、表示部１７に表示されてもよい。 Based on the comparison result output from the comparison unit 209, the evaluation unit 211 calculates an evaluation value that serves as an index for evaluating the singing sound. The evaluation unit 211 evaluates the higher the degree of coincidence between the data indicating the pitch of the singing sound by the singer, the data indicating the volume of the singing sound, and the data indicating the technique of the singing sound and the evaluation reference data 13d of the corresponding singing sound. The value is calculated to be high, and the evaluation value is calculated to be lower as the mismatch degree is higher. Moreover, the evaluation part 211 may give a weighting value about the techniques with high difficulty levels, such as extraction and vibrato, when the coincidence of the singing sound by the singer and the evaluation reference data 13d of the singing sound is high. The evaluation result by the evaluation unit 211 may be displayed on the display unit 17.

（変形例２）
上述した実施形態では、技法判定機能１００において、平坦部検出部１１１は、特徴量検出部１０５によって検出された特徴量（ピッチ及び音量）の時系列の変動に基づいて、特徴量の平坦部を検出している。上述した実施形態１では、平坦検出部１１１は、特徴量の変動が予め決められた閾値又は予め決められた幅以内であるフレームが、予め決められた数以上連続している場合、即ち、予め決められた所定時間以上である場合、平坦部検出部１１１は、検出されたフレームを特徴量の平坦部として検出している。ここで、平坦部として判定される予め決められた所定時間は、歌唱音声データ（入力音）によって決定されてもよい。 (Modification 2)
In the above-described embodiment, in the technique determination function 100, the flat part detection unit 111 detects the flat part of the feature amount based on the time-series variation of the feature amount (pitch and volume) detected by the feature amount detection unit 105. Detected. In the first embodiment described above, the flatness detection unit 111 determines that the number of frames whose feature amount variation is within a predetermined threshold or a predetermined width continues for a predetermined number or more, If it is longer than the predetermined time, the flat part detection unit 111 detects the detected frame as a flat part of the feature amount. Here, the predetermined time determined as the flat portion may be determined by singing voice data (input sound).

図１３は、本発明の第１の実施形態の変形例における技法判定機能１００ａの構成を示すブロック図である。技法判定機能１００ａは、音量検出部１０７によって検出された音量を時系列に示すデータ（音量波形）に基づいて、歌唱音声データ（入力音）のテンポを検出するテンポ推定部１１０を備え、平坦部検出部１１１ａがテンポ推定部１１０により検出された歌唱音声データのテンポに基づいて、特徴量の平坦部を判定するのに必要な所定時間を決定すること以外は、本発明の第１の実施形態の技法判定機能１００と同様である。 FIG. 13 is a block diagram illustrating a configuration of the technique determination function 100a according to the modification of the first embodiment of the present invention. The technique determination function 100a includes a tempo estimation unit 110 that detects the tempo of singing voice data (input sound) based on data (volume waveform) indicating the volume detected by the volume detection unit 107 in time series, and includes a flat unit. 1st Embodiment of this invention except the detection part 111a determining the predetermined time required in order to determine the flat part of a feature-value based on the tempo of the song audio | voice data detected by the tempo estimation part 110 This is the same as the technique determination function 100.

テンポ推定部１１０は、音量検出部１０７から歌唱音声データの音量を時系列に示すデータ（音量波形）を取得する。テンポ推定部１１０は、音量の変動（音量の強弱）に基づいて、歌唱音声データ（入力音）のテンポを検出する。テンポ推定部１１０は、検出した歌唱音声データ（入力音）のテンポを平坦部検出部１１１ａに伝達する。平坦部検出部１１１ａは、テンポ推定部１１０により検出された歌唱音声データのテンポに基づいて、特徴量の平坦部として判定するのに必要な所定時間の長さを決定する。 The tempo estimation unit 110 acquires data (volume waveform) indicating the volume of the singing voice data in time series from the volume detection unit 107. The tempo estimation unit 110 detects the tempo of the singing voice data (input sound) based on the change in volume (volume level). The tempo estimation unit 110 transmits the detected tempo of the singing voice data (input sound) to the flat part detection unit 111a. The flat part detection unit 111a determines the length of a predetermined time necessary to determine as a flat part of the feature amount based on the tempo of the singing voice data detected by the tempo estimation unit 110.

このように、入力された歌唱音に応じて、特徴量の平坦部として判定するのに必要な所定時間の長さを歌唱音に応じて決定することにより、平坦部の検出の精度を向上させることができる。 Thus, according to the input singing sound, the precision of the detection of a flat part is improved by determining the length of the predetermined time required to determine as a flat part of a feature-value according to a singing sound. be able to.

尚、テンポ推定部１１０は、技法判定機能１００ａが伴奏出力部１０１を有する場合、伴奏出力部１０１から伴奏音の音量を時系列に示すデータ（音量波形）を取得し、伴奏音の音量に基づいて、対応する曲のテンポを検出してもよい。 When the technique determination function 100a includes the accompaniment output unit 101, the tempo estimation unit 110 acquires time-series data (volume waveform) of the accompaniment sound volume from the accompaniment output unit 101, and based on the accompaniment sound volume. Thus, the tempo of the corresponding song may be detected.

以上に述べた技法判定機能１００、１００ａにおいて、入力音取得部１０３によって取得される歌唱音声データが示す音は、歌唱者による音声に限られず、歌唱合成による音声であってもよいし、楽器音であってもよい。楽器音である場合には、単音演奏であることが望ましい。なお、楽器音である場合には、子音および母音の概念が存在しないが、演奏方法によっては、各音の発音の開始点において歌唱と同様な傾向を有する。したがって、楽器音においても同様の判定ができる場合もある。 In the technique determination functions 100 and 100a described above, the sound indicated by the singing voice data acquired by the input sound acquisition unit 103 is not limited to the voice by the singer, but may be voice by singing synthesis or instrument sound. It may be. If it is a musical instrument sound, it is desirable to be a single note performance. In the case of instrument sounds, there is no concept of consonants and vowels, but depending on the performance method, there is a tendency similar to singing at the starting point of pronunciation of each sound. Therefore, the same determination may be made for musical instrument sounds.

本発明の実施形態として説明した構成を基にして、当業者が適宜構成要素の追加、削除もしくは設計変更を行ったもの、又は、工程の追加、省略もしくは条件変更を行ったものも、本発明の要旨を備えている限り、本発明の範囲に含まれる。 Based on the configuration described as the embodiment of the present invention, those in which a person skilled in the art appropriately added, deleted, or changed the design of the component, or added, omitted, or changed conditions of the process are also included in the present invention. As long as the gist of the present invention is provided, the scope of the present invention is included.

また、上述した実施形態の態様によりもたらされる作用効果とは異なる他の作用効果であっても、本明細書の記載から明らかなもの、又は、当業者において容易に予測し得るものについては、当然に本発明によりもたらされると解される。 Of course, other operational effects that are different from the operational effects brought about by the above-described embodiment are obvious from the description of the present specification or can be easily predicted by those skilled in the art. It is understood that this is brought about by the present invention.

１０…技法判定装置、１１…制御部、１３…記憶部、１５…操作部、１７…表示部、１９…通信部、２１…信号処理部、２３…音入力部、２５…音出力部、１００、１００ａ…技法判定機能、１０１…伴奏出力部、１０３…入力音取得部、１０５…特徴量検出部、１０７…音量検出部、１０９…ピッチ検出部、１１０…テンポ検出部、１１１、１１１ａ…平坦部検出部、１１３…技法判定部、２００…評価機能、２０１…技法取得部、２０３…ピッチ取得部、２０５…音量取得部、２０７…基準データ取得部、２０９…比較部、２１１…評価部
DESCRIPTION OF SYMBOLS 10 ... Technique determination apparatus, 11 ... Control part, 13 ... Memory | storage part, 15 ... Operation part, 17 ... Display part, 19 ... Communication part, 21 ... Signal processing part, 23 ... Sound input part, 25 ... Sound output part, 100 , 100a ... technique determination function, 101 ... accompaniment output unit, 103 ... input sound acquisition unit, 105 ... feature amount detection unit, 107 ... volume detection unit, 109 ... pitch detection unit, 110 ... tempo detection unit, 111, 111a ... flat Part detection unit 113 ... technique determination part 200 ... evaluation function 201 ... technique acquisition part 203 ... pitch acquisition part 205 ... volume acquisition part 207 ... reference data acquisition part 209 ... comparison part 211 ... evaluation part

Claims

An input sound acquisition unit for acquiring the input sound;
A feature amount detection unit that detects the feature amount of the input sound acquired by the input sound acquisition unit in time series; and
A flat part detection unit that detects a flat part of the feature quantity based on the feature quantity acquired by the feature quantity detection unit;
A technique determination unit that determines a technique of the input sound based on a variation in the feature amount in a predetermined period before or after the flat portion of the feature amount;
A technique determination apparatus comprising:

The flat part detection part is
Detecting a period in which the time-series variation of the feature amount is equal to or less than a predetermined variation;
The technique determination apparatus according to claim 1, wherein when the period is equal to or longer than a predetermined time, the period is detected as the flat portion.

The technique determination apparatus according to claim 1, wherein the feature amount is pitch or volume.

A tempo estimation unit that estimates the tempo of the input sound based on the feature amount;
The technique determination apparatus according to claim 2, wherein the predetermined time is determined according to the tempo.

When a plurality of flat portions are detected by the flat portion detection unit,
The technique determination apparatus according to claim 1, wherein the predetermined period is a period between two flat portions adjacent to each other in time series.