JP2007004153A

JP2007004153A - Device, method, and program for processing sound signal

Info

Publication number: JP2007004153A
Application number: JP2006146868A
Authority: JP
Inventors: Takuya Fujishima; 琢哉藤島; Bonada Jordi; ボナダジョルディ; Rosukosu Alex; ロスコスアレックス; Oscar Mayor; メイヤーオスカー
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-05-26
Filing date: 2006-05-26
Publication date: 2007-01-11
Anticipated expiration: 2026-05-26
Also published as: JP4367437B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for processing sound signals capable of making a decision on an expression of a performance which is highly consistent and guaranteed to be the best. <P>SOLUTION: A plurality of expression modes of a song are modeled as respective states, the probability that a section including a frame or a plurality of continuous frames lies in a specific state is calculated with respect to a prescribed observed section based on the characteristic parameters obtained by dividing the sound signals at intervals of 25 ms, and the optimum route of state transition in the prescribed observed section is determined based on the calculated probabilities to decide expression modes of the sound signals and sections thereof. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、音声信号として入力された演奏や歌唱の表現態様を検出する音声信号処理装置、音声信号処理方法および音声信号処理プログラムに関する。 The present invention relates to an audio signal processing device, an audio signal processing method, and an audio signal processing program for detecting an expression mode of performance or singing input as an audio signal.

歌唱情報や演奏データなどの楽音データと基準メロディとを照合して、音の始まりや持続，終わりを判定する音楽評価装置は種々提案されている（たとえば特許文献１）。
特開平３−２４２７００号公報 Various music evaluation devices have been proposed in which musical sound data such as singing information and performance data is compared with a reference melody to determine the beginning, duration, and end of a sound (for example, Patent Document 1).
JP-A-3-242700

しかし、上記従来の装置では、モデルを想定せず、その場その場での楽音がこのような特徴を示すからこのような表現であるという判定をするので、前後関係からして整合しないような判定をすることがある。また、判定結果が何らかの基準に照らして最善のものであるという保証がないという問題点があった。 However, in the above-described conventional apparatus, a model is not assumed, and the musical sound on the spot shows such a feature, so it is determined that it is such an expression. Judgment may be made. In addition, there is a problem that there is no guarantee that the determination result is the best according to some criteria.

この発明では、音声認識においてよく行われる手法を採用することで、より整合性の高い、また最善という保証のある判定を行うことができる音声信号処理装置、音声信号処理方法および音声信号処理プログラムを提供することを目的とする。 According to the present invention, an audio signal processing apparatus, an audio signal processing method, and an audio signal processing program capable of performing determination with higher consistency and guaranteeing the best by adopting a technique often used in voice recognition. The purpose is to provide.

請求項１の発明は、楽音を演奏または歌唱した音声信号を入力する音声信号入力部と、入力された音声信号を所定時間毎のフレームに分割し、各フレーム毎に該音声信号の特徴パラメータを検出する特徴パラメータ検出部と、検出した特徴パラメータを記憶するバッファと、演奏または歌唱の複数の表現態様をそれぞれ１つの状態としてモデル化し、前記各フレームの特徴パラメータに基づき、１フレームまたは連続する複数フレームが特定の状態の区間である確率を所定の観測区間にわたって算出し、この確率に基づいて所定の観測区間における状態推移の最適経路を決定することにより、前記音声信号の表現態様とその区間を判定する表現判定処理を行う表現判定部と、を備えたことを特徴とする。 According to the first aspect of the present invention, an audio signal input unit for inputting an audio signal for playing or singing a musical sound, and the input audio signal are divided into frames for every predetermined time, and the characteristic parameters of the audio signal are set for each frame. A feature parameter detection unit to detect, a buffer for storing the detected feature parameter, and a plurality of expression modes of performance or singing are each modeled as one state, and one frame or a plurality of consecutive ones are based on the feature parameter of each frame. By calculating the probability that the frame is a section of a specific state over a predetermined observation section, and determining the optimum path of state transition in the predetermined observation section based on this probability, the expression mode of the speech signal and its section are determined. An expression determination unit that performs expression determination processing for determination is provided.

請求項２の発明は、前記表現判定部は、前記表現判定処理として、さらに、表現態様が確定した区間について、前記特徴パラメータに基づいてさらに判定を行い、該表現態様の詳細内容を判定することを特徴とする。 According to a second aspect of the present invention, the expression determination unit further determines, as the expression determination process, a section in which the expression form is determined based on the feature parameter, and determines the detailed contents of the expression form. It is characterized by.

請求項３の発明は、前記表現判定部は、楽曲の途中で、楽曲の一部区間について前記表現判定処理を行い、前記バッファは、前記楽曲の一部区間分の記憶容量を有するものであることを特徴とする。 According to a third aspect of the present invention, the expression determination unit performs the expression determination process for a partial section of the music in the middle of the music, and the buffer has a storage capacity for the partial section of the music. It is characterized by that.

請求項４の発明は、前記表現判定部は、入力された音声信号における前記楽音の遷移点を検出し、この遷移点で前記表現判定処理を実行することを特徴とする。 The invention of claim 4 is characterized in that the expression determination unit detects a transition point of the musical sound in the input audio signal and executes the expression determination process at the transition point.

請求項５の発明は、前記表現判定部は、前記最適経路の決定を、Ｖｉｔｅｒｂｉアルゴリズムを用いて行うことを特徴とする。 The invention according to claim 5 is characterized in that the expression determination unit determines the optimum route using a Viterbi algorithm.

請求項６の発明は、楽音を演奏または歌唱した音声信号を入力する音声信号入力手順と、入力された音声信号を所定時間毎のフレームに分割し、各フレーム毎に該音声信号の特徴パラメータを検出する特徴パラメータ検出手順と、演奏または歌唱の複数の表現態様をそれぞれ１つの状態としてモデル化し、前記各フレームの特徴パラメータに基づき、１フレームまたは連続する複数フレームが特定の状態の区間である確率を所定の観測区間にわたって算出し、この確率に基づいて所定の観測区間における状態推移の最適経路を決定することにより、前記音声信号の表現態様とその区間を判定する表現判定手順と、を行う音声信号処理方法である。 According to the sixth aspect of the present invention, an audio signal input procedure for inputting an audio signal for playing or singing a musical sound, and the input audio signal is divided into frames at predetermined intervals, and the characteristic parameters of the audio signal are set for each frame. Probability that a characteristic parameter detection procedure to be detected and a plurality of expression forms of performance or singing are each modeled as one state, and one frame or a plurality of consecutive frames are sections of a specific state based on the feature parameter of each frame Is calculated over a predetermined observation interval, and the voice that performs the expression mode of the audio signal and the expression determination procedure for determining the interval by determining the optimum path of state transition in the predetermined observation interval based on this probability This is a signal processing method.

請求項７の発明は、信号処理装置に、楽音を演奏または歌唱した音声信号を入力する音声信号入力手順と、入力された音声信号を所定時間毎のフレームに分割し、各フレーム毎に該音声信号の特徴パラメータを検出する特徴パラメータ検出手順と、演奏または歌唱の複数の表現態様をそれぞれ１つの状態としてモデル化し、前記各フレームの特徴パラメータに基づき、１フレームまたは連続する複数フレームが特定の状態の区間である確率を所定の観測区間にわたって算出し、この確率に基づいて所定の観測区間における状態推移の最適経路を決定することにより、前記音声信号の表現態様とその区間を判定する表現判定手順と、を実行させる音声信号処理プログラムである。 According to the seventh aspect of the present invention, a sound signal input procedure for inputting a sound signal for playing or singing a musical sound to the signal processing device, and the input sound signal is divided into frames for every predetermined time, and the sound signal is divided for each frame. A feature parameter detection procedure for detecting a feature parameter of a signal and a plurality of expression modes of performance or singing are each modeled as one state, and one frame or a plurality of consecutive frames are specified based on the feature parameter of each frame. The expression determination procedure for determining the voice signal expression mode and the section by calculating the probability of the section of the voice signal over a predetermined observation section and determining the optimum path of state transition in the predetermined observation section based on the probability And an audio signal processing program for executing

本願の発明では、まず基準となるメロディ情報から、表現推移モデルを生成する。入力楽音から、ピッチ、音量、スペクトル変化度などの特徴を逐次求める。それらの値から所定のルールで表現の推移する確率，推移せずに留まる確率を算出する。この確率に基づき、表現推移モデルにおいて、もっとも確率の高い表現推移経路を選択する。そして表現の推移の位置を確定し、さらに各表現における特徴的な傾向をラベル付けする。とくに，確定した区間の確率計算結果を次々捨てるデータ構造をとることでストリーム処理を実現する。 In the invention of the present application, an expression transition model is first generated from melody information serving as a reference. Features such as pitch, volume, and spectral change are sequentially obtained from the input musical sound. From these values, the probability that the expression changes according to a predetermined rule and the probability that the expression stays without changing are calculated. Based on this probability, the expression transition route with the highest probability is selected in the expression transition model. Then, the position of the transition of the expression is determined, and the characteristic tendency in each expression is labeled. In particular, stream processing is realized by adopting a data structure that discards the probability calculation results in the determined section one after another.

これにより、この発明によれば、前後の状態を観測しつつ、各区間で最も確率の高い状態判定を行うことが可能になる。 Thereby, according to this invention, it becomes possible to perform the state determination with the highest probability in each section while observing the previous and subsequent states.

≪実施形態の概要≫
図面を参照してこの発明の実施形態である音声信号処理装置を適用したカラオケ装置について説明する。このカラオケ装置では、カラオケ曲を歌唱した歌唱音声信号（楽音信号）を入力し、この楽音信号からビブラートやしゃくりなどの表現態様（表現タイプ）を検出するものである。 << Summary of Embodiment >>
A karaoke apparatus to which an audio signal processing apparatus according to an embodiment of the present invention is applied will be described with reference to the drawings. In this karaoke apparatus, a singing voice signal (musical sound signal) for singing a karaoke song is input, and an expression mode (expression type) such as vibrato or shackle is detected from this musical sound signal.

この実施形態では、複数の表現態様を的確に検出するために、以下のような手順を用いている。複数の表現態様をそれぞれ１つの状態としてモデル化し、隠れマルコフモデル（ＨＭＭ）である表現推移モデルを構成する。図１は、この表現推移モデルを示す図である。同図に示すように基準メロディデータとして示した音符Ｆを１つの観測区間とした場合には、当該観測区間に対応する表現態様（状態）は７つ検出される(i.,e, silence, attack, sustain, vibrato,release, silence,transition)。しかし、同図では、各音符毎に連続して表現態様が検出されるため、１２の表現態様（状態）が検出されている。つまり、音符ＦとＥの間にあるsilenceとtransitionは、音符ＦとＥで共通に表されている。 In this embodiment, the following procedure is used to accurately detect a plurality of expression modes. A plurality of expression modes are each modeled as one state, and an expression transition model that is a hidden Markov model (HMM) is configured. FIG. 1 is a diagram showing this expression transition model. As shown in the figure, when the note F shown as the reference melody data is set as one observation section, seven expression modes (states) corresponding to the observation section are detected (i., E, silence, attack, sustain, vibrato, release, silence, transition). However, in the same figure, since the expression mode is detected continuously for each note, twelve expression modes (states) are detected. That is, silence and transition between the notes F and E are represented in common by the notes F and E.

図２は、各状態を縦方向に配列し、横方向にフレームを時系列に配列した表現推移モデルにおける状態推移の一例を示す図である。この図では各状態（各表現態様）の持続確率をコスト確率Ｐcost(n) として表し、各状態のコスト確率の積（対数和）が最も大きくなる経路を決定する。 FIG. 2 is a diagram illustrating an example of state transition in an expression transition model in which each state is arranged in the vertical direction and frames are arranged in time series in the horizontal direction. In this figure, the persistence probability of each state (each expression mode) is expressed as a cost probability Pcost (n), and the path with the largest product (logarithmic sum) of the cost probabilities of each state is determined.

コスト確率は、特定のフレームから後続の特定のフレームまでの各フレームに含まれる特徴パラメータに表現態様判定ルールをあてはめて求める。このコスト確率の計算を、各状態において、開始フレームから終了フレームまでの全てのフレームの組み合わせについて求め、このコスト確率群のなかから最も積が大きくなる経路を選択すればよい。 The cost probability is obtained by applying an expression mode determination rule to a feature parameter included in each frame from a specific frame to a subsequent specific frame. The cost probability is calculated for all combinations of frames from the start frame to the end frame in each state, and the route with the largest product may be selected from the cost probability group.

ただし、この実施形態では、Viterbiアルゴリズムを用いて最適経路を探索するため、開始から終了フレームまでの全てのフレームの組み合わせについてコスト確率を求める必要はない。 However, in this embodiment, since the optimum route is searched using the Viterbi algorithm, it is not necessary to obtain cost probabilities for all frame combinations from the start to the end frame.

≪本発明を適用したカラオケ装置の構成≫
図３は、同カラオケ装置１の構成を概略的に示すブロック図である。カラオケ装置１は、歌唱者の歌唱音声を集音するためのマイクロホン２およびカラオケ演奏の楽曲を放音するためのスピーカ３を接続している。 ≪Configuration of karaoke device to which the present invention is applied≫
FIG. 3 is a block diagram schematically showing the configuration of the karaoke apparatus 1. The karaoke apparatus 1 is connected to a microphone 2 for collecting the singer's singing voice and a speaker 3 for emitting karaoke performance music.

そして、カラオケ装置１は、カラオケ楽曲を再生するための自動演奏部１１、マイクロホン２から入力された歌唱音声信号をデジタル化するＡＤ（Analog/Digital）コンバータ１２、および、デジタル化された歌唱音声信号（歌唱音声データ）から種々の表現態様を検出するための各種機能部であるＦＦＴ処理部１３、特徴パラメータ取得部１４、特徴パラメータバッファ１５、ルール記憶部１６、リファレンスバッファ１７、表現判定部１８を備えている。 The karaoke apparatus 1 includes an automatic performance unit 11 for reproducing karaoke music, an AD (Analog / Digital) converter 12 for digitizing a singing voice signal input from the microphone 2, and a digitized singing voice signal. An FFT processing unit 13, a feature parameter acquisition unit 14, a feature parameter buffer 15, a rule storage unit 16, a reference buffer 17, and an expression determination unit 18, which are various functional units for detecting various expression modes from (singing voice data). I have.

自動演奏部１１は、カラオケ楽曲データを記憶する記憶部や、このカラオケ楽曲データを演奏するシーケンサ，音源等を備えており、ユーザの操作を受け付ける操作部等も有している。自動演奏部１１は、マイクロホン２からＡＤコンバータ１２を介して入力された歌唱者の歌唱音声信号と自動演奏したカラオケ演奏音とを合成してスピーカ３に入力する。 The automatic performance unit 11 includes a storage unit that stores karaoke song data, a sequencer that plays the karaoke song data, a sound source, and the like, and also includes an operation unit that receives user operations. The automatic performance unit 11 synthesizes the singer's singing voice signal input from the microphone 2 via the AD converter 12 and the automatically performed karaoke performance sound and inputs the synthesized sound to the speaker 3.

ＡＤコンバータ１２は、接続端子１２ａに接続されたマイク２から、入力されたアナログの歌唱音声信号をデジタルデータに変換してＦＦＴ処理部１３および特徴パラメータ取得部１４に入力する。ＦＦＴ処理部１３では、入力されたサンプリングデータ列である歌唱音声データを２５ｍｓｅｃ毎に分割して高速フーリエ変換（ＦＦＴ）する。なお、高速フーリエ変換時には、有限の時間窓による誤差スペクトルを抑制するためにサンプリングデータ列に窓関数を掛ける。このＦＦＴによって得られた周波数スペクトルは、ＦＦＴ処理部１３から特徴パラメータ取得部１４に入力される。 The AD converter 12 converts the analog singing voice signal input from the microphone 2 connected to the connection terminal 12 a into digital data, and inputs the digital data to the FFT processing unit 13 and the feature parameter acquisition unit 14. The FFT processing unit 13 performs fast Fourier transform (FFT) by dividing the singing voice data that is the input sampling data string every 25 msec. At the time of fast Fourier transform, a sampling function is multiplied by a window function in order to suppress an error spectrum due to a finite time window. The frequency spectrum obtained by the FFT is input from the FFT processing unit 13 to the feature parameter acquisition unit 14.

特徴パラメータ取得部１４は、例えばＣＰＵ等で実現される。特徴パラメータ取得部１４には、ＡＤコンバータ１２から直接時間領域の信号波形である歌唱音声データが入力されるとともに、ＦＦＴ処理部１３から周波数領域の情報である周波数スペクトルが入力される。特徴パラメータ取得部１４は、これら歌唱音声データおよびその周波数スペクトルから、歌唱音声の種々の特徴を表す複数の特徴パラメータを取得する。この特徴パラメータの取得は、上記２５ｍｓのフレームごとに行われる。 The feature parameter acquisition unit 14 is realized by a CPU or the like, for example. Singing voice data, which is a signal waveform in the time domain, is directly input from the AD converter 12 to the feature parameter acquisition unit 14, and a frequency spectrum as information in the frequency domain is input from the FFT processing unit 13. The feature parameter acquisition unit 14 acquires a plurality of feature parameters representing various features of the singing voice from these singing voice data and its frequency spectrum. This feature parameter is acquired every 25 ms frame.

図４は、図３に示す特徴パラメータ取得部１４の構成をより詳細に示したブロック図である。特徴パラメータ取得部１４は、ＡＤコンバータ１２から入力された歌唱音声データから時間領域の特徴パラメータを割り出す時間領域情報取得部１４１およびＦＦＴ処理部１３から入力された周波数スペクトルから周波数領域の特徴パラメータを割り出す周波数領域情報取得部１４２を備えている。 FIG. 4 is a block diagram showing the configuration of the feature parameter acquisition unit 14 shown in FIG. 3 in more detail. The feature parameter acquisition unit 14 calculates the frequency domain feature parameters from the time domain information acquisition unit 141 that calculates the time domain feature parameters from the singing voice data input from the AD converter 12 and the frequency spectrum input from the FFT processing unit 13. A frequency domain information acquisition unit 142 is provided.

時間領域情報取得部１４１は、入力された歌唱音声データをＦＦＴ処理部１３と同期した２５ｍｓｅｃ間隔のフレームに分割し、各フレームごとに時間領域の特徴パラメータを取得する。時間領域情報取得部１４１が取得する特徴パラメータは、以下のとおりである。 The time domain information acquisition unit 141 divides the input singing voice data into 25 msec-interval frames synchronized with the FFT processing unit 13 and acquires time domain feature parameters for each frame. The characteristic parameters acquired by the time domain information acquisition unit 141 are as follows.

ゼロクロスタイミング：Zero crossing
エネルギー：Energy
エネルギ変化度：Delta energy
持続時間（デユレーション）：Duration
ピッチ間隔：Pitch interval
ピッチ傾斜：Pitch slope
ピッチ範囲：Pitch range
ピッチ安定度：Pitch stability
等である。時間領域情報取得部１４１は、上記パラメータの平均および偏差も必要に応じて取得する。なお、英語表記は、各特徴パラメータの図４における表記を示す。 Zero crossing timing: Zero crossing
Energy: Energy
Energy change: Delta energy
Duration (duration): Duration
Pitch interval: Pitch interval
Pitch slope
Pitch range: Pitch range
Pitch stability
Etc. The time domain information acquisition unit 141 acquires the average and deviation of the parameters as necessary. The English notation indicates the notation in FIG. 4 for each characteristic parameter.

周波数領域情報取得部１４２は、ＦＦＴ処理部１３から入力された２５ｍｓの波形の周波数スペクトルから周波数領域の特徴パラメータを取得する。周波数領域情報取得部１４２が取得する特徴パラメータは、以下のとおりである。 The frequency domain information acquisition unit 142 acquires frequency domain feature parameters from the frequency spectrum of the 25 ms waveform input from the FFT processing unit 13. The characteristic parameters acquired by the frequency domain information acquisition unit 142 are as follows.

低音領域エネルギ：LF energy
高音領域エネルギ：HF energy
フィルタバンク（４０要素）：Filter bank
ケプストラム（２４要素）：Cepstrum
スペクトル平滑度：Spectral flatness
フィルタバンク変化度：Delta filter bank
ケプストラム変化度：Delta cepstrum
音色変化度：Delta timbre
ピッチ：Pitch
ピッチ変化度：Delta pitch
ビブラート深さ：Vibrato depth
ビブラート速さ：Vibrato rate
倍音周波数：Harmonic frequency
倍音レベル：Harmonic amplitude
倍音位相：Harmonic phase
倍音安定度：Harmonic stability
純音度：Sinusoidality
等である。なお、英語表記は、各特徴パラメータの図４における表記を表す。 Low frequency energy: LF energy
High frequency energy: HF energy
Filter bank (40 elements): Filter bank
Cepstrum (24 elements): Cepstrum
Spectral smoothness: Spectral flatness
Filter bank change rate: Delta filter bank
Cepstrum change: Delta cepstrum
Tone change: Delta timbre
Pitch: Pitch
Pitch change degree: Delta pitch
Vibrato depth
Vibrato rate: Vibrato rate
Overtone frequency: Harmonic frequency
Overtone level: Harmonic amplitude
Overtone phase: Harmonic phase
Harmonic stability
Pureness: Sinusoidality
Etc. Note that the English notation represents the notation of each characteristic parameter in FIG.

上記ピッチは、音声信号の基本周波数から取得され、エネルギーは音声信号の音量の瞬時値から取得される。ビブラートについては、上記エネルギーおよびピッチの時間変動を正弦関数で近似し、当該近似された正弦波の周波数をビブラート速さ(Vibrato rate)として取得し、当該近似された正弦波の最大振幅をビブラート深さ(Vibrato depth)として取得する。 The pitch is acquired from the fundamental frequency of the audio signal, and the energy is acquired from the instantaneous value of the volume of the audio signal. For vibrato, the energy and pitch time fluctuations are approximated by a sine function, the frequency of the approximated sine wave is obtained as the vibrato rate, and the maximum amplitude of the approximated sine wave is obtained as the vibrato depth. Acquired as (Vibrato depth).

音色変化度は、振幅スペクトルの対数を逆フーリエ変換を行なった値（ケプストラム）に関する、フレーム間の変化量を示す値であり、周波数スペクトルの変化を良く表しているパラメータである。この音色変化度を特徴パラメータとして後述する判定に用いることにより、状態の遷移による音の変化をよりよく検出することかできる。とくに、音色変化度の特徴パラメータにより、他の特徴パラメータで検出が困難な「母音」から「母音」への音の変化をよく検出することができる。 The timbre change degree is a value indicating a change amount between frames regarding a value (cepstrum) obtained by performing inverse Fourier transform on the logarithm of the amplitude spectrum, and is a parameter that well represents a change in the frequency spectrum. By using this timbre change degree as a characteristic parameter for determination to be described later, it is possible to better detect a sound change due to a state transition. In particular, a change in sound from a “vowel” to a “vowel”, which is difficult to detect with other feature parameters, can be well detected by the feature parameter of the timbre change degree.

時間領域情報取得部１４１および周波数領域情報取得部１４２で取得された特徴パラメータは、特徴パラメータバッファ１５に入力される。 The feature parameters acquired by the time domain information acquisition unit 141 and the frequency domain information acquisition unit 142 are input to the feature parameter buffer 15.

特徴パラメータバッファ１５は、入力された特徴パラメータに時間情報を付して記憶する。この時間情報は、特徴パラメータの元データであるフレームの時間軸上の位置を表す情報である。特徴パラメータバッファ１５は、最新の数秒分の特徴パラメータのみを記憶し、古い特徴パラメータは破棄する。記憶時間は、後述の表現判定部１８が繰り返し実行する特徴判定処理を1回実行するのに要する時間程度にすればよい。これにより、特徴パラメータバッファ１５は、全曲分の特徴パラメータを記憶する必要がなくなり、メモリ容量を効果的に小さくすることができる。 The feature parameter buffer 15 stores time information added to the inputted feature parameters. This time information is information representing the position on the time axis of the frame which is the original data of the feature parameter. The feature parameter buffer 15 stores only the latest feature parameters for several seconds, and discards old feature parameters. The storage time may be about the time required to execute a feature determination process repeatedly executed by the expression determination unit 18 described later. As a result, the feature parameter buffer 15 does not need to store the feature parameters for all the songs, and the memory capacity can be effectively reduced.

ルール記憶部１６は、表現判定部１８が行う表現判定処理に用いられる各種ルールを記憶する記憶部である。どのようなルールを記憶しているかは、表現判定部１８の説明において述べる。 The rule storage unit 16 is a storage unit that stores various rules used in the expression determination process performed by the expression determination unit 18. What rules are stored will be described in the description of the expression determination unit 18.

リファレンスバッファ１７には、自動演奏部１１からカラオケ曲（楽曲データ）の演奏に同期した基準メロディデータが入力される。この基準メロディデータは、歌唱をガイドするためのガイドメロディデータを用いればよい。リファレンスバッファ１７は、この基準メロディデータを表現判定部１８が繰り返し実行する特徴判定処理を1回実行するのに要する時間程度の量記憶し、古いデータは破棄してゆく。 Reference melody data synchronized with the performance of the karaoke song (music data) is input from the automatic performance unit 11 to the reference buffer 17. The reference melody data may be guide melody data for guiding a song. The reference buffer 17 stores the reference melody data for an amount of time required to execute the feature determination process that the expression determination unit 18 repeatedly executes, and discards old data.

表現判定部１８は、基準メロディデータの１音ごとにその音に対する表現態様をモデル化したＨＭＭを構成し、各音のなかで表現態様がどのように推移するかを判定する。表現判定部１８は、表現態様の判定が確定するごとに、後述の採点部１９にその判定した表現態様に関する表現態様情報を入力する。表現態様情報は、その表現態様の種類を示す情報と、その表現態様の開始タイミングおよび終了タイミングを含む。 The expression determination unit 18 configures an HMM that models the expression mode for each sound of the reference melody data, and determines how the expression mode changes in each sound. Each time the determination of the expression mode is confirmed, the expression determination unit 18 inputs expression mode information relating to the determined expression mode to the scoring unit 19 described later. The expression mode information includes information indicating the type of the expression mode, and the start timing and end timing of the expression mode.

採点部１９には、表現判定部１８から表現態様情報が入力されるとともに、自動演奏部１１から基準メロディデータが入力される。採点部１９は、入力された表現態様情報を基準メロディ上に位置づけることにより、表現態様が歌唱のどの位置で行われたかを判断し、これに基づいて歌唱を採点評価する。採点部１９は、この評価を例えば１００点を満点として採点し、採点結果を表示部２０に入力する。表示部２０は、歌唱者に採点結果を表示する。この採点処理は、短い歌唱区間（たとえば１０秒程度）ごとにリアルタイムに行ってもよく、カラオケ曲が終了したのち、全体として行ってもよい。また、短い歌唱区間ごとにリアルタイムに行いつつ、歌唱終了ののち総合的に評価するようにしてもよい。 The scoring unit 19 receives the expression mode information from the expression determination unit 18 and the reference melody data from the automatic performance unit 11. The scoring unit 19 positions the input expression mode information on the reference melody, thereby determining at which position of the song the expression mode is performed, and scores the song based on this. The scoring unit 19 scores this evaluation with, for example, 100 points as a full score, and inputs the scoring result to the display unit 20. The display unit 20 displays the scoring result to the singer. This scoring process may be performed in real time for each short singing section (for example, about 10 seconds), or may be performed as a whole after the karaoke song ends. Moreover, you may make it evaluate comprehensively after completion | finish of a singing, performing in real time for every short singing area.

≪表現判定処理の説明≫
以下、表現判定部１８で実行される表現判定処理について説明する。この表現判定処理においては、基準メロディデータの1音ごとにその音に含まれるＨＭＭの状態としての表現態様（表現タイプ）に加えて、各表現態様のなかでさらに詳細な特徴を表す表現ラベル（詳細特徴）を決定する。この表現判定部１８が決定する表現態様と各表現態様の表現ラベル（括弧書きで示す）は以下のとおりである。 ≪Explanation of expression judgment process≫
Hereinafter, the expression determination process executed by the expression determination unit 18 will be described. In this expression determination process, in addition to the expression form (expression type) as the state of the HMM included in each sound of the reference melody data, an expression label that represents more detailed features in each expression form ( Details). The expression modes determined by the expression determination unit 18 and the expression labels (shown in parentheses) of the respective expression modes are as follows.

アタック：Attack
（normal / scoop up / scoop fry）
サスティン：Sustain
（normal, fall down）
ビブラート：Vibrato
（normal）
リリース：Release
（normal,fall down）
トランジション：Transition
（normal, portamento, scoopup/down, staccato）。 Attack: Attack
(Normal / scoop up / scoop fry)
Sustain: Sustain
(Normal, fall down)
Vibrato: Vibrato
(Normal)
Release: Release
(Normal, fall down)
Transition: Transition
(Normal, portamento, scoopup / down, staccato).

表現判定部１８は、基準メロディデータの個々の音について上記表現態様の推移を判定するための表現推移モデルを生成する。上掲の図１は、この表現推移モデルの例を示す図である。 The expression determination unit 18 generates an expression transition model for determining the transition of the expression mode for each sound of the reference melody data. FIG. 1 is a diagram showing an example of this expression transition model.

表現判定部１８には、あらかじめ、１つの音の中で表現態様の推移が生じる条件が登録されており、この推移の条件に基づいて図１に示すような表現推移モデルを生成する。この条件は、以下のようである。 The expression determination unit 18 is registered in advance with conditions that cause the transition of the expression mode in one sound, and an expression transition model as shown in FIG. 1 is generated based on the condition of the transition. This condition is as follows.

たとえば、「無音(silence) のあとの音は常にアタックで始まる。」、「無音以外の状態からアタックへ移行することはない。」、「フレーズの最後の音すなわち無音の直前の音は、常にリリースで終わる。」、「リリースから無音以外の状態へ移行することはない。」などである。また、連続する２音の間で行われる表現態様の推移は２とおりである。１つは「リリース→無音→アタック」、もう１つは、「トランジション」である。なお、１音の途中には、サスティンと、ビブラートという表現を途中にもつことができる。 For example, “the sound after silence always starts with an attack”, “the transition from a state other than silence does not shift to an attack”, “the last sound of a phrase, ie the sound immediately before the silence, is always It ends with release, "" There is no transition from release to a state other than silence. " Moreover, there are two ways of transition of the expression mode performed between two consecutive sounds. One is “release → silence → attack”, and the other is “transition”. In the middle of a note, the expressions sustain and vibrato can be included.

なお、図１には、２音分の表現推移モデルを記載しているが、表現判定部１８は、各判定サイクルにおいて、そのときの観測区間に属する１音または複数音に対応する表現推移モデルを「無音」状態および「トランジション」状態で連結して生成する。 Although FIG. 1 shows an expression transition model for two sounds, the expression determination unit 18 represents an expression transition model corresponding to one sound or a plurality of sounds belonging to the observation section at that time in each determination cycle. Are generated by concatenating them in a “silence” state and a “transition” state.

表現判定部１８は、適当なタイミング毎に表現判定処理を繰り返せばよいが、基準メロディデータにおける音の遷移タイミングに表現判定処理を実行すれば、少なくとも２音にまたがって表現態様の推移を判定することができる。この場合、１回の表現判定処理の観測区間は、前記遷移タイミングの前後１秒ずつの２秒程度とすればよい。 The expression determination unit 18 may repeat the expression determination process at every appropriate timing, but if the expression determination process is executed at the transition timing of the sound in the reference melody data, the transition of the expression mode is determined over at least two sounds. be able to. In this case, the observation interval of one expression determination process may be about 2 seconds, 1 second before and after the transition timing.

同図において、状態の遷移は、左端の無音からスタートし、発音の開始は、必ずアタックである。そして、矢印で示す状態遷移の方向に遷移する。この推移により、アタックで開始した楽音の発音は；
サスティン、ビブラートの一方または両方を経由してリリースで終了する（通常の歌唱）
アタックののち、直接リリースに遷移して終了する（弾けるような歌唱）
アタック、サスティンまたはビブラートからトランジションを経由して次の楽音につながる（レガート，ポルタメント等）
という複数種類の遷移をとることができる。 In the figure, the state transition starts from silence at the left end, and the start of sound generation is always an attack. And it changes to the direction of the state transition shown by the arrow. Due to this transition, the pronunciation of the musical sound that started with the attack was:
End with release via Sustain, Vibrato or both (normal singing)
After an attack, transition to direct release and end (song that can be played)
From attack, sustain or vibrato to the next musical sound via transition (legato, portamento, etc.)
Multiple types of transitions can be taken.

表現判定部１８で実行される判定は、１つのフレームまたは複数の連続したフレームについて、そのフレームを含む区間がどの表現態様であるかを求め、および、そのフレームを含む区間がその表現態様の開始タイミングから終了タイミングとして適切か否かの確率（すなわち、区間の長さが適切か否かの確率）を求め、観測区間全体として最も確率の高い表現態様の推移およびその遷移タイミングを決定するという手法で実行される。 The determination performed by the expression determination unit 18 determines which expression mode is included in a section including the frame for one frame or a plurality of consecutive frames, and the section including the frame is the start of the expression mode. A method of obtaining the probability of whether or not the end timing is appropriate from the timing (that is, the probability of whether or not the length of the section is appropriate) and determining the transition of the expression mode having the highest probability as the entire observation section and its transition timing Is executed.

したがって、判定ルールもフレーム単位で特徴パラメータを評価するのではなく、連続するフレーム列にどのような特徴があれば（どのような傾向を示していれば）、連続するフレーム列がある表現態様として判定される確率が高くなるかを記述したものになっている。確率値は、０．０〜１．０までの間の実数値で表現される。 Therefore, the determination rule does not evaluate the feature parameter in units of frames, but if there is any feature in the continuous frame sequence (if it shows any tendency), the expression form with the continuous frame sequence is It describes whether the probability of determination is high. The probability value is expressed by a real value between 0.0 and 1.0.

判定ルールの設定には、当該判定に必要とされる測定の誤差がしばしばガウス分布を成すことから、ガウス分布
gaussian(mean,var) = exp(-0.5* ((mean - x)/var) ^ 2)
を使うことが合理的である。あるいは、主に計算量の節約の観点から、たとえば折線（近似直線）で判定ルールを表現するようにしても良く、ファジー論理の考え方で判定ルールを表現するように構成しても良い。さらにまた、ガウス分布と折線（近似直線）とファジー論理を区分的に組み合わせるように構成してもよい。 In setting the judgment rule, the measurement error required for the judgment often forms a Gaussian distribution.
gaussian (mean, var) = exp (-0.5 * ((mean-x) / var) ^ 2)
It is reasonable to use Alternatively, mainly from the viewpoint of saving the calculation amount, for example, the determination rule may be expressed by a broken line (approximate straight line), or the determination rule may be expressed by the concept of fuzzy logic. Further, the Gaussian distribution, the broken line (approximate straight line), and the fuzzy logic may be combined piecewise.

さらに、複数のルールで求められる確率を組み合わせて最終的な確率を求めるようにしてもよい。その方法として、たとえば、各ルールで得られた確率の積をとるようにする。 Furthermore, final probabilities may be obtained by combining probabilities obtained by a plurality of rules. As the method, for example, the product of the probabilities obtained by each rule is taken.

以下に記載する判定ルールは、人が常識や経験則に基づいて設定したヒューリスティックルールであるが、機械学習によるものであってもよい。 The determination rules described below are heuristic rules set by humans based on common sense and empirical rules, but may be based on machine learning.

● 無音(Silence)に関する判定ルール
・全フレーム中の有音フレームの割合に応じて確率を下げる。 ● Judgment rules regarding silence ・ The probability is lowered according to the ratio of sound frames in all frames.

これは、有音フレームの割合に対して平均値が０であるガウス分布などの確率分布を規定し、入力音声から測定された有音フレームの割合に応じた確率を求めるという意味である。 This means that a probability distribution such as a Gaussian distribution having an average value of 0 with respect to the ratio of sound frames is defined, and a probability corresponding to the ratio of sound frames measured from the input speech is obtained.

最初のフレームから１０フレーム（１０フレームなければ全フレームの前半分のフレーム) について有音フレームが３以上だったら確率を下げる。これは、最初のフレームから１０フレームに含まれる有音フレーム数に対して平均値が３であるガウス分布等の確率分布を規定し、有音フレーム数の測定値に基づいて確率を求めるという意味である。 If the number of sound frames is 3 or more for 10 frames from the first frame (the first half of all frames if there are no 10 frames), the probability is lowered. This means that a probability distribution such as a Gaussian distribution having an average value of 3 with respect to the number of sound frames included in 10 frames from the first frame is defined, and the probability is obtained based on the measured value of the number of sound frames. It is.

最後のフレームから１０フレーム（１０フレームなければ全フレームの後半分のフレーム）について有音フレームが３以上だったら確率を下げる。 If the number of sound frames is 3 or more for 10 frames from the last frame (the latter half of all frames if there are no 10 frames), the probability is lowered.

これは、最後のフレームから１０フレームに含まれる有音フレーム数に対して平均値が３であるガウン分布等の確率分布を規定し、有音フレーム数の測定値に基づいて確率を求めるという意味である。 This means that a probability distribution such as a gown distribution having an average value of 3 with respect to the number of sound frames included in the 10 frames from the last frame is defined, and the probability is obtained based on the measured value of the number of sound frames. It is.

上記３つの確率の乗算値を算出し、その算出結果を無音態様のコスト確率とする。 The multiplication value of the above three probabilities is calculated, and the calculation result is set as the cost probability of the silent mode.

● アタックに関する判定ルール
・持続時間
測定対象の有音区間の長さが、あらかじめ設定した閾値より短い場合は、確率は低いとする。 ● Judgment rules and durations related to attacks If the length of the target voiced section is shorter than the preset threshold, the probability is low.

たとえば、上記閾値として６フレームを設定し、有音区間の長さがそれより大きい場合は１．０、小さい場合はgaussian(6, 1.8)として確率を定めるように計算する。 For example, 6 frames are set as the threshold value, and the calculation is performed so that the probability is determined as 1.0 when the length of the voiced section is larger than that and as gaussian (6, 1.8) when the length is smaller.

・ピッチ
判定区間の最後のフレームにおいて、ピッチが存在していること。・ Pitch A pitch must exist in the last frame of the judgment section.

この場合、確率分布のとる値は、「１」または「０」のいずれかになる。つまり、この条件を満たせば、確率は１となる。 In this case, the value of the probability distribution is either “1” or “0”. That is, if this condition is satisfied, the probability is 1.

・エネルギー
判定区間の先頭部分で、エネルギーが低いこと。・ Energy is low at the beginning of the judgment section.

判定区間の先頭部分で、エネルギー増分が大きいこと。 The energy increment is large at the beginning of the judgment section.

判定区間の末尾部分で、エネルギー増分が小さいこと。 The energy increment is small at the end of the judgment section.

これらは、エネルギーの値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。各条件を満たすほど、各確率は大きくなる。 These mean that a probability distribution such as a Gaussian distribution according to an energy value is defined and a probability according to a measured value is obtained. As each condition is satisfied, each probability increases.

なお、先頭部分は先頭の数フレームを意味し、末尾部分は末尾の数フレームを意味する。以下同様の意味である。 The head portion means the first few frames, and the tail portion means the last few frames. Hereinafter, it has the same meaning.

・ピッチ変化度
判定区間の末尾部分で、ピッチ変動は小さいこと。・ Pitch change degree Pitch fluctuation should be small at the end of the judgment section.

これは、ピッチ変化度の値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution corresponding to the value of the degree of pitch change is defined, and a probability corresponding to the measured value is obtained. The more this condition is met, the greater the probability.

・音色変化度
区間の先頭部分で、音色変化度が大きいこと。・ Tone change degree The tone change degree is large at the beginning of the section.

これは、音色変化度の値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution corresponding to the value of the timbre change degree is defined, and a probability corresponding to the measured value is obtained. The more this condition is met, the greater the probability.

・ビブラート
判定区間にビブラートが存在しないこと。・ Vibrato Vibrato does not exist in the judgment section.

以上の持続時間、ピッチ、エネルギー、ピッチ変化度、音色変化度、および、ビブラートに関して算出された複数の確率の乗算値を算出し、その算出結果をアタック態様のコスト確率とする。 A multiplication value of a plurality of probabilities calculated for the above duration, pitch, energy, pitch change degree, tone color change degree, and vibrato is calculated, and the calculation result is set as the cost probability of the attack mode.

◆ アタックの表現ラベル付けルール(normal/scoop-fry/scoop-up)
判定区間の先頭部分でピッチ変動が小さい場合は、アタックの表現態様にnormalとラベルをつける。 ◆ Attack expression labeling rules (normal / scoop-fry / scoop-up)
If the pitch variation is small at the beginning of the determination section, the attack expression is labeled normal.

判定区間の先頭部分でピッチ変動が大きい場合は以下のようにアタックの表現態様にラベルをつける；
判定区間の先頭部分で有音フレームの数が少なければ、アタックの表現態様にSCOOP-FRY(非常に弱い発声からのしゃくり)とラベルをつける。 If the pitch variation is large at the beginning of the judgment section, label the attack expression as follows:
If the number of voiced frames is small at the beginning of the judgment section, label the attack expression as SCOOP-FRY (spoken from a very weak utterance).

判定区間の先頭部分で有音フレームの数が多ければ、アタックの表現態様にSCOOP-UP(通常の発声でのしゃくり)とラベルをつける。 If the number of voiced frames is large at the beginning of the judgment section, the attack expression is labeled SCOOP-UP (normal screaming).

● リリースに関する判定ルール
・持続時間
判定対象の有音区間の長さが、あらかじめ設定した閾値より短い場合は、確率は低いとする。 ● Judgment rules and duration related to release If the length of the target section is shorter than the preset threshold, the probability is low.

たとえば、上記閾値として4フレームを設定し、区間の長さがそれより大きい場合は1.0、小さい場合はgaussian(4,1.2)として確率を定める。 For example, 4 frames are set as the threshold value, and the probability is determined as 1.0 when the length of the section is larger than that, and as gaussian (4, 1.2) when it is smaller.

また一般に、有音区間が長くなるとリリースであると判定する確率は低い。 In general, the probability of determining a release is low when the sound section becomes long.

これを表現するのに、たとえば、gaussian(0, c)（cは、２秒にあたるフレーム数を与える、たとえば毎秒４０フレーム処理の場合８０)を利用する。 To express this, for example, gaussian (0, c) (c is the number of frames corresponding to 2 seconds, for example, 80 in the case of 40 frame processing per second) is used.

・ピッチ
判定区間のさらに前、すなわち過去２フレームに、ピッチがあること。この場合、確率分布のとる値は、「１」または「０」となる。これに対して、過去２フレームにピッチがない場合は確率は０となる。 -Pitch There must be a pitch before the judgment section, that is, in the past two frames. In this case, the value of the probability distribution is “1” or “0”. On the other hand, when there is no pitch in the past two frames, the probability is zero.

・エネルギー
判定区間の末尾部分で、エネルギーが低いこと。・ Energy is low at the end of the judgment section.

判定区間の末尾部分で、エネルギーが大幅に減少していること。 The energy is greatly reduced at the end of the judgment section.

判定区間の先頭部分で、エネルギーの増大は小さいこと。 The increase in energy is small at the beginning of the judgment section.

判定区間中、エネルギーが増大することは少ないこと。 There is little increase in energy during the judgment section.

これらは、エネルギーの値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 These mean that a probability distribution such as a Gaussian distribution according to an energy value is defined and a probability according to a measured value is obtained. The more this condition is met, the greater the probability.

・ピッチ変化度
判定区間の先頭部分で、ピッチの変化度は小さいこと。・ Pitch change degree The change degree of the pitch should be small at the beginning of the judgment section.

・音色変化度
判定区間の末尾部分で、音色変化度が大きいこと。・ Tone change degree The tone change degree is large at the end of the judgment section.

・倍音安定度
判定区間の先頭部分では、倍音安定度は低いこと。 -Harmonic stability The harmonic stability is low at the beginning of the judgment section.

これは、倍音安定度の値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution according to the harmonic stability value is defined, and the probability according to the measured value is obtained. The more this condition is met, the greater the probability.

・ビブラート
判定区間にビブラートは少ないこと（ビブラートの速度が遅くおよびビブラートの深さが浅いこと）。・ Vibrato Vibrato in the judgment section is low (vibrato speed is slow and vibrato is shallow).

これは、ビブラート（ビブラートの速度およびビブラートの深さ）の値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution according to the value of vibrato (vibrato speed and vibrato depth) is defined, and the probability according to the measured value is obtained. The more this condition is met, the greater the probability.

◆ リリースの表現ラベル付けルール(fall-down)
・リリースの表現ラベルがfall-downかどうか判定するため,あらたにfall-downとラベル付けする確率を計算する。 ◆ Release expression labeling rules (fall-down)
・ To determine whether the expression label of the release is fall-down, calculate the probability of newly labeling it as fall-down.

判定区間の長さが、設定した最短値より短ければ、リリースの表現ラベルをfall-downではなくnormalと確定する。 If the length of the judgment section is shorter than the set minimum value, the release expression label is fixed as normal instead of fall-down.

判定区間の長さが、設定した最短値よりも長い場合つぎの判定に進む。 When the length of the determination section is longer than the set shortest value, the process proceeds to the next determination.

判定区間の前半部分における最高ピッチと判定区間の後半部分における最低ピッチとの差が小さい場合、リリースの表現ラベルがfall-downである可能性は低いので、fall-downとラベル付けする確率を下げる。 If the difference between the highest pitch in the first half of the judgment section and the lowest pitch in the second half of the judgment section is small, it is unlikely that the release expression label is fall-down, so the probability of labeling fall-down is lowered. .

当該区間の先頭（先頭フレーム）から、もっとも末尾側の有音フレームまでについて、ピッチの近似直線をもとめた場合、その傾きが負であればリリースの表現ラベルがfall-downである可能性が高いので、fall-downとラベル付けする確率を上げる。 When the approximate straight line of the pitch is found from the beginning (first frame) of the section to the last sounded frame, if the slope is negative, the release expression label is likely to be fall-down So increase the probability of labeling fall-down.

以上の結果、fall-downとラベル付けする確率が、設定した値より高ければ、当該区間にfall-downとラベルをつけて終わる。一方、fall-downとラベル付けする確率が、設定した値より低いければ、当該区間にnormalとラベルをつけて終わる。 As a result, if the probability of labeling fall-down is higher than the set value, the section ends with labeling fall-down. On the other hand, if the probability of labeling fall-down is lower than the set value, the section is labeled as normal.

● トランジションに関する判定ルール
・持続時間
判定対象の有音区間の長さが、設定した最短値より長いこと。 ● Judgment rules / duration regarding transitions The length of the sounded section to be judged must be longer than the set minimum value.

・スタッカート
判定区間の前半半分において無音とみなせる部分が存在すれば、スタッカートとみなす。スタッカートは、(release-silence-attackではなく)トランジションタイプとしてあつかう。・ Staccato If there is a part that can be regarded as silence in the first half of the judgment section, it is considered to be staccato. Staccato is treated as a transition type (not release-silence-attack).

・ピッチ
判定区間の先頭からさらに過去2つのフレームはピッチありであること。・ Pitch The two frames from the beginning of the judgment section must have a pitch.

判定区間の先頭フレームにピッチがあること。 There is a pitch in the first frame of the judgment section.

判定区間の末尾のフレームはピッチありであること。 The last frame of the judgment section must have a pitch.

判定区間の末尾からさらに後続の（将来）２つのフレームが参照できる場合には、それらについてピッチがあること。 If two subsequent (future) frames can be referenced from the end of the judgment section, there must be a pitch for them.

これらの場合、確率分布のとる値は、「１」または「０」のいずれかになる。つまり、この条件を満たせば、確率は１となる。 In these cases, the value of the probability distribution is either “1” or “0”. That is, if this condition is satisfied, the probability is 1.

・エネルギー
判定区間の先頭部分と末尾部分のいずれにも、設定した最小値より大きいエネルギーがあること。・ Energy There must be energy that is larger than the set minimum value at both the beginning and end of the judgment section.

・倍音安定度
判定区間の長さが設定した最短値より長く、かつ対象区間の先頭部分において安定度が高い場合は、確率を下げる。 -Harmonic stability If the length of the judgment section is longer than the set shortest value and the stability is high at the beginning of the target section, the probability is lowered.

判定区間の長さが設定した最短値より長く、かつ対象区間の末尾部分において安定度が高い場合は、確率を下げる。 If the length of the determination section is longer than the set shortest value and the stability is high at the end of the target section, the probability is lowered.

安定度が高い状態が長く続き、かつ区間内にスタッカートがないときは、確率を下げる。 When the state of high stability continues for a long time and there is no staccato in the section, the probability is lowered.

・音程
判定区間内に存在すると想定される、遷移の前の音と遷移の後の音について、平均ピッチをそれぞれ求めてその幅を算出する。 -Pitch For the sound before the transition and the sound after the transition that are assumed to be in the judgment section, the average pitch is obtained and the width is calculated.

その値と、基準メロディ情報での対応する２音の音程とを比べ、両者が近いほどよい。 The value is compared with the corresponding two-tone pitch in the reference melody information.

・ピッチ変化度
判定区間の両端部分において、ピッチ変化度は低いこと。・ Pitch change degree The pitch change degree is low at both ends of the judgment section.

以上の持続時間、ピッチ、エネルギー、ビブラート、倍音安定度、音程およびピッチ変化度に関して算出された複数の確率の乗算値を算出し、その算出結果をトランジション態様のコスト確率とする。 A multiplication value of a plurality of probabilities calculated for the above duration, pitch, energy, vibrato, harmonic stability, pitch, and pitch change is calculated, and the calculation result is used as the cost probability of the transition mode.

◆ トランジションのラベル付け計算と最終結果計算
・scoop-up (しゃくりあげ)の判定は次のように確率を別途計算して行なう。 ◆ Transition labeling calculation, final result calculation, and scoop-up determination are performed by separately calculating probabilities as follows.

遷移の後の音の長さが、設定した最小値より短いか、または長くてかつ、遷移の後の音の、始まりの部分に、ピッチが安定している区間が存在すること。 The length of the sound after the transition is shorter or longer than the set minimum value, and there is a section where the pitch is stable at the beginning of the sound after the transition.

判定区間全体の平均ピッチと、末尾部分でのピッチとの差が半音以上あること。 The difference between the average pitch of the entire judgment section and the pitch at the end is at least a semitone.

遷移の後の音の、安定するまでの部分には、６０以上ピッチずれがないこと。 There should be no pitch shift of 60 or more in the part of the sound after transition until it becomes stable.

判定区間の、末尾部分は、安定度が低いこと。 The last part of the judgment section has low stability.

判定区間の末尾部分には、ビブラートが少ないこと。 There should be little vibrato at the end of the judgment section.

各条件を満たすほど、各確率は大きくなる。算出された複数の確率の乗算値を算出し、その算出結果が、設定した最小値より高くなれば、トランジションの表現ラベルにscoop-upとラベル付けする。 As each condition is satisfied, each probability increases. A multiplication value of a plurality of calculated probabilities is calculated, and if the calculation result is higher than the set minimum value, the transition expression label is labeled scoop-up.

・ポルタメントの判定は次のように確率を別途計算して行なう。・ Portamento is determined by calculating the probability separately as follows.

遷移の前の音の長さが、設定した最小値より長いこと。 The length of the sound before the transition is longer than the set minimum value.

遷移の後の音の部分は安定していること。 The sound part after the transition must be stable.

遷移の前の音の部分は、ピッチ変化度が大きいこと。 The part of the sound before the transition must have a large degree of pitch change.

遷移の前の音の部分で、ピッチの変動幅が1半音以上あること。 The pitch range is 1 semitone or more in the part of the sound before the transition.

遷移の前の音の部分で、無声フレームがあったら確率を下げること。 If there is an unvoiced frame in the part of the sound before the transition, lower the probability.

以上の判定の後、scoop-upやportamentoに該当せず、normalに該当する場合には；
ピッチ変化が、設定した正の値より大きくかつスタッカートが存在していれば、トランジションの表現ラベルにstaccato-normal-upとラベル付けする。 After the above judgment, if it does not fall under scoop-up or portamento but falls under normal:
If the pitch change is larger than the set positive value and staccato exists, the transition expression label is labeled staccato-normal-up.

そうでなければ、トランジションの表現ラベルにnormal-upとラベル付けする。 Otherwise, label the transition expression label as normal-up.

ピッチ変化が、設定した負の値より小さくかつスタッカートが存在していれば、トランジションの表現ラベルにstaccato-normal-downとラベル付けする。 If the pitch change is smaller than the set negative value and staccato exists, the transition expression label is labeled staccato-normal-down.

そうでなければ、トランジションの表現ラベルにnormal-downとラベル付けする。 Otherwise, label the transition label as normal-down.

● サスティンに関する判定ルール
・持続時間
判定対象となる区間の長さが、あらかじめ設定した閾値より短い場合は、確率は低いとする。 ● Judgment rules and duration regarding sustain If the length of the section to be judged is shorter than a preset threshold, the probability is low.

これは、持続時間の値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution according to the value of the duration is specified, and the probability according to the measurement value is obtained. The more this condition is met, the greater the probability.

・ピッチ
判定区間の最初のフレームにピッチがあること。・ Pitch The pitch must be in the first frame of the judgment section.

判定区間の最後のフレームにピッチがあること。 There is a pitch in the last frame of the judgment section.

判定区間の末尾からさらに後続の（将来）２つのフレームがもし参照できるなら、それらについてピッチがあること。 If two subsequent frames from the end of the decision section can be referenced, there must be a pitch for them.

・ピッチ変化度
対象区間の先頭において、大きなピッチ変化はないこと。・ Pitch change degree There should be no large pitch change at the beginning of the target section.

・エネルギー
設定した最小の値より大きい値であること。・ Energy Value must be larger than the set minimum value.

エネルギーの値が安定していること。 The energy value is stable.

これは、エネルギーの値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution according to the energy value is defined and the probability according to the measured value is obtained. The more this condition is met, the greater the probability.

・音色変化度
音色変化度が、あらかじめ設定した範囲の中にあること。・ Tone change rate The tone change rate must be within the preset range.

・ビブラート
判定区間にビブラートが少ないこと。・ Vibrato There should be little vibrato in the judgment section.

◆ サスティンの表現ラベル付けルール
・ラベルはnormal のみであるが、そのnormalらしさの確率は次のルールで計算して最終結果に反映する。 ◆ The sustain expression labeling rule label is only normal, but the probability of normality is calculated according to the following rules and reflected in the final result.

ピッチは安定していること。 The pitch must be stable.

判定区間全体でのピッチの近似直線の傾きが０に近いこと。 The slope of the approximate straight line of the pitch in the entire judgment section is close to 0

●ビブラートに関する判定ルール
・持続時間
判定対象となる有音区間の長さが、あらかじめ設定した閾値より短い場合は、確率は低いとする。 ● Vibrato Judgment Rules / Duration Time The probability is low when the length of the voiced section to be judged is shorter than a preset threshold.

・音色変化度
音色の変化度が、あらかじめ設定した範囲の中にあること。・ Tone variation The timbre variation is within the preset range.

・ピッチ変化度
判定区間の中でもとめたピッチ変化度の最大、最小について、
最大値が、設定した下限値より大きいこと。・ Pitch change degree About the maximum and minimum pitch change degree stopped in the judgment section,
The maximum value is larger than the set lower limit value.

最小値が、設定した上限値より小さいこと。 The minimum value is smaller than the set upper limit value.

・ビブラート
判定区間にビブラートが多いこと。・ Vibrato There are many vibrato in the judgment section.

これは、ビブラートの値に応じたガウス分布等の確率分布を規定して、測定値に応じた確率を求めるという意味である。この条件を満たすほど、確率は大きくなる。 This means that a probability distribution such as a Gaussian distribution according to the vibrato value is specified, and the probability according to the measured value is obtained. The more this condition is met, the greater the probability.

以上の持続時間、ピッチ、エネルギー、音色変化度、ピッチ変化度およびビブラートに関して算出された複数の確率の乗算値を算出し、その算出結果をビブラート態様のコスト確率とする。 A multiplication value of a plurality of probabilities calculated with respect to the above duration, pitch, energy, timbre change degree, pitch change degree and vibrato is calculated, and the calculation result is set as the cost probability of the vibrato mode.

なお、確率分布はガウス分布に限られるものではなく、直線近似したガウス分布でもよい。 Note that the probability distribution is not limited to the Gaussian distribution, and may be a Gaussian distribution approximated by a straight line.

次に、図５，図６を参照して表現判定および最適ルート検索の手法について説明する。 Next, the method of expression determination and optimum route search will be described with reference to FIGS.

図６は、ラティス(lattice)図上にViterbiアルゴリズムで検索した表現推移の最適ルートを記載した図である。各節点（フレーム）での確率は、演奏の冒頭から始まってその節点に到達する最善の経路を与える。ここで、確率として二通りのものを考える。ひとつは、TransitionProbability (遷移確率) P trans, もうひとつはCost Probability(コスト確率）Pcostである。この実施形態では、計算を容易化するため、全ての節点において全ての分岐に対する遷移確率を全て１とする。したがって、全体の確率は。コスト確率だけで決まることになる。 FIG. 6 is a diagram in which the optimal route of the expression transition searched by the Viterbi algorithm is described on the lattice diagram. The probability at each node (frame) gives the best path to reach that node starting from the beginning of the performance. Here, two probabilities are considered. One is TransitionProbability (transition probability) P trans and the other is Cost Probability (cost probability) Pcost. In this embodiment, all transition probabilities for all branches at all nodes are set to 1 in order to facilitate calculation. Therefore, the overall probability is. It is determined only by the cost probability.

遷移確率が全て１であるため、たとえば図２に示した経路の確率は；
P = Pcost(1) Pcost(2) Pcost(3)Pcost(4)
で与えられる。 Since the transition probabilities are all 1, for example, the path probabilities shown in FIG.
P = Pcost (1) Pcost (2) Pcost (3) Pcost (4)
Given in.

コスト確率は、各節点（フレーム）においてそのフレームを終了フレームとして、それ以前の０〜ｎ個のフレーム列である状態が持続する場合の確率として、ｎ個与えられる。コスト確率は、前記すべてのルールの確率の積で与えられる。 The cost probabilities are given as n probabilities when the state of 0 to n frame trains before that frame is the end frame at each node (frame). The cost probability is given by the product of the probabilities of all the rules.

観測区間全体の分析の後、Viterbi行列のバックトラック法を実行して、もっとも確率の高い経路を判定する。 After analyzing the entire observation interval, the Viterbi matrix backtracking method is executed to determine the path with the highest probability.

図５は、無音(silence), アタック(attack),トランジション(transition) , ビブラート(vibrato), リリース(release), 無音(silence)という表現推移の経路が、もっとも確率の高いものとして選ばれている例を示す図である。なお、図６は、上記のうちトランジションまでの様子をViterbi行列で示したものである。 In FIG. 5, the path of expression transition of silence, attack, transition, vibrato, release, silence is selected as the most probable path. It is a figure which shows an example. FIG. 6 shows the state up to the transition among the above in the Viterbi matrix.

このようにして、表現態様の経路が確定すると、つぎに、それぞれの表現態様について、もっとも可能性の高いラベルの判定を行なう。この判定も、上述したヒューリスティックルールに基づいて計算される。 When the path of the expression mode is determined in this way, the most likely label is determined for each expression mode. This determination is also calculated based on the heuristic rules described above.

図５の上段に、最終的な表現タイプの開始と終了の正確な時刻位置を示す。そしてアタックはscoop up (しゃくり)、遷移はnormal、ビブラートは regular, リリースは fall down とそれぞれラベルがつけられている。 The upper part of FIG. 5 shows the exact time position of the start and end of the final expression type. The attack is scoop up, the transition is normal, the vibrato is regular, and the release is fall down.

上記の表現判定処理は、２秒程度の観測区間について行ってもよいが、曲の終了後、全体にわたって判定してもよい。 The above expression determination process may be performed for an observation interval of about 2 seconds, but may be determined over the whole after the end of the song.

≪リアルタイム化の工夫≫
表現判定処理については、上述したが、これを２秒程度ごとにリアルタイムに、とくに、歌唱音声の（リファレンスとずれた）実際の音の遷移点を検出するスコアマッチングを行いながら表現を判定するためには、さらに以下のような動作をすることが好ましい。 ≪Ingenuity of real time≫
The expression determination process has been described above, but in order to determine the expression in real time every 2 seconds, in particular, while performing score matching that detects transition points of the actual sound (deviation from the reference) of the singing voice. Further, it is preferable to perform the following operation.

すなわち、ここでは、「スコアマッチングによってある音の始まり（または終了）が確定するたびに、終了した音に関して、もっとも確率の高い表現推移の経路を確定する。」という処理を行う。 That is, here, a process of “determining the path of expression transition with the highest probability for the ended sound every time the start (or end) of a sound is determined by score matching” is performed.

図７，図８を参照してこの処理手順について説明する。 This processing procedure will be described with reference to FIGS.

図７において、基準メロディの１番目の音の始まった時刻は既知であるが、その持続時間（終了点）は未確定であり、１番目の音区間は未確定である。ただし、この音における最初の表現態様のみアタック(Attack)であると判定することができる。これは、フレーズの最初の音すなわち無音の直後の音であるためである。 In FIG. 7, the time at which the first sound of the reference melody starts is known, but its duration (end point) is unconfirmed, and the first sound section is unconfirmed. However, only the first expression mode in this sound can be determined to be an attack. This is because the first sound of the phrase, that is, the sound immediately after the silence.

図８は、スコアマッチングの結果、基準メロディにおける２番目の音の始まりを確定した時点の状況を示している。この状態で、１番目の音の範囲内で、最も確率の高い経路を決定する。その結果、アタック(attack:scoop-up),ビブラート(vibrato:normal), トランジション(transition)という経路（表現態様）が確定する。 FIG. 8 shows a situation when the start of the second sound in the reference melody is confirmed as a result of the score matching. In this state, the route with the highest probability is determined within the range of the first sound. As a result, paths (expression modes) such as attack (scoop-up), vibrato (normal), and transition are determined.

ただし、Transitionの持続時間とラベルはまだ確定できない。これは、さらに次の音の中で終了点が決定されるからである。よって、スコアマッチングによって次の音の持続時間が確定してから、transitionの持続時間（終了点）とラベルの判定を行う。 However, the duration and label of Transition cannot be determined yet. This is because the end point is determined in the next sound. Therefore, after the duration of the next sound is determined by score matching, the transition duration (end point) and label are determined.

このように、２秒程度の観測区間内で全ての区間の表現態様を確定するのではなく、スコアマッチングによって実際の持続時間が確定した音について順次、表現態様を判定してゆくことにより、より高い精度でリアルタイムな表現態様の判定を行うことができる。 In this way, instead of determining the expression mode of all the sections within the observation interval of about 2 seconds, by sequentially determining the expression mode for sounds whose actual duration has been determined by score matching, It is possible to determine the expression mode in real time with high accuracy.

以上詳細に説明したように、本実施の形態によれば、まず基準となるメロディ情報から、表現推移モデルを生成する。入力楽音から、ピッチ，音量，スペクトル変化度などの特徴パラメータを逐次検出する。それらの値から所定のルールで表現の推移する確率，推移せずに留まる確率を算出する。この確率に基づき、表現推移モデルにおいて、最も確率の高い表現推移経路を選択する。そして表現の推移の位置（区間の長さ）を確定し、さらに各表現における特徴的な傾向をラベル付けする。これにより、入力された楽音信号から歌唱または演奏の表現態様および表現推移をより正確に検出することができる。これを、たとえばカラオケ装置にすれば、歌唱の採点をより正確に行うことができる。 As described in detail above, according to the present embodiment, an expression transition model is first generated from melody information serving as a reference. Characteristic parameters such as pitch, volume, and spectral change are sequentially detected from the input musical sound. From these values, the probability that the expression changes according to a predetermined rule and the probability that the expression stays without changing are calculated. Based on this probability, the expression transition route with the highest probability is selected in the expression transition model. Then, the position of the transition of the expression (the length of the section) is determined, and the characteristic tendency in each expression is labeled. Thereby, the expression mode and expression transition of singing or performance can be detected more accurately from the input musical sound signal. If this is made into a karaoke apparatus, for example, singing can be scored more accurately.

また、確定した区間の確率計算結果を次々捨てるデータ構造をとることで、ストリーム処理を実現する。これにより、入力された音声信号から歌唱または演奏の表現態様および表現推移をより正確にリアルタイムで検出することができる。 Also, stream processing is realized by adopting a data structure in which the probability calculation results of the determined section are discarded one after another. Thereby, the expression mode and expression transition of singing or performance can be detected more accurately in real time from the input audio signal.

なお、この実施形態は、音声信号処理装置をカラオケ装置１に適用した例を示したが、本発明の適用はこれに限定されない。入力された演奏音声の表現を判定する装置であればどのような装置に適用することも可能である。また、入力する音声は歌唱に限定されず、楽器の演奏音であってもよい。 In addition, although this embodiment showed the example which applied the audio | voice signal processing apparatus to the karaoke apparatus 1, application of this invention is not limited to this. The present invention can be applied to any device as long as it can determine the expression of the input performance sound. The input voice is not limited to singing, and may be a performance sound of a musical instrument.

本発明による表現判定処理に用いる基準メロディデータと表現推移モデルとを示す図である。It is a figure which shows the reference | standard melody data and expression transition model which are used for the expression determination process by this invention. 各状態を縦方向に配列し、横方向にフレームを時系列に配列した表現推移モデルにおける状態推移の一例を示す図である。It is a figure which shows an example of the state transition in the expression transition model which arranged each state to the vertical direction, and arranged the frame to the horizontal direction in time series. 音声信号処理装置としてのカラオケ装置の構成を概略的に示すブロック図である。It is a block diagram which shows roughly the structure of the karaoke apparatus as an audio | voice signal processing apparatus. 同カラオケ装置の特徴パラメータ取得部の構成をより詳細に示すブロック図である。It is a block diagram which shows the structure of the characteristic parameter acquisition part of the karaoke apparatus in detail. 同カラオケ装置が実行する表現判定処理を表現推移モデルで説明した図である。It is the figure explaining the expression determination process which the karaoke apparatus performs with an expression transition model. 前記表現判定処理を説明する図である。It is a figure explaining the said expression determination process. 前記表現判定処理をリアルタイムで行う場合の処理を説明する図である。It is a figure explaining the process in the case of performing the said expression determination process in real time. 前記表現判定処理をリアルタイムで行う場合の処理を説明する図である。It is a figure explaining the process in the case of performing the said expression determination process in real time.

Explanation of symbols

１−カラオケ装置
１２ａ−接続端子
１３−ＦＦＴ処理部
１４−特徴パラメータ取得部
１５−特徴パラメータバッファ
１７−リファレンスバッファ
１８−表現判定部 1-Karaoke device 12a-Connection terminal 13-FFT processing unit 14-Feature parameter acquisition unit 15-Feature parameter buffer 17-Reference buffer 18-Expression determination unit

Claims

An audio signal input unit for inputting an audio signal for playing or singing a musical sound;
A feature parameter detection unit that divides an input speech signal into frames for each predetermined time and detects a feature parameter of the speech signal for each frame;
A plurality of expression modes of performance or singing are each modeled as one state, and the probability that one frame or a plurality of consecutive frames is a specific state section is calculated over a predetermined observation section based on the feature parameter of each frame. An expression determination unit that performs an expression determination process for determining an expression mode of the voice signal and the section by determining an optimum path of state transition in a predetermined observation section based on the probability;
An audio signal processing apparatus.

The audio signal according to claim 1, wherein the expression determination unit further determines, based on the feature parameter, for a section in which an expression mode is determined as the expression determination process, and determines the detailed contents of the expression mode. Processing equipment.

A buffer for storing the feature parameter detected by the feature parameter detector;
The said expression determination part performs the said expression determination process about the partial area of a music in the middle of a music, The said buffer has a storage capacity for the partial area of the said music. 2. The audio signal processing device according to 2.

The audio signal processing apparatus according to claim 3, wherein the expression determination unit detects a transition point of the musical sound in the input audio signal, and executes the expression determination process at the transition point.

The audio signal processing apparatus according to claim 1, wherein the expression determination unit determines the optimum route using a Viterbi algorithm.

An audio signal input procedure for inputting an audio signal for playing or singing a musical sound;
A feature parameter detection procedure for dividing the input speech signal into frames at predetermined time intervals and detecting feature parameters of the speech signal for each frame;
A plurality of expression modes of performance or singing are each modeled as one state, and the probability that one frame or a plurality of consecutive frames is a specific state section is calculated over a predetermined observation section based on the feature parameter of each frame. , By determining the optimum path of state transition in a predetermined observation section based on this probability, an expression determination procedure for determining the expression mode of the voice signal and the section;
An audio signal processing method comprising:

In signal processing equipment,
An audio signal input procedure for inputting an audio signal for playing or singing a musical sound;
A feature parameter detection procedure for dividing the input speech signal into frames at predetermined time intervals and detecting feature parameters of the speech signal for each frame;
A plurality of expression modes of performance or singing are each modeled as one state, and the probability that one frame or a plurality of consecutive frames is a specific state section is calculated over a predetermined observation section based on the feature parameter of each frame. , By determining the optimum path of state transition in a predetermined observation section based on this probability, an expression determination procedure for determining the expression mode of the voice signal and the section;
An audio signal processing program characterized in that