JP4910854B2

JP4910854B2 - Fist detection device, fist detection method and program

Info

Publication number: JP4910854B2
Application number: JP2007108527A
Authority: JP
Inventors: 達也入山; 拓弥 ▲高▼橋
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-04-17
Filing date: 2007-04-17
Publication date: 2012-04-04
Anticipated expiration: 2027-04-17
Also published as: JP2008268370A

Description

本発明は、こぶし検出装置、こぶし検出方法及びプログラムに関する。 The present invention relates to a fist detection device, a fist detection method, and a program.

歌唱の評価を行うカラオケ装置が種々開発されている。例えば、特許文献１に記載のカラオケ装置においては、歌唱者の音声からピッチ（音程）、音長、タイミングなどのパラメータを抽出し、抽出された各パラメータに基づいて歌唱の評価を行う。
特開平１０−７８７５０号公報 Various karaoke apparatuses for evaluating singing have been developed. For example, in the karaoke apparatus described in Patent Document 1, parameters such as pitch (pitch), sound length, and timing are extracted from the singer's voice, and singing is evaluated based on the extracted parameters.
Japanese Patent Laid-Open No. 10-78750

ところで、歌手などの熟練した歌唱者は、歌い始めや歌い終わりを意図的にずらしたり、声質や音量を変化させたり、ビブラートやこぶしなどの技法を用いたりするなど、様々な表現方法を用いて歌のなかに情感を表現する。多くの歌唱者は、楽譜の内容に忠実に従うよりも、歌手のように様々な表現方法を用いて歌唱したいと考えている。 By the way, skilled singers such as singers use various expression methods such as intentionally shifting the beginning and end of singing, changing the voice quality and volume, and using techniques such as vibrato and fist. Express emotions in the song. Many singers want to sing using various ways of expression, like a singer, rather than faithfully following the content of the score.

さて、従来のカラオケ装置では、模範となる歌唱やガイドメロディなどのリファレンスに基づいて歌唱を評価する。しかし、それらのリファレンスは、楽譜どおりの音高やリズムで作成されていることが多く、上述のように各種の表現方法を用いて歌唱すると、リファレンスと異なっているために評価が下がってしまう傾向があった。そこで、カラオケ装置において、上述のような各種の表現方法を検出し、検出された個々の表現方法に基づいて歌唱を評価する技術が望まれていた。 Now, in the conventional karaoke apparatus, singing is evaluated based on references, such as model singing and a guide melody. However, these references are often created with the same pitch and rhythm as the score, and when singing using various expression methods as described above, the evaluation tends to decrease because they differ from the reference. was there. Therefore, a technique for detecting various expression methods as described above in a karaoke apparatus and evaluating a song based on each detected expression method has been desired.

本発明は、上述した事情に鑑みてなされたものであり、歌唱音声から「こぶし」の技法を用いて歌唱された区間を検出することが可能なこぶし検出装置、こぶし検出方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-described circumstances, and provides a fist detection device, a fist detection method, and a program capable of detecting a section sung using the “fist” technique from a singing voice. The purpose is to do.

本発明の好適な態様であるこぶし検出装置は、歌唱音声を表す歌唱音声データを受取る受取手段と、前記受取手段が受取った歌唱音声データから、ピッチを検出するピッチ検出手段と、前記ピッチ検出手段により検出されたピッチから、ピッチが上昇した後下降する区間を特定し、前記区間から、（１）ピッチが上昇する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（２）ピッチが下降する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（３）ピッチが上昇し始めてから下降し終わるまでの区間の長さが所定の範囲内であることを特徴とする候補区間を選択する候補区間選択手段と、前記歌唱音声において、ビブラートの技法が用いられている区間を表すビブラート区間データを受取る第２の受取手段と、前記候補区間選択手段により選択された候補区間の各々について、前記第２の受取手段が受け取ったビブラート区間データが表すビブラート区間に含まれなければ、該候補区間をこぶしの技法が用いられている区間として特定するこぶし区間特定手段とを有することを特徴とする。 A fist detection device according to a preferred aspect of the present invention includes a receiving unit that receives singing voice data representing a singing voice, a pitch detecting unit that detects a pitch from singing voice data received by the receiving unit, and the pitch detecting unit. From the detected pitch, a section where the pitch rises and then descends is specified. From the section, (1) the absolute value of the rate of change in pitch in the section where the pitch rises is larger than a predetermined threshold, (2 ) The absolute value of the rate of change in pitch in the section where the pitch is lowered is larger than a predetermined threshold, and (3) the length of the section from when the pitch starts to rise until it finishes falling is within the predetermined range. A candidate section selecting means for selecting a candidate section to be characterized, and a second section for receiving vibrato section data representing a section in which a vibrato technique is used in the singing voice If each of the candidate sections selected by the receiving means and the candidate section selecting means is not included in the vibrato section represented by the vibrato section data received by the second receiving means, the technique of fisting the candidate sections is used. And a fist section identifying means that identifies the section as being defined.

本発明の好適な態様であるこぶし検出装置は、上記の構成において、前記候補区間選択手段は、さらに追加して前記区間を含むノート（楽音）の発音を開始したタイミングから、ピッチが上昇を開始したタイミングとの時間差が所定の閾値より小さいとの条件を用いて前記候補区間を選択しても良い。 In the fist detection device according to a preferred aspect of the present invention, in the configuration described above, the pitch starts to increase from the timing when the candidate section selection unit further starts to sound a note (musical sound) including the section. The candidate section may be selected using a condition that the time difference from the timing is smaller than a predetermined threshold.

本発明の好適な態様であるこぶし検出装置は、上記の構成において、前記候補区間選択手段は、さらに追加して前記区間の直前または直後の所定長の区間において、所定の大きさを超えるピッチの変動がないとの条件を用いて前記候補区間を選択しても良い。 In the fist detection device according to a preferred aspect of the present invention, in the above-described configuration, the candidate section selection unit further adds a pitch exceeding a predetermined size in a section of a predetermined length immediately before or immediately after the section. The candidate section may be selected using a condition that there is no change.

本発明の好適な態様であるこぶし検出装置は、上記の構成において、前記候補区間選択手段は、さらに追加して前記区間の直後または所定時間後の所定長の区間において、ピッチの変動幅が所定の閾値より小さいとの条件を用いて前記候補区間を選択しても良い。 In the fist detection device according to a preferred aspect of the present invention, in the configuration described above, the candidate section selection unit further adds a pitch fluctuation range in a predetermined length immediately after the section or in a predetermined length section after a predetermined time. The candidate section may be selected using a condition that the value is smaller than the threshold value.

本発明の好適な態様であるこぶし検出方法は、歌唱音声を表す歌唱音声データを受取る受取段階と、前記受取段階において受取った歌唱音声データから、ピッチを検出するピッチ検出段階と、前記ピッチ検出段階において検出されたピッチから、ピッチが上昇した後下降する区間を特定し、前記区間から、（１）ピッチが上昇する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（２）ピッチが下降する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（３）ピッチが上昇し始めてから下降し終わるまでの区間の長さが所定の範囲内であることを特徴とする候補区間を選択する候補区間選択段階と、前記歌唱音声において、ビブラートの技法が用いられている区間を表すビブラート区間データを受取る第２の受取段階と、前記候補区間選択段階において選択された候補区間の各々について、前記第２の受取段階において受け取ったビブラート区間データが表すビブラート区間に含まれなければ、該候補区間をこぶしの技法が用いられている区間として特定するこぶし区間特定段階とを有することを特徴とする。 The fist detection method according to a preferred aspect of the present invention includes a receiving step for receiving singing voice data representing a singing voice, a pitch detecting step for detecting a pitch from the singing voice data received in the receiving step, and the pitch detecting step. From the detected pitch, a section in which the pitch rises and then descends is specified. From the section, (1) the absolute value of the rate of change in pitch in the section in which the pitch rises is greater than a predetermined threshold, (2 ) The absolute value of the rate of change in pitch in the section where the pitch is lowered is larger than a predetermined threshold, and (3) the length of the section from when the pitch starts to rise until it finishes falling is within the predetermined range. A candidate section selection stage for selecting a candidate section to be characterized, and vibrato section data representing a section in which the vibrato technique is used in the singing voice. If each of the candidate sections selected in the second receiving stage and the candidate section selecting stage is not included in the vibrato section represented by the vibrato section data received in the second receiving stage, the candidate section is fisted. And a fist segment identifying stage that identifies the segment in which the technique is used.

本発明の好適な態様であるプログラムは、コンピュータを、歌唱音声を表す歌唱音声データを受取る受取手段と、前記受取手段が受取った歌唱音声データから、ピッチを検出するピッチ検出手段と、前記ピッチ検出手段により検出されたピッチから、ピッチが上昇した後下降する区間を特定し、前記区間から、（１）ピッチが上昇する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（２）ピッチが下降する区間におけるピッチの変化の割合の絶対値が所定の閾値よりも大きく、（３）ピッチが上昇し始めてから下降し終わるまでの区間の長さが所定の範囲内であることを特徴とする候補区間を選択する候補区間選択手段と、前記歌唱音声において、ビブラートの技法が用いられている区間を表すビブラート区間データを受取る第２の受取手段と、前記候補区間選択手段により選択された候補区間の各々について、前記第２の受取手段が受け取ったビブラート区間データが表すビブラート区間に含まれなければ、該候補区間をこぶしの技法が用いられている区間として特定するこぶし区間特定手段として機能させることを特徴とする。 A program according to a preferred aspect of the present invention includes a receiving unit that receives singing voice data representing a singing voice, a pitch detecting unit that detects a pitch from the singing voice data received by the receiving unit, and the pitch detection. From the pitch detected by the means, a section where the pitch rises and then falls is identified, and from the section, (1) the absolute value of the rate of change in pitch in the section where the pitch rises is greater than a predetermined threshold value, 2) The absolute value of the rate of change in pitch in the section where the pitch is lowered is greater than a predetermined threshold, and (3) the length of the section from when the pitch starts to rise until it finishes falling is within the predetermined range. Candidate section selecting means for selecting candidate sections characterized by the above, and vibrato section data representing sections in which the vibrato technique is used in the singing voice If each of the second receiving means for receiving and the candidate section selected by the candidate section selecting means is not included in the vibrato section represented by the vibrato section data received by the second receiving means, the candidate section is fisted. It is characterized by functioning as a fist section identifying means that identifies the section in which the technique is used.

本発明に係るこぶし検出装置、こぶし検出方法、またはプログラムによれば、歌唱音声から「こぶし」の技法を用いて歌唱された区間を検出することが可能となる。 According to the fist detection device, the fist detection method, or the program according to the present invention, it is possible to detect a section sung using the “fist” technique from the singing voice.

以下では、本発明の一実施形態に係るカラオケ装置について説明する。なお、以下の説明では、上記カラオケ装置を用いて歌唱を練習する者を「歌唱者」と呼ぶ。 Below, the karaoke apparatus which concerns on one Embodiment of this invention is demonstrated. In the following description, a person who practices singing using the karaoke apparatus is referred to as a “singer”.

（Ａ：構成）
図１は、カラオケ装置１のハードウェア構成を示したブロック図である。カラオケ装置１は、カラオケ伴奏を再生する通常のカラオケ機能を備えるとともに、歌唱者の歌唱音声から「こぶし」の技法が用いられている区間（以下、こぶし区間）を検出するこぶし検出機能も備える。なお、「こぶし」とは、装飾的に加える、うねるような節回しを行う技法である。 (A: Configuration)
FIG. 1 is a block diagram showing a hardware configuration of the karaoke apparatus 1. The karaoke apparatus 1 has a normal karaoke function for reproducing karaoke accompaniment, and also has a fist detection function for detecting a section (hereinafter referred to as a fist section) in which the technique of “fist” is used from the singing voice of the singer. The “fist” is a technique for adding a decorative and undulating tune.

図１において、ＣＰＵ（Central Processing Unit）１１は、ＲＯＭ（Read Only Memory）１２に記憶されている制御プログラムを読み出してＲＡＭ（Random Access Memory）１３にロードし、これを実行することにより、カラオケ装置１の各部を制御する。 In FIG. 1, a CPU (Central Processing Unit) 11 reads a control program stored in a ROM (Read Only Memory) 12, loads it into a RAM (Random Access Memory) 13, and executes it to execute a karaoke device. 1 part is controlled.

表示部１５は、例えば液晶ディスプレイなどであり、ＣＰＵ１１の制御の下で、カラオケ装置１を操作するためのメニュー画面や、背景画像に歌詞テロップが重ねられたカラオケ画面などの各種画面を表示する。
操作部１６は、テンキーや上下キー、演奏開始キーなどの各種のキーを備えており、押下されたキーに対応した操作信号をＣＰＵ１１へ出力する。 The display unit 15 is, for example, a liquid crystal display, and displays various screens such as a menu screen for operating the karaoke device 1 and a karaoke screen in which lyrics telop is superimposed on a background image under the control of the CPU 11.
The operation unit 16 includes various keys such as a numeric keypad, an up / down key, and a performance start key, and outputs an operation signal corresponding to the pressed key to the CPU 11.

マイクロホン１７は、歌唱音声を収音し、音声信号（アナログデータ）を生成する。
音声処理部１８は、マイクロホン１７が生成した音声信号をＡ／Ｄ変換し、デジタルデータ（音声データ）に変換してＣＰＵ１１に出力する。また、音声処理部１８は、ＣＰＵ１１から受取った音声データをＤ／Ａ変換し、音声信号に変換してスピーカ１９に出力する。
スピーカ１９は、音声処理部１８から受取った音声信号に基づいて音声を放音する。 The microphone 17 picks up the singing voice and generates a voice signal (analog data).
The audio processing unit 18 performs A / D conversion on the audio signal generated by the microphone 17, converts it into digital data (audio data), and outputs the digital data to the CPU 11. The audio processing unit 18 performs D / A conversion on the audio data received from the CPU 11, converts the audio data into an audio signal, and outputs the audio signal to the speaker 19.
The speaker 19 emits sound based on the sound signal received from the sound processing unit 18.

記憶部１４は、例えばＨＤＤ（Hard Disk Drive）などの大容量の記憶手段であり、各種の記憶領域を有している。 The storage unit 14 is a large-capacity storage unit such as an HDD (Hard Disk Drive), and has various storage areas.

楽曲データ記憶領域１４ａには、複数の楽曲データが格納されている。図２は、各楽曲データの内容を模式的に示した図である。各楽曲データは、ヘッダと伴奏データと歌詞データとガイドメロディデータとを有している。
ヘッダには、楽曲を特定する曲番号データ、楽曲の曲名を示す曲名データ、ジャンルを示すジャンルデータ、楽曲の演奏時間を示す演奏時間データなどが含まれている。 A plurality of music data is stored in the music data storage area 14a. FIG. 2 is a diagram schematically showing the contents of each piece of music data. Each piece of music data has a header, accompaniment data, lyrics data, and guide melody data.
The header includes song number data for specifying a song, song name data indicating a song title, genre data indicating a genre, performance time data indicating a performance time of a song, and the like.

伴奏データには、楽曲の伴奏を行う各種楽器の演奏音が楽曲の進行に伴って記されている。伴奏データは、例えばＭＩＤＩ（Musical Instrument Digital Interface）形式などのデータ形式である。
歌詞データには、歌詞の内容（文字）が、表示すべきタイミング、および表示部１５の画面において表示すべき位置と対応付けられて記されている。
ガイドメロディデータには、歌唱者が歌唱すべき旋律を示したガイドメロディが書き込まれている。 In the accompaniment data, performance sounds of various musical instruments that accompany the music are recorded as the music progresses. The accompaniment data is in a data format such as MIDI (Musical Instrument Digital Interface) format.
In the lyrics data, the contents (characters) of the lyrics are described in association with the timing to be displayed and the position to be displayed on the screen of the display unit 15.
In the guide melody data, a guide melody indicating the melody that the singer should sing is written.

歌唱音声データ記憶領域１４ｂには、マイクロホン１７から出力された歌唱者の歌唱を表す音声信号が音声処理部１８でＡ／Ｄ変換されることにより生成された音声データ（以下、歌唱音声データ）が、各楽曲について記憶されている。歌唱音声データは、ＷＡＶＥ形式やＭＰ３（MPEG-1 Audio Layer-3）形式などの音声データである。
ビブラート区間データ記憶領域１４ｃには、各楽曲の歌唱音声データにおいて「ビブラート」の技法が用いられている区間（以下、ビブラート区間）を示すデータが記憶される。 In the singing voice data storage area 14b, voice data (hereinafter referred to as singing voice data) generated by A / D conversion of a voice signal representing the singer's song output from the microphone 17 by the voice processing unit 18 is provided. , Stored for each song. The singing voice data is voice data such as WAVE format or MP3 (MPEG-1 Audio Layer-3) format.
The vibrato section data storage area 14c stores data indicating a section (hereinafter referred to as a vibrato section) in which the “vibrato” technique is used in the singing voice data of each music piece.

こぶし区間データ記憶領域１４ｄには、各楽曲の歌唱音声データにおいて「こぶし」の技法が用いられているこぶし区間を示すデータが記憶される。
パラメータ記憶領域１４ｅには、各楽曲の歌唱音声データから抽出されたピッチや、該ピッチから抽出された各種のパラメータが記憶される。
以上に説明したカラオケ装置１の各部は、バス２０を介して互いにデータをやり取りする。 The fist section data storage area 14d stores data indicating a fist section in which the “fist” technique is used in the singing voice data of each music piece.
The parameter storage area 14e stores a pitch extracted from the singing voice data of each musical piece and various parameters extracted from the pitch.
Each unit of the karaoke apparatus 1 described above exchanges data with each other via the bus 20.

（Ｂ：動作）
次に、カラオケ装置１が歌唱音声データから「こぶし区間」を特定する処理について説明する。 (B: Operation)
Next, a process in which the karaoke apparatus 1 specifies a “fist section” from singing voice data will be described.

（Ｂ−１：カラオケ伴奏処理）
歌唱者が、操作部１６を操作して歌唱する楽曲を選択すると、楽曲の曲番号データなど楽曲を特定する操作信号が操作部１６からＣＰＵ１１に出力される。ＣＰＵ１１は、操作部１６から供給された操作信号が示す楽曲データを楽曲データ記憶領域１４ａから読み出し、読み出した楽曲データについてカラオケ伴奏処理を行う。 (B-1: Karaoke accompaniment processing)
When the singer operates the operation unit 16 to select a song to be sung, an operation signal for specifying the song such as song number data of the song is output from the operation unit 16 to the CPU 11. The CPU 11 reads the music data indicated by the operation signal supplied from the operation unit 16 from the music data storage area 14a, and performs a karaoke accompaniment process on the read music data.

図３は、カラオケ伴奏処理の流れを示すフローチャートである。
ステップＳＡ１００において、ＣＰＵ１１は、カラオケ伴奏をスピーカ１９に放音させると共に、歌詞テロップを表示部１５に表示させる。すなわち、ＣＰＵ１１は、楽曲データ記憶領域１４ａから楽曲データに含まれる伴奏データを読み出して音声処理部１８に出力する。そして音声処理部１８は、上記伴奏データをアナログの音声信号に変換し、スピーカ１９に出力する。また、ＣＰＵ１１は、楽曲データ記憶領域１４ａから楽曲データに含まれる歌詞データを読み出して、該歌詞データに従って歌詞テロップを表示部１５に表示させる。 FIG. 3 is a flowchart showing the flow of karaoke accompaniment processing.
In step SA100, the CPU 11 causes the speaker 19 to emit the karaoke accompaniment and causes the display unit 15 to display the lyrics telop. That is, the CPU 11 reads the accompaniment data included in the music data from the music data storage area 14 a and outputs it to the audio processing unit 18. The audio processing unit 18 converts the accompaniment data into an analog audio signal and outputs the analog audio signal to the speaker 19. Further, the CPU 11 reads out the lyrics data included in the song data from the song data storage area 14a, and causes the display unit 15 to display the lyrics telop according to the lyrics data.

歌唱者は、表示部１５に表示された歌詞テロップを見ながら、スピーカ１９から放音されるカラオケ伴奏にあわせて歌唱を行う。マイクロホン１７により生成された音声信号は、Ａ／Ｄ変換されることにより歌唱音声データが生成される（ステップＳＡ１１０）。該歌唱音声データは、歌唱音声データ記憶領域１４ｂに書き込まれる。 The singer sings along with the karaoke accompaniment emitted from the speaker 19 while watching the lyrics telop displayed on the display unit 15. The voice signal generated by the microphone 17 is A / D converted to generate singing voice data (step SA110). The singing voice data is written in the singing voice data storage area 14b.

ステップＳＡ１２０において、ＣＰＵ１１は、楽曲の進行に伴い歌唱音声データのピッチを解析し、解析結果を表す歌唱ピッチデータを生成する。すなわち、ステップＳＡ１１０において生成された歌唱音声データを歌唱音声データ記憶領域１４ｂから読み出し、読み出した歌唱音声データから所定時間長（例えば、３msec）のフレーム単位でピッチを検出し、検出されたピッチを示す歌唱ピッチデータを生成する。なお、歌唱ピッチデータにおいては、楽曲データに含まれるガイドメロディのピッチを基準とした上記検出されたピッチの相対値としてピッチを示す。生成された歌唱ピッチデータは、パラメータ記憶領域１４ｅに書き込まれる。
図４には、ステップＳＡ１２０において生成された歌唱ピッチデータのうち、楽曲の一部分（時刻１２ｓ０００ｍｓ〜１８ｓ０００ｍｓ）をグラフＡ１で示す。図４において、横軸は時刻（楽曲が開始されてからの経過時間）を表す。また、縦軸には、各時刻における歌唱ピッチデータのガイドメロディに対する相対値が示されている。 In step SA120, the CPU 11 analyzes the pitch of the singing voice data as the music progresses, and generates singing pitch data representing the analysis result. That is, the singing voice data generated in step SA110 is read from the singing voice data storage area 14b, the pitch is detected from the read singing voice data in units of frames of a predetermined time length (for example, 3 msec), and the detected pitch is indicated. Singing pitch data is generated. In the singing pitch data, the pitch is shown as a relative value of the detected pitch with reference to the pitch of the guide melody included in the music data. The generated singing pitch data is written in the parameter storage area 14e.
FIG. 4 shows a part of the song (time 12s000ms to 18s000ms) of the singing pitch data generated in step SA120 as a graph A1. In FIG. 4, the horizontal axis represents time (elapsed time since the music was started). The vertical axis indicates the relative value of the singing pitch data at each time with respect to the guide melody.

ステップＳＡ１３０において、ＣＰＵ１１は、楽曲の演奏が一曲分終了したか否かを判定する。ステップＳＡ１３０の判定結果が“Ｙｅｓ”である場合には、カラオケ伴奏処理を終了する。ステップＳＡ１３０の判定結果が“Ｎｏ”である場合には、楽曲の残りの部分についてステップＳＡ１００ないしステップＳＡ１２０の処理を行う。 In step SA130, the CPU 11 determines whether or not the music performance has been completed for one song. If the determination result in step SA130 is “Yes”, the karaoke accompaniment process is terminated. If the determination result in step SA130 is “No”, the process from step SA100 to step SA120 is performed for the remaining portion of the music.

（Ｂ−２：ビブラート区間特定処理）
ＣＰＵ１１は、歌唱音声データから「こぶし区間」を特定するにあたり、予め「こぶし」と類似した特徴を示す「ビブラート」が用いられている区間を特定するビブラート区間特定処理を行う。ビブラートとは、音を伸ばしながらピッチをわずかに上下させ震えるような音色を出すことにより音に豊かな響きを与える歌唱技法である。 (B-2: Vibrato section specifying process)
In specifying the “fist section” from the singing voice data, the CPU 11 performs a vibrato section specifying process for specifying a section in which “vibrato” indicating characteristics similar to “fist” is used in advance. Vibrato is a singing technique that gives the sound a rich reverberation by creating a trembling tone by slightly raising and lowering the pitch while stretching the sound.

図５は、ビブラート区間特定処理の流れを示したフローチャートである。ステップＳＢ１００において、ＣＰＵ１１は、パラメータ記憶領域１４ｅに書き込まれた歌唱ピッチデータに対して、特定の周波数成分を抽出するフィルタ処理を施す。本実施形態においては、ＣＰＵ１１は、歌唱ピッチデータを６Ｈｚより低い周波数の成分を抽出するローパスフィルタで処理し、新たなピッチデータ（以下、フィルタ歌唱ピッチデータ）を生成する。図４におけるグラフＡ２は、グラフＡ１の歌唱ピッチデータを上記ローパスフィルタにより処理することで生成されたフィルタ歌唱ピッチデータを示している。 FIG. 5 is a flowchart showing the flow of the vibrato section specifying process. In step SB100, the CPU 11 performs a filtering process for extracting a specific frequency component on the singing pitch data written in the parameter storage area 14e. In the present embodiment, the CPU 11 processes the singing pitch data with a low-pass filter that extracts components having a frequency lower than 6 Hz, and generates new pitch data (hereinafter referred to as filter singing pitch data). A graph A2 in FIG. 4 shows filter singing pitch data generated by processing the singing pitch data of the graph A1 by the low-pass filter.

図４に示されるように、フィルタをかける前の歌唱ピッチデータ（Ａ１）には、波形に細かい乱れがある。このような波形の乱れは例えばリバーブによるものであり、リバーブのかかった音声データからピッチを検出した場合には、その検出結果は正弦波にならず波形の乱れたものとなる。そのため、リバーブのかかった音声からビブラート区間を検出することが困難であった。更には、音声にリバーブがかかっているか否かを音声データから判定することも困難であった。しかしながら、ローパスフィルタで処理された歌唱ピッチデータにおいては、音声にかけられたリバーブの影響は取り除かれており、後述の処理において適切にビブラート区間を特定することが可能になる。 As shown in FIG. 4, the singing pitch data (A1) before applying the filter has a fine disturbance in the waveform. Such waveform disturbance is caused by, for example, reverb. When a pitch is detected from audio data subjected to reverberation, the detection result is not a sine wave but a waveform disturbance. For this reason, it has been difficult to detect a vibrato section from reverberated speech. Furthermore, it is difficult to determine from the audio data whether or not the audio is reverberated. However, in the singing pitch data processed by the low-pass filter, the influence of the reverb applied to the voice is removed, and it becomes possible to appropriately specify the vibrato section in the processing described later.

ステップＳＢ１１０において、ＣＰＵ１１は、歌唱音声データにおいてビブラート区間の特徴を示す区間（以下、ビブラート候補区間）を以下の条件で特定する。すなわち、ＣＰＵ１１は、ステップＳＢ１００において生成されたフィルタ歌唱ピッチデータの表すピッチが、負から正又は正から負に変化する（ゼロクロスする）箇所をゼロクロス箇所として特定する。具体的には、例えば、図４に示す例においては、フィルタ歌唱ピッチデータを表すグラフＡ２がゼロクロスする時刻（例えば、時刻Ｐ１，Ｐ２，Ｐ３，Ｐ４など）が、ゼロクロス箇所として特定される。 In step SB110, CPU11 specifies the area (henceforth a vibrato candidate area) which shows the characteristic of a vibrato area in song voice data on the following conditions. That is, the CPU 11 identifies a point where the pitch represented by the filter singing pitch data generated in step SB100 changes from negative to positive or from positive to negative (zero crossing) as a zero crossing point. Specifically, for example, in the example shown in FIG. 4, the time at which the graph A2 representing the filter singing pitch data zero-crosses (for example, time P1, P2, P3, P4, etc.) is specified as the zero-cross location.

次いで、ＣＰＵ１１は、フィルタ歌唱ピッチデータにおいてゼロクロス箇所が現れる時間間隔を測定し、測定された時間間隔が予め定められた範囲内であり、かつ、その時間間隔が連続して所定回数以上検出された区間を、ビブラート候補区間として特定する。この処理によって、図４に示した例では、ゼロクロス箇所がほぼ等間隔で現れる区間Ａ３がビブラート候補区間として特定される。なお、図４においては、ビブラート候補区間として１つの区間が特定されたが、図４に含まれない楽曲部分においてもビブラート候補区間が特定される。 Next, the CPU 11 measures the time interval at which the zero-cross point appears in the filter singing pitch data, the measured time interval is within a predetermined range, and the time interval is continuously detected a predetermined number of times or more. The section is specified as a vibrato candidate section. By this process, in the example shown in FIG. 4, the section A3 in which the zero cross points appear at almost equal intervals is specified as the vibrato candidate section. In FIG. 4, one section is specified as the vibrato candidate section. However, the vibrato candidate section is also specified in the music portion not included in FIG. 4.

そして、ＣＰＵ１１は、ステップＳＢ１１０において、特定されたビブラート候補区間においてビブラート技法が実際に用いられていることを更に厳密に解析するため、該区間のそれぞれについて、以下のように複数のパラメータを抽出する（ステップＳＢ１２０）。なお、以下の説明において、例えば図４における区間Ａ３のようにフィルタ歌唱ピッチデータの値が周期的に変動している場合に、単位時間あたりの振動の回数を「ビブラートの振動数」と呼ぶ。 Then, in step SB110, the CPU 11 extracts a plurality of parameters for each of the sections as follows in order to more strictly analyze that the vibrato technique is actually used in the specified vibrato candidate section. (Step SB120). In the following description, for example, when the value of the filter singing pitch data fluctuates periodically as in section A3 in FIG. 4, the number of vibrations per unit time is referred to as “vibrato frequency”.

（１）ビブラートの振動数の平均値（Ａｆ；Average of frequency）
パラメータＡｆは各ビブラート候補区間におけるビブラートの振動数の平均値であり、上記フィルタ歌唱ピッチデータが横軸とゼロクロスする時間間隔の逆数の平均値として算出される。
（２）ビブラートの振動数の標準偏差（Ｄｆ；Deviation of frequency）
パラメータＤｆは、上記フィルタ歌唱ピッチデータが横軸とゼロクロスする時間間隔の逆数の分布の標準偏差として算出される。本パラメータから、ビブラートの振動数の「ばらつき」の大きさを推定することができる。すなわち、本パラメータの値が０に近いほど均一な振動数を持つ、優れたビブラートであることを示す。 (1) Average frequency of vibrato (Af; Average of frequency)
The parameter Af is the average value of the vibrato frequency in each vibrato candidate section, and is calculated as the average value of the reciprocal of the time interval at which the filter singing pitch data zero crosses the horizontal axis.
(2) Standard deviation (Df: Deviation of frequency) of vibrato
The parameter Df is calculated as the standard deviation of the reciprocal distribution of the time interval at which the filter singing pitch data crosses zero with the horizontal axis. From this parameter, it is possible to estimate the magnitude of “variation” of the vibrato frequency. That is, the closer the value of this parameter is to 0, the better the vibrato has a uniform frequency.

ここで、以下のパラメータの説明において用いられる「ピッチ振動幅」について説明する。図６は、図４におけるフィルタ歌唱ピッチデータ（Ａ２）を取り出して示したグラフである。図６において、ＣＰＵ１１は、以下のようにして上記ビブラート候補区間における「ピッチ振動幅」を算出する。まず、ＣＰＵ１１は、フィルタ歌唱ピッチデータを時間で微分することにより、該データのグラフから極大値および極小値を特定する。 Here, the “pitch vibration width” used in the description of the following parameters will be described. FIG. 6 is a graph showing the filter singing pitch data (A2) extracted from FIG. In FIG. 6, the CPU 11 calculates the “pitch vibration width” in the vibrato candidate section as follows. First, the CPU 11 differentiates the filter singing pitch data with respect to time, thereby specifying a maximum value and a minimum value from the graph of the data.

例えば、図６においてＱ２、Ｑ４、Ｑ６、Ｑ８、およびＱ１０は極大値を示し、Ｑ１、Ｑ３、Ｑ５、Ｑ７、およびＱ９は極小値を示す。ＣＰＵ１１は、特定された１つの極小値と、時間的に直後に隣接する極大値との差分をピッチ振動幅とし、該ピッチ振動幅を、該値の算出に用いた極小値と極大値との中間の時刻に位置付ける。例えば極小値Ｑ１と極大値Ｑ２とからはピッチ振動幅Ｗ１が生成される。図６には、そのようにして生成されたピッチ振動幅Ｗ１〜５が書き込まれている。 For example, in FIG. 6, Q2, Q4, Q6, Q8, and Q10 indicate maximum values, and Q1, Q3, Q5, Q7, and Q9 indicate minimum values. The CPU 11 uses the difference between the specified minimum value and the maximum value immediately adjacent in time as the pitch vibration width, and uses the pitch vibration width as the minimum value and the maximum value used to calculate the value. Position at an intermediate time. For example, the pitch vibration width W1 is generated from the minimum value Q1 and the maximum value Q2. In FIG. 6, pitch vibration widths W1 to W5 generated in this way are written.

さて、ステップＳＢ１２０で抽出されるパラメータの説明に戻る。
（３）ピッチ振動幅の平均値（Ａｐ；Average of pitch）
パラメータＡｐは、各ビブラート候補区間において見出されたピッチ振動幅の平均値を示す。
（４）ピッチ振動幅の標準偏差（Ｄｐ；Deviation of pitch）
パラメータＤｐは、各ビブラート候補区間において見出されたピッチ振動幅の標準偏差を示す。本パラメータから、ビブラート区間におけるピッチの振動幅の「ばらつき」の大きさを推定することができる。すなわち、本パラメータの値が０に近いほど均一の振動幅でピッチが振動する、優れたビブラートであることを示す。 Now, the description returns to the parameters extracted in step SB120.
(3) Average value of pitch vibration width (Ap: Average of pitch)
The parameter Ap indicates an average value of the pitch vibration width found in each vibrato candidate section.
(4) Standard deviation of pitch vibration width (Dp; Deviation of pitch)
The parameter Dp indicates the standard deviation of the pitch vibration width found in each vibrato candidate section. From this parameter, the “variation” of the vibration width of the pitch in the vibrato section can be estimated. That is, the closer the value of this parameter is to 0, the better the vibrato that the pitch vibrates with a uniform vibration width.

（５）ピッチ振動幅の線形近似直線の傾き（Ｓｐ；Slope of pitch）
パラメータＳｐは、上記ピッチ振動幅のグラフにおける線形近似直線の傾きを示す。図７は、図６において算出されたピッチ振動幅のグラフを取り出して示している。ＣＰＵ１１は、ビブラート候補区間（図中Ａ３）におけるピッチ振動幅の点について、線形近似直線を決定する。例えば、図７に示す区間Ａ３においては、線形近似直線のグラフは直線Ｌ１のように決定され、（式１）として表される。
（式１）Ｐ＝１５ｔ＋１５０
このように線形近似直線を算出することにより直線の傾きＳｐが決定される。上記の例では、ピッチ振動幅の線形近似直線の傾きＳｐは、１５である。
本パラメータから、ビブラートを行っている間のピッチの振動幅の安定性を推定することができる。すなわち、Ｓｐの絶対値が小さい値であるほど、ビブラートを行っている間にピッチの変動幅が均一に保たれた、優れたビブラートであることを表す。 (5) Slope of pitch (Sp; Slope of pitch)
The parameter Sp indicates the slope of the linear approximation line in the pitch vibration width graph. FIG. 7 shows a graph of the pitch vibration width calculated in FIG. The CPU 11 determines a linear approximation line for the pitch vibration width point in the vibrato candidate section (A3 in the figure). For example, in the section A3 shown in FIG. 7, the linear approximate straight line graph is determined as a straight line L1 and expressed as (Equation 1).
(Formula 1) P = 15t + 150
Thus, by calculating the linear approximate straight line, the slope Sp of the straight line is determined. In the above example, the slope Sp of the linear approximation line of the pitch vibration width is 15.
From this parameter, the stability of the vibration width of the pitch during vibrato can be estimated. That is, the smaller the absolute value of Sp is, the better the vibrato is, in which the pitch fluctuation range is kept uniform during the vibrato.

ステップＳＢ１３０において、ＣＰＵ１１は、以下のような基準で、ステップＳＢ１１０において特定されたビブラート候補区間の各々について、ビブラート区間として最終的に決定するか否かを判定する。すなわち、
（１）Ｄｆが所定の閾値より小さい
（２）Ａｐが所定の範囲内である
（３）Ｄｐが所定の閾値より小さい
（４）Ｓｐの絶対値が所定の閾値より小さい
上記（１）ないし（４）の全ての条件を満たすビブラート候補区間をビブラート区間として最終決定する。 In step SB130, the CPU 11 determines whether to finally determine each of the vibrato candidate sections specified in step SB110 as a vibrato section based on the following criteria. That is,
(1) Df is smaller than a predetermined threshold (2) Ap is within a predetermined range (3) Dp is smaller than a predetermined threshold (4) The absolute value of Sp is smaller than a predetermined threshold (1) to ( A vibrato candidate section that satisfies all the conditions in 4) is finally determined as a vibrato section.

上記の条件により特定されたビブラート区間においては、ビブラートが用いられている可能性は非常に高いことが期待される。なぜなら、一般にビブラートにおいては、ビブラートの振動数、ピッチの振動幅のばらつきは小さく、また、その振動幅は所定の大きさの範囲内（例えば５００セント以内など）にあり、更にはピッチの変動幅はビブラート区間を通して略一定となるからである。なお、「セント」とは、ピッチの相対的な音程差を示す単位であり、例えば＋１００セントが示すピッチは基準となるピッチから半音分上の音程を示す。ＣＰＵ１１は、特定した区間を表すビブラート区間データを、ビブラート区間データ記憶領域１４ｃに記憶する。 In the vibrato section specified by the above conditions, it is expected that the possibility that vibrato is used is very high. This is because, generally, in vibrato, variations in the vibration frequency and pitch vibration width of vibrato are small, and the vibration width is within a predetermined range (for example, within 500 cents), and further, the fluctuation range of the pitch. Is substantially constant throughout the vibrato section. Note that “cent” is a unit indicating a relative pitch difference of pitches. For example, a pitch indicated by +100 cents indicates a pitch that is a semitone above a reference pitch. The CPU 11 stores vibrato section data representing the specified section in the vibrato section data storage area 14c.

図８に、図４に示された歌唱ピッチデータについて生成されたビブラート区間データを示す。図８に示されるように、ビブラート区間データにおいては、各楽曲についての歌唱音声データにおいて検出されたビブラート区間について、その開始時刻と終了時刻が書き込まれている。
以上で、ビブラート区間特定処理は終了する。 FIG. 8 shows vibrato section data generated for the singing pitch data shown in FIG. As shown in FIG. 8, in the vibrato section data, the start time and the end time are written for the vibrato section detected in the singing voice data for each song.
This completes the vibrato section specifying process.

以上のように、ステップＳＢ１１０で、フィルタ歌唱ピッチデータにおいてピッチの振動の時間間隔が予め定められた範囲内であり、且つその時間間隔が連続して所定回数以上検出されたことを条件として一旦ビブラート区間の候補を絞り込んだ。そしてステップＳＢ１２０において抽出されたパラメータに基づいて上記ビブラート候補区間がビブラート区間として適切であるか厳密に判定した。そのように、ビブラートに特有のピッチの変動を示すか否かを複数の条件で判定することで、最終的に正確なビブラート区間を特定することができる。 As described above, in step SB110, once the time interval of pitch vibration in the filter singing pitch data is within a predetermined range and the time interval has been continuously detected a predetermined number of times or more, the vibrato is temporarily performed. We narrowed down the candidates for the section. Based on the parameters extracted in step SB120, it was strictly determined whether the vibrato candidate section is appropriate as a vibrato section. As described above, it is possible to finally specify an accurate vibrato section by determining whether or not a change in pitch peculiar to vibrato is exhibited based on a plurality of conditions.

（Ｂ−３：こぶし区間特定処理）
ＣＰＵ１１は、上述のビブラート区間特定処理を終えると、こぶし区間特定処理を行う。こぶし区間特定処理とは、歌唱音声データからこぶし区間を特定する処理である。図９は、こぶし区間特定処理の流れを示すフローチャートである。 (B-3: Fist section specifying process)
When the CPU 11 finishes the above-described vibrato section specifying process, the CPU 11 performs the fist section specifying process. The fist section specifying process is a process for specifying the fist section from the singing voice data. FIG. 9 is a flowchart showing the flow of the fist section specifying process.

ステップＳＣ１００において、ＣＰＵ１１は、パラメータ記憶領域１４ｅから、フィルタ歌唱ピッチデータを読み出す。
次にステップＳＣ１１０において、ＣＰＵ１１は、歌唱音声データにおいてこぶし区間を含む可能性がある区間（以下、こぶし候補区間）を以下のように特定する。なお、以下の説明においては、フィルタ歌唱ピッチデータの一部を模式的に示した図１０を参照して説明する。なお、図１０において、Ｐ_Ａ（＞０）およびＰ_Ｂ（＜０）は、それぞれピッチが上昇している区間Ａおよび減少している区間Ｂにおけるピッチの変動幅を示す。また、区間Ｃは、区間Ａの開始から区間Ｂの終了までの区間を示す。ｔ_Ａ、ｔ_Ｂ、およびｔ_Ｃは、それぞれ区間Ａ、Ｂ、およびＣの時間幅を示す。 In step SC100, the CPU 11 reads out the filter singing pitch data from the parameter storage area 14e.
Next, in step SC110, CPU11 specifies the area (henceforth a fist candidate area) which may contain a fist area in singing voice data as follows. In addition, in the following description, it demonstrates with reference to FIG. 10 which showed a part of filter song pitch data typically. In FIG. 10, P _A (> 0) and P _B (<0) indicate the pitch fluctuation ranges in the section A where the pitch is increasing and the section B where the pitch is decreasing, respectively. Section C indicates a section from the start of section A to the end of section B. t _A , t _B , and t _C indicate the time widths of the sections A, B, and C, respectively.

ＣＰＵ１１は、フィルタ歌唱ピッチデータから、以下に示す条件(1)ないし(3)を同時に満たす区間を「こぶし候補区間」として特定する。すなわち、ピッチが上がり再び下がる区間（区間Ｃ）において、
(1)ピッチが上がる区間（区間Ａ）においてピッチの変化の割合の絶対値（｜Ｐ_Ａ／ｔ_Ａ｜）が所定値よりも大きいこと。
(2)ピッチが下がる区間（区間Ｂ）においてピッチの変化の割合の絶対値（｜Ｐ_Ｂ／ｔ_Ｂ｜）が所定値よりも大きいこと。
(3)ピッチが上がり始めてから下がり終わるまでの区間の長さ（ｔ_Ｃ）が所定の範囲内であること。すなわち、ピッチの一過的な上昇に要する時間が、短すぎず且つ長すぎないこと。 The CPU 11 specifies a section that simultaneously satisfies the following conditions (1) to (3) from the filter singing pitch data as a “fist candidate section”. That is, in the section where the pitch increases and decreases again (section C),
(1) The absolute value (| P _A / t _A |) of the rate of change in pitch in the section where the pitch increases (section A) is greater than a predetermined value.
(2) The absolute value (| P _B / t _B |) of the rate of change in pitch in the section where the pitch is lowered (section B) is greater than a predetermined value.
(3) The length (t _C ) of the section from when the pitch starts to rise until it finishes falling is within a predetermined range. That is, the time required for a temporary increase in pitch is not too short and not too long.

以上の条件(1)ないし(3)により、図４と同じフィルタ歌唱ピッチデータを示した図１１においてこぶし候補区間を特定すると、区間１、２、３、４、５、および６が特定される。ＣＰＵ１１は、該こぶし候補区間を示すデータをこぶし候補区間データとして生成し、こぶし区間データ記憶領域１４ｄに書き込む。
図１２は、こぶし候補区間データの一例を示している。こぶし候補区間データにおいては、各楽曲についての歌唱音声データにおいて特定されたこぶし候補区間の各々について、ピッチの変動の開始時刻と終了時刻が書き込まれている。例えば、図１２において、００ｍ１４ｓ５００〜００ｍ１５ｓ４００ｍｓとのデータは、図１１における区間３のピッチ変動と対応している。 When the fist candidate sections are identified in FIG. 11 showing the same filter singing pitch data as in FIG. 4 by the above conditions (1) to (3), the sections 1, 2, 3, 4, 5, and 6 are identified. . The CPU 11 generates data indicating the fist candidate section as fist candidate section data and writes it in the fist section data storage area 14d.
FIG. 12 shows an example of fist candidate section data. In the fist candidate section data, the pitch fluctuation start time and end time are written for each fist candidate section specified in the singing voice data for each piece of music. For example, in FIG. 12, the data of 00m14s500 to 00m15s400ms corresponds to the pitch fluctuation in section 3 in FIG.

ステップＳＣ１２０において、ＣＰＵ１１は、ビブラート区間データ記憶領域１４ｃからビブラート区間データを、こぶし区間データ記憶領域１４ｄからこぶし候補区間データを読み出し、読み出したこぶし候補区間データが示す区間からビブラート区間に含まれているこぶし候補区間を除外することにより、最終的なこぶし区間を特定する。
例えば、図１１に示されたフィルタ歌唱ピッチデータからは、図１２に示すこぶし候補区間データと、図８に示すビブラート区間データとが生成されるが、図１２に示された複数のこぶし候補区間のうち、００ｍ１２ｓ２００ｍｓ〜００ｍ１２ｓ８００ｍｓの区間のみは、ビブラート区間に含まれないが、該区間を除く他の区間は、ビブラート区間に含まれていることからこぶし区間から除外される。なお、こぶし候補区間とビブラート区間の開始時刻または終了時刻がずれている場合には、こぶし候補区間の一部でもビブラート区間に含まれている場合には、該こぶし候補区間はビブラート区間に含まれていると判定しても良い。 In step SC120, the CPU 11 reads the vibrato section data from the vibrato section data storage area 14c and the fist candidate section data from the fist section data storage area 14d, and is included in the vibrato section from the section indicated by the read fist candidate section data. The final fist section is specified by excluding the fist candidate section.
For example, the fist candidate section data shown in FIG. 12 and the vibrato section data shown in FIG. 8 are generated from the filter singing pitch data shown in FIG. 11, but a plurality of fist candidate sections shown in FIG. Of these, only the section of 00m12s200ms to 00m12s800ms is not included in the vibrato section, but other sections other than the section are excluded from the fist section because they are included in the vibrato section. In addition, when the start time or end time of the fist candidate section and the vibrato section are shifted, if a part of the fist candidate section is included in the vibrato section, the fist candidate section is included in the vibrato section. It may be determined that

上述したように、こぶし候補区間の特定においては、所定の時間内にピッチが一過的に上昇することを条件としているため、歌唱音声にビブラートが含まれていた場合には、該ビブラートを構成する個々のピッチの振動が含まれてしまう。そこで、こぶしおよびビブラートの両者が共に特定されるような条件でこぶし候補区間を特定し、該特定されたこぶし候補区間から、別途特定されたビブラート区間を除外することによりこぶし区間が特定される。 As described above, the identification of the fist candidate section is based on the condition that the pitch rises temporarily within a predetermined time, so if the singing voice includes vibrato, the vibrato is configured. Vibrations of individual pitches are included. Therefore, the fist candidate section is specified by specifying both the fist and the vibrato, and the fist section is specified by excluding the separately specified vibrato section from the specified fist candidate section.

（Ｃ：変形例）
以上、本発明の一実施形態について説明したが、本発明は上述の実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。 (C: Modification)
As mentioned above, although one Embodiment of this invention was described, this invention is not limited to the above-mentioned embodiment, It can implement with another various form. An example is shown below.

（１）上述した実施形態においては、フィルタ歌唱ピッチデータにおいて、ピッチの変動の態様が条件(1)ないし(3)の条件を満たす区間を「こぶし候補区間」として特定する場合について説明した。しかし、上記の条件に加え、例えば以下のような条件（ａ）、（ｂ）、（ｃ）、および（ｄ）によりこぶし候補区間を特定しても良い。
（ａ）各ノート（楽音）の発音タイミングの直後の所定の期間内にピッチの立ち上がりがあるものをこぶし候補区間とする。
フィルタ歌唱ピッチデータの一部を取り出して示した図１３（ａ）に示されているように、時刻ｔ１からｔ３まで伸ばして発音するノートにおいて、その発音開始（時刻ｔ１）から所定の時間以上経過してからピッチの立ち上がり（時刻ｔ２）が見られる場合には、該区間をこぶし候補区間とは特定しない。なぜなら、各ノートの発音開始直後にこぶしを回すのが一般的であるからである。
（ｂ）こぶしの直前および直後に所定のレベルを超えるピッチの下降がない。
図１３（ｂ）に示されているように、時刻ｔ５において極大値を示すピッチの一過的な上昇の直前に時刻ｔ４において極小値を示すようなピッチの大きな下降が見られる場合には、上記時刻ｔ５において極大値を示すピッチの一過的な上昇の区間をこぶし候補区間とは特定しない。
（ｃ）こぶしの直後に所定のレベルを超えるピッチの上昇がない。
図１３（ｃ）に示されているように、時刻ｔ６において極大値を示すピッチの一過的な上昇の直後に時刻ｔ７において極大値を示すようなピッチの大きな上昇が見られる場合には、時刻ｔ６において極大値を示すピッチの一過的な上昇の区間をこぶし候補区間とは特定しない。
（ｄ）ピッチの一過的な上昇の後に一定期間の平坦部（区間内のピッチの最大値と最小値の差分が一定値以内である部分）がある。すなわち、ピッチが一過的に上昇した区間の直後の所定長の区間、またはピッチが一過的に上昇した区間の所定時間後の所定長の区間において、平坦部がある。
図１３（ｄ）に示されているように、時刻ｔ８において極大値を示すピッチの一過的な上昇が終了した後に時刻ｔ９において極小値を示すピッチの下降や、時刻ｔ１０において極大値を示すピッチの上昇が見られ、該区間におけるピッチの変動幅は所定の閾値より大きくなるため、時刻ｔ８において極大値を示すピッチの一過的な上昇の区間をこぶし候補区間とは特定しない。
なお、上述の実施形態において示した条件(1)ないし(3)、および上記の条件（ａ）、（ｂ）、（ｃ）、および（ｄ）の中から、複数の条件を選択して用いることによりこぶし候補区間を特定するとしても良く、該条件の組み合わせ方法は適切に設定すれば良い。 (1) In the above-described embodiment, the case has been described in which the section in which the pitch variation mode satisfies the conditions (1) to (3) is specified as the “fist candidate section” in the filter singing pitch data. However, in addition to the above conditions, the fist candidate section may be specified by the following conditions (a), (b), (c), and (d), for example.
(A) A fist candidate section having a pitch rise within a predetermined period immediately after the sound generation timing of each note (musical sound) is determined.
As shown in FIG. 13A, which shows a part of the filter singing pitch data, a predetermined time or more has elapsed from the start of sounding (time t1) in a note that is sounded from time t1 to time t3. If a pitch rise (time t2) is observed after that, the section is not specified as a fist candidate section. This is because it is common to turn the fist immediately after the start of pronunciation of each note.
(B) There is no pitch drop exceeding a predetermined level immediately before and after the fist.
As shown in FIG. 13B, when a large drop in the pitch showing the minimum value at time t4 is seen immediately before the transient increase in the pitch showing the maximum value at time t5, The temporarily increasing section of the pitch showing the maximum value at the time t5 is not specified as the fist candidate section.
(C) There is no pitch increase exceeding a predetermined level immediately after the fist.
As shown in FIG. 13C, when a large increase in the pitch showing the maximum value at time t7 is observed immediately after the transient increase in the pitch showing the maximum value at time t6, A section where the pitch is temporarily increased at time t6 is not specified as a fist candidate section.
(D) After a temporary increase in pitch, there is a flat portion (a portion where the difference between the maximum value and the minimum value of the pitch in the section is within a certain value) for a certain period. That is, there is a flat portion in a predetermined length section immediately after a section in which the pitch increases temporarily, or in a predetermined length section after a predetermined time in a section in which the pitch increases temporarily.
As shown in FIG. 13 (d), after the transient rise in the pitch showing the maximum value at time t8 is finished, the pitch showing the minimum value is lowered at time t9, or the maximum value is shown at time t10. Since a pitch increase is observed and the pitch fluctuation range in the section is larger than a predetermined threshold value, the transient increase section of the pitch showing the maximum value at time t8 is not specified as the fist candidate section.
It should be noted that a plurality of conditions are selected and used from the conditions (1) to (3) and the above conditions (a), (b), (c), and (d) shown in the above embodiment. Thus, the fist candidate section may be specified, and the combination method of the conditions may be set appropriately.

（２）上述した実施形態においては、ビブラート区間を特定する方法の一例として、ピッチの変動の態様に基づく場合について説明した。しかし、ビブラート区間を特定する方法は上述の方法に限定されるものではない。例えば、ビブラートの区間を特定するために上記実施形態において用いた複数の条件のうち、いずれかを用いないとしても良いし、上記の条件に加え、他の条件を組み合わせて用いても良い。 (2) In the above-described embodiment, the case where it is based on the variation mode of the pitch has been described as an example of the method for specifying the vibrato section. However, the method for specifying the vibrato section is not limited to the above-described method. For example, any of a plurality of conditions used in the above embodiment for specifying a vibrato section may not be used, or other conditions may be used in combination with the above conditions.

（３）上述した実施形態においては、歌唱音声データはＷＡＶＥ形式やＭＰ３形式のデータとしたが、データの形式はこれに限定されるものではなく、歌唱音声を表すデータであればどのような形式のデータであってもよい。 (3) In the embodiment described above, the singing voice data is data in the WAVE format or MP3 format, but the format of the data is not limited to this, and any format as long as it represents singing voice data. It may be the data.

（４）上述した実施形態では、カラオケ装置１が、上記実施形態に係る機能の全てを実現する場合について説明した。しかし、ネットワークで接続された２以上の装置が上記機能を分担するようにし、それら複数の装置を備えるシステムが同実施形態のカラオケ装置１の機能を実現するようにしてもよい。 (4) In embodiment mentioned above, the case where the karaoke apparatus 1 implement | achieves all the functions which concern on the said embodiment was demonstrated. However, two or more devices connected via a network may share the above function, and a system including the plurality of devices may realize the function of the karaoke device 1 of the embodiment.

（５）上述した実施形態においては、歌唱音声データからビブラート区間を特定し、該ビブラート区間をこぶし候補区間から除外してこぶし区間を特定する場合について説明した。しかし、楽曲データに付随しているなど、ビブラートの技法を用いて歌唱する区間を示すデータが得られる場合には、該データを用いてビブラート区間を特定すればよく、上述のビブラート区間特定処理を行う必要はない。その場合、該データが示すビブラート区間をビブラート区間データ記憶領域１４ｃに書き込んでおき、こぶし区間特定処理の際には、該書き込まれたビブラート区間データを読み出して用いれば良い。 (5) In the above-described embodiment, the case where the vibrato section is specified from the singing voice data, the vibrato section is excluded from the fist candidate sections, and the fist section is specified has been described. However, when data indicating the section to be sung using vibrato techniques, such as attached to music data, is obtained, the vibrato section may be specified using the data, and the above-described vibrato section specifying process is performed. There is no need to do it. In that case, the vibrato section indicated by the data may be written in the vibrato section data storage area 14c, and the written vibrato section data may be read and used in the fist section specifying process.

（６）上述した実施形態においては、カラオケ伴奏の進行に伴って歌唱音声データからピッチを検出し、歌唱ピッチデータを生成する場合について説明した。しかし、該処理は必ずしもカラオケ伴奏に伴って行われる必要はない。例えば、歌唱音声を一旦楽曲の初めから終わりまで蓄積して生成した歌唱音声データから、カラオケ伴奏が終了した段階でピッチを検出するようにしても良い。 (6) In embodiment mentioned above, the case where a pitch was detected from singing voice data with the progress of karaoke accompaniment, and the singing pitch data was produced | generated was demonstrated. However, this process does not necessarily have to be performed with karaoke accompaniment. For example, the pitch may be detected at the stage where the karaoke accompaniment is completed from the singing voice data once generated by accumulating the singing voice from the beginning to the end of the music.

（７）上記実施形態においては、こぶし区間を歌唱音声データから特定する場合について説明した。しかし、例えば歌手の歌唱など模範となる歌唱を示すデータからこぶし区間を上記実施形態に説明した方法で予め特定し、特定した区間を示すデータを含むカラオケ用楽曲データを作成してもよい。その場合、カラオケ装置１は、楽曲においていずれの区間がこぶしを用いて歌うべき区間であるかを表示部１５に表示すれば、歌唱者は、適切なタイミングでこぶしを用いて歌唱することができる。 (7) In the said embodiment, the case where a fist area was specified from song audio | voice data was demonstrated. However, for example, the fist section may be specified in advance by the method described in the above embodiment from data indicating an exemplary song such as a singer's song, and karaoke music data including data indicating the specified section may be created. In that case, if the karaoke apparatus 1 displays on the display part 15 which section should be sung using a fist in a music, the singer can sing using a fist at an appropriate timing. .

（８）上述した実施形態では、歌唱者の歌唱音声データから楽曲におけるこぶし区間を特定する場合について説明した。しかし、例えば、何らかの記憶手段に記憶された音声データを読み出して、該音声データからこぶしを検出しても良い。 (8) In embodiment mentioned above, the case where the fist area in a music was specified from a singer's song audio | voice data was demonstrated. However, for example, the audio data stored in some storage means may be read and the fist detected from the audio data.

（９）上述した実施形態では、ＣＰＵ１１は、歌唱音声データに対して特定の周波数以下の周波数成分を抽出するローパスフィルタ処理を行ったが、ＣＰＵ１１が行うフィルタ処理はこれに限らず、例えば、所定の周波数幅の周波数成分を取り出すフィルタを用いてもよい。要するに、特定の周波数帯域の成分を抽出するフィルタ処理であればどのようなものであってもよい。 (9) In the above-described embodiment, the CPU 11 performs low-pass filter processing for extracting frequency components equal to or lower than a specific frequency with respect to the singing voice data. However, the filter processing performed by the CPU 11 is not limited to this. A filter that extracts a frequency component having a frequency width of may be used. In short, any filtering process that extracts components in a specific frequency band may be used.

（１０）上述した実施形態においては、歌唱者の歌唱を表す歌唱音声データからこぶし区間を特定する場合について説明した。しかし、処理の対象となる音声データは、歌唱音声を表すデータに限らず、例えばバイオリンやフルートなどの楽器の演奏音を表す音声データであってもよい。そのようにすれば、例えば楽器においてこぶしと特性が類似した演奏技法を検出することができる。 (10) In embodiment mentioned above, the case where the fist area was specified from the singing voice data showing a singer's song was demonstrated. However, the audio data to be processed is not limited to data representing the singing voice, and may be audio data representing the performance sound of a musical instrument such as a violin or a flute. By doing so, it is possible to detect a performance technique having characteristics similar to those of a fist, for example, in an instrument.

（１１）上述した実施形態におけるカラオケ装置１のＣＰＵ１１によって実行されるプログラムは、磁気テープ、磁気ディスク、フレキシブルディスク、光記録媒体、光磁気記録媒体、ＣＤ（Compact Disk）−ＲＯＭ、ＤＶＤ（Digital Versatile Disk）、ＲＡＭ、ＲＯＭなどの記録媒体に記録した状態で提供し得る。また、インターネットのようなネットワーク経由でカラオケ装置１にダウンロードさせることも可能である。 (11) Programs executed by the CPU 11 of the karaoke apparatus 1 in the above-described embodiment are a magnetic tape, a magnetic disk, a flexible disk, an optical recording medium, a magneto-optical recording medium, a CD (Compact Disk) -ROM, a DVD (Digital Versatile). Disk), RAM, ROM, and the like. It is also possible to download to the karaoke apparatus 1 via a network such as the Internet.

（１２）上述した実施形態において、こぶし区間を特定するに際し、こぶし候補区間の一部でもビブラート区間に含まれている場合には、該こぶし候補区間はビブラート区間に含まれていると判定する場合について説明した。しかし、各こぶし候補区間がビブラート区間に含まれるか否かを判定する方法は、上記の方法に限定されるものではない。たとえば、各こぶし候補区間の開始時刻から終了時刻までがビブラート区間に含まれる場合に、該こぶし候補区間はビブラート区間に含まれると判定しても良い。また、各こぶし候補区間において、ピッチが極大値を示す時刻がビブラート区間に含まれる場合に、該こぶし候補区間はビブラート区間に含まれると判定しても良い。 (12) In the above-described embodiment, when specifying a fist section, if a part of the fist candidate section is included in the vibrato section, it is determined that the fist candidate section is included in the vibrato section Explained. However, the method for determining whether or not each fist candidate section is included in the vibrato section is not limited to the above method. For example, when the vibrato section includes from the start time to the end time of each fist candidate section, it may be determined that the fist candidate section is included in the vibrato section. Further, in each fist candidate section, when the time at which the pitch shows a maximum value is included in the vibrato section, it may be determined that the fist candidate section is included in the vibrato section.

カラオケ装置１の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a karaoke apparatus 1. FIG. 楽曲データの内容を示す図である。It is a figure which shows the content of music data. カラオケ伴奏処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a karaoke accompaniment process. 歌唱ピッチデータおよびフィルタ歌唱ピッチデータを示す図である。It is a figure which shows singing pitch data and filter singing pitch data. ビブラート区間特定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a vibrato area specific process. ピッチ振動幅の算出方法を説明するための図である。It is a figure for demonstrating the calculation method of a pitch vibration width. ピッチ振動幅の線形近似直線の算出方法を示す図である。It is a figure which shows the calculation method of the linear approximate line of pitch vibration width. ビブラート区間データの一例を示す図である。It is a figure which shows an example of vibrato area data. こぶし区間特定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a fist area specific process. こぶし候補区間の特定方法を説明するための図である。It is a figure for demonstrating the identification method of a fist candidate area. フィルタ歌唱ピッチデータを示す図である。It is a figure which shows filter song pitch data. こぶし候補区間データの一例を示す図である。It is a figure which shows an example of fist candidate area data. こぶし区間の特定方法を説明するための図である。It is a figure for demonstrating the identification method of a fist area.

Explanation of symbols

１…カラオケ装置、１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部、１５…表示部、１６…操作部、１７…マイクロホン、１８…音声処理部、１９…スピーカ、２０…バス。 DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... Memory | storage part, 15 ... Display part, 16 ... Operation part, 17 ... Microphone, 18 ... Audio | voice processing part, 19 ... Speaker, 20 ... Bus.

Claims

Receiving means for receiving singing voice data representing the singing voice;
Pitch detection means for detecting the pitch from the singing voice data received by the receiving means;
From the pitch detected by the pitch detection means, identify a section that falls after the pitch has risen, from the section,
(1) The absolute value of the rate of change in pitch in a section in which the pitch increases is greater than a predetermined threshold;
(2) The absolute value of the rate of change in pitch in the section where the pitch is lowered is greater than a predetermined threshold value,
(3) Candidate section selection means for selecting a candidate section characterized in that the length of the section from the start of the pitch rise to the end of the fall is within a predetermined range;
A second receiving means for receiving vibrato section data representing a section in which the vibrato technique is used in the singing voice;
If each of the candidate sections selected by the candidate section selecting means is not included in the vibrato section represented by the vibrato section data received by the second receiving means, a section in which the technique of fisting the candidate section is used. A fist detection device characterized by comprising: a fist section specifying means.

The candidate section selecting means selects the candidate section using a condition that a time difference between a timing at which a note (musical tone) including the section is started and a timing at which the pitch starts to rise is smaller than a predetermined threshold. The fist detection device according to claim 1.

The candidate section selecting means selects the candidate section using a condition that there is no variation in pitch exceeding a predetermined size in a section of a predetermined length immediately before or after the section. The fist detection device according to 1.

The candidate section selecting means selects the candidate section using a condition that a pitch fluctuation range is smaller than a predetermined threshold immediately after the section or in a section having a predetermined length after a predetermined time. Item 3. The fist detection device according to Item 1.

A receiving stage for receiving singing voice data representing the singing voice;
A pitch detection step of detecting a pitch from the singing voice data received in the reception step;
From the pitch detected in the pitch detection step, identify a section that descends after the pitch has increased, from the section,
(1) The absolute value of the rate of change in pitch in a section in which the pitch increases is greater than a predetermined threshold;
(2) The absolute value of the rate of change in pitch in the section where the pitch is lowered is greater than a predetermined threshold value,
(3) a candidate section selection stage for selecting a candidate section characterized in that the length of the section from the start of the pitch rise to the end of the fall is within a predetermined range;
Receiving a vibrato section data representing a section in which the vibrato technique is used in the singing voice;
If each of the candidate sections selected in the candidate section selection stage is not included in the vibrato section represented by the vibrato section data received in the second receiving stage, a section in which the technique of fisting the candidate section is used A fist detection method characterized by comprising: a fist segment identification step identified as

Computer
Receiving means for receiving singing voice data representing the singing voice;
Pitch detection means for detecting the pitch from the singing voice data received by the receiving means;
From the pitch detected by the pitch detection means, identify a section that falls after the pitch has risen, from the section,
(1) The absolute value of the rate of change in pitch in a section in which the pitch increases is greater than a predetermined threshold;
(2) The absolute value of the rate of change in pitch in the section where the pitch is lowered is greater than a predetermined threshold value,
(3) Candidate section selection means for selecting a candidate section characterized in that the length of the section from the start of the pitch rise to the end of the fall is within a predetermined range;
A second receiving means for receiving vibrato section data representing a section in which the vibrato technique is used in the singing voice;
If each of the candidate sections selected by the candidate section selecting means is not included in the vibrato section represented by the vibrato section data received by the second receiving means, a section in which the technique of fisting the candidate section is used. A program for functioning as a fist section identification means.