JP5549166B2

JP5549166B2 - Audio processing device, program

Info

Publication number: JP5549166B2
Application number: JP2009217173A
Authority: JP
Inventors: 典昭阿瀬見; 誠司黒川
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2009-09-18
Filing date: 2009-09-18
Publication date: 2014-07-16
Anticipated expiration: 2029-09-18
Also published as: JP2011065044A

Description

本発明は、入力音声において、１つの音符とみなせる期間を表す音符期間を推定する音声処理装置、及びプログラムに関する。 The present invention relates to a speech processing apparatus and program for estimating a note period representing a period that can be regarded as one note in an input speech.

従来、入力音声から、１つの音符とみなせる期間を表す音符期間を特定する音声処理装置が知られている（例えば、特許文献１参照）。
この特許文献１に記載の音声処理装置では、時間進行に沿って連続する２つの分析区間における平均音圧が、予め規定された規定値以上増加していれば、その連続する２つの分析区間のうち、時間進行において先の分析区間を音符開始タイミングとして特定する。そして、それら特定された音符開始タイミングに基づき、入力音声の時間進行に沿って連続する２つの音符開始タイミングの間を音符期間として推定している（以下、このような音符期間の推定技術を従来推定技術と称す）。 2. Description of the Related Art Conventionally, a voice processing device that specifies a note period representing a period that can be regarded as one note from an input voice is known (see, for example, Patent Document 1).
In the speech processing device described in Patent Document 1, if the average sound pressure in two analysis sections that are continuous along time progress increases by a predetermined value or more, the two analysis sections that are continuous are increased. Among them, the previous analysis interval is specified as the note start timing in time progress. Based on these specified note start timings, the interval between two note start timings that are continuous along the time progress of the input speech is estimated as a note period (hereinafter, such a note period estimation technique is conventionally known). Called estimation technology).

特許第４１２８８４８号Japanese Patent No. 4128848

ところで、入力音声にビブラートがかかっている場合、入力音声の音圧の時間軸に沿った推移（以下、音圧推移と称す）は、上に凸と下に凸とを繰り返すように連続的に変動する。この音圧推移が上に凸となる領域では、それぞれの領域にて、連続する２つの分析区間における平均音圧が規定値以上増加することがある。 By the way, when vibrato is applied to the input voice, the transition along the time axis of the sound pressure of the input voice (hereinafter referred to as the sound pressure transition) continuously repeats a convexity upward and a convexity downward. fluctuate. In regions where the sound pressure transition is convex upward, the average sound pressure in two consecutive analysis sections may increase by more than a specified value in each region.

この場合、従来推定技術では、ビブラートでの発声期間（以下、ビブラート期間と称す）において、音圧推移が上に凸となる領域それぞれを、音符開始タイミングとして特定してしまう。すると、従来推定技術では、ビブラート期間を、本来推定されるべきではない複数の音符期間に分割して、それら分割されたそれぞれを音符期間として推定してしまう。 In this case, in the conventional estimation technique, each region in which the sound pressure transition is convex upward is specified as the note start timing in the utterance period in vibrato (hereinafter referred to as the vibrato period). Then, in the conventional estimation technique, the vibrato period is divided into a plurality of note periods that should not be estimated, and each of the divided periods is estimated as a note period.

つまり、従来推定技術では、音符期間を精度良く推定できないという問題があった。
そこで、本発明は、音符期間を推定する技術において、推定精度を向上させることを目的とする。 That is, the conventional estimation technique has a problem that the note period cannot be accurately estimated.
Therefore, an object of the present invention is to improve estimation accuracy in a technique for estimating a note period.

上記目的を達成するためになされた本発明は、時間進行に沿って連続した入力音声から、１つの音符とみなせる期間それぞれを表す音符期間を推定し、当該音符期間における音高を特定して採譜する音声処理装置である。
その本発明の音声処理装置では、音圧推移特定手段が、入力音声から、その入力音声における音圧の時間進行に沿った推移を表す音圧推移を特定し、開始タイミング検出手段が、その特定された音圧推移が単調増加である区間にて、音圧推移に規定された第１規定期間における音圧の増加率が、時間進行に沿って最初に、予め規定された規定値以上となった時点それぞれを、音符開始タイミングとして検出する。ただし、ここでいう音符開始タイミングとは、音符期間の開始タイミングそれぞれである。 The present invention, which has been made to achieve the above object, estimates a note period representing each period that can be regarded as one note from input speech that is continuous over time , specifies a pitch in the note period, and records music. Is a voice processing device.
In the speech processing apparatus of the present invention, the sound pressure transition specifying means specifies a sound pressure transition representing a transition along the time progression of the sound pressure in the input sound from the input sound, and the start timing detecting means is the specifying In the section where the sound pressure transition is monotonically increasing, the rate of increase of the sound pressure during the first specified period specified in the sound pressure transition first becomes equal to or greater than a predetermined value as time progresses. Each time is detected as a note start timing. However, the note start timing here is the start timing of the note period.

そして、本発明の音声処理装置では、ビブラート期間特定手段が、入力音声における音高の時間進行に沿った推移を表す音高推移に基づいて、ビブラート期間を特定し、期間内タイミング除去手段が、開始タイミング検出手段にて検出された音符開始タイミングの中で、ビブラート期間内に対応する音符開始タイミング（即ち、期間内タイミング）を、開始タイミング検出手段での検出結果の中から除去する。ただし、ここでいうビブラート期間とは、入力音声中にてビブラートにより発声された期間である。 And in the speech processing device of the present invention, the vibrato period specifying means specifies the vibrato period based on the pitch transition representing the transition along the time progression of the pitch in the input voice, and the in-period timing removing means is Of the note start timings detected by the start timing detection means, note start timings corresponding to the vibrato period (that is, timing within the period) are removed from the detection results of the start timing detection means. However, the vibrato period here is a period uttered by vibrato in the input voice.

さらに、本発明の音声処理装置では、音符期間推定手段が、期間内タイミングが除去された後の音符開始タイミングそれぞれと対となる音符終了タイミングを特定すると共に、該対となる音符開始タイミングと音符終了タイミングとの間の期間それぞれを音符期間として推定する。
さらに、本発明におけるビブラート期間特定手段では、増減検出手段が、期間音高推移における音高の変動幅が規定幅以下であれば、該期間音高推移にて音高が増加する増加区間、及び音高が減少する減少区間を検出し、その検出された増加区間及び減少区間の数が規定数以上であれば、期間特定手段が、期間音高推移に対応する第２規定期間をビブラート期間として特定するように構成されている。なお、ここでいう期間音高推移とは、第２規定期間における音高推移である。ただし、第２規定期間とは、音高推移の全体にわたって、かつ時間進行に沿って互いに連続するように規定された複数の期間それぞれである。 Furthermore, in the speech processing apparatus of the present invention, the note period estimation means specifies the note end timing paired with each note start timing after the in-period timing is removed, and the paired note start timing and note note. Each period between the end timings is estimated as a note period.
Further, in the vibrato period specifying means in the present invention, if the increase / decrease detecting means has a fluctuation range of the pitch in the period pitch transition equal to or less than a specified width, an increasing section in which the pitch increases in the period pitch transition, and If a decrease section in which the pitch decreases is detected, and the number of detected increase sections and decrease sections is equal to or greater than the specified number, the period specifying means sets the second specified period corresponding to the period pitch transition as the vibrato period. Is configured to identify. Note that the period pitch transition here is a pitch transition in the second specified period. However, the second specified period is each of a plurality of periods defined so as to be continuous with each other over the entire pitch transition and along the time progress.

つまり、本発明の音声処理装置では、音符期間の開始タイミングとして、全ての音符開始タイミングの中から期間内タイミングが除去された後に残った音符開始タイミングのみが用いられる。 That is, in the speech processing apparatus of the present invention, only the note start timing remaining after the in-period timing is removed from all the note start timings is used as the note period start timing.

したがって、本発明の音声処理装置によれば、音符期間を推定する際に、ビブラート期間中に含まれる音符開始タイミングが用いられることを防止できる。この結果、本発明の音声処理装置によれば、少なくとも、ビブラート期間内から期間が開始される音符期間が推定されることがなくなり、１つのビブラート期間が２つ以上の音符期間に分割して推定されることを防止できる。 Therefore, according to the speech processing apparatus of the present invention, it is possible to prevent the use of the note start timing included in the vibrato period when estimating the note period. As a result, according to the speech processing apparatus of the present invention, at least the note period starting from the vibrato period is not estimated, and one vibrato period is divided into two or more note periods and estimated. Can be prevented.

換言すれば、本発明の音声処理装置によれば、従来推定技術に比べ、入力音声における音符期間の推定精度を向上させることができる。
なお、ここでいう第１規定期間とは、音圧推移が単調増加である区間全体よりも短い期間であってもよいし、単調増加である区間全体でもよい。 In other words, according to the speech processing apparatus of the present invention, it is possible to improve the estimation accuracy of the note period in the input speech as compared with the conventional estimation technique.
The first specified period referred to here may be a period shorter than the entire section in which the sound pressure transition is monotonically increasing, or may be the entire section having a monotonically increasing period.

このように構成されたビブラート期間特定手段によれば、時間進行に沿った入力音声の開始時点から終了時点までの中から、条件を満たす第２規定期間を、ビブラート期間として特定することができる。 According to the vibrato period specifying means configured as described above, the second specified period that satisfies the condition from the start point to the end point of the input voice along the time progress can be specified as the vibrato period.

ところで、音符期間推定手段は、音圧推移における音圧が、期間内タイミングが除去された後の音符開始タイミング以降、最初に、該音符開始タイミングにおける音圧以下となった音圧変動時点を、該音符開始タイミングと対となる音符終了タイミングとして特定するように構成されていても良い。 By the way, the note period estimation means first calculates the sound pressure fluctuation time point when the sound pressure in the sound pressure transition becomes equal to or lower than the sound pressure at the note start timing after the note start timing after the in-period timing is removed. It may be configured to be specified as a note end timing paired with the note start timing.

このように構成された終了タイミング推定手段によれば、発音終了タイミングを音圧推移から直接推定することができる。
なお、音符期間推定手段が上述したように構成されている場合、音符終了タイミングを表す時間進行上の時点は、期間内タイミングを除去する前に推定されたものでも良いし、期間内タイミングを除去した後に推定されたものでも良い。 According to the end timing estimation means configured as described above, the sound generation end timing can be directly estimated from the sound pressure transition.
When the note period estimating means is configured as described above, the time progress point indicating the note end timing may be estimated before removing the in-period timing, or the in-period timing is removed. It may be the one estimated after.

ただし、音符期間推定手段が上述したように構成されている場合、ビブラート期間において音圧が単調減少する区間を、音符終了タイミングとして推定する可能性がある。 However, when the note period estimation means is configured as described above , there is a possibility that a section where the sound pressure monotonously decreases in the vibrato period is estimated as the note end timing.

これを防止するため、音符期間推定手段において、期間内タイミングが除去された後の音符開始タイミング以降、かつビブラート期間の終了タイミング以降における最初の音圧変動時点を、該音符開始タイミングと対となる音符終了タイミングとしてみなすようにしても良い。 Order to prevent this, the sound marks period estimating means, after the note start timing after the period in the timing has been removed, and the initial sound pressure variation point in the later end timing of vibrato periods, the sound marks the start timing and the pair It may be considered as the note end timing.

ところで、人が歌唱する場合、入力音声の音圧が、音符開始タイミングにおける音圧まで低下することなく上昇し、次の音に対する発声が開始されることがある。このような状況下では、音符終了タイミングそのものが特定されない可能性がある。 By the way, when a person sings, the sound pressure of the input voice rises without decreasing to the sound pressure at the note start timing, and utterance for the next sound may be started. Under such circumstances, sound marks the end timing itself may not be identified.

このため、本発明において、音符期間推定手段は、後開始タイミングから設定時間長だけ前の時点を、前開始タイミングと対となる音符終了タイミングとして特定するように構成されていても良い。ただし、ここでいう前開始タイミングとは、期間内タイミングが除去された後の時間進行に沿って隣接する音符開始タイミングのうち、時間進行上、前の音符開始タイミングであり、後開始タイミングとは、時間進行上、後の音符開始タイミングである。 For this reason, in the present invention, the note period estimation means may be configured to specify a time point that is a set time length before the later start timing as a note end timing paired with the preceding start timing. However, the pre-start timing here refers to the note start timing that precedes the time progress among the note start timings adjacent to the time progress after the in-period timing is removed. This is the later note start timing in time progression.

このように構成された音符期間推定手段によれば、入力音声から音符終了タイミングが推定できない場合であっても、音符開始タイミングから音符終了タイミングを推定することができる。これにより、本発明の音声処理装置によれば、音符終了タイミングが推定されないという事態に陥ることを防止できる。 According to the note period estimating means configured as described above, even if the note end timing cannot be estimated from the input voice, the note end timing can be estimated from the note start timing. Thereby, according to the speech processing device of the present invention, it is possible to prevent a situation in which the note end timing is not estimated.

特に、このように構成された本発明の音声処理装置において、設定時間長を０［ｓ］とみなせるほど短い時間とすれば、特定される音符終了タイミングは、ビブラート期間の終了タイミング以降となる。この結果、独立した１つの音符期間にビブラート期間全体を包含することができる。 In particular, in the speech processing apparatus of the present invention configured as described above, if the set time length is set to a time short enough to be regarded as 0 [s], the specified note end timing is after the end timing of the vibrato period. As a result, the entire vibrato period can be included in one independent note period.

なお、本発明における該音符期間推定手段は、さらに、後開始タイミングよりも時間進行上、前に、音圧推移における音圧が、前開始タイミングにおける音圧以下となった音圧変動時点が存在すれば、該音圧変動時点を前開始タイミングと対となる音符終了タイミングとして特定するように構成されていることが望ましい。 Incidentally, the sound marks period estimating means in the present invention, in addition, the traveling time than the rear start timing before, the sound pressure in the sound pressure transition is, the sound pressure fluctuation when it becomes the lower the sound pressure or before the start time If present, it is desirable that the sound pressure fluctuation time point be specified as the note end timing paired with the previous start timing.

このように構成された本発明の音声処理装置によれば、音圧推移から音符終了タイミングを特定する方法と、音符開始タイミングから音符終了タイミングを特定する方法とを併存させたとしても、１つの音符開始タイミングに対して、音符終了タイミングを１つに特定することができる。 According to the speech processing apparatus of the present invention configured as described above, even if the method for specifying the note end timing from the sound pressure transition and the method for specifying the note end timing from the note start timing are combined, One note end timing can be specified with respect to the note start timing.

また、上述したように音符期間推定手段を構成した場合、入力音声の時間進行に沿って最後の音符開始タイミングと対となる音符終了タイミングが特定されない可能性がある。
このため、本発明の音符期間推定手段は、入力音声の時間進行に沿った終端を、期間内タイミングが除去された後の音符開始タイミングのうち、時間進行に沿った最後の音符開始タイミングと対となる音符終了タイミングとして特定するように構成されていても良い。 Further, when the note period estimating means is configured as described above, there is a possibility that the note end timing paired with the last note start timing is not specified along with the time progress of the input voice.
Therefore, note duration estimating means of the present invention, the end along the time progression of the input speech, of the notes start timing after the period in the timing has been removed, the last note start timing along the time progression You may comprise so that it may identify as a note end timing which becomes a pair.

このように構成された本発明の音声処理装置によれば、入力音声の時間に沿った終端を音符終了タイミングとして特定するため、時間進行に沿った最後の音符開始タイミングと対となる音符終了タイミングが特定されないという事態を防止できる。
さらに、本発明の音声処理装置は、基本周波数検出手段と、不連続検出手段と、周波数補正手段と、音高推定手段とを備えている。
基本周波数決定手段は、入力音声において時間軸に沿って連続する単位区間毎に、その単位区間における音声基本周波数を決定し、不連続検出手段は、基本周波数決定手段により検出された音声基本周波数を時間軸に沿って配置した周波数推移のうち、連続する単位区間同士の間で音声基本周波数が不連続である不連続領域を検出する。
さらに、周波数補正手段は、音圧推移特定手段により特定された周波数推移を、不連続検出手段により検出された不連続領域に対応する単位区間それぞれの基本周波数が、その不連続領域の直前の単位区間の音声基本周波数である直前周波数から、該不連続領域の直後の単位区間の音声基本周波数である直後周波数へと到達するように補正する周波数補正を実行し、音高推定手段は、周波数補正手段による周波数補正が実行された後の単位区間における音声基本周波数を、当該単位区間に対応する音符期間における音高として特定する。 According to the speech processing device of the present invention configured as described above, in order to identify the end of the input speech along the time as the note end timing, the note end timing paired with the last note start timing along the time progression Can be prevented from being identified.
Furthermore, the speech processing apparatus of the present invention includes basic frequency detection means, discontinuity detection means, frequency correction means, and pitch estimation means.
The fundamental frequency determining means determines the speech fundamental frequency in each unit section along the time axis in the input speech, and the discontinuity detecting means determines the speech fundamental frequency detected by the fundamental frequency determining means. Among the frequency transitions arranged along the time axis, a discontinuous region in which the speech fundamental frequency is discontinuous between consecutive unit sections is detected.
Further, the frequency correction means is configured such that the fundamental frequency of each unit section corresponding to the discontinuous area detected by the discontinuity detecting means is the unit immediately before the discontinuous area, with respect to the frequency transition specified by the sound pressure transition specifying means. The frequency correction is performed so that the frequency immediately before the speech fundamental frequency of the section reaches the immediately following frequency that is the speech fundamental frequency of the unit section immediately after the discontinuous region. The voice fundamental frequency in the unit section after the frequency correction by the means is executed is specified as the pitch in the note period corresponding to the unit section.

なお、本発明は、コンピュータに実行させるプログラムとしてなされたものでも良い。
ただし、本発明のプログラムは、入力音声から、その入力音声の音圧推移を特定する音圧推移特定手順と、その特定された音圧推移が単調増加である区間での第１規定期間における音圧の増加率が、時間進行に沿って最初に規定値以上となった時点それぞれを、音符開始タイミングとして検出する開始タイミング検出手順と、音高推移に基づいて、ビブラート期間を特定するビブラート期間特定手順とをコンピュータに実行させる。さらに、本発明のプログラムは、開始タイミング検出手順にて検出された音符開始タイミングの中で、ビブラート期間内に対応する音符開始タイミングを期間内タイミングとし、開始タイミング検出手順での検出結果の中から、期間内タイミングを除去する期間内タイミング除去手順と、期間内タイミング除去手順にて、期間内タイミングが除去された後の前記音符開始タイミングそれぞれと対となる音符終了タイミングを特定すると共に、該対となる音符開始タイミングと音符終了タイミングとの間の期間それぞれを音符期間として推定する音符期間推定手順とをコンピュータに実行させる必要がある。
さらに、本発明のプログラムにおいては、ビブラート期間特定手順にて、音高推移の全体にわたって、かつ時間進行に沿って互いに連続するように規定された複数の期間それぞれを第２規定期間とし、第２規定期間での音高推移を期間音高推移とし、期間音高推移における音高の変動幅が、予め規定された規定幅以下であれば、該期間音高推移にて音高が増加する増加区間、及び音高が減少する減少区間を検出する増減検出手順と、増減検出手順にて検出された増加区間及び減少区間の数が、予め規定された規定数以上であれば、該期間音高推移に対応する第２規定期間をビブラート期間として特定する期間特定手順とをコンピュータに実行させる必要がある。 The present invention may be implemented as a program that is executed by a computer.
However, the program of the present invention, from the input speech, the first predetermined period at the sound pressure changes the specific procedures for determining the sound pressure changes in the input speech, the identified sound pressure changes are monotonically increasing interval A start timing detection procedure that detects each point when the rate of increase in sound pressure first exceeds the specified value as time progresses, and a vibrato period that identifies the vibrato period based on the pitch transition Let the computer execute specific procedures. Furthermore, the program of the present invention uses the note start timing corresponding to the vibrato period as the in-period timing among the note start timings detected by the start timing detection procedure, and detects from the detection results in the start timing detection procedure. Identifying a note end timing paired with each of the note start timings after the removal of the intra-period timing in the intra-period timing removal procedure for removing the intra-period timing and the intra-period timing removal procedure, It is necessary to cause the computer to execute a note period estimation procedure for estimating each period between the note start timing and the note end timing as a note period.
Furthermore, in the program of the present invention, each of a plurality of periods defined so as to be continuous with each other over the entire pitch transition in the vibrato period specifying procedure is set as the second specified period. If the pitch transition in the specified period is the pitch change in the specified period, and the fluctuation range of the pitch in the period pitch transition is less than or equal to the predefined specified range, the increase in pitch will increase in the period transition Increase / decrease detection procedure for detecting the interval and the decrease interval in which the pitch decreases, and if the number of the increase interval and the decrease interval detected in the increase / decrease detection procedure is equal to or greater than a predetermined number, It is necessary to cause the computer to execute a period specifying procedure for specifying the second specified period corresponding to the transition as the vibrato period.

本発明がこのようになされたプログラムであれば、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスクやフラッシュメモリ等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された音声処理装置として機能させることができる。 If the present invention is a program made in this way, for example, it is recorded on a computer-readable recording medium such as a DVD-ROM, CD-ROM, hard disk, flash memory, etc., and loaded into the computer as needed to start. In addition, the computer can be used by being acquired and started up via a communication line as necessary. And by making a computer perform each procedure, the computer can be functioned as an audio | voice processing apparatus described in Claim 1.

楽曲検索システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a music search system. 楽曲検索処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the music search process. 音高推定処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the pitch estimation process. 相関ピークの検出方法を模式的に示した説明図である。It is explanatory drawing which showed the detection method of the correlation peak typically. 音声基本周波数ｆ０の決定方法を模式的に示した説明図である。It is explanatory drawing which showed typically the determination method of the audio | voice basic frequency f0. 信頼度算出処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the reliability calculation process. ｆ０候補信頼度の導出過程を例示した説明図である。It is explanatory drawing which illustrated the derivation process of f0 candidate reliability. ｆ０補正処理の処理手順を示したフローチャートである。It is the flowchart which showed the processing procedure of f0 correction processing. ｆ０補正処理の動作例を説明するための説明図である。It is explanatory drawing for demonstrating the operation example of f0 correction process. 開始・終了タイミング推定処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the start / end timing estimation process. 開始・終了タイミングの推定過程を例示した説明図である。It is explanatory drawing which illustrated the estimation process of the start / end timing. 開始・終了タイミングの推定過程を例示した説明図である。It is explanatory drawing which illustrated the estimation process of the start / end timing. 採譜処理の処理手順を示したフローチャートである。It is the flowchart which showed the processing procedure of the music transcription process. 採譜処理おいて、音符音高の決定過程を例示した説明図である。It is explanatory drawing which illustrated the determination process of the note pitch in music transcription processing. 採譜結果照合処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the transcription result collation process.

以下、本発明の実施形態を図面と共に説明する。
まず、図１は、本発明が適用された音声処理装置を備えた楽曲検索システムの概略構成を示すブロック図である。
〈楽曲検索システムについて〉
楽曲検索システム１は、利用者が発声することで入力された入力音声から、その音声を入力する際に利用者が意図したと推定される楽曲（以下、意図予想曲と称す）を検索するものである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, FIG. 1 is a block diagram showing a schematic configuration of a music search system provided with a voice processing device to which the present invention is applied.
<About the music search system>
The music search system 1 searches for a music (hereinafter referred to as an intended expected music) that is estimated to be intended by the user when inputting the voice from the input voice inputted by the user. It is.

このため、図１に示すように、楽曲検索システム１は、楽曲毎に予め用意された楽曲データを格納するサーバ４０と、入力音声を採譜し、その採譜した結果を楽曲データに照合することで意図予想曲を検索する音声処理装置２０とを備えている。なお、音声処理装置２０は、ネットワーク（例えば、専用回線やＷＡＮ）を介してサーバ４０に接続されている。 For this reason, as shown in FIG. 1, the music search system 1 records the input voice with the server 40 that stores the music data prepared in advance for each music, and collates the scored result with the music data. And an audio processing device 20 for searching for an expected song. The voice processing device 20 is connected to the server 40 via a network (for example, a dedicated line or WAN).

このうち、サーバ４０は、楽曲データを格納する記憶装置４１と、ＲＯＭ，ＲＡＭ，ＣＰＵを少なくとも有した周知のマイクロコンピュータ４２とを備えた情報処理装置を中心に構成された周知のサービス用サーバ装置である。
〈楽曲データについて〉
次に、記憶装置４１に格納される楽曲データについて説明する。 Among these, the server 40 is a known service server device mainly composed of an information processing apparatus including a storage device 41 for storing music data and a known microcomputer 42 having at least a ROM, a RAM, and a CPU. It is.
<About music data>
Next, music data stored in the storage device 41 will be described.

この楽曲データは、当該楽曲を識別するためのデータである楽曲情報と、当該楽曲の演奏開始から演奏終了までに要する時間を示す時間情報と、当該楽曲の旋律に関するデータであるガイドメロディとを有している。 This music data includes music information that is data for identifying the music, time information that indicates the time required from the start of performance of the music to the end of performance, and a guide melody that is data relating to the melody of the music. doing.

そして、楽曲情報には、楽曲を特定するための曲番号データと、その楽曲の曲名を示す曲名データとが少なくとも含まれている。
また、ガイドメロディは、楽曲の主旋律（以下、基準旋律と称す）を形成する各構成音について、それぞれの音高及び音価が表された周知のデータである。具体的に、本実施形態における構成音の音長は、楽音出力開始時間及び楽音出力終了時間によって表されている。ただし、ここで言う楽音出力開始時間とは、その構成音の出力を開始するまでの当該楽曲の演奏開始からの時間であり、楽音出力終了時間とは、その構成音の出力を終了するまでの当該楽曲の演奏開始からの時間である。つまり、楽音出力開始時間と楽音出力終了時間との間の時間長が、当該構成音の音長となる。 The song information includes at least song number data for specifying a song and song name data indicating the song name of the song.
The guide melody is well-known data that represents the pitch and the tone value of each constituent sound that forms the main melody (hereinafter referred to as a reference melody) of the music. Specifically, the tone lengths of the constituent sounds in the present embodiment are represented by a tone output start time and a tone output end time. However, the musical sound output start time here is the time from the start of the performance of the music until the output of the constituent sound starts, and the musical sound output end time is the time until the output of the constituent sound ends. This is the time from the start of the performance of the music. That is, the time length between the musical sound output start time and the musical sound output end time is the sound length of the constituent sound.

以下、ガイドメロディにおいて、各構成音の音高及び音価を表す情報を、基準音符データと称す。ただし、この基準音符データは、各構成音の音高及び音価が、その構成音の基準旋律における時間進行上の順番と対応付けられたものである。
〈音声処理装置について〉
次に、音声処理装置２０について説明する。 Hereinafter, in the guide melody, information representing the pitch and the note value of each constituent sound is referred to as reference note data. However, the reference note data is such that the pitch and the note value of each constituent sound are associated with the order of time progression in the reference melody of the constituent sound.
<About the audio processor>
Next, the voice processing device 20 will be described.

ここで図１へと戻り、音声処理装置２０は、通信部２１と、表示部２２と、操作受付部２３と、マイクロホン２４と、音声入力部２５と、音声出力部２６と、スピーカ２７と、記憶部２８と、制御部３０とを備えている。 Returning to FIG. 1, the audio processing device 20 includes a communication unit 21, a display unit 22, an operation receiving unit 23, a microphone 24, an audio input unit 25, an audio output unit 26, a speaker 27, A storage unit 28 and a control unit 30 are provided.

このうち、通信部２１は、音声処理装置２０をネットワーク（例えば、専用回線や、ＷＡＮ）に接続し、その接続されたネットワークを介して外部（即ち、サーバ４０）と通信を行うための通信インタフェースである。 Among these, the communication unit 21 connects the voice processing device 20 to a network (for example, a dedicated line or WAN), and communicates with the outside (that is, the server 40) via the connected network. It is.

そして、表示部２２は、例えば、液晶ディスプレイ等から構成された周知の表示装置である。また、操作受付部２３は、例えば、キーボードやポインティングデバイス（例えば、マウス）等の周知の入力装置からなる。 And the display part 22 is a known display apparatus comprised from the liquid crystal display etc., for example. The operation receiving unit 23 includes a known input device such as a keyboard and a pointing device (for example, a mouse).

マイクロホン２４は、音声を入力するための周知の装置である。そして、音声入力部２５は、マイクロホン２４を介して入力された音声（アナログ信号）をサンプリングし、そのサンプリング値（標本値）を制御部３０に入力するＡＤ変換器として構成されている。なお、以下では、音声入力部２５にてサンプリング値へと変換された音声を音声データと称す。 The microphone 24 is a well-known device for inputting sound. The voice input unit 25 is configured as an AD converter that samples the voice (analog signal) input via the microphone 24 and inputs the sampled value (sample value) to the control unit 30. Hereinafter, the sound converted into the sampling value by the sound input unit 25 is referred to as sound data.

さらに、音声出力部２６は、制御部３０からの指令に基づく制御信号を、スピーカ２７に出力するように構成されている。そして、スピーカ２７は、音声出力部２６からの制御信号を音に変換して放音するように構成されている。 Furthermore, the audio output unit 26 is configured to output a control signal based on a command from the control unit 30 to the speaker 27. And the speaker 27 is comprised so that the control signal from the audio | voice output part 26 may be converted into a sound and emitted.

また、記憶部２８は、電源が切断されても記憶内容を保持すると共に、記憶内容を読み書き可能に構成された記憶装置（例えば、ハードディスクドライブ）であり、プログラムや通信部２１を介してサーバ４０から取得した楽曲データ等が格納される。 The storage unit 28 is a storage device (for example, a hard disk drive) configured to hold the stored content even when the power is turned off and to be able to read and write the stored content, and the server 40 via the program and the communication unit 21. The music data acquired from is stored.

次に、制御部３０は、ＲＯＭ３１と、ＲＡＭ３２と、ＣＰＵ３３とを少なくとも有した周知のマイクロコンピュータを中心に構成されている。
このうち、ＲＯＭ３１は、電源が切断されても記憶内容を保持する必要のあるプログラムやデータを格納するものである。また、ＲＡＭ３２は、プログラムやデータを一時的に格納するものであり、記憶部２８からの処理プログラムが転送されて格納されるものである。 Next, the control unit 30 is configured around a known microcomputer having at least a ROM 31, a RAM 32, and a CPU 33.
Of these, the ROM 31 stores programs and data that need to retain stored contents even when the power is turned off. The RAM 32 temporarily stores programs and data. The processing program from the storage unit 28 is transferred and stored in the RAM 32.

そして、ＣＰＵ３３は、ＲＯＭ３１やＲＡＭ３２に記憶された処理プログラムに従って各処理（各種演算）を実行して、音声処理装置２０を構成する各部２１，２２，２３，２５（２４），２６（２７），２８に対する制御を実行する。 And CPU33 performs each process (various calculation) according to the processing program memorize | stored in ROM31 or RAM32, and each part 21,22,23,25 (24), 26 (27), 26 which comprises the audio | voice processing apparatus 20 is carried out. The control for 28 is executed.

なお、本実施形態では、処理プログラムとして、利用者がマイクロホン２４を介して入力した入力音声に基づいて、その入力音声を採譜した音声音符データを生成し、その生成した音声音符データを基準音符データそれぞれに照合した結果に従って意図予想曲を検索する楽曲検索処理を制御部３０（より正確には、ＣＰＵ３３）が実行するためのものが用意されている。
〈楽曲検索処理について〉
次に、制御部３０が実行する楽曲検索処理について説明する。 In the present embodiment, as the processing program, based on the input voice input by the user through the microphone 24, voice note data obtained by recording the input voice is generated, and the generated voice note data is used as the reference note data. There is prepared for the control unit 30 (more precisely, the CPU 33) to execute a music search process for searching for an intended expected music in accordance with the result collated.
<About music search processing>
Next, a music search process executed by the control unit 30 will be described.

ここで、図２は、楽曲検索処理の処理手順を示したフローチャートである。
この楽曲検索処理は、マイクロホン２４を介して入力された入力音声に基づく音声データが、少なくとも１つ記憶部２８に格納された後、操作受付部２３介して起動指令を受け付けると起動される。ここでの入力音声は、時間の進行に沿って、一定時間以上連続（継続）したものである。 Here, FIG. 2 is a flowchart showing a processing procedure of the music search processing.
The music search process is started when an activation command is received via the operation reception unit 23 after at least one piece of audio data based on the input voice input via the microphone 24 is stored in the storage unit 28. The input voice here is continuous (continuous) for a certain time or more as time progresses.

そして、図２に示すように、楽曲検索処理は、起動されると、まず、Ｓ１１０にて、記憶部２８に記憶された音声データの中から、１つの音声データを取得する。
続く、Ｓ１２０では、Ｓ１１０にて取得した音声データに対して、それぞれ周知のダウンサンプリング、直流成分の除去、ノイズ除去処理、コンプレッサ処理、及びノーマライズを事前処理として実行する。以下、Ｓ１２０にて事前処理が完了した音声データを処理済音声データと称す。 As shown in FIG. 2, when the music search process is started, first, one piece of voice data is acquired from the voice data stored in the storage unit 28 in S110.
In S120, well-known downsampling, DC component removal, noise removal processing, compressor processing, and normalization are executed as pre-processing on the audio data acquired in S110. Hereinafter, the audio data that has been pre-processed in S120 is referred to as processed audio data.

そして、Ｓ１３０では、処理済音声データにおいて入力音声の時間進行に沿って規定された単位区間毎に、その単位区間における入力音声の音高（音声基本周波数ｆ０）を推定する音高推定処理を実行する。 In S130, a pitch estimation process for estimating the pitch (sound fundamental frequency f0) of the input voice in the unit section is executed for each unit section defined along the time progress of the input voice in the processed voice data. To do.

さらに、Ｓ１５０では、入力音声において、規定の音圧以上で発声を継続した期間である発音期間それぞれの開始タイミング及び終了タイミングを推定する開始・終了タイミング推定処理を実行する。以下、開始・終了タイミング推定処理にて推定される開始タイミング、終了タイミングを、それぞれ、発音開始タイミング、発音終了タイミングと称す。 Furthermore, in S150, start / end timing estimation processing is performed for estimating the start timing and end timing of each sound generation period, which is a period in which utterance is continued at a predetermined sound pressure or higher in the input voice. Hereinafter, the start timing and end timing estimated in the start / end timing estimation processing are referred to as a sound generation start timing and a sound generation end timing, respectively.

続く、Ｓ１９０では、Ｓ１５０にて推定された発音開始タイミング及び発音終了タイミングに基づいて、１つの音符とみなせる期間（以下、音符期間と称す）を推定し、その推定した音符期間における音高（以下、音符音高と称す）を、Ｓ１３０にて推定された単位区間毎の音声基本周波数ｆ０に基づいて特定する採譜処理を実行する。この採譜処理により、各音符期間の期間長（即ち、音長、または、この音長を量子化した音価）と、音符音高とが対応付けられたデータ、即ち、音声音符データとして、入力音声を音符化したデータが生成される。 Subsequently, in S190, a period that can be regarded as one note (hereinafter referred to as a note period) is estimated based on the sound generation start timing and the sound generation end timing estimated in S150, and the pitch (hereinafter referred to as a note period) is estimated. , Which is referred to as a note pitch) is performed based on the sound fundamental frequency f0 for each unit section estimated in S130. As a result of this transcription processing, the duration of each note period (that is, the note length or the note value obtained by quantizing the note length) and the note pitch are associated with each other, that is, input as voice note data. Data in which voice is converted into musical notes is generated.

そして、Ｓ２１０では、Ｓ１９０で生成された音声音符データを、基準音符データに照合し、その照合した結果に基づいて意図予想曲を特定すると共に、その特定した意図予想曲を音声処理装置２０の利用者に報知する採譜結果照合処理を実行する。 In S210, the voice note data generated in S190 is collated with reference note data, and an intended expected song is identified based on the collation result, and the identified expected song is used by the voice processing device 20. The transcription result collation process notified to the person is executed.

その後、本楽曲検索処理を終了する。
〈音高推定処理について〉
次に、楽曲検索処理のＳ１３０にて起動される音高推定処理について説明する。 Thereafter, the music search process ends.
<Pitch estimation processing>
Next, the pitch estimation process started in S130 of the music search process will be described.

ここで、図３は、音高推定処理の処理手順を示したフローチャートである。
この音高推定処理は、図３に示すように、起動されると、Ｓ３１０にて、処理済音声データを周波数解析する。この周波数解析として、本実施形態では、処理済音声データにおける予め規定されたサンプリング数の標本値をＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）する。なお、サンプリング数分の標本値は、処理済音声データの開始から終了までの間を、時間進行に沿った一部を重複させながら繰り返し取得される。これにより、入力音声の振幅スペクトル（即ち、周波数成分の分布）が、サンプリング数に対応する単位区間毎に導出される。 Here, FIG. 3 is a flowchart showing a processing procedure of the pitch estimation processing.
As shown in FIG. 3, when the pitch estimation process is started, the processed voice data is subjected to frequency analysis in S310. As this frequency analysis, in this embodiment, a sample value of a predetermined number of samplings in the processed audio data is subjected to FFT (Fast Fourier Transform). Note that as many sample values as the number of samplings are repeatedly acquired from the start to the end of the processed audio data while overlapping a part along the time progress. Thereby, the amplitude spectrum (that is, the distribution of frequency components) of the input speech is derived for each unit interval corresponding to the number of samplings.

続く、Ｓ３３０では、Ｓ３１０にて導出した振幅スペクトルに基づいて、振幅スペクトルの周波数成分が基本周波数成分であることの確からしさを表す自己相関値を導出する。
具体的には、１つの振幅スペクトルの各周波数成分における振幅値と、その振幅スペクトルにおける各周波数成分から規定周波数幅だけ増加させた周波数成分における振幅値との積和を、自己相関値として導出している。このため、規定周波数幅だけ変位させる毎に導出される自己相関値は、規定周波数幅だけ変位させた際に、基本周波数成分、またはその基本周波数の倍音成分が一致すると大きな値となる。 In S330, based on the amplitude spectrum derived in S310, an autocorrelation value representing the probability that the frequency component of the amplitude spectrum is a fundamental frequency component is derived.
Specifically, the product sum of the amplitude value in each frequency component of one amplitude spectrum and the amplitude value in the frequency component increased from the frequency component in the amplitude spectrum by a specified frequency width is derived as an autocorrelation value. ing. For this reason, the autocorrelation value that is derived every time the specified frequency width is displaced becomes a large value when the fundamental frequency component or the harmonic component of the fundamental frequency matches when the specified frequency width is displaced.

なお、Ｓ３３０では、振幅スペクトル（即ち、単位区間）毎に、その振幅スペクトルから導出された全ての自己相関値の平均値（以下、自己相関平均値と称す）も導出する。
さらに、Ｓ３５０では、先のＳ３３０にて導出された自己相関値を平滑化微分することで、各単位区間における音声基本周波数ｆ０の候補となる周波数を表す区間ｆ０候補を検出する。 In S330, an average value of all autocorrelation values derived from the amplitude spectrum (hereinafter referred to as autocorrelation average value) is also derived for each amplitude spectrum (that is, unit interval).
Further, in S350, the autocorrelation value derived in the previous S330 is smoothed and differentiated to detect a section f0 candidate representing a frequency that is a candidate for the speech fundamental frequency f0 in each unit section.

その区間ｆ０候補は、図４に示すように、自己相関値の軌跡における極大値（以下、相関ピークとする）に対応する周波数である。ここでの自己相関値の軌跡とは、規定周波数毎の自己相関値を周波数軸に沿って配置してなるものである。 The section f0 candidate is a frequency corresponding to the maximum value (hereinafter referred to as a correlation peak) in the locus of the autocorrelation value, as shown in FIG. The locus of autocorrelation values here is obtained by arranging autocorrelation values for each specified frequency along the frequency axis.

ただし、本実施形態では、自己相関値が自己相関平均値以上である相関ピークのみを、区間ｆ０候補としている（すなわち、図４に示す例では、第４相関ピークが自己相関平均値未満であるため、第４相関ピークに対応する周波数は、区間ｆ０候補として検出されない）。さらに、本実施形態におけるＳ３５０では、区間ｆ０候補として検出されない周波数成分における自己相関値は、その値を０としている。 However, in this embodiment, only the correlation peak whose autocorrelation value is equal to or greater than the autocorrelation average value is set as the section f0 candidate (that is, in the example shown in FIG. 4, the fourth correlation peak is less than the autocorrelation average value). Therefore, the frequency corresponding to the fourth correlation peak is not detected as the section f0 candidate). Furthermore, in S350 in the present embodiment, the value of the autocorrelation value in the frequency component that is not detected as the section f0 candidate is 0.

このＳ３５０は、処理済音声データに規定された全単位区間について終了するまで繰り返し実行される。
続くＳ３７０では、Ｓ３５０にて検出された区間ｆ０候補それぞれの音声基本周波数ｆ０としての尤度を表すｆ０候補信頼度を算出する信頼度算出処理を実行する。この信頼度算出処理にて導出されるｆ０候補信頼度は、尤度が高いほど大きな値となる。 This step S350 is repeatedly executed until all the unit sections defined in the processed voice data are finished.
In subsequent S370, a reliability calculation process is performed to calculate f0 candidate reliability representing the likelihood of each of the section f0 candidates detected in S350 as the speech fundamental frequency f0. The f0 candidate reliability derived by this reliability calculation process becomes a larger value as the likelihood is higher.

なお、この信頼度算出処理は、単位区間毎に実行される。このため、一回の信頼度算出処理により、１つの振幅スペクトルから導出された区間ｆ０候補それぞれについて、ｆ０候補信頼度が算出される。 This reliability calculation process is executed for each unit section. For this reason, f0 candidate reliability is calculated about each section f0 candidate derived | led-out from one amplitude spectrum by one reliability calculation process.

そして、Ｓ３９０では、Ｓ３７０にて導出された単位区間毎のｆ０候補信頼度に基づいて、その単位区間における音声基本周波数ｆ０を決定する。その音声基本周波数ｆ０として決定される区間ｆ０候補は、図５に示すように、Ｓ３７０にて導出された全てのｆ０候補信頼度の中で、値が最も高いｆ０候補信頼度に対応するものである（図５に示す例では、第１区間ｆ０候補が、音声基本周波数ｆ０として決定される）。 In S390, based on the f0 candidate reliability for each unit section derived in S370, the speech fundamental frequency f0 in the unit section is determined. The section f0 candidate determined as the speech fundamental frequency f0 corresponds to the f0 candidate reliability having the highest value among all the f0 candidate reliability derived in S370, as shown in FIG. (In the example shown in FIG. 5, the first section f0 candidate is determined as the voice fundamental frequency f0).

ただし、本実施形態のＳ３９０では、予め規定された信頼度閾値未満であるｆ０候補信頼度は、その値を０としている（図５に示す例では、第３区間ｆ０候補のｆ０候補信頼度が信頼度閾値未満であるため、そのｆ０候補信頼度の値が０となる）。そして、単位区間における全てのｆ０候補信頼度が信頼度閾値未満であれば、その単位区間における音声基本周波数ｆ０を０［Ｈｚ］に決定する。つまり、ｆ０候補信頼度が低い場合、そのｆ０候補信頼度に対応する区間ｆ０候補が、音声基本周波数ｆ０として決定されないようにしている。 However, in S390 of the present embodiment, the value of the f0 candidate reliability that is less than the predetermined reliability threshold is 0 (in the example illustrated in FIG. 5, the f0 candidate reliability of the third section f0 candidate is Since it is less than the reliability threshold, the value of the f0 candidate reliability is 0). And if all the f0 candidate reliability in a unit area is less than a reliability threshold value, the audio | voice fundamental frequency f0 in the unit area will be determined to 0 [Hz]. That is, when the f0 candidate reliability is low, the section f0 candidate corresponding to the f0 candidate reliability is not determined as the speech fundamental frequency f0.

さらに、Ｓ４１０では、処理済音声データに規定された全ての単位区間について、Ｓ３７０及びＳ３９０のステップを実行したか否かを判定する。
そのＳ４１０での判定の結果、全ての単位区間について、Ｓ３７０及びＳ３９０を実行していなければ、Ｓ３７０へと戻る。そのようにして移行したＳ３７０では、前回のＳ３７０にてｆ０候補信頼度を算出した単位区間から、処理済音声データにおける時間進行に沿って次の単位区間を対象としてｆ０候補信頼度を算出し、その後、Ｓ３９０へと進む。 Further, in S410, it is determined whether or not the steps of S370 and S390 have been executed for all the unit sections defined in the processed voice data.
As a result of the determination in S410, if S370 and S390 are not executed for all unit sections, the process returns to S370. In S370 thus shifted, the f0 candidate reliability is calculated for the next unit interval along the time progress in the processed speech data from the unit interval for which the f0 candidate reliability was calculated in the previous S370, Thereafter, the process proceeds to S390.

そして、全ての単位区間について、Ｓ３７０及びＳ３９０の実行が終了すると、Ｓ４３０へと進む。
そのＳ４３０では、Ｓ３９０にて決定された単位区間毎の音声基本周波数ｆ０を補正するｆ０補正処理を実行する。このｆ０補正処理は、単位区間毎の音声基本周波数ｆ０を、入力音声における時間進行に沿って配置してなる周波数推移分布において、音声基本周波数ｆ０が不連続とみなせる不連続領域に対して実行される。 And when execution of S370 and S390 is completed about all the unit sections, it progresses to S430.
In S430, f0 correction processing for correcting the sound fundamental frequency f0 for each unit section determined in S390 is executed. This f0 correction processing is executed for a discontinuous region in which the voice fundamental frequency f0 can be regarded as discontinuous in a frequency transition distribution in which the voice fundamental frequency f0 for each unit section is arranged along the time progression in the input voice. The

続く、Ｓ４５０では、Ｓ４３０にて補正が実行された後の各単位区間における音声基本周波数ｆ０を、半音単位で量子化する。これにより、音声基本周波数ｆ０が、半音毎に吸着されることになる。なお、この量子化は、周知の処理であるため、ここでの詳しい説明は省略する。 In S450, the voice fundamental frequency f0 in each unit section after the correction is performed in S430 is quantized in semitone units. Thereby, the voice fundamental frequency f0 is attracted for each semitone. Since this quantization is a well-known process, a detailed description thereof is omitted here.

その後、本音高推定処理を終了して、楽曲検索処理のＳ１５０へと進む。
〈信頼度算出処理について〉
次に、音高推定処理のＳ３７０にて起動される信頼度算出処理について説明する。 Thereafter, the pitch estimation process is terminated, and the process proceeds to S150 of the music search process.
<Reliability calculation process>
Next, the reliability calculation process started in S370 of the pitch estimation process will be described.

ここで、図６は、信頼度算出処理の処理手順を示したフローチャートである。
この信頼度算出処理は、図６に示すように、音高推定処理のＳ３７０にて起動されると、Ｓ３７１０では、１つの単位区間における全ての区間ｆ０候補の中から、特定周波数帯に含まれる区間ｆ０候補（以下、特定ｆ０候補と称す）の自己相関値を抽出する。ただし、本実施形態における特定ｆ０候補は、特定周波数帯に含まれる区間ｆ０候補の中で、最低周波数に対応するものである。なお、特定周波数帯とは、自己相関値の導出によって自動的に規定される下限周波数から上限周波数までの周波数帯である。 Here, FIG. 6 is a flowchart showing a processing procedure of the reliability calculation processing.
As shown in FIG. 6, when the reliability calculation process is started in S370 of the pitch estimation process, in S3710, it is included in the specific frequency band from all the sections f0 candidates in one unit section. The autocorrelation value of the section f0 candidate (hereinafter referred to as a specific f0 candidate) is extracted. However, the specific f0 candidate in the present embodiment corresponds to the lowest frequency among the sections f0 candidates included in the specific frequency band. The specific frequency band is a frequency band from a lower limit frequency to an upper limit frequency that is automatically defined by deriving an autocorrelation value.

続く、Ｓ３７２０では、Ｓ３７１０にて抽出した自己相関値に対応する特定ｆ０候補の倍音範囲に含まれる区間ｆ０候補（以下、倍音ｆ０候補と称す）の自己相関値を１つ取得する。ただし、倍音範囲とは、Ｓ３７１０にて取得された自己相関値に対応する特定ｆ０候補の倍音成分を中心として、その倍音成分を挟むように規定された周波数範囲である。 In S3720, one autocorrelation value of a section f0 candidate (hereinafter referred to as a harmonic f0 candidate) included in the harmonic range of the specific f0 candidate corresponding to the autocorrelation value extracted in S3710 is acquired. However, the overtone range is a frequency range defined so as to sandwich the overtone component around the overtone component of the specific f0 candidate corresponding to the autocorrelation value acquired in S3710.

そして、Ｓ３７３０では、Ｓ３７２０にて取得した倍音ｆ０候補の自己相関値から、Ｓ３７１０にて抽出された特定ｆ０候補の自己相関値を減算する。そして、その減算結果を、Ｓ３７２０にて取得した倍音ｆ０候補の自己相関値として新規に規定（即ち、変更）する。 In S3730, the autocorrelation value of the specific f0 candidate extracted in S3710 is subtracted from the autocorrelation value of the overtone f0 candidate acquired in S3720. Then, the subtraction result is newly defined (that is, changed) as the autocorrelation value of the harmonic overtone f0 candidate acquired in S3720.

続く、Ｓ３７４０では、１つの単位区間における全ての倍音ｆ０候補の自己相関値に対して、Ｓ３７３０のステップを実行したか否かを判定する。その判定の結果、全ての倍音ｆ０候補の自己相関値に対して、Ｓ３７３０のステップを実行していなければ、Ｓ３７２０へと戻る。 Subsequently, in S3740, it is determined whether or not the step of S3730 has been executed for the autocorrelation values of all overtone f0 candidates in one unit section. As a result of the determination, if the step of S3730 is not executed for the autocorrelation values of all overtone f0 candidates, the process returns to S3720.

そのようにして移行したＳ３７２０では、前回Ｓ３７２０にて取得した自己相関値に対応する区間ｆ０候補の次に高い倍音範囲に含まれる区間ｆ０候補を倍音ｆ０候補とし、その倍音ｆ０候補の自己相関値を取得して、Ｓ３７３０へと進む。 In S3720 thus shifted, the section f0 candidate included in the next higher harmonic range of the section f0 candidate corresponding to the autocorrelation value acquired in the previous S3720 is set as the harmonic f0 candidate, and the autocorrelation value of the harmonic f0 candidate is set. Is acquired and it progresses to S3730.

つまり、このＳ３７２０からＳ３７４０のステップを繰り返すことにより、図７（Ａ）に示すように、倍音ｆ０候補の自己相関値は、先の音高推定処理におけるＳ３３０にて導出された値から、特定ｆ０候補の自己相関値が減算された値に変更される。 In other words, by repeating the steps from S3720 to S3740, as shown in FIG. 7A, the autocorrelation value of the harmonic overtone f0 candidate is specified f0 from the value derived in S330 in the previous pitch estimation process. The candidate autocorrelation value is changed to a subtracted value.

さらに、Ｓ３７５０では、自己相関値に減衰係数を乗算する。この減衰係数は、図７（Ｂ）に示すように、乗算される自己相関値に対応する周波数が低いほど値が大きく、周波数が高いほど値が小さなものである。 In S3750, the autocorrelation value is multiplied by an attenuation coefficient. As shown in FIG. 7B, the attenuation coefficient increases as the frequency corresponding to the autocorrelation value to be multiplied decreases, and decreases as the frequency increases.

ただし、減衰係数が乗算される自己相関値は、特定ｆ０候補の自己相関値と、Ｓ３７２０からＳ３７４０のステップの繰り返しにより変更された全ての倍音ｆ０候補の自己相関値とを含む、単位区間における全ての区間ｆ０候補の自己相関値である。 However, the autocorrelation values multiplied by the attenuation coefficient are all in the unit interval including the autocorrelation value of the specific f0 candidate and the autocorrelation values of all overtone f0 candidates changed by repeating the steps from S3720 to S3740. Auto-correlation value of the section f0 candidate.

続く、Ｓ３７６０では、Ｓ３７５０にて減衰係数が乗算された後の自己相関値に、各自己相関値に対応する区間ｆ０候補のスペクトル振幅値を乗算する。そして、乗算した結果を、各区間ｆ０候補についてのｆ０候補信頼度として導出する。 In S3760, the autocorrelation value after the attenuation coefficient is multiplied in S3750 is multiplied by the spectrum amplitude value of the section f0 candidate corresponding to each autocorrelation value. Then, the multiplication result is derived as the f0 candidate reliability for each section f0 candidate.

なお、区間ｆ０候補以外の周波数成分（以下、非候補周波数と称す）に対応する自己相関値は、先の音高推定処理におけるＳ３５０にて、その値が０とされる。このため、非候補周波数の自己相関値について、Ｓ３７５０での減衰係数の乗算や、Ｓ３７６０でのｆ０候補信頼度の演算を実行しても、その演算結果は０になる。 Note that the autocorrelation values corresponding to frequency components other than the section f0 candidate (hereinafter referred to as non-candidate frequencies) are set to 0 in S350 in the previous pitch estimation process. For this reason, even if the autocorrelation value of the non-candidate frequency is multiplied by the attenuation coefficient in S3750 or the calculation of the f0 candidate reliability in S3760 is performed, the calculation result becomes zero.

したがって、Ｓ３７５０での減衰係数の乗算、及びＳ３７６０でのｆ０候補信頼度の演算により、単位区間における区間ｆ０候補についてのｆ０候補信頼度のみが算出される。
その後、本信頼度算出処理を終了して、音高推定処理のＳ３９０へと戻る。 Therefore, only the f0 candidate reliability for the section f0 candidate in the unit section is calculated by the multiplication of the attenuation coefficient in S3750 and the calculation of the f0 candidate reliability in S3760.
Then, this reliability calculation process is complete | finished and it returns to S390 of a pitch estimation process.

つまり、本実施形態の信頼度算出処理では、各周波数成分の自己相関値に減衰係数を乗じることで、音声基本周波数ｆ０の倍音成分が含まれる可能性の高い高周波帯における区間ｆ０候補の自己相関値が抑制される。よって、その値が抑制された自己相関値に、各自己相関値に対応する区間ｆ０候補の振幅値を乗じたｆ０候補信頼度は、基本周波数の周波数成分に対応するものほど大きな値となる。
〈ｆ０補正処理について〉
次に、音高推定処理のＳ４３０にて起動されるｆ０補正処理について説明する。 In other words, in the reliability calculation process of the present embodiment, the autocorrelation value of each frequency component is multiplied by the attenuation coefficient, so that the autocorrelation of the section f0 candidate in the high frequency band in which the harmonic component of the speech fundamental frequency f0 is highly likely to be included. The value is suppressed. Therefore, the f0 candidate reliability obtained by multiplying the autocorrelation value whose value is suppressed by the amplitude value of the section f0 candidate corresponding to each autocorrelation value becomes larger as it corresponds to the frequency component of the fundamental frequency.
<About f0 correction processing>
Next, the f0 correction process started in S430 of the pitch estimation process will be described.

ここで、図８は、ｆ０補正処理の処理手順を示したフローチャートである。
このｆ０補正処理は、図８に示すように、起動されると、まず、Ｓ４３１０では、先の音高推定処理のＳ３１０にて周波数解析を実行した全ての単位区間の中から、１つの単位区間を選択する。このＳ４３１０では、単位区間は、Ｓ４３１０に移行する毎に、処理済音声データの開始から、処理済音声データにおける時間進行に沿って１つずつ選択される。 Here, FIG. 8 is a flowchart showing the processing procedure of the f0 correction processing.
When this f0 correction process is started as shown in FIG. 8, first, in S4310, one unit section is selected from all the unit sections in which the frequency analysis was executed in S310 of the previous pitch estimation process. Select. In S4310, each time a transition is made to S4310, the unit sections are selected one by one along the time progress in the processed voice data from the start of the processed voice data.

続く、Ｓ４３２０では、先のＳ４３１０にて選択された単位区間における音声基本周波数ｆ０が０［Ｈｚ］であるか否かを判定する。
その判定の結果、音声基本周波数ｆ０が０［Ｈｚ］であれば、Ｓ４３３０へと進む。そのＳ４３３０では、区間カウンタを１つインクリメントして、Ｓ４３１０へと戻る。 Subsequently, in S4320, it is determined whether or not the voice basic frequency f0 in the unit section selected in the previous S4310 is 0 [Hz].
As a result of the determination, if the audio fundamental frequency f0 is 0 [Hz], the process proceeds to S4330. In S4330, the section counter is incremented by 1, and the process returns to S4310.

つまり、Ｓ４３１０からＳ４３３０のステップが実行されることで、音声基本周波数ｆ０が０［Ｈｚ］である単位区間（以下、非正規周波数区間と称す）が、処理済音声データの時間進行に沿って連続する数が計測される。 That is, by executing the steps from S4310 to S4330, a unit interval (hereinafter referred to as a non-regular frequency interval) in which the audio fundamental frequency f0 is 0 [Hz] continues along the time progress of the processed audio data. The number to do is measured.

一方、Ｓ４３２０での判定の結果、音声基本周波数が０［Ｈｚ］以外の周波数であれば、Ｓ４３４０へと進む。
そのＳ４３４０では、今回Ｓ４３４０に移行するまでの間、第１区間ｆ０としていた音声基本周波数ｆ０を第２区間ｆ０とし、今回Ｓ４３４０へと移行する契機となり、かつＳ４３１０で選択された単位区間における音声基本周波数ｆ０を第１区間ｆ０として設定する。つまり、このＳ４３４０へと移行すると、処理済音声データの時間進行に沿って取得済みである音声基本周波数ｆ０の中で、音声開始に近い単位区間における音声基本周波数ｆ０を第２区間ｆ０とし、音声終了に近い単位区間における音声基本周波数ｆ０を第１区間ｆ０としている。 On the other hand, if the result of determination in S4320 is that the audio fundamental frequency is a frequency other than 0 [Hz], the flow proceeds to S4340.
In S4340, the voice basic frequency f0 that has been set as the first section f0 until the transition to S4340 this time is set as the second section f0. The frequency f0 is set as the first section f0. In other words, when the process proceeds to S4340, the voice fundamental frequency f0 in the unit section close to the voice start is set as the second section f0 among the voice fundamental frequencies f0 acquired along with the time progress of the processed voice data. The voice fundamental frequency f0 in the unit interval close to the end is set as the first interval f0.

続く、Ｓ４３５０では、区間カウンタの値であるカウント値が、予め規定された第１規定値以上であるか否かを判定する。
そして、Ｓ４３５０での判定の結果、カウント値が第１規定値以上であれば、Ｓ４３６０へと進む。すなわち、処理済音声データの時間進行に沿って連続する非正規周波数区間の数が、第１規定値以上であれば、その連続する非正規周波数区間を、周波数推移分布における不連続領域として検出する。以下、第１規定値以上連続する非正規周波数区間を、長期不連続領域と称す。 Subsequently, in S4350, it is determined whether or not the count value, which is the value of the section counter, is equal to or greater than a first specified value specified in advance.
If the result of determination in S4350 is that the count value is greater than or equal to the first specified value, processing proceeds to S4360. That is, if the number of non-normal frequency sections that are continuous along the time progress of the processed audio data is equal to or greater than the first specified value, the continuous non-normal frequency sections are detected as discontinuous regions in the frequency transition distribution. . Hereinafter, the non-regular frequency interval that continues for the first specified value or more is referred to as a long-term discontinuous region.

そして、Ｓ４３６０では、処理済音声データの時間進行において、第２区間ｆ０に対応する単位区間の直後の単位区間から、直近のＳ４３１０にて選択された単位区間の直前の単位区間までの音声基本周波数ｆ０が、第２区間ｆ０となるように補正する。その後、Ｓ４３９０へと進む。 In S4360, in the time progress of the processed voice data, the basic voice frequency from the unit section immediately after the unit section corresponding to the second section f0 to the unit section immediately before the unit section selected in the latest S4310. Correction is made so that f0 becomes the second interval f0. Thereafter, the process proceeds to S4390.

つまり、Ｓ４３６０では、長期不連続領域を形成する非正規周波数区間における音声基本周波数ｆ０を、０［Ｈｚ］から第２区間ｆ０へと変更する。
ところで、Ｓ４３５０での判定の結果、カウント値が第１規定値未満であれば、Ｓ４３７０へと進む。そのＳ４３７０では、カウント値が１以上であるか否かを判定する。その判定の結果、カウント値が１以上であれば、Ｓ４３８０へと進む。つまり、入力音声の時間進行に沿って連続する非正規周波数区間の数が、１つ以上であり、かつ第１規定値未満であれば、その連続する非正規周波数区間を、周波数推移分布における不連続領域として検出する。以下、１つ以上かつ第１規定値未満連続する非正規周波数区間を、短期不連続領域と称す。 That is, in S4360, the sound fundamental frequency f0 in the non-regular frequency section forming the long-term discontinuous region is changed from 0 [Hz] to the second section f0.
By the way, if the result of determination in S4350 is that the count value is less than the first specified value, processing proceeds to S4370. In S4370, it is determined whether or not the count value is 1 or more. As a result of the determination, if the count value is 1 or more, the process proceeds to S4380. That is, if the number of non-regular frequency sections that are continuous along the time progress of the input speech is one or more and less than the first specified value, the non-normal frequency sections that are consecutive are regarded as non-uniform in the frequency transition distribution. Detect as a continuous area. Hereinafter, one or more non-regular frequency intervals that are continuous with less than the first specified value are referred to as short-term discontinuous regions.

そして、Ｓ４３８０では、短期不連続領域に対応する単位区間の音声基本周波数ｆ０を、第２区間ｆ０から、一定の変動幅で変動しながら順に第１区間ｆ０へと直線的に到達するように補正する。その後、Ｓ４３９０へと進む。 Then, in S4380, the basic voice frequency f0 of the unit section corresponding to the short-term discontinuous region is corrected so as to linearly reach the first section f0 in order from the second section f0 with a constant fluctuation range. To do. Thereafter, the process proceeds to S4390.

つまり、Ｓ４３８０では、短期不連続領域を形成する非正規周波数区間における音声基本周波数ｆ０を、０［Ｈｚ］から、第２区間ｆ０と第１区間ｆ０とを結ぶ直線上の周波数へと変更する。 That is, in S4380, the speech fundamental frequency f0 in the non-regular frequency section forming the short-term discontinuous region is changed from 0 [Hz] to a frequency on a straight line connecting the second section f0 and the first section f0.

続くＳ４３９０では、区間カウンタを初期化（ここでは、値を０と）する。
その後、Ｓ４４００にて、処理済音声データに規定された全ての単位区間を、Ｓ４３１０にて選択済みであるか否かを判定する。その判定の結果、未選択の単位区間が存在すれば、Ｓ４３１０へと戻る。 In subsequent S4390, the section counter is initialized (in this case, the value is 0).
Thereafter, in S4400, it is determined whether or not all unit sections defined in the processed voice data have been selected in S4310. As a result of the determination, if an unselected unit section exists, the process returns to S4310.

ところで、Ｓ４３７０での判定の結果、カウント値が１未満であれば、Ｓ４４１０へと進む。すなわち、周波数推移分布において、第２区間ｆ０に対応する単位区間と第１区間ｆ０に対応する単位区間との間に、非正規周波数区間が存在しなければ、Ｓ４４１０へと進む。 If the count value is less than 1 as a result of the determination in S4370, the process proceeds to S4410. That is, in the frequency transition distribution, if there is no non-regular frequency section between the unit section corresponding to the second section f0 and the unit section corresponding to the first section f0, the process proceeds to S4410.

そのＳ４４１０では、音飛フラグが設定済みであるか否かを判定する。なお、音飛フラグとは、設定済みであれば、周波数推移分布における不連続領域の１つである倍音誤検出領域の開始時点が検出されたことを表す。 In S4410, it is determined whether or not the sound skip flag has been set. Note that the sound skip flag indicates that the start time of the harmonic overtone detection region, which is one of the discontinuous regions in the frequency transition distribution, has been detected if set.

つまり、Ｓ４４１０以降のステップでは、周波数推移分布において、処理済音声データの時間進行に沿って隣接する単位区間における音声基本周波数ｆ０同士の比率が、予め設定された比率の範囲を表す特別範囲を超えることで、周波数推移が不連続となる不連続領域（即ち、倍音誤検出領域）を検出する。これと共に、Ｓ４４１０以降のステップでは、倍音誤検出領域を形成する単位区間に対応する音声基本周波数ｆ０を補正する。 That is, in the steps after S4410, in the frequency transition distribution, the ratio between the voice basic frequencies f0 in the unit sections adjacent to each other along the time progress of the processed voice data exceeds a special range representing a preset ratio range. Thus, a discontinuous region where the frequency transition is discontinuous (that is, a harmonic overtone detection region) is detected. At the same time, in step S4410 and subsequent steps, the sound fundamental frequency f0 corresponding to the unit section forming the harmonic overtone detection area is corrected.

そして、Ｓ４４１０での判定の結果、音飛フラグが未設定であれば、Ｓ４４２０へと進む。そのＳ４４２０では、第２区間ｆ０を第１区間ｆ０にて除した結果（以下、第１周波数比率と称す）が、特別範囲を超えているか否かを判定する。 If the sound skip flag is not set as a result of the determination in S4410, the process proceeds to S4420. In S4420, it is determined whether or not the result of dividing the second section f0 by the first section f0 (hereinafter referred to as the first frequency ratio) exceeds the special range.

そして、Ｓ４４２０での判定の結果、第１周波数比率が特別範囲を超えていれば、Ｓ４４３０へと進む。そのＳ４４３０では、音飛フラグを設定する。つまり、処理済音声データの時間進行に沿って隣接する単位区間における音声基本周波数ｆ０同士の比率が、特別範囲を超えると、それら隣接する単位区間のうち、時間進行に沿った後の単位区間を、倍音誤検出領域の開始時点とする。 If the result of determination in S4420 is that the first frequency ratio exceeds the special range, processing proceeds to S4430. In S4430, a sound skip flag is set. That is, when the ratio of the speech fundamental frequencies f0 in the adjacent unit sections along the time progress of the processed speech data exceeds the special range, the unit section after the time progress is selected among the adjacent unit sections. The start time of the harmonic overtone detection area.

続く、Ｓ４４４０では、第２区間ｆ０を第３区間ｆ０とする。その後、Ｓ４４００へと進む。
なお、Ｓ４４２０での判定の結果、第１周波数比率が特別範囲以内であれば、周波数推移分布において、倍音誤検出領域が開始されていないものと判定して、Ｓ４４００へと進む。そのＳ４４００では、全ての単位区間の中に、未選択の単位区間が存在すれば（Ｓ４４００：ＮＯ）、Ｓ４３１０へと戻る。 In S4440, the second section f0 is set as the third section f0. Thereafter, the process proceeds to S4400.
If the first frequency ratio is within the special range as a result of the determination in S4420, it is determined that the harmonic overtone detection area is not started in the frequency transition distribution, and the process proceeds to S4400. In S4400, if there is an unselected unit section among all the unit sections (S4400: NO), the process returns to S4310.

ところで、Ｓ４４１０での判定の結果、音飛フラグが設定済みであれば、Ｓ４４５０へと進む。そのＳ４４５０では、周波数推移分布において、倍音誤検出領域が継続中であるものとして、音飛カウンタを１つインクリメントする。 By the way, as a result of the determination in S4410, if the sound skip flag has been set, the process proceeds to S4450. In S4450, it is assumed that the harmonic overtone detection area is continuing in the frequency transition distribution, and the sound skip counter is incremented by one.

その後、Ｓ４４６０では、第３区間ｆ０を第１区間ｆ０にて除した結果（以下、第２周波数比率と称す）が、特別範囲を超えているか否かを判定する。その判定の結果、第２周波数比率が特別範囲を超えていれば、Ｓ４４７０へと進む。 Thereafter, in S4460, it is determined whether or not the result of dividing the third section f0 by the first section f0 (hereinafter referred to as the second frequency ratio) exceeds the special range. As a result of the determination, if the second frequency ratio exceeds the special range, the process proceeds to S4470.

そして、Ｓ４４７０では、音飛カウンタの値である音飛値が、予め規定された第２規定値以上であるか否かを判定する。その判定の結果、音飛値が第２規定値未満であれば、周波数推移分布において、倍音誤検出領域が継続中であるものとして、Ｓ４４００へと進む。 In step S4470, it is determined whether or not the sound skip value, which is the value of the sound skip counter, is equal to or greater than a second predetermined value specified in advance. As a result of the determination, if the skip value is less than the second specified value, it is determined that the harmonic overtone detection area is continuing in the frequency transition distribution, and the process proceeds to S4400.

ところで、Ｓ４４６０での判定の結果、第２周波数比率が特別範囲以内であれば、周波数推移分布において、倍音誤検出領域の継続が終了したものとして、Ｓ４４８０へと進む。つまり、倍音誤検出領域は、周波数推移分布において、時間進行に沿って隣接する単位区間における音声基本周波数ｆ０が特別範囲を超えて変動（以下、特定変動と称す）した時点から、その特定変動の後に時間進行に沿って隣接する単位区間における音声基本周波数ｆ０が、第３区間ｆ０を基準とした特別範囲以内へと戻った時点までの領域である。ただし、倍音誤検出領域は、その領域を構成する単位区間の数が第２規定数未満である領域である。 By the way, if the result of determination in S4460 is that the second frequency ratio is within the special range, it is determined that continuation of the harmonic overtone detection area has ended in the frequency transition distribution, and the flow proceeds to S4480. In other words, the harmonic overtone detection area starts from the time when the basic frequency f0 in the unit interval adjacent to the frequency transition distribution fluctuates beyond a special range (hereinafter referred to as a specific fluctuation) in the frequency transition distribution. This is a region up to a point in time when the sound fundamental frequency f0 in the unit interval adjacent to the time interval returns to within a special range based on the third interval f0. However, the harmonic overtone detection area is an area where the number of unit sections constituting the area is less than the second specified number.

そのＳ４４８０では、倍音誤検出領域に対応する単位区間の音声基本周波数ｆ０を、第３区間ｆ０から、一定の変動幅で変動しながら順に第１区間ｆ０へと直線的に到達するように補正する。その後、Ｓ４４９０へと進む。 In S4480, the basic voice frequency f0 of the unit section corresponding to the harmonic overtone detection area is corrected so as to linearly reach the first section f0 in order from the third section f0 with a constant fluctuation range. . Thereafter, the process proceeds to S4490.

なお、Ｓ４４７０での判定の結果、音飛値が第２規定値以上であれば、対応する特定変動以降に隣接する単位区間からなる領域は、不連続領域ではなく、入力音声における音声基本周波数ｆ０の推移そのものを表しているものとして、Ｓ４４９０へと進む。そのＳ４４９０では、音飛カウンタを初期化すると共に、音飛フラグを解除して、Ｓ４４００へと進む。 Note that, as a result of the determination in S4470, if the skip value is equal to or greater than the second specified value, the region composed of adjacent unit sections after the corresponding specific variation is not a discontinuous region, but is a speech fundamental frequency f0 in the input speech. In step S4490, the process proceeds to step S4490. In S4490, the sound skip counter is initialized, the sound skip flag is canceled, and the process proceeds to S4400.

そのＳ４４００では、全ての単位区間の中に、未選択の単位区間が存在すれば（Ｓ４４０：ＮＯ）、Ｓ４３１０へと戻る。なお、Ｓ４４００に移行した際に、未選択の単位区間が存在しなければ、ｆ０補正処理を終了して、音高推定処理のＳ４５０へと進む。 In S4400, if there is an unselected unit section among all the unit sections (S440: NO), the process returns to S4310. If there is no unselected unit section when the process proceeds to S4400, the f0 correction process is terminated, and the process proceeds to S450 of the pitch estimation process.

次に、本実施形態におけるｆ０補正処理を実行した場合の動作例について説明する。
ここで、図９（Ａ）は、ｆ０補正処理を実行する前の周波数推移分布を示した図面であり、図９（Ｂ）は、ｆ０補正処理を実行した後の周波数推移分布を示した図面である。 Next, an operation example when the f0 correction process in the present embodiment is executed will be described.
Here, FIG. 9A is a diagram showing a frequency transition distribution before executing the f0 correction process, and FIG. 9B is a diagram showing a frequency transition distribution after executing the f0 correction process. It is.

図９（Ａ）に示すような周波数推移分布を示す各単位区間における音声基本周波数ｆ０に対して、ｆ０補正処理が実行されると、まず、周波数推移分布における入力音声の時間進行に沿った単位区間が選択される（Ｓ４３１０）。その選択された単位区間における音声基本周波数ｆ０は、単位区間ｔ１における音声基本周波数ｆ０＿ｔ１までは、全て０［Ｈｚ］以外の周波数であり、かつ時間進行に沿って連続する単位区間における音声基本周波数ｆ０同士の比率が、特別範囲以内である。このため、周波数推移分布における開始時点から単位区間ｔ１までは、周波数補正が行われること無く、音高推移処理のＳ３９０にて決定された音声基本周波数ｆ０が維持される。 When the f0 correction processing is performed on the speech fundamental frequency f0 in each unit section showing the frequency transition distribution as shown in FIG. 9A, first, the units along the time progress of the input speech in the frequency transition distribution A section is selected (S4310). The basic voice frequency f0 in the selected unit interval is a frequency other than 0 [Hz] up to the basic audio frequency f0_t1 in the unit interval t1, and the basic audio frequency f0 in the unit interval that continues in time progress. The ratio between them is within a special range. For this reason, the voice fundamental frequency f0 determined in S390 of the pitch transition process is maintained without performing frequency correction from the start time in the frequency transition distribution to the unit interval t1.

ところが、単位区間ｔ１における音声基本周波数ｆ０＿ｔ１と、単位区間ｔ２における音声基本周波数ｆ０＿ｔ２との比率は、特別範囲を超える。
このため、Ｓ４３１０にて単位区間ｔ２が選択されると、Ｓ４４２０にて否定判定され、音飛フラグが設定される。次に、Ｓ４３１０にて選択された単位区間ｔ３における音声基本周波数ｆ０＿ｔ３は、音声基本周波数ｆ０＿ｔ１との比率が特別範囲を超える。このため、Ｓ４４６０にて肯定判定され、しかも、この時点での音飛値が第２規定値未満（このｆ０補正処理の動作の説明では、第２規定値を２以上とする）であることから、Ｓ４４７０では、否定判定となる。 However, the ratio between the sound fundamental frequency f0_t1 in the unit interval t1 and the sound fundamental frequency f0_t2 in the unit interval t2 exceeds the special range.
For this reason, when the unit section t2 is selected in S4310, a negative determination is made in S4420, and a sound skip flag is set. Next, the ratio of the voice basic frequency f0_t3 in the unit interval t3 selected in S4310 to the voice basic frequency f0_t1 exceeds the special range. Therefore, an affirmative determination is made in S4460, and the skip value at this point is less than the second specified value (in the description of the operation of the f0 correction process, the second specified value is 2 or more). In S4470, a negative determination is made.

そして、Ｓ４３１０にて、処理済音声データの時間進行に沿って次に選択された単位区間ｔ４における音声基本周波数ｆ０＿ｔ４は、音声基本周波数ｆ０＿ｔ１との比率が特別範囲以内である。よって、Ｓ４４６０にて否定判定され、単位区間ｔ２から単位区間ｔ３までの区間が、倍音誤検出領域として検出される。このようにして検出された倍音誤検出領域における音声基本周波数ｆ０＿ｔ２，ｆ０＿ｔ３を、図９（Ｂ）に示すように、音声基本周波数ｆ０＿ｔ１から、一定の変動幅で変動しながら順に音声基本周波数ｆ０＿ｔ４へと直線的に到達するように補正する。 Then, in S4310, the audio fundamental frequency f0_t4 in the unit interval t4 selected next along the time progress of the processed audio data has a ratio with the audio basic frequency f0_t1 within the special range. Therefore, a negative determination is made in S4460, and a section from the unit section t2 to the unit section t3 is detected as a harmonic overtone detection area. As shown in FIG. 9B, the sound fundamental frequencies f0_t2 and f0_t3 in the harmonic overtone detection area detected in this way are sequentially changed from the sound fundamental frequency f0_t1 to the sound fundamental frequency f0_t4 while changing within a certain fluctuation range. And correct so that it reaches linearly.

ここで、図９（Ａ）へと戻り、ｆ０補正処理において、入力音声の時間進行に沿って単位区間の選択を繰り返す。このとき、図９（Ａ）に示す周波数推移分布において、単位区間ｔ５から単位区間ｔ９までの間の領域は、全ての単位区間にて音声基本周波数ｆ０（図中、ｆ０＿ｔ５〜ｔ９）が０［Ｈｚ］である。 Here, returning to FIG. 9A, in the f0 correction process, the selection of the unit section is repeated along with the time progress of the input voice. At this time, in the frequency transition distribution shown in FIG. 9A, in the region between the unit interval t5 and the unit interval t9, the voice basic frequency f0 (f0_t5 to t9 in the drawing) is 0 [ Hz].

このことから、ｆ０補正処理では、Ｓ４３１０にて、単位区間ｔ５〜ｔ１０が選択された際には、それらの単位区間ｔ５〜ｔ１０が選択される毎に、Ｓ４３３０へと移行し、区間カウンタを５まで増加させる。なお、Ｓ４３１０にて、入力音声に沿って次に選択される単位区間ｔ１０における音声基本周波数ｆ０＿ｔ１０は、０［Ｈｚ］以外の周波数であるため、Ｓ４３２０にて肯定判定される。そして、カウント値が、第１規定値未満であり（このｆ０補正処理の動作の説明では、第１規定値を６以上とする）、かつ１以上であることから、Ｓ４３７０では否定判定となる。よって、単位区間ｔ５から単位区間ｔ１０が、短期不連続領域として検出される。このようにして検出された短期不連続領域における音声基本周波数ｆ０＿ｔ５〜ｔ９を、図９（Ｂ）に示すように、音声基本周波数ｆ０＿ｔＡから、一定の変動幅で変動しながら順に音声基本周波数ｆ０＿ｔ１０へと直線的に到達するように補正する。 Therefore, in the f0 correction process, when the unit sections t5 to t10 are selected in S4310, each time the unit sections t5 to t10 are selected, the process proceeds to S4330, and the section counter is set to 5. Increase to. Note that, in S4310, the basic voice frequency f0_t10 in the unit interval t10 that is selected next along the input voice is a frequency other than 0 [Hz], and thus an affirmative determination is made in S4320. Since the count value is less than the first specified value (in the description of the operation of the f0 correction process, the first specified value is 6 or more) and is 1 or more, a negative determination is made in S4370. Therefore, the unit interval t5 to the unit interval t10 are detected as short-term discontinuous regions. The basic voice frequencies f0_t5 to t9 in the short-term discontinuous area detected in this way are sequentially changed from the basic voice frequency f0_tA to the basic voice frequency f0_t10 while changing within a certain fluctuation range, as shown in FIG. 9B. And correct so that it reaches linearly.

つまり、本実施形態のｆ０補正処理では、周波数推移分布における不連続領域として、倍音誤検出領域や、短期不連続領域、長期不連続領域を検出する。
そして、不連続領域として倍音誤検出領域や短期不連続領域が検出されると、ｆ０補正処理では、それらの倍音誤検出領域や短期不連続領域を時間進行に沿って挟む直前の単位区間における音声基本周波数ｆ０から、一定の変動幅で変動しながら順に、直後の単位区間における音声基本周波数ｆ０へと到達するように補正する。一方、不連続領域として長期不連続領域が検出されると、ｆ０補正処理では、その長期不連続領域に対して時間進行上直前の単位区間における音声基本周波数ｆ０を、長期不連続領域を形成する単位区間における音声基本周波数ｆ０とする。
〈開始・終了タイミング推定処理について〉
次に、楽曲検索処理のＳ１５０にて起動される開始・終了タイミング推定処理について説明する。 That is, in the f0 correction process of the present embodiment, a harmonic overtone detection region, a short-term discontinuous region, and a long-term discontinuous region are detected as discontinuous regions in the frequency transition distribution.
Then, when a harmonic overtone detection area or a short-term discontinuous area is detected as a discontinuous area, the f0 correction process performs speech in a unit section immediately before the harmonic overdetection area or the short-term discontinuous area is sandwiched in time progress. Corrections are made in order from the fundamental frequency f0 so as to reach the speech fundamental frequency f0 in the immediately following unit section while varying within a certain fluctuation range. On the other hand, when a long-term discontinuous area is detected as a discontinuous area, in the f0 correction process, a long-term discontinuous area is formed with the voice fundamental frequency f0 in the unit interval immediately before the time progression for the long-term discontinuous area. The basic voice frequency f0 in the unit section is assumed.
<Start / end timing estimation process>
Next, the start / end timing estimation process started in S150 of the music search process will be described.

ここで、図１０は、開始・終了タイミング推定処理の処理手順を示したフローチャートである。
この開始・終了タイミング推定処理は、図１０に示すように、起動されると、まず、Ｓ５１０にて、先の音高推定処理のＳ３１０にて周波数解析を実行した単位区間それぞれについて、各単位区間における音圧を導出する。その導出される音圧は、先のＳ３１０にて導出された振幅スペクトルにおけるスペクトル振幅値の総和である。 Here, FIG. 10 is a flowchart showing a processing procedure of the start / end timing estimation processing.
As shown in FIG. 10, when this start / end timing estimation process is started, first, in S510, for each unit section for which frequency analysis was performed in S310 of the previous pitch estimation process, each unit section The sound pressure at is derived. The derived sound pressure is the sum of the spectrum amplitude values in the amplitude spectrum derived in the previous S310.

続いて、Ｓ５２０では、Ｓ５１０にて導出された単位区間毎の音圧に基づいて、入力音声の時間進行に沿った音圧の推移を表す音圧推移を導出する。これと共に、Ｓ５２０では、導出された音圧推移を移動平均によって平滑化する。ただし、本実施形態における移動平均は、規定数の単位区間を、音圧推移における時間進行に沿って互いに重複するように繰り返し規定して実施される。なお、繰り返し規定される規定数の単位区間は、単位区間を１つずつ変位させることで達成される。これにより、平滑化された後の音圧推移（以下、平滑化音圧推移と称す）は、平滑化される前の音圧推移と同様、全ての単位区間にて対応する音圧を有することになる。 Subsequently, in S520, based on the sound pressure for each unit section derived in S510, a sound pressure transition representing a sound pressure transition along the time progress of the input speech is derived. At the same time, in S520, the derived sound pressure transition is smoothed by the moving average. However, the moving average in the present embodiment is implemented by repeatedly defining a prescribed number of unit sections so as to overlap each other along the time progression in the sound pressure transition. The prescribed number of unit sections that are repeatedly defined is achieved by displacing the unit sections one by one. As a result, the sound pressure transition after smoothing (hereinafter referred to as the smoothed sound pressure transition) has the corresponding sound pressure in all unit sections as the sound pressure transition before smoothing. become.

そして、Ｓ５３０では、図１１（Ａ）に示すように、平滑化音圧推移において、各単位区間に対応する音圧それぞれから、予め規定された大きさの騒音音圧を減算する。このとき、減算結果が負の値（マイナス）となる音圧については、その値を０とする。 In S530, as shown in FIG. 11A, in the smoothed sound pressure transition, the noise sound pressure having a predetermined magnitude is subtracted from each sound pressure corresponding to each unit section. At this time, for a sound pressure at which the subtraction result is a negative value (minus), the value is set to zero.

続く、Ｓ５４０では、音圧推移における全ての単位区間の中から、１つの単位区間を選択する。これと共に、Ｓ５４０では、その選択された単位区間における音圧を取得する。このＳ５４０では、単位区間は、Ｓ５４０に移行する毎に、処理済音声データの開始から、その処理済音声データの時間進行に沿って順次選択される。 In S540, one unit section is selected from all the unit sections in the sound pressure transition. At the same time, in S540, the sound pressure in the selected unit section is acquired. In S540, each time the unit section moves to S540, the unit section is sequentially selected from the start of the processed audio data along the time progress of the processed audio data.

そして、Ｓ５５０では、今回Ｓ５５０に移行するまでの間、第１音圧Ｐｖ１としていた音圧を第２音圧Ｐｖ２とし、Ｓ５５０へと移行する際にＳ５４０にて選択した単位区間における音圧を第１音圧Ｐｖ１として設定する。つまり、このＳ５５０へと移行すると、処理済音声データの時間進行に沿って取得済みである音圧の中で、音声開始に近い単位区間における音圧を第２音圧Ｐｖ２とし、音声終了に近い単位区間における音圧を第１音圧Ｐｖ１としている。 In S550, the sound pressure that has been the first sound pressure Pv1 until the current transition to S550 is set as the second sound pressure Pv2, and the sound pressure in the unit section selected in S540 when the transition to S550 is performed is the first. Set as one sound pressure Pv1. That is, when the process proceeds to S550, the sound pressure in the unit section close to the start of the sound is set as the second sound pressure Pv2 among the sound pressures acquired along the time progress of the processed sound data, and close to the end of the sound. The sound pressure in the unit section is the first sound pressure Pv1.

さらに、Ｓ５６０では、第１音圧Ｐｖ１を第２音圧Ｐｖ２にて除する（以下、この演算結果を音圧増加率と称す）。
続く、Ｓ５７０では、Ｓ５６０にて導出された音圧増加率が、予め規定された規定閾値Ｔｈ以上であるか否かを判定する。そのＳ５７０での判定の結果、音圧増加率が規定閾値Ｔｈ以上であれば、Ｓ５８０へと進む。そして、Ｓ５８０では、発音カウンタを１つインクリメントする。 Further, in S560, the first sound pressure Pv1 is divided by the second sound pressure Pv2 (hereinafter, this calculation result is referred to as a sound pressure increase rate).
Subsequently, in S570, it is determined whether or not the sound pressure increase rate derived in S560 is greater than or equal to a predefined threshold value Th. As a result of the determination in S570, if the sound pressure increase rate is equal to or greater than the specified threshold Th, the process proceeds to S580. In S580, the sound generation counter is incremented by one.

続く、Ｓ５９０では、発音カウンタの値である発音カウント値が、予め規定された第１閾値以上であるか否かを判定し、判定の結果、発音カウント値が第１閾値未満であれば、Ｓ６００へと進む。そのＳ６００では、発音カウント値が、第１閾値よりも１つ小さな値として予め規定された第２閾値以上であるか否かを判定する。そのＳ６００での判定の結果、発音カウント値が第２閾値未満であれば、Ｓ５４０へと戻り、Ｓ５４０〜Ｓ５９０のステップを繰り返す。 Subsequently, in S590, it is determined whether or not the pronunciation count value, which is the value of the pronunciation counter, is greater than or equal to a first threshold value defined in advance. If the result of the determination is that the pronunciation count value is less than the first threshold value, S600 is performed. Proceed to In S600, it is determined whether or not the sound generation count value is equal to or greater than a second threshold value that is defined in advance as one value smaller than the first threshold value. As a result of the determination in S600, if the pronunciation count value is less than the second threshold value, the process returns to S540, and steps S540 to S590 are repeated.

一方、Ｓ６００での判定の結果、発音カウント値が第２閾値以上であれば、即ち、Ｓ５４０〜Ｓ５９０のステップを繰り返す際に、第２閾値の値だけ連続してＳ５７０にて肯定判定されると、Ｓ６１０へと進む。つまり、Ｓ６００にて肯定判定されることにより、音圧増加率が規定閾値Ｔｈ以上となる単位区間が、第１閾値の値に１を加えた数だけ連続する領域（以下、開始判定対象区間と称す）が検出される。 On the other hand, as a result of the determination in S600, if the pronunciation count value is equal to or greater than the second threshold value, that is, when the steps of S540 to S590 are repeated, an affirmative determination is made in S570 continuously for the second threshold value. , The process proceeds to S610. That is, by making an affirmative determination in S600, a unit interval in which the sound pressure increase rate is equal to or greater than the specified threshold Th is a region in which the number of the first threshold value plus 1 (hereinafter referred to as a start determination target interval). Is detected).

そして、Ｓ６１０では、開始判定対象区間を形成する単位区間のうち、入力音声の時間進行に沿った最初の単位区間を発音開始タイミングとして特定する。これと共に、その特定された発音開始タイミングにおける音圧（以下、発音開始音圧と称す）を取得する。さらに、Ｓ６１０では、それら特定された発音開始タイミング及び取得された発音開始音圧を記憶部２８に記憶する。なお、図１０〜１２では、開始タイミングを「ＳＴ」と表記する。 In step S610, the first unit section along the time progress of the input voice among the unit sections forming the start determination target section is specified as the sound generation start timing. At the same time, the sound pressure at the specified sounding start timing (hereinafter referred to as sounding start sound pressure) is acquired. Furthermore, in S610, the specified sound generation start timing and the acquired sound generation start sound pressure are stored in the storage unit 28. 10 to 12, the start timing is expressed as “ST”.

なお、Ｓ５９０での判定の結果、発音カウント値が第１閾値以上であれば、Ｓ６００及びＳ６１０のステップを実行することなく、Ｓ６３０へと進む。つまり、平滑化音圧推移において、発音開始タイミング以降における音圧増加率が、その発音開始タイミングから継続して規定閾値Ｔｈ以上である場合には、Ｓ５９０にて否定判定される。 Note that, as a result of the determination in S590, if the pronunciation count value is greater than or equal to the first threshold value, the process proceeds to S630 without executing steps S600 and S610. That is, in the smoothed sound pressure transition, if the sound pressure increase rate after the sounding start timing is continuously greater than or equal to the specified threshold Th from the sounding start timing, a negative determination is made in S590.

ところで、Ｓ５７０での判定の結果、音圧増加率が規定閾値Ｔｈ未満であれば、Ｓ６２０にて、発音カウンタを初期化（ここでは、０と）する。つまり、規定閾値Ｔｈ以上の音圧増加率が継続する単位区間の数の計測を終了する。その後、Ｓ６３０へと進む。 By the way, if the sound pressure increase rate is less than the prescribed threshold Th as a result of the determination in S570, the sound generation counter is initialized (in this case, 0) in S620. That is, the measurement of the number of unit sections in which the sound pressure increase rate equal to or greater than the specified threshold Th continues. Thereafter, the process proceeds to S630.

そして、Ｓ６３０では、第１音圧Ｐｖ１が、直近のＳ６１０にて記憶部２８に記憶された発音開始音圧（以下、終了判定音圧と称す）以下であるか否かを判定する。その判定の結果、第１音圧Ｐｖ１が終了判定音圧未満であれば、Ｓ６４０へと進む。 In S630, it is determined whether or not the first sound pressure Pv1 is equal to or lower than the sound generation start sound pressure (hereinafter referred to as end determination sound pressure) stored in the storage unit 28 in the latest S610. As a result of the determination, if the first sound pressure Pv1 is less than the end determination sound pressure, the process proceeds to S640.

そのＳ６４０では、第１音圧Ｐｖ１に対応する単位区間を発音終了タイミングとして記憶部２８に記憶する。その後、Ｓ６５０へと進む。なお、図１０〜図１２では、終了タイミングを「ＥＴ」と表記する。 In S640, the unit section corresponding to the first sound pressure Pv1 is stored in the storage unit 28 as the sound generation end timing. Thereafter, the process proceeds to S650. 10 to 12, the end timing is expressed as “ET”.

なお、Ｓ６３０での判定の結果、第１音圧Ｐｖ１が終了判定音圧以上であれば、Ｓ５４０にて選択された単位区間が発音終了タイミングではないものと判定して、Ｓ６４０を実行することなく、Ｓ６５０へと進む。 As a result of the determination in S630, if the first sound pressure Pv1 is equal to or higher than the end determination sound pressure, it is determined that the unit section selected in S540 is not the sound generation end timing, and S640 is not executed. , The process proceeds to S650.

そのＳ６５０では、処理済音声データに規定された全ての単位区間を、Ｓ５４０にて選択済みであるか否かを判定する。その判定の結果、未選択の単位区間が存在すれば、Ｓ５４０へと戻る。一方、Ｓ６５０での判定の結果、Ｓ５４０にて未選択の単位区間が存在しなければ、Ｓ６６０へと進む。 In S650, it is determined whether or not all unit sections defined in the processed audio data have been selected in S540. As a result of the determination, if there is an unselected unit section, the process returns to S540. On the other hand, as a result of the determination in S650, if there is no unselected unit section in S540, the process proceeds to S660.

つまり、図１１（Ｂ）に示すように、音圧増加率が継続して規定閾値以上となる音圧の単調増加区間が平滑化音圧推移に含まれている場合、開始・終了タイミング推定処理において、Ｓ５４０〜Ｓ６５０のステップを繰り返すことにより、その単調増加区間における最初の単位区間それぞれが発音開始タイミング（図中、第１，第２，第３，第４発音ＳＴ）として特定される。ただし、ここで言う単調増加区間とは、開始判定対象区間を形成する単位区間数以上連続する単位区間である。 That is, as shown in FIG. 11B, when the smoothed sound pressure transition includes a monotonically increasing section of the sound pressure in which the sound pressure increase rate is continuously equal to or higher than the specified threshold, the start / end timing estimation process In step S540 to S650, the first unit section in the monotonically increasing section is specified as the sound generation start timing (first, second, third, and fourth sound generation STs in the figure). However, the monotonically increasing section referred to here is a unit section that continues for the number of unit sections forming the start determination target section.

さらに、Ｓ５４０〜Ｓ６５０のステップを繰り返すことにより、平滑化音圧推移において、処理済音声データの時間進行に沿った発音開始タイミング以降の単位区間の中で、各単位区間に対応する音圧が、最初に終了判定音圧以下となった単位区間が発音終了タイミング（図中、第１，第２発音ＥＴ）として特定される。 Furthermore, by repeating the steps S540 to S650, in the smoothed sound pressure transition, the sound pressure corresponding to each unit section in the unit sections after the sounding start timing along the time progress of the processed speech data is The unit section that first becomes equal to or lower than the end determination sound pressure is specified as the sound generation end timing (first and second sound generation ETs in the figure).

ここで、図１０へと戻り、続くＳ６６０では、処理済音声データに設定された単位区間の中で、処理済音声データの時間進行に沿った最終の単位区間を発音終了タイミングとして、記憶部２８に記憶する。 Here, returning to FIG. 10, in S <b> 660, the storage unit 28 uses the final unit interval along the time progression of the processed audio data as the sound generation end timing among the unit intervals set in the processed audio data. To remember.

続く、Ｓ６７０では、処理済音声データに規定された全ての単位区間の中から、判定対象区間における音声基本周波数ｆ０を取得する。このＳ６７０にて音声基本周波数ｆ０が取得される判定対象区間は、予め規定された規定数の単位区間からなるものである。その規定数の単位区間は、処理済音声データにおける時間進行に沿って互いに連続かつ重複するように繰り返し規定される。 In S670, the basic voice frequency f0 in the determination target section is acquired from all the unit sections defined in the processed voice data. The determination target section from which the voice fundamental frequency f0 is acquired in S670 is composed of a predetermined number of unit sections. The prescribed number of unit sections are repeatedly defined so as to be continuous and overlap each other along the time progress in the processed voice data.

そして、Ｓ６８０では、Ｓ６７０にて取得した判定対象区間における音声基本周波数ｆ０に基づき、それらの音声基本周波数ｆ０の変動幅を導出する。このＳ６８０にて導出される変動幅は、判定対象区間における最大周波数の音声基本周波数ｆ０と、最小周波数の音声基本周波数ｆ０との差である。 In S680, the fluctuation range of the voice fundamental frequency f0 is derived based on the voice fundamental frequency f0 in the determination target section acquired in S670. The fluctuation range derived in S680 is the difference between the maximum audio basic frequency f0 and the minimum audio basic frequency f0 in the determination target section.

続く、Ｓ６９０では、Ｓ６８０にて導出した変動幅が、予め規定された周波数の幅である規定幅未満であるか否かを判定する。その判定の結果、変動幅が規定幅未満であれば、Ｓ７００へと進む。 In S690, it is determined whether or not the fluctuation width derived in S680 is less than a predetermined width that is a predetermined frequency width. As a result of the determination, if the fluctuation range is less than the specified range, the process proceeds to S700.

そして、Ｓ７００では、判定対象区間における全ての音声基本周波数ｆ０を、処理済音声データの時間進行に沿って配置してなる周波数軌跡を導出する。これと共に、その導出された周波数軌跡を平滑化微分して、周波数軌跡における極値を検出する。 In S700, a frequency trajectory is derived by arranging all the voice basic frequencies f0 in the determination target section along the time progress of the processed voice data. At the same time, the derived frequency locus is smoothed and differentiated to detect an extreme value in the frequency locus.

続く、Ｓ７１０では、Ｓ７００での平滑化微分の結果、周波数軌跡における極値が検出されたか否かを判定する。その判定の結果、極値が検出されていれば、Ｓ７２０へと進み。 In S710, it is determined whether or not an extreme value in the frequency trajectory has been detected as a result of the smoothing differentiation in S700. If the extreme value is detected as a result of the determination, the process proceeds to S720.

そのＳ７２０では、Ｓ７００にて検出された判定対象区間内での極値の数を集計する。そして、Ｓ７３０では、Ｓ７２０にて集計された極値の数であるビブラート値が、予め規定された第３閾値以上であるか否かを判定する。そのＳ７３０での判定の結果、ビブラート値が第３閾値以上であれば、Ｓ７４０へと進む。 In S720, the number of extreme values in the determination target section detected in S700 is totaled. In S730, it is determined whether or not the vibrato value, which is the number of extreme values tabulated in S720, is equal to or greater than a predetermined third threshold value. If the result of determination in S730 is that the vibrato value is greater than or equal to the third threshold value, processing proceeds to S740.

つまり、Ｓ６７０からＳ７３０でのステップを実行することにより、音声基本周波数ｆ０の変動幅が規定幅未満である判定対象区間の中で、音声基本周波数ｆ０が増加する増加区間及び減少する減少区間の和が第３閾値以上である判定対象区間が、ビブラート期間として検出される。なお、このビブラート期間とは、音声処理装置２０の利用者がビブラートにて発声した期間を意味する。 That is, by executing the steps from S670 to S730, the sum of the increasing interval in which the speech fundamental frequency f0 increases and the decreasing interval in which the speech fundamental frequency f0 increases in the determination target interval in which the fluctuation range of the speech fundamental frequency f0 is less than the specified width. Is determined as the vibrato period. In addition, this vibrato period means the period when the user of the voice processing device 20 uttered by vibrato.

続く、Ｓ７４０では、Ｓ７２０にて集計されたビブラート値を初期化（ここでは、０と）する。さらに、Ｓ７５０では、記憶部２８に記憶されている発音開始タイミングの中から、ビブラート期間内に対応する発音開始タイミング（以下、期間内タイミングと称す）を消去（除去）する。その後、Ｓ７７０へと進む。 In step S740, the vibrato values counted in step S720 are initialized (in this case, 0). Further, in S750, the sounding start timing corresponding to the vibrato period (hereinafter referred to as the intra-period timing) is deleted (removed) from the sounding start timings stored in the storage unit 28. Thereafter, the process proceeds to S770.

なお、Ｓ６９０での判定の結果、判定対象区間内における音声基本周波数ｆ０の変動幅が規定幅以上である場合や、Ｓ７１０での判定の結果、判定対象区間内に極値が含まれていない場合には、Ｓ７６０へと進む。さらに、Ｓ７３０での判定の結果、ビブラート値が第３閾値未満である場合にも、Ｓ７６０へと進む。 As a result of the determination in S690, when the fluctuation range of the voice basic frequency f0 within the determination target section is equal to or greater than the specified width, or as a result of the determination in S710, no extreme value is included in the determination target section. Then, the process proceeds to S760. Furthermore, if the result of determination in S730 is that the vibrato value is less than the third threshold, the process proceeds to S760.

つまり、Ｓ６７０にて規定された判定対象区間がビブラート期間でなければ、Ｓ７６０へと進む。そのＳ７６０では、ビブラート値を初期化した後、Ｓ７７０へと進む。
そのＳ７７０では、処理済音声データに設定された全ての単位区間を、判定対象区間として規定したか否かを判定する。その判定の結果、全ての単位区間が判定対象区間として規定されていなければ、Ｓ６８０へと戻り、新たな判定対象区間を設定して、Ｓ６８０へと進む。そして、全ての単位区間が判定対象区間として規定されるまで、Ｓ６８０〜Ｓ７７０を繰り返す。 That is, if the determination target section defined in S670 is not the vibrato period, the process proceeds to S760. In S760, after the vibrato value is initialized, the process proceeds to S770.
In S770, it is determined whether or not all the unit sections set in the processed audio data are defined as the determination target sections. As a result of the determination, if all unit sections are not defined as determination target sections, the process returns to S680, a new determination target section is set, and the process proceeds to S680. Then, S680 to S770 are repeated until all unit sections are defined as determination target sections.

例えば、本開始・終了タイミング推定処理を実行することで、図１２（Ａ）に示すような発音開始タイミング（第１〜第４発音開始タイミング）、及び発音終了タイミング（第１，第２発音終了タイミング）が特定されると共に、第３発音開始タイミング、及び第４発音開始タイミングを含む判定対象区間をビブラート期間として特定した場合を想定する。このような場合、第３発音開始タイミング及び第４発音開始タイミングは、期間内タイミングとして除去されるため、図１２（Ｂ）に示すように、第１発音開始タイミングと第２発音開始タイミングとの２つのみが残される。なお、発音終了タイミングは、除去されずに全て残される。 For example, by executing the start / end timing estimation process, the sounding start timing (first to fourth sounding start timings) and the sounding end timing (first and second sounding end times) as shown in FIG. (Timing) is specified, and the determination target section including the third sound generation start timing and the fourth sound generation start timing is specified as the vibrato period. In such a case, since the third sound generation start timing and the fourth sound generation start timing are removed as the in-period timing, as shown in FIG. 12B, the first sound generation start timing and the second sound generation start timing are Only two are left. Note that all the sound generation end timings are left without being removed.

なお、Ｓ７７０での判定の結果、全ての単位区間が判定対象として規定されていれば、本開始・終了タイミング推定処理を終了して、楽曲検索処理のＳ１９０へと進む。
つまり、本実施形態の開始・終了タイミング推定処理では、入力音声の音圧推移に基づいて、発音開始タイミングと発音終了タイミングとを検出すると共に、入力音声における音声基本周波数ｆ０の時間進行に沿った推移（即ち、周波数軌跡）から、ビブラート期間を特定する。そして、開始・終了タイミング推定処理では、特定されたビブラート期間内に対応する発音開始タイミングを消去し、ビブラート期間外に対応する発音開始タイミングのみを残す。
〈採譜処理について〉
次に、楽曲検索処理のＳ１９０にて起動される採譜処理について説明する。 If it is determined in S770 that all unit sections are defined as determination targets, the start / end timing estimation process is terminated, and the process proceeds to S190 of the music search process.
That is, in the start / end timing estimation processing of the present embodiment, the sound generation start timing and the sound generation end timing are detected based on the sound pressure transition of the input sound, and the time base of the sound basic frequency f0 in the input sound is met. The vibrato period is specified from the transition (that is, the frequency trajectory). In the start / end timing estimation process, the sounding start timing corresponding to the specified vibrato period is deleted, and only the sounding start timing corresponding to the outside of the vibrato period is left.
<About transcription processing>
Next, the music transcription process started in S190 of the music search process will be described.

ここで、図１３は、採譜処理の処理手順を示したフローチャートである。
この採譜処理は、図１３に示すように、起動されると、まず、Ｓ９１０にて、先の音高推定処理のＳ３１０にて周波数解析を実行した全ての単位区間の中から、１つの単位区間を選択する。このＳ９１０では、単位区間は、Ｓ９１０へと移行する毎に、処理済音声データの開始から、処理済音声データの時間進行に沿って順次選択される。 Here, FIG. 13 is a flowchart showing the processing procedure of the music transcription processing.
As shown in FIG. 13, when the music recording process is started, first, in S910, one unit section is selected from all the unit sections in which the frequency analysis was performed in S310 of the previous pitch estimation process. Select. In S910, each time a transition is made to S910, the unit sections are sequentially selected from the start of the processed audio data along the time progress of the processed audio data.

続く、Ｓ９２０では、Ｓ９１０にて選択された単位区間が発音開始タイミングであるか否かを判定する。その判定の結果、選択された単位区間が発音開始タイミングでなければ、Ｓ９３０へと進む。 Subsequently, in S920, it is determined whether or not the unit section selected in S910 is the sounding start timing. As a result of the determination, if the selected unit section is not the sounding start timing, the process proceeds to S930.

そのＳ９３０では、Ｓ９１０にて選択された単位区間が発音終了タイミングであるか否かを判定する。その判定の結果、選択された単位区間が発音終了タイミングでなければ、Ｓ９１０へと戻る。つまり、Ｓ９１０にて選択された単位区間が、発音開始タイミングまたは発音終了タイミングでなければ、Ｓ９１０からＳ９３０のステップを繰り返す。 In S930, it is determined whether or not the unit section selected in S910 is the sound generation end timing. As a result of the determination, if the selected unit section is not the sound generation end timing, the process returns to S910. That is, if the unit section selected in S910 is not the sounding start timing or the sounding end timing, the steps from S910 to S930 are repeated.

一方、Ｓ９２０での判定の結果、Ｓ９１０にて選択された単位区間が発音開始タイミングであれば、Ｓ９４０へと進む。そのＳ９４０では、今回Ｓ９４０に移行する前の間、第１開始タイミングとしていた発音開始タイミングを第２開始タイミングとし、Ｓ５５０へと移行する際にＳ９１０にて選択した単位区間（即ち、発音開始タイミング）を第１開始タイミングとして設定する。つまり、このＳ９４０へと移行すると、処理済音声データの時間進行に沿った発音開始タイミングの中で、発声開始に近い発音開始タイミングを第２とし、音声終了に近い単位区間における音圧を第１開始タイミングとしている。なお、図１３では、開始タイミングをＳＴと表記する。 On the other hand, as a result of the determination in S920, if the unit section selected in S910 is the sounding start timing, the process proceeds to S940. In S940, the sound generation start timing that was the first start timing before the transition to S940 this time is set as the second start timing, and the unit section selected in S910 when moving to S550 (ie, the sound generation start timing) Is set as the first start timing. That is, when proceeding to S940, the sounding start timing close to the utterance start is set as the second sounding start timing along the time progress of the processed sound data, and the sound pressure in the unit section close to the sound end is set to the first. It is the start timing. In FIG. 13, the start timing is denoted as ST.

続く、Ｓ９５０では、開始取得フラグ（以下、開始取得Ｆと表記する）が設定済みであるか否かを判定する。その判定の結果、開始取得フラグが未設定であれば、Ｓ９６０へと進む。そのＳ９６０では、開始取得フラグを設定する。その後、Ｓ９１０へと戻る。 In S950, it is determined whether a start acquisition flag (hereinafter referred to as start acquisition F) has been set. If the start acquisition flag is not set as a result of the determination, the process proceeds to S960. In S960, a start acquisition flag is set. Thereafter, the process returns to S910.

ところで、Ｓ９３０での判定の結果、Ｓ９１０にて選択された単位区間が発音終了タイミングであれば、Ｓ９７０へと進む。つまり、Ｓ９７０への移行は、処理済音声データの時間進行に沿って、発音開始タイミングと、その発音開始タイミングと対となるべき発音終了タイミングとが取得された場合である。そして、Ｓ９７０では、開始取得フラグを解除してＳ９８０へと進む。 By the way, as a result of the determination in S930, if the unit section selected in S910 is the pronunciation end timing, the process proceeds to S970. That is, the transition to S970 is a case where the sound generation start timing and the sound generation end timing to be paired with the sound generation start timing are acquired along the time progress of the processed audio data. In S970, the start acquisition flag is canceled and the process proceeds to S980.

なお、Ｓ９５０での判定の結果、開始取得フラグが設定されていれば、Ｓ９８０へと進む。つまり、処理済音声データの時間進行に沿って、２つの発音開始タイミングが、それらの発音開始タイミングの間に発音終了タイミングを挟むことなく存在する場合、Ｓ９５０にて肯定判定される。 If the start acquisition flag is set as a result of the determination in S950, the process proceeds to S980. That is, when two sound generation start timings exist without interposing the sound generation end timing between the sound generation start timings along the time progress of the processed sound data, an affirmative determination is made in S950.

そして、Ｓ９８０では、そのＳ９８０へと移行した時点で取得済みの発音開始タイミングまたは発音終了タイミングに基づいて、音符期間を特定する。
具体的に、本実施形態では、Ｓ９３０にて肯定判定されることで、Ｓ９８０へと進んだ場合には、第１開始タイミングを音符開始タイミングとし、発音終了タイミングを音符終了タイミングとする。一方、Ｓ９５０にて否定判定されることで、Ｓ９８０へと進んだ場合には、第２開始タイミングを音符開始タイミングとし、第１開始タイミングよりも処理済音声データの時間進行に沿って第１開始タイミングよりも設定時間長だけ前の時点を音符終了タイミングとする。そして、何れの場合にも、それら音符開始タイミングと音符終了タイミングとの間の期間を、音符期間として特定する。なお、本実施形態におけるＳ９８０では、特定した音符期間の期間長を音長として導出する。 In S980, the note period is specified based on the sounding start timing or sounding end timing acquired at the time of shifting to S980.
Specifically, in this embodiment, when the determination in S930 is affirmative and the process proceeds to S980, the first start timing is set as the note start timing, and the sound generation end timing is set as the note end timing. On the other hand, if the determination in S950 is negative and the process proceeds to S980, the second start timing is set as the note start timing, and the first start is performed in accordance with the time progress of the processed audio data from the first start timing. The time point before the timing by the set time length is set as the note end timing. In any case, the period between the note start timing and the note end timing is specified as the note period. In S980 in the present embodiment, the length of the specified note period is derived as the sound length.

続く、Ｓ９９０では、Ｓ９８０にて特定された音符期間に対応する全ての単位区間における音高（即ち、音高推定処理のＳ４５０にて量子化された音声基本周波数ｆ０、以下、量子化周波数とも称す）を取得する。つまり、音符期間を構成する単位区間の数だけ、量子化周波数が取得される。 Subsequently, in S990, the pitch in all unit intervals corresponding to the note period specified in S980 (that is, the voice fundamental frequency f0 quantized in S450 of the pitch estimation process, hereinafter also referred to as a quantization frequency). ) To get. That is, the quantization frequency is acquired by the number of unit intervals constituting the note period.

そして、Ｓ１０００では、Ｓ９９０にて取得された量子化周波数に基づき、第１音高周波数、第２音高周波数を特定すると共に、第１音高数、及び第２音高数を集計する。このＳ１０００にて特定される第１音高周波数は、Ｓ９８０にて特定された音符期間に占める割合が最も高い量子化周波数であり、第２音高周波数とは、その音符期間に占める割合が二番目に高い量子化周波数である。なお、本実施形態のＳ１０００では、Ｓ９８０にて特定された音符期間に第２音高周波数が複数存在する場合、周波数が最も高いものを第２音高周波数とする。 In S1000, the first pitch frequency and the second pitch frequency are specified based on the quantization frequency acquired in S990, and the first pitch number and the second pitch number are tabulated. The first pitch frequency specified in S1000 is a quantization frequency having the highest ratio in the note period specified in S980. The second pitch frequency is a ratio in which the ratio in the note period is two. The second highest quantization frequency. In S1000 of the present embodiment, when there are a plurality of second pitch frequencies in the note period specified in S980, the highest frequency is set as the second pitch frequency.

そして、Ｓ１０００にて集計される第１音高数は、Ｓ９８０にて特定された音符期間に含まれる単位区間の中で、第１音高周波数に対応する単位区間の数である。また、第２音高数は、Ｓ９８０にて特定された音符期間に含まれる単位区間の中で、第音高２周波数に対応する単位区間の数である。 The first pitch number counted in S1000 is the number of unit intervals corresponding to the first pitch frequency in the unit intervals included in the note period specified in S980. The second pitch number is the number of unit intervals corresponding to the second pitch frequency in the unit intervals included in the note period specified in S980.

次に、Ｓ１０１０では、Ｓ１０００にて特定された第２音高周波数が、第１音高周波数よりも周波数が高いか否かを判定する。その判定の結果、第２音高周波数が第１音高周波数よりも高ければ、Ｓ１０２０へと進む。 Next, in S1010, it is determined whether or not the second pitch frequency specified in S1000 is higher than the first pitch frequency. As a result of the determination, if the second pitch frequency is higher than the first pitch frequency, the process proceeds to S1020.

そのＳ１０２０では、第２音高数が音高判定閾値以上であるか否かを判定する。この判定に用いられる音高判定閾値は、予め規定された規定割合Ａ（本実施形態では、１／２．３とする）と第１音高数とを乗算した値である。そのＳ１０２０での判定の結果、第２音高数が音高判定閾値以上であれば、Ｓ１０３０へと進む。 In S1020, it is determined whether or not the second pitch number is equal to or higher than a pitch determination threshold value. The pitch determination threshold used for this determination is a value obtained by multiplying a predetermined ratio A (in the present embodiment, 1 / 2.3) and the first pitch number. As a result of the determination in S1020, if the second pitch number is equal to or higher than the pitch determination threshold, the process proceeds to S1030.

そして、Ｓ１０３０では、第２音高周波数に対応する音高を、Ｓ９８０にて特定された音符期間における音高（即ち、音符音高）として特定する。そして、その特定された音符音高と、Ｓ９８０にて導出された音長を音符音長とした音声音符データを生成する。その後、Ｓ１０５０へと進む。 In S1030, the pitch corresponding to the second pitch frequency is specified as the pitch in the note period specified in S980 (that is, the note pitch). Then, voice note data is generated with the specified note pitch and the note length derived in S980 as the note note length. Thereafter, the process proceeds to S1050.

ところで、Ｓ１０１０での判定の結果、第２音高周波数が第１音高周波数以下である場合や、Ｓ１０２０での判定の結果、第２音高数が音高判定閾値未満であれば、Ｓ１０４０へと進む。 By the way, as a result of the determination in S1010, if the second pitch frequency is equal to or lower than the first pitch frequency, or if the second pitch number is less than the pitch determination threshold as a result of the determination in S1020, the process proceeds to S1040. Proceed with

そのＳ１０４０では、第１音高周波数に対応する音高を、Ｓ９８０にて特定された音符期間における音高（即ち、音符音高）として特定する。そして、その特定された音符音高と、Ｓ９８０にて導出された音長を音符音長とした音声音符データを生成する。その後、Ｓ１０５０へと進む。 In S1040, the pitch corresponding to the first pitch frequency is specified as the pitch (that is, the note pitch) in the note period specified in S980. Then, voice note data is generated with the specified note pitch and the note length derived in S980 as the note note length. Thereafter, the process proceeds to S1050.

例えば、Ｓ９１０からＳ９８０のステップを繰り返した後、Ｓ９８０にて、図１４（Ａ）に示すような第１音符開始タイミングと第１音符終了タイミングとの間の期間を第１音符期間として特定したとする。この特定した第１音符期間における第２音高周波数ｆ０２_t1＿ｈｉは、第１音高周波数ｆ０１_t1よりも高い周波数である。また、第２音高数は、「３」であり、第１音高数である「５」に規定割合Ａ（本実施形態では、Ａ＝１／（２．３））を乗じた値よりも大きい。 For example, after repeating the steps from S910 to S980, in S980, the period between the first note start timing and the first note end timing as shown in FIG. 14A is specified as the first note period. To do. Second pitch frequency f02 _t1 _hi in the identified first note period is the frequency higher than the first pitch frequency f01 _t1. The second pitch number is “3”, and is a value obtained by multiplying the first pitch number “5” by a specified ratio A (A = 1 / (2.3) in the present embodiment). Is also big.

この場合、図１４（Ｂ）に示すように、第１音符期間についての音符音高は、Ｓ１０３０にて、第２音高周波数ｆ０２_t1＿ｈｉに対応する音高（図中、第１音符音高）に特定される。 In this case, as shown in FIG. 14B, the note pitch for the first note period is the pitch corresponding to the second pitch frequency f02 _t1 _hi in S1030 (the first note pitch in the figure). ).

なお、第２音高周波数ｆ０２_t1＿ｌｏｗも、第１音符中に占める割合が第２音高周波数ｆ０２_t1＿ｈｉと同一である。しかし、第２音高周波数ｆ０２_t1＿ｈｉの方が高い周波数であるため、第１音符期間の音高は、第２音高周波数ｆ０２_t1＿ｈｉとなる。 Note that the second pitch frequency f02 _t1 _low also, the ratio in the first note is the same as the second pitch frequency f02 _t1 _hi. However, since towards the second pitch frequency f02 _t1 _hi has a high frequency, the pitch of the first note period, the second pitch frequency f02 _t1 _hi.

さらに、Ｓ９１０からＳ９８０のステップを繰り返した後、Ｓ９８０にて、図１４（Ａ）に示すような第２音符開始タイミングと第２音符終了タイミングとの間の期間を第２音符期間として特定したとする。この特定した第２音符期間における第２音高周波数ｆ０２_t2＿ｈｉは、第１音高周波数ｆ０１_t2よりも低い周波数である。また、第２音高数は、「３」であり、第１音高数である「４」に規定割合Ａ（本実施形態では、Ａ＝１／（２．３））を乗じた値よりも小さい。 Furthermore, after repeating the steps from S910 to S980, in S980, the period between the second note start timing and the second note end timing as shown in FIG. 14A is specified as the second note period. To do. Second pitch frequency f02 _t2 _hi in the specified second note period is the frequency lower than the first pitch frequency f01 _t2. Further, the second pitch number is “3”, and a value obtained by multiplying the first pitch number “4” by a specified ratio A (A = 1 / (2.3) in the present embodiment). Is also small.

この場合、図１４（Ｂ）に示すように、第２音符期間についての音符音高（図中、第２音符音高）は、Ｓ１０４０にて、第１音高周波数ｆ０１_t2に対応する音高に特定される。
続く、Ｓ１０５０では、処理済音声データに規定された全ての単位区間について、Ｓ９１０にて選択済みであるか否かを判定する。その判定の結果、未選択の単位区間が存在すれば、Ｓ９１０へと戻り、Ｓ９１０〜Ｓ１０５０を繰り返す。 In this case, as shown in FIG. 14B, the note pitch (the second note pitch in the figure) for the second note period is the pitch corresponding to the first pitch frequency f01 _t2 in S1040. Specified.
In S1050, it is determined whether or not all unit sections defined in the processed audio data have been selected in S910. As a result of the determination, if there is an unselected unit section, the process returns to S910, and S910 to S1050 are repeated.

一方、Ｓ１０５０での判定の結果、未選択の単位区間が存在しなければ、本採譜処理を終了して、楽曲検索処理のＳ２１０へと進む。
つまり、本採譜処理では、処理済音声データの時間進行に沿って、発音開始タイミングと、その発音開始タイミングと対となるべき発音終了タイミングとが存在する場合には、その発音開始タイミングを音符開始タイミングとし、その発音終了タイミングを音符終了タイミングとする。また、処理済音声データの時間進行に沿って、２つの発音開始タイミングが、それらの発音開始タイミングの間に発音終了タイミングを挟むことなく存在する場合には、時間進行に沿った前の発音開始タイミングを音符開始タイミングとし、時間進行に沿った後の発音開始タイミングを音符終了タイミングとする。そして、何れの場合にも、それら音符開始タイミングと音符終了タイミングとの間の期間を、音符期間として特定する。 On the other hand, if there is no unselected unit section as a result of the determination in S1050, the music transcription process is terminated and the process proceeds to S210 of the music search process.
In other words, in this music recording process, if there is a sounding start timing and a sounding end timing that should be paired with the sounding start timing as time progresses in the processed audio data, the sounding start timing is set to the note start time. Timing is set, and the sound generation end timing is set as a note end timing. In addition, when the two sound generation start timings exist without interposing the sound generation end timing between the sound generation start timings along the time progress of the processed audio data, the previous sound generation start along the time progress is started. The timing is the note start timing, and the sound generation start timing after the time progression is the note end timing. In any case, the period between the note start timing and the note end timing is specified as the note period.

これに加えて、本採譜処理では、第２音高周波数が第１音高周波数よりも周波数が高く、かつ第２音高数が第１音高数に対して規定割合Ａ以上であれば、第２音高周波数に対応する音高を、その音符期間における音符音高として特定する。これと共に、本採譜処理では、第２音高周波数が第１音高周波数よりも周波数が低い場合、または第２音高数が第１音高数に対して規定割合Ａ未満である場合には、第１音高周波数に対応する音高を、その音符期間における音符音高として特定している。
〈採譜結果照合処理について〉
次に、楽曲検索処理のＳ２１０にて起動される採譜結果照合処理について説明する。 In addition to this, in this musical notation process, if the second pitch frequency is higher than the first pitch frequency and the second pitch number is equal to or higher than the specified ratio A with respect to the first pitch number, The pitch corresponding to the second pitch frequency is specified as the note pitch in the note period. At the same time, in this music recording process, when the second pitch frequency is lower than the first pitch frequency, or when the second pitch number is less than the specified ratio A with respect to the first pitch frequency. The pitch corresponding to the first pitch frequency is specified as the note pitch in the note period.
<Transcription result matching process>
Next, the transcription result matching process started in S210 of the music search process will be described.

ここで、図１５は、採譜結果照合処理の処理手順を示したフローチャートである。
この採譜結果照合処理は、図１５に示すように、起動されると、Ｓ１２１０では、先の採譜処理にて生成された音声音符データを、処理済音声データの時間進行に沿って連続する予め規定された音符規定数毎に単語化（即ち、グループ化）する。この単語化に際しては、音声音符データの一部が互いに重複するように実施する。以下、単語化された音声音符データそれぞれを、単語音符データと称す。 Here, FIG. 15 is a flowchart showing the processing procedure of the transcription result collation process.
As shown in FIG. 15, when the transcription result collation process is started, in S1210, the voice note data generated in the previous transcription process is defined in advance along the time progress of the processed voice data. A word is formed (ie, grouped) for each specified number of notes. This wording is performed so that part of the voice note data overlaps each other. Hereinafter, each of the voiced note data converted into words is referred to as word note data.

さらに、Ｓ１２２０では、サーバ４０から取得され記憶部２８に記憶されている楽曲データに対応する楽曲の中から、単語音符データを基準音符データ（即ち、ガイドメロディ）に照合する楽曲（以下、音符照合楽曲と称す）を１つ決定する。 Furthermore, in S1220, the music (hereinafter referred to as note collation) in which the word note data is collated with the reference note data (that is, the guide melody) from the music corresponding to the music data acquired from the server 40 and stored in the storage unit 28. (Referred to as music).

続く、Ｓ１２３０では、Ｓ１２１０にて生成された全ての単語音符データの中から、１つの単語音符データを取得する。ただし、単語音符データを取得する際には、処理済音声データの時間進行において、音声開始に近い音声音符データを含むものを取得する。 In S1230, one word note data is acquired from all the word note data generated in S1210. However, when the word note data is acquired, data including voice note data close to the start of the voice is acquired in the time progress of the processed voice data.

そして、Ｓ１２４０では、Ｓ１２２０にて決定された音符照合楽曲に対応する基準音符データの中から、時間進行に沿って連続する音符規定数分だけ単語化して取得する。この音符規定数分の基準音符データを単語化する際には、基準旋律の時間進行において、その基準旋律の開始に近い構成音についての基準音符データから実行する。以下、Ｓ１２４０にて単語化して取得した音符規定数分の基準音符データを、比較音符データとする。 In step S1240, the reference note data corresponding to the note collation music determined in step S1220 is converted into words for a predetermined number of notes that are continuous with time. When the reference note data corresponding to the specified number of notes is converted into words, the reference note data for the constituent sounds close to the start of the reference melody is executed in the time progression of the reference melody. Hereinafter, reference note data corresponding to the prescribed number of notes acquired by wording in S1240 is referred to as comparative note data.

続いて、Ｓ１２５０では、Ｓ１２３０にて取得した単語音符データを、Ｓ１２４０にて取得した比較音符データに照合する。その照合の結果、単語音符データと比較音符データとが一致すれば（Ｓ１２６０：ＹＥＳ）、Ｓ１２７０へと進む。 Subsequently, in S1250, the word note data acquired in S1230 is collated with the comparison note data acquired in S1240. As a result of the collation, if the word note data and the comparison note data match (S1260: YES), the process proceeds to S1270.

そのＳ１２７０では、詳しくは後述する音符一致度、及び累積楽曲内一致度を導出すると共に、その導出した累積楽曲内一致度を構成音の番号と対応付けて記憶し、その後、Ｓ１２８０へと進む。この累積楽曲内一致度と対応付けられる構成音の番号は、比較音符データを形成する音符規定数の構成音の中で、基準旋律の時間進行に沿った最初の構成音に対応付けられたものである。 In S1270, a note coincidence degree and a cumulative in-music coincidence degree, which will be described in detail later, are derived, and the derived accumulated in-music coincidence degree is stored in association with the constituent sound numbers, and then the process proceeds to S1280. The number of the component sound associated with the cumulative in-music coincidence is the one associated with the first component sound along the time progression of the reference melody among the prescribed number of component sounds forming the comparative note data It is.

一方、Ｓ１２５０での照合の結果、単語音符データと比較音符データとが一致しなければ（Ｓ１２６０：ＮＯ）、Ｓ１２８０へと進む。
そのＳ１２８０では、全ての基準音符データを単語化して、その単語化によって生成された比較音符データに、Ｓ１２３０にて取得した単語音符データを照合したか否かを判定する。その判定の結果、全ての比較音符データに単語音符データを照合していなければ、Ｓ１２４０へと戻る。そのようにして移行したＳ１２４０では、前回のＳ１２４０にて単語化した基準音符データと、基準旋律の時間進行に沿った一部が重複するように音符規定数分だけ、基準音符データを単語化して取得する。すなわち、新たな比較音符データを生成して、Ｓ１２５０へと進む。 On the other hand, if the word note data and the comparison note data do not match as a result of the collation in S1250 (S1260: NO), the process proceeds to S1280.
In S1280, all the reference note data are worded, and it is determined whether or not the word note data acquired in S1230 is collated with the comparison note data generated by the wording. As a result of the determination, if the word note data is not collated with all the comparison note data, the process returns to S1240. In step S1240, the reference note data worded in the previous step S1240 and the reference note data corresponding to the prescribed number of notes so that a part of the reference note along the time progression of the reference melody overlaps are worded. get. That is, new comparison note data is generated, and the process proceeds to S1250.

これにより、１つの楽曲における全ての基準音符データに対して、１つの単語音符データの照合が完了するまで、Ｓ１２４０からＳ１２８０が繰り返し実行される。
なお、Ｓ１２８０での判定の結果、全ての基準音符データを単語化して、その単語化によって生成された比較音符データに、単語音符データを照合していれば、Ｓ１２９０へと進む。そのＳ１２９０では、全ての単語音符データを取得して、比較音符データに照合済みであるか否かを判定する。 Thereby, S1240 to S1280 are repeatedly executed until the collation of one word note data is completed for all the reference note data in one music piece.
As a result of the determination in S1280, if all reference note data is worded and the word note data is collated with the comparison note data generated by the wording, the process proceeds to S1290. In S1290, all the word note data is acquired, and it is determined whether or not the comparison note data has already been collated.

そのＳ１２９０での判定の結果、全ての単語音符データを比較音符データに照合していなければ、Ｓ１２３０へと戻る。そのＳ１２３０では、比較音符データに対して未照合の単語音符データの中から、１つの単語音符データを取得する。ただし、単語音符データを取得する際には、入力音声の時間進行において、音声開始に近い音声音符データからなる単語音符データを取得する。 As a result of the determination in S1290, if all the word note data is not collated with the comparison note data, the process returns to S1230. In S1230, one word note data is acquired from word note data that has not been compared with the comparison note data. However, when acquiring the word note data, the word note data consisting of the voice note data close to the start of the voice is acquired as the input voice progresses over time.

その後、Ｓ１２９０にて肯定判定されるまで、Ｓ１２３０〜Ｓ１２９０までのステップを繰り返す。以下、Ｓ１２３０〜Ｓ１２９０までの一回の流れを、別音符照合サイクルと称す。また、別音符照合サイクルにて、単語音符データを取得してから新たな単語音符データを取得するまでのＳ１２４０〜Ｓ１２８０の一回の流れを、同一音符照合サイクルと称す。 Thereafter, the steps from S1230 to S1290 are repeated until an affirmative determination is made in S1290. Hereinafter, one flow from S1300 to S1290 is referred to as a separate note collation cycle. In addition, one flow from S1240 to S1280 from the acquisition of word note data to the acquisition of new word note data in another note verification cycle is referred to as the same note verification cycle.

この同一音符照合サイクルを繰り返す過程の中で、Ｓ１２６０にて肯定判定されると、Ｓ１２７０へと進む。そのようにして移行したＳ１２７０では、今回の別音符照合サイクルにて単語音符データと一致した比較音符データが、前回の別音符照合サイクルにて単語音符データと一致した比較音符データと、基準旋律の時間進行上連続するものであるか否かを判定（以下、音符接続判定とする）する。具体的には、前回の別音符照合サイクルにて音符一致度に対応付けられた構成音の番号の中に、今回Ｓ１２７０へと進んだ際に、単語音符データに一致したと判定された比較音符データを形成する構成音の番号よりも、基準旋律における時間進行上１つ前の構成音であることを示す番号があれば、音符接続判定における判定結果が肯定されたものとする。 If an affirmative determination is made in S1260 in the process of repeating the same note matching cycle, the process proceeds to S1270. In S1270 thus shifted, the comparison note data matched with the word note data in the current separate note collation cycle is compared with the comparison note data matched with the word note data in the previous separate note collation cycle and the reference melody. It is determined whether it is continuous over time (hereinafter referred to as note connection determination). Specifically, the comparison note that is determined to match the word note data when proceeding to S1270 this time among the constituent note numbers associated with the note matching degree in the previous separate note matching cycle. If there is a number indicating that it is a constituent sound one before the time progression in the reference melody than the number of the constituent sound forming the data, the determination result in the note connection determination is affirmed.

その音符接続判定の判定結果が肯定であれば、連続して肯定判定された別音符照合サイクルの回数を「べき指数」として、初期規定値を累乗した値を音符一致度として導出する。一方、音符接続判定の判定結果が否定であれば、初期規定値そのものを音符一致度として導出する。 If the determination result of the note connection determination is affirmative, a value obtained by raising the power of the initial specified value is derived as the note coincidence, with the number of different note collation cycles successively determined to be positive as the “power exponent”. On the other hand, if the determination result of the note connection determination is negative, the initial specified value itself is derived as the note matching degree.

つまり、音符一致度は、処理済音声データの時間進行に沿った単語音符データが連続して、音符照合楽曲の基準旋律における時間進行に沿った比較音符データに一致するほど、大きな値となる。 That is, the note coincidence degree becomes larger as the word note data along the time progression of the processed voice data continuously matches the comparison note data along the time progression in the reference melody of the note collation music.

さらに、導出された音符一致度の和を楽曲内累積一致度として導出する。
なお、Ｓ１２９０にて肯定判定されると、Ｓ１３００へと進む。そのＳ１３００では、先のＳ１２２０にて決定された音符照合楽曲に対する楽曲内累積一致度の中で、値が最大のものを、その音符照合楽曲に対応する曲名データと対応付けて、記憶部２８に記憶する。つまり、Ｓ１３００にて曲名データと対応付けられる楽曲内累積一致度は、一つの音符照合楽曲に対する別音符照合サイクルの繰り返しにて導出された全楽曲内累積一致度の中で、値が最大のものである。 Further, the sum of the derived note coincidence is derived as the in-music cumulative coincidence.
If a positive determination is made in S1290, the process proceeds to S1300. In S1300, the cumulative value in the music for the note collation music determined in S1220 is correlated with the song name data corresponding to the note collation music in the storage unit 28. Remember. In other words, the cumulative coincidence in music associated with the song name data in S1300 has the largest value among the cumulative in-music coincidence derived by repeating another note collation cycle for one note collation music. It is.

続く、Ｓ１３１０では、記憶部２８に記憶されている楽曲データに対応する全ての楽曲を、音符照合楽曲として決定済みであるか否かを判定する。その判定の結果、全ての楽曲を音符照合楽曲として決定済みでなければ、Ｓ１２２０へと戻る。そのようにして移行したＳ１２２０では、音符照合楽曲として未決定の楽曲の中から、新たな楽曲を音符照合楽曲として決定して、Ｓ１２３０へと進む。つまり、Ｓ１２３０からＳ１３１０までのステップを、記憶部２８に記憶されている全ての楽曲データ中の基準音符データに、単語音符データの照合が完了するまで繰り返す。 Subsequently, in S1310, it is determined whether or not all the music corresponding to the music data stored in the storage unit 28 has been determined as the note collation music. As a result of the determination, if all music pieces have not been determined as note collation music pieces, the process returns to S1220. In S1220 thus shifted, a new music piece is determined as a note collation music piece from among the music pieces that have not been decided as a note collation music piece, and the process proceeds to S1230. That is, the steps from S1230 to S1310 are repeated until the reference of note data in all music data stored in the storage unit 28 is matched with the word note data.

なお、Ｓ１３１０での判定の結果、記憶部２８に記憶されている全ての楽曲を音符照合楽曲として決定済みであれば、Ｓ１３２０へと進む。
そのＳ１３２０では、Ｓ１３００にて記憶部２８に記憶された楽曲内累積一致度の中で、値が最大である楽曲内累積一致度に対応する楽曲を意図予想曲として特定する。さらに、Ｓ１３２０では、その特定された意図予想曲についての曲名データを取得し、取得された曲名データに対応する曲名を表示部２２に表示すると共に、その曲名をスピーカ２７から音声にて出力する。すなわち、意図予想曲の曲名が報知される。 If it is determined in step S1310 that all the music pieces stored in the storage unit 28 have been determined as note collation music pieces, the process advances to step S1320.
In S1320, the music corresponding to the in-music cumulative coincidence having the maximum value among the cumulative in-music coincidence stored in the storage unit 28 in S1300 is specified as the expected expected music. Further, in S1320, the song name data for the specified expected song is acquired, the song name corresponding to the obtained song name data is displayed on the display unit 22, and the song name is output from the speaker 27 by voice. That is, the song title of the expected song is notified.

そして、その後、採譜結果照合処理を終了し、さらに、楽曲検索処理を終了する。
つまり、本実施形態の採譜結果照合処理では、採譜処理にて生成された音声音符データを、楽曲毎に予め用意された基準音符データに照合する。そして、その照合結果として、処理済音声データの時間進行に沿って連続する音声音符データが、音符照合楽曲の基準旋律における時間進行に沿って連続して一致する比較音符データの数が多いほど、大きな値の楽曲内累積一致度を導出する。そして、本実施形態の採譜結果照合処理では、導出された楽曲内累積一致度の中で、値が最も高いものに対応する楽曲を、意図予想曲として検出している。
［実施形態の効果］
以上説明したように、本実施形態の開始・終了タイミング推定処理では、検出された発音開始タイミングの中から、ビブラート期間内に対応する発音開始タイミングを消去し、ビブラート期間外に対応する発音開始タイミングのみを残す。 Then, after that, the musical score matching process is terminated, and the music search process is terminated.
That is, in the transcription result collating process of the present embodiment, the voice note data generated by the musical transcription process is collated with reference note data prepared in advance for each music piece. And, as a result of the collation, the more the number of comparison note data that the voice note data continuous along the time progression of the processed voice data matches continuously along the time progression in the reference melody of the note collation music, Deriving the cumulative value of the cumulative value in the song. And in the transcription result collation process of this embodiment, the music corresponding to the highest value in the derived in-music cumulative coincidence is detected as the intended expected music.
[Effect of the embodiment]
As described above, in the start / end timing estimation process of the present embodiment, the sounding start timing corresponding to the vibrato period is deleted from the detected sounding start timings, and the sounding start timing corresponding to outside the vibrato period is deleted. Leave only.

そして、残された発音開始タイミングと、その発音開始タイミングと対となるべき発音終了タイミングとが存在すれば、採譜処理にて、その発音開始タイミング及び発音終了タイミングをそれぞれ、音符開始タイミング及び音符終了タイミングとしている。一方、残された発音開始タイミングが、２つの発音開始タイミングの間に発音終了タイミングを挟むことなく存在すれば、採譜処理にて、その２つの発音開始タイミングのうち、時間進行に沿った前の発音開始タイミングを音符開始タイミング、時間進行に沿った後の発音開始タイミングを音符終了タイミングとしている。さらに、採譜処理では、音符開始タイミングと、音符終了タイミングとの間の期間を音符期間として推定する。 Then, if there is a remaining sounding start timing and a sounding end timing to be paired with the sounding start timing, the sounding start timing and the sounding end timing are set as the note start timing and the note end timing, respectively, in the music transcription process. It is timing. On the other hand, if the remaining sounding start timing is present without interposing the sounding end timing between the two sounding start timings, the two sounding start timings of the two sounding start timings before the time progression are obtained in the music recording process. The sound generation start timing is the note start timing, and the sound generation start timing after the time progress is the note end timing. Further, in the music recording process, a period between the note start timing and the note end timing is estimated as a note period.

したがって、本実施形態の音声処理装置２０によれば、ビブラート期間内から、当該期間が開始される音符期間が推定されることを防止できる。この結果、本実施形態の音声処理装置２０によれば、入力音声における音符期間の推定精度を向上させることができる。 Therefore, according to the speech processing apparatus 20 of the present embodiment, it is possible to prevent the note period from which the period starts from being estimated from within the vibrato period. As a result, according to the speech processing device 20 of the present embodiment, it is possible to improve the estimation accuracy of the note period in the input speech.

なお、本実施形態の音声処理装置２０によれば、入力音声の終端を発音終了タイミングとしているため、処理済音声データにおける最後の発音開始タイミング（即ち、音符開始タイミング）に対しても、音符期間を推定することができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 According to the sound processing device 20 of the present embodiment, since the end of the input sound is set as the sound generation end timing, the note period is also applied to the last sound generation start timing (ie, note start timing) in the processed sound data. Can be estimated.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記楽曲検索処理において実行される音高推定処理は、上記実施形態に記載したものに限らない。上記実施形態における音高推定処理は、処理済音声データに規定された単位区間毎に、その単位区間における入力音声の音声基本周波数ｆ０を検出するものであれば、どのようなものでも良い。 For example, the pitch estimation process executed in the music search process is not limited to that described in the embodiment. The pitch estimation processing in the above embodiment may be any method as long as it detects the voice fundamental frequency f0 of the input voice in the unit section for each unit section defined in the processed voice data.

また、上記実施形態の開始・終了タイミング推定処理では、発音終了タイミングの検出を、対となる発音開始タイミング以降にて実行していたが、発音終了タイミングの検出は、これに限るものではなく、例えば、ビブラート期間の終了時点以降にて実施しても良い。 Moreover, in the start / end timing estimation process of the above embodiment, the detection of the sound generation end timing is performed after the pair of sound generation start timings, but the detection of the sound generation end timing is not limited to this, For example, it may be performed after the end of the vibrato period.

さらに、上記実施形態において、発音終了タイミングの検出は、処理済音声データの時間進行に沿った発音開始タイミング以降であり、かつその発音開始タイミングから予め規定された数の単位区間だけ後の単位区間以降にて実行しても良い。 Furthermore, in the above embodiment, the detection of the end time of the sound generation is after the sound generation start timing along the time progress of the processed sound data, and the unit section after the predetermined number of unit sections from the sound generation start timing You may perform after that.

なお、上記実施形態の開始終了タイミング推定処理におけるＳ５３０では、予め規定された規定値を騒音音圧としていたが、騒音音圧は、これに限るものではない。例えば、処理済音声データの時間進行に沿った開始時点から、時間進行に沿った最初の発音開始タイミングまでの平均音圧を騒音音圧としても良いし、規定値と平均音圧とのうち、値が大きいものを騒音音圧としても良い。 In S530 in the start / end timing estimation process of the above embodiment, the predetermined specified value is the noise sound pressure, but the noise sound pressure is not limited to this. For example, the average sound pressure from the start point along the time progression of the processed voice data to the first sounding start timing along the time progression may be the noise sound pressure, and among the specified value and the average sound pressure, A large value may be used as the noise sound pressure.

ところで、上記実施形態における楽曲検索処理では、マイクロホン２４を介して入力された後、記憶部２８に記憶された音声データを処理対象としていたが、楽曲検索処理にて処理対象とする音声データは、マイクロホン２４を介して入力されたものに限らず、例えば、サーバ４０や、他の音声処理装置２０から取得した音声データでも良い。この場合、音声処理装置２０では、マイクロホン２４と、音声入力部２５とが省略されていても良い。 By the way, in the music search process in the above-described embodiment, the audio data that is input via the microphone 24 and then stored in the storage unit 28 is the processing target. For example, audio data acquired from the server 40 or another audio processing device 20 may be used. In this case, in the audio processing device 20, the microphone 24 and the audio input unit 25 may be omitted.

それとは反対に、上記実施形態における楽曲検索処理では、音声入力部２５にてサンプリングされた直後の音声データを直接処理対象としても良い。つまり、楽曲検索処理では、マイクロホン２４を介して入力された音声をリアルタイムに処理していても良い。 On the contrary, in the music search process in the above embodiment, the audio data immediately after being sampled by the audio input unit 25 may be directly processed. That is, in the music search process, the voice input via the microphone 24 may be processed in real time.

また、上記実施形態における音声処理装置２０は、スピーカ２７と音声出力部２６とを備えていなくとも良い。
なお、上記実施形態では、音声処理装置２０にて楽曲検索処理を実行していたが、楽曲検索処理は、サーバ４０にて実行されていても良い。この場合、音声データが、音声処理装置２０からサーバ４０に転送される必要がある。 In addition, the audio processing device 20 in the above embodiment may not include the speaker 27 and the audio output unit 26.
In the above embodiment, the music search process is executed by the audio processing device 20, but the music search process may be executed by the server 40. In this case, the audio data needs to be transferred from the audio processing device 20 to the server 40.

また、楽曲検索システム１は、音声処理装置２０のみから構成されていても良い。この場合、楽曲データは、予め記憶部２８に記憶されている必要がある。
上記実施形態における楽曲検索処理では、Ｓ２１０にて採譜結果照合処理を実行していたが、楽曲検索処理として実行される内容として、このＳ２１０は省略されていても良い。つまり、上記実施形態における音声処理装置２０は、いわゆる採譜装置として構成されたものでも良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との対応関係について説明する。 In addition, the music search system 1 may be configured only from the voice processing device 20. In this case, the music data needs to be stored in the storage unit 28 in advance.
In the music search process in the above-described embodiment, the transcription result matching process is executed in S210. However, S210 may be omitted as the content executed as the music search process. That is, the audio processing device 20 in the above embodiment may be configured as a so-called music recording device.
[Correspondence between Embodiment and Claims]
Finally, the correspondence between the description of the above embodiment and the description of the claims will be described.

上記実施形態の開始・終了タイミング推定処理におけるＳ５１０，Ｓ５２０を実行することで得られる機能が、本発明の音圧推移特定手段に相当し、Ｓ５４０〜Ｓ６１０を実行することで得られる機能が、本発明の開始タイミング検出手段に相当する。 The function obtained by executing S510 and S520 in the start / end timing estimation process of the above embodiment corresponds to the sound pressure transition specifying means of the present invention, and the function obtained by executing S540 to S610 is This corresponds to the start timing detection means of the invention.

そして、上記実施形態の開始・終了タイミング推定処理におけるＳ６７０〜Ｓ７３０を実行することで得られる機能が、本発明のビブラート期間特定手段に相当し、Ｓ７５０を実行することで得られる機能が、本発明の期間内タイミング除去手段に相当する。 The function obtained by executing S670 to S730 in the start / end timing estimation process of the above embodiment corresponds to the vibrato period specifying means of the present invention, and the function obtained by executing S750 is the present invention. This corresponds to the timing removal means within the period.

なお、上記実施形態の採譜処理におけるＳ９１０〜Ｓ９８０を実行することで得られる機能が、本発明の音符期間推定手段に相当する。 In addition, the function obtained by performing S910-S980 in the music recording process of the said embodiment is equivalent to the note period estimation means of this invention.

１…楽曲検索システム２０…音声処理装置２１…通信部２２…表示部２３…操作受付部２４…マイクロホン２５…音声入力部２６…音声出力部２７…スピーカ２８…記憶部３０…制御部３１…ＲＯＭ３２…ＲＡＭ３３…ＣＰＵ４０…サーバ４１…記憶装置４２…マイクロコンピュータ DESCRIPTION OF SYMBOLS 1 ... Music search system 20 ... Audio | voice processing apparatus 21 ... Communication part 22 ... Display part 23 ... Operation reception part 24 ... Microphone 25 ... Audio | voice input part 26 ... Audio | voice output part 27 ... Speaker 28 ... Memory | storage part 30 ... Control part 31 ... ROM 32 ... RAM 33 ... CPU 40 ... Server 41 ... Storage device 42 ... Microcomputer

Claims

From the time the input speech continuously along with the progress, to estimate the note duration representing each period that can be regarded as a single note, a speech processing device for transcription identify and the pitch of the musical note period,
A sound pressure transition specifying means for specifying a sound pressure transition representing a transition along a time progression of a sound pressure in the input sound from the input sound;
Each note period start timing is a note start timing, and in a section where the sound pressure transition specified by the sound pressure transition specifying means is monotonically increasing, the sound in the first specified period specified by the sound pressure transition Start timing detection means for detecting each time when the rate of increase in pressure first becomes equal to or greater than a predetermined value as time progresses, as the note start timing;
A vibrato period specifying means for specifying the vibrato period based on a pitch transition representing a transition along a time progression of a pitch in the input voice, with a period uttered by vibrato in the input voice as a vibrato period; ,
Among the note start timings detected by the start timing detecting means, the note start timing corresponding to the vibrato period specified by the vibrato period specifying means is set as the in-period timing, and the detection by the start timing detecting means is performed. An intra-period timing removing means for removing the intra-period timing from the result;
A note end timing paired with each note start timing after the timing within the period is removed by the timing removal means within the period, and between the note start timing and the note end timing as a pair Note period estimating means for estimating each period as the note period, and
The vibrato period specifying means is:
A plurality of periods defined so as to be continuous with each other over the entire pitch transition and along the time progress are defined as a second defined period, and the pitch transition in the second defined period is defined as a period pitch transition. If the pitch fluctuation range in the period pitch transition is equal to or less than a predetermined range, the increase section in which the pitch increases and the decrease section in which the pitch decreases in the period transition are detected. An increase / decrease detection means to
If the number of increase and decrease sections detected by the increase / decrease detection means is equal to or greater than a predetermined number, a period specification that specifies the second specified period corresponding to the period pitch transition as the vibrato period And a voice processing apparatus.

The note period estimating means includes
The sound pressure fluctuation time at which the sound pressure in the sound pressure transition first becomes equal to or lower than the sound pressure at the note start timing after the note start timing after the timing within the period is removed is defined as the note start timing. The voice processing device according to claim 1, wherein the voice processing device is specified as the note end timing of a pair.

The note period estimating means includes
Of the note start timings that are adjacent to each other along the time progress after the timing within the period is removed, on the time progress, the previous note start timing is set as the previous start timing, and the subsequent note start timing is set as the subsequent start timing, 3. The voice processing according to claim 1, wherein a time point that is a preset time length before the subsequent start timing is specified as the note end timing paired with the previous start timing. 4. apparatus.

The note period estimating means includes
If there is a sound pressure fluctuation time point at which the sound pressure in the sound pressure transition becomes equal to or lower than the sound pressure at the previous start timing before the subsequent start timing, the sound pressure fluctuation time point is The voice processing device according to claim 3, wherein the voice processing device is specified as the note end timing paired with the start timing.

The note period estimating means includes
The end of the input voice along the time progress is specified as the note end timing paired with the last note start timing along the time progress among the note start timings after the timing within the period is removed. The speech processing apparatus according to any one of claims 1 to 4, wherein the speech processing apparatus is characterized.

Fundamental frequency determination means for determining a speech fundamental frequency in the unit interval for each unit interval continuous along the time axis in the input speech;
Discontinuous detection for detecting a discontinuous region in which a speech fundamental frequency is discontinuous between successive unit sections among frequency transitions in which the speech fundamental frequency detected by the fundamental frequency determining means is arranged along a time axis. Means,
The basic frequency of each unit section corresponding to the discontinuous area detected by the discontinuity detecting means is the frequency fundamental specified by the sound pressure transition specifying means. A frequency correction means for performing frequency correction for correcting so as to reach the immediately following frequency that is the speech fundamental frequency of the unit section immediately after the discontinuous region from the immediately preceding frequency that is the frequency;
The pitch estimation means for specifying a voice fundamental frequency in a unit section after the frequency correction by the frequency correction means is performed as a pitch in the note period corresponding to the unit section. The speech processing apparatus according to any one of claims 1 to 5.

It is a program for causing a computer to function as a voice processing device that specifies and records a note period that represents each period that can be regarded as one note from input speech that is continuous over time, and a pitch in the note period. And
A sound pressure transition identification procedure for identifying a sound pressure transition representing a transition along a time progression of a sound pressure in the input voice from the input voice;
The sound in the first specified period defined in the sound pressure transition in the section where the sound pressure transition specified in the sound pressure transition specifying procedure is monotonically increasing with the start timing of each of the note periods as the note start timing. A start timing detection procedure for detecting, as the note start timing, each time point at which the rate of increase in pressure becomes equal to or greater than a predetermined value defined in advance along time,
A vibrato period specifying procedure for specifying a vibrato period based on a pitch transition representing a transition along a time progression of a pitch in the input voice, with a period uttered by vibrato in the input voice as a vibrato period; ,
Among the note start timings detected in the start timing detection procedure, the note start timing corresponding to the vibrato period specified in the vibrato period specifying procedure is set as the in-period timing, and the detection in the start timing detection procedure is performed. From the results, a timing removal procedure within a period for removing the timing within the period;
The note end timing paired with each of the note start timings after the timing within the period is removed in the timing removal procedure within the period, and between the note start timing and the note end timing as a pair Causing the computer to execute a note period estimation procedure for estimating each period as the note period;
further,
The vibrato period specifying procedure is:
A plurality of periods defined so as to be continuous with each other over the entire pitch transition and along the time progress are defined as a second defined period, and the pitch transition in the second defined period is defined as a period pitch transition. If the pitch fluctuation range in the period pitch transition is equal to or less than a predetermined range, the increase section in which the pitch increases and the decrease section in which the pitch decreases in the period transition are detected. An increase / decrease detection procedure to
If the number of increasing and decreasing intervals detected by the increase / decrease detection procedure is equal to or greater than a predetermined number, a period specifying that specifies the second specified period corresponding to the period pitch transition as the vibrato period A program characterized by causing a computer to execute the procedure.