JP4809913B2

JP4809913B2 - Phoneme division apparatus, method, and program

Info

Publication number: JP4809913B2
Application number: JP2009159513A
Authority: JP
Inventors: 孝中村; 昇宮崎; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-07-06
Filing date: 2009-07-06
Publication date: 2011-11-09
Anticipated expiration: 2029-07-06
Also published as: JP2011013594A

Description

この発明は、音声から、音素の境界時刻を自動的に決定する技術に関する。 The present invention relates to a technique for automatically determining a phoneme boundary time from speech.

従来の自動音素セグメンテーション技術では、音素の統計的な平均スペクトルパターンを分布として用意し、与えられた音声のスペクトルパターンとの類似性（尤度）を計算して、各フレームに尤もらしい音素を割り当てることにより、音素境界時刻を求めていた。すなわち、前後で対応する音素が変わるフレームを音素境界時刻としていた（例えば、非特許文献１参照）。 In conventional automatic phoneme segmentation technology, a statistical average spectral pattern of phonemes is prepared as a distribution, similarity (likelihood) with a given speech spectral pattern is calculated, and a likely phoneme is assigned to each frame. Thus, the phoneme boundary time is obtained. That is, a frame in which corresponding phonemes change before and after is set as a phoneme boundary time (see, for example, Non-Patent Document 1).

河井恒、戸田智基著、”波形接続型音声合成のための自動音素セグメンテーションの評価”、電子情報通信学会技法研究報告、SP2002-170、pp.5-10、２００３年１月Tsuyoshi Kawai and Satoshi Toda, "Evaluation of automatic phoneme segmentation for waveform-connected speech synthesis", IEICE Technical Report, SP2002-170, pp.5-10, January 2003

しかしながら、音素の統計的な平均スペクトルパターンは平均処理の際に平滑化され、スペクトルパターンの詳細な特徴が失われることが多い。そのため、スペクトルパターンが連続的に滑らかに変化する場合では、音素境界時刻の前後でスペクトルパターンに大きな差がでないため、統計的な平均スペクトルパターンでは尤度の差が現れず、推定された音素境界時刻が正解と大きくずれるという課題があった。 However, the statistical average spectral pattern of phonemes is smoothed during the averaging process, and detailed features of the spectral pattern are often lost. Therefore, when the spectrum pattern changes continuously and smoothly, there is no large difference in the spectrum pattern before and after the phoneme boundary time, so there is no likelihood difference in the statistical average spectrum pattern, and the estimated phoneme boundary There was a problem that the time was significantly different from the correct answer.

上記の課題を解決するために、入力された音声の各フレームの音声特徴量を抽出する。複数の音素の音声特徴量についての統計量を用いて、各フレームに最も尤もらしい音素を割り当てて、連続する２つのフレームで割り当てられた音素が異なる場合に、それらの２つのフレームに亘る時間範囲に含まれる時刻の何れかを音素境界時刻とすることにより音素境界時刻を推定する。音素境界時刻が信頼できるかどうか判定する。音素境界時刻が信頼できないと判定された音素境界を構成する各音素に、その各音素の継続長の、平均値が大きいほど長く、分散が大きいほど大きく伸縮した時間を割り当てることにより、その音素境界時刻が信頼できないと判定された音素境界の音素境界時刻を推定する。 In order to solve the above-described problem, the speech feature amount of each frame of the input speech is extracted. When the most probable phoneme is assigned to each frame using a statistic about the speech feature quantity of a plurality of phonemes, and the phonemes assigned in two consecutive frames are different, the time range over those two frames The phoneme boundary time is estimated by using any of the times included in the phoneme as the phoneme boundary time. Determine whether phoneme boundary time is reliable. By assigning to each phoneme that constitutes a phoneme boundary that the phoneme boundary time is determined to be unreliable, the duration of each phoneme is longer as the average value is larger, and as the variance is larger, the phoneme boundary is expanded. A phoneme boundary time of a phoneme boundary that is determined to be unreliable is estimated.

信頼することができない音素境界時刻を、音素の継続長の平均値及び分散を用いて再度推定することにより、音素境界時刻の推定を従来よりも精度良く行なうことができる。 By re-estimating the phoneme boundary time that cannot be relied on using the average value and variance of the phoneme duration, the phoneme boundary time can be estimated more accurately than in the past.

音素分割装置の例の機能ブロック図。The functional block diagram of the example of a phoneme division | segmentation apparatus. 音素境界時刻推定結果信頼性判定部の例１の機能ブロック図。The functional block diagram of Example 1 of a phoneme boundary time estimation result reliability determination part. 音素境界時刻推定結果信頼性判定部の例２の機能ブロック図。The functional block diagram of Example 2 of a phoneme boundary time estimation result reliability determination part. 第二音素境界時刻推定部の例の機能ブロック図。The functional block diagram of the example of a 2nd phoneme boundary time estimation part. 音素分割方法の例の流れ図。The flowchart of the example of the phoneme division | segmentation method. 音素境界時刻推定結果信頼性判定部の処理の例２の流れ図。The flowchart of the example 2 of a process of the phoneme boundary time estimation result reliability determination part. 第二音素境界時刻推定部の処理の例の流れ図。The flowchart of the example of a process of a 2nd phoneme boundary time estimation part. 音素境界時刻を説明するための図。The figure for demonstrating phoneme boundary time. 信頼の可否の条件を示した図。The figure which showed the conditions of the propriety of reliability. 音素境界時刻の推定の具体例を示す図。The figure which shows the specific example of estimation of phoneme boundary time.

以下、この発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

図１は、この発明による音素分割装置の例の機能ブロック図である。図５は、この発明による音素分割方法の例の流れ図である。 FIG. 1 is a functional block diagram of an example of a phoneme dividing device according to the present invention. FIG. 5 is a flowchart of an example of a phoneme division method according to the present invention.

音素分割装置は、音声特徴量抽出部１、第一音素境界時刻推定部２、音素境界時刻推定結果信頼性判定部３、第二音素境界時刻推定部４、音声特徴量記憶部６、継続長分布記憶部７を例えば含む。 The phoneme dividing device includes a speech feature amount extraction unit 1, a first phoneme boundary time estimation unit 2, a phoneme boundary time estimation result reliability determination unit 3, a second phoneme boundary time estimation unit 4, a speech feature amount storage unit 6, and a continuation length. For example, the distribution storage unit 7 is included.

＜ステップＳ１＞
入力された音声は、音声特徴量抽出部１に入力される。音声特徴量抽出部１は、入力された音声を一定時間長のフレームに分割して、各フレームごとに音声特徴量を計算する（ステップＳ１）。各フレームの音声特徴量は、第一音素境界時刻推定部２に送られる。 <Step S1>
The input voice is input to the voice feature amount extraction unit 1. The voice feature amount extraction unit 1 divides the input voice into frames having a predetermined time length, and calculates a voice feature amount for each frame (step S1). The voice feature amount of each frame is sent to the first phoneme boundary time estimation unit 2.

音声特徴量としては、その音声特徴量を用いてフレームに音素を割り当てることができるものであればどのような音声特徴量を用いてもよい。例えば、音声特徴量として、音声認識等でよく用いられるＭＦＣＣ、ケプストラム、メルケプストラム、フィルタバンク、メルフィルタバンク等を用いることができる。 As the speech feature amount, any speech feature amount may be used as long as it can assign a phoneme to a frame using the speech feature amount. For example, MFCC, cepstrum, mel cepstrum, filter bank, mel filter bank, etc. that are often used in voice recognition or the like can be used as the voice feature amount.

＜ステップＳ２＞
音声特徴量記憶部６には、複数の音素の音声特徴量についての統計量が記憶されているとする。例えば、音素の統計的なスペクトルパターンが、音声認識等でよく用いられるＨＭＭ（隠れマルコフモデル）、ベクトル量子化、ニューラルネットワーク等の形態で記憶されているとする。 <Step S2>
Assume that the speech feature quantity storage unit 6 stores statistics about speech feature quantities of a plurality of phonemes. For example, it is assumed that a statistical spectrum pattern of phonemes is stored in the form of HMM (Hidden Markov Model), vector quantization, neural network, etc. that are often used in speech recognition and the like.

例えば、モノフォン、トライフォンの単位ごとに統計的な平均スペクトルが記憶されている。モノフォンは、当該音素の前後にどのような音素があるか考慮しない音素モデルである。トライフォンは、当該音素の前後にある音素を考慮する音素モデルであり、ａ−ｉ＋ｕ（前に／ａ／、後ろに／ｕ／がある／ｉ／）と、ｏ−ｉ−ｅ（前に／ｏ／、後ろに／ｅ／がある／ｉ／）とを違うものとして考える。 For example, a statistical average spectrum is stored for each unit of monophone and triphone. A monophone is a phoneme model that does not consider what phonemes are present before and after the phoneme. A triphone is a phoneme model that takes into account the phonemes before and after the phoneme. / O /, / e / behind / i /) are considered different.

第一音素境界時刻推定部２は、音声特徴量及び音声特徴量記憶部６から読み込んだ統計量を用いて、各フレームに最も尤もらしい音素を割り当てて、連続する２つのフレームで割り当てられた音素が異なる場合に、それらの２つのフレームに亘る時間範囲に含まれる時刻の何れかを音素境界時刻とすることにより音素境界時刻を推定する（ステップＳ２）。音素境界時刻を含む音素境界時刻推定結果は、音素境界時刻推定結果信頼性判定部３に送られる。 The first phoneme boundary time estimation unit 2 assigns the most likely phoneme to each frame using the speech feature value and the statistic read from the speech feature value storage unit 6, and the phoneme assigned in two consecutive frames Are different from each other, the phoneme boundary time is estimated by setting one of the times included in the time range over the two frames as the phoneme boundary time (step S2). The phoneme boundary time estimation result including the phoneme boundary time is sent to the phoneme boundary time estimation result reliability determination unit 3.

第一音素境界時刻推定部２は、音声特徴量及び音声特徴量記憶部６から読み込んだ統計量を用いて、各フレームに各音素を割り当てたときの尤もらしさである尤度を計算して、最も尤度が高い音素をその各フレームに割り当てる。必要に応じて、各フレームに割り当てられた音素についての尤度が、音素境界時刻推定結果信頼性判定部３に送られる。 The first phoneme boundary time estimation unit 2 uses the speech feature value and the statistic read from the speech feature amount storage unit 6 to calculate the likelihood that is the likelihood when each phoneme is assigned to each frame, The phoneme with the highest likelihood is assigned to each frame. As necessary, the likelihood of the phonemes assigned to each frame is sent to the phoneme boundary time estimation result reliability determination unit 3.

図８を用いて、２つのフレームに亘る時間範囲について説明する。フレームｋに割り当てられた音素（この例では／ａ／）と、フレームｋ＋１に割り当てられた音素（この例では／ｉ／）とが異なるとき、図８に例示するように、フレームｋの開始時刻からフレームｋ＋１の終了時刻までの時間範囲に含まれる何れかの時刻を音素境界時刻とする。例えば、フレームｋ＋１の中心時刻を音素境界時刻とする。 The time range over two frames will be described with reference to FIG. When the phoneme assigned to the frame k (in this example, / a /) is different from the phoneme assigned to the frame k + 1 (in this example, / i /), as illustrated in FIG. 8, the start time of the frame k Any time included in the time range from the end time of frame k + 1 to the end time of frame k + 1 is defined as a phoneme boundary time. For example, the center time of frame k + 1 is set as the phoneme boundary time.

第一音素境界時刻推定部２は、非特許文献１に記載された手法により音素境界時刻を推定してもよい。 The first phoneme boundary time estimation unit 2 may estimate the phoneme boundary time by the method described in Non-Patent Document 1.

＜ステップＳ３＞
音素境界時刻推定結果信頼性判定部３は、第一音素境界時刻推定部２が推定した音素境界時刻が信頼できるかどうか判定する（ステップＳ３）。音素境界時刻が信頼できるかどうかの判定結果である信頼性判定結果は、第二音素境界時刻推定部４に送られる。 <Step S3>
The phoneme boundary time estimation result reliability determination unit 3 determines whether the phoneme boundary time estimated by the first phoneme boundary time estimation unit 2 is reliable (step S3). A reliability determination result, which is a determination result of whether or not the phoneme boundary time is reliable, is sent to the second phoneme boundary time estimation unit 4.

以下、音素境界時刻推定結果信頼性判定部３による判定処理の例を２つ紹介する。 Hereinafter, two examples of determination processing by the phoneme boundary time estimation result reliability determination unit 3 will be introduced.

≪例１≫
図２に、例１による音素境界時刻推定結果信頼性判定部３の機能ブロック図を示す。 << Example 1 >>
FIG. 2 is a functional block diagram of the phoneme boundary time estimation result reliability determination unit 3 according to Example 1.

信頼性判定用条件記憶部３２には、信頼することができない音素境界時刻に対応する連続する２つのフレームに割り当てられた音素についての条件が記憶されている。図９に、この条件の例を示す。図９の例では、（１）音素境界が有声音と無声音の間にある場合にはその音素境界の時刻を信頼する、（２）音素境界を構成する音素の何れか一方が摩擦音である場合にはその音素境界の時刻を信頼する、（３）（２）ではなくかつ音素境界が有声音の間にある場合にはその音素境界の時刻を信頼しない、という３つの条件がある。 The reliability determination condition storage unit 32 stores conditions for phonemes assigned to two consecutive frames corresponding to phoneme boundary times that cannot be trusted. FIG. 9 shows an example of this condition. In the example of FIG. 9, (1) when the phoneme boundary is between voiced sound and unvoiced sound, the time of the phoneme boundary is trusted. (2) When one of the phonemes constituting the phoneme boundary is a friction sound There are three conditions: the time of the phoneme boundary is trusted, and the time of the phoneme boundary is not trusted when the phoneme boundary is between voiced sounds instead of (3) and (2).

条件信頼性判定部３１には、音素境界時刻に対応する連続する２つのフレームに割り当てられた音素が入力される。図８の例では、音素／ａ／と音素／ｉ／が入力される。条件信頼性判定部３１は、入力された音素が信頼性判定用条件記憶部３２に記憶された条件を満たすかどうかを判定することにより、その音素境界時刻が信頼することができるものであるか否かを判定する。 The conditional reliability determination unit 31 receives phonemes assigned to two consecutive frames corresponding to the phoneme boundary time. In the example of FIG. 8, phonemes / a / and phonemes / i / are input. Whether the phoneme boundary time can be trusted by determining whether or not the input phoneme satisfies the condition stored in the reliability determination condition storage unit 32. Determine whether or not.

「現実を」という音声が入力された場合を例に挙げて、条件信頼性判定部３１の説明をする。「現実を」という音声に対して、図１０に示すように、／ｇ／、／ｅ／、／ｎ／、／ｊ／、／ｉ／、／ｔｓ／、／ｕ／、／ｏ／という音素が順に対応付けられたとする。 The conditional reliability determination unit 31 will be described by taking as an example a case where a voice “real reality” is input. As shown in FIG. 10, the phonemes of / g /, / e /, / n /, / j /, / i /, / ts /, / u /, / o / Are sequentially associated.

音素境界／ｉ／−／ｔｓ／、音素境界／ｔｓ／−／ｕ／は、音素／ｔｓ／が無声音であり、音素／ｉ／及び音素／ｕ／が有声音なので、条件（１）により、これらの音素境界の音素境界時刻は信頼することができると判定される。 The phoneme boundary / i / − / ts / and the phoneme boundary / ts / − / u / are unvoiced sounds and the phonemes / i / and the phonemes / u / are voiced sounds. It is determined that the phoneme boundary times of these phoneme boundaries are reliable.

音素境界／ｎ／−／ｊ／、音素境界／ｊ／−／ｉ／は、音素／ｊ／が摩擦音であり、音素／ｎ／及び音素／ｉ／が摩擦音でないので、条件（２）により、これらの音素境界の音素境界時刻は信頼することができると判定される。 The phoneme boundary / n / − / j / and the phoneme boundary / j / − / i / are phonetic / j / is a frictional sound and phoneme / n / and phoneme / i / are not a frictional sound. It is determined that the phoneme boundary times of these phoneme boundaries are reliable.

音素境界／ｇ／−／ｅ／、音素境界／ｅ／−／ｎ／、音素境界／ｕ／−／ｏ／は、条件（２）を満たさず、かつ、音素／ｇ／、音素／ｅ／及び音素／ｎ／は有声音であるため、条件（３）により、これらの音素境界の音素境界時刻は信頼することができないと判定される。 Phoneme boundary / g / − / e /, phoneme boundary / e / − / n /, phoneme boundary / u / − / o / do not satisfy condition (2), and phoneme / g /, phoneme / e / And phoneme / n / are voiced sounds, it is determined by the condition (3) that the phoneme boundary times of these phoneme boundaries cannot be trusted.

≪例２≫
図３に、例２による音素境界時刻推定結果信頼性判定部３の機能ブロック図を示す。図６に、例２による処理の流れ図を示す。 << Example 2 >>
FIG. 3 is a functional block diagram of the phoneme boundary time estimation result reliability determination unit 3 according to Example 2. FIG. 6 shows a flowchart of processing according to Example 2.

この例では、第一音素境界時刻推定部２が計算した各フレームに割り当てられた音素についての尤度、及び、音素境界時刻推定結果が、音素境界時刻推定結果信頼性判定部３に入力される。 In this example, the likelihood of the phoneme assigned to each frame calculated by the first phoneme boundary time estimation unit 2 and the phoneme boundary time estimation result are input to the phoneme boundary time estimation result reliability determination unit 3. .

音素境界時刻推定結果信頼性判定部３の尤度変化度計算部３３は、音素境界時刻のフレームを含む所定の数のフレームに割り当てられた音素の尤度の変化の大きさを示す指標である尤度変化度を求める（ステップＳ３１）。計算された尤度変化度は、尤度信頼性判定部３４に送られる。 The likelihood change calculation unit 33 of the phoneme boundary time estimation result reliability determination unit 3 is an index indicating the magnitude of change in the likelihood of phonemes assigned to a predetermined number of frames including the frame of the phoneme boundary time. A likelihood change degree is obtained (step S31). The calculated likelihood change degree is sent to the likelihood reliability determination unit 34.

尤度変化度として、下記式により定義されるＤを用いることができる。下記式において、ｔは音素境界時刻のフレーム番号、Ｋは予め定められた正の整数、Ｌ_ｉはフレームｉに割り当てられた音素についての尤度、ｗ_ｋは尤度差｜Ｌ_ｔ−Ｌ_ｔ−ｋ｜＋｜Ｌ_ｔ−Ｌ_ｔ＋ｋ｜の重みであり予め定められた非負の実数でありｋの増加に従って単調減少するのが望ましい。Ｋ、ｗ_ｋは、求める仕様や性能に応じて実験等に基づき適宜設定される。音素境界時刻のフレームとは、音素境界時刻を構成する２つのフレームの何れかのことである。図８の例では、フレームｋ又はフレームｋ＋１のことである。 As the likelihood change degree, D defined by the following equation can be used. In the following equation, t is the frame number of the phoneme boundary time, K is a predetermined positive integer, L _i is the likelihood for the phoneme assigned to frame i, and w _k is the likelihood difference | L _t −L _{t _{_{-k | + | L t -L t}}} + k | of desirable monotonically decreases with increasing real number and k of a predetermined non-negative is the weight. K and w _k are appropriately set based on experiments and the like according to the required specifications and performance. The frame at the phoneme boundary time is one of two frames constituting the phoneme boundary time. In the example of FIG. 8, this is frame k or frame k + 1.

尤度信頼性判定部３４は、尤度変化度と所定の閾値θとを比較して、尤度変化度が所定の閾値以下であれば、音素境界時刻は信頼できないと判定する（ステップＳ３２）。所定の閾値は、求める仕様や性能に応じて実験等に基づき適宜設定される。 The likelihood reliability determination unit 34 compares the likelihood change degree with a predetermined threshold value θ, and determines that the phoneme boundary time is not reliable if the likelihood change degree is equal to or less than the predetermined threshold value (step S32). . The predetermined threshold is appropriately set based on experiments or the like according to the required specifications and performance.

例えば、Ｄ−θを計算して、Ｄ−θ＞０であれば音素境界時刻は信頼することができ、Ｄ−θ≦０であれば音素境界時刻は信頼することができないと判定する。もちろん、尤度変化度Ｄと所定の閾値θとを直接比較して、その大小関係に基づいて信頼の可否を決定してもよい。 For example, when D−θ is calculated, it is determined that the phoneme boundary time can be trusted if D−θ> 0, and the phoneme boundary time cannot be trusted if D−θ ≦ 0. Of course, the likelihood change D and the predetermined threshold value θ may be directly compared, and the reliability may be determined based on the magnitude relationship.

＜ステップＳ４＞
継続長分布記憶部７には、複数の音素の継続長の平均値及び分散が記憶されている。 <Step S4>
The duration distribution storage unit 7 stores an average value and variance of durations of a plurality of phonemes.

第二音素境界時刻推定部４は、音素境界時刻が信頼できないと判定された音素境界を構成する各音素に、その各音素の継続長の、平均値が大きいほど長く、分散が大きいほど大きく伸縮した時間を割り当てることにより、その音素境界時刻が信頼できないと判定された音素境界の音素境界時刻を推定する（ステップＳ４）。各音素の継続長の平均値及び分散は、継続長分布記憶部７から読み込んだ値を用いる。 The second phoneme boundary time estimation unit 4 expands / contracts to the phonemes constituting the phoneme boundary for which the phoneme boundary time is determined to be unreliable as the average value of the duration of each phoneme increases. By assigning the determined time, the phoneme boundary time of the phoneme boundary at which it is determined that the phoneme boundary time is unreliable is estimated (step S4). For the average value and variance of the duration of each phoneme, values read from the duration distribution storage unit 7 are used.

図４に、第二音素境界時刻推定部４の例の機能ブロック図を示す。図７に、第二音素境界時刻推定部４の処理の例の流れ図を示す。 FIG. 4 shows a functional block diagram of an example of the second phoneme boundary time estimation unit 4. In FIG. 7, the flowchart of the example of a process of the 2nd phoneme boundary time estimation part 4 is shown.

例えば、第二音素境界時刻推定部４の音素継続長最尤推定部４１は、下記式に基づいて、音素境界時刻が信頼できないと判定されたｉ番目の音素の音素継続長ｍ_ｉ ^＊を計算する（ステップＳ４１）。Ｔは音素境界時刻が信頼できないと判定された連続するＩ（Ｉは正の整数）個の音素が構成する時間長であり、ｍ_ｉ ^＊は時間長Ｔの中のｉ番目の音素の推定音素継続長であり、ｍ_ｉは継続長分布記憶部７から読み込んだｉ番目の音素の継続長の平均値であり、σ_ｉ ^２は継続長分布記憶部７から読み込んだｉ番目の音素の継続長の分散である。 For example, the phoneme duration maximum likelihood estimator 41 of the second phoneme boundary time estimator 4 calculates the phoneme continuation length m _i ^* of the i-th phoneme determined that the phoneme boundary time is unreliable based on the following equation: (Step S41). T is a time length formed by consecutive I (I is a positive integer) phonemes whose phoneme boundary times are determined to be unreliable, and m _i ^* is an estimated phoneme of the i-th phoneme in the time length T a duration, m _i is the mean value of the duration of the read i-th phoneme from duration distribution storage unit 7, sigma _i ² is duration of the read i-th phoneme from duration distribution storage unit 7 Is the dispersion of.

図１０の例だと、音素境界／ｇ／−／ｅ／、音素境界／ｅ／−／ｎ／、音素境界／ｕ／−／ｏ／の音素境界時刻が信頼できないと判定されている。連続する３つ音素／ｇ／、／ｅ／、／ｎ／が構成する時間長Ｔの１番目の音素である音素／ｇ／の時間長は、上記式を用いて以下のように計算することができる。この例では、Ｔ＝１８０ｍｓ、音素／ｇ／の継続長の平均値ｍ_１＝２０ｍｓ、音素／ｇ／の継続長の分散σ_１ ^２＝０．００３、音素／ｅ／の継続長の平均値ｍ_２＝９５ｍｓ、音素／ｅ／の継続長の分散σ_２ ^２＝０．０１２、音素／ｎ／の継続長の平均値ｍ_３＝４５ｍｓ、音素／ｎ／の継続長の分散σ_３ ^２＝０．００５であるとする。 In the example of FIG. 10, it is determined that the phoneme boundary times of the phoneme boundary / g / − / e /, the phoneme boundary / e / − / n /, and the phoneme boundary / u / − / o / are not reliable. The time length of phoneme / g /, which is the first phoneme of time length T formed by three consecutive phonemes / g /, / e /, / n /, is calculated as follows using the above formula. Can do. In this example, T = 180 ms, average value of duration of phoneme / g / m ₁ = 20 ms, variance of duration of phoneme / g / σ ₁ ² = 0.003, average value of duration of phoneme / e / m ₂ = 95 ms, variance of duration of phoneme / e / σ ₂ ² = 0.012, average duration of phoneme / n / m ₃ = 45 ms, variance of duration of phoneme / n / σ ₃ ² = Suppose that it is 0.005.

ｍ_１ ^＊＝ｍ_１＋（σ_１ ^２／Σ_ｉ＝１ ^Ｉσ_ｉ ^２）・（Ｔ−Σ_ｉ＝１ ^Ｉｍ_ｉ）
＝２０＋（０．００３／（０．００３＋０．０１２＋０．００５））・（１８０−（２０＋９５＋４５））
＝２０＋（０．００３／０．０２０）・２０
＝２３ _{^{_{m 1 * = m 1 + (}}} σ 1 2 / Σ i = 1 I σ i 2) · (T-Σ i = 1 I m i)
= 20 + (0.003 / (0.003 + 0.012 + 0.005)). (180- (20 + 95 + 45))
= 20 + (0.003 / 0.020) .20
= 23

第二音素境界時刻推定部４の音素境界時刻決定部４２は、音素に割り当てられた時間長と、信頼性判定結果の中の信頼することができる音素境界時刻とを用いて、最終的な音素境界時刻を決定し、その結果を出力する（ステップＳ４）。 The phoneme boundary time determination unit 42 of the second phoneme boundary time estimation unit 4 uses the time length assigned to the phoneme and the reliable phoneme boundary time in the reliability determination result to determine the final phoneme. The boundary time is determined and the result is output (step S4).

このように、音素境界時刻を仮推定し、推定精度が低いと判断される音素境界時刻を音素の継続長の平均値及び分散を用いて再度推定することにより音素境界時刻の推定を従来よりも精度良く行なうことができる。 In this way, the phoneme boundary time is temporarily estimated, and the phoneme boundary time that is estimated to be low in accuracy is estimated again by using the average value and variance of the phoneme duration, thereby estimating the phoneme boundary time. It can be performed with high accuracy.

なお、詳細音素境界時刻推定部５が、参考文献１に記載された手法を用いて、第二音素境界時刻推定部４が決定した音素境界時刻を修正して、更に精度が高い音素境界時刻を求めてもよい。参考文献１に記載された手法では、事前に決定された音素境界時刻の前後に探索窓を設定し、音素境界時刻付近のスペクトルパターンを学習したマルコフモデルを用いて、更に精度が高い音素境界時刻を求める。参考文献１に記載された手法を用いる際に探索窓を狭くすることにより、正解と同一コンテキストが現れるのを防ぐことができる。 The detailed phoneme boundary time estimation unit 5 corrects the phoneme boundary time determined by the second phoneme boundary time estimation unit 4 by using the method described in Reference Document 1 to obtain a more accurate phoneme boundary time. You may ask for it. In the technique described in Reference 1, a phoneme boundary time with higher accuracy is set using a Markov model in which a search window is set before and after a phoneme boundary time determined in advance and a spectrum pattern near the phoneme boundary time is learned. Ask for. By narrowing the search window when using the technique described in Reference 1, it is possible to prevent the same context as the correct answer from appearing.

〔参考文献１〕Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jian-Lai Zhou and Zhigang Cao, “Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units,” IEICE Transactions 89-D(3), pp.1082-1091, 2006 [Reference 1] Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jian-Lai Zhou and Zhigang Cao, “Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units,” IEICE Transactions 89-D (3) , pp.1082-1091, 2006

音素分割装置は、コンピュータによって実現することができる。この場合、この装置が有すべき各機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、これ装置における各処理機能が、コンピュータ上で実現される。 The phoneme dividing device can be realized by a computer. In this case, the processing contents of each function that the apparatus should have are described by a program. Then, by executing this program on a computer, each processing function in this apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、これらの装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. In this embodiment, these apparatuses are configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

この発明は、上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

１音声特徴量抽出部
２第一音素境界時刻推定部
３音素境界時刻推定結果信頼性判定部
３１条件信頼性判定部
３２信頼性判定用条件記憶部
３３尤度変化度計算部
３４尤度信頼性判定部
４第二音素境界時刻推定部
４１音素継続長最尤推定部
４２音素境界時刻決定部
５詳細音素境界時刻推定部
６音声特徴量記憶部
７継続長分布記憶部 DESCRIPTION OF SYMBOLS 1 Speech feature-value extraction part 2 1st phoneme boundary time estimation part 3 Phoneme boundary time estimation result reliability determination part 31 Condition reliability determination part 32 Reliability determination condition memory | storage part 33 Likelihood change calculation part 34 Likelihood reliability Determination unit 4 Second phoneme boundary time estimation unit 41 Phoneme duration maximum likelihood estimation unit 42 Phoneme boundary time determination unit 5 Detailed phoneme boundary time estimation unit 6 Speech feature storage unit 7 Duration distribution storage unit

Claims

A voice feature amount extraction unit that extracts a voice feature amount of each frame of the input voice;
A speech feature memory unit statistics for audio feature amounts of a plurality of phonemes are stored,
When the most likely phoneme is assigned to each of the frames using the speech feature amount and the statistic read from the speech feature amount storage unit, and the phonemes assigned in two consecutive frames are different, those two A first phoneme boundary time estimation unit that estimates a phoneme boundary time by setting any of the times included in a time range over two frames as a phoneme boundary time;
A phoneme boundary time estimation result reliability determination unit that determines whether the phoneme boundary time is reliable;
A duration distribution storage unit in which average values and variances of durations of a plurality of phonemes are stored;
For each phoneme constituting the phoneme boundary determined to be unreliable, the duration read from the duration distribution storage unit of each phoneme is longer as the average value is larger, and is greatly expanded as the variance is larger. A second phoneme boundary time estimation unit that estimates a phoneme boundary time of a phoneme boundary that is determined to be unreliable by assigning time;
Phoneme splitting device.

The phoneme division device according to claim 1,
The phoneme boundary time estimation result reliability determination unit includes a reliability determination condition storage unit the condition is stored for allocated to two consecutive frames corresponding to the phoneme boundary time can not be trusted phonemes, the A condition reliability determination unit that determines that the phoneme boundary is unreliable when the phonemes assigned to two consecutive frames corresponding to the phoneme boundary time satisfy the conditions read from the reliability determination condition storage unit; including,
A phoneme segmentation device characterized by the above.

The phoneme division device according to claim 1,
The first phoneme boundary time estimation unit calculates a likelihood that is a likelihood when each phoneme is assigned to each of the frames, and assigns a phoneme having the highest likelihood to each of the frames.
A phoneme segmentation device characterized by the above.

The phoneme dividing device according to claim 3, wherein
The phoneme boundary time estimation result reliability determination unit obtains a likelihood change degree that is an index indicating a magnitude of the change in the likelihood of the phonemes assigned to a predetermined number of frames including the frame of the phoneme boundary time. A likelihood change calculation unit, and a likelihood reliability determination unit that determines that the phoneme boundary time is unreliable if the likelihood change is equal to or less than a predetermined threshold,
A phoneme segmentation device characterized by the above.

The audio feature amount storage unit, statistics about the audio feature amounts of a plurality of phonemes are stored,
The duration distribution storage unit, the average value of the duration of a plurality of phonemes and dispersion is stored,
A speech feature amount extraction unit that extracts a speech feature amount of each frame of the input speech;
The first phoneme boundary time estimator assigns the most likely phoneme to each of the frames using the speech feature and the statistics read from the speech feature storage, and is assigned in two consecutive frames. A first phoneme boundary time estimation step for estimating a phoneme boundary time by using any of the times included in the time range over the two frames as a phoneme boundary time when the phonemes are different;
Phoneme boundary time estimation result reliability determination, phoneme boundary time estimation result reliability determination step for determining whether the phoneme boundary time is reliable,
The larger the average value of the continuation length read from the continuation length distribution storage unit of each phoneme for each phoneme constituting the phoneme boundary determined by the second phoneme boundary time estimation unit to be unreliable. A second phoneme boundary time estimation step for estimating a phoneme boundary time of a phoneme boundary that is determined to be unreliable by assigning a longer and longer time to expand and contract as the variance increases;
Phoneme segmentation method.

A phoneme division program for causing a computer to function as each unit of the phoneme division device according to claim 1.