JP4809918B2

JP4809918B2 - Phoneme division apparatus, method, and program

Info

Publication number: JP4809918B2
Application number: JP2009201990A
Authority: JP
Inventors: 孝中村; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-09-01
Filing date: 2009-09-01
Publication date: 2011-11-09
Anticipated expiration: 2029-09-01
Also published as: JP2011053425A

Description

この発明は、音声から、音素の境界時刻を自動的に決定する技術に関する。 The present invention relates to a technique for automatically determining a phoneme boundary time from speech.

事前に決定された音素境界の前後に探索窓を設定し、音素境界付近のスペクトルパターンを学習したマルコフモデルを用いて、更に精度が高い音素境界を求める技術が知られている（例えば、非特許文献１参照）。 There is known a technique for obtaining a phoneme boundary with higher accuracy by using a Markov model in which a search window is set before and after a phoneme boundary determined in advance and a spectrum pattern near the phoneme boundary is learned (for example, non-patented). Reference 1).

Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jian-Lai Zhou and Zhigang Cao, “Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units,” IEICE Transactions 89-D(3), pp.1082-1091, 2006Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jian-Lai Zhou and Zhigang Cao, “Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units,” IEICE Transactions 89-D (3), pp.1082- 1091, 2006

しかしながら、非特許文献１では、各音素境界を独立して推定しており、推定された音素境界が全体として最適になっていないという課題があった。 However, in Non-Patent Document 1, each phoneme boundary is estimated independently, and there is a problem that the estimated phoneme boundary is not optimal as a whole.

上記の課題を解決するために、スペクトルテンプレート記憶部には、各音素境界を構成する各フレームの音声特徴量を示すスペクトルテンプレートが記憶されており、入力された音声の各フレームの音声特徴量を抽出する。フレームのマッチングスコアをそのフレームをスペクトルテンプレートの中心とした場合に上記入力された音声との距離が最も近くなるスペクトルテンプレートの数として、上記スペクトルテンプレート記憶部から予め推定された初期音素境界に対応する複数のスペクトルテンプレートを読み込み、上記初期音素境界を含む予め定められたフレーム区間に含まれる各フレームを上記読み込んだ各スペクトルテンプレートの中心として上記読み込んだ各スペクトルテンプレートと上記入力された音声との距離を上記音声特徴量を用いて計算し、上記フレーム区間に含まれるフレームの中で上記各読み込んだスペクトルテンプレートと上記入力された音声との距離が最も近くなるフレームを求めて、各フレームのマッチングスコアを計算する。マッチングスコアの極大値に対応するフレームを上記初期音素境界の音素境界候補として決定する。探索スコア関数は、音素境界候補の組により区切られる各音素の継続長とその各音素に対応する初期音素境界の組により区切られる音素の継続長との差の絶対値について広義単調減少し、音素境界候補の組により分割される各音素の継続長の分散について広義単調増加し、音素境界候補の組の各音素境界候補のマッチングスコアについて広義単調増加する関数として、Ｒを２以上の整数として、連続するＲ個の音素を区切る音素境界候補の組が複数ある場合には、それらの音素境界候補の組のそれぞれの探索スコアを、上記探索スコア関数にその音素境界候補の組により区切られる各音素の継続長とその各音素に対応する初期音素境界の組により区切られる音素の継続長と、複数の音素の継続長の分散が記憶された継続長分布記憶部から読み込んだその音素境界候補の組により分割される各音素の継続長の分散と、その音素境界候補の組の各音素境界候補のマッチングスコアとの少なくともひとつを入力して計算し、その探索スコアを最大にする音素境界候補の組を構成する音素境界を最適な音素境界とする。 In order to solve the above problems, the spectral template storage unit, are stored in the spectral template showing the audio feature amount of each frame constituting each phoneme boundary, the audio feature amount of each frame of the input speech Extract. The matching score of a frame corresponds to the initial phoneme boundary estimated in advance from the spectrum template storage unit as the number of spectrum templates that are closest to the input speech when the frame is the center of the spectrum template. A plurality of spectrum templates are read, and each frame included in a predetermined frame section including the initial phoneme boundary is used as a center of each of the read spectrum templates, and the distance between each read spectrum template and the input speech is determined. Calculated using the speech feature amount, finds a frame in which the distance between each of the read spectrum template and the input speech is closest among the frames included in the frame section, and calculates a matching score of each frame. calculate. A frame corresponding to the maximum value of the matching score is determined as a phoneme boundary candidate of the initial phoneme boundary. The search score function decreases monotonically in a broad sense for the absolute value of the difference between the duration of each phoneme delimited by a set of phoneme boundary candidates and the duration of the phoneme delimited by the set of initial phoneme boundaries corresponding to each phoneme. As a function that monotonically increases in terms of dispersion of the duration of each phoneme divided by the set of boundary candidates and increases monotonically in a broad sense monotonically with respect to the matching score of each phoneme boundary candidate of the set of phoneme boundary candidates, When there are a plurality of sets of phoneme boundary candidates that divide consecutive R phonemes, the search score of each of the phoneme boundary candidate sets is set to each phoneme that is divided by the set of phoneme boundary candidates in the search score function. duration and its the duration of phonemes bounded by a set of initial phoneme boundary corresponding to each phoneme, the dispersion of the duration of a plurality of phonemes to read from the stored duration distribution storage unit of The search score is calculated by inputting at least one of the dispersion of the duration of each phoneme divided by the set of phoneme boundary candidates and the matching score of each phoneme boundary candidate of the set of phoneme boundary candidates. The phoneme boundary constituting the set of candidate phoneme boundaries to be maximized is set as the optimum phoneme boundary.

連続するＲ個の音素を区切る音素境界候補の組が複数ある場合には、それらの連続するＲ個の音素の全体を考慮して最適な音素境界候補の組を選択することにより、音素境界の推定の精度が従来よりも高くなる。 When there are a plurality of sets of phoneme boundary candidates that divide consecutive R phonemes, an optimal phoneme boundary candidate set is selected in consideration of the entire R phonemes. The accuracy of the estimation is higher than before.

音素分割装置の例の機能ブロック図。The functional block diagram of the example of a phoneme division | segmentation apparatus. マッチングスコア計算部の例の機能ブロック図。The functional block diagram of the example of a matching score calculation part. 最適音素境界探索部の例の機能ブロック図。The functional block diagram of the example of the optimal phoneme boundary search part. 音素分割方法の例の流れ図。The flowchart of the example of the phoneme division | segmentation method. マッチングスコア計算部の処理の例の流れ図。The flowchart of the example of a process of a matching score calculation part. 最適音素境界探索部の処理の例の流れ図。The flowchart of the example of a process of the optimal phoneme boundary search part. スペクトルテンプレートを説明するための図。The figure for demonstrating a spectrum template. 音素境界候補計算部の処理を説明するための図。The figure for demonstrating the process of a phoneme boundary candidate calculation part. 最適音素境界探索部の処理を説明するための図。The figure for demonstrating the process of the optimal phoneme boundary search part.

以下、この発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

図１は、この発明による音素分割装置の例の機能ブロック図である。図４は、この発明による音素分割方法の例の流れ図である。 FIG. 1 is a functional block diagram of an example of a phoneme dividing device according to the present invention. FIG. 4 is a flowchart of an example of a phoneme division method according to the present invention.

音素分割装置は、音声特徴量抽出部１、探索範囲決定部２、マッチングスコア計算部３、スペクトルテンプレート記憶部４、音素境界候補計算部５、最適音素境界探索部６、継続長分布記憶部７を例えば含む。 The phoneme dividing device includes a speech feature amount extraction unit 1, a search range determination unit 2, a matching score calculation unit 3, a spectrum template storage unit 4, a phoneme boundary candidate calculation unit 5, an optimal phoneme boundary search unit 6, and a continuation length distribution storage unit 7. For example.

＜ステップＳ１＞
入力された音声は、音声特徴量抽出部１に入力される。音声特徴量抽出部１は、入力された音声を一定時間長のフレームに分割して、各フレームごとに音声特徴量を計算する（ステップＳ１）。各フレームの音声特徴量は、マッチングスコア計算部３に送られる。 <Step S1>
The input voice is input to the voice feature amount extraction unit 1. The voice feature amount extraction unit 1 divides the input voice into frames having a predetermined time length, and calculates a voice feature amount for each frame (step S1). The audio feature amount of each frame is sent to the matching score calculation unit 3.

音声特徴量としては、その音声特徴量を用いてフレームに音素を割り当てることができるものであればどのような音声特徴量を用いてもよい。例えば、音声特徴量として、音声認識等でよく用いられるＭＦＣＣ、ケプストラム、メルケプストラム、フィルタバンク、メルフィルタバンク等を用いることができる。 As the speech feature amount, any speech feature amount may be used as long as it can assign a phoneme to a frame using the speech feature amount. For example, MFCC, cepstrum, mel cepstrum, filter bank, mel filter bank, etc. that are often used in voice recognition or the like can be used as the voice feature amount.

＜ステップＳ２＞
予め推定された初期音素境界についての情報が、探索範囲決定部２に入力される。探索範囲決定部２は、予め推定された初期音素境界から、探索範囲を決定する（ステップＳ２）。探索範囲は、初期音素境界を含むフレーム区間であり、後述するマッチングスコア計算部３はそのフレーム区間に含まれる各フレームについてのマッチングスコアを計算する。 <Step S2>
Information about the initial phoneme boundary estimated in advance is input to the search range determination unit 2. The search range determination unit 2 determines the search range from the initial phoneme boundary estimated in advance (step S2). The search range is a frame section including the initial phoneme boundary, and the matching score calculation unit 3 described later calculates a matching score for each frame included in the frame section.

例えばｎを０から１までの実数として、初期音素境界によって区切られるモーラ（ポーズ除く）の平均長のｎ倍を探索範囲とする。例えばｎを０．５〜０．７とする。 For example, assuming that n is a real number from 0 to 1, the search range is n times the average length of mora (excluding pause) delimited by the initial phoneme boundary. For example, n is set to 0.5 to 0.7.

＜ステップＳ３＞
スペクトルテンプレート記憶部４には、各音素境界を構成する各フレームの音声特徴量を示すスペクトルテンプレートが記憶されている。スペクトルテンプレートは、例えば図７に示すように、音素境界を含む予め定められたフレーム区間の各フレームの音声特徴量と、その音素境界を構成する前音素、後音素のそれぞれの音素の中心を含む予め定められたフレーム区間の各フレームの音声特徴量とを含む。スペクトルテンプレートの中心は、音素境界を含むフレームである。 <Step S3>
The spectrum template storage unit 4 stores a spectrum template indicating the audio feature amount of each frame constituting each phoneme boundary. For example, as shown in FIG. 7, the spectrum template includes a speech feature amount of each frame in a predetermined frame section including a phoneme boundary, and a center of each phoneme of the previous phoneme and the rear phoneme constituting the phoneme boundary. And a voice feature amount of each frame in a predetermined frame section. The center of the spectrum template is a frame including a phoneme boundary.

音素境界を含む予め定められたフレーム区間の各フレームの音声特徴量のうち、音素境界を含むフレームを音素境界パタン、音素境界を含まないフレームの音声特徴量（音素境界を含むフレームの前後のフレームの音声特徴量）を音声境界近傍パタン、音素境界を構成する前音素、後音素のそれぞれの音素の中心を含む予め定められたフレーム区間の各フレームの音声特徴量を音素中心近傍パタンと呼ぶ。 Out of the speech features of each frame in a predetermined frame section including a phoneme boundary, a frame including a phoneme boundary is a frame feature including a phoneme boundary, and a speech feature amount of a frame not including a phoneme boundary (frames before and after a frame including a phoneme boundary) (Voice feature amount) is referred to as a speech boundary neighborhood pattern, and the speech feature amount of each frame in a predetermined frame section including the center of each of the previous phoneme and the later phoneme constituting the phoneme boundary is referred to as a phoneme center neighborhood pattern.

マッチングスコア計算部３は、初期音素境界を含むフレーム区間（探索範囲）に含まれる各フレームについてのマッチングスコアを計算する（ステップＳ３）。各フレームのマッチングスコアは、音素境界候補計算部５に送られる。フレームのマッチングスコアは、そのフレームをスペクトルテンプレートの中心とした場合に上記入力された音声との距離が最も近くなるスペクトルテンプレートの数である。 The matching score calculation unit 3 calculates a matching score for each frame included in the frame section (search range) including the initial phoneme boundary (step S3). The matching score of each frame is sent to the phoneme boundary candidate calculation unit 5. The frame matching score is the number of spectrum templates that are closest to the input speech when the frame is set as the center of the spectrum template.

以下、マッチングスコア計算部３の具体例について説明する。マッチングスコア計算部３は、図２に例示するように、スペクトルテンプレート選択部３１、距離計算部３２、フレーム選択部３３、累積部３４、制御部３５を含む。マッチングスコア計算部３の処理の流れを図５に例示する。 Hereinafter, a specific example of the matching score calculation unit 3 will be described. As illustrated in FIG. 2, the matching score calculation unit 3 includes a spectrum template selection unit 31, a distance calculation unit 32, a frame selection unit 33, an accumulation unit 34, and a control unit 35. The flow of processing of the matching score calculation unit 3 is illustrated in FIG.

スペクトルテンプレート選択部３１は、スペクトルテンプレート記憶部４から予め推定された初期音素境界に対応する複数のスペクトルテンプレートを読み込む（ステップＳ３１）。例えば、初期音素境界を構成する２つの音素が共通する音素境界のスペクトルテンプレート、すなわち初期音素境界が／Ａ／−／Ｗ／である場合には、音素境界／Ａ／−／Ｗ／のスペクトルテンプレートを読み込む。また、調音方法、調音位置、有声／無声の少なくともひとつが一致するスペクトルテンプレートを読み込んでもよい。スペクトルテンプレート選択部３１が読み込んだ初期音素境界に対応する音素境界のスペクトルテンプレートの数をＮとする。 The spectrum template selection unit 31 reads a plurality of spectrum templates corresponding to the initial phoneme boundary estimated in advance from the spectrum template storage unit 4 (step S31). For example, when the initial phoneme boundary is / A /-/ W / when the initial phoneme boundary is / A /-/ W /, the spectrum template of the phoneme boundary / A /-/ W / Is read. Further, a spectrum template that matches at least one of the articulation method, the articulation position, and voiced / unvoiced may be read. Let N be the number of spectrum templates at the phoneme boundary corresponding to the initial phoneme boundary read by the spectrum template selection unit 31.

距離計算部３２は、初期音素境界を含む予め定められたフレーム区間に含まれる各フレームを上記読み込んだ各スペクトルテンプレートの中心として上記読み込んだ各スペクトルテンプレートと入力された音声との距離を音声特徴量を用いて計算する（ステップＳ３２）。距離としては、コサイン距離、ユークリッド距離、マハラノビス距離の何れかを用いることができる。 The distance calculation unit 32 sets the distance between each read spectrum template and the input speech as the center of each read spectrum template with each frame included in a predetermined frame section including the initial phoneme boundary as a voice feature amount. (Step S32). As the distance, any one of a cosine distance, an Euclidean distance, and a Mahalanobis distance can be used.

例えば次式により、スペクトルテンプレートｎの中心をフレームｍとした場合の、スペクトルテンプレートｎと入力された音声との距離ｄ（ｍ，ｎ）を計算する。計算された距離ｄ（ｍ，ｎ）は、フレーム選択部３３に送られる。 For example, the distance d (m, n) between the spectrum template n and the input speech when the center of the spectrum template n is the frame m is calculated by the following equation. The calculated distance d (m, n) is sent to the frame selection unit 33.

Ｖは音声特徴量の次元の数、αは音声境界近傍パタンのフレームの数、βは音素中心近傍パタンの中心フレームを除き中心フレームから左又は右にあるフレームの数、Ｃ_ｒｅｆ（ｍ，ｖ）は入力された音声のフレームｍの音声特徴量のｖ次元目の値、Ｃ_{ｔｅｍ，Ｂｏｕｎｄ}（ｎ，ｖ）はスペクトルテンプレートｎの音素境界パタンの音声特徴量のｖ次元目の値、Ｃ_{ｔｅｍ，Ｃｅｎｔｅｒ，Ｌ}（ｉ，ｎ，ｖ）はスペクトルテンプレートｎの前音素の音素中心近傍パタンの左からｉ番目のフレームの音声特徴量のｖ次元目の値、Ｃ_{ｔｅｍ，Ｃｅｎｔｅｒ，Ｒ}（ｉ，ｎ，ｖ）はスペクトルテンプレートｎの後音素の音素中心近傍パタンの左からｉ番目のフレームの音声特徴量のｖ次元目の値、Ｃ_{ｔｅｍ，Ｒｏｕｎｄ，Ｌ}（ｉ，ｎ，ｖ）はスペクトルテンプレートｎの左側の音素境界近傍パタンの左からｉ番目のフレームの音声特徴量のｖ次元目の値、Ｃ_{ｔｅｍ，Ｒｏｕｎｄ，Ｒ}（ｉ，ｎ，ｖ）はスペクトルテンプレートｎの右側の音素境界近傍パタンの左からｉ番目のフレームの音声特徴量のｖ次元目の値、Ｌ_ｌは初期音素境界が含まれるフレームと初期音素境界の前音素の中心を含むフレームとの距離、Ｌ_ｒは初期音素境界が含まれるフレームと初期音素境界の後音素の中心を含むフレームとの距離である。Ｌ_ｌ及びＬ_ｒの単位はフレームの数である。距離計算部３２が、初期音素境界についての情報からＬ_ｌ及びＬ_ｒを求める。 V is the number of dimensions of the speech feature, α is the number of frames in the vicinity of the speech boundary pattern, β is the number of frames left or right from the center frame except for the center frame of the phoneme center vicinity pattern, and C _ref (m, v ) Is the v-th value of the speech feature of the input speech frame m, C _{tem, Bound} (n, v) is the v-th value of the speech feature of the phoneme boundary pattern of the spectrum template n, and C _{tem , Center, L} (i, n, v) is the _vth value of the speech feature quantity of the i-th frame from the left of the phoneme center neighborhood pattern of the previous phoneme of the spectrum template n, C _{tem, Center, R} (i, n, v) phoneme phoneme center near the left from the i-th v-th dimension values of the audio feature amount of the frame of the pattern after the spectral template n _{is, C tem, Round, L (} i, n, v) is space Torr template left phoneme boundary near the left from the i-th v-th dimension values of the audio feature amount of the frame of pattern of _{n, C tem, Round, R} (i, n, v) is the right side of the phoneme boundary of spectral template n The value of the vth dimension of the speech feature value of the i-th frame from the left of the neighboring pattern, L _l is the distance between the frame including the initial phoneme boundary and the frame including the center of the previous phoneme at the initial phoneme boundary, and L _r is the initial This is the distance between the frame containing the phoneme boundary and the frame containing the center of the back phoneme of the initial phoneme boundary. The unit of L ₁ and L _r is the number of frames. The distance calculator 32 obtains L _l and L _r from the information about the initial phoneme boundary.

このように、初期音素境界を含むフレームと初期音素境界の前音素の中心を含むフレームとの距離だけスペクトルテンプレートの音素境界を含むフレームとその音素境界を構成する前音素の中心を含むフレームとの距離を離し、初期音素境界を含むフレームと初期音素境界の後音素の中心を含むフレームとの距離だけスペクトルテンプレートの音素境界を含むフレームとその音素境界を構成する後音素の中心を含むフレームとの距離を離して、スペクトルテンプレートと入力された音声との距離を計算することにより、入力された音声の発話速度に対応させた距離計算が可能となり、音素境界の推定精度が増す。 In this way, the frame including the phoneme boundary of the spectrum template and the frame including the center of the previous phoneme constituting the phoneme boundary by the distance between the frame including the initial phoneme boundary and the frame including the center of the previous phoneme of the initial phoneme boundary. The frame including the phoneme boundary of the spectrum template and the frame including the center of the postphoneme constituting the phoneme boundary are separated by a distance between the frame including the initial phoneme boundary and the frame including the center of the backphoneme of the initial phoneme boundary. By calculating the distance between the spectrum template and the input speech by separating the distance, it is possible to calculate the distance corresponding to the speech rate of the input speech, and the estimation accuracy of the phoneme boundary is increased.

フレーム選択部３３は、スペクトルテンプレートｎについて、距離ｄ（ｍ，ｎ）を最小にするフレームを、探索範囲のフレームＲの中から選択する（ステップＳ３３）。例えば、Ｓ（・）を以下に示すサブスコア関数として、Ｓ（ｄ（ｍ，ｎ））を計算して、最小にするフレームを１としてカウントする。サブスコア関数値Ｓ（ｄ（ｍ，ｎ））は、累積部３４に送られる。 The frame selection unit 33 selects a frame that minimizes the distance d (m, n) from the frames R in the search range for the spectrum template n (step S33). For example, S (d (m, n)) is calculated using S (•) as a subscore function shown below, and the frame to be minimized is counted as 1. The sub-score function value S (d (m, n)) is sent to the accumulating unit 34.

制御部３５がｎ＝Ｎであるかどうかを判定し（ステップＳ３４）、ｎ＝ＮであればステップＳ３５に進み、ｎ＝Ｎでなければｎを１インクリメントして（ステップＳ３６）、ステップＳ３１に戻る。これにより、各スペクトルテンプレートについてｎ（ｎ＝１，…，Ｎ）について、ステップＳ３２からステップＳ３３の処理を行う。Ｎは、スペクトルテンプレート選択部３１が読み込んだ初期音素境界に対応する音素境界のスペクトルテンプレートの数である。 The controller 35 determines whether n = N (step S34). If n = N, the process proceeds to step S35. If n = N, n is incremented by 1 (step S36), and the process proceeds to step S31. Return. Thus, the processing from step S32 to step S33 is performed for n (n = 1,..., N) for each spectrum template. N is the number of spectrum templates at the phoneme boundary corresponding to the initial phoneme boundary read by the spectrum template selection unit 31.

累積部３４は、スペクトルテンプレートｎ（ｎ＝１，…，Ｎ）についてのＳ（ｄ（ｍ，ｎ））を加算して、その加算値をフレームｍについてのマッチングスコアＭＳ（ｍ）とする（ステップＳ３５）。 The accumulating unit 34 adds S (d (m, n)) for the spectrum template n (n = 1,..., N), and uses the added value as a matching score MS (m) for the frame m ( Step S35).

上記式では、スペクトルテンプレートｎについてのＳ（ｄ（ｍ，ｎ））について重みを考慮していないが、下記式のように重みを考慮してＳ（ｄ（ｍ，ｎ））を加算してもよい。 In the above formula, the weight is not considered for S (d (m, n)) for the spectrum template n, but S (d (m, n)) is added in consideration of the weight as in the following formula. Also good.

ｗ_ｎはスペクトルテンプレートｎの重みであり、例えば０から１までの実数であり、求める仕様、性能に応じて適宜設定される。例えば、初期音素境界を構成する音素とスペクトルテンプレートｎの音素境界を構成する音素とが一致していればそのスペクトルテンプレートｎの重みｗ_ｎ＝１とし、調和方法が一致していれば重みｗ_ｎ＝０．８とし、調和位置が一致していれば重みｗ_ｎ＝０．６とする。すなわち、初期音素境界とスペクトルテンプレートの一致度が高いほど重みを大きくする。 w _n is the weight of the spectral template n, for example, a real number from 0 to 1, determined specification is appropriately set according to the performance. For example, if the phoneme constituting the initial phoneme boundary and the phoneme constituting the phoneme boundary of the spectrum template n match, the weight w _n = 1 of the spectrum template n is set, and if the harmony method matches, the weight w _n = 0.8, and the weight w _n = 0.6 if the harmonic positions match. That is, the weight is increased as the coincidence between the initial phoneme boundary and the spectrum template is higher.

＜ステップＳ４＞
音素境界候補計算部５は、マッチングスコアが大きいフレームを初期音素境界の音素境界候補として選択する（ステップＳ４）各初期音素境界の音素境界候補についての情報は、最適音素境界探索部６に送られる。 <Step S4>
The phoneme boundary candidate calculation unit 5 selects a frame having a large matching score as a phoneme boundary candidate for the initial phoneme boundary (step S4). Information on the phoneme boundary candidates for each initial phoneme boundary is sent to the optimal phoneme boundary search unit 6. .

例えば、マッチングスコアの極大値に対応するフレームを初期音素境界の音素境界候補として選択する。例えばある初期音素境界を含む探索範囲のフレーム区間のマッチングスコアが図８に示すように与えられる場合には、２つの極大値に対応するフレームｍ_１、ｍ_２がその初期音素境界に対応する音素境界候補として選択される。図８はイメージ図であるため、フレームとマッチングスコアの関係を表すグラフを連続関数として記載しているが、フレーム番号は離散値であるため実際にはフレームとマッチングスコアの関係を表すグラフは不連続関数となる。 For example, a frame corresponding to the maximum value of the matching score is selected as a phoneme boundary candidate for the initial phoneme boundary. For example, when the matching score of the frame section of the search range including a certain initial phoneme boundary is given as shown in FIG. 8, the frames m ₁ and m ₂ corresponding to the two maximum values are phonemes corresponding to the initial phoneme boundary. Selected as a boundary candidate. Since FIG. 8 is an image diagram, the graph representing the relationship between the frame and the matching score is described as a continuous function, but since the frame number is a discrete value, the graph representing the relationship between the frame and the matching score is actually discontinuous. It becomes a function.

＜ステップＳ５＞
最適音素境界探索部６は、Ｒを２以上の整数として、連続するＲ個の音素を区切る音素境界候補の組が複数ある場合には、音素境界候補の組のそれぞれについて探索スコアを求めて、探索スコアを最大にする音素境界候補の組を構成する音素境界を最適な音素境界とする（ステップＳ５）。 <Step S5>
The optimal phoneme boundary search unit 6 determines a search score for each set of phoneme boundary candidates when there are a plurality of sets of phoneme boundary candidates that divide consecutive R phonemes, where R is an integer equal to or greater than 2. The phoneme boundary constituting the set of phoneme boundary candidates that maximizes the search score is set as the optimum phoneme boundary (step S5).

図９を用いてＲ＝３の場合を例に挙げて、最適音素境界探索部６の処理のイメージを説明する。初期音素境界／Ａ／−／Ｗ／の音素境界候補がＡ１、Ａ２の２つあり、初期音素境界／Ｗ／−／Ａ／の音素境界候補がＢ１、Ｂ２の２つある場合には、図９に示すように、音素境界候補の組は４（＝２×２）個ある。すなわち、（Ａ１，Ｂ１）の音素境界候補の組、（Ａ１，Ｂ２）の音素境界候補の組、（Ａ２，Ｂ１）の音素境界候補の組、（Ａ２，Ｂ２）の音素境界候補の組がある。最適音素境界探索部６は、各音素境界候補の組についての探索スコアを求めて、探索スコアを最大にする音素境界候補の組を構成する音素境界を最適な音素境界とする。 An example of the process of the optimum phoneme boundary search unit 6 will be described using FIG. 9 as an example of R = 3. If there are two phoneme boundary candidates A1 and A2 of the initial phoneme boundary / A /-/ W /, and two phoneme boundary candidates B1 and B2 of the initial phoneme boundary / W /-/ A / As shown in FIG. 9, there are 4 (= 2 × 2) sets of phoneme boundary candidates. That is, a set of (A1, B1) phoneme boundary candidates, a set of (A1, B2) phoneme boundary candidates, a set of (A2, B1) phoneme boundary candidates, and a set of (A2, B2) phoneme boundary candidates. is there. The optimum phoneme boundary search unit 6 obtains a search score for each set of phoneme boundary candidates, and sets the phoneme boundary constituting the set of phoneme boundary candidates that maximizes the search score as the optimum phoneme boundary.

探索スコアは、音素境界候補の組の確からしさを表す指標であり、探索スコア関数の値を計算することにより計算される。探索スコア関数は、例えば、音素境界候補の組により区切られる各音素の継続長とその各音素に対応する初期音素境界の組により区切られる音素の継続長との差の絶対値について広義単調減少し、音素境界候補の組により分割される各音素の継続長の分散について広義単調増加し、音素境界候補の組の各音素境界候補のマッチングスコアについて広義単調増加する関数である。探索スコア関数を以下に例示する。 The search score is an index representing the likelihood of a set of phoneme boundary candidates, and is calculated by calculating the value of a search score function. The search score function, for example, decreases monotonously in a broad sense with respect to the absolute value of the difference between the duration of each phoneme delimited by a set of phoneme boundary candidates and the duration of a phoneme delimited by a set of initial phoneme boundaries corresponding to each phoneme. This is a function that monotonously increases in terms of dispersion of the duration of each phoneme divided by the set of phoneme boundary candidates, and monotonically increases in the matching score of each phoneme boundary candidate in the set of phoneme boundary candidates. The search score function is exemplified below.

Ｓ_ｐｒはｒ番目の音素のマッチングスコアである。ｒ番目の音素とｒ−１番目の音素との音素境界のマッチングスコア、又は、ｒ番目の音素とｒ＋１番目の音素との音素境界のマッチングスコアをｒ番目の音素のマッチングスコアとする。ｗ_ｐ、ｗ_ｄは重みであり、例えば０から１の範囲で０．１刻みで変えて行き、音素境界推定結果が最も良好になる重みを用いる。 S _pr is the matching score of the r th phoneme. The matching score at the phoneme boundary between the rth phoneme and the r-1th phoneme, or the matching score at the phoneme boundary between the rth phoneme and the r + 1th phoneme is used as the rth phoneme matching score. w _p and w _d are weights. For example, the weights are changed in increments of 0.1 in the range of 0 to 1, and the weight that gives the best phoneme boundary estimation result is used.

Ｓ_ｄｒはｒ番目の音素の継続長スコアであり、ｄ_ｒは音素境界候補の組により区切られるｒ番目の音素の継続長、ｍ’_ｒは初期音素境界の組により区切られるｒ番目の音素の継続長、σ_ｒ ^２はｒ番目の音素の継続長の分散である。 S _dr is the duration score r th phoneme, d _r is duration of r-th phoneme delimited by a set of phoneme boundary candidate, m _'r is the r th phoneme delimited by a set of initial phoneme boundary The continuation length, σ _r ^2, is the variance of the continuation length of the r th phoneme.

最適音素境界探索部６は、図３に例示するように、継続長スコア計算部６１、探索スコア計算部６２、最適候補列探索部６３、制御部６４を含む。最適音素境界探索部６の処理の流れを図６に例示する。 As illustrated in FIG. 3, the optimal phoneme boundary search unit 6 includes a duration score calculation unit 61, a search score calculation unit 62, an optimal candidate string search unit 63, and a control unit 64. The flow of processing of the optimal phoneme boundary search unit 6 is illustrated in FIG.

制御部６４はｒ＝１とする（ステップＳ５１）。 The control unit 64 sets r = 1 (step S51).

継続長スコア計算部６１は、音素境界候補の組により区切られるｒ番目の音素の継続長ｄ_ｒとｒ番目の音素に対応する初期音素境界の組により区切られるｒ番目の音素の継続長ｍ’_ｒと、複数の音素の継続長の分散を記憶する継続長分布記憶部７から読み込んだそのｒ番目の音素の継続長の分散とを用いて、例えば（２）式により定義されるｒ番目の音素の継続長スコアを計算する（ステップＳ５２）。計算された継続長スコアＳ_ｄｒは探索スコア計算部６２に送られる。 Duration score calculation unit 61, r th duration of phonemes bounded by a set of initial phoneme boundary corresponding to the r-th duration d _r and r th phoneme phoneme delimited by a set of phone boundary candidate m ' _{Using r} and the variance of the duration of the r-th phoneme read from the duration distribution storage unit 7 that stores the variance of the durations of a plurality of phonemes, for example, the r-th defined by equation (2) A phoneme duration score is calculated (step S52). The calculated duration score S _dr is sent to the search score calculation unit 62.

制御部６４がｒ＝Ｒであるかどうかを判定し（ステップＳ５３）、ｒ＝ＲであればステップＳ５５に進み、ｒ＝Ｒでなければｒを１インクリメントして（ステップＳ５４）、ステップＳ５２に戻る。これにより、ｒ（ｒ＝１，…，Ｒ）番目の音素のそれぞれについての継続長スコアＳ_ｄｒを計算する。 The controller 64 determines whether or not r = R (step S53). If r = R, the process proceeds to step S55. If r = R, r is incremented by 1 (step S54), and the process proceeds to step S52. Return. Thus, the duration score S _dr for each of the r (r = 1,..., R) phonemes is calculated.

探索スコア計算部６２は、計算された継続長スコアＳ_ｄｒと、音素境界候補の組の各音素境界候補のマッチングスコアとを用いて、例えば（１）式により定義される探索スコアを計算する（ステップＳ５５）。計算された探索スコアは、最適候補列探索部６３に送られる。 The search score calculation unit 62 calculates, for example, a search score defined by equation (1) using the calculated duration score S _dr and the matching score of each phoneme boundary candidate in the set of phoneme boundary candidates ( Step S55). The calculated search score is sent to the optimum candidate string search unit 63.

制御部６４は、音素境界候補の組の全てについて探索スコアを計算したかを判断して（ステップＳ５６）、まだ探索スコアを計算していない音素境界候補の組がある場合には、そのまだ探索スコアを計算していない音素境界候補の組についてステップＳ５１からステップＳ５５の処理を行わせる。これにより、音素境界候補の組の全てについての探索スコアを計算する。 The control unit 64 determines whether or not the search score has been calculated for all of the phoneme boundary candidate sets (step S56), and if there is a phoneme boundary candidate set for which the search score has not been calculated yet, the search is still performed. The processing from step S51 to step S55 is performed on a set of phoneme boundary candidates whose scores are not calculated. As a result, search scores for all sets of phoneme boundary candidates are calculated.

最適候補列探索部６３は、探索スコアを最大にする音素境界候補の組を選択して、その音素境界候補の組を構成する音素境界を最適な音素境界とする（ステップＳ５７）。 The optimal candidate string search unit 63 selects a set of phoneme boundary candidates that maximizes the search score, and sets the phoneme boundary that forms the set of phoneme boundary candidates as the optimal phoneme boundary (step S57).

このように、連続するＲ個の音素を区切る音素境界候補の組が複数ある場合には、それらの連続するＲ個の音素の全体を考慮して最適な音素境界候補の組を選択することにより、音素境界の推定の精度が従来よりも高くなる。 Thus, when there are a plurality of sets of phoneme boundary candidates that divide consecutive R phonemes, the optimum set of phoneme boundary candidates is selected in consideration of the entire continuous R phonemes. Therefore, the accuracy of the phoneme boundary estimation is higher than the conventional one.

［変形例］
上記の例では、予め推定された初期音素境界が探索範囲決定部２に入力されたが、図１に破線で示す初期音素境界推定部８を設けて、初期音素境界推定部８が入力された音声から初期音素境界を推定して、その推定された初期音素境界についての情報を探索範囲決定部２を送ってもよい。初期音素境界の推定は既存の音素境界技術を用いる。この発明では初期音素境界を基にしてより精度の高い音素境界の推定を行うため、初期音素境界の推定は大まかな推定でよい。 [Modification]
In the above example, the initial phoneme boundary estimated in advance is input to the search range determination unit 2, but the initial phoneme boundary estimation unit 8 indicated by the broken line in FIG. An initial phoneme boundary may be estimated from speech, and information about the estimated initial phoneme boundary may be sent to the search range determination unit 2. The initial phoneme boundary estimation uses existing phoneme boundary technology. In the present invention, since the phoneme boundary is estimated with higher accuracy based on the initial phoneme boundary, the initial phoneme boundary may be roughly estimated.

上記の例では、探索スコア関数は、上記探索スコア関数にその音素境界候補の組により区切られる各音素の継続長とその各音素に対応する初期音素境界の組により区切られる音素の継続長と、複数の音素の継続長の分散を記憶する継続長分布記憶部から読み込んだその音素境界候補の組により分割される各音素の継続長の分散と、その音素境界候補の組の各音素境界候補のマッチングスコアとの全てを入力としたが、これらの少なくともひとつを入力することにより探索スコア関数の値を計算してもよい。 In the above example, the search score function includes a continuation length of each phoneme delimited by a set of phoneme boundary candidates in the search score function and a continuation length of a phoneme delimited by a set of initial phoneme boundaries corresponding to each phoneme, The duration distribution of each phoneme that is divided by the set of phoneme boundary candidates read from the duration distribution storage unit that stores the variance of the durations of a plurality of phonemes, and each phoneme boundary candidate of the set of phoneme boundary candidates Although all of the matching scores are input, the value of the search score function may be calculated by inputting at least one of them.

音素分割装置は、コンピュータによって実現することができる。この場合、この装置が有すべき各機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、これ装置における各処理機能が、コンピュータ上で実現される。 The phoneme dividing device can be realized by a computer. In this case, the processing contents of each function that the apparatus should have are described by a program. Then, by executing this program on a computer, each processing function in this apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、これらの装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. In this embodiment, these apparatuses are configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

この発明は、上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

１音声特徴量抽出部
２探索範囲決定部
３マッチングスコア計算部
３１スペクトルテンプレート選択部
３２距離計算部
３３フレーム選択部
３４累積部
３５制御部
４スペクトルテンプレート記憶部
５音素境界候補計算部
６最適音素境界探索部
６１継続長スコア計算部
６２探索スコア計算部
６３最適候補列探索部
６４制御部
７継続長分布記憶部
８初期音素境界推定部 DESCRIPTION OF SYMBOLS 1 Speech feature amount extraction part 2 Search range determination part 3 Matching score calculation part 31 Spectrum template selection part 32 Distance calculation part 33 Frame selection part 34 Accumulation part 35 Control part 4 Spectrum template memory | storage part 5 Phoneme boundary candidate calculation part 6 Optimal phoneme boundary Search unit 61 Duration length score calculation unit 62 Search score calculation unit 63 Optimal candidate string search unit 64 Control unit 7 Duration length storage unit 8 Initial phoneme boundary estimation unit

Claims

A voice feature amount extraction unit that extracts a voice feature amount of each frame of the input voice;
A spectrum template storage unit in which a spectrum template indicating a speech feature amount of each frame constituting each phoneme boundary is stored;
The matching score of a frame corresponds to the initial phoneme boundary estimated in advance from the spectrum template storage unit as the number of spectrum templates that are closest to the input speech when the frame is the center of the spectrum template. A plurality of spectrum templates are read, and each frame included in a predetermined frame section including the initial phoneme boundary is used as a center of each of the read spectrum templates, and the distance between each read spectrum template and the input speech is determined. Calculated using the speech feature amount, finds a frame in which the distance between each of the read spectrum template and the input speech is closest among the frames included in the frame section, and calculates a matching score of each frame. The map to calculate And Ngusukoa calculator,
A phoneme boundary candidate determination unit that determines a frame corresponding to the maximum value of the matching score as a phoneme boundary candidate of the initial phoneme boundary;
The search score function decreases monotonically in a broad sense for the absolute value of the difference between the duration of each phoneme delimited by a set of phoneme boundary candidates and the duration of the phoneme delimited by the set of initial phoneme boundaries corresponding to each phoneme. As a function that monotonically increases in terms of dispersion of the duration of each phoneme divided by the set of boundary candidates and increases monotonically in a broad sense monotonically with respect to the matching score of each phoneme boundary candidate of the set of phoneme boundary candidates, When there are a plurality of sets of phoneme boundary candidates that divide consecutive R phonemes, the search score of each of the phoneme boundary candidate sets is set to each phoneme that is divided by the set of phoneme boundary candidates in the search score function. duration and its the duration of phonemes bounded by a set of initial phoneme boundary corresponding to each phoneme, the dispersion of the duration of a plurality of phonemes to read from the stored duration distribution storage unit of The search score is calculated by inputting at least one of the dispersion of the duration of each phoneme divided by the set of phoneme boundary candidates and the matching score of each phoneme boundary candidate of the set of phoneme boundary candidates. An optimal phoneme boundary search unit that makes the phoneme boundary that constitutes a set of maximum phoneme boundary candidates an optimal phoneme boundary;
Phoneme splitting device.

The phoneme dividing device according to claim 1, wherein
The spectrum template includes a speech feature amount of each frame in a predetermined frame section including a phoneme boundary, and each of predetermined frame sections including the center of each phoneme of the previous phoneme and the rear phoneme constituting the phoneme boundary. A spectrum template storage unit that stores a plurality of phoneme boundary spectrum templates with a frame including a phoneme boundary as a center of the spectrum template.
The matching score calculation unit configures the frame including the phoneme boundary of each spectrum template and the phoneme boundary corresponding to the distance between the frame including the initial phoneme boundary and the frame including the center of the preceding phoneme boundary of the initial phoneme boundary. The distance between the frame including the center of the preceding phoneme is separated, and the phoneme boundary of each of the read spectrum templates is included by the distance between the frame including the initial phoneme boundary and the frame including the center of the postphoneme boundary of the initial phoneme boundary. The distance between the frame and the frame including the center of the postphoneme constituting the phoneme boundary is separated, and the distance between each of the read spectrum templates and the input speech is calculated.
A phoneme segmentation device characterized by the above.

The spectral template storage unit, spectral template showing the audio feature amount of each frame constituting each phoneme boundary is stored,
A speech feature amount extraction unit that extracts a speech feature amount of each frame of the input speech;
The matching score calculation unit preliminarily estimated from the spectrum template storage unit as the number of spectrum templates having the closest distance to the input speech when the frame matching score is the center of the spectrum template. A plurality of spectrum templates corresponding to the initial phoneme boundary are read, and each frame included in a predetermined frame section including the initial phoneme boundary is input with the read spectrum templates as the centers of the read spectrum templates. Calculating the distance between the read voice template and the input voice, and calculating the distance between the read spectrum template and the input voice among the frames included in the frame section. The map of each frame Matching score calculation step of calculating Ngusukoa,
A phoneme boundary candidate determining unit determining a frame corresponding to the maximum value of the matching score as a phoneme boundary candidate of the initial phoneme boundary;
The optimal phoneme boundary search unit calculates the absolute value of the difference between the duration of each phoneme delimited by the set of phoneme boundary candidates and the duration of the phoneme delimited by the set of initial phoneme boundaries corresponding to each phoneme As a function that monotonically decreases in terms of monotone, increases monotonically in terms of variance of the duration of each phoneme divided by the set of phoneme boundary candidates, and increases monotonically in a broad sense monotonically for the matching score of each phoneme boundary candidate in the set of phoneme boundary candidates. If there are a plurality of pairs of phoneme boundary candidates that divide consecutive R phonemes, and the search score function sets the search score function to each of the phoneme boundary candidates. and duration of phonemes bounded by a set of duration of each phoneme to the initial phoneme boundary corresponding to the respective phonemes are separated by a set, relay a plurality of phonemes duration of dispersion is stored Calculate by inputting at least one of the variance of the duration of each phoneme divided by the phoneme boundary candidate set read from the long distribution storage unit and the matching score of each phoneme boundary candidate of the phoneme boundary candidate set. , An optimal phoneme boundary search step in which the phoneme boundary constituting the set of phoneme boundary candidates that maximizes the search score is the optimal phoneme boundary;
Phoneme segmentation method.

The phoneme division method according to claim 3,
The spectrum template includes a speech feature amount of each frame in a predetermined frame section including a phoneme boundary, and each of predetermined frame sections including the center of each phoneme of the previous phoneme and the rear phoneme constituting the phoneme boundary. and a voice feature amount of the frame, the frame including a phoneme boundary as the center of the spectral template, the spectral template storage unit, spectral template of a plurality of phone boundary is stored,
In the matching score calculation step, the frame including the phoneme boundary of each spectrum template and the phoneme boundary thereof are configured by the distance between the frame including the initial phoneme boundary and the frame including the center of the preceding phoneme of the initial phoneme boundary. The distance between the frame including the center of the preceding phoneme is separated, and the phoneme boundary of each of the read spectrum templates is included by the distance between the frame including the initial phoneme boundary and the frame including the center of the postphoneme boundary of the initial phoneme boundary. The distance between the frame and the frame including the center of the postphoneme constituting the phoneme boundary is separated, and the distance between each of the read spectrum templates and the input speech is calculated.
A phoneme segmentation method characterized by the above.

A phoneme division program for causing a computer to function as each unit of the phoneme division apparatus according to claim 1.