JP2014219635A

JP2014219635A - Pause insertion device and method and program thereof

Info

Publication number: JP2014219635A
Application number: JP2013100502A
Authority: JP
Inventors: 秀治中嶋; Hideji Nakajima; 博子村上; Hiroko Murakami
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-05-10
Filing date: 2013-05-10
Publication date: 2014-11-20
Anticipated expiration: 2033-05-10
Also published as: JP6009403B2

Abstract

PROBLEM TO BE SOLVED: To provide a pause insertion device for inserting a pause into an appropriate position in speech-synthesizing text information according to a reading speed.SOLUTION: A feature amount extracting part 110 inputs a piece of text information and extracts a feature amount vector which represents a feature amount of the text information. A reading speed model makes an association between the feature amount vector and a piece of pause information which represents insertion/not insertion of a pause. A model selection section 130 inputs a specified reading speed and a speed model and selects a reading speed model according to the specified reading speed to output the same. A pause position prediction section 140 inputs the feature amount vector and makes an association of the feature amount vector with the selected reading speed model to output a pause insertion position.

Description

本発明は、音声合成のためにテキスト情報からポーズ位置を決定するポーズ付与装置と、その方法とプログラムに関する。 The present invention relates to a pose imparting apparatus for determining a pose position from text information for speech synthesis, a method thereof, and a program.

従来の音声合成では、合成対象とするテキスト情報が情報案内やニュース等に関するものであり、音声合成の読み上げ速度は一定速度を前提としていた。そのため、ポーズ付与においても、一定の速度での合成音声の生成を前提として、目標とする音声を集め、その音声でのポーズ付与を再現するようなポーズ付与装置と、その方法とプログラムが構築されていた。その代表的な技術としては、例えば非特許文献１に記載されたポーズ付与技術が知られている。 In conventional speech synthesis, text information to be synthesized is related to information guidance, news, etc., and the speech synthesis read-out speed is assumed to be constant. Therefore, a pose assignment device, method and program for collecting a target voice and reproducing the pose assignment with the sound are constructed on the premise of generating synthesized speech at a constant speed even in pose assignment. It was. As a representative technique, for example, a pose imparting technique described in Non-Patent Document 1 is known.

非特許文献１に開示されたポーズ付与技術は、単語の出現形や品詞や読み（音節）を特徴量として統計的にモデル化し、テキストの各単語の後ろ、又は前を、ポーズ位置とすべきかどうかの判断を行うものである。また、ポーズ位置を予測する規則を人手で作成する場合も、音声合成の読み上げ速度は一定の前提で集められた音声を目標として、それら音声でのポーズ位置を再現するような規則作成が行われてきた。 Should the pose assignment technique disclosed in Non-Patent Document 1 statistically model the appearance of words, parts of speech, and readings (syllables) as features, and place the pose position after or before each word in the text? This is a judgment. Also, when creating rules for predicting pose positions manually, rules are created to reproduce the pose positions of those speeches, with the goal of using speech collected at a constant speech synthesis reading speed. I came.

V. Keri, etal., “Pause Prediction from Lexical and Syntax Information”, International Conference on Natural Language Processing(ICON-2007), 2007.V. Keri, etal., “Pause Prediction from Lexical and Syntax Information”, International Conference on Natural Language Processing (ICON-2007), 2007.

音声合成技術の進歩に伴い、必要とされる音声合成音も多様化して来ている。音声合成対象のテキスト情報は、従来の情報案内やニュース等の一定速度で読み上げる他に、対話場面でのテキスト情報や宣伝用のテキスト情報のように、人の感情やその他の状況に対応させて様々な速度で読み上げる必要のあるテキスト情報が増加している。しかし、音声合成用のテキスト情報へのポーズ付与は、上記したように、一定の読み上げ速度を前提とする、或いは、読み上げ速度の違いを考慮しない規則（モデル）に基づく技術に留まっていた。その従来の規則は、集められた音声データ内での平均的なポーズ位置を再現するように作られているので、平均よりも速い又は遅い速度でテキスト情報を読み上げる必要がある対話場面などのテキストを合成する場合に、適切な位置にポーズが付与できない。また、そもそも読み上げ速度の違いを吸収する考えが無いために、ポーズ付与の規則も平均化されてしまい、規則（モデル）の正確性にも問題があった。つまり、従来のポーズ付与装置は、複数の読み上げ速度に対応できない課題があった。 With the advancement of speech synthesis technology, the required speech synthesis sounds are also diversifying. Text information for speech synthesis is read at a constant speed, such as conventional information guidance and news, etc., as well as human emotions and other situations, such as text information in dialogue scenes and text information for advertising Text information that needs to be read out at various speeds is increasing. However, as described above, the addition of a pose to text information for speech synthesis has been limited to a technique based on a rule (model) that presupposes a constant reading speed or does not consider a difference in reading speed. The conventional rules are designed to reproduce the average pause position in the collected audio data, so text such as dialogue scenes where text information needs to be read out faster or slower than average. When combining, a pose cannot be given at an appropriate position. In addition, since there is no idea to absorb the difference in reading speed, the rules for giving poses are also averaged, and there is a problem in the accuracy of the rules (model). That is, the conventional pose imparting device has a problem that it cannot cope with a plurality of reading speeds.

本発明は、これらの課題に鑑みてなされたものであり、読み上げ速度に対応させた正確なポーズ情報を、テキスト情報に付与することの出来るポーズ付与装置と、その方法とプログラムを提供することを目的とする。 The present invention has been made in view of these problems, and provides a pose imparting apparatus capable of imparting accurate pose information corresponding to a reading speed to text information, and a method and program thereof. Objective.

本発明のポーズ付与装置は、特徴量抽出部と、読み上げ速度別モデルと、モデル選択部と、ポーズ位置予測部と、を具備する。特徴量抽出部は、テキスト情報を入力として、ポーズ位置予測に必要となる特徴量である特徴量ベクトルを抽出する。読み上げ速度別モデルは、特徴量ベクトルとポーズを置く又は置かないを意味するポーズ情報とを対応付ける。モデル選択部は、指定読み上げ速度と読み上げ速度別モデルとを入力として、指定読み上げ速度に対応した読み上げ速度別モデルを選択して出力する。ポーズ位置予測部は、特徴量ベクトルを入力として、当該特徴量ベクトルとモデル選択部で選択された読み上げ速度別モデルを対応させることでテキスト情報のポーズ付与位置を予測する。 The pose imparting apparatus of the present invention includes a feature amount extraction unit, a model for each reading speed, a model selection unit, and a pose position prediction unit. The feature quantity extraction unit receives text information as an input and extracts a feature quantity vector that is a feature quantity necessary for pose position prediction. The model according to the reading speed associates the feature amount vector with pose information meaning that a pose is placed or not placed. The model selection unit receives the designated reading speed and the model for each reading speed, and selects and outputs a model for each reading speed corresponding to the designated reading speed. The pose position prediction unit predicts the pose provision position of the text information by associating the feature amount vector with the model according to the reading speed selected by the model selection unit.

本発明のポーズ付与装置によれば、読み上げ速度別モデルを用いてテキスト情報にポーズを付与するので、読み上げ速度に対応した正確なポーズ付与位置を予測することが出来る。また、読み上げ速度が変化する前提で規則（モデル）を作成するので、ポーズ付与の精度を向上させることも可能である。 According to the pose imparting apparatus of the present invention, since a pose is imparted to text information using a model for each reading speed, an accurate pose imparting position corresponding to the reading speed can be predicted. Moreover, since the rule (model) is created on the premise that the reading speed changes, it is possible to improve the accuracy of the pose assignment.

本発明のポーズ付与装置１００の機能構成例を示す図。The figure which shows the function structural example of the pose provision apparatus 100 of this invention. ポーズ付与装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the pause provision apparatus. 読み上げ速度別モデルの一部の例を示す図。The figure which shows the example of a part of model according to reading speed. 本発明のポーズ付与装置２００の機能構成例を示す図。The figure which shows the function structural example of the pose provision apparatus 200 of this invention. ポーズ付与装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the pause provision apparatus. 全速度モデル２２０を概念的に示す図である。3 is a diagram conceptually showing a full speed model 220. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。
〔発明の考え〕
実施例の説明の前にこの発明の新しい考え方を説明する。ある一定以上の長さの文を話す時、ゆっくり話した場合のポーズと、速く話した場合のポーズとは、異なるのが普通である。例えば、「大変貴重なお土産を頂きましてどうもありがとうございます。」の一文を、速く読む時は、「て」の後に１個目のポーズＰ_１が置かれる可能性が高い。一方、遅く読む場合は、「貴重な」の後にポーズＰ_１、「お土産を」の後にポーズＰ_２、「て」の後にポーズＰ_３が置かれても自然である。このようにテキスト情報を読み上げる速度によって、ポーズが付与される位置が自然に異なってくる。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.
[Invention]
Prior to the description of the embodiments, a new concept of the present invention will be described. When speaking a sentence longer than a certain length, the pose when speaking slowly is usually different from the pose when speaking quickly. For example, when reading the sentence “Thank you very much for a very valuable souvenir.”, There is a high possibility that the _first pose P1 will be placed after “Te”. On the other hand, when reading late, pose P ₁ after “precious”, pose P ₂ after “souvenir”, and pose P ₃ after “te” are natural. Thus, the position where the pose is given naturally varies depending on the speed at which the text information is read out.

ポーズを付与するための特徴量として、例えば品詞列を用い、数単語分の部分的な品詞列で一致を見ると、完全に品詞列が同じとなる部分品詞列が現れる可能性が高い。その部分品詞列の中で、ポーズが付与される特徴的な部分品詞列を抽出してポーズ付与モデルを構成することで、正確で自然なポーズを付与することができ、自然で聞き取り易い音声合成音を生成することが可能になる。 If, for example, a part-of-speech string is used as a feature quantity for giving a pose, and a match is seen in partial part-of-speech strings for several words, there is a high possibility that a partial part-of-speech string with the same part-of-speech string appears. By extracting a characteristic partial part-of-speech sequence to which a pose is given from the partial part-of-speech sequence and constructing a pose assignment model, it is possible to assign an accurate and natural pose, and natural and easy-to-hear speech synthesis. Sound can be generated.

この発明は、例えば部分品詞列で表される特徴量を入力としてポーズの位置を決める従来のモデルを、読み上げ速度に対応させて用意し、指定される読み上げ速度とその速度に対応するモデルとテキスト情報から抽出した特徴量との対応関係からポーズ付与位置を予測するようにしたものである。この発明のポーズ付与装置によれば、ポーズ付与のモデルが読み上げ速度ごとに用意されるので、読み上げ速度に対応させた正確（自然）なポーズ位置情報を予測することが可能になる。 The present invention prepares a conventional model for determining the position of a pose by inputting, for example, a feature amount represented by a partial part-of-speech string, corresponding to the reading speed, and a model and text corresponding to the specified reading speed. The pose provision position is predicted from the correspondence with the feature amount extracted from the information. According to the pose imparting apparatus of the present invention, since a pose imparting model is prepared for each reading speed, accurate (natural) pose position information corresponding to the reading speed can be predicted.

図１に、この発明のポ−ズ付与装置１００の機能構成例を示す。その動作フローを図２に示す。ポーズ付与装置１００は、特徴量抽出部１１０と、読み上げ速度別モデル１２０と、モデル選択部１３０と、ポーズ位置予測部１４０と、制御部１５０と、を具備する。ポーズ付与装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。以降で説明する他の実施例についても同様である。 FIG. 1 shows an example of the functional configuration of a dose applying apparatus 100 according to the present invention. The operation flow is shown in FIG. The pose imparting apparatus 100 includes a feature amount extraction unit 110, a model 120 for each reading speed, a model selection unit 130, a pose position prediction unit 140, and a control unit 150. The pause giving device 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program. The same applies to other embodiments described below.

特徴量抽出部１１０は、テキスト情報を入力として、当該テキスト情報の特徴量である特徴量ベクトルを抽出する（ステップＳ１１０）。特徴量は、テキスト情報の単語を構成する例えば音節や品詞列である。これらの特徴量の抽出は、例えば参考文献１（松本裕治、形態素解析システム「茶筅」、情報処理、41(11), pp.1208-1214, 2000）の方法によって得ることが出来る。なお、音節の代わりに、単語の字面である出現形を用いることも可能である。 The feature quantity extraction unit 110 receives text information as an input and extracts a feature quantity vector that is a feature quantity of the text information (step S110). The feature amount is, for example, a syllable or a part of speech string that constitutes a word of text information. Extraction of these feature quantities can be obtained by, for example, the method of Reference Document 1 (Yuji Matsumoto, morphological analysis system “tea bowl”, information processing, 41 (11), pp. 1208-1214, 2000). Instead of syllables, appearance forms that are word faces can be used.

表１に、特徴量ベクトルの例として単語の出現形と品詞列の例を、「大変貴重なお土産を頂きましてどうもありがとうございます。」の一文の特徴量ベクトルを、単語の出現形と品詞列の例で示す。 Table 1 shows examples of feature vectors and examples of word appearances and part-of-speech sequences. “Thank you very much for receiving a very valuable souvenir.” This is shown in the example.

この出現形又は品詞、又は出現形と品詞を、時間軸方向に並べたものが特徴量ベクトルである。出現形Ｗと品詞Ｐで表せる特徴量ベクトルは、式（１）で表せる。 A feature quantity vector is the appearance form or part of speech, or the appearance form and part of speech arranged in the time axis direction. The feature quantity vector that can be represented by the appearance form W and the part of speech P can be represented by Expression (1).

特徴量ベクトルは、出現形Ｗのみで有っても良いし、品詞Ｐのみで有っても良い。この特徴量ベクトルは、判定対象の単語から文頭側の複数個の単語と、判定対象の単語から文末側の複数個の単語から上記した出現形や品詞を取り出して作成する。ｎは、最も文頭側の単語から最も文末側の単語に向かって数えた数の最大値であり、予め実験によって決めることができる。 The feature quantity vector may be only the appearance form W or only the part of speech P. This feature quantity vector is created by extracting the above-mentioned appearance form and part of speech from a plurality of words on the sentence head side from the word to be judged and a plurality of words on the sentence end side from the word to be judged. n is the maximum value of the number counted from the word at the beginning of the sentence toward the word at the end of the sentence, and can be determined in advance by experiment.

読み上げ速度別モデル１２０は、上記テキスト情報から抽出した特徴量と速度別のポーズ付与位置の記録されたデータから作成されたそれぞれの読み上げ速度に対応した統計的モデルであり、当該統計的モデルと特徴量ベクトルとを対応させることでポーズを置く又は置かないを意味するポーズ情報を与えるものである。統計的モデルは、例えば、ポーズの有無を分類する分類木である。分類木は、非特許文献１に記載されたClassification and Regression Treeのことであり、周知なものである。分類木は、式（２）で表せる確率モデル（ＣＲＦ:Conditional Random Field）に置き代えても良い。確率モデルＰ（ｙ｜ｘ）も、参考文献２（奥村監修、高村著、「言語処理のための機械学習入門」、コロナ社、p.153）に記載されているように周知なものである。 The model 120 for each reading speed is a statistical model corresponding to each reading speed created from the recorded data of the feature amount extracted from the text information and the pose imparting position for each speed. The pose information indicating that the pose is placed or not placed is provided by associating with the quantity vector. The statistical model is, for example, a classification tree that classifies the presence or absence of a pose. The classification tree is a Classification and Regression Tree described in Non-Patent Document 1, and is well known. The classification tree may be replaced with a probability model (CRF: Conditional Random Field) that can be expressed by Equation (2). The probabilistic model P (y | x) is also well known as described in Reference 2 (supervised by Okumura, written by Takamura, “Introduction to Machine Learning for Language Processing”, Corona, p. 153). .

ここでｘは、出現形Ｗや品詞Ｐで表される特徴量ベクトル、ｙは、ポーズを付与する／しないを表す変数を表す記号である。φは素性ベクトルであり、ｘとｙを引数にとり、上記した様々なｘに対して、ｙがポーズを置くことを表す場合に１を、置かない場合に０を返す関数をベクトル状にしたものである。ｗは各素性関数φの重みベクトルである。Ｚ_ｘ，ｗはＰ（ｙ｜ｘ）の総和が１になるようにするための正規化項である。「・」はベクトル間の内積を示す。これらｗ，φ，Ｚ_ｘ，ｗによって読み上げ速度別モデル１２０が構成される。これらの分類木と確率モデルは、従来からの周知な一般的な手法によって作成することが出来る。そして、付与する場合と付与しない場合の確率値の大きな方を用いて、ポーズ付与判定が実現できる。 Here, x is a feature vector represented by the appearance form W or the part of speech P, and y is a symbol representing a variable indicating whether or not a pose is given. φ is a feature vector that takes x and y as arguments, and for each of the various types of x described above, a function that returns 1 if y indicates that a pose is placed and returns 0 if it is not placed is a vector It is. w is a weight vector of each feature function φ. Z _{x, w} are normalization terms for making the sum of P (y | x) equal to 1. “·” Indicates an inner product between vectors. These w, φ, Z _{x, and w} constitute a reading speed model 120. These classification trees and probability models can be created by a well-known general method. Then, the pose grant determination can be realized by using the one with the larger probability value when the grant is performed and when the grant is not performed.

読み上げ速度別モデル１２０は、分類木で有っても良いし、確率モデルであっても良い。又は、n-gramモデルで有っても良い。このような読み上げ速度別モデル１２０を、例えば、低速度用モデル１２０_１、中速度用モデル１２０_２、高速度用モデル１２０_３の様に、読み上げ速度別に用意しておく。 The model 120 according to the reading speed may be a classification tree or a probability model. Or it may be an n-gram model. Such a reading speed model 120 is prepared for each reading speed, such as a low speed model 120 ₁ , a medium speed model 120 ₂ , and a high speed model 120 ₃ .

中速度用モデル１２０_２は、例えば、アナウンサーの標準的な読み上げ速度である一分当たり約４００文字（参考文献３：三木朋乃ほか、「ＮＨＫ放送技術研究所・ＮＨＫエンジニアリングサービス・日本ビクター話速変換技術を搭載したラジオ・テレビの開発」大河内賞ケース研究プロジェクト、ＩＩＲケーススタディCASE#10-03,一橋大学イノベーション研究センター,2010年4月）の読み上げ速度を想定したモデルとすることも可能である。低速度用モデル１２０_１は中速度用モデル１２０_２の例えば８０％程度の読み上げ速度を想定したモデル、高速度用モデル１２０_３は中速度用モデル１２０_２の例えば１２０％程度の読み上げ速度を想定したモデルとすることも可能である。このように具体的な読み上げ速度を指標として学習された読み上げ速度別モデル１２０は、そもそも読み上げ速度に対応させる考えの無い従来のモデルよりも、ポーズ付与の精度を正確なものにすることが出来る。 Speed for the model 120 ₂ medium is, for example, about 400 characters per minute is a standard reading speed of the announcer (Reference 3: Miki Tomo乃addition, "NHK Science and Technical Research Laboratories, NHK Engineering Services, Victor Company of Japan, speaking speed Development of radio / TV with conversion technology ”Okochi Prize Case Research Project, IIR Case Study CASE # 10-03, Hitotsubashi University Innovation Research Center, April 2010) is there. Low velocity model 120 ₁ assuming a reading speed of, for example, about 80% of medium-speed model 120 ₂ model, speed model 120 ₃ assuming a reading speed of, for example, about 120% of the medium speed model 120 ₂ It can also be a model. As described above, the model 120 for each reading speed learned by using the specific reading speed as an index can make the accuracy of the pose assignment more accurate than the conventional model that has no idea of corresponding to the reading speed in the first place.

モデル選択部１３０は、指定読み上げ速度と速度別モデルとを入力として、上記指定読み上げ速度に対応した読み上げ速度別モデルを選択して出力する（ステップＳ１３０）。外部から入力される指定読み上げ速度が「低速」の場合、モデル選択部１３０は、低速度用モデル１２０_１を選択してポーズ位置予測部１４０に出力する。指定読み上げ速度が「中速」の場合は中速度用モデル１２０_２、「高速」の場合は高速度用モデル１２０_３がそれぞれ選択される。 The model selection unit 130 receives the designated reading speed and the model for each speed, and selects and outputs a model for each reading speed corresponding to the designated reading speed (step S130). If the specified readout speed is inputted from the outside of the "slow", the model selection unit 130 outputs the pause position prediction unit 140 selects the low-speed model 120 _1. When the designated reading speed is “medium speed”, the medium speed model 120 ₂ is selected, and when the designated reading speed is “high speed”, the high speed model 120 ₃ is selected.

ポーズ位置予測部１４０は、特徴量抽出部１１０が出力する特徴量ベクトルを入力として、当該特徴量ベクトルとモデル選択部１３０で選択された読み上げ速度別モデルとを対応させることでテキスト情報のポーズ付与位置を予測する（ステップＳ１４０）。ステップＳ１１０とステップＳ１４０は、全てのテキスト情報についての処理が終了するまで繰り返される（ステップＳ１５０のＮｏ）。この繰り返し動作の処理は制御部１５０で行う。制御部１５０は、ポーズ付与装置１００の各部の時系列動作を制御する一般的なものであり、特別な処理を行うものではない。 The pose position prediction unit 140 receives the feature amount vector output from the feature amount extraction unit 110, and associates the feature amount vector with the model according to the reading speed selected by the model selection unit 130, thereby giving a pose of text information. The position is predicted (step S140). Steps S110 and S140 are repeated until the processing for all text information is completed (No in step S150). This repetitive operation process is performed by the control unit 150. The control unit 150 is a general unit that controls the time-series operation of each unit of the pose imparting apparatus 100 and does not perform any special processing.

ポーズ付与位置の予測を、図３を参照して説明する。図３は、分類木で表される読み上げ速度別モデルの例の一部である。図３は、品詞列と出現形で表される特徴量ベクトルに対応した分類木であり、上記した一文の「大変貴重な」の部分の処理に対応する一部を表記したものである。表１のように、「大変」がＷ_１、「貴重」がＷ_２、「な」がＷ_３である。Ｗ_３の直後にポーズを置くか否かの判定の場合を説明する。「大変貴重な」の特徴量ベクトルで、分類木を辿ると、出現形Ｗ_１の品詞Ｐ_１は名詞（ステップＳ１２０１のＹｅｓ）、出現形Ｗ_２の品詞Ｐ_２は名詞・形容動詞語幹（ステップＳ１２０２のＹｅｓ）、出現形Ｗ_３の品詞Ｐ_３は助動詞（ステップＳ１２０３のＹｅｓ）、の分岐を辿り、出現形Ｗ_３＝「な」なので（ステップＳ１２０４のＹｅｓ）、出現形Ｗ_３の後にポーズを付与する（ステップＳ１２０５）。この場合、ポーズ位置予測部１４０は、例えば「ポーズ付与」を出力する。ポーズ付与は、出現形Ｗ_３の後にポーズを付与することを意味する。出現形Ｗ_３が「な」以外の場合は出現形Ｗ_３の後にポーズを付与しない（ステップＳ１２０６）。 The prediction of the pose assignment position will be described with reference to FIG. FIG. 3 is a part of an example of a model for each reading speed represented by a classification tree. FIG. 3 is a classification tree corresponding to the feature vector represented by the part-of-speech string and the appearance form, and shows a part corresponding to the processing of the “very valuable” part of the above sentence. As shown in Table 1, “very” is W ₁ , “precious” is W ₂ , and “na” is W ₃ . In the case of determining whether placing a pause immediately after the W ₃ will be described. In feature vector of "very valuable", and follow the classification tree, (Yes of step S1201) part of speech _{P 1} of appearance form _{W 1} is a noun, appearance form part of speech _{P 2} of the _{W 2} is a noun, adjective stem (step Yes of S1202), part of speech _{P 3} is an auxiliary verb of appearance form _{W 3} (Yes of step S1203), follow a branch of, Yes occurred form _{W 3} = so "Do" (step S1204), pose after the appearance form _{W 3} Is assigned (step S1205). In this case, the pose position prediction unit 140 outputs, for example, “pause grant”. Pose grant is meant to grant a pause after the appearance form W _3. If the appearance form W ₃ is other than “NA”, no pose is given after the appearance form W ₃ (step S1206).

このように、特徴量ベクトルと読み上げ速度別モデルを対応させるとは、特徴量ベクトルで例えば分類木で構成された読み上げ速度別モデル１２０を辿ることである。このように、分類木で構成された読み上げ速度別モデル１２０を、特徴量ベクトルで辿ることでポーズを付与する位置を予測することが出来る。ここではＷ_３の文頭側の単語のみを利用してポーズを付与するか否かの判定を行ったが、Ｗ_３よりも文末側の単語が分類木の分岐点に来る場合にも同様に木を辿ることで、ポーズを付与するか否かの判定を行うことが可能である。 As described above, the correspondence between the feature amount vector and the model according to the reading speed is to trace the model 120 according to the reading speed constituted by the feature amount vector, for example, by a classification tree. In this way, the position to which the pose is given can be predicted by tracing the model 120 according to the reading speed constituted by the classification tree with the feature amount vector. Here it was subjected to only the use judgment as to whether or not to grant a pause word of the beginning of a sentence side of the W _3, as well as a tree even if the end of the sentence side word than W ₃ is at the branch point of the classification tree By following the above, it is possible to determine whether or not to give a pose.

図３に示した分類木から成る読み上げ速度別モデルを、例えば低速度用モデル１２０_１とした場合で、例えばテキスト情報を「大変貴重なお土産を頂きましてどうもありがとうございます」とし、ポーズ位置を読点「、」で表すとすると、ポーズ情報は「大変貴重な、お土産を、頂きまして、ありがとうございます」の３箇所に付与される。 The reading speed by model consists of classification tree shown in FIG. 3, for example in the case of a low speed for model 120 _1, for example, the text information as a "Thank you very much I received a very valuable souvenir", a comma pause position If it is expressed by “,”, the pose information is given to three locations “Thank you for receiving a very valuable souvenir”.

また、高速度用モデル１２０３とした場合のポーズ情報は、例えば「大変貴重なお土産を頂きまして、ありがとうございます」の１箇所に付与される。このように読み上げ速度別のモデルを用意しておくことで、同じテキスト情報に対して、指定された読み上げ速度に対応した正確で自然なポーズを付与することが可能である。 The pose information in the case of the high-speed model 1203 is given in one place, for example, “Thank you for receiving a very valuable souvenir”. By preparing a model for each reading speed in this way, it is possible to give an accurate and natural pose corresponding to the specified reading speed to the same text information.

式（２）で示した確率モデルを用いてポーズを付与する方法について説明する。式（２）の確率モデルの場合は、モデル構築の際に、特徴ベクトルｘの要素の部分集合から成るベクトルとポーズを置くか否かというｙとの様々な組であるφに対して、重みｗが計算される。φで参照するｘの部分集合は、モデルの設計者がポーズ付与に関する知識に基づいて様々に選択し、設定することができる。 A method for assigning a pose using the probability model expressed by Equation (2) will be described. In the case of the probabilistic model of the formula (2), weights are given to φ, which are various combinations of a vector composed of a subset of elements of the feature vector x and y indicating whether or not to place a pose at the time of model construction. w is calculated. The subset of x referred to by φ can be selected and set in various ways by the model designer based on knowledge regarding pose assignment.

例えば、Ｗ_１単独とｙ、Ｗ_２とＷ_３とｙ、Ｗ_１とＷ_２とＷ_３とｙから３つのφを作ることができる。また、モデル構築時のデータを用いて重みｗを決めることができる。例えば、上記した「大変貴重な」の部分の「な」の後でのポーズ付与を判定する際には、これら３つのφが１か０かを計算し、それぞれに重みを掛け合わせて足し合わせ（つまり、ｗとφの内積計算）、指数関数を通した値が確率値となる。この確率値をポーズを付与する場合としない場合の両方について算出し、確率値の大きな方を結果として採用する。 For example, three φs can be formed from W ₁ alone and y, W ₂ and W ₃ and y, and W ₁ and W ₂ , W ₃ and y. Further, the weight w can be determined using data at the time of model construction. For example, when determining whether to give a pose after “na” in the “very precious” portion described above, calculate whether these three φs are 1 or 0, and multiply each by adding a weight. (In other words, inner product calculation of w and φ), a value obtained through an exponential function is a probability value. This probability value is calculated for both cases with and without a pause, and the larger probability value is adopted as a result.

図４に、この発明のポ−ズ付与装置２００の機能構成例を示す。その動作フローを図５に示す。ポーズ付与装置２００は、特徴量抽出部１１０と、全速度モデル２２０と、ポーズ位置予測部２４０と、を具備する。 FIG. 4 shows an example of the functional configuration of the position imparting device 200 of the present invention. The operation flow is shown in FIG. The pose imparting apparatus 200 includes a feature amount extraction unit 110, a full speed model 220, and a pose position prediction unit 240.

特徴量抽出部１１０は、テキスト情報を入力として、当該テキスト情報の特徴量である特徴量ベクトルを抽出する（ステップＳ１１０）。特徴抽出部１１０は、参照符号から明らかなようにポーズ付与装置１００と同じものであり、抽出する特徴量も同じである。 The feature quantity extraction unit 110 receives text information as an input and extracts a feature quantity vector that is a feature quantity of the text information (step S110). The feature extraction unit 110 is the same as the pose imparting device 100 as is clear from the reference numerals, and the feature amount to be extracted is also the same.

全速度モデル２２０は、テキスト情報から抽出した特徴量から作成した複数の読み上げ速度に対応した統計的モデルであり、当該統計的モデルと上記特徴量ベクトルとを対応させることで、ポーズを置く又は置かないを意味するポーズ情報を与えるものである。 The total speed model 220 is a statistical model corresponding to a plurality of reading speeds created from feature amounts extracted from text information, and a pose is placed or placed by associating the statistical models with the feature amount vectors. It gives pose information that means no.

全速度モデル２２０は、上記した低速度用モデル１２０_１、中速度用モデル１２０_２、高速度用モデル１２０_３、などの複数の読み上げ速度の統計モデルを内包したものである。図６に、全速度モデル２２０を概念的に示す。 The full speed model 220 includes a plurality of statistical models of the reading speed such as the low speed model 120 ₁ , the medium speed model 120 ₂ , and the high speed model 120 ₃ . FIG. 6 conceptually shows the full speed model 220.

図６（ａ）は、全速度モデル２２０を構成する分類木の根元の分岐点に速度の特徴量が配置された例を示す。読み上げ速度別モデル１２０（図１）に示した低速度用モデル１２０_１と中速度用モデル１２０_２と高速度用モデル１２０_３とが、１個の分類木として全速度モデル２２０を構成している。 FIG. 6A shows an example in which velocity feature quantities are arranged at the branch points at the roots of the classification tree constituting the entire velocity model 220. And by speed model 120 low speed model 120 ₁ model for medium speed ₁₂₀₂ and speed model 120 ₃ shown in (FIG. 1) reading constitute the total velocity model 220 as one classification tree .

図６（ｂ）は、速度の特徴量が、分類木の様々な分岐点に配置された全速度モデル２２０′の例を示す。このように、速度の特徴量による分岐は分類木の様々な部分に有っても構わない。図６中に示す「¬Ｖ１」や「¬Ｖ２」や「¬Ｖ３」は速度では無いそれぞれ異なる特徴量による分岐を意味している。図６に示した構造の全速度モデル２２０，２２０′は、上記した読み上げ速度別モデルと同様に従来からの手法によって作成することが出来る一般的なものである。 FIG. 6B shows an example of the entire speed model 220 ′ in which the speed feature amount is arranged at various branch points of the classification tree. As described above, the branching based on the speed feature amount may exist in various parts of the classification tree. “¬V1”, “¬V2”, and “¬V3” shown in FIG. 6 mean branches based on different feature amounts that are not speeds. The full speed models 220 and 220 ′ having the structure shown in FIG. 6 are general ones that can be created by a conventional method in the same manner as the above-described model for each reading speed.

ポーズ位置予測部２４０は、外部から入力される指定読み上げ速度と特徴量抽出部１１０が出力する特徴量とからなる特徴量ベクトルを作成し、当該特徴量ベクトルと全速度モデル２２０とを対応させることでポーズ付与位置を予測する（ステップＳ２４０）。ステップＳ１１０とステップＳ２４０は、全てのテキスト情報についての処理が終了するまで繰り返される（ステップＳ２５０のＮｏ）。この繰り返し動作の処理は制御部２５０で行う。 The pose position prediction unit 240 creates a feature vector composed of a designated reading speed inputted from the outside and a feature amount output from the feature extraction unit 110, and associates the feature vector with the entire speed model 220. To predict the pose assignment position (step S240). Steps S110 and S240 are repeated until the processing for all text information is completed (No in step S250). This repetitive operation process is performed by the control unit 250.

以上説明したポーズ付与装置２００の構成でも、音声合成用のテキスト情報に対して指定された読み上げ速度に対応させた正確で自然な位置に、ポーズを付与することが可能である。 Even with the configuration of the pose imparting apparatus 200 described above, it is possible to impart a pose to an accurate and natural position corresponding to the reading speed specified for the text information for speech synthesis.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A feature quantity extraction unit that extracts text vector, which is a feature quantity necessary for pose position prediction, using text information as input;
A model according to the reading speed that associates the feature vector with pose information that indicates whether or not to place a pose;
A model selection unit for selecting and outputting a model according to the reading speed corresponding to the specified reading speed, with the specified reading speed and the reading speed model as inputs,
A pose position prediction unit that predicts a pose imparting position of the text information by associating the feature amount vector with the model according to the reading speed selected by the model selection unit,
A pose imparting device comprising:

A feature quantity extraction unit that extracts text vector, which is a feature quantity necessary for pose position prediction, using text information as input;
A total speed model that associates the feature vector with the input designated reading speed and pose information that indicates whether or not to place a pose;
A pose position prediction unit that predicts a pose imparting position of the text information by associating the designated reading speed and the feature vector with the specified reading speed, the feature vector, and the entire speed model;
A pose imparting device comprising:

In the pose grant apparatus according to claim 1 or 2,
The feature vector is expressed by the following equation:

A pose imparting device, which is a vector of word appearance forms W and parts of speech P.

A feature quantity extraction process for extracting a feature quantity vector, which is a feature quantity necessary for pose position prediction, using text information as input,
A model for each reading speed corresponding to the above-mentioned designated reading speed is selected and output by inputting a model for each reading speed associating the designated reading speed with the feature vector and the pose information indicating that the pose is placed or not. The model selection process,
A pose position prediction process for predicting a pose imparting position of the text information by associating the feature quantity vector with the model according to the reading speed selected by the model selection unit, using the feature quantity vector as an input;
A pose granting method comprising:

A feature quantity extraction process for extracting a feature quantity vector, which is a feature quantity necessary for pose position prediction, using text information as input,
By inputting a total speed model that associates the designated reading speed with the feature vector and pose information that indicates whether or not to place a pose, the correspondence between the designated reading speed, the feature vector, and the whole speed model A pause position prediction process for predicting a pause position of text information;
A pose granting method comprising:

A program for operating a computer as the pose imparting device according to claim 1.