JP2010237307A

JP2010237307A - Speech learning/synthesis system and speech learning/synthesis method

Info

Publication number: JP2010237307A
Application number: JP2009083164A
Authority: JP
Inventors: Hideyuki Mizuno; 秀之水野; Noboru Miyazaki; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-03-30
Filing date: 2009-03-30
Publication date: 2010-10-21
Anticipated expiration: 2029-03-30
Also published as: JP5049310B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an easy-to-use speech learning/synthesis system and method that can reduce facility costs and the burden of the costs on a user, and execute speech learning processing even during use of a portable terminal. <P>SOLUTION: The speech learning/synthesis system includes a user terminal 100 to which speech data and a text are input, and a server 200 connected to the user terminal 100 through a network. The user terminal 100 includes a feature quantity analyzer 110 which analyzes/extracts a feature quantity from speech data, and a waveform generator 140 which generates a synthesized speech from intermediate information, and the server 200 includes a DB generator 210 which generates a sound source DB using feature quantities, and an intermediate information generator 220 which generates the intermediate information from the text. The feature quantity and text are transmitted from the user terminal 100 to the server 200, and the intermediate information is transmitted from the server 200 to the user terminal 100. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は入力された音声データを学習し、その後、テキストが入力された場合に、学習により得られた音声の性質を有する合成音声を生成して出力する音声学習・合成システム及び音声学習・合成方法に関する。 The present invention learns input speech data, and then generates and outputs a synthesized speech having the properties of speech obtained by learning when text is input, and speech learning / synthesis Regarding the method.

近年、与えられた音声コーパスの特徴を統計的に自動学習し、モデル化することで、任意のテキストから合成音声を生成する統計的音声合成技術の開発が進んでいる（例えば、非特許文献１参照）。 In recent years, statistical speech synthesis technology for generating synthesized speech from an arbitrary text by statistically automatically learning and modeling features of a given speech corpus has been developed (for example, Non-Patent Document 1). reference).

そして、このような音声合成技術によって、原理的には音声コーパスの特徴を再現した合成音声を生成することが可能になってきている。具体的には、予め音声のスペクトルや基本周波数（Ｆ０）、音素の継続時間長に関する静的・動的特徴量を音声コーパスから分析・抽出し、ＥＭアルゴリズムを用いて隠れマルコフモデル（ＨＭＭモデル）を学習する。そして、テキストから音声を合成する際には、入力テキストから適切なＨＭＭモデルを選択して連結した後、スペクトル、基本周波数、音素継続時間長の特徴量系列を生成し、この特徴量系列から合成音声の生成を行う。 With such a speech synthesis technique, it has become possible in principle to generate synthesized speech that reproduces the characteristics of a speech corpus. Specifically, static and dynamic features related to the speech spectrum, fundamental frequency (F0), and phoneme duration are analyzed and extracted from the speech corpus in advance, and a hidden Markov model (HMM model) using the EM algorithm. To learn. When synthesizing speech from text, an appropriate HMM model is selected and connected from the input text, and then a feature quantity sequence of spectrum, fundamental frequency, and phoneme duration is generated and synthesized from this feature quantity sequence. Generate audio.

吉村他，“ＨＭＭに基づく音声合成におけるスペクトル・ピッチ・状態継続長の同時モデル化”，信学論，J83-D-II，No.11，pp.2099-2107，2000年11月Yoshimura et al., “Simultaneous Modeling of Spectrum, Pitch, and State Duration in Speech Synthesis Based on HMM”, IEICE Theory, J83-D-II, No.11, pp.2099-2107, November 2000

ところで、上述したような音声学習及び合成においては種々の課題が存在する。
まず、第１の課題は音声の学習にかかる処理量の多さである。一般に学習に関わる処理の計算量は大きく、必要なメモリ量も多い。音声合成の利用分野の一つとして、電話等の端末での音声サービスへの応用が考えられるが、例えば最新の携帯端末であってもこうした学習に必要なほどの計算能力や計算リソースを有しておらず、端末側で学習や合成処理を完結させることは困難となっている。 By the way, there are various problems in speech learning and synthesis as described above.
First, the first problem is the amount of processing required for speech learning. In general, the amount of processing involved in learning is large, and the amount of memory required is also large. As one of the fields of use of speech synthesis, it can be applied to voice services on terminals such as telephones. For example, even the latest mobile terminals have the computational power and computational resources necessary for such learning. However, it is difficult to complete learning and synthesis processing on the terminal side.

第２の課題は設備コストである。第１の課題を解決する方法として、端末側ではなく、サーバ側で処理を行う方法が容易に考えられるが、その場合、サーバ側に処理が集中するため、端末数に比例して大規模な設備を用意する必要があり、コスト面で大きな問題となる。 The second problem is equipment cost. As a method for solving the first problem, a method of performing processing on the server side instead of the terminal side can be considered easily. However, in this case, since the processing is concentrated on the server side, the scale is large in proportion to the number of terminals. It is necessary to prepare equipment, which is a big problem in terms of cost.

さらに、第３の課題として、ＮＷ（ネットワーク）伝送に伴う遅延やパケット単価面でのユーザの使い勝手に関わる問題がある。サーバで処理を行う場合は、端末・サーバ間で学習用音声や合成音声をネットワークを用いて伝送する必要があるが、ネットワーク伝送に伴う遅延やパケットロスが生じるため、端末側で途切れのない合成音声の再生を行うためにはバッファリングするなどの処理が必要となる。この場合、当然待ち時間がかかることになる。また、従量制のパケット単価が設定されているような携帯電話のネットワーク等では、音声データの大量の配信にはユーザの費用負担が大きくなるという課題がある。 Further, as a third problem, there are problems relating to delay associated with NW (network) transmission and user convenience in terms of packet unit price. When processing is performed by the server, it is necessary to transmit the learning speech and synthesized speech between the terminal and the server using the network. However, there is a delay and packet loss due to network transmission, so there is no interruption on the terminal side. Processing such as buffering is necessary to reproduce audio. In this case, naturally a waiting time is required. In addition, in a cellular phone network or the like in which a pay-per-use packet unit price is set, there is a problem that a large burden is placed on the user for distributing a large amount of audio data.

この発明の目的はこのような種々の課題を解決することができる分散型の音声学習・合成システム及び方法を提供することにある。 An object of the present invention is to provide a distributed speech learning / synthesis system and method capable of solving such various problems.

請求項１の発明によれば、入力された音声データを学習し、その学習に基づき、入力されたテキストに対して合成音声を生成する音声学習・合成システムは、音声データ及びテキストが入力されるユーザ端末と、そのユーザ端末とネットワークを介して接続されたサーバとよりなり、ユーザ端末は音声データから特徴量を分析・抽出する特徴量分析部と、中間情報から合成音声を生成する波形生成部とを備え、サーバは上記特徴量を用いて音源ＤＢを生成するＤＢ生成部と、テキストから上記中間情報を生成する中間情報生成部とを備え、上記特徴量及びテキストがユーザ端末からサーバに送信され、上記中間情報がサーバからユーザ端末に送信される構成とされる。 According to the first aspect of the present invention, a speech learning / synthesis system that learns input speech data and generates synthesized speech for the input text based on the learning, receives speech data and text. A user terminal and a server connected to the user terminal via a network. The user terminal analyzes and extracts a feature value from speech data, and a waveform generation unit generates a synthesized speech from intermediate information. The server includes a DB generation unit that generates a sound source DB using the feature amount, and an intermediate information generation unit that generates the intermediate information from text, and the feature amount and text are transmitted from the user terminal to the server. The intermediate information is transmitted from the server to the user terminal.

請求項２の発明では請求項１の発明において、ユーザ端末は上記音源ＤＢの送信をサーバに要求するＤＢ要求部を備え、サーバは上記要求に基づき、ユーザ端末に上記音源ＤＢを送信する構成とされる。 According to a second aspect of the present invention, in the first aspect of the invention, the user terminal includes a DB request unit that requests the server to transmit the sound source DB, and the server transmits the sound source DB to the user terminal based on the request. Is done.

請求項３の発明によれば、入力された音声データを学習し、その学習に基づき、入力されたテキストに対して合成音声を生成する音声学習・合成方法は、ネットワークを介して接続されたユーザ端末とサーバとを備え、学習は、ユーザ端末が入力された音声データから特徴量を分析・抽出する過程と、その特徴量をユーザ端末がサーバに送信する過程と、サーバが受信した特徴量を用いて音源ＤＢを生成する過程とよりなる。合成は、ユーザ端末が入力されたテキストをサーバに送信する過程と、サーバが受信したテキストから中間情報を生成する過程と、その中間情報をサーバがユーザ端末に送信する過程と、ユーザ端末が受信した中間情報から合成音声を生成する過程とよりなる。 According to the invention of claim 3, a speech learning / synthesis method for learning input speech data and generating a synthesized speech for the input text based on the learning is a user connected via a network. The terminal includes a terminal and a server, and learning includes a process in which a user terminal analyzes and extracts a feature value from input voice data, a process in which the user terminal transmits the feature value to the server, and a feature value received by the server. And a process of generating a sound source DB. Compositing is a process in which the user terminal transmits the input text to the server, a process in which the server generates intermediate information from the received text, a process in which the server transmits the intermediate information to the user terminal, and the user terminal receives And a process of generating synthesized speech from the intermediate information.

請求項４の発明では請求項３の発明において、ユーザ端末が上記音源ＤＢの送信をサーバに要求する過程と、上記要求に基づき、サーバがユーザ端末に上記音源ＤＢを送信する過程とを含む。 According to a fourth aspect of the present invention, the method according to the third aspect of the present invention includes a process in which the user terminal requests the server to transmit the sound source DB, and a process in which the server transmits the sound source DB to the user terminal based on the request.

この発明によれば、ユーザ端末とサーバ間で処理を分担し、サーバ側で計算量や必要なメモリ量が大きな学習処理等を実行し、ユーザ端末側では計算量やメモリ量が少ない処理のみ実行するものとなっており、よってユーザ端末として例えば携帯端末の利用時においても、音声の学習処理の実行が可能となり、前述の第１の課題を解決することができる。 According to the present invention, the processing is shared between the user terminal and the server, the learning processing or the like having a large calculation amount or necessary memory amount is executed on the server side, and only the processing having a small calculation amount or memory amount is executed on the user terminal side. Therefore, even when a mobile terminal is used as a user terminal, for example, the voice learning process can be executed, and the first problem described above can be solved.

また、全ての処理をサーバのみで実行するのではなく、ユーザ端末側で一部の処理を実行することで、サーバ側で必要となる処理量を削減することが可能となり、結果としてユーザ端末あたりに用意すべきサーバ数の削減または低コストで計算能力の低いサーバを用いることが可能となるため、設備コストを削減することができ、前述の第２の課題を解決することができる。 Moreover, it is possible to reduce the amount of processing required on the server side by executing a part of the processing on the user terminal side instead of executing all the processing only on the server. Therefore, it is possible to reduce the number of servers to be prepared or to use a server with low cost and low computing capacity, so that the equipment cost can be reduced and the second problem described above can be solved.

さらに、１）合成時に中間情報のみ送信し、ユーザ端末側で中間情報から合成処理することで、サーバから合成音声を送信する場合と比較して送信データ量を少なくすることができるため、バッファリングするための待ち時間を短縮できる。 Furthermore, 1) only intermediate information is transmitted at the time of synthesis, and the user terminal side performs synthesis processing from the intermediate information, so that the amount of transmission data can be reduced compared with the case of transmitting synthesized speech from the server. Can reduce the waiting time.

２）ランダムに発生するネットワーク遅延発生時において、音声を送信した場合は不意にバッファリングによる音声の途切れが生じ、聞き取りづらくなることが避けられないのに対し、アクセント句単位等の人が聞き取りやすいまとまった単位で中間情報を送信し、ユーザ端末側で音声に変換することにより、たとえ遅延が生じてもアクセント句等の単位となるため、聞きづらさを軽減することができる。 2) When a network delay occurs at random, if voice is transmitted, it will be inevitable that the voice will be interrupted due to buffering, making it difficult to hear. By transmitting the intermediate information in a unit of unit and converting it into voice on the user terminal side, even if a delay occurs, it becomes a unit of an accent phrase or the like, so it is possible to reduce difficulty in hearing.

３）学習時にはユーザ端末側で分析処理を実行し、特徴量を送信することで、音声をそのまま送信するより送信データ量を削減することができ、合成時においてユーザ端末に中間情報のみ送信することと合わせて、ネットワークを通したユーザ端末・サーバ間のデータの送受信量を削減することができる。これにより、ユーザの費用負担を軽減することが
できる。
よって、これら１）〜３）の効果により、前述の第３の課題を解決することができる。 3) At the time of learning, analysis processing is executed on the user terminal side, and the feature amount is transmitted, so that the amount of transmission data can be reduced compared to the case of transmitting the voice as it is, and only intermediate information is transmitted to the user terminal at the time of synthesis In addition, the amount of data transmitted and received between the user terminal and the server through the network can be reduced. Thereby, a user's expense burden can be reduced.
Therefore, the above-described third problem can be solved by the effects 1) to 3).

加えて、学習時における特徴量の分析処理と中間情報からの音声合成処理は理論的には逆過程であり、計算アルゴリズム的に共通する部分が多い。特徴量分析部と合成処理を行う波形生成部をユーザ端末に搭載することで、プログラムコード的には共通に使える部分も多くなり、ユーザ端末側に搭載すべきプログラムコードサイズも単純に学習部と合成部を載せる場合に比べて削減することができる。よって、プログラムの開発コストの削減や塔載メモリ量の少ない端末への適用も可能となるという効果も得ることができる。 In addition, the feature value analysis processing during learning and the speech synthesis processing from the intermediate information are theoretically reverse processes, and there are many common parts in the calculation algorithm. By installing the feature amount analysis unit and the waveform generation unit that performs synthesis processing in the user terminal, there are many parts that can be used in common in terms of program code, and the program code size that should be installed on the user terminal side is simply the learning unit. This can be reduced as compared with the case where the synthesis unit is mounted. Therefore, it is possible to obtain an effect that the development cost of the program can be reduced and application to a terminal with a small amount of installed memory is possible.

この発明による音声学習・合成システムの一実施例の全体構成を示す図。The figure which shows the whole structure of one Example of the speech learning and the synthesis | combination system by this invention. 図１におけるユーザ端末の構成を示す図。The figure which shows the structure of the user terminal in FIG. 図１におけるサーバの構成を示す図。The figure which shows the structure of the server in FIG. この発明による音声学習・合成システムの一実施例の処理手順を説明するためのシーケンス図。The sequence diagram for demonstrating the process sequence of one Example of the speech learning and the synthesis | combination system by this invention.

以下、この発明の実施形態を図面を参照して実施例により説明する。
図１はこの発明による音声学習・合成システムの一実施例の全体構成を示したものであり、ユーザ端末１００とサーバ２００とがネットワーク１０を介して相互に接続されており、この例ではこれらユーザ端末１００とサーバ２００とによって音声学習・合成システムが構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 shows the overall configuration of an embodiment of a speech learning / synthesizing system according to the present invention. A user terminal 100 and a server 200 are connected to each other via a network 10, and in this example, these users are connected. The terminal 100 and the server 200 constitute a speech learning / synthesis system.

図２はユーザ端末１００の構成を示したものであり、図２を参照して、まず、ユーザ端末１００の構成を説明する。
ユーザ端末１００はこの例では特徴量分析部１１０とテキスト前処理部１２０とＤＢ要求部１３０と波形生成部１４０と入力部１５０と出力部１６０とネットワークインターフェース１７０と制御部１８０を備えている。学習用の音声データ及びテキストは入力部１５０から入力される。 FIG. 2 shows the configuration of the user terminal 100. First, the configuration of the user terminal 100 will be described with reference to FIG.
In this example, the user terminal 100 includes a feature amount analysis unit 110, a text preprocessing unit 120, a DB request unit 130, a waveform generation unit 140, an input unit 150, an output unit 160, a network interface 170, and a control unit 180. The voice data and text for learning are input from the input unit 150.

特徴量分析部１１０は入力された学習用の音声データから特徴量（音声特徴量）を分析・抽出する。特徴量とは例えばスペクトル、基本周波数（Ｆ０）、音素の継続時間長等である。
テキスト前処理部１２０は入力されたテキストの種別に応じてテキストの前処理を行い、文字コードの変換やメールやＨＴＭＬテキストから音声合成の対象にならないタグやヘッダ等を除去する処理を行う。 The feature amount analysis unit 110 analyzes and extracts a feature amount (speech feature amount) from the input speech data for learning. The feature amount includes, for example, a spectrum, a fundamental frequency (F0), a phoneme duration, and the like.
The text preprocessing unit 120 performs text preprocessing according to the type of the input text, and performs processing for converting character codes and removing tags and headers that are not subject to speech synthesis from emails or HTML texts.

波形生成部１４０はサーバ２００から受信した中間情報から合成音声を生成する。中間情報とは例えば基本周波数（Ｆ０）、音素の継続時間長、モデルインデックス等である。中間情報から合成音声を生成する際に音源ＤＢ（データベース）が必要であれば利用する。なお、音源ＤＢが必要かどうかは中間情報の内容に依存する。この点については後で詳述する。 The waveform generation unit 140 generates synthesized speech from the intermediate information received from the server 200. The intermediate information is, for example, a fundamental frequency (F0), a phoneme duration, a model index, or the like. A sound source DB (database) is used if necessary when generating synthesized speech from the intermediate information. Whether or not the sound source DB is necessary depends on the contents of the intermediate information. This point will be described in detail later.

ＤＢ要求部１３０は波形生成部１４０における合成音声生成の際に音源ＤＢが必要であれば、サーバ２００に音源ＤＢの送信を要求する。図２ではこのように要求して受信し、保存された音源ＤＢ１９０を破線で図示している。
波形生成部１４０で生成された合成音声は出力部１６０から出力される。なお、制御部１８０はユーザ端末１００の動作を全体的に制御し、ネットワークインターフェース１７０はネットワーク１０との接続を担い、サーバ２００との通信を可能とする。 The DB requesting unit 130 requests the server 200 to transmit the sound source DB if the sound source DB is necessary when the waveform generation unit 140 generates the synthesized speech. In FIG. 2, the sound source DB 190 that has been requested and received in this manner is illustrated by a broken line.
The synthesized speech generated by the waveform generation unit 140 is output from the output unit 160. Note that the control unit 180 controls the operation of the user terminal 100 as a whole, and the network interface 170 is connected to the network 10 and enables communication with the server 200.

次に、図３を参照してサーバ２００の構成を説明する。
サーバ２００はＤＢ生成部２１０と中間情報生成部２２０とネットワークインターフェース２３０と制御部２４０を備えている。ネットワークインターフェース２３０はユーザ端末１００との通信を行う。制御部２４０はサーバ２００の動作を全体的に制御する。 Next, the configuration of the server 200 will be described with reference to FIG.
The server 200 includes a DB generation unit 210, an intermediate information generation unit 220, a network interface 230, and a control unit 240. The network interface 230 communicates with the user terminal 100. The control unit 240 controls the operation of the server 200 as a whole.

ＤＢ生成部２１０はユーザ端末１００より送信された特徴量を用いて、音声合成に必要な音源ＤＢを生成する。音源ＤＢは基本的には話者の特徴を有するものとする必要があるため、話者毎に異なるものが生成される。図３では話者毎に生成された音源ＤＢを２５０_１〜２５０_Ｎで示している。 The DB generation unit 210 generates a sound source DB necessary for speech synthesis using the feature amount transmitted from the user terminal 100. Since the sound source DB basically needs to have speaker characteristics, a different one is generated for each speaker. In FIG. 3, the sound source DB generated for each speaker is indicated by 250 ₁ to 250 _N.

中間情報生成部２２０はユーザ端末１００より送信されたテキストから中間情報を生成する。
図４は上記のような構成を有するユーザ端末１００及びサーバ２００よりなる音声学習・合成システムの処理手順を示したものであり、以下、処理手順及び各処理の詳細を説明する。 The intermediate information generation unit 220 generates intermediate information from the text transmitted from the user terminal 100.
FIG. 4 shows a processing procedure of the speech learning / synthesis system including the user terminal 100 and the server 200 having the above-described configuration, and the processing procedure and details of each processing will be described below.

〈学習〉
学習用音声データがユーザ端末１００に入力される（ステップＳ１１）。ユーザ端末１００は入力された音声データから特徴量の分析・抽出を行う（ステップＳ１２）。特徴量としては例えばスペクトル、基本周波数（Ｆ０）、音素の継続時間長がある。
スペクトルの分析方法には様々の方法があり、例えば古典的にはＦＦＴによる周波数分析やＬＰＣ分析法によるスペクトル推定法がある。また、正弦波重畳モデルベースの推定法（亀岡他，“正弦波重畳モデルのパラメータ最適化アルゴリズムの導出”，信学技報，Vol.106，EA2000-97，pp.49-54，2006）、STRAIGHT分析法（H.Kawahara et al,“Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous- frequency-based F0 extraction：Possible role of a reptitive structure in sounds”，Speech Communication，Vol.27，No.3-4，pp.187-207，1999）等の方法が提案されている。 <Learning>
Learning voice data is input to the user terminal 100 (step S11). The user terminal 100 analyzes and extracts the feature amount from the input voice data (step S12). Examples of the feature amount include a spectrum, a fundamental frequency (F0), and a phoneme duration.
There are various spectrum analysis methods. For example, classically, there are a frequency analysis by FFT and a spectrum estimation method by LPC analysis. Also, estimation method based on sine wave superposition model (Kameoka et al., “Derivation of parameter optimization algorithm for sine wave superposition model”, IEICE Technical Report, Vol.106, EA2000-97, pp.49-54, 2006), STRAIGHT analysis (H. Kawahara et al, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a reptitive structure in sounds”, Speech Communication, Vol. 27, No. 3-4, pp. 187-207, 1999) have been proposed.

Ｆ０分析方法も様々な方法が提案されているが、例えば上述のSTRAIGHT分析法にはＦ０推定法も含まれており、Ｆ０推定の代表的な方法の一つである。
音素の継続時間長の推定法としては、ＨＭＭを用いた音素セグメンテーション（Ljolje A.，and Riley M.D.,“Automatic Segmentation and Labeling of Speech”，Proc. of ICASSP’91，pp.473-476，1991）が代表的な方法である。但し、どのような特徴量を分析するかは合成方法に依存し、決定され、合成時にＦ０や音素継続時間長のデータを必要としない場合はＦ０や音素継続時間を分析しなくてもよい。 Various F0 analysis methods have been proposed. For example, the above-described STRAIGHT analysis method includes the F0 estimation method, and is one of the representative methods of F0 estimation.
Phoneme segmentation using HMM (Ljolje A., and Riley MD, “Automatic Segmentation and Labeling of Speech”, Proc. Of ICASSP'91, pp.473-476, 1991) Is a typical method. However, what kind of feature value is analyzed depends on the synthesis method, and is determined. If the data of F0 and phoneme duration length is not required at the time of synthesis, F0 and phoneme duration need not be analyzed.

分析・抽出された特徴量はサーバ２００に送信される（ステップＳ１３）。サーバ２００は受信した特徴量を用いて音源ＤＢを生成する（ステップＳ１４）。
音源ＤＢの生成方法としては様々な方法が存在するが、例えば素片ベースの場合、閉ループ方式（籠嶋他，“閉ループ学習に基づく最適な音声素片の解析的生成”，信学論，J83-D-II，No.6，pp.1405-1411，2000）のように、素片データベースのような形態が典型的であり、ＨＭＭのような統計モデルに基づく場合は、ＨＭＭに基づく話者モデル作成方法（前述の非特許文献１参照）が代表的な方法である。 The analyzed and extracted feature amount is transmitted to the server 200 (step S13). The server 200 generates a sound source DB using the received feature amount (step S14).
There are various methods for generating the sound source DB. For example, in the case of a segment base, the closed loop method (Takashima et al., “Analytical generation of an optimal speech segment based on closed loop learning”, theory of theory, J83 -D-II, No.6, pp.1405-1411, 2000), a form like a segment database is typical, and when it is based on a statistical model such as HMM, a speaker based on HMM A model creation method (see Non-Patent Document 1 described above) is a typical method.

〈合成〉
テキストがユーザ端末１００に入力される（ステップＳ２１）。ユーザ端末１００は入力されたテキストの前処理を行い（ステップＳ２２）、前処理を行ったテキストをサーバ２００に送信する（ステップＳ２３）。サーバ２００は受信したテキストを解析して（ステップＳ２４）、読み情報とアクセント等の韻律情報を決定する。 <Synthesis>
A text is input to the user terminal 100 (step S21). The user terminal 100 preprocesses the input text (step S22), and transmits the preprocessed text to the server 200 (step S23). The server 200 analyzes the received text (step S24) and determines prosody information such as reading information and accent.

テキスト解析処理は、主に形態素解析処理と読み・アクセント付与処理からなるが、これらの処理方法については従来から様々な方法が存在し、例えば下記文献１や文献２に記載されている方法に基づいて処理を行うこともできる。
文献１：特許第３３７９６４３号公報「形態素解析方法および形態素解析プログラムを
記録した記録媒体」
文献２：特許第３５１８３４０号公報「読み韻律情報設定方法及び装置及び読み韻律情
報設定プログラムを格納した記憶媒体」 Text analysis processing mainly consists of morphological analysis processing and reading / accenting processing, but there are various conventional methods for these processing methods. For example, the text analysis processing is based on the methods described in Documents 1 and 2 below. Can also be processed.
Reference 1: Japanese Patent No. 3379643 “Morphological analysis method and program
Recorded recording medium "
Reference 2: Japanese Patent No. 3518340 “Reading Prosody Information Setting Method and Apparatus and Reading Prosody Information”
Storage medium storing information setting program "

テキスト解析は中間情報生成部２２０で行われ、テキスト解析後、中間情報生成部２２０は中間情報を生成する（ステップＳ２５）。生成された中間情報はサーバ２００からユーザ端末１００に送信される（ステップＳ２６）。
中間情報生成は韻律パラメータ生成ステップと合成パラメータ生成ステップに大別される。 The text analysis is performed by the intermediate information generation unit 220. After the text analysis, the intermediate information generation unit 220 generates intermediate information (step S25). The generated intermediate information is transmitted from the server 200 to the user terminal 100 (step S26).
Intermediate information generation is roughly divided into a prosodic parameter generation step and a synthesis parameter generation step.

ａ）韻律パラメータ生成ステップ
形態素情報、読み、韻律情報に基づいて各種韻律パラメータを求める。ここで、韻律パラメータとしてはＦ０や音素継続時間長、パワー等があるが、それらを求める方式は従来から存在し、例えば下記文献３に記載されている方法によって音源ＤＢに含まれるＦ０データに基づいてピッチ（基本周波数）を求めることが可能であり、音素継続時間長についても例えば下記文献４に記載されている方法で音源ＤＢに含まれる継続時間長データに基づいて求めることが可能である。
文献３：特許第３２４０６９１号公報「音声認識方法」
文献４：M.D.Riley,“Tree-based modeling for speech synthesis”In G.Bailly，C.Benoit，and T.R.Sawallis，editors，Talking Machines：Theories，Models，and Designs，pp.265-273，Elsevier，1992
なお、古典的な点ピッチモデルや拍の等特性の継続時間モデルのように、完全に規則でＦ０や音素継続時間を決定するような方式を利用する場合は音源ＤＢは必要としない。 a) Prosodic Parameter Generation Step Various prosodic parameters are obtained based on morphological information, readings, and prosodic information. Here, there are F0, phoneme duration, power, etc. as prosodic parameters, but there are conventional methods for obtaining them, for example, based on F0 data included in the sound source DB by the method described in the following document 3. Thus, the pitch (basic frequency) can be obtained, and the phoneme duration can also be obtained based on the duration length data included in the sound source DB by the method described in Reference 4 below.
Reference 3: Japanese Patent No. 3240691 “Voice Recognition Method”
Reference 4: MDRiley, “Tree-based modeling for speech synthesis” In G. Bailly, C. Benoit, and TRSawallis, editors, Talking Machines: Theories, Models, and Designs, pp.265-273, Elsevier, 1992
Note that the sound source DB is not required when using a method in which F0 and phoneme duration are determined completely by rule, such as a classic point pitch model or a duration model with equal characteristics of beats.

ｂ）合成パラメータ生成ステップ
前述の韻律パラメータを用いて合成に必要な情報を生成する。具体的にどのような情報を生成するかは合成方法に依存する。
ｂ−１）素片接続型の場合
上記のとおり求められた読み情報や韻律パラメータに適合する最適な素片の組み合わせとなる素片系列を音源ＤＢに基づいて決定する。例えば、下記文献５に記載されている方法のようにして合成単位の系列は決定できる。
文献５：特許第３５１５４０６号公報「音声合成方法及び装置」 b) Synthesis Parameter Generation Step Information necessary for synthesis is generated using the prosodic parameters described above. The specific information to be generated depends on the synthesis method.
b-1) In the case of unit connection type A unit sequence that is an optimal combination of units matching the reading information and prosodic parameters obtained as described above is determined based on the sound source DB. For example, the series of synthesis units can be determined by the method described in Document 5 below.
Document 5: Japanese Patent No. 3515406 “Speech Synthesis Method and Device”

この後の処理（中間情報生成及び中間情報送信）は２通りの形態がある。一つは合成パラメータとして素片データの音源ＤＢ中の位置を示す素片インデックス情報のみを中間情報として送信する形態である。もう一つの形態はインデックス情報に基づいて順次、素片データを音源ＤＢから読み出し、素片データを結合してスペクトル特徴量の系列まで生成した後、スペクトル特徴量を中間情報として送信する形態である。 The subsequent processing (intermediate information generation and intermediate information transmission) has two forms. One is a mode in which only segment index information indicating the position of the segment data in the sound source DB is transmitted as intermediate information as a synthesis parameter. Another mode is a mode in which segment data is sequentially read from the sound source DB based on the index information, the segment data is combined to generate a sequence of spectrum feature values, and then the spectrum feature values are transmitted as intermediate information. .

ｂ−２）統計モデル型の場合
例えば前述の吉村らの論文（非特許文献１）の方法のとおり、上記の読み情報と韻律パラメータから決定木を用いて最適なコンテキスト依存型ＨＭＭモデルを選択する。 b-2) Statistical Model Type For example, as described in the above-mentioned paper by Yoshimura et al. (Non-Patent Document 1), an optimum context-dependent HMM model is selected from the above reading information and prosodic parameters using a decision tree. .

この後の処理（中間情報生成及び中間情報送信）は素片接続型と同様に２通りの形態がある。一つは合成パラメータとして、音源ＤＢ中に含まれるどのモデルであるかを示すモデルインデックス情報のみを中間情報として送信する形態である。もう一つの形態はインデックス情報に基づいて順次、モデルデータを音源ＤＢから読み出し、前述の吉村らの論文のとおり、モデルからスペクトル特徴量の系列を生成した後、スペクトル特徴量を中間情報として送信する形態である。 Subsequent processing (intermediate information generation and intermediate information transmission) has two forms as in the unit connection type. One is a mode in which only model index information indicating which model is included in the sound source DB is transmitted as intermediate information as a synthesis parameter. In another form, model data is sequentially read out from the sound source DB based on the index information, and as described in the above-mentioned paper by Yoshimura et al., A sequence of spectral feature values is generated from the model, and then the spectral feature values are transmitted as intermediate information. It is a form.

ユーザ端末１００は中間情報を受信すると、その受信した中間情報から合成音声を生成する（ステップＳ２７）。上記ｂ−１），ｂ−２）のいずれの場合においても、スペクトル特徴量の系列を中間情報として受信した場合は、音源ＤＢは不要であり、スペクトル特徴量の系列から単に音声波形を生成する。音声波形の生成方法はスペクトル特徴量の分析方法に依存する。例えば、前述のSTRAIGHT分析法で分析された特徴量であれば、STRAIGHT合成方式で合成すればよい。 When receiving the intermediate information, the user terminal 100 generates synthesized speech from the received intermediate information (step S27). In both cases b-1) and b-2), when a spectral feature quantity sequence is received as intermediate information, the sound source DB is unnecessary, and a speech waveform is simply generated from the spectral feature quantity sequence. . The generation method of the speech waveform depends on the analysis method of the spectral feature amount. For example, if the feature amount is analyzed by the above-described STRAIGHT analysis method, it may be synthesized by the STRAIGHT synthesis method.

一方、ｂ−１）の場合において、中間情報として素片インデックス情報を受信する場合には、音源ＤＢがユーザ端末１００側に必要であり、音声合成前にサーバ２００より音源ＤＢを受信しておく必要がある。ユーザ端末１００はサーバ２００に音源ＤＢの送信を要求し（ステップＳ３１）、サーバ２００はその要求に基づき、音源ＤＢをユーザ端末１００に送信する（ステップＳ３２）。 On the other hand, in the case of b-1), when receiving segment index information as intermediate information, a sound source DB is required on the user terminal 100 side, and the sound source DB is received from the server 200 before speech synthesis. There is a need. The user terminal 100 requests the server 200 to transmit the sound source DB (step S31), and the server 200 transmits the sound source DB to the user terminal 100 based on the request (step S32).

ユーザ端末１００の波形生成部１４０は素片インデックス情報に基づいて順次、素片データを音源ＤＢから読み出し、素片データを結合してスペクトル特徴量を生成し、その後、上記のとおりスペクトル特徴量から音声波形を生成する。 The waveform generation unit 140 of the user terminal 100 sequentially reads the segment data from the sound source DB based on the segment index information, combines the segment data to generate a spectrum feature amount, and then uses the spectrum feature amount as described above. Generate a speech waveform.

ｂ−２）の場合において、中間情報としてモデルインデックス情報を受信する場合にも、音源ＤＢがユーザ端末１００側に必要であり、ユーザ端末１００はサーバ２００に音源ＤＢの送信を要求し（ステップＳ３１）、サーバ２００はその要求に基づき、音源ＤＢをユーザ端末１００に送信する（ステップＳ３２）。
ユーザ端末１００の波形生成部１４０はインデックス情報に基づいて順次、モデルデータを音源ＤＢから読み出し、前述の吉村らの論文のとおり、モデルからスペクトル特徴量の系列を生成した後、上記のとおりスペクトル特徴量から音声波形を生成する。 In the case of b-2), when the model index information is received as intermediate information, the sound source DB is necessary on the user terminal 100 side, and the user terminal 100 requests the server 200 to transmit the sound source DB (step S31). The server 200 transmits the sound source DB to the user terminal 100 based on the request (step S32).
The waveform generation unit 140 of the user terminal 100 sequentially reads out the model data from the sound source DB based on the index information, and generates a sequence of spectral feature values from the model as described in the above-mentioned Yoshimura et al. Generate a speech waveform from the quantity.

以上説明したように、この例では音声学習・合成に関わる処理のうち、音声入力直後の処理や音声出力直前の処理などユーザへのインターフェースに近い処理かつ比較的軽い処理はユーザ端末１００側で実行するものとなっており、言い換えれば特徴量分析と音声合成処理という計算処理上、共通点が多い処理をユーザ端末１００側で実行するものとなっている。 As described above, in this example, among the processes related to speech learning and synthesis, processes close to the interface to the user, such as processes immediately after speech input and processes immediately before speech output, are executed on the user terminal 100 side. In other words, in the calculation process of the feature amount analysis and the voice synthesis process, a process having many common points is executed on the user terminal 100 side.

なお、ユーザ端末１００は電話等の携帯端末に限らず、例えばＰＣ（パーソナルコンピュータ）等であってもよく、ＰＣの場合、複数のユーザ（話者）が共用するといった形態がある。この場合、各話者に対する音声合成の方式は異なっていてもよく、つまりサーバ２００において生成される中間情報の内容は各話者によって異なっていてもよい。ユーザ端末１００において音源ＤＢが必要かどうかは音声合成の方式に依存するため、ユーザ端末１００が音源ＤＢを保持するかどうかは話者毎に決まり、つまり話者によって音源ＤＢの有無が異なるといった状況が生じる。 Note that the user terminal 100 is not limited to a portable terminal such as a telephone, but may be a PC (personal computer), for example. In the case of a PC, there is a form in which a plurality of users (speakers) are shared. In this case, the method of speech synthesis for each speaker may be different, that is, the content of the intermediate information generated in the server 200 may be different for each speaker. Since whether or not the sound source DB is necessary in the user terminal 100 depends on the speech synthesis method, whether or not the user terminal 100 holds the sound source DB is determined for each speaker, that is, whether or not the sound source DB is different depending on the speaker. Occurs.

Claims

A speech learning / synthesizing system that learns input speech data and generates synthesized speech for input text based on the learning,
The user terminal to which the voice data and the text are input, and a server connected to the user terminal via a network,
The user terminal includes a feature amount analysis unit that analyzes and extracts a feature amount from the speech data, and a waveform generation unit that generates the synthesized speech from intermediate information,
The server includes a DB generation unit that generates a sound source DB using the feature amount, and an intermediate information generation unit that generates the intermediate information from the text,
The feature amount and the text are transmitted from the user terminal to the server,
A speech learning / synthesis system characterized in that the intermediate information is transmitted from the server to the user terminal.

The speech learning and synthesis system according to claim 1,
The user terminal includes a DB request unit that requests the server to transmit the sound source DB,
The speech learning / synthesis system, wherein the server is configured to transmit the sound source DB to the user terminal based on the request.

A speech learning / synthesizing method that learns input speech data and generates synthesized speech for the input text based on the learning,
A user terminal and a server connected via a network;
The above learning
A process of analyzing and extracting feature values from the input voice data by the user terminal;
A process in which the user terminal transmits the feature amount to the server;
The process includes generating a sound source DB using the feature received by the server,
The above synthesis is
A process in which the user terminal transmits the input text to the server;
Generating intermediate information from the text received by the server;
A process in which the server transmits the intermediate information to the user terminal;
A speech learning / synthesizing method comprising a step of generating synthesized speech from intermediate information received by the user terminal.

The speech learning / synthesis method according to claim 3,
A process in which the user terminal requests the server to transmit the sound source DB;
A speech learning / synthesizing method comprising: a step in which the server transmits the sound source DB to the user terminal based on the request.