JP2010091829A

JP2010091829A - Voice synthesizer, voice synthesis method and voice synthesis program

Info

Publication number: JP2010091829A
Application number: JP2008262330A
Authority: JP
Inventors: Fumihiko Aoyama; 文彦青山
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2008-10-09
Filing date: 2008-10-09
Publication date: 2010-04-22
Anticipated expiration: 2028-10-09
Also published as: JP5765874B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizer, performing voice synthesis output of text information in a short time, and configured in a comparatively inexpensive manner, and to provide a voice synthesis method and a voice synthesis program. <P>SOLUTION: This voice-synthesis method includes a step (S101) of receiving text information; a step (S106) of reading phoneme data, corresponding to the text information, with reference to a hard disk device storing phoneme data and its cache memory, and creating voice data based on the read phoneme data; a step (S108) of determining whether instruction for voice reproduction is given; and a step (S109) of voice-outputting the created voice data to a voice output means, when it is determined that a designation for voice reproduction is given, and deleting the voice data (S110) and outputting empty voice data to the voice output means, when it is determined that instruction for voice reproduction is not given. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、テキスト情報の音声合成を行う音声合成装置に関し、特に、車載用ナビゲーション装置において利用される音声合成装置に関する。 The present invention relates to a speech synthesizer that performs speech synthesis of text information, and more particularly to a speech synthesizer used in an in-vehicle navigation device.

車載用電子システムでは、ラジオ放送、テレビ放送、その他の媒体から取得したコンテンツ情報の音声出力を行う機能に加えて、ラジオ放送等の外部から取得したテキスト情報の音声合成を行いこれを音声出力する機能を有するものがある。特に、ナビゲーション装置では、リアルタイムの道路交通情報を音声出力すれば、運転者の脇見運転の回避や利便性の向上を図ることができる。現在、ヨーロッパ等では、ＦＭラジオ放送の副搬送波を利用して道路交通情報をテキスト情報として提供するトラフィックメッセージチャンネル（Traffic Message Channel：以下、ＴＭＣと称す）等のサービスが実用化されている。車載用電子システムにおいて、ＴＭＣが利用され、受信した道路交通情報を音声合成出力している。 In-vehicle electronic systems, in addition to the function to output audio of content information acquired from radio broadcasts, television broadcasts, and other media, in addition to synthesizing text information acquired from outside such as radio broadcasts, and outputting this as audio Some have functions. In particular, in a navigation device, if real-time road traffic information is output as a voice, it is possible to avoid driver's side-by-side driving and improve convenience. In Europe and the like, services such as a traffic message channel (Traffic Message Channel: hereinafter referred to as TMC) that provides road traffic information as text information using subcarriers of FM radio broadcasting are in practical use. In an in-vehicle electronic system, TMC is used to synthesize and output received road traffic information.

音声合成に関する報告は、数多く成されている。例えば、特許文献１は、素片接続型音声合成装置に関し、少なくとも音声素片の波形データを記憶する記憶手段へのアクセス速度コストを含むサブコストを有し、そのサブコストを含んで算出されたコストが所定の条件を充足する候補を選択する技術を開示する。特許文献２は、音声合成装置に関し、音声素片を接続して合成音声を生成したときに生じる歪みを算出し、その歪みに基づいて合成単位ごとに音声素片を選択する技術を開示している。 There have been many reports on speech synthesis. For example, Patent Document 1 relates to a unit connection type speech synthesizer, and has a sub cost including at least an access speed cost to a storage unit that stores waveform data of a speech unit, and a cost calculated including the sub cost is included. A technique for selecting a candidate satisfying a predetermined condition is disclosed. Patent Document 2 relates to a speech synthesizer, and discloses a technique for calculating distortion generated when speech units are connected to generate synthesized speech, and selecting speech units for each synthesis unit based on the distortion. Yes.

特開２００５−２６６０１０号公報JP 2005-266010 A 特開２００７−３１０１７６号公報JP 2007-310176 A

図１は、従来の音声合成装置の構成を示す図である。音声合成装置は、ＦＭ放送に含まれるＴＭＣを受信するＴＭＣ受信部２０と、音声合成モジュール３０と、音素データを記憶するハードディスク装置４０と、音声出力部５０とを含んでいる。ＴＭＣ受信部２０は、ＦＭ放送に含まれるテキスト情報を抽出するテキスト情報抽出部２２と、最新のテキスト情報を格納する最新テキスト情報記憶部２４と、再生ボタン２８の指示に応答してテキスト情報を音声合成モジュール３０に送信するテキスト情報送信部２６とを含む。音声合成モジュール３０は、受信したテキスト情報の構文解析を行う構文解析部と、構文解析された単語または句などに対応する音素データをハードディスク装置４０から読み出す音素選択部３４と、読み出された音素を結合する音素結合部３６とを有している。音声出力部５０は、音素データが結合された音声データをアナログ信号に変換し、これをスピーカから出力する。 FIG. 1 is a diagram showing a configuration of a conventional speech synthesizer. The voice synthesizer includes a TMC receiver 20 that receives TMC included in FM broadcast, a voice synthesizer module 30, a hard disk device 40 that stores phoneme data, and a voice output unit 50. The TMC receiving unit 20 receives text information in response to an instruction from the text information extracting unit 22 that extracts text information included in the FM broadcast, the latest text information storage unit 24 that stores the latest text information, and the play button 28. And a text information transmission unit 26 that transmits to the speech synthesis module 30. The speech synthesis module 30 includes a syntax analysis unit that parses received text information, a phoneme selection unit 34 that reads phoneme data corresponding to the parsed word or phrase from the hard disk device 40, and the read phoneme. And a phoneme coupling part 36 for coupling. The voice output unit 50 converts the voice data combined with the phoneme data into an analog signal and outputs the analog signal from the speaker.

図１に示す音声合成装置において、音素データを読み出すために記憶装置４０をアクセスしたり、読み出した音素データを結合するには時間がかかる。このため、ユーザからの再生指示が成されても、音声合成処理に時間がかかり、音声出力に遅延が生じてしまう。特に、テキスト情報が大きくなればなるほど、その遅延が大きくなってしまう。 In the speech synthesizer shown in FIG. 1, it takes time to access the storage device 40 to read phoneme data and to combine the read phoneme data. For this reason, even if a reproduction instruction is issued from the user, it takes time for the voice synthesis process, and a delay occurs in the voice output. In particular, the larger the text information, the greater the delay.

こうした遅延時間を解消するための方法として、ハードディスク装置４０に格納されているすべての音素データを、音声合成モジュール内のメインメモリ等の作業領域に格納することが考えられるが、この方法を用いると、ハードディスク装置４０からメインメモリへ音素データをロードするのに時間がかかり、しかもメインメモリの容量を大きくしなければならないためコストが増加してしまう。また、別な方法として、音声合成モジュールに処理能力の高い中央処理装置（ＣＰＵ）を用い、処理速度の向上を図ることも考えられるが、この方法でも、高価なＣＰＵを使用するためコストが高くなってしまう。 As a method for eliminating such a delay time, it is conceivable to store all phoneme data stored in the hard disk device 40 in a work area such as a main memory in the speech synthesis module. In addition, it takes time to load phoneme data from the hard disk device 40 to the main memory, and the capacity of the main memory has to be increased, which increases the cost. Another method is to use a central processing unit (CPU) having a high processing capacity for the speech synthesis module to improve the processing speed. However, this method also uses an expensive CPU and is expensive. turn into.

本発明は、このような従来の課題を解決するものであり、テキスト情報の音声合成出力を短時間で行うことができ、かつ比較的安価に構成することができる音声合成装置、音声合成方法および音声合成プログラムを提供することを目的とする。 The present invention solves such a conventional problem, and can perform speech synthesis output of text information in a short time and can be configured relatively inexpensively, a speech synthesizer, a speech synthesis method, and An object is to provide a speech synthesis program.

本発明に係る音声合成装置は、外部からテキスト情報を取得する取得手段と、
音声合成に必要な音素データを記憶する第１の記憶手段と、前記第１の記憶手段において読み出された音素データと同一の音素データを記憶可能でありかつ前記第１の記憶手段よりもアクセス時間が速い第２の記憶手段と、前記第１および第２の記憶手段を参照し、前記テキスト情報に対応する音素データを第１および第２の記憶手段の少なくとも一方から読み出し、読み出された音素データに基づき音声データを作成する作成手段と、音声を出力する音声出力手段と、テキスト情報の音声再生の指示を入力する入力手段と、前記入力手段からの音声再生の指示の有無を判定し、音声再生の指示があるとき前記作成手段により作成された音声データを前記音声出力手段に出力させ、音声再生の指示がないとき、前記作成手段により作成された音声データを削除し、空の音声データを前記音声出力手段に出力させる制御手段とを有する。 A speech synthesizer according to the present invention includes an acquisition unit that acquires text information from outside,
First storage means for storing phoneme data necessary for speech synthesis; phoneme data that is the same as the phoneme data read in the first storage means can be stored; and more accessible than the first storage means The phoneme data corresponding to the text information is read out from at least one of the first and second storage means with reference to the second storage means that is fast in time and the first and second storage means. It determines whether or not there is a voice reproduction instruction from the input means for creating voice data based on phoneme data, a voice output means for outputting voice, an input means for inputting voice reproduction instructions for text information, and the input means. When the voice reproduction instruction is given, the voice data created by the creation means is output to the voice output means. When there is no voice reproduction instruction, the voice data is created by the creation means. Remove the voice data, and a control means for outputting an empty audio data to the audio output means.

好ましくは前記制御手段は、前記取得手段によるテキスト情報の取得に応答して前記作成手段に取得したテキスト情報に対応する音声データの作成を命令する。好ましくは前記制御手段は、前記作成手段による音声データの作成から一定時間経過したとき、前記取得手段にテキスト情報を前記作成手段へ送信させ、かつ前記作成手段に送信されたテキスト情報に対応する音声データの作成を命令する。好ましくは前記制御手段は、前記入力手段からの音声再生の指示に応答して前記取得手段にテキスト情報を前記作成手段へ送信させ、かつ前記作成手段に送信された音声データの作成を命令する。 Preferably, the control means instructs the creation means to create voice data corresponding to the acquired text information in response to the acquisition of text information by the acquisition means. Preferably, the control means causes the acquisition means to transmit text information to the creation means when a predetermined time has elapsed since the creation of the voice data by the creation means, and the voice corresponding to the text information transmitted to the creation means. Command the creation of data. Preferably, the control means causes the obtaining means to transmit text information to the creation means in response to a voice reproduction instruction from the input means, and instructs the creation means to create the voice data transmitted.

好ましくは前記第１の記憶手段は、大容量記憶装置であり、第２の記憶手段は、第１の記憶手段のキャッシュメモリであり、前記作成手段は、前記テキスト情報に対応する音素データをキャッシュメモリから読み出し、キャッシュメモリでヒットしなかった音素データを大容量記憶装置から読み出す。好ましくは前記取得手段は、放送波を受信し、放送波に含まれるテキスト情報を抽出する。 Preferably, the first storage means is a mass storage device, the second storage means is a cache memory of the first storage means, and the creation means caches phoneme data corresponding to the text information. The phoneme data that was read from the memory and did not hit in the cache memory is read from the mass storage device. Preferably, the acquisition means receives a broadcast wave and extracts text information included in the broadcast wave.

本発明に係る音声合成方法またはプログラムは、テキスト情報を取得するステップと、音声合成に必要な音素データを記憶するメモリおよびメモリから読み出された音素データと同一の音素データを記憶可能なキャッシュメモリを参照し、取得したテキスト情報に対応する音素データをメモリおよびキャッシュメモリの少なくとも一方から読み出すステップと、読み出された音素データに基づき音声データを作成するステップと、音声再生の指示の有無を判定するステップと、音声再生の指示があると判定したとき、前記作成された音声データを音声出力手段に音声出力させ、音声再生の指示がないと判定したとき、前記作成された音声データを削除し空の音声データを音声出力手段に音声出力させるステップとを有する。 A speech synthesis method or program according to the present invention includes a step of acquiring text information, a memory for storing phoneme data necessary for speech synthesis, and a cache memory capable of storing the same phoneme data as the phoneme data read from the memory To read phoneme data corresponding to the acquired text information from at least one of the memory and the cache memory, to create audio data based on the read phoneme data, and to determine whether there is an instruction to play audio And when the voice output instruction is determined to be output, the generated voice data is output to a voice output unit, and when it is determined that there is no voice playback instruction, the generated voice data is deleted. A step of causing the sound output means to output sound of empty sound data.

本発明によれば、第１の記憶装置において読み出された音素データと同一の音素データを記憶可能でありアクセス時間が速い第２の記憶装置を参照して音声データを作成するようにしたので、音声再生指示から音声出力までの時間を短縮することができる。また、音声再生指示がないときには、音声データは事実上音声出力されないので、ユーザに不快を与えることなく、一定の頻度で音声データを作成することで、不定期に生じる音声再生指示に対処することができる。 According to the present invention, the same phoneme data as the phoneme data read in the first storage device can be stored, and the voice data is created with reference to the second storage device having a fast access time. The time from the voice reproduction instruction to the voice output can be shortened. Also, when there is no voice playback instruction, the voice data is not actually output as a voice, so that the voice playback instruction that occurs irregularly can be dealt with by creating the voice data at a certain frequency without causing discomfort to the user. Can do.

本発明の最良の実施の形態について図面を参照して詳細に説明する。 The best mode for carrying out the present invention will be described in detail with reference to the drawings.

図２は、本発明の実施例に係る音声合成装置の構成を示すブロック図である。
本実施例に係る音声合成装置１００は、ユーザからの指示を入力する入力部１１０と、外部からテキスト情報を受信する受信部１２０と、テキスト情報を音声合成出力するための音声データを作成する音声合成モジュール１３０と、音声合成出力するための音声データ等を記憶するメモリ１４０と、音声データに基づき音声を出力する音声出力部１５０と、外部装置と接続するためのインターフェースを形成する外部Ｉ／Ｆ１６０と、これらを接続する内部バス１７０とを含んで構成される。 FIG. 2 is a block diagram showing the configuration of the speech synthesizer according to the embodiment of the present invention.
The speech synthesizer 100 according to the present embodiment includes an input unit 110 that inputs an instruction from a user, a reception unit 120 that receives text information from the outside, and speech that creates speech data for speech synthesis output of the text information. A synthesis module 130, a memory 140 for storing voice data and the like for voice synthesis output, a voice output unit 150 for outputting voice based on the voice data, and an external I / F 160 forming an interface for connecting to an external device And an internal bus 170 for connecting them.

入力部１１０は、リモコン、マウス等の入力装置を有する。ユーザは、入力部１１０を介してテキスト情報の音声再生を指示することができる。受信部１２０は、ラジオ放送、テレビ放送、その他の媒体からテキスト情報を受信する。例えば、ＴＭＣのようにＦＭ放送の副搬送波に重畳された道路交通情報を受信する。受信部１２０は、ＴＭＣのような道路交通情報を受信する場合、常時、ＦＭ放送を受信し、最新の道路交通情報を抽出する。 The input unit 110 includes input devices such as a remote controller and a mouse. The user can instruct voice reproduction of the text information via the input unit 110. The receiving unit 120 receives text information from radio broadcast, television broadcast, and other media. For example, the road traffic information superimposed on the subcarrier of FM broadcasting like TMC is received. When receiving road traffic information such as TMC, the receiving unit 120 always receives FM broadcasts and extracts the latest road traffic information.

音声合成モジュール１３０は、受信部１２０で受信されたテキスト情報に対応する音声データを作成し、作成した音声データをメモリ１４０に格納する。音声出力部１５０は、音声合成モジュール１３０から音声再生の指示があったとき、メモリ１４０に格納された音声データをアナログ信号に変換し、これをスピーカから出力する。 The speech synthesis module 130 creates speech data corresponding to the text information received by the receiving unit 120 and stores the created speech data in the memory 140. When a voice reproduction instruction is given from the voice synthesis module 130, the voice output unit 150 converts voice data stored in the memory 140 into an analog signal and outputs the analog signal from the speaker.

図３に受信部１２０の構成を示す。受信部１２０は、テキスト情報抽出部１２１、テキスト情報転送部１２２、テキスト情報格納部１２３を含んでいる。テキスト情報抽出部１２１は、上記したようにＦＭ放送からテキスト情報を抽出する。テキスト情報転送部１２２は、好ましくは受信部１２０がテキスト情報を受信したとき（条件１）、ユーザからの音声再生指示があったとき（条件２）、あるいは音声データを作成してから一定時間が経過したとき（条件３）に、テキスト情報を音声合成モジュール１３０に転送する。条件２または３に該当するか否かは判定は音声合成モジュール１３０によって行われ、音声合成モジュール１３０から受信部１２０に対して送信の要求が成される。テキスト情報格納部１２３は、受信された最新のテキスト情報と１つ前のテキスト情報を格納し、それよりも古いテキスト情報は削除する。 FIG. 3 shows the configuration of the receiving unit 120. The reception unit 120 includes a text information extraction unit 121, a text information transfer unit 122, and a text information storage unit 123. The text information extraction unit 121 extracts text information from the FM broadcast as described above. The text information transfer unit 122 is preferably configured so that the receiving unit 120 receives the text information (Condition 1), receives a voice playback instruction from the user (Condition 2), or creates a certain amount of time after generating the voice data. When the time has elapsed (condition 3), the text information is transferred to the speech synthesis module 130. Whether the condition 2 or 3 is satisfied is determined by the speech synthesis module 130, and the speech synthesis module 130 makes a transmission request to the receiving unit 120. The text information storage unit 123 stores the latest received text information and the previous text information, and deletes older text information.

図４に、音声合成モジュールの内部構成を示す。音声合成モジュール１３０は、音声データを作成する音声データ作成装置２００と、音素データを格納する音素データ記憶装置２１０と、音声データ作成装置２００、受信部１２０および音声出力部１５０等を制御する音声合成制御装置２２０とを有する。 FIG. 4 shows the internal configuration of the speech synthesis module. The speech synthesis module 130 includes a speech data creation device 200 that creates speech data, a phoneme data storage device 210 that stores phoneme data, a speech synthesis device 200 that controls the speech data creation device 200, the reception unit 120, the speech output unit 150, and the like. And a control device 220.

音声作成データ装置２００は、音声合成制御装置２２０からの命令に応答して、テキスト情報に対応する必要な音素データを音素データ記憶装置２１０から読み出し、読み出した音素データを結合して音声データを作成する。音素データ記憶装置２１０は、音素データを格納する大容量のハードディスク装置２１２とキャッシュメモリ２１４とを含んでいる。音声データ作成装置２００は、音素データを検索するとき、キャッシュメモリ２１４をアクセスし、キャッシュメモリ２１４にヒットする音素データがあれば、当該音素データを読み出し、キャッシュメモリ２１４にヒットする音素データがなければ、ハードディスク装置２１２から音素データを検索し、当該音素データを読み出す。 In response to a command from the speech synthesis control device 220, the speech creation data device 200 reads the necessary phoneme data corresponding to the text information from the phoneme data storage device 210, and creates the speech data by combining the read phoneme data. To do. The phoneme data storage device 210 includes a large-capacity hard disk device 212 and a cache memory 214 that store phoneme data. When searching for phoneme data, the voice data creation device 200 accesses the cache memory 214, and if there is phoneme data hitting the cache memory 214, reads the phoneme data and if there is no phoneme data hitting the cache memory 214. The phoneme data is retrieved from the hard disk device 212, and the phoneme data is read out.

キャッシュメモリ２１４は、ハードディスク装置２１２において読み出された音素データと同一の音素データを記憶することで、音素データの検索に要する時間を短縮させる。但し、キャッシュメモリ２１４の記憶容量には制限があるため、古い音素データから順に上書きしたり、あるいはＬＲＵ（Least Recently Used）のようなアルゴリズムに従い最も古くアクセスされた音素データを書き換えるようにしてもよい。キャッシュメモリ２１４は、ハードディスク装置２１２よりもアクセス時間が速いメモリ、例えばＳＲＡＭから構成される。 The cache memory 214 stores the same phoneme data as the phoneme data read by the hard disk device 212, thereby reducing the time required for searching for phoneme data. However, since the storage capacity of the cache memory 214 is limited, the oldest phoneme data may be overwritten in order, or the earliest accessed phoneme data may be rewritten according to an algorithm such as LRU (Least Recently Used). . The cache memory 214 is composed of a memory, such as an SRAM, whose access time is faster than that of the hard disk device 212.

図５に、音声データ作成装置２００の構成を示す。音声データ作成装置２００は、受信部１２０からテキスト列を取得するテキスト列取得部２０１、取得したテキスト列の構文解析を行う構文解析部２０２、構文解析された結果に基づき音素データ記憶装置２１０をアクセスし、そこから音素データを選択する音素選択部２０３、選択された音素データを結合して音声データを作成する音素結合部２０４と、作成された音素データをメモリ１４０の指定された領域に送信する音声データ送信部２０５を有する。 FIG. 5 shows the configuration of the audio data creation device 200. The speech data creation device 200 accesses the phoneme data storage device 210 based on the result of the syntax analysis, the text sequence acquisition unit 201 that acquires the text sequence from the reception unit 120, the syntax analysis unit 202 that performs syntax analysis of the acquired text sequence. Then, a phoneme selection unit 203 that selects phoneme data therefrom, a phoneme combination unit 204 that combines the selected phoneme data to create speech data, and the created phoneme data is transmitted to a specified area of the memory 140. An audio data transmission unit 205 is included.

構文解析部２０２は、一連のテキスト列を主語、述語、助詞などに単語または句等に解析する。音素選択部２０３は、解析された単語や句などに含まれる音素に対応する音素データを音素データ記憶装置２１０から読み出し、音素結合部２０４がこれらの音素データを結合する。音声データは、音素データの結合であり、作成された音声データは、音声合成制御装置２２０からの命令に応答して音声データ送信部２０５によりメモリ１４０に書き込まれる。 The syntax analysis unit 202 analyzes a series of text strings into a subject, a predicate, a particle or the like as a word or a phrase. The phoneme selection unit 203 reads phoneme data corresponding to the phonemes included in the analyzed word or phrase from the phoneme data storage device 210, and the phoneme combination unit 204 combines these phoneme data. The voice data is a combination of phoneme data, and the created voice data is written into the memory 140 by the voice data transmission unit 205 in response to a command from the voice synthesis control device 220.

図６に、音声合成制御装置の機能ブロック図を示す。音声合成制御装置２２０は、テキスト情報の転送を要求するテキスト情報転送要求部２２１と、入力部１１０から音声再生の指示があったか否かを判定する音声指示判定部２２２と、音声データ作成装置２００に対し音声データの作成を要求する音声データ作成要求部２２３と、音声作成データ装置２００に対して音声データの送信を要求する音声データ送信要求部２２４と、擬似音声再生するときにメモリ１４０に格納された音声データを消去する音声データ消去部２２５と、音声出力部１５０に音声データの音声出力を要求する音声出力要求部２２６とを含んでいる。音声合成制御装置２２０は、例えばマイクロコントローラ、マイクロコンピュータまたはマイクロプロセッサを含み、これらがプログラムを実行して上記機能を遂行するようにしてもよい。 FIG. 6 shows a functional block diagram of the speech synthesis control device. The voice synthesis control device 220 includes a text information transfer request unit 221 that requests text information transfer, a voice instruction determination unit 222 that determines whether or not a voice reproduction instruction has been received from the input unit 110, and a voice data creation device 200. The voice data creation request unit 223 that requests creation of voice data, the voice data transmission request unit 224 that requests the voice creation data device 200 to send voice data, and the pseudo-sound playback are stored in the memory 140. The audio data deleting unit 225 for deleting the audio data and the audio output requesting unit 226 for requesting the audio output unit 150 to output the audio data. The speech synthesis control device 220 may include, for example, a microcontroller, a microcomputer, or a microprocessor, and these may execute a program to perform the above functions.

次に、本実施例に係る音声合成装置の動作について図７に示すフローチャートを参照して説明する。上記したように受信部１２０は、最新のテキスト情報を受信すると、当該テキスト情報を音声合成モジュール１３０へ送信する。音声合成制御装置２２０は、テキスト情報を受信すると（ステップＳ１０１）、音声データ作成装置に音声データの作成を命令する（ステップＳ１０２）。これ以外にも、音声合成制御装置２２０は、再生指示判定部２２２によりユーザからの音声再生の指示があったと判定したとき（条件２）、あるいは音声データ作成要求部２２３が作成命令を送信してから一定時間が経過したとき（条件３）、テキスト情報転送要求部２２１が受信部１２０に対してテキスト情報の送信を要求する。この送信要求に応答して受信部１２０がテキスト情報を送信し、テキスト情報が音声合成モジュール１３０で受信されると（ステップＳ１０１）、音声データの作成命令がなされる（ステップＳ１０２）。 Next, the operation of the speech synthesizer according to the present embodiment will be described with reference to the flowchart shown in FIG. As described above, when receiving the latest text information, the receiving unit 120 transmits the text information to the speech synthesis module 130. When receiving the text information (step S101), the speech synthesis control device 220 instructs the speech data creation device to create speech data (step S102). In addition to this, when the speech synthesis control device 220 determines that the playback instruction determination unit 222 has instructed the playback of the voice (condition 2), or the voice data creation request unit 223 transmits a creation command. When a certain period of time has passed (condition 3), the text information transfer request unit 221 requests the receiving unit 120 to transmit text information. In response to this transmission request, the receiving unit 120 transmits text information. When the text information is received by the speech synthesis module 130 (step S101), a voice data creation command is issued (step S102).

次に、音声データ作成装置２００のテキスト列取得部２０１がテキスト情報を取得すると（ステップＳ１０３）、構文解析部２０２がテキスト情報の構文解析を行う（ステップＳ１０４）。次に、音素選択部２０３は、記憶装置２１０をアクセスし、構文解析された単語または句に対応する音素データを記憶装置２１０から読み出す。そして、音素結合部２０４は、読み出された音素データを結合し（ステップＳ１０５）、音声データを作成する（ステップＳ１０６）。音素データを読み出すとき、ハードディスク装置２１２から読み出された音素データと同一の音素データはキャッシュメモリ２１４に記憶される。 Next, when the text string acquisition unit 201 of the voice data creation device 200 acquires text information (step S103), the syntax analysis unit 202 performs syntax analysis of the text information (step S104). Next, the phoneme selection unit 203 accesses the storage device 210 and reads out phoneme data corresponding to the parsed word or phrase from the storage device 210. Then, the phoneme combination unit 204 combines the read phoneme data (step S105) to create voice data (step S106). When reading the phoneme data, the same phoneme data as the phoneme data read from the hard disk device 212 is stored in the cache memory 214.

次に、音声合成制御装置２２０の音声データ送信要求部２２４は、音声データ作成装置２００に対して音声データの送信を要求する。この要求には、音声データを格納すべきメモリ１４０のアドレス情報が含まれている。音声データ送信部２０５は、音声データの送信要求に応答して、作成した音声データをメモリ１４０へ書き込む（ステップＳ１０７）。 Next, the voice data transmission request unit 224 of the voice synthesis control device 220 requests the voice data creation device 200 to send voice data. This request includes address information of the memory 140 where the audio data is to be stored. The voice data transmission unit 205 writes the created voice data in the memory 140 in response to the voice data transmission request (step S107).

次に、音声合成制御装置２２０の音声指示判定部２２２は、ユーザから音声再生の指示があったか否かを判定し（ステップＳ１０８）、再生の指示があった場合には、音声出力要求部２２６は、音声出力部１５０へ音声出力を要求するとともに、メモリ１４０に格納された音声データを音声出力部１５０へ送信する（ステップＳ１０９）。音声出力部１５０は、送信された音声データをアナログ信号に変換し、スピーカから音声出力をする。 Next, the voice instruction determination unit 222 of the voice synthesis control device 220 determines whether or not there has been a voice reproduction instruction from the user (step S108), and when there is a reproduction instruction, the voice output request unit 226 The audio output unit 150 is requested to output audio and the audio data stored in the memory 140 is transmitted to the audio output unit 150 (step S109). The audio output unit 150 converts the transmitted audio data into an analog signal and outputs the audio from the speaker.

他方、ユーザから音声再生の指示がない場合には、音声データ消去部２２５は、メモリ１４０に格納された音声データを削除し（ステップＳ１１０）、音声出力要求部２２６は、上記と同様に音声出力部１５０に音声出力要求を行う。この場合、メモリ１４０から音声データが削除されるため、音声出力部１５０は、空のデータの音声出力を行うため、事実上、音声出力はされない、いわゆる擬似再生となる。音声合成制御装置２２０は、音声再生処理が終了したか否かを判定し、終了していなければ、ステップＳ１０５からの処理を継続する（ステップＳ１１１）。このような擬似再生処理を含ませることで、ユーザからの音声再生の指示がなくとも、テキスト情報に対応する音声データの作成が行われるため、キャッシュメモリ２１４にはヒット率の高い音素データを格納しておくことができる。そして、ユーザから音声再生の指示があれば、音声データ作成装置２００は音声データを作成するが、仮に、再生するテキスト情報が擬似再生したテキスト情報と同じであれば、キャッシュメモリを参照することで、音素データの読み出し時間を大幅に短縮することができる。 On the other hand, when there is no voice reproduction instruction from the user, the voice data deleting unit 225 deletes the voice data stored in the memory 140 (step S110), and the voice output requesting unit 226 outputs the voice in the same manner as described above. An audio output request is made to the unit 150. In this case, since the audio data is deleted from the memory 140, the audio output unit 150 performs the audio output of empty data, so that the audio output is practically not performed, so-called pseudo reproduction. The voice synthesis control device 220 determines whether or not the voice reproduction processing has ended, and if not, continues the processing from step S105 (step S111). By including such pseudo reproduction processing, voice data corresponding to the text information is created even if there is no voice reproduction instruction from the user, so that the phoneme data having a high hit rate is stored in the cache memory 214. Can be kept. Then, if there is an audio reproduction instruction from the user, the audio data creation device 200 creates audio data. If the text information to be reproduced is the same as the pseudo-reproduced text information, the cache data is referred to. The time for reading phoneme data can be greatly shortened.

図８は、本実施例による音声合成装置と従来の音声合成装置との処理時間の比較例である。入力テキストとして「ＴＥＳＴ」、「ＦＡＳＴ」のそれぞれの音声出力を行ったとき、従来の音声合成装置による音声出力までの時間を、９１．２０％、８５．７０％だけ短縮することができた。 FIG. 8 is a comparative example of processing time between the speech synthesizer according to the present embodiment and the conventional speech synthesizer. When “TEST” and “FAST” were output as input texts, the time until the speech output by the conventional speech synthesizer could be reduced by 91.20% and 85.70%.

図９は、本実施例に係る音声合成装置をナビゲーション装置に適用したときの構成図である。車載用ナビゲーション装置３００は、ラジオ放送やテレビ放送を受信するチューナ３１０を含んでおり、ラジオ放送やテレビ放送に重畳されたテキスト情報（道路交通情報）を抽出することで、当該テキスト情報を音声合成装置１００において音声出力することができる。さらに、ナビゲーション装置３００が無線通信手段３２０を含む場合には、当該無線通信手段により受信したテキスト情報を音声合成装置１００において出力することができる。 FIG. 9 is a configuration diagram when the speech synthesizer according to the present embodiment is applied to a navigation device. The vehicle-mounted navigation device 300 includes a tuner 310 that receives radio broadcasts and television broadcasts. By extracting text information (road traffic information) superimposed on the radio broadcasts and television broadcasts, the text information is synthesized with speech. The apparatus 100 can output a sound. Further, when the navigation device 300 includes the wireless communication unit 320, the text information received by the wireless communication unit can be output by the speech synthesizer 100.

なお上記実施例では、音素データ記憶装置２１０が音声合成モジュール１３０に含まれる例を示したが、これに限らず、例えば図１０に示すように記憶装置２１０Ａは内部バス１７０に接続されてもよい。この場合、音声合成モジュールは、内部バス１７０を介して音素データの読み出しを行う。また、記憶装置２１０Ａに含まれるハードディスク装置は、音素データのみならず他のデータを記憶するようにしてもよい。例えば図９に示したようにナビゲーション装置と結合される場合には、ハードディスク装置は地図データ等を格納するものであってもよい。さらに、内部バス１７０を介してのメモリ１４０へのアクセス時間が記憶装置２１０Ａ（ハードディスク装置）よりも速いならば、メモリ１４０をキャッシュメモリとして利用することも可能である。 In the above embodiment, the example in which the phoneme data storage device 210 is included in the speech synthesis module 130 has been shown. However, the present invention is not limited thereto, and the storage device 210A may be connected to the internal bus 170 as shown in FIG. . In this case, the speech synthesis module reads phoneme data via the internal bus 170. The hard disk device included in the storage device 210A may store not only phoneme data but also other data. For example, when combined with a navigation device as shown in FIG. 9, the hard disk device may store map data or the like. Further, if the access time to the memory 140 via the internal bus 170 is faster than the storage device 210A (hard disk device), the memory 140 can be used as a cache memory.

さらに上記実施例では、音声合成制御装置２２０が、音声データを削除するようにしたが、これに限らず、音声データ作成装置２００は、音声合成制御装置２２０によって音声再生の指示がないと判定されたとき、音声データを削除するようにしてもよい。この場合、音声データ作成装置２００の音声データ送信部２０５は、空の音声データを音声合成制御装置２２０へ送信する。 Furthermore, in the above-described embodiment, the voice synthesis control device 220 deletes the voice data. However, the present invention is not limited to this, and the voice data creation device 200 determines that the voice synthesis control device 220 has not issued a voice playback instruction. Audio data may be deleted. In this case, the voice data transmission unit 205 of the voice data creation apparatus 200 transmits empty voice data to the voice synthesis control apparatus 220.

さらに上記実施例では、音声出力を行うときに、メモリ１４０の領域を利用してそこに格納された音声データを音声出力部１５０に送信するようにしたが、音声データが格納されるアドレスが一定であれば、必ずしも音声データを送信する必要はない。この場合には、音声出力部１５０はメモリ１４０の決められたアドレスから音声データ読み出し、これを再生することができる。 Further, in the above embodiment, when performing audio output, the audio data stored in the area of the memory 140 is transmitted to the audio output unit 150, but the address where the audio data is stored is constant. If so, it is not always necessary to transmit audio data. In this case, the audio output unit 150 can read out audio data from a predetermined address in the memory 140 and reproduce it.

さらに上記実施例では、メモリ１４０に音声データを格納する例を示したが、音声データは必ずしもメモリ１４０に格納される必要はない。例えば、音声出力部１５０が、音声出力用のリングバッファを含み、リングバッファに音声データを格納するようにしてもよい。擬似再生を行う場合には、リングバッファに格納される音声データを空にする。なお、リングバッファを用いる場合には、古い音声データから順次上書きされるように音声データを記憶する一方で、音声出力部１５０は、古い音声データから順に音声再生する。 Further, in the above-described embodiment, the example in which the audio data is stored in the memory 140 has been described, but the audio data is not necessarily stored in the memory 140. For example, the audio output unit 150 may include a ring buffer for audio output and store audio data in the ring buffer. When performing pseudo reproduction, the audio data stored in the ring buffer is emptied. When the ring buffer is used, the audio data is stored so that the old audio data is sequentially overwritten, while the audio output unit 150 reproduces the audio in order from the old audio data.

さらに上記実施例では、テキスト情報として不定期に受信される道路交通情報を例示したが、テキスト情報は、道路交通情報に限らず他の情報であってもよい。例えば、テキスト情報は、電子メールやその他のテキスト文書であってもよい。 Furthermore, in the said Example, although the road traffic information received irregularly as text information was illustrated, text information may be other information not only road traffic information. For example, the text information may be an email or other text document.

以上、本発明の好ましい実施の形態について詳述したが、本発明は、特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The preferred embodiment of the present invention has been described in detail above, but the present invention is not limited to the specific embodiment, and various modifications can be made within the scope of the present invention described in the claims. Deformation / change is possible.

従来の音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the conventional speech synthesizer. 本発明の実施例に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer based on the Example of this invention. 図２に示す受信部の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the receiving part shown in FIG. 図２に示す音声合成モジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis module shown in FIG. 図４に示す音声データ作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice data production apparatus shown in FIG. 図４に示す音声合成制御装置の機能ブロック図である。FIG. 5 is a functional block diagram of the speech synthesis control device shown in FIG. 4. 本実施例に係る音声合成装置の動作フローチャートである。It is an operation | movement flowchart of the speech synthesizer concerning a present Example. 本実施例に係る音声合成装置による音声再生に要する時間を従来装置と対比した例を示す図である。It is a figure which shows the example which contrasted the time required for the audio | voice reproduction | regeneration by the speech synthesizer concerning a present Example with the conventional apparatus. 本実施例の音声合成装置をナビゲーション装置に適用した図である。It is the figure which applied the speech synthesizer of a present Example to the navigation apparatus. 本実施例の他の音声合成装置の例を示す図である。It is a figure which shows the example of the other speech synthesizer of a present Example.

Explanation of symbols

１００：音声合成装置１１０：入力部
１２０：受信部１３０：音声合成モジュール
１４０：メモリ１５０：音声出力部
１６０：外部ＩＦ１７０：内部バス 100: Speech synthesis device 110: Input unit 120: Reception unit 130: Speech synthesis module 140: Memory 150: Speech output unit 160: External IF 170: Internal bus

Claims

An acquisition means for acquiring text information from the outside;
First storage means for storing phoneme data necessary for speech synthesis;
Second storage means capable of storing the same phoneme data as the phoneme data read in the first storage means and having a faster access time than the first storage means;
Creation of referring to the first and second storage means, reading out phoneme data corresponding to the text information from at least one of the first and second storage means, and generating voice data based on the read phoneme data Means,
Audio output means for outputting audio;
An input means for inputting an instruction to reproduce audio of the text information;
The presence or absence of an audio playback instruction from the input means is determined, and when there is an audio playback instruction, the audio data created by the creating means is output to the audio output means,
Control means for deleting the voice data created by the creation means and outputting empty voice data to the voice output means when there is no voice reproduction instruction;
A speech synthesizer.

The speech synthesizer according to claim 1, wherein the control means instructs the creation means to create speech data corresponding to the acquired text information in response to the acquisition of text information by the acquisition means.

The control means causes the acquisition means to transmit text information to the creation means when a predetermined time has elapsed since the creation of the voice data by the creation means, and the voice data corresponding to the text information transmitted to the creation means. The speech synthesizer according to claim 1 which instructs creation.

The control means causes the acquisition means to transmit text information to the creation means in response to a voice reproduction instruction from the input means, and instructs the creation means to create voice data transmitted thereto. The speech synthesizer according to 1.

The first storage means is a mass storage device, the second storage means is a cache memory of the first storage means, and the creation means sends phoneme data corresponding to the text information from the cache memory. The speech synthesizer according to claim 1, wherein the phoneme data that has not been read and hit in the cache memory is read from the mass storage device.

The speech synthesis apparatus according to claim 1, wherein the acquisition unit receives a broadcast wave and extracts text information included in the broadcast wave.

The speech synthesizer according to claim 6, wherein the speech synthesizer is mounted on a vehicle, and the text information is road traffic information.

A speech synthesis method for synthesizing speech based on speech data,
Obtaining text information;
Reference is made to a memory for storing phoneme data necessary for speech synthesis and a cache memory capable of storing the same phoneme data as the phoneme data read from the memory, and the phoneme data corresponding to the acquired text information is stored in the memory and the cache memory. Reading from at least one;
Creating audio data based on the read phoneme data;
Determining whether or not there is an instruction to play audio;
When it is determined that there is an instruction for audio reproduction, the generated audio data is output to the audio output means, and when it is determined that there is no instruction for audio reproduction, the generated audio data is deleted and empty audio data Voice output to the voice output means;
A speech synthesis method comprising:

A speech synthesis program for synthesizing speech based on speech data,
Obtaining text information;
Reference is made to a memory for storing phoneme data necessary for speech synthesis and a cache memory capable of storing the same phoneme data as the phoneme data read from the memory, and the phoneme data corresponding to the acquired text information is stored in the memory and the cache memory. Reading from at least one;
Creating audio data based on the read phoneme data;
Determining whether or not there is an instruction to play audio;
When it is determined that there is an instruction for audio reproduction, the generated audio data is output to the audio output means, and when it is determined that there is no instruction for audio reproduction, the generated audio data is deleted and empty audio data Voice output to the voice output means;
A speech synthesis program.