JP2006349787A

JP2006349787A - Method and device for synthesizing voices

Info

Publication number: JP2006349787A
Application number: JP2005173241A
Authority: JP
Inventors: Kenichi Nakamura; 兼一中村; Kiyoshi Owada; 潔大和田
Original assignee: Hitachi Information and Control Systems Inc; Hitachi Information and Control Solutions Ltd
Current assignee: Hitachi Information and Control Systems Inc; Hitachi Information and Control Solutions Ltd
Priority date: 2005-06-14
Filing date: 2005-06-14
Publication date: 2006-12-28

Abstract

<P>PROBLEM TO BE SOLVED: To easily grasp an image concerning an actual synthesized sound output relating to voice synthesis, when editing synthesized voice data. <P>SOLUTION: The method for synthesizing voices including phonetic signs set to a text sentence and processing for editing a reading form when producing composite voices from the input text sentence, includes sentence separation processing for making a set of separated sentences (separated sentence data 1551) of a limited length by separating a text mode sentence on the basis of a predetermined sentence separation reference by a sentence separation processing part 11, and the editing processing is carried out in units of the separated sentences obtained by the sentence separation processing. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、与えられたテキスト文から合成音を生成する音声合成技術に関する。 The present invention relates to a speech synthesis technique for generating synthesized speech from a given text sentence.

音声合成処理においては、与えられた漢字カナ混じりのテキスト文について構文解析を行った後、発音記号や韻律記号を含むカナ文字列を生成させ、さらに所望の読上げ形式を設定し、これらの合成音データに基づいて合成音を生成する。より具体的には、発音や韻律それに読上げ形式について標準設定（デフォルト）が設けられており、この標準設定によりまず上記カナ文字列を生成させるとともに読上げ形式を設定し、それからそれらを編集画面上に表示して標準設定の合成音データをマニュアル操作で変更する編集を行うという作業が音声合成処理において行われる。 In speech synthesis processing, syntactic analysis is performed on a given text sentence mixed with kana and kana, a kana character string including phonetic symbols and prosodic symbols is generated, a desired reading format is set, and these synthesized sounds are set. A synthesized sound is generated based on the data. More specifically, standard settings (default) are provided for pronunciation, prosody, and reading format. By this standard setting, the above kana character string is first generated and the reading format is set, and then they are displayed on the editing screen. An operation of displaying and editing the standard set synthesized sound data to be changed manually is performed in the speech synthesis process.

こうした音声合成処理における編集作業は、実際に出力される合成音をイメージしながら行われる。そのため実際の出力合成音のイメージを編集画面上でつかみ易いか否かが作業効率に大きく影響する。すなわち出力イメージがつかみ難いと、実際に出力させた合成音を聞いてから編集をし直すという作業を何度も繰り返すことになり、作業効率が悪くなる。 The editing work in such speech synthesis processing is performed while imagining the synthesized sound that is actually output. Therefore, whether the actual output synthesized sound image is easy to grasp on the editing screen greatly affects the work efficiency. That is, if it is difficult to grasp the output image, the work of re-editing after listening to the actually output synthesized sound is repeated many times, resulting in poor work efficiency.

このようなことから、合成音データの編集については様々な工夫がなされている。例えば特許文献１に開示の音声合成装置では、合成音データにおける各文字自体の状態表示と他の文字との相対位置関係により、その文に対する韻律を表した文字列を表示することが示されている。具体的には、音質を表示色、声の高さを文字の表示位置の上下関係、発声の速さを各文字間隔、声の大きさを文字の輝度、アクセントの大きさを文字の上方に位置するアクセント記号により、というようにして韻律などを表した文字列を編集画面に表示し、合成音の出力イメージを視覚的に捉えることができるようにしている。 For this reason, various ideas have been made for editing the synthesized sound data. For example, in the speech synthesizer disclosed in Patent Document 1, it is shown that a character string representing the prosody for the sentence is displayed by the relative position relationship between the state display of each character itself and other characters in the synthesized sound data. Yes. Specifically, the sound quality is the display color, the voice pitch is the vertical position of the character display position, the utterance speed is the character spacing, the voice loudness is the character brightness, and the accent size is above the character. A character string representing the prosody and the like is displayed on the editing screen by the accented symbol so that the output image of the synthesized sound can be visually grasped.

特開平８−７７１５２号公報JP-A-8-77152

合成音の出力イメージを視覚的に捉えることができるようにする特許文献１に開示の手法は、それなりの有効性を期待できる。しかし、特許文献１に開示の手法では韻律などの表示が複雑となる。そのため、音声合成の対象とするテキスト文がある程度以上の長文になると、韻律などの表示が複雑であることに起因して、逆に出力イメージをつかみ難くなる可能性が高い。すなわち編集作業における出力合成音のイメージ化の難易は、処理するテキスト文の長さによる影響をより多く受けると考えられ、このことに配慮しないと出力合成音のイメージ化の容易化には限界があるということである。 The method disclosed in Patent Document 1 that enables the output image of the synthesized sound to be visually grasped can be expected to be effective. However, the method disclosed in Patent Document 1 complicates the display of prosody and the like. Therefore, if the text sentence to be synthesized is longer than a certain level, it is highly likely that it is difficult to grasp the output image due to the complicated display of prosody and the like. In other words, it is considered that the difficulty of imaging the output synthesized sound in editing work is more affected by the length of the text sentence to be processed. If this is not taken into account, there is a limit to the ease of imaging the output synthesized sound. That is.

以上は音声合成処理における合成音データ編集時の出力合成音のイメージ化の問題であるが、音声合成処理には、合成音によるテキスト文の読上げ速度やポーズ長の編集についての問題もある。すなわち合成音データの編集においては、合成音による読上げ速度なども編集対象となるが、与えられたテキスト文の読上げ時間に制限がある場合に、この制限時間との関係などで最適な読上げ速度などを設定するのが必ずしも容易でないという問題である。 The above is the problem of imaging the output synthesized sound when the synthesized sound data is edited in the speech synthesis process. However, the speech synthesis process also has a problem regarding the editing of the text sentence reading speed and pause length by the synthesized sound. In other words, in the editing of synthesized sound data, the reading speed by the synthesized sound is also subject to editing, but when there is a limit on the reading time of the given text sentence, the optimum reading speed etc. in relation to this time limit etc. It is a problem that it is not always easy to set.

さらに音声合成処理には、それで得られる合成音の使用目的に応じて、より効果的な合成音とすることが求められる場合があり、そのような要望に応えることができるようすることも１つの課題となっている。 Furthermore, the voice synthesis process may require a more effective synthesized sound depending on the purpose of use of the synthesized sound obtained by the voice synthesis process. It has become a challenge.

本発明は、以上のような知見に基づいてなされたものであり、音声合成について、合成音データの編集に際し、実際の出力合成音についてのイメージをより容易につかむことを可能とし、それにより音声合成処理の効率を高めることができるようにすることを第１の目的とし、また音声合成について、与えられたテキスト文の読上げに時間制限などがある場合の合成音データの編集をより容易に行えるようにすることを第２の目的とし、さらに音声合成について、使用目的に応じてより効果的となる合成音の生成を可能とすることを第３の目的としている。 The present invention has been made on the basis of the above-described knowledge, and it is possible to more easily grasp an image of an actual output synthesized sound when editing synthesized sound data with respect to speech synthesis. The primary purpose is to increase the efficiency of the synthesis process, and it is easier to edit the synthesized speech data when there is a time limit for reading a given text sentence for speech synthesis. The second object is to do so, and the third object is to enable generation of synthesized sound that is more effective in accordance with the purpose of use.

上記第１の目的のために本発明では、入力されたテキスト文から合成音を生成するについて、前記テキスト文に設定した発音記号や読上げ形式を編集する処理を含んでいる音声合成方法において、前記テキスト文の文章を所定の文区切り基準で区切ることにより有限の長さの区切り文の集合とする文区切り処理を含み、前記編集処理は、前記文区切り処理で得られる区切り文を単位として行うようにされていることを特徴としている。 For the first purpose, in the present invention, in the speech synthesis method including the process of editing the phonetic symbols and the reading format set in the text sentence, the synthetic speech is generated from the input text sentence. Including a sentence delimiter process in which a sentence of a text sentence is delimited by a predetermined sentence delimiter criterion to make a set of delimited sentences of a finite length, and the editing process is performed in units of delimiter sentences obtained by the sentence delimiter process It is characterized by being made.

また本発明では、上記のような音声合成方法について、前記文区切り基準として、句点、疑問符、感嘆符などの終端文字を用いるようにしている。 In the present invention, in the speech synthesis method as described above, terminal characters such as a punctuation mark, a question mark, and an exclamation mark are used as the sentence delimiter.

また本発明では、上記のような音声合成方法について、読上げ形式情報編集画面を表示して読上げ形式情報の編集を行う処理と発音情報編集画面を表示して発音情報の編集を行う処理を含ませるものとしている。 In the present invention, the speech synthesis method as described above includes a process for displaying the reading format information editing screen and editing the reading format information and a process for displaying the pronunciation information editing screen and editing the pronunciation information. It is supposed to be.

また本発明では上記第１の目的のために、入力されたテキスト文から合成音を生成する音声合成装置において、前記テキスト文の文章を所定の文区切り基準で区切ることにより有限の長さの区切り文の集合とする文区切り処理部を備えていることを特徴としている。 Further, in the present invention, for the first object, in a speech synthesizer that generates synthesized speech from an input text sentence, the sentence of the text sentence is delimited by a predetermined sentence delimitation criterion, and is separated by a finite length. It is characterized by having a sentence delimiter processing unit as a set of sentences.

また本発明では、上記のような音声合成装置について、前記文区切り基準として、句点、疑問符、感嘆符などの終端文字を用いるようにしている。 In the present invention, in the speech synthesizer as described above, a terminal character such as a punctuation mark, a question mark, or an exclamation mark is used as the sentence delimiter.

また本発明では、上記のような音声合成装置について、前記テキスト文に発音情報や読上げ形式情報を設定して生成される中間合成音データについて前記読上げ形式情報を前記区切り文単位で編集するための読上情報処理部を備えるものとしている。 Further, in the present invention, for the speech synthesizer as described above, for the intermediate synthesized sound data generated by setting pronunciation information and reading format information in the text sentence, the reading format information is edited in units of the delimited sentences. A reading information processing unit is provided.

また本発明では、上記のような音声合成装置について、前記中間合成音データについて前記発音情報を前記区切り文単位で編集するための発音情報処理部をさらに備えるものとしている。 According to the present invention, the speech synthesizer as described above further includes a pronunciation information processing unit for editing the pronunciation information for the intermediate synthesized sound data in units of the delimited sentences.

また本発明では上記第２の目的のために、入力されたテキスト文から合成音を生成する音声合成装置において、前記テキスト文に発音情報や読上げ形式情報を設定して生成される中間合成音データを編集するための合成音データ表示・編集部を備え、前記合成音データ表示・編集部は、生成した合成音を再生して出力した際の再生に関する時間を求め、それを前記編集のための画面に表示する時間演算部を含んでいることを特徴としている。 Further, in the present invention, for the second object, in the speech synthesizer for generating synthesized sound from the input text sentence, intermediate synthesized sound data generated by setting pronunciation information and reading format information to the text sentence. A synthetic sound data display / editing unit for editing the sound, and the synthetic sound data display / editing unit obtains a time related to reproduction when the generated synthetic sound is reproduced and output, and is used for the editing. It is characterized by including a time calculation unit to be displayed on the screen.

また本発明では、上記のような音声合成装置について、前記テキスト文の文章を所定の文区切り基準で区切ることにより有限の長さの区切り文の集合とする文区切り処理部を設け、前記時間演算部には、前記区切り文ごとの合成音再生に要する区切り文再生時間と、前記テキスト文についての合成音再生の開始時点から前記区切り文の再生が開始されるまでの区切り文再生開始時間を求めさせるようにしている。 In the present invention, the speech synthesizer as described above is provided with a sentence delimiter processing unit that divides the sentence of the text sentence according to a predetermined sentence delimiter criterion to make a set of delimited sentences of a finite length, and the time calculation The section obtains a delimiter sentence reproduction time required for the synthesized sound reproduction for each delimiter sentence and a delimiter sentence reproduction start time from the start of the synthesized sound reproduction for the text sentence until the delimited sentence reproduction starts. I try to let them.

また本発明では上記第３の目的のために、入力されたテキスト文から合成音を生成する音声合成装置において、前記合成音にのせる効果音や背景音を格納する効果音・背景音格納部を備え、前記合成音に効果音・背景音をのせる編集処理を行う読上情報処理部を備えていることを特徴としている。 Further, in the present invention, for the third object, in a speech synthesizer that generates a synthesized sound from an input text sentence, a sound effect / background sound storage unit that stores a sound effect and a background sound put on the synthesized sound And a reading information processing unit for performing an editing process for adding sound effects and background sounds to the synthesized sound.

本発明では、テキスト文を所定の文区切り基準で区切ることにより有限の長さの区切り文の集合とする文区切り処理を行い、発音記号や読上げ形式の編集は、区切り文を単位として行えるようにしている。このため本発明によれば、テキスト文が長文の場合でも、編集に際し、実際の出力合成音についてのイメージをより容易につかむことができるようになり、音声合成処理の効率を高めることが可能となる。 In the present invention, sentence separation processing is performed by dividing a text sentence according to a predetermined sentence separation criterion into a set of separation sentences having a finite length, and editing of phonetic symbols and reading format can be performed in units of separation sentences. ing. For this reason, according to the present invention, even when the text sentence is a long sentence, it is possible to more easily grasp an image of the actual output synthesized sound during editing, and the efficiency of the speech synthesis process can be increased. Become.

また本発明では、生成した合成音を再生して出力した際の再生に関する時間を求め、それを編集のための画面に表示できるようにしている。このため本発明によれば、与えられたテキスト文の読上げに時間制限などがある場合の合成音データの編集をより容易に行えるようになる。 Further, in the present invention, a time related to reproduction when the generated synthesized sound is reproduced and output is obtained, and can be displayed on a screen for editing. Therefore, according to the present invention, it is possible to more easily edit the synthesized sound data when there is a time limit in reading a given text sentence.

また本発明では、合成音にのせる効果音や背景音を格納する効果音・背景音格納部を設け、前記合成音に効果音・背景音をのせる編集処理を行う読上情報処理部を設けるようにしている。このため本発明によれば、合成音に効果音や背景音をのせることができるようになり、使用目的に応じてより効果的となる合成音を生成することが可能となる。 In the present invention, there is provided a sound effect / background sound storage unit for storing a sound effect and a background sound to be put on a synthesized sound, and a reading information processing unit for performing an editing process for putting the sound effect / background sound on the synthesized sound I am trying to provide it. For this reason, according to the present invention, it becomes possible to put a sound effect and a background sound on the synthesized sound, and it is possible to generate a synthesized sound that is more effective according to the purpose of use.

以下、本発明を実施するための形態について説明する。本発明を実施するには、ハードウエア要素であるパーソナルコンピュータなどのデータ処理機器にソフトウエア要素である音声合成方法に関するコンピュータプログラムを実装して音声合成装置を構成する。図１に、一実施形態による音声合成装置のシステム構成を示す。本実施形態の音声合成装置１は、文区切り処理部１１、テキスト解析部１２、単語辞書１３、表示・操作部１４、合成音データ表示・編集部１５、音素片格納部１６、音響処理部１７、音声ファイル作成部１８、効果音・背景音格納部１９、音声ファイル再生部２０、およびスピーカ２１を備えている。 Hereinafter, modes for carrying out the present invention will be described. To implement the present invention, a computer program related to a speech synthesis method as a software element is mounted on a data processing device such as a personal computer as a hardware element to constitute a speech synthesis apparatus. FIG. 1 shows a system configuration of a speech synthesizer according to an embodiment. The speech synthesizer 1 according to this embodiment includes a sentence delimiter processing unit 11, a text analysis unit 12, a word dictionary 13, a display / operation unit 14, a synthesized sound data display / editing unit 15, a phoneme piece storage unit 16, and an acoustic processing unit 17. , An audio file creation unit 18, a sound effect / background sound storage unit 19, an audio file playback unit 20, and a speaker 21.

文区切り処理部１１は、入力されたテキスト文Ｔの文章を所定の文区切り基準で区切ることにより有限の長さの区切り文の集合とする処理を行う。文区切り基準には、句点、読点、疑問符、感嘆符などの終端文字を用いるのが通常である。本実施形態では、句点、疑問符および感嘆符を文区切り基準に用いている。文区切り処理部１１で複数の区切り文に区切られたテキスト文Ｔは、テキスト解析部１２に渡されるとともに、合成音データ表示・編集部１５における後述のテキスト格納部１５５に区切り文を単位とした区切り文データ１５５１の集合として格納される。 The sentence delimiter processing unit 11 performs a process of making a set of delimited sentences having a finite length by delimiting the sentence of the input text sentence T according to a predetermined sentence delimiter criterion. Usually, terminators such as punctuation marks, punctuation marks, question marks, and exclamation marks are used as sentence break criteria. In the present embodiment, punctuation marks, question marks, and exclamation marks are used as sentence delimiters. The text sentence T divided into a plurality of delimiter sentences by the sentence delimiter processing section 11 is passed to the text analysis section 12, and the delimited sentence is used as a unit in the text storage section 155 described later in the synthesized sound data display / editing section 15. It is stored as a set of delimited sentence data 1551.

テキスト解析部１２は、区切り文に区切られたテキスト文Ｔを入力とし、単語辞書１３を用いてその解析を行う。テキスト解析部１２による解析は、テキスト文Ｔの構文解析とそれに基づく発音情報の標準設定である。発音情報の標準設定では、標準設定の合成音データ生成のために発音に関する設定（読み付与、アクセント付与、イントネーション付与、ポーズ設定など）を構文解析に基づいて行って発音記号や韻律記号付きカナ文字列を生成させる。これらの処理は音声合成分野では周知の処理である。標準的に発音記号を設定した結果は、テキスト格納部１５５に区切り文データ１５５１として格納される。ここで、テキスト解析部１２による発音情報の標準設定に加えて、標準的な読上げ形式を設定することで標準設定による初期設定の合成音データが生成される。ただし、最終的な合成音データとするには発音情報と読上げ形式情報の設定を基に音素データ（音素片）を割り当てる音響変換処理が必要であり、したがって発音情報と読上げ形式情報だけが設定された時点のデータは、中間的なもので、中間合成音データである。なお、標準的な読上げ形式の設定は、後述するように合成音データ表示・編集部１５における読上情報処理部１５１が行う。 The text analysis unit 12 receives the text sentence T delimited by the delimiter sentence and performs analysis using the word dictionary 13. The analysis by the text analysis unit 12 is a standard setting of syntactic analysis of the text sentence T and pronunciation information based thereon. In the standard settings of pronunciation information, settings for pronunciation (reading, accenting, intonation, pose setting, etc.) are generated based on syntax analysis to generate standard synthesized sound data, and kana characters with phonetic symbols and prosodic symbols Generate a column. These processes are well-known processes in the field of speech synthesis. The result of setting phonetic symbols as standard is stored as delimited sentence data 1551 in the text storage unit 155. Here, in addition to the standard setting of the pronunciation information by the text analysis unit 12, by setting a standard reading format, synthesized voice data of the default setting by the standard setting is generated. However, the final synthesized sound data requires an acoustic conversion process that assigns phoneme data (phoneme segments) based on the settings of pronunciation information and reading format information, so only the pronunciation information and reading format information are set. The data at the point in time is intermediate and is intermediate synthetic sound data. The standard reading format is set by the reading information processing unit 151 in the synthesized sound data display / editing unit 15 as described later.

単語辞書１３は、漢字カナ混じりのテキスト文Ｔにおける漢字をテキスト解析部１２でカナに変換するのに用いられる辞書である。 The word dictionary 13 is a dictionary used for converting the kanji in the text sentence T mixed with kanji into kana by the text analysis unit 12.

表示・操作部１４は、音声合成処理で必要なデータの表示を行うモニタや音声合成処理で必要なデータの入力出力を行うキーボードやマウスなどの入力出力機器で構成される。 The display / operation unit 14 includes a monitor that displays data necessary for speech synthesis processing and an input / output device such as a keyboard and a mouse that performs input / output of data necessary for speech synthesis processing.

合成音データ表示・編集部１５は、中間合成音データの編集処理つまり中間合成音データにおける発音情報や読上げ形式情報の設定をユーザのマニュアル操作で変更する編集処理のために主に機能する。その編集処理では、標準設定などにより既に生成されている合成音データを表示・操作部１４におけるモニタに表示し、その状態でユーザが読上げ形式情報や発音情報を変更することになる。こうした合成音データ表示・編集部１５は、読上情報処理部１５１、発音情報処理部１５２、時間演算部１５３、表示処理部１５４、およびテキスト格納部１５５を含んでいる。 The synthesized sound data display / editing unit 15 mainly functions for editing processing of intermediate synthesized sound data, that is, editing processing for changing settings of pronunciation information and reading format information in the intermediate synthesized sound data by a user's manual operation. In the editing process, the synthesized sound data already generated by standard setting or the like is displayed on the monitor in the display / operation unit 14, and the user changes the reading format information and the pronunciation information in that state. The synthesized sound data display / editing unit 15 includes a reading information processing unit 151, a pronunciation information processing unit 152, a time calculation unit 153, a display processing unit 154, and a text storage unit 155.

読上情報処理部１５１は、３つの機能を負っている。第１の機能は、読上げ形式情報の設定をユーザのマニュアル操作で変更する編集処理のための機能である。この機能の具体的な例ついては後述する。 The reading information processing unit 151 has three functions. The first function is a function for editing processing that changes the setting of reading format information by a user's manual operation. A specific example of this function will be described later.

読上情報処理部１５１の第２の機能は、合成音にベル音などの効果音や音楽などの背景音をのせるための編集処理についての機能である。この機能により合成音に効果音や背景音をのせることで、合成音をその使用目的に応じて、より効果的なものとすることができる。合成音に効果音や背景音をのせる編集処理の具体的な例については後述する。 The second function of the reading information processing unit 151 is a function relating to editing processing for placing a sound effect such as a bell sound or a background sound such as music on the synthesized sound. By applying sound effects and background sounds to the synthesized sound by this function, the synthesized sound can be made more effective according to the purpose of use. A specific example of the editing process in which sound effects and background sounds are added to the synthesized sound will be described later.

読上情報処理部１５１の第３の機能は、標準的な発音情報が設定された区切り文データに対してさらに標準的な読上げ形式を設定して標準設定の中間合成音データを生成する機能である。この機能による処理は、標準的な読上げ形式を区切り文データごとに設定することで行われ、その設定情報を区切り文ごとにその文頭に所定の記号で表示する形式で行われる。読上げ形式に関する設定情報としては、「音声種別：男性／女性」、「音声出力形式：８ｋＨｚ８ｂｉｔ／１１ｋＨｚ８ｂｉｔ／１６ｋＨｚ８ｂｉｔ／２２ｋＨｚ８ｂｉｔ／８ｋＨｚ１６ｂｉｔ／１１ｋＨｚ１６ｂｉｔ／１６ｋＨｚ１６ｂｉｔ／２２ｋＨｚ１６ｂｉｔ」、「読上げ方式：通常読上げ／いきいき読上げ」、「感情効果：平静／喜び／怒り／悲しみ」、「音響効果：なし／エコー／ロボット」、「文末ポーズ長：０．１秒〜２．５秒」、「読上げ速度：１０段階」、「声の高さ：１０段階」、「音量：１０段階」、「声の抑揚：１０段階」、「無声化：する／しない」、「高域強調：する／しない」、「鼻濁音化：する／しない」などがある。ここで、「無声化」や「鼻濁音化」のように、後述の発音情報における項目と重なる項目も読上げ形式情報に含まれるが、読上げ形式におけるこれらの項目は、区切り文の全体に対するものであり、発音情報におけるそれらは区切り文中の個々の文字に対するものである。なお、無声化とは、無声子音に挟まれた「イ」音や「ウ」音の発声を省略することである。例えば「テキスト」を発声する場合、「ス」の母音部「ウ」をはっきりとは発声しない、というのがその例である。こうした無声化は、より自然な日本語の合成音を生成する上で重要である。 A third function of the reading information processing unit 151 is a function for generating standard intermediate synthetic sound data by further setting a standard reading format for the delimited sentence data in which standard pronunciation information is set. is there. The processing by this function is performed by setting a standard reading format for each delimiter sentence data, and is performed in a format in which the setting information is displayed with a predetermined symbol at the beginning of each sentence. The setting information regarding the reading format includes “voice type: male / female”, “voice output format: 8 kHz 8 bit / 11 kHz 8 bit / 16 kHz 8 bit / 22 kHz 8 bit / 8 kHz 16 bit / 11 kHz 16 bit / 16 kHz 16 bit / 22 kHz 16 bit”, “reading method: normal reading / live reading”, “Emotion effect: calm / joy / anger / sadness”, “acoustic effect: none / echo / robot”, “end of sentence pause length: 0.1 to 2.5 seconds”, “reading speed: 10 steps”, “voice Height: 10 levels "," volume: 10 levels "," voice inflection: 10 levels "," silence: yes / no "," high frequency emphasis: yes / no "," nose turbidity: yes / no " "and so on. Here, items that overlap the items in the pronunciation information described later, such as “silenced” and “nasalization”, are also included in the reading format information, but these items in the reading format are for the whole delimited sentence. In the pronunciation information, they are for individual characters in the delimiter. Note that “voiceless” means omitting the utterance of “I” sound and “U” sound sandwiched between unvoiced consonants. For example, when “text” is uttered, the vowel part “c” of “su” is not clearly uttered. Such devoicing is important in generating more natural Japanese synthesized sounds.

発音情報処理部１５２は、発音情報の設定をユーザのマニュアル操作で変更する編集処理のための機能である。この機能の具体的例ついては後述する。編集の対象となる発音情報としては、「ポーズの挿入／削除及びポーズ長：０．１秒〜２．５秒」、「アクセント区切りの挿入／削除及びアクセントの立ち上がり具合調整」、「鼻濁音：あり／なし」、「無声化：あり／なし」、「アクセント：あり／なし」などがある。 The pronunciation information processing unit 152 is a function for editing processing that changes the setting of pronunciation information by a user's manual operation. A specific example of this function will be described later. As pronunciation information to be edited, “pause insertion / deletion and pause length: 0.1 to 2.5 seconds”, “accent break insertion / deletion and accent rise adjustment”, “nasal muddy sound: yes / No "," Devoiced: Yes / No "," Accent: Yes / No ", etc.

時間演算部１５３は、読上情報処理部１５１や発音情報処理部１５２で編集された合成音データで合成音を出力した場合における再生に関する時間を求め、それを編集画面に表示するのに機能する。時間演算部１５３が求める時間は、区切り文を単位としており、区切り文ごとの合成音再生に要する時間（区切り文再生時間）と、テキスト文Ｔについての合成音再生の開始時点から各区切り文の再生が開始されるまでの時間（区切り文再生開始時間）である。このような時間データを求めて編集画面に表示することにより、例えば与えられたテキスト文Ｔの読上げ時間に制限があるような場合に、この制限時間との関係などで最適な読上げ速度や末ポーズ長を設定する編集作業をより効率よく行えるようになる。時間演算部１５３による時間データの算出は、音声ファイル作成部１８で作成される音声ファイルを用いて行われる。 The time calculation unit 153 functions to obtain a reproduction-related time when the synthesized sound is output from the synthesized sound data edited by the reading information processing unit 151 or the pronunciation information processing unit 152 and to display the time on the editing screen. . The time calculated by the time calculation unit 153 is in units of delimiter sentences. The time required for the synthesized sound reproduction for each delimiter sentence (delimiter sentence reproduction time) and the start time of the synthesized sound reproduction for the text sentence T This is the time until playback starts (separated sentence playback start time). By obtaining such time data and displaying it on the editing screen, for example, when there is a limit on the reading time of a given text sentence T, the optimum reading speed and end pause are related to this time limit. Editing work to set the length can be performed more efficiently. Calculation of time data by the time calculation unit 153 is performed using an audio file created by the audio file creation unit 18.

表示処理部１５４は、ユーザが合成音の出力イメージを捉え易いようにして中間合成音データを編集のために表示・操作部１４のモニタに表示する機能を負っている。 The display processing unit 154 has a function of displaying the intermediate synthesized sound data on the monitor of the display / operation unit 14 for editing so that the user can easily capture the output image of the synthesized sound.

テキスト格納部１５５は、区切り文に区切られたテキスト文Ｔないしそのテキスト文Ｔに発音情報や読上げ形式情報を設定して生成される中間合成音データを区切り文単位の区切り文データ１５５１の集合として格納する。 The text storage unit 155 sets the text sentence T divided into delimiter sentences or intermediate synthesized sound data generated by setting pronunciation information and reading format information to the text sentence T as a set of delimiter sentence data 1551 in units of delimiter sentences. Store.

音素片格納部１６は、合成音を生成させる元となる音素片（音素データ）を格納する機能部である。音素片はカナ１文字に対して数十〜百数十用意されているのが通常である。音響処理部１７は、中間合成音データに設定されている発音情報や読上げ形式情報を基に該当する音素片を音素片格納部１６から抽出して中間合成音データにおける各カナ文字に割り当てて最終的な合成音データを生成する。音声ファイル作成部１８は、音響処理部１７で音素片を割り当てて生成された合成音データから例えばＷＡＶＥファイルなど音声ファイルを作成する。これら音素片格納部１６、音響処理部１７、音声ファイル作成部１８それぞれの機能は、音声合成分野で周知の機能である。 The phoneme piece storage unit 16 is a functional unit that stores a phoneme piece (phoneme data) from which a synthesized sound is generated. In general, several tens to hundreds of phoneme pieces are prepared for one character of kana. The sound processing unit 17 extracts the corresponding phoneme pieces from the phoneme piece storage unit 16 based on the pronunciation information and the reading format information set in the intermediate synthesized sound data, assigns them to each kana character in the intermediate synthesized sound data, and finally Synthetic sound data is generated. The audio file creation unit 18 creates an audio file such as a WAVE file from the synthesized sound data generated by assigning phonemes in the acoustic processing unit 17. The functions of the phoneme piece storage unit 16, the acoustic processing unit 17, and the audio file creation unit 18 are well-known functions in the field of speech synthesis.

効果音・背景音格納部１９は、読上情報処理部１５１による編集で合成音にのせる効果音や背景音を格納する。 The sound effect / background sound storage unit 19 stores the sound effect and the background sound that are put on the synthesized sound by editing by the reading information processing unit 151.

音声ファイル再生部２０は、音声ファイル作成部１８で作成された音声ファイルを合成音として再生する機能部位であり、これにより再生される合成音はスピーカ２１を通じて出力される。 The audio file reproduction unit 20 is a functional part that reproduces the audio file created by the audio file creation unit 18 as a synthesized sound, and the synthesized sound reproduced thereby is output through the speaker 21.

以下では、中間合成音データやその編集について具体的な例で説明する。図２に、中間合成音データの例を示す。この例の中間合成音データは、ＸＭＬ文書（Extensible Markup Language、拡張可能なマークアップ言語）で記述されている。ＸＭＬ文書は、“ＸＭＬ宣言（省略可）“、“ＸＭＬ型文書宣言（省略可）“、“ＸＭＬ文書本体“から構成されており、図２では“ＸＭＬ型文書宣言“は省略している。１行目がＸＭＬ宣言を示しており、二行目以降がＸＭＬ文書本体である。２行目以降の＜文節＞〜＜／文節＞で囲まれた記述が各区切り文データ（図１の区切り文データ１５５１）を示している。区切り文データは、読上げ形式情報、発音情報（発音記号や韻律記号付きカナ文字列）および再生に関する時間情報を含む構成とされている。ここで、効果音や背景音を合成音にのせる場合には、それについての区切り文データを生成する。図２の例では効果音用の区切り文データが生成されている。なお、図２における「文節」は、音声合成分野で一般的に用いられるそれとは異なり、「区切り文」に対応している。 Hereinafter, the intermediate synthesized sound data and editing thereof will be described with specific examples. FIG. 2 shows an example of intermediate synthesized sound data. The intermediate synthesized sound data in this example is described in an XML document (Extensible Markup Language). The XML document includes “XML declaration (can be omitted)”, “XML type document declaration (can be omitted)”, and “XML document body”. In FIG. 2, “XML type document declaration” is omitted. The first line shows the XML declaration, and the second and subsequent lines are the XML document body. The description enclosed in <clause> to </ clause> in the second and subsequent lines indicates each piece of sentence data (separated sentence data 1551 in FIG. 1). The delimited sentence data is configured to include reading format information, pronunciation information (kana character string with pronunciation symbols and prosodic symbols), and time information related to reproduction. Here, in the case where a sound effect or background sound is put on a synthesized sound, delimiter text data is generated. In the example of FIG. 2, delimiter text data for sound effects is generated. Note that “sentence” in FIG. 2 corresponds to “separated sentence”, which is different from that generally used in the speech synthesis field.

図３に、中間合成音データの編集のために読上情報処理部１５１が表示する読上げ形式情報編集画面の例を示す。描画５１は、効果音／背景音設定記号である。効果音や背景音は、区切り文を単位として設定することが可能で、ある区切り文に効果音あるいは背景音を設定すると、その区切り文の文頭に効果音／背景音設定記号が表示される。効果音や背景音の設定解除は、効果音／背景音設定記号をマウスクリックするなどの操作で行うことができる。描画５２は、区切り文を単位とする読上げ形式情報記号であり、区切り文の文頭に配置される。図の例においては読上げ形式情報として「感情効果」、「鼻濁音化」、「無声化」、「音量」、「読上げ速度」、「声の高さ」、「声の抑揚」、「文末ポーズ長」の各項目が記号表示されている。この表示における表示項目は変更することが可能で、通常は編集作業で設定を変更する場合の多い項目を表示するようにし、標準設定をそのまま用いることの多い項目の表示は省略する。読上げ形式を変更するには、表示されている読上げ形式情報記号をマウスクリックするなどして変更操作画面を表示させ、その画面で設定内容を変更する。描画５３は、区切り文のテキスト内容を示している。描画５４は、区切り文の区切りを示す文区切り記号である。テキスト文Ｔを区切り文に区切る処理は、文区切り処理部１１による自動処理でなすのが基本である。ただ、その自動処理で用いる句点、疑問符、感嘆符などの文区切り基準で区切ると、区切り文が長くなり過ぎるような場合もあり得る。そのような場合に対応するために、この編集画面においてマニュアル操作で文区切り記号を挿入して区切り文の長さを調節できるようにすることも可能である。描画５５は、時間演算部１５３で求めた時間データの表示であり、区切り文再生時間と区切り文再生開始時間が表示される。この表示は、時間データを表示させたい区切り文をマウスなどで指定する操作を行うと、当該区切り文についてなされる。 FIG. 3 shows an example of a reading format information editing screen displayed by the reading information processing unit 151 for editing the intermediate synthesized sound data. A drawing 51 is a sound effect / background sound setting symbol. Sound effects and background sounds can be set in units of delimiter sentences. When a sound effect or background sound is set in a certain delimiter sentence, a sound effect / background sound setting symbol is displayed at the beginning of the delimiter sentence. The setting of sound effects and background sounds can be canceled by an operation such as clicking the sound effect / background sound setting symbol with a mouse. The drawing 52 is a reading format information symbol with a delimited sentence as a unit, and is arranged at the head of the delimited sentence. In the example shown in the figure, the reading format information includes “emotional effect”, “nasalization”, “silence”, “volume”, “reading speed”, “voice pitch”, “voice inflection”, “end of sentence pause length” "Is displayed as a symbol. The display items in this display can be changed. Normally, the items whose settings are often changed in the editing operation are displayed, and the display of the items that often use the standard settings as they are is omitted. To change the reading format, a change operation screen is displayed by, for example, clicking the displayed reading format information symbol with a mouse, and the setting contents are changed on the screen. The drawing 53 shows the text content of the delimiter sentence. The drawing 54 is a sentence delimiter indicating a delimiter sentence delimiter. The process of dividing the text sentence T into delimited sentences is basically performed automatically by the sentence delimiter processing unit 11. However, there are cases where the delimiter becomes too long when delimited by sentence delimiters such as punctuation marks, question marks, and exclamation marks used in the automatic processing. In order to cope with such a case, it is also possible to adjust the length of the delimiter sentence by inserting a sentence delimiter by manual operation on this editing screen. The drawing 55 is a display of the time data obtained by the time calculation unit 153, and displays the delimited sentence reproduction time and the delimited sentence reproduction start time. This display is performed for the delimiter sentence when an operation for designating a delimiter sentence for which time data is to be displayed is performed with a mouse or the like.

図４に、中間合成音データの編集のために発音情報処理部１５２が表示する発音情報編集画面の例を示す。発音情報編集画面では、発音情報つまり発音記号や韻律記号付きカナ文字列が区切り文単位で表示される。描画６１は、真下の文字にアクセントがあることを示すアクセント記号である。描画６２は、アクセントの区切りを示すアクセント区切り記号である。描画６３は、ポーズの区切りを示すポーズ区切り記号である。描画６４は、真上の文字が無声音であることを示す無声音記号である。描画６５は、真上の文字が鼻濁音であることを示す鼻濁音記号である。発音についての設定を変更するには、これらの記号を消去したり、あるいは新たに追加したりする。 FIG. 4 shows an example of a pronunciation information editing screen displayed by the pronunciation information processing unit 152 for editing intermediate synthesized sound data. On the pronunciation information editing screen, pronunciation information, that is, phonetic symbols and kana character strings with prosodic symbols are displayed in units of delimited sentences. The drawing 61 is an accent symbol indicating that the character immediately below has an accent. The drawing 62 is an accent delimiter indicating an accent delimiter. A drawing 63 is a pause delimiter symbol indicating a pause delimiter. The drawing 64 is an unvoiced sound symbol indicating that the character immediately above is an unvoiced sound. The drawing 65 is a nasal cloud symbol indicating that the character directly above is a nasal cloud. To change the pronunciation settings, delete these symbols or add new ones.

以上のように区切り文を単位として中間合成音データの編集を行えるようにしたことにより、入力されるテキスト文が長文の場合でも、中間合成音データの編集に際し、実際の出力合成音についてのイメージをより容易につかむことができるようになり、音声合成処理の効率を高めることが可能となる。 As described above, the intermediate synthesized sound data can be edited in units of delimiter sentences, so that even if the input text sentence is a long sentence, the image of the actual output synthesized sound when editing the intermediate synthesized sound data Can be grasped more easily, and the efficiency of speech synthesis processing can be improved.

以下では、図１の音声合成装置で実行される音声合成処理について説明する。図５に、その処理についての流れを示す。処理開始後の最初のステップ４０では文区切り処理部１１による文区切り処理がなされる。具体的には、入力されたテキスト文Ｔの文章を句点、疑問符および感嘆符文で区切って区切り文の集合とする処理を行う。区切り文に区切られたテキスト文Ｔは、テキスト解析部１２に渡されるとともに、区切り文データ１５５１の集合としてテキスト格納部１５５に格納される。ステップ４１では、テキスト解析部１２によるテキスト解析がなされる。このテキスト解析は、構文解析と、それに基づく発音情報の設定（発音記号や韻律記号付きカナ文字列を生成）であることは上述のとおりである。ステップ４２では、読上情報処理部１５１による読上げ形式の初期設定がなされる。この処理は、上述のように、標準的な読上げ形式を区切り文データごとに設定することで行われ、その設定情報を区切り文ごとにその文頭に記号で表示する形式で行われる。ステップ４３では、ステップ４２を経て得られる中間合成音データをモニタに表示する。ステップ４４は、ステップ４３で表示された中間合成音データに対する編集の要否を判断する。表示されている中間合成音データにおける発音設定と読上げ形式設定をそのままにして最終的な合成音データを作成する場合、つまり編集不要（判定が“Ｎｏ“）の場合にはステップ４６に進み、一方編集を必要とする（判定が“Ｙｅｓ“）の場合にはステップ４５に進む。 Hereinafter, the speech synthesis process executed by the speech synthesizer of FIG. 1 will be described. FIG. 5 shows the flow of the processing. In the first step 40 after the start of processing, the sentence delimiter processing unit 11 performs sentence delimiter processing. Specifically, a process is performed in which the sentence of the input text sentence T is separated by a punctuation mark, a question mark, and an exclamation mark sentence to form a set of delimiter sentences. The text sentence T delimited by the delimiter sentence is transferred to the text analysis unit 12 and is stored in the text storage unit 155 as a set of delimiter sentence data 1551. In step 41, the text analysis by the text analysis unit 12 is performed. As described above, this text analysis is syntactic analysis and setting of pronunciation information based on it (generating a kana character string with pronunciation symbols and prosodic symbols). In step 42, the reading format is initially set by the reading information processing unit 151. As described above, this processing is performed by setting a standard reading format for each delimited sentence data, and is performed in a format in which the setting information is displayed as a symbol at the beginning of each delimited sentence. In step 43, the intermediate synthesized sound data obtained through step 42 is displayed on the monitor. In step 44, it is determined whether or not the intermediate synthetic sound data displayed in step 43 needs to be edited. When the final synthesized sound data is created with the pronunciation setting and the reading format setting in the intermediate synthesized sound data being displayed as they are, that is, when editing is not necessary (determination is “No”), the process proceeds to step 46. If editing is required (determination is “Yes”), the process proceeds to step 45.

ステップ４５では、中間合成音データについて読上げ情報や発音情報について編集を行う。ステップ４５での編集が済むと、再度ステップ４３に戻って編集結果の確認を行う。ステップ４６では、合成音再生の有無を判断する。すなわち継続して音響変換処理を行うか、それとも中間合成音データの段階で処理を停止するかの判断を行う。中間合成音データの段階で処理を停止して合成音再生を行わないとした場合は、判定が“Ｎｏ“となり、処理を終了する。中間合成音データに音響変換処理を施して合成音再生を行う場合は、判定が“Ｙｅｓ“となり、ステップ４７へ進む。ステップ４７では、音響処理部１７による音響変換処理が行われる。すなわち中間合成音データに設定されている発音情報や読上げ形式情報を基に該当する音素片を音素片格納部１６から抽出して中間合成音データにおける各カナ文字に割り当てて最終的な合成音データを生成する処理が行われる。ステップ４８では、ステップ４７で生成された最終的な合成音データから音声ファイル作成部１８により音声ファイルを生成する。ステップ４９では、音声ファイル作成部１８で作成された音声ファイルにより音声ファイル再生部２０が合成音を出力する。ユーザはこの出力合成音を試聴し、それがイメージ通りでない場合にはステップ４４に戻り、イメージ通りの出力合成音となるまでステップ４４以降の処理を繰り返す。 In step 45, the reading information and the pronunciation information are edited for the intermediate synthesized sound data. When the editing in step 45 is completed, the process returns to step 43 again to confirm the editing result. In step 46, it is determined whether or not the synthesized sound is reproduced. That is, it is determined whether to continue the acoustic conversion process or to stop the process at the intermediate synthetic sound data stage. When the process is stopped at the stage of the intermediate synthesized sound data and the synthesized sound is not reproduced, the determination is “No”, and the process ends. In the case where the intermediate sound conversion data is subjected to the acoustic conversion process and the sound reproduction is performed, the determination is “Yes”, and the process proceeds to step 47. In step 47, acoustic conversion processing by the acoustic processing unit 17 is performed. That is, based on the pronunciation information and the reading format information set in the intermediate synthesized sound data, the corresponding phoneme piece is extracted from the phoneme piece storage unit 16 and assigned to each kana character in the intermediate synthesized sound data to obtain the final synthesized sound data. Is generated. In step 48, a sound file is generated by the sound file creation unit 18 from the final synthesized sound data generated in step 47. In step 49, the audio file reproduction unit 20 outputs a synthesized sound based on the audio file created by the audio file creation unit 18. The user listens to this output synthesized sound, and if it does not match the image, the process returns to step 44, and the processing from step 44 is repeated until the output synthesized sound matches the image.

本発明は、音声合成について、合成音データの編集に際し、実際の出力合成音についてのイメージをより容易につかむことを可能とし、それにより音声合成処理の効率を高めることができるようにするなどの効果をもたらすものであり、音声合成分野において広く利用することができる。 The present invention makes it possible to more easily grasp an image of an actual output synthesized sound when editing synthesized sound data for speech synthesis, thereby improving the efficiency of speech synthesis processing, etc. It has an effect and can be widely used in the field of speech synthesis.

一実施形態による音声合成装置の構成を示す図である。It is a figure which shows the structure of the speech synthesizer by one Embodiment. 中間合成音データの例を示す図である。It is a figure which shows the example of intermediate synthetic sound data. 読上げ形式情報編集画面の例を示す図である。It is a figure which shows the example of a reading format information edit screen. 発音情報編集画面の例を示す図である。It is a figure which shows the example of a pronunciation information edit screen. 図１の音声合成装置で実行される音声合成処理における処理の流れを示す図である。It is a figure which shows the flow of the process in the speech synthesis process performed with the speech synthesizer of FIG.

Explanation of symbols

１音声合成装置
１１文区切り処理部
１２テキスト解析部
１５合成音データ表示・編集部
１９効果音・背景音格納部
１５１読上情報処理部
１５２発音情報処理部
１５３時間演算部
１５５１区切り文データ

DESCRIPTION OF SYMBOLS 1 Speech synthesizer 11 Sentence delimiter processing part 12 Text analysis part 15 Synthetic sound data display / editing part 19 Sound effect / background sound storage part 151 Reading information processing part 152 Pronunciation information processing part 153 Time calculation part 1551 Separation sentence data

Claims

In the speech synthesis method including the process of editing the phonetic symbols and the reading format set in the text sentence for generating the synthesized sound from the input text sentence,
Including a sentence delimiter process in which the sentence of the text sentence is delimited by a predetermined sentence delimiter criterion to form a set of delimited sentences having a finite length, and the editing process is performed in units of delimiter sentences obtained by the sentence delimiter process A speech synthesis method characterized by being configured as described above.

The speech synthesis method according to claim 1, wherein a terminal character such as a punctuation mark, a question mark, or an exclamation mark is used as the sentence delimiter.

3. The speech synthesis method according to claim 1, further comprising: a process for displaying a reading format information editing screen and editing the reading format information; and a process for displaying a pronunciation information editing screen and editing the pronunciation information. .

In a speech synthesizer that generates synthesized speech from input text sentences,
A speech synthesizer comprising: a sentence delimiter processing unit configured to delimit a sentence of the text sentence according to a predetermined sentence delimiter criterion to obtain a set of delimited sentences having a finite length.

5. The speech synthesizer according to claim 4, wherein a terminal character such as a punctuation mark, a question mark, or an exclamation mark is used as the sentence break reference.

5. A reading information processing unit for editing the reading format information in units of the delimited sentences for intermediate synthesized sound data generated by setting pronunciation information and reading format information in the text sentence. Item 6. The speech synthesizer according to Item 5.

The speech synthesis apparatus according to claim 6, further comprising a pronunciation information processing unit for editing the pronunciation information for the intermediate synthesized sound data in units of the delimiter sentences.

In a speech synthesizer that generates synthesized speech from input text sentences,
A synthetic sound data display / editing unit for editing intermediate synthetic sound data generated by setting pronunciation information and reading format information to the text sentence, and the synthetic sound data displaying / editing unit A speech synthesizer, comprising: a time calculation unit that obtains a time related to reproduction when the image is reproduced and output and displays the time on the editing screen.

A sentence delimiter processing unit configured to divide the sentence of the text sentence according to a predetermined sentence delimiter criterion to form a set of delimited sentences of a finite length, and the time calculation unit is a delimiter required for reproducing the synthesized sound for each delimited sentence 9. The speech synthesizer according to claim 8, wherein a sentence reproduction time and a delimiter sentence reproduction start time from a start time of synthetic sound reproduction for the text sentence to a start of reproduction of the delimiter sentence are obtained.

In a speech synthesizer that generates synthesized speech from input text sentences,
A sound effect / background sound storage unit for storing the effect sound and background sound to be put on the synthesized sound, and a reading information processing unit for performing an editing process for putting the effect sound / background sound on the synthesized sound A speech synthesizer characterized by the above.