JP2008185911A

JP2008185911A - Voice synthesizer

Info

Publication number: JP2008185911A
Application number: JP2007021048A
Authority: JP
Inventors: Seiichi Amashiro; 成一天白; Yasuo Sobashima; 康雄傍島; Takaaki Moriyama; 高明森山; Yasuhiro Fujii; 泰宏藤井; Mutsuaki Miki; 睦明三木; Ikuko Hatta; 育子八田
Original assignee: Arcadia Co Ltd
Current assignee: Arcadia Co Ltd
Priority date: 2007-01-31
Filing date: 2007-01-31
Publication date: 2008-08-14
Anticipated expiration: 2027-01-31
Also published as: JP4856560B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer which allows a user to easily obtain desired synthesized voice even if the user does not set/change a parameter value. <P>SOLUTION: (1) The user gives character string information which is text data of a certain character string, to the voice synthesizer. (2) the user gives speech information obtained by pronouncing the character string with the accent assumed by the user, to the voice synthesizer. (3) The voice synthesizer extracts the accent from the obtained voice information. (4) The voice synthesizer creates synthesized voice information corresponding to the character string based on the accent of the voice information. Thereby, on the basis of the accent which the user inputs by voice, the synthesized voice corresponding to the character string is created, and the user easily obtains the synthesized voice including the accent assumed by the user. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声合成装置であって、特に、ユーザが音声合成に必要なパラメータの値を設定せずとも所望の合成音声を得ることができるものに関する。 The present invention relates to a speech synthesizer, and more particularly to a device capable of obtaining a desired synthesized speech without a user setting parameter values necessary for speech synthesis.

従来の音声合成装置を説明する。従来の音声合成装置の一つである音声合成装置１００では、音声合成エンジンに適切なパラメータを与えて、所望の音声合成データを得る作業を容易に行うことができる。 A conventional speech synthesizer will be described. The speech synthesizer 100, which is one of the conventional speech synthesizers, can easily perform an operation of obtaining desired speech synthesis data by giving appropriate parameters to the speech synthesis engine.

音声合成装置１００は、ユーザからキーボードを介して漢字仮名まじり文字列を獲得すると、獲得した漢字まじり文字列から音声合成に必要なパラメータの値を自動的に決定し、決定したパラメータの値に基づいて合成音声を生成する。ここで、パラメータには、アクセントの高低やアクセント位置などがある。 When the speech synthesizer 100 acquires a kanji kanji character string from the user via the keyboard, the speech synthesizer 100 automatically determines a parameter value necessary for speech synthesis from the acquired kanji magic character string, and based on the determined parameter value. To generate synthesized speech. Here, the parameters include the height of the accent and the accent position.

また、音声合成装置１００では、生成した合成音声に対する修正が可能となっている。音声合成装置１００は、生成した合成音声のパラメータをユーザが修正しやすいように、図１６に示すように、現在のパラメータの設定値をディスプレイ上に視覚的に表示する。図１６では、パラメータの一つであるアクセントの設定値がディスプレイ上に視覚的に表示されている。仮名文字列「あらしまちょーの」を構成する各仮名文字「あ」、「ら」、「し」、「ま」、「ちょ」「−」、「の」に対して与えられたアクセントに対応して、各仮名文字が上下位置に配置されている。この図では、仮名文字「あ」以外の仮名文字に対しては、仮名文字「あ」よりも相対的に高いアクセントが与えられている。 In the speech synthesizer 100, the generated synthesized speech can be corrected. The speech synthesizer 100 visually displays the current parameter setting values on the display as shown in FIG. 16 so that the user can easily modify the parameters of the generated synthesized speech. In FIG. 16, the setting value of the accent which is one of the parameters is visually displayed on the display. The accent given to each kana character "A", "RA", "SH", "MA", "CHO", "-", "NO" that composes the kana character string "ARASHIMA CHOO" Correspondingly, each kana character is arranged in the vertical position. In this figure, a kana character other than the kana character “a” is given a relatively higher accent than the kana character “a”.

ここで、仮名文字「ま」についてアクセントを低くしたければ、マウスを操作して、仮名文字枠７２を下方向にドラッグする。これを受けて、音声合成装置１００は、仮名文字「ま」のアクセントを低くするようにパラメータの値を変更する。そして、図１７に示すように、音声合成装置１００は、仮名文字「ま」のかな文字枠７２を下方向に移動して表示する。このようにして、パラメータの一つであるアクセントの高低をユーザは容易に編集することができる。 Here, if the accent of the kana character “MA” is to be lowered, the kana character frame 72 is dragged downward by operating the mouse. In response to this, the speech synthesizer 100 changes the parameter value so as to lower the accent of the kana character “MA”. Then, as shown in FIG. 17, the speech synthesizer 100 moves the kana character “MA” kana character frame 72 downward and displays it. In this way, the user can easily edit the height of the accent, which is one of the parameters.

特開２００４−２４６１２９JP 2004-246129 A

前述の音声合成装置１００には、次のような問題点がある。音声合成装置１００では、ユーザは、パラメータの値を変更することによって、合成音声を修正することができる。つまり、ユーザが適切なパラメータを与えれば、音声合成装置１００は、ユーザが所望する合成音声を提供することができる。逆に言えば、ユーザが適切なパラメータの値を与えることができなければ、音声合成装置１００は、ユーザが所望する合成音声を提供することはない。 The speech synthesizer 100 described above has the following problems. In the speech synthesizer 100, the user can correct the synthesized speech by changing parameter values. That is, if the user gives appropriate parameters, the speech synthesizer 100 can provide the synthesized speech desired by the user. In other words, if the user cannot give an appropriate parameter value, the speech synthesizer 100 does not provide the synthesized speech desired by the user.

そして、一般的に、ユーザは、自らが所望する合成音声の具体的イメージは持っていても、どのパラメータをどの程度の値に変更すれば所望する合成音声となるのかを把握していない場合が多い。つまり、ユーザはパラメータの値の設定変更を行っては、合成音声を確認するという作業を繰り返さなければならない、という問題が生ずる。 In general, the user may have a specific image of the desired synthesized speech, but may not know which parameter is changed to what value to obtain the desired synthesized speech. Many. That is, there arises a problem that the user has to repeat the operation of confirming the synthesized speech after changing the parameter value setting.

例えば、ユーザがキーボードを介して「中山」と入力すると、音声合成装置１００は、「中山」を構成する仮名文字列「な」、「か」、「や」、「ま」に対して、パラメータの一つとしてアクセント「低」、「高」、「高」、「高」の値を自動的に決定し、「中山」の合成音声を生成する。この例において音声合成装置１００が各仮名文字「な」、「か」、「や」、「ま」に対して与えたアクセント「低」、「高」、「高」、「高」の値は、一般的に人名の「中山」が有するアクセントである。 For example, when the user inputs “Nakayama” via the keyboard, the speech synthesizer 100 sets parameters for the kana character strings “NA”, “KA”, “YA”, “MA” constituting “NAKAYAMA”. As an example, the values of accents “low”, “high”, “high”, and “high” are automatically determined, and a synthesized speech of “Nakayama” is generated. In this example, the values of the accents “low”, “high”, “high”, “high” given to the kana characters “na”, “ka”, “ya”, “ma” by the speech synthesizer 100 are: Generally, it is an accent that the personal name “Nakayama” has.

ここで、ユーザは近畿のある地方における地名を想定して「中山」と入力していた場合を考える。この場合、各仮名文字「な」、「か」、「や」、「ま」に対してアクセント「低」、「高」、「低」、「低」の値が設定されていなければ、ユーザは所望する合成音声を得ることができない。つまり、ユーザは、各仮名文字「な」、「か」、「や」、「ま」に対するアクセントの値が「低」、「高」、「高」、「高」ではなく、「低」、「高」、「低」、「低」であることを認識した上で、アクセントの値を修正する必要がある。 Here, it is assumed that the user inputs “Nakayama” assuming a place name in a certain region in Kinki. In this case, if the kana characters “na”, “ka”, “ya”, “ma” have no accent “low”, “high”, “low”, “low” values, Cannot obtain the desired synthesized speech. In other words, the user can set the accent value for each of the kana characters “na”, “ka”, “ya”, “ma” to “low”, “high”, “high”, “high” instead of “low” It is necessary to correct the accent value after recognizing “high”, “low”, and “low”.

しかし、音声学の専門家でないユーザが、単語やフレーズといった文字列のアクセントの位置やアクセントの大きさを把握することは容易ではない。よって、ユーザは、アクセントの位置、値の設定変更を行っては、合成音声を確認するという作業を、自らが想定する合成音声が得られるまで繰り返さなければならない。 However, it is not easy for a user who is not a phonetic expert to grasp the position of accents and the size of accents of character strings such as words and phrases. Therefore, the user must change the setting of the accent position and value and repeat the operation of confirming the synthesized speech until the synthesized speech expected by the user is obtained.

そこで、本発明は、ユーザがパラメータの値を設定・変更せずとも所望の合成音声を容易に得ることができる音声合成装置の提供を目的とする。 Therefore, an object of the present invention is to provide a speech synthesizer that allows a user to easily obtain a desired synthesized speech without setting / changing parameter values.

本発明に関する課題を解決するための手段及び発明の効果を以下に示す。 Means for solving the problems relating to the present invention and effects of the present invention will be described below.

本発明に係る音声合成装置、音声合成プログラム、及び音声合成方法では、ある文字列を表す文字列情報を取得し、ある音声を音声情報として取得し、取得した音声情報から、当該音声情報が有するアクセントをアクセント情報として抽出し、前記アクセント情報及び前記文字列情報に基づいて、前記文字列に対応する合成音声であって、前記アクセントを有するものを生成する。 In the speech synthesizer, speech synthesis program, and speech synthesis method according to the present invention, character string information representing a certain character string is acquired, a certain voice is acquired as voice information, and the voice information has the acquired voice information. Accents are extracted as accent information, and based on the accent information and the character string information, synthesized speech corresponding to the character string and having the accent is generated.

これにより、ユーザが音声により入力したアクセントに基づいて、文字列に対応する合成音声を生成することができる。よって、ユーザは、自らが想定するアクセントを有する合成音声を容易に得ることができる。 Thereby, the synthetic | combination audio | voice corresponding to a character string can be produced | generated based on the accent which the user input with the audio | voice. Therefore, the user can easily obtain a synthesized speech having an accent assumed by the user.

本発明に係る音声合成装置又は音声合成プログラムでは、さらに、取得した音声情報の基本周波数の時間的変化を表す基本周波数関数を用いて、前記アクセント情報を抽出する。これにより、アクセントを容易に判断することができる。 In the speech synthesizer or speech synthesis program according to the present invention, the accent information is further extracted using a fundamental frequency function representing a temporal change in the fundamental frequency of the obtained speech information. Thereby, an accent can be determined easily.

本発明に係る音声合成装置又は音声合成プログラムでは、取得した文字列情報が表す文字列と前記音声情報とを対応付けて、当該文字列を構成する音節のうち、どの音節にアクセントが存在するのかを判断し、アクセントが存在すると判断した音節を前記アクセント情報として抽出し、前記文字列情報における文字列に対応した合成音声であって、アクセントが存在すると判断した音節にアクセントを有するものを生成する。これにより、アクセントが存在する音節を容易に特定することができる。 In the speech synthesizer or the speech synthesis program according to the present invention, the character string represented by the acquired character string information and the voice information are associated with each other, and in which syllable the syllable constituting the character string has an accent. Is extracted as the accent information, and a synthesized speech corresponding to the character string in the character string information having an accent in the syllable determined to have an accent is generated. . Thereby, the syllable in which an accent exists can be specified easily.

本発明に係る音声合成装置又は音声合成プログラムでは、生成した合成音声の特徴量を変更し、変更した特徴量を有する合成音声を生成し、前記特徴量を変更した合成音声及び当該変更をする前の合成音声とを、合成音声候補として表示手段に表示し、表示した合成音声候補は入力手段によって選択可能なように構成されており、前記表示手段に表示した合成音声候補のいずれかが選択されたと判断すると、当該合成音声候補を合成音声と決定する。 In the speech synthesizer or speech synthesis program according to the present invention, the feature amount of the generated synthesized speech is changed, a synthesized speech having the changed feature amount is generated, and the synthesized speech in which the feature amount is changed and before the change is made Are displayed on the display means as synthesized speech candidates, and the displayed synthesized speech candidates are selectable by the input means, and any one of the synthesized speech candidates displayed on the display means is selected. If it is determined that the synthesized speech candidate is synthesized speech, the synthesized speech candidate is determined to be synthesized speech.

これにより、複数の合成音声候補を提供することができる。よって、ユーザは、提供された合成音声候補から選択するという容易な操作で合成音声を得ることができる。 Thereby, a plurality of synthesized speech candidates can be provided. Therefore, the user can obtain synthesized speech by an easy operation of selecting from the provided synthesized speech candidates.

本発明に係る音声合成装置又は音声合成プログラムでは、前記特徴量は、音の高低若しくは速度のいずれか一方を少なくとも含む。これにより、音の高低又は／及び速度を変更した、合成音声候補を容易に得ることができる。 In the speech synthesizer or speech synthesis program according to the present invention, the feature amount includes at least one of a pitch and a speed of sound. This makes it possible to easily obtain a synthesized speech candidate in which the pitch or / or speed of the sound is changed.

本発明に係る音声合成装置又は音声合成プログラムでは、前記特徴量が、音の高低及び速度により構成されている場合、前記合成音声候補を音の高低及び速度を２軸とした平面上に配置する。 In the speech synthesizer or the speech synthesis program according to the present invention, when the feature amount is configured by the pitch and speed of the sound, the synthesized speech candidates are arranged on a plane with the pitch and speed of the sound as two axes. .

これにより、ユーザは、提供される合成音声候補の相関関係を容易に把握することができる。よって、ユーザは、容易に合成音声候補から所望のもの選択することができる。 Thereby, the user can grasp | ascertain easily the correlation of the synthetic speech candidate provided. Therefore, the user can easily select a desired one from the synthesized speech candidates.

本発明に係る音声合成装置又は音声合成プログラムでは、前記表示手段に表示した合成音声候補のいずれかが選択されたと判断すると、当該合成音声候補を再生し、再生した前記合成音声候補に対して合成音声情報として確定する確定情報を獲得すると、前記合成音声候補を合成音声と決定する。 In the speech synthesizer or the speech synthesis program according to the present invention, when it is determined that any of the synthesized speech candidates displayed on the display unit is selected, the synthesized speech candidate is reproduced and synthesized with the reproduced synthesized speech candidate. When the confirmed information to be confirmed as voice information is acquired, the synthesized speech candidate is determined as synthesized speech.

これにより、ユーザは、合成音声候補の再生音を確認した上で、合成音声候補の選択を行うことができる。 Thereby, the user can select the synthesized speech candidate after confirming the reproduced sound of the synthesized speech candidate.

ここで、請求項に記載されている要素と実施例における要素との対応関係を示す。音声合成装置は音声合成装置２１に対応する。音声情報取得手段はＣＰＵ２１１、サウンド回路２１８、マイクロフォン２１８ｍに、アクセント情報抽出手段はＣＰＵ２１１及びメモリ２１２に、音声合成手段はＣＰＵ２１１及びメモリ２１２に、文字列情報取得手段はＣＰＵ２１１、メモリ２１２、キーボード２１４及びマウス２１５に、変更合成音声生成手段はＣＰＵ２１１及びメモリ２１２に、合成音声候補表示手段はＣＰＵ２１１及びメモリ２１２に、合成音声決定手段はＣＰＵ２１１及びメモリ２１２に、それぞれ該当する。 Here, the correspondence relationship between the elements described in the claims and the elements in the embodiment is shown. The speech synthesizer corresponds to the speech synthesizer 21. The voice information acquisition means is the CPU 211, the sound circuit 218, the microphone 218m, the accent information extraction means is the CPU 211 and the memory 212, the voice synthesis means is the CPU 211 and the memory 212, the character string information acquisition means is the CPU 211, the memory 212, the keyboard 214, and The modified synthesized speech generation means corresponds to the mouse 215, the CPU 211 and the memory 212, the synthesized speech candidate display means corresponds to the CPU 211 and the memory 212, and the synthesized speech determination means corresponds to the CPU 211 and the memory 212, respectively.

音声情報取得手段はステップＳ５０１、Ｓ５０３、及びＳ５１３の処理を、アクセント情報抽出手段はステップＳ５１５、Ｓ５１７、Ｓ８０１〜Ｓ８１５、Ｓ９０１〜Ｓ９１３の処理を、音声合成手段はステップＳ５１９、Ｓ１１０１〜Ｓ１１０５の処理を、文字列情報取得手段はステップＳ５０１、Ｓ５０３、及びＳ５１１の処理を、変更合成音声生成手段はステップＳ１００１〜Ｓ１００５、Ｓ１２０１〜Ｓ１２０５の処理を、合成音声候補表示手段はステップＳ１２０７の処理を、合成音声決定手段はステップＳ１２０９〜Ｓ１２１５の処理を、それぞれ実行する。 The voice information acquisition means performs the processes of steps S501, S503, and S513, the accent information extraction means performs the processes of steps S515, S517, S801 to S815, and S901 to S913, and the voice synthesis means performs the processes of steps S519, S1101 to S1105. The character string information acquisition means performs the processing of steps S501, S503, and S511, the modified synthesized speech generation means performs the processing of steps S1001 to S1005, S1201 to S1205, and the synthesized speech candidate display means performs the processing of step S1207. The determining means executes the processes of steps S1209 to S1215, respectively.

アクセント情報は、アクセント位置情報に対応する。 Accent information corresponds to accent position information.

「特徴量」とは、音声を特徴付ける情報をいい、音声の高・低、音声の速度の早い・遅い等を含む概念である。 The “feature amount” refers to information that characterizes the voice, and is a concept including high / low voice, fast / slow voice speed, and the like.

「音声」とは、直接的、間接的を問わず発せられる音をいい、音を発する主体は人間のみに限定されない概念である。また、アクセントを有するものであればよく、発せられた音声の意味内容が理解できないようなもの、例えばハミング等も含む概念である。 “Speech” refers to a sound that can be emitted directly or indirectly, and the subject that emits a sound is not limited to humans. Moreover, what is necessary is just a thing with an accent, and it is a concept including the thing which cannot understand the meaning content of the emitted voice, for example, humming etc.

本発明における音声合成装置の実施例を以下において説明する。 An embodiment of a speech synthesizer according to the present invention will be described below.

1. 概要
本発明に係る音声合成装置の概要を図１に基づいて説明する。 1. Overview An overview of a speech synthesizer according to the present invention will be described with reference to FIG.

１．ユーザは、音声合成装置に対して、ある文字列のテキストデータである文字列情報を与える。 1. The user gives character string information which is text data of a certain character string to the speech synthesizer.

２．ユーザは、自らが想定するアクセントで文字列を発声した音声情報を音声合成装置に与える。 2. The user gives the speech synthesizer speech information that utters a character string with an accent assumed by the user.

３．音声合成装置は、取得した音声情報からアクセントを抽出する。 3. The speech synthesizer extracts accents from the acquired speech information.

４．音声合成装置は、音声情報のアクセントに基づいて、文字列に対応する合成音声情報を生成する。 4). The speech synthesizer generates synthesized speech information corresponding to the character string based on the accent of the speech information.

2. 機能ブロック図
本発明に係る音声合成装置Ｍ１を図２に示す機能ブロック図に基づいて説明する。音声合成装置Ｍ１は、音声情報取得手段Ｍ１１、アクセント情報抽出手段Ｍ１３、音声合成手段Ｍ１５、及び文字列情報取得手段Ｍ１７を有している。 2. Functional Block Diagram A speech synthesizer M1 according to the present invention will be described based on the functional block diagram shown in FIG. The speech synthesizer M1 includes speech information acquisition means M11, accent information extraction means M13, speech synthesis means M15, and character string information acquisition means M17.

音声情報取得手段Ｍ１１は、ある音声を音声情報として取得する。 The voice information acquisition unit M11 acquires a certain voice as voice information.

アクセント情報抽出手段Ｍ１３は、取得した音声情報から、当該音声情報が有するアクセントをアクセント情報として抽出する。また、アクセント情報抽出手段Ｍ１３は、取得した音声情報の基本周波数の時間的変化を表す基本周波数関数を用いて、前記アクセント情報を抽出する。さらに、アクセント情報抽出手段Ｍ１３は、取得した文字列情報が表す文字列と前記音声情報とを対応付けて、当該文字列を構成する音節のうち、どの音節にアクセントが存在するのかを判断し、アクセントが存在すると判断した音節を前記アクセント情報として抽出する。 The accent information extraction unit M13 extracts the accent included in the audio information as accent information from the acquired audio information. Further, the accent information extracting unit M13 extracts the accent information using a fundamental frequency function that represents a temporal change in the fundamental frequency of the acquired voice information. Further, the accent information extraction unit M13 associates the character string represented by the acquired character string information with the voice information, determines which syllable of the syllable constituting the character string has an accent, A syllable that is determined to have an accent is extracted as the accent information.

音声合成手段Ｍ１５は、前記アクセント情報及び前記文字列情報に基づいて、前記文字列に対応する合成音声であって、前記アクセントを有するものを生成する。また、音声合成手段Ｍ１５は、前記文字列情報における文字列に対応した合成音声であって、アクセントが存在すると判断した音節にアクセントを有するものを生成する。 Based on the accent information and the character string information, the speech synthesizer M15 generates synthesized speech corresponding to the character string and having the accent. Further, the speech synthesizer M15 generates synthesized speech corresponding to the character string in the character string information and having an accent on a syllable that is determined to have an accent.

文字列情報取得手段Ｍ１７は、ある文字列を表す文字列情報を取得する。 The character string information acquisition unit M17 acquires character string information representing a certain character string.

これにより、ユーザが音声により入力したアクセントに基づいて、文字列に対応する合成音声を生成することができる。よって、ユーザは、自らが想定するアクセントを有する合成音声を容易に得ることができる。また、アクセントを容易に判断することができる。さらに、アクセントが存在する音節を容易に特定することができる。 Thereby, the synthetic | combination audio | voice corresponding to a character string can be produced | generated based on the accent which the user input with the audio | voice. Therefore, the user can easily obtain a synthesized speech having an accent assumed by the user. Also, the accent can be easily determined. Furthermore, it is possible to easily identify a syllable in which an accent exists.

3. 音声合成装置２１のハードウェア構成
本発明に係る音声合成装置である音声合成装置２１のハードウェア構成を図３を用いて説明する。音声合成装置２１は、ＣＰＵ２１１、メモリ２１２、ハードディスク２１３、キーボード２１４、マウス２１５、ディスプレイ２１６、ＣＤ−ＲＯＭドライブ２１７、サウンド回路２１８、スピーカ２１８ｓ、及びマイクロフォン２１８ｍを備えている。 3. Hardware Configuration of Speech Synthesizer 21 The hardware configuration of the speech synthesizer 21 that is a speech synthesizer according to the present invention will be described with reference to FIG. The speech synthesizer 21 includes a CPU 211, a memory 212, a hard disk 213, a keyboard 214, a mouse 215, a display 216, a CD-ROM drive 217, a sound circuit 218, a speaker 218s, and a microphone 218m.

ＣＰＵ２１１は、ハードディスク２１３に記録されているオペレーティング・システム（ＯＳ）、音声合成プログラム等その他のアプリケーションに基づいた処理を行う。メモリ２１２は、ＣＰＵ２１１に対して作業領域を提供する。ハードディスク２１３は、オペレーティング・システム（ＯＳ）、音声合成プログラム等その他のアプリケーション及び各種データを記録保持する。なお、ハードディスク２１３に記録されているデータについては後述する。 The CPU 211 performs processing based on other applications such as an operating system (OS) and a voice synthesis program recorded on the hard disk 213. The memory 212 provides a work area for the CPU 211. The hard disk 213 records and holds other applications such as an operating system (OS) and a speech synthesis program and various data. The data recorded on the hard disk 213 will be described later.

キーボード２１４、マウス２１５は、外部からの命令を受け付ける。ディスプレイ２１６は、ユーザーインターフェイス等の画像を表示する。ＣＤ−ＲＯＭドライブ２１７は、音声合成プログラムが記録されているＣＤ−ＲＯＭ２１０から音声合成プログラム及び他のＣＤ−ＲＯＭからその他のアプリケーションのプログラムを読み取る等、ＣＤ−ＲＯＭからのデータの読み取りを行う。サウンド回路２１８は、与えられた音声合成データをアナログ波形に変換してスピーカ２１８ｓへ出力する。また、サウンド回路２１８は、マイクロフォン２１８ｍを介して取得したアナログ波形をデジタル波形に変換する。 The keyboard 214 and the mouse 215 accept external commands. The display 216 displays an image such as a user interface. The CD-ROM drive 217 reads data from a CD-ROM, such as reading a voice synthesis program from a CD-ROM 210 in which a voice synthesis program is recorded and a program of another application from another CD-ROM. The sound circuit 218 converts the given voice synthesis data into an analog waveform and outputs the analog waveform to the speaker 218s. The sound circuit 218 also converts the analog waveform acquired via the microphone 218m into a digital waveform.

4. データ
音声合成装置２１がハードディスク２１３に記録する音節持続時間データベース（以下、音節持続時間ＤＢとする。）について図４に基づいて説明する。音節持続時間ＤＢは、音節と当該音節を発声したときの標準的な持続時間とを関連付けたデータベースである。 4. Data A syllable duration database (hereinafter referred to as syllable duration DB) recorded by the speech synthesizer 21 on the hard disk 213 will be described with reference to FIG. The syllable duration DB is a database that associates a syllable with a standard duration when the syllable is uttered.

音節持続時間ＤＢは、［音節］列Ｃ４０１及び［持続時間］列Ｃ４０５を有している。［音節］列Ｃ４０１には、日本語において一般的に用いられている音節の種類が記述される。［持続時間］列Ｃ４０５には、［音節］列Ｃ４０１に記述された音節が標準的な速度で発声されたときの発声時間が持続時間として記述される。 The syllable duration DB has a [syllable] column C401 and a [duration] column C405. [Syllable] column C401 describes syllable types generally used in Japanese. In the [Duration] column C405, the utterance time when the syllable described in the [Syllable] column C401 is uttered at a standard speed is described as the duration.

5. 音声合成装置２１の動作
音声合成装置２１のＣＰＵ２１１の動作の概要を図５を用いて説明する。ＣＰＵ２１１は、文字列情報若しくは音声情報を取得するための文字列・音声情報取得画面Ｄ１を表示する（Ｓ５０１）。 5. Operation of Speech Synthesizer 21 An outline of the operation of the CPU 211 of the speech synthesizer 21 will be described with reference to FIG. The CPU 211 displays a character string / voice information acquisition screen D1 for acquiring character string information or voice information (S501).

音声合成装置２１のディスプレイ２１６に表示される文字列・音声情報取得画面Ｄ１の一例を図６に示す。文字列・音声情報取得画面Ｄ１は、文字列入力領域Ａ６０１、音声情報取得開始ボタンＢ６０１を有している。文字列入力領域Ａ６０１は、ユーザが音声合成しようとする仮名文字列を表す文字列情報をキーボード２１４等の入力手段を用いて入力するための領域である。音声情報取得開始ボタンＢ６０１は、ユーザが文字列に与えるアクセントを音声で入力しようとする際にマウス２１５等で選択するボタンである。 An example of the character string / voice information acquisition screen D1 displayed on the display 216 of the voice synthesizer 21 is shown in FIG. The character string / voice information acquisition screen D1 includes a character string input area A601 and a voice information acquisition start button B601. The character string input area A601 is an area for inputting character string information representing a kana character string to be synthesized by the user using an input unit such as the keyboard 214. The voice information acquisition start button B601 is a button for selecting with the mouse 215 or the like when the user wants to input the accent given to the character string by voice.

図５に戻って、ＣＰＵ２１１は、音声情報取得開始ボタンＢ６０１（図６参照）が選択されたと判断すると（Ｓ５０３）、文字列・音声情報取得画面Ｄ１の文字列入力領域Ａ６０１に入力された文字列を文字列情報として取得し、メモリ２１２へ記憶する（Ｓ５１１）。また、ＣＰＵ２１１は、マイクロフォン２１８ｍを介して音声情報を取得する（Ｓ５１３）。ユーザは、マイクロフォンに向かって、文字列・音声情報取得画面Ｄ１の文字列入力領域Ａ６０１に入力された文字列に対応する音声であって、自らが想定するアクセントを有する音声を入力する。ＣＰＵ２１１は、音声情報を獲得したと判断すると、音節アライメント処理（Ｓ５１５）及びアクセント位置判断処理（Ｓ５１７）、及び音声合成処理を実行する（Ｓ５１９）を実行する。ＣＰＵ２１１は、生成した合成音声情報をスピーカ２１８ｓを介して再生する（Ｓ５２１）。 Returning to FIG. 5, when the CPU 211 determines that the voice information acquisition start button B601 (see FIG. 6) has been selected (S503), the character string input in the character string input area A601 of the character string / voice information acquisition screen D1. Is acquired as character string information and stored in the memory 212 (S511). In addition, the CPU 211 acquires audio information via the microphone 218m (S513). The user inputs the voice corresponding to the character string input in the character string input area A601 of the character string / voice information acquisition screen D1 toward the microphone and having an accent assumed by the user. When the CPU 211 determines that voice information has been acquired, the CPU 211 executes syllable alignment processing (S515), accent position determination processing (S517), and voice synthesis processing (S519). The CPU 211 reproduces the generated synthesized voice information via the speaker 218s (S521).

以降において、音節アライメント処理（Ｓ５１５）、及びアクセント位置判断処理（Ｓ５１７）、及び音声合成処理（Ｓ５１９）を説明する。 Hereinafter, the syllable alignment process (S515), the accent position determination process (S517), and the speech synthesis process (S519) will be described.

5.1. 音節アライメント処理
ＣＰＵ２１１が実行する音節アライメント処理（図５：Ｓ５１５参照）は、取得した音声情報のどの位置に文字列情報を構成する各仮名文字の音節区切りがあるのかを判断するために実行する処理である。ＣＰＵ２１１が実行する音節アライメント処理を図７に示すフローチャートを用いて説明する。 5.1. Syllable alignment processing The syllable alignment processing (see FIG. 5: S515) executed by the CPU 211 is executed to determine at which position of the acquired speech information the syllable break of each kana character constituting the character string information is present. It is processing to do. The syllable alignment process executed by the CPU 211 will be described with reference to the flowchart shown in FIG.

ＣＰＵ２１１は、ステップＳ５１１（図５参照）で取得した文字列情報を構成する仮名文字に対応する持続時間を音節持続時間ＤＢ（図４参照）の［持続時間］列Ｃ４０５から取得する（Ｓ８０１）。ＣＰＵ２１１は、取得した［持続時間］列Ｃ４０５の値を合計した合計持続時間を算出する（Ｓ８０３）。ＣＰＵ２１１は、文字列情報を構成する仮名文字について、算出した合計持続時間と各仮名文字の持続時間との比を算出する（Ｓ８０５）。 The CPU 211 obtains the duration corresponding to the kana characters constituting the character string information obtained in step S511 (see FIG. 5) from the [duration] column C405 of the syllable duration DB (see FIG. 4) (S801). The CPU 211 calculates a total duration obtained by summing up the values of the acquired [duration] column C405 (S803). The CPU 211 calculates the ratio between the calculated total duration and the duration of each kana character for the kana characters constituting the character string information (S805).

また、ＣＰＵ２１１は、ステップＳ５１３（図５参照）で取得した音声情報の発声時間を計測する（Ｓ８１１）。ＣＰＵ２１１は、ステップＳ８０５で算出した合計持続時間と各仮名文字の持続時間との比と、ステップＳ８１１で計測した発声時間とに基づいて、文字列情報を構成する仮名文字と音声情報との対応関係を判断し（Ｓ８１３）、文字−音声対応テーブルとしてメモリ２１２へ記憶する（Ｓ８１５）。 Further, the CPU 211 measures the utterance time of the audio information acquired in step S513 (see FIG. 5) (S811). Based on the ratio between the total duration calculated in step S805 and the duration of each kana character, and the utterance time measured in step S811, the CPU 211 correlates the kana characters constituting the character string information with the speech information. Is stored in the memory 212 as a character-speech correspondence table (S815).

ここで、文字−音声対応テーブルを図８を用いて説明する。文字−音声対応テーブルは、［文字］列Ｃ１４０１、［対応時間］列Ｃ１４０３を有している。［文字］列Ｃ１４０１には、文字列情報を構成する仮名文字が記述される。［対応時間］列Ｃ１４０３には、音声情報において、［文字］列Ｃ１４０１に記述された仮名文字に対応する時間が記述される。 Here, the character-speech correspondence table will be described with reference to FIG. The character-speech correspondence table has a [character] column C1401 and a [corresponding time] column C1403. [Character] column C1401 describes the kana characters constituting the character string information. The [corresponding time] column C1403 describes the time corresponding to the kana character described in the [character] column C1401 in the voice information.

例えば、文字列情報「なかやま」を構成する仮名文字「な」について、音声情報の０秒００から０秒３０までが対応する場合、［文字］列Ｃ１４０１の「な」に対応する［対応時間］列Ｃ１４０３には、値「０’００”−０’３０”」が記述される。 For example, when the kana character “NA” constituting the character string information “NAKAYAMA” corresponds to “0” to “0” 30 of the voice information, [corresponding time] corresponding to “NA” in the [character] column C1401 In a column C1403, a value “0′00” −0′30 ″ ”is described.

これにより、マイクロフォン２１８ｍから取得した音声情報のどの時間からどの時間までが文字列情報を構成する各仮名文字に対応するのか、という音声情報と文字列情報を構成する各仮名文字との対応関係を把握することが可能となる。 Thus, the correspondence between the voice information acquired from the microphone 218m and the time from which time corresponds to each kana character constituting the character string information and each kana character constituting the character string information. It becomes possible to grasp.

5.2. アクセント位置判断処理
ＣＰＵ２１１が実行するアクセント位置判断処理（図５：Ｓ５１７参照）を図９に示すフローチャートを用いて説明する。ＣＰＵ２１１は、ステップＳ５１３（図５参照）で取得した音声情報に対する基本周波数関数を算出する（Ｓ９０１）。なお、基本周波数関数における基本周波数の算出は、取得した音声情報の自己相関関数を算出し、相関値が一定のしきい値以上である周期を求めることによって行う。 5.2. Accent Position Determination Processing The accent position determination processing (see FIG. 5: S517) executed by the CPU 211 will be described with reference to the flowchart shown in FIG. The CPU 211 calculates a fundamental frequency function for the audio information acquired in step S513 (see FIG. 5) (S901). The calculation of the fundamental frequency in the fundamental frequency function is performed by calculating an autocorrelation function of the acquired voice information and obtaining a period in which the correlation value is equal to or greater than a certain threshold value.

ＣＰＵ２１１は、算出した基本周波数関数の一次微分関数を算出する（Ｓ９０３）。そして、ＣＰＵ２１１は、算出した基本周波数関数の一次微分関数の値が正から負に変わる位置にアクセント位置があると判断し（Ｓ９０５）、アクセント位置に対応する時間をアクセント位置情報としてメモリ２１２へ一時的に記憶する（Ｓ９０７）。ＣＰＵ２１１は、メモリ２１２から文字−音声対応テーブルを取得し（Ｓ９０９）、アクセント位置情報の時間が文字列情報を構成する仮名文字のうちどの仮名文字に対応するのかを判断する（Ｓ９１１）。ＣＰＵ２１１は、アクセント位置が存在する仮名文字をアクセント文字情報としてメモリ２１２へ一時的に記憶する（Ｓ９１３）。 The CPU 211 calculates a first derivative function of the calculated fundamental frequency function (S903). Then, the CPU 211 determines that there is an accent position at a position where the calculated first derivative function value of the fundamental frequency function changes from positive to negative (S905), and temporarily stores the time corresponding to the accent position as accent position information to the memory 212. (S907). The CPU 211 obtains a character-speech correspondence table from the memory 212 (S909), and determines which kana character corresponds to the time of the accent position information among the kana characters constituting the character string information (S911). The CPU 211 temporarily stores the kana character having the accent position in the memory 212 as accent character information (S913).

音節アライメント処理によって、音声情報と文字列情報を構成する仮名文字との対応付けが終了しており、どの時間からどの時間までの音声情報がどの仮名文字に対応しているのかを把握することが可能となっている。よって、アクセント位置情報に対応する時間がどの仮名文字に対応するのかも判断することができる。 The syllable alignment process has finished associating the speech information with the kana characters constituting the character string information, and can grasp from which time to which time the speech information corresponds to which kana character. It is possible. Therefore, it can be determined which kana character corresponds to the time corresponding to the accent position information.

5.3. 音声合成処理
ＣＰＵ２１１が実行する音声合成処理（図５：Ｓ５１９参照）を図１０に示すフローチャートを用いて説明する。ＣＰＵ２１１は、ステップＳ５１１で取得した文字列情報をメモリ２１２から取得する（Ｓ１１０１）。また、ＣＰＵ２１１は、アクセント位置判断処理（図５：Ｓ５１７参照）で得られたアクセント文字情報をメモリ２１２から取得する（Ｓ１１０３）。 5.3. Speech Synthesis Processing The speech synthesis processing (see FIG. 5: S519) executed by the CPU 211 will be described using the flowchart shown in FIG. The CPU 211 acquires the character string information acquired in step S511 from the memory 212 (S1101). Further, the CPU 211 acquires the accent character information obtained by the accent position determination process (see S517 in FIG. 5) from the memory 212 (S1103).

ＣＰＵ２１１は、文字列情報及びアクセント文字情報に基づいて、合成音声情報を生成し（Ｓ１１０５）、合成音声情報としてメモリ２１２へ記憶する（Ｓ１１０７参照）。なお、合成音声の生成については、従来から一般的に用いられている音声合成技術を用いる。 The CPU 211 generates synthesized speech information based on the character string information and the accent character information (S1105), and stores the synthesized speech information in the memory 212 (see S1107). Note that, for the generation of synthesized speech, a speech synthesis technique that has been generally used is used.

このように、音声合成装置２１は、合成音声を生成しようとする文字列に与えるアクセントについては、ユーザが発声した音声情報から取得する。つまり、ユーザは、仮名文字列に与えるアクセントを有する音声を、自らが発声することによって、音声合成装置２１へ提供することができる。よって、ユーザは、自らが発声したアクセントを有する文字列の合成音声を容易に得ることができる。 As described above, the speech synthesizer 21 acquires the accent given to the character string for which the synthesized speech is to be generated from the speech information uttered by the user. That is, the user can provide the speech synthesizer 21 with the voice having the accent given to the kana character string by himself / herself. Therefore, the user can easily obtain a synthesized voice of a character string having an accent uttered by the user.

6. 具体例
これまで説明してきた音声合成装置２１のＣＰＵ２１１の動作を具体的な例を示しながら説明する。 6. Specific Example The operation of the CPU 211 of the speech synthesizer 21 described so far will be described with a specific example.

ユーザは、文字列情報若しくは音声情報を取得するための文字列・音声情報取得画面Ｄ１を表示する（図５：Ｓ５０１参照）。音声合成装置２１のディスプレイ２１６に表示される文字列・音声情報取得画面Ｄ１（図６参照）の文字列入力領域Ａ６０１に、キーボード２１４を用いて、仮名文字列「なかやま」を入力し、音声情報取得開始ボタンＢ６０１をマウス２１５で選択したとする。なお、ユーザは、近畿のある地方における地名「中山」に対する合成音声を生成することを目的としているとする。 The user displays a character string / voice information acquisition screen D1 for acquiring character string information or voice information (see FIG. 5: S501). The kana character string “Nakayama” is input to the character string input area A601 of the character string / voice information acquisition screen D1 (see FIG. 6) displayed on the display 216 of the voice synthesizer 21 by using the keyboard 214, and the voice information is input. It is assumed that the acquisition start button B601 is selected with the mouse 215. It is assumed that the user aims to generate a synthesized voice for the place name “Nakayama” in a certain region in Kinki.

ＣＰＵ２１１は、音声情報取得開始ボタンＢ６０１が選択されたと判断すると（図５：Ｓ５０３参照）、文字列入力領域Ａ６０１に入力された仮名文字列「なかやま」を文字列情報として取得しメモリ２１２へ記憶する（図５：Ｓ５１１参照）。 When the CPU 211 determines that the voice information acquisition start button B601 has been selected (see FIG. 5: S503), the CPU 211 acquires the kana character string “Nakayama” input in the character string input area A601 as character string information and stores it in the memory 212. (See FIG. 5: S511).

また、ＣＰＵ２１１は、マイクロフォン２１８ｍを介して音声情報を取得する（図５：Ｓ５１３参照）。ユーザは、マイクロフォンに向かって地名としての文字列「なかやま」（各仮名文字「な」、「か」、「や」、「ま」に対するアクセントが「低」、「高」、「低」、「低」）を発音し、音声情報として入力する。ＣＰＵ２１１は、音声情報を獲得したと判断すると、取得した文字列情報「なかやま」を構成する各仮名文字「な」、「か」、「や」、「ま」に対応する持続時間を音節持続時間ＤＢの［持続時間］列Ｃ４０５から取得する（図７：Ｓ８０１参照）。ＣＰＵ２１１は、取得した［持続時間］列Ｃ４０５の値を合計した合計持続時間を算出する（図７：Ｓ８０３参照）。ＣＰＵ２１１は、文字列情報を構成する仮名文字について、算出した合計持続時間と各仮名文字の持続時間との比を算出する（図７：Ｓ８０５参照）。 Further, the CPU 211 acquires audio information via the microphone 218m (see FIG. 5: S513). The user enters the character string “Nakayama” as a place name toward the microphone (accents for the kana characters “na”, “ka”, “ya”, “ma” are “low”, “high”, “low”, “ Pronounce "low") and input it as voice information. When the CPU 211 determines that the voice information has been acquired, the CPU 211 sets the duration corresponding to each kana character “NA”, “KA”, “YA”, “MA” constituting the acquired character string information “NAKAYAMA” as the syllable duration. Obtained from the [duration] column C405 of the DB (see FIG. 7: S801). The CPU 211 calculates a total duration obtained by summing up the values of the acquired [duration] column C405 (see FIG. 7: S803). The CPU 211 calculates a ratio between the calculated total duration and the duration of each kana character for the kana characters constituting the character string information (see FIG. 7: S805).

また、ＣＰＵ２１１は、マイクロフォン２１８ｍから取得した音声情報の発音時間を計測する（図７：Ｓ８１１参照）。ＣＰＵ２１１は、合計持続時間と各仮名文字の持続時間との比と、計測した発音時間とに基づいて、文字列情報を構成する各仮名文字と音声情報との対応関係を判断し（図７：Ｓ８１３参照）、文字−音声対応テーブル（図８参照）を生成しメモリ２１２へ記憶する（図７：Ｓ８１５参照）。 Further, the CPU 211 measures the sound generation time of the sound information acquired from the microphone 218m (see FIG. 7: S811). The CPU 211 determines the correspondence between each kana character constituting the character string information and the speech information based on the ratio between the total duration and the duration of each kana character and the measured pronunciation time (FIG. 7 :). A character-speech correspondence table (see FIG. 8) is generated and stored in the memory 212 (see FIG. 7: S815).

ＣＰＵ２１１は、取得した音声情報に対する基本周波数関数を算出する（図９：Ｓ９０１参照）。ＣＰＵ２１１は、算出した基本周波数関数の一次微分関数を算出し（図９：Ｓ９０３参照）、一次微分関数の値が正から負に変わる位置にアクセント位置があると判断する（図９：Ｓ９０５参照）。ＣＰＵ２１１は、アクセント位置に対応する時間をアクセント位置情報としてメモリ２１２へ一時的に記憶する（図９：Ｓ９０７参照）。ＣＰＵ２１１は、文字−音声対応テーブルを取得し（図９：Ｓ９０９参照）、アクセント位置情報の時間が文字列情報を構成する仮名文字「な」、「か」、「や」、「ま」のうちどの仮名文字に対応するのかを判断する（図９：Ｓ９１１参照）。ＣＰＵ２１１は、アクセント位置が存在する仮名文字が仮名文字「か」であると判断すると、アクセント文字情報としてメモリ２１２へ一時的に記憶する（図９：Ｓ９１３参照）。 The CPU 211 calculates a fundamental frequency function for the acquired audio information (see FIG. 9: S901). The CPU 211 calculates a first derivative function of the calculated fundamental frequency function (see FIG. 9: S903), and determines that there is an accent position at a position where the value of the first derivative function changes from positive to negative (see FIG. 9: S905). . The CPU 211 temporarily stores the time corresponding to the accent position as accent position information in the memory 212 (see FIG. 9: S907). The CPU 211 obtains the character-speech correspondence table (see FIG. 9: S909), and the time of the accent position information includes the kana characters “NA”, “KA”, “YA”, “MA” that constitute the character string information. It is determined which kana character corresponds (see FIG. 9: S911). If the CPU 211 determines that the kana character having the accent position is the kana character “ka”, the CPU 211 temporarily stores it in the memory 212 as accent character information (see FIG. 9: S913).

そして、ＣＰＵ２１１は、文字列情報「なかやま」をメモリ２１２から取得する（図１０：Ｓ１１０１参照）。ＣＰＵ２１１は、アクセント文字情報である仮名文字「か」をメモリ２１２から取得する（図１０：Ｓ１１０３参照）。 Then, the CPU 211 acquires character string information “Nakayama” from the memory 212 (see FIG. 10: S1101). The CPU 211 acquires the kana character “ka”, which is accent character information, from the memory 212 (see FIG. 10: S1103).

ＣＰＵ２１１は、文字列情報及びアクセント文字情報に基づいて、合成音声情報を生成し（図１０：Ｓ１１０５参照）、合成音声情報としてメモリ２１２へ記憶する（図１０：Ｓ１１０７参照）。 The CPU 211 generates synthesized speech information based on the character string information and accented character information (see FIG. 10: S1105), and stores the synthesized speech information in the memory 212 (see FIG. 10: S1107).

ＣＰＵ２１１は、生成した合成音声情報をスピーカ２１８ｓを介して再生する（図５：Ｓ５２１参照）。 The CPU 211 reproduces the generated synthesized voice information via the speaker 218s (see S521 in FIG. 5).

これにより、ユーザは、自らが発声したアクセントを有する文字列「なかやま」の合成音声を容易に得ることができる。
Thereby, the user can easily obtain a synthesized voice of the character string “Nakayama” having an accent uttered by the user.

1. 概要
本発明に係る音声合成装置の実施例２の概要を説明する。前述の実施例１においては、音声合成装置２１は、文字列に与えるアクセントを、当該アクセントを有する音声をユーザが自ら発声した音声情報から抽出することによって、当該アクセントを有する文字列の合成音声を生成した。これにより、ユーザは、自らが発声したアクセントを有する文字列の合成音声を容易に得ることができた。 1. Overview An overview of a second embodiment of the speech synthesizer according to the present invention will be described. In the first embodiment described above, the speech synthesizer 21 extracts synthesized speech of a character string having the accent by extracting the accent given to the character string from the speech information that the user has uttered. Generated. As a result, the user can easily obtain a synthesized voice of a character string having an accent uttered by the user.

その一方、音声合成装置２１は、各音節の標準的な音の高さ、速さ等を有する音節音声データを用いて合成音声を生成していた。ユーザによっては、標準的な音の高さや速さではなく、「もう少し高い音色で」や「もう少し遅いスピードで」等の要求があることもある。 On the other hand, the speech synthesizer 21 generates synthesized speech using syllable speech data having the standard pitch and speed of each syllable. Depending on the user, there may be a request such as “with a slightly higher tone” or “with a slightly slower speed” instead of the standard pitch or speed.

本実施例における音声合成装置は、合成音声に対して変更を容易に加えたいユーザの要求を満たすために、合成音声に対して所定のパラメータを変更した合成音声を幾つか提供することによって、ユーザが、自らが望む合成音声を容易に取得できるようにするものである。 The speech synthesizer in this embodiment provides the user with some synthesized speech in which predetermined parameters are changed for the synthesized speech in order to satisfy the user's request to easily change the synthesized speech. However, it is possible to easily obtain the synthesized speech desired by the user.

なお、本実施例においては、実施例１と同様の構成・動作については、実施例１で与えた番号と同じ番号を与えている。 In the present embodiment, the same configurations and operations as those in the first embodiment are given the same numbers as those in the first embodiment.

2. 機能ブロック図
本発明に係る音声合成装置Ｍ５１を図１１に示す機能ブロック図に基づいて説明する。音声合成装置Ｍ５１は、音声情報取得手段Ｍ１１、アクセント情報抽出手段Ｍ１３、音声合成手段Ｍ１５、文字列情報取得手段Ｍ１７、変更合成音声生成手段Ｍ２１、合成音声候補表示手段Ｍ２３、及び合成音声決定手段Ｍ２５を有している。音声情報取得手段Ｍ１１、アクセント情報抽出手段Ｍ１３、音声合成手段Ｍ１５、文字列情報取得手段Ｍ１７については、実施例１と同様の構成であるため、以下での記載は省略する。 2. Functional Block Diagram A speech synthesizer M51 according to the present invention will be described based on the functional block diagram shown in FIG. The speech synthesizer M51 includes speech information acquisition means M11, accent information extraction means M13, speech synthesis means M15, character string information acquisition means M17, modified synthesized speech generation means M21, synthesized speech candidate display means M23, and synthesized speech determination means M25. have. Since the voice information acquisition unit M11, the accent information extraction unit M13, the voice synthesis unit M15, and the character string information acquisition unit M17 have the same configuration as that of the first embodiment, the description below is omitted.

変更合成音声生成手段Ｍ２１は、生成した合成音声の特徴量を変更し、変更した特徴量を有する合成音声を生成する。 The modified synthesized speech generation means M21 changes the feature amount of the generated synthesized speech and generates synthesized speech having the changed feature amount.

合成音声候補表示手段Ｍ２３は、前記特徴量を変更した合成音声及び当該変更をする前の合成音声とを、合成音声候補として表示手段に表示する。また、合成音声候補表示手段Ｍ２３は、表示した合成音声候補は入力手段によって選択可能なように構成する。合成音声候補表示手段Ｍ２３は、前記特徴量が、音の高低及び速度により構成されている場合、前記合成音声候補を音の高低及び速度を２軸とした平面上に配置する。 The synthesized speech candidate display means M23 displays the synthesized speech in which the feature amount is changed and the synthesized speech before the change is made as a synthesized speech candidate on the display means. The synthesized speech candidate display means M23 is configured so that the displayed synthesized speech candidates can be selected by the input means. The synthesized speech candidate display means M23 arranges the synthesized speech candidate on a plane with the pitch and speed of the sound as two axes when the feature amount is constituted by the pitch and speed of the sound.

合成音声決定手段Ｍ２５は、前記表示手段に表示した合成音声候補のいずれかが選択されたと判断すると、当該合成音声候補を合成音声と決定する。合成音声決定手段Ｍ２５は、さらに、前記表示手段に表示した合成音声候補のいずれかが選択されたと判断すると、当該合成音声候補を再生し、再生した前記合成音声候補に対して合成音声情報として確定する確定情報を獲得すると、前記合成音声候補を合成音声と決定する。 When the synthesized speech determining unit M25 determines that any of the synthesized speech candidates displayed on the display unit is selected, the synthesized speech determining unit M25 determines the synthesized speech candidate as a synthesized speech. When the synthesized speech determination unit M25 further determines that any of the synthesized speech candidates displayed on the display unit is selected, the synthesized speech candidate is reproduced and confirmed as synthesized speech information for the reproduced synthesized speech candidate. When the confirmation information to be acquired is acquired, the synthesized speech candidate is determined as a synthesized speech.

これにより、複数の合成音声候補を提供することができる。よって、ユーザは、提供された合成音声候補から選択するという容易な操作で合成音声を得ることができる。また、音の高低又は／及び速度を変更した、合成音声候補を容易に得ることができる。さらに、ユーザは、提供される合成音声候補の相関関係を容易に把握することができる。よって、ユーザは、容易に合成音声候補から所望のもの選択することができる。さらに、ユーザは、合成音声候補の再生音を確認した上で、合成音声候補の選択を行うことができる。 Thereby, a plurality of synthesized speech candidates can be provided. Therefore, the user can obtain synthesized speech by an easy operation of selecting from the provided synthesized speech candidates. In addition, synthesized speech candidates can be easily obtained in which the pitch or / and speed of the sound is changed. Furthermore, the user can easily grasp the correlation between the provided synthesized speech candidates. Therefore, the user can easily select a desired one from the synthesized speech candidates. Furthermore, the user can select a synthesized speech candidate after confirming the reproduced sound of the synthesized speech candidate.

3. 音声合成装置５１のハードウェア構成
本発明に係る音声合成装置である音声合成装置５１のハードウェア構成は、実施例１におけるハードウェア構成（図３参照）と同様である。 3. Hardware Configuration of Speech Synthesizer 51 The hardware configuration of the speech synthesizer 51, which is a speech synthesizer according to the present invention, is the same as the hardware configuration (see FIG. 3) in the first embodiment.

4. データ
本発明に係る音声合成装置である音声合成装置５１のデータは、実施例１におけるデータ（図４参照）と同様である。 4. Data The data of the speech synthesizer 51 which is the speech synthesizer according to the present invention is the same as the data in the first embodiment (see FIG. 4).

5. 音声合成装置５１の動作
音声合成装置５１のＣＰＵ２１１の動作の概要を図１２を用いて説明する。図１２におけるステップＳ５０１〜Ｓ５１９及びＳ５２１の処理については、実施例１の音声合成装置２１のＣＰＵ２１１の動作と同様である。従って、以下においては、ステップＳ５５１、Ｓ５５３の処理を説明する。 5. Operation of Speech Synthesizer 51 An overview of the operation of the CPU 211 of the speech synthesizer 51 will be described with reference to FIG. The processing in steps S501 to S519 and S521 in FIG. 12 is the same as the operation of the CPU 211 of the speech synthesizer 21 of the first embodiment. Therefore, in the following, the processing of steps S551 and S553 will be described.

また、ＣＰＵ２１１は、ステップＳ５１９の音声合成処理が終了すると、高さ・速度抽出処理（Ｓ５５１）を実行する。ＣＰＵ２１１は、高さ・速度抽出処理（Ｓ５５１）が終了すると、合成音声候補提供処理（Ｓ５５３）を実行する。 In addition, when the speech synthesis process in step S519 ends, the CPU 211 executes a height / speed extraction process (S551). When the height / speed extraction process (S551) ends, the CPU 211 executes a synthesized speech candidate provision process (S553).

以降において、高さ・速度抽出処理（Ｓ５５１）及び合成音声候補提供処理（Ｓ５５３）を説明する。 Hereinafter, the height / speed extraction process (S551) and the synthesized speech candidate providing process (S553) will be described.

5.1. 高さ・速度抽出処理
ＣＰＵ２１１が実行する高さ・速度抽出処理（図１２：Ｓ５５１参照）を図１３に示すフローチャートを用いて説明する。ＣＰＵ２１１は、ステップＳ５１３で取得した音声情報における音節特徴量の一つである音の高低を高さ情報として取得する（Ｓ１００１）。さらに、ＣＰＵ２１１は、ステップＳ５１３で取得した音声情報における音節特徴量の一つである音声情報の速さを速度情報として算出する（Ｓ１００３）。ＣＰＵ２１１、算出した高さ情報及び速度情報を、メモリ２１２へ一時的に記憶保持する（Ｓ１００５）。なお、高さ情報、速度情報を取得する際には、音声情報の周波数成分における音響に関する情報（音響情報）、韻律に関する情報（韻律情報）等を用いればよい。 5.1. Height / Speed Extraction Processing The height / speed extraction processing (see FIG. 12: S551) executed by the CPU 211 will be described with reference to the flowchart shown in FIG. The CPU 211 acquires, as height information, the pitch of a sound that is one of the syllable feature amounts in the audio information acquired in step S513 (S1001). Further, the CPU 211 calculates the speed of the voice information that is one of the syllable feature amounts in the voice information acquired in step S513 as speed information (S1003). The CPU 211 temporarily stores and holds the calculated height information and speed information in the memory 212 (S1005). In addition, when acquiring height information and speed information, information relating to sound (acoustic information), information relating to prosody (prosodic information), and the like in frequency components of speech information may be used.

5.2. 合成音声候補提供処理
ＣＰＵ２１１が実行する合成音声候補提供処理（図１２：Ｓ５５３参照）を図１４に示すフローチャートを用いて説明する。ＣＰＵ２１１は、高さ・速度抽出処理Ｓ５５１（図１２参照）より抽出した速度情報及び高さ情報をメモリ２１２から取得する（Ｓ１２０１）。ＣＰＵ２１１は、合成音声情報の速度情報及び高さ情報の値を変更し、変更した特徴量を有する合成音声情報を生成する（Ｓ１２０３）。具体的には、予め合成音声を特徴付けるパラメータを設定しておき、また、当該パラメータの値に基づいたコスト値を基準コスト値として算出しておく。一方、ステップＳ１２０３で合成した合成音声情報の前記パラメータに対応する値に基づいたコスト値を合成コスト値として算出する。そして、基準コスト値と合成コスト値とを比較して、所定の条件に合致する合成コスト値を有する合成音声情報を生成する。なお、ＣＰＵ２１１は、できるかぎり、複数の合成音声情報を生成する。 5.2. Synthetic Speech Candidate Providing Processing The synthetic speech candidate providing processing (see FIG. 12: S553) executed by the CPU 211 will be described using the flowchart shown in FIG. The CPU 211 acquires the speed information and the height information extracted from the height / speed extraction process S551 (see FIG. 12) from the memory 212 (S1201). The CPU 211 changes the values of the speed information and the height information of the synthesized voice information, and generates synthesized voice information having the changed feature amount (S1203). Specifically, a parameter characterizing the synthesized speech is set in advance, and a cost value based on the value of the parameter is calculated as a reference cost value. On the other hand, a cost value based on a value corresponding to the parameter of the synthesized speech information synthesized in step S1203 is calculated as a synthesized cost value. Then, the reference cost value and the synthesis cost value are compared, and synthesized speech information having a synthesis cost value that meets a predetermined condition is generated. Note that the CPU 211 generates a plurality of synthesized speech information as much as possible.

ＣＰＵ２１１は、ステップＳ１２０３で生成した合成音声情報にステップＳ１１０５（図１０参照）で生成した合成音声情報を加えて、合成音声候補情報とする（Ｓ１２０４）。そして、ＣＰＵ２１１は、合成音声候補情報のそれぞれの速度情報の値及び高さ情報の値に基づいて、速度及び音の高さを２軸とした合成音声選択図に、各合成音声候補情報を配置する（Ｓ１２０５）。ＣＰＵ２１１は、生成した合成音声選択図をディスプレイ２１６に表示する（Ｓ１２０７）。 The CPU 211 adds the synthesized speech information generated in step S1105 (see FIG. 10) to the synthesized speech information generated in step S1203 to obtain synthesized speech candidate information (S1204). Then, the CPU 211 arranges each piece of synthesized speech candidate information in a synthesized speech selection diagram with two axes of speed and pitch based on the values of the speed information and the height information of the synthesized speech candidate information. (S1205). The CPU 211 displays the generated synthesized speech selection diagram on the display 216 (S1207).

ディスプレイ２１６に表示される合成音声選択図を図１５に示す。合成音声選択図には、音の高さを表す縦軸、音の速さを表す横軸によって構成される平面図が表示される。速度情報の値及び高さ情報の値に基づいて合成音声候補情報が特定され、合成音声選択図の平面図上に表示される。なお、縦軸と横軸との交点（原点）には、ステップＳ１１０５（図１０参照）で合成された合成音声情報が配置される。合成音声選択図上に表示されている合成音声候補情報は、ユーザがマウス２１５等を用いて選択できるように構成されている。 A synthesized speech selection diagram displayed on the display 216 is shown in FIG. In the synthesized speech selection diagram, a plan view constituted by a vertical axis representing the pitch of the sound and a horizontal axis representing the speed of the sound is displayed. Based on the speed information value and the height information value, the synthesized speech candidate information is specified and displayed on the plan view of the synthesized speech selection diagram. Note that the synthesized speech information synthesized in step S1105 (see FIG. 10) is arranged at the intersection (origin) of the vertical axis and the horizontal axis. The synthesized speech candidate information displayed on the synthesized speech selection diagram is configured so that the user can select it using the mouse 215 or the like.

図１４に戻って、ＣＰＵ２１１は、決定ボタンＢ１３０１（図１５参照）が選択されたか否かを判断する（Ｓ１２０９）。ＣＰＵ２１１は、決定ボタンＢ１３０１が選択されておらず、一の合成音声候補情報が選択されたと判断すると（Ｓ１２１１）、選択された合成音声候補情報に対する音声をサウンド回路２１８を介してスピーカ２１８ｓから出力する（Ｓ１２１３）。 Returning to FIG. 14, the CPU 211 determines whether or not the determination button B1301 (see FIG. 15) has been selected (S1209). If the CPU 211 determines that the decision button B1301 has not been selected and one synthesized speech candidate information has been selected (S1211), the CPU 211 outputs the sound corresponding to the selected synthesized speech candidate information from the speaker 218s via the sound circuit 218. (S1213).

ＣＰＵ２１１は、決定ボタンＢ１３０１が選択されたと判断すると、その時点で選択されている合成音声候補情報を合成音声情報と判断して、ハードディスク２１３へ記録する（Ｓ１２１５）。
When determining that the decision button B1301 has been selected, the CPU 211 determines that the synthesized speech candidate information selected at that time is synthesized speech information, and records it in the hard disk 213 (S1215).

［その他の実施例］
（１）音節アライメント処理
前述の実施例１においては、取得した音声情報のどの位置に音節の区切りがあるのかを判断する音節アライメント処理において、音節配列を構成する音節の標準的な継続時間を加算し、音声情報の継続時間との相関関係から、アクセント位置を判断した。しかし、音声情報のどの位置に音節の区切りがあるのかを判断できるのであれば例示のものに限定されない。例えば、隠れマルコフモデルを用いる音声認識技術等を利用して、音声情報のどの位置に音節の区切りがあるのかを判断するようにしてもよい。 [Other Examples]
(1) Syllable alignment processing In the first embodiment described above, in the syllable alignment processing for determining at which position of the acquired speech information the syllable alignment is present, the standard duration of syllables constituting the syllable array is added. Then, the accent position was determined from the correlation with the duration of the voice information. However, the present invention is not limited to the illustrated example as long as it can be determined at which position in the voice information the syllable break is present. For example, the position of the syllable break may be determined at a position in the speech information by using a speech recognition technique using a hidden Markov model.

（２）音節持続時間ＤＢ
前述の実施例１においては、［音節］列及び［持続時間］列からなる単純な音節持続時間ＤＢを用いて、文字列情報を構成する仮名文字に対応する持続時間を取得した。しかし、実際には同じ音節であっても、先行、後続する音節や当該音節の発話フレーズ内での位置、発声の強弱によって持続時間は変化する。そのため、それらの条件を含むより複雑なデータベースであっても良い。さらに、データベースから当該音節の持続時間を取得するために単純なテーブル検索を用いるのではなく、確率モデルや決定木などを用いても良い。 (2) Syllable duration DB
In the above-described first embodiment, the duration corresponding to the kana characters constituting the character string information is acquired using the simple syllable duration DB including the [syllable] column and the [duration] column. However, even for the same syllable, the duration changes depending on the preceding and succeeding syllables, the position of the syllable in the utterance phrase, and the strength of the utterance. Therefore, a more complicated database including those conditions may be used. Further, instead of using a simple table search to obtain the duration of the syllable from the database, a probability model, a decision tree, or the like may be used.

（３）音声情報
前述の実施例１においては、音声情報をサウンド回路２１８及びマイクロフォン２１８ｍにより取得するとしたが、音声情報を得られるものであれば例示のものに限定されない。例えば、ＣＤ−ＲＯＭ等の記録媒体を介して音声情報を取得するようにしてもよい。また、通信回線を介して音声情報を取得するようにしてもよい。 (3) Audio information In the first embodiment, the audio information is acquired by the sound circuit 218 and the microphone 218m. However, the audio information is not limited to the example as long as the audio information can be obtained. For example, audio information may be acquired via a recording medium such as a CD-ROM. Moreover, you may make it acquire audio | voice information via a communication line.

（４）文字列情報
前述の実施例１においては、文字列情報として日本語の仮名文字を例示したが、文字列であれば例示のものに限定されない。例えば、英語等の外国語であってもよい。 (4) Character String Information In the first embodiment, Japanese kana characters are exemplified as the character string information. However, the character string information is not limited to the illustrated example as long as it is a character string. For example, a foreign language such as English may be used.

（５）特徴量
前述の実施例２においては、合成音声候補情報を生成するにあたって、音の高さ及び速度を特徴量としたが、合成音声候補情報を生成できるものであれば、例示のものに限定されない。例えば、明るい、暗い等の音声の主観的な特徴を表現する情報であってもよい。 (5) Feature Amount In the above-described second embodiment, the pitch and speed of the sound are used as the feature amount when generating the synthesized speech candidate information. However, as long as the synthesized speech candidate information can be generated, it is illustrative. It is not limited to. For example, it may be information that expresses subjective features of voice such as bright and dark.

（６）合成音声候補情報
前述の実施例２においては、合成音声候補情報として合成を生成するにあたり、音声情報及び合成音声情報の音の高さ及び速度の値をパラメータの値として算出した基準コスト値、合成コスト値を用いることとしたが、合成音声候補情報を生成できるものであれば、例示の方法に限定されない。 (6) Synthetic Speech Candidate Information In the above-described second embodiment, when generating a synthesis as the synthesized speech candidate information, the reference cost calculated using the sound pitch and speed values of the speech information and the synthesized speech information as parameter values Although the value and the synthesis cost value are used, the present invention is not limited to the illustrated method as long as the synthesized speech candidate information can be generated.

また、基準コスト値を算出するにあたり、音声情報から取得した音の高さ及び速度を用いたが、例示のものに限定されない。例えば、予め設定されたパラメータの値を用いて、基準コスト値を算出するようにしてもよい。この場合、音声情報から音の高さ及び速度を抽出せずともよい。 Moreover, in calculating the reference cost value, the pitch and speed of the sound acquired from the voice information are used, but the reference cost value is not limited to the example. For example, the reference cost value may be calculated using a preset parameter value. In this case, the pitch and speed of the sound need not be extracted from the sound information.

（７）候補の表示
前述の実施例２においては、音の高さ及び速度を特徴量として、音の高さ及び速度を２軸とした合成音声選択図を生成したが、２種類の特徴量に限定されない。例えば、３種類、４種類の特徴量を選択するようにしてもよい。 (7) Candidate display In the above-described second embodiment, the synthesized speech selection diagram having the pitch and the pitch of the sound as the feature amount and the sound pitch and the velocity as the two axes is generated. It is not limited to. For example, three types and four types of feature amounts may be selected.

また、合成音声選択図は、合成音声候補情報を選択できるものであれば、例示の２軸表示のものに限定されない。例えば、３軸、４軸表示等であってもよい。 Further, the synthesized speech selection diagram is not limited to the example of biaxial display as long as the synthesized speech candidate information can be selected. For example, 3-axis, 4-axis display, etc. may be used.

さらに、合成音声選択図の縦軸と横軸との交点（原点）には、ステップＳ１１０５（図１０参照）で合成された合成音声情報が配置されるとしたが、ユーザからステップＳ５１３で取得した音声情報を原点に表示するようにしてもよい。 Furthermore, the synthesized speech information synthesized in step S1105 (see FIG. 10) is arranged at the intersection (origin) between the vertical axis and the horizontal axis of the synthesized speech selection diagram, but it was acquired from the user in step S513. Audio information may be displayed at the origin.

（８）フローチャートにおける処理の順番
前述の実施例１及び実施例２においては、図に示した各フローチャートに基づいて、各処理を実現するようにした。しかし、各処理を実現できるものであれば、各フローチャート内における処理の順番は例示のものに限定されない。
(8) Order of processing in flowchart In the first and second embodiments, each processing is realized based on each flowchart shown in the figure. However, as long as each process can be realized, the order of the processes in each flowchart is not limited to the example.

本発明における音声合成装置の概要を示した図である。It is the figure which showed the outline | summary of the speech synthesizer in this invention. 実施例１における音声合成装置の機能ブロック図を示した図である。It is the figure which showed the functional block diagram of the speech synthesizer in Example 1. FIG. 音声合成装置２１のハードウェア構成を示した図である。2 is a diagram illustrating a hardware configuration of a speech synthesizer 21. FIG. 音節持続時間ＤＢのデータ構造を示した図である。It is the figure which showed the data structure of syllable duration DB. 音声合成装置２１の動作を示したフローチャートである。3 is a flowchart showing the operation of the speech synthesizer 21. 文字列・音声情報取得画面Ｄ１を示した図である。It is the figure which showed the character string and audio | voice information acquisition screen D1. 音節アライメント処理を表したフローチャートである。It is a flowchart showing a syllable alignment process. 文字−音声対応テーブルを示した図である。It is the figure which showed the character-speech correspondence table. アクセント位置判断処理を示したフローチャートである。It is the flowchart which showed the accent position judgment process. 音声合成処理を示したフローチャートである。It is the flowchart which showed the speech synthesis process. 実施例２における音声合成装置の機能ブロック図を示した図である。It is the figure which showed the functional block diagram of the speech synthesizer in Example 2. FIG. 音声合成装置５１の動作を示したフローチャートである。5 is a flowchart showing the operation of the speech synthesizer 51. 高さ・速度抽出処理を示したフローチャートである。It is the flowchart which showed the height and speed extraction process. 合成音声候補提供処理を示したフローチャートである。It is the flowchart which showed the synthetic speech candidate provision process. 合成音声選択図を示した図である。It is the figure which showed the synthetic | combination voice selection figure. 従来の音声合成装置を説明するための図である。It is a figure for demonstrating the conventional speech synthesizer. 従来の音声合成装置を説明するための図である。It is a figure for demonstrating the conventional speech synthesizer.

Explanation of symbols

２１・・・・・音声合成装置
５１・・・・・音声合成装置
Ｍ１１・・・・・音声情報取得手段
Ｍ１３・・・・・アクセント情報抽出手段
Ｍ１５・・・・・音声合成手段
Ｍ１７・・・・・文字列情報取得手段
Ｍ２１・・・・・変更合成音声生成手段
Ｍ２３・・・・・合成音声候補表示手段
Ｍ２５・・・・・合成音声決定手段 21... Synthesizer 51... Speech synthesizer M 11... Speech information acquisition means M 13... Accent information extraction means M 15. ... Character string information acquisition means M21... Changed synthesized speech generation means M23... Synthetic speech candidate display means M25.

Claims

A speech synthesizer that generates synthesized speech corresponding to a character string,
The speech synthesizer
Character string information acquisition means for acquiring character string information representing a certain character string;
Voice information acquisition means for acquiring a certain voice as voice information;
From the acquired voice information, an accent information extracting means for extracting the accent of the voice information as accent information,
Based on the accent information and the character string information, synthesized speech corresponding to the character string, the voice synthesizing means for generating the accented sound,
A speech synthesizer.

A speech synthesis program that generates synthesized speech corresponding to a character string,
The speech synthesis program is
Computer
Character string information acquisition means for acquiring character string information representing a certain character string;
Voice information acquisition means for acquiring a certain voice as voice information;
From the acquired voice information, an accent information extracting means for extracting the accent of the voice information as accent information,
Based on the accent information and the character string information, synthesized speech corresponding to the character string, the voice synthesizing means for generating the accented sound,
Speech synthesis program to function as.

In the speech synthesis apparatus according to claim 1 or the speech synthesis program according to claim 2,
The accent information extracting means further includes:
Extracting the accent information using a fundamental frequency function representing a temporal change in the fundamental frequency of the acquired voice information;
It is characterized by.

In any of the speech synthesizer or the speech synthesis program according to claim 3,
The accent information extracting means further includes:
By associating the character string represented by the acquired character string information with the speech information, it is determined which syllable has an accent among the syllables constituting the character string, and the syllable that the accent has been determined is Extract as accent information,
The speech synthesis means further includes:
Generating synthesized speech corresponding to a character string in the character string information, wherein the syllable is determined to have an accent,
It is characterized by.

In any one of the speech synthesizer or the speech synthesis program according to claims 1 to 4,
A modified synthesized speech generation means for generating a synthesized speech having a changed feature value by changing the feature value of the generated synthesized speech;
The synthesized speech candidate display means for displaying the synthesized speech in which the feature amount has been changed and the synthesized speech before the change is made as a synthesized speech candidate on the display means, and the displayed synthesized speech candidate can be selected by the input means Synthesized speech candidate display means configured as described above,
When it is determined that any of the synthesized speech candidates displayed on the display unit is selected, a synthesized speech determination unit that determines the synthesized speech candidate as a synthesized speech;
Have

In any of the speech synthesizer or the speech synthesis program according to claim 5,
The feature amount is
Including at least one of high or low sound or speed,
It is characterized by.

Either of the speech synthesizer or the speech synthesis program according to claim 6,
The synthesized speech candidate display means includes
When the feature amount is constituted by the pitch and speed of sound, the synthesized speech candidates are arranged on a plane with the pitch and speed of sound as two axes,
It is characterized by.

In any of the speech synthesizer or the speech synthesis program according to claims 5 to 7,
The synthesized speech determination means further includes:
When it is determined that one of the synthesized speech candidates displayed on the display means is selected, the synthesized speech candidate is reproduced, and when the confirmed information to be confirmed as synthesized speech information for the reproduced synthesized speech candidate is acquired, the synthesis is performed. Determining speech candidates as synthesized speech;
It is characterized by.

A synthetic speech generation method for generating synthetic speech corresponding to a character string using a computer,
The computer obtains character string information representing a certain character string,
The computer acquires a certain voice as voice information,
The computer extracts the accent that the audio information has from the acquired audio information as accent information,
A synthetic speech generation method in which a computer generates synthetic speech corresponding to the character string and having the accent based on the accent information and the character string information.