JP2010032978A

JP2010032978A - Voice message creation device and method

Info

Publication number: JP2010032978A
Application number: JP2008197827A
Authority: JP
Inventors: Ryota Kamoshita; 亮太鴨志田; Kenji Nagamatsu; 健司永松; Yusuke Fujita; 雄介藤田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-07-31
Filing date: 2008-07-31
Publication date: 2010-02-12
Anticipated expiration: 2028-07-31
Also published as: JP5218971B2

Abstract

PROBLEM TO BE SOLVED: To automatically create a voice message with uniform vocal quality. SOLUTION: A real voice data corresponding to an input text is specified, and difference between a sound feature quantity of the specified real voice data and a reference sound feature quantity defined for the specified real voice data is calculated. On the basis of the calculated difference and a value indicating importance of the specified real voice data, it is determined whether or not, the sound feature quantity of the specified real voice data is converted. When it is determined that the sound feature quantity of the specified real voice data is converted, the sound feature quantity of the specified real voice data is converted, and the voice message based on the real voice data in which the sound feature quantity is converted is created. When it is determined that the sound feature quantity of the specified real voice data is not converted, the voice message based on the specified real voice data is created. COPYRIGHT: (C)2010,JPO&INPIT

Description

本願明細書で開示される技術は、音声メッセージを作成する装置に関し、特に、複数の音片を接続して音声メッセージを作成する装置に関する。 The technology disclosed in this specification relates to an apparatus for creating a voice message, and more particularly to an apparatus for creating a voice message by connecting a plurality of sound pieces.

鉄道の駅での構内放送音声及びカーナビゲーションシステムのルート案内音声など、複数の音片を接続して音声メッセージを作成するシステムが幅広く利用されている。ここでいう音片とは、単語、音節又はそれらが複数繋がったフレーズなどを単位として構成される。録音した肉声をそのまま音片として用いる場合と、音声合成技術によって生成された合成音声を音片として用いる場合とがある。 Systems that create a voice message by connecting a plurality of sound pieces, such as on-site broadcast voice at a railway station and route guidance voice of a car navigation system, are widely used. The sound piece here is configured with a unit of a word, a syllable, or a phrase in which a plurality of them are connected. There are a case where the recorded real voice is used as a sound piece as it is, and a case where a synthesized voice generated by the speech synthesis technique is used as a sound piece.

このようなシステムでは、作成された音声メッセージの品質劣化要因として、次の二つが存在する。 In such a system, the following two factors exist as quality degradation factors of the created voice message.

第１の品質劣化要因は、録音した肉声をそのまま音片として利用する場合に、音片間で声質に差異が生じることである。一般に音片作成のための音声収録作業は長期間に渡る場合が多いため、全収録期間に渡って発話者の声質を一定に保つことは極めて困難である。そのため、作成した音片データベース内には声質のバラつきが存在し、これが音声メッセージとして接続されたときの品質劣化要因となる。 The first quality deterioration factor is that a difference in voice quality occurs between sound pieces when the recorded real voice is directly used as a sound piece. In general, since voice recording work for creating a piece often involves a long period of time, it is extremely difficult to keep the voice quality of a speaker constant over the entire recording period. Therefore, there is a variation in voice quality in the created sound piece database, which becomes a quality deterioration factor when connected as a voice message.

第２の品質劣化要因は、録音した肉声音片の声質と、合成音声音片の声質とに差異が生じることである。近年、音声合成技術の進歩は目覚しく、合成音声品質は以前と比較すると格段に向上した。しかし、合成音声を肉声と比較した場合には依然として声質差があり、これが音声メッセージとして接続されたときの品質劣化要因となる。 A second quality deterioration factor is that a difference occurs between the voice quality of the recorded real voice sound piece and the voice quality of the synthesized voice sound piece. In recent years, the progress of speech synthesis technology has been remarkable, and the quality of synthesized speech has been greatly improved compared to the past. However, when the synthesized voice is compared with the real voice, there is still a voice quality difference, which becomes a quality deterioration factor when connected as a voice message.

このような、接続された音片間の音質差による品質劣化問題に対する対策として、例えば特許文献１に示すような技術が存在する。特許文献１によれば、肉声音片のビットレートを下げることによって、肉声音片と合成音片との音質差が低減される。 As a countermeasure against such a quality deterioration problem due to a difference in sound quality between connected sound pieces, for example, a technique as shown in Patent Document 1 exists. According to Patent Document 1, the sound quality difference between the real voice sound piece and the synthetic sound piece is reduced by lowering the bit rate of the real voice sound piece.

特許文献２には、任意の韻律を有する合成音声を作成する技術が開示されている。 Patent Document 2 discloses a technique for creating a synthesized speech having an arbitrary prosody.

非特許文献１及び非特許文献２には、音片の声質を変換する技術が開示されている。
特開昭６２−２１５２９９号公報特開平１１−２４９６７７号公報中田和男著、「音声」（改訂版第１刷）、コロナ社、１９９５年、ｐｐ．１２６−１２８田村、他１名、「複数音声素片選択・融合型音声合成のための声質変換」、日本音響学会講演論文集、２００６年３月、ｐｐ．２３７−２３８ Non-Patent Document 1 and Non-Patent Document 2 disclose techniques for converting the voice quality of sound pieces.
JP-A-62-215299 JP-A-11-249677 Kazuo Nakata, “Speech” (first revised edition), Corona, 1995, pp. 126-128 Tamura et al., “Voice quality conversion for multiple speech unit selection / fusion speech synthesis”, Proceedings of the Acoustical Society of Japan, March 2006, pp. 237-238

上記の特許文献１に開示された技術によれば、合成音片の音質が肉声音片の音質より低いという前提で、肉声音片の音質を下げることによって両者の音質差が低減される。このため、音声メッセージ全体の音質が不均一であることに起因する品質劣化の問題は改善される。しかし、ビットレート低下によって、作成された音声メッセージ全体の音質が大きく劣化するという問題がある。また、ビットレートの低下によって音質を低下させることはできるが、声質を調整することはできない。このため、特許文献１に開示された技術によれば、肉声音片間の声質差による品質劣化を解消することができない。さらに、合成音片と肉声音片の声質差を解消することもできない。 According to the technique disclosed in Patent Document 1, the difference in sound quality between the two is reduced by lowering the sound quality of the real voice sound piece on the premise that the sound quality of the synthesized sound piece is lower than the sound quality of the real voice sound piece. For this reason, the problem of quality deterioration resulting from the non-uniform sound quality of the entire voice message is improved. However, there is a problem that the sound quality of the entire created voice message is greatly deteriorated due to a decrease in the bit rate. In addition, although the sound quality can be reduced by lowering the bit rate, the voice quality cannot be adjusted. For this reason, according to the technique disclosed in Patent Document 1, it is not possible to eliminate quality deterioration due to a voice quality difference between real voice sound pieces. Furthermore, it is impossible to eliminate the voice quality difference between the synthesized speech piece and the real voice piece.

本願で開示する代表的な発明は、肉声音声データを用いて音声メッセージを作成する音声メッセージ作成装置であって、前記肉声音声データがあらかじめ格納された記憶装置と、前記記憶装置に接続されるプロセッサと、前記プロセッサに接続される入力装置及び出力装置と、を備え、テキストを指定する情報を入力されると、前記入力された情報によって指定されたテキストに対応する肉声音声データを特定し、前記特定された肉声音声データの音響特徴量と、前記特定された肉声音声データについて定義された基準音響特徴量との差分を算出し、前記算出された差分と、前記特定された肉声音声データの重要度を示す値と、に基づいて、前記特定された肉声音声データの音響特徴量を変換するか否かを判定し、前記特定された肉声音声データの音響特徴量を変換すると判定された場合、前記特定された肉声音声データの音響特徴量を変換し、前記音響特徴量を変換された肉声音声データに基づく音声メッセージを作成し、前記特定された肉声音声データの音響特徴量を変換しないと判定された場合、前記特定された肉声音声データに基づく音声メッセージを作成することを特徴とする。 A representative invention disclosed in the present application is a voice message creation device for creating a voice message using real voice data, a storage device in which the real voice data is stored in advance, and a processor connected to the storage device And an input device and an output device connected to the processor, and when the information specifying the text is input, the voice data corresponding to the text specified by the input information is specified, The difference between the acoustic feature quantity of the identified real voice data and the reference acoustic feature quantity defined for the identified real voice data is calculated, and the calculated difference and the importance of the identified real voice data are calculated. Based on the value indicating the degree, it is determined whether or not to convert the acoustic feature amount of the identified real voice data, and the identified real voice data is determined. The acoustic feature value of the specified real voice data is converted, a voice message based on the converted real voice data is created, and the specified voice feature data is created. When it is determined not to convert the acoustic feature amount of the real voice data, a voice message based on the identified real voice data is created.

本発明の一実施形態によれば、肉声音片ごとに声質変換コストを算出し、肉声音片の声質変換コストが閾値以下である音片に対してのみ声質変換を行うことによって、声質変換による音声メッセージの品質劣化を最小限に抑えながら、音声メッセージ内での声質を均一にすることができる。これによって、高品質な音声メッセージを作成することができる。 According to an embodiment of the present invention, the voice quality conversion cost is calculated for each real voice sound piece, and the voice quality conversion is performed only on the sound piece whose voice quality conversion cost is equal to or less than the threshold value. The voice quality in the voice message can be made uniform while minimizing the quality degradation of the voice message. As a result, a high-quality voice message can be created.

以下、図面を参照して本発明の一実施形態を説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施形態の音声メッセージ作成装置の概略構成を示すブロック図である。 FIG. 1 is a block diagram showing a schematic configuration of a voice message creating apparatus according to an embodiment of the present invention.

本実施形態の音声メッセージ作成装置は、入力部１、出力部２、記憶部３、音声メッセージ作成部４、音声合成部５、声質差分量計算部６、声質変換音片決定部７及び声質変換部８を備える。 The voice message creation device of this embodiment includes an input unit 1, an output unit 2, a storage unit 3, a voice message creation unit 4, a voice synthesis unit 5, a voice quality difference amount calculation unit 6, a voice quality conversion sound piece determination unit 7, and a voice quality conversion. Part 8 is provided.

入力部１は、音声メッセージを作成するためのテキスト又は音片列情報の入力を受け付ける。 The input unit 1 accepts input of text or sound piece string information for creating a voice message.

出力部２は、作成された音声メッセージを出力するほか、操作経過及び操作結果も出力する。 In addition to outputting the created voice message, the output unit 2 also outputs operation progress and operation results.

記憶部３は、本発明を実施するにあたって必要となるプログラム及びデータを格納する。 The storage unit 3 stores programs and data necessary for carrying out the present invention.

音声メッセージ作成部４は、入力部１に入力されたテキスト又は音片列情報に基づいて音声メッセージを作成するために必要な音片を決定し、声質変換された音片を接続することによって音声メッセージを作成する。 The voice message creation unit 4 determines a voice piece necessary for creating a voice message based on the text or the voice piece string information input to the input unit 1 and connects the voice piece whose voice quality has been changed to connect the voice piece. Create a message.

音声合成部５は、音声メッセージ作成に必要な音片を、入力されたテキストに基づいて合成する。 The voice synthesizing unit 5 synthesizes sound pieces necessary for creating a voice message based on the input text.

声質差分量計算部６は、肉声音片ごとに、その肉声音片と、その肉声音片と韻律が等しくなるように作成された合成音片と、の声質の差分量を計算する。 The voice quality difference amount calculation unit 6 calculates, for each real voice sound piece, a voice quality difference amount between the real voice sound piece and a synthesized sound piece created so that the prosody is equal to the real voice sound piece.

声質変換音片決定部７は、声質差分量計算部６で計算された声質差分量と、肉声音片ごとの音片情報と、に基づいて声質変換コストを算出し、その声質変換コストに基づいて、声質変換の対象の肉声音片を決定する。 The voice quality conversion sound piece determination unit 7 calculates a voice quality conversion cost based on the voice quality difference amount calculated by the voice quality difference amount calculation unit 6 and the sound piece information for each real voice sound piece, and based on the voice quality conversion cost. Then, the target voice conversion piece is determined.

声質変換部８は、声質変換音片決定部７で決定された肉声音片を声質変換する。 The voice quality conversion unit 8 converts the voice quality of the real voice sound piece determined by the voice quality conversion sound piece determination unit 7.

ここで、「韻律」及び「声質」の概念について説明する。声を聴く者は、その声の聴感上の種々の特徴を認識することができる。声の特徴のうち、声の高さ、声の大きさ及び話す速さは、本実施形態において「韻律」と記載される。一方、声の特徴のうち、韻律に含まれないものは、本実施形態において「声質」と記載される。声質は、聴感上、例えば、声の太さ、声のかすれの程度等のように認識される。声の特徴のうち、測定によって数値化されたものは、音響特徴量と呼ばれる。音響特徴量として扱われるパラメータの代表的なものは、例えば、スペクトル、ケプストラム、Δケプストラム及びメルケプストラム等である。 Here, the concepts of “prosody” and “voice quality” will be described. A person who listens to the voice can recognize various characteristics of the audibility of the voice. Among the features of the voice, the pitch, the loudness, and the speaking speed are described as “prosody” in the present embodiment. On the other hand, among the features of the voice, those not included in the prosody are described as “voice quality” in the present embodiment. The voice quality is recognized from the viewpoint of hearing, for example, the thickness of the voice, the degree of voice blur, and the like. Of the voice features, those digitized by measurement are called acoustic features. Typical parameters treated as acoustic features are, for example, spectrum, cepstrum, Δ cepstrum, mel cepstrum, and the like.

図２は、本発明の実施形態の音声メッセージ作成装置のハードウェア構成を示すブロック図である。 FIG. 2 is a block diagram showing a hardware configuration of the voice message creation device according to the embodiment of the present invention.

図１に示す音声メッセージ作成装置は、図２に示すハードウェアによって実現される。 The voice message creation device shown in FIG. 1 is realized by the hardware shown in FIG.

本実施形態の音声メッセージ作成装置は、相互に通信可能に接続された制御装置１１０、記憶装置１２０、入力装置１３０及び出力装置１４０を備える。 The voice message creation device according to the present embodiment includes a control device 110, a storage device 120, an input device 130, and an output device 140 that are communicably connected to each other.

制御装置１１０は、本実施形態の動作を制御する。制御装置１１０は、ＣＰＵ１１１及びメモリ１１２を備える。ＣＰＵ１１１は、メモリ１１２に格納されたプログラムを実行するプロセッサである。メモリ１１２は、例えば半導体メモリであり、ＣＰＵ１１１によって実行されるプログラム及びＣＰＵ１１１によって参照されるデータを格納する。メモリ１１２に格納されるプログラム及びデータは、記憶装置１２０に格納され、必要に応じて記憶装置１２０からメモリ１１２にコピーされてもよい。ＣＰＵ１１１は、メモリ１１２に格納されたプログラムを実行することによって、記憶装置１２０、入力装置１３０及び出力装置１４０におけるデータの入出力及びその他の種々の処理を制御する。 The control device 110 controls the operation of this embodiment. The control device 110 includes a CPU 111 and a memory 112. The CPU 111 is a processor that executes a program stored in the memory 112. The memory 112 is a semiconductor memory, for example, and stores a program executed by the CPU 111 and data referred to by the CPU 111. The program and data stored in the memory 112 may be stored in the storage device 120 and copied from the storage device 120 to the memory 112 as necessary. The CPU 111 controls the input / output of data and various other processes in the storage device 120, the input device 130, and the output device 140 by executing a program stored in the memory 112.

記憶装置１２０は、図１の記憶部３に相当する。記憶装置１２０は、ＣＰＵ１１１によって実行されるプログラム及びＣＰＵ１１１によって参照されるデータを格納する。記憶装置１２０は、例えば、ハードディスクドライブ（ＨＤＤ）のようなディスク装置又はフラッシュメモリのような半導体メモリであってもよい。本実施形態の記憶装置１２０には、音声メッセージ作成部１２１、音声合成部１２２、声質差分量計算部１２３、声質変換音片決定部１２４及び声質変換部１２５が格納される。ＣＰＵ１１１がこれらを実行することによって、図１に示す音声メッセージ作成部４、音声合成部５、声質差分量計算部６、声質変換音片決定部７及び声質変換部８が実現される。 The storage device 120 corresponds to the storage unit 3 in FIG. The storage device 120 stores a program executed by the CPU 111 and data referred to by the CPU 111. The storage device 120 may be, for example, a disk device such as a hard disk drive (HDD) or a semiconductor memory such as a flash memory. The storage device 120 of the present embodiment stores a voice message creation unit 121, a voice synthesis unit 122, a voice quality difference amount calculation unit 123, a voice quality conversion sound piece determination unit 124, and a voice quality conversion unit 125. When the CPU 111 executes these, the voice message creation unit 4, the voice synthesis unit 5, the voice quality difference amount calculation unit 6, the voice quality converted sound piece determination unit 7 and the voice quality conversion unit 8 shown in FIG. 1 are realized.

記憶装置１２０には、さらに、音声データベース１２６が格納される。音声データベース１２６は、種々のテキストに対応する音片のデータ（すなわち、話者が種々のテキストを発話することによって得られた肉声音片をＡ／Ｄ変換することによって得られたデータ）を含む。 The storage device 120 further stores an audio database 126. The speech database 126 includes sound piece data corresponding to various texts (that is, data obtained by A / D converting real voice sound pieces obtained when a speaker speaks various texts). .

入力装置１３０は、図１の入力部１に相当する。入力装置１３０は、キーボード１３３及びマウス１３４を備える。キーボード１３３及びマウス１３４は、操作者による指示などを受け付け、その指示を制御装置１１０に送信するインターフェースである。入力装置１３０は、キーボード１３３及びマウス１３４の代わりに（又はそれらに加えて）、いかなる種類のインターフェースを備えてもよい。操作者は、入力装置を操作することによって、テキスト又は音片列情報を音声メッセージ作成装置に入力することができる。 The input device 130 corresponds to the input unit 1 in FIG. The input device 130 includes a keyboard 133 and a mouse 134. The keyboard 133 and the mouse 134 are interfaces that receive an instruction from the operator and transmit the instruction to the control device 110. The input device 130 may include any type of interface instead of (or in addition to) the keyboard 133 and mouse 134. The operator can input text or speech string information to the voice message creation device by operating the input device.

出力装置１４０は、図１の出力部２に相当する。出力装置１４０は、デジタル／アナログ（Ｄ／Ａ）変換器１４１、スピーカ１４２及びディスプレイ１４３を備える。Ｄ／Ａ変換器１４１は、音声データをアナログ電気信号に変換する。スピーカ１４２は、Ｄ／Ａ変換器１４１から出力されたアナログ電気信号を音声に変換する。ディスプレイ１４３は、操作者に種々の情報を表示するインターフェースである。 The output device 140 corresponds to the output unit 2 in FIG. The output device 140 includes a digital / analog (D / A) converter 141, a speaker 142, and a display 143. The D / A converter 141 converts audio data into an analog electric signal. The speaker 142 converts the analog electrical signal output from the D / A converter 141 into sound. The display 143 is an interface for displaying various information to the operator.

なお、本実施の形態の音声メッセージ作成部４、音声合成部５、声質差分量計算部６、声質変換音片決定部７及び声質変換部８は、図２に示すように、メモリ１１２に格納されたプログラムをＣＰＵ１１１が実行することによって実現される。しかし、これらは、音声メッセージ作成装置内に設けられた専用のハードウェア（例えば、専用プロセッサ）等によって実現されてもよい。 Note that the voice message creation unit 4, the voice synthesis unit 5, the voice quality difference amount calculation unit 6, the voice quality conversion sound piece determination unit 7 and the voice quality conversion unit 8 according to the present embodiment are stored in the memory 112 as shown in FIG. The program is realized by the CPU 111 executing the program. However, these may be realized by dedicated hardware (for example, a dedicated processor) provided in the voice message creating apparatus.

図３は、本発明の実施形態の音声メッセージ作成装置の全体の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the overall operation of the voice message creation apparatus according to the embodiment of the present invention.

図１の音声メッセージ作成装置において、まず、音声メッセージ作成部４が、入力部１に入力されたテキスト又は音片列情報に基づいて、音声メッセージを作成するために利用するデータベース内の音片を決定する（ステップＳ２０１）。利用する音片として決定された結果の一例を図４（ａ）に示す。テキスト３０２は、入力されたテキストである。音片ＩＤ３０１は、入力されたテキストに対応する肉声音片に付与されたＩＤである。このＩＤは、音声データベース１２６内で肉声音片のデータを識別するために付与されたものである。入力されたテキスト３０２に対応する肉声音片のデータが音声データベース１２６内に存在しない場合、そのテキスト３０２に対応する音片ＩＤ３０１は空欄となる。このようなテキスト３０２に対応する音片は、合成音声によって生成する必要がある。 In the voice message creation device of FIG. 1, first, the voice message creation unit 4 selects a speech piece in a database used for creating a voice message based on the text or the speech piece string information input to the input unit 1. Determine (step S201). An example of the result determined as the sound piece to be used is shown in FIG. Text 302 is the input text. The sound piece ID 301 is an ID assigned to a real voice sound piece corresponding to the input text. This ID is assigned to identify the data of the real voice fragment in the voice database 126. When the data of the real voice unit corresponding to the input text 302 does not exist in the speech database 126, the speech unit ID 301 corresponding to the text 302 is blank. The sound piece corresponding to the text 302 needs to be generated by synthesized speech.

図４（ａ）は、例として、「まもなく中野坂上交差点を右方向そのあとしばらく道なりです」というテキストが入力された場合にステップＳ２０１において決定された結果を示す。この例において、テキスト「まもなく」、「右方向」、「そのあと」及び「しばらく道なりです」に対応する肉声音片のデータは、音声データベース１２６に格納されている。一方、テキスト「中野坂上交差点を」に対応する肉声音片のデータは、音声データベース１２６に格納されていない。このため、テキスト「中野坂上交差点を」に対応する音片ＩＤ３０１は空欄である。 FIG. 4A shows, as an example, the result determined in step S 201 when the text “soon to the right after Nakano Sakaue intersection” is input for a while. In this example, the data of the real voice component corresponding to the text “Soon”, “Right”, “After that”, and “For a while” is stored in the speech database 126. On the other hand, the data of the real voice segment corresponding to the text “Nakano Sakagami Intersection” is not stored in the speech database 126. For this reason, the sound piece ID 301 corresponding to the text “Nakano Sakaue Intersection” is blank.

次に、音声合成部５が、肉声音片が存在しないテキストに対対応する合成音声（すなわち、そのテキストを読み上げる合成音声）を生成する（ステップＳ２０２）。この合成音声の生成は、従来から知られているいかなる方法によって行われてもよい。合成音声生成後の利用音片決定結果の一例を図４（ｂ）に示す。図４（ｂ）の例において、音片ＩＤ３０１の先頭がアルファベットであるものは、合成音声によって生成された音片を識別するものである。 Next, the speech synthesizer 5 generates a synthesized speech corresponding to a text in which no real voice segment exists (that is, a synthesized speech that reads out the text) (step S202). The generation of the synthesized speech may be performed by any conventionally known method. An example of the use sound piece determination result after the synthesized speech is generated is shown in FIG. In the example of FIG. 4B, the sound piece ID 301 whose head is an alphabet identifies a sound piece generated by synthesized speech.

具体的には、図４（ｂ）において、肉声音片のデータが存在しなかったテキスト「中野坂上交差点を」に対応する合成音声によって音片が生成され、その音片に音片ＩＤ「Ａ００１」が付与されている。その他の部分は、図４（ａ）と同じである。 Specifically, in FIG. 4B, a speech piece is generated by a synthesized speech corresponding to the text “Nakano Sakagami intersection” for which no real voice speech piece data exists, and the speech piece ID “A001” is assigned to the speech piece. Is given. Other parts are the same as those in FIG.

次に、声質差分量計算部６は、音声メッセージ作成部４によって決定された利用肉声音片ごとの声質差分量計算処理（Ａ）を行う（ステップＳ２０４）。 Next, the voice quality difference amount calculation unit 6 performs a voice quality difference amount calculation process (A) for each used real voice sound piece determined by the voice message creation unit 4 (step S204).

図５は、本発明の実施形態において実行される声質差分量計算処理（Ａ）を示すフローチャートである。 FIG. 5 is a flowchart showing voice quality difference amount calculation processing (A) executed in the embodiment of the present invention.

声質差分量計算部６は、まず、ステップＳ２０１において決定された音片のうち、音声データベース１２６内に存在する肉声音片と韻律が等しい肉声韻律合成音声を作成する（ステップＳ４０１）。肉声韻律合成音声を作成する技術としては、例えば、背景技術として引用した特許文献２に記載された技術を用いてもよい。 First, the voice quality difference amount calculation unit 6 creates a real voice prosody synthesized speech having the same prosody as the real voice sound piece existing in the speech database 126 among the sound pieces determined in step S201 (step S401). As a technique for creating a real voice prosody synthesis voice, for example, a technique described in Patent Document 2 cited as a background technique may be used.

次に、声質差分量計算部６は、作成された肉声韻律合成音声と肉声音片との音響特徴量の差分を声質差分量として計算する（ステップＳ４０２）。肉声韻律合成音声と当該肉声音片とは韻律が同一であるため、音響特徴量の差分を声質の差分として扱うことができる。音響特徴量としてはスペクトル、ケプストラム、Δケプストラム又はメルケプストラムなどが用いられてもよい。 Next, the voice quality difference amount calculation unit 6 calculates the difference in acoustic feature amount between the created real voice prosody synthesized speech and the real voice sound piece as a voice quality difference amount (step S402). Since the prosody of the real voice prosody synthesis speech and the real voice voice fragment are the same, the difference in acoustic feature values can be treated as a difference in voice quality. A spectrum, a cepstrum, a Δ cepstrum, a mel cepstrum, or the like may be used as the acoustic feature amount.

肉声音片の声質差分量計算処理が終了したら、声質差分量計算部６は、音声メッセージ作成部４が決定した肉声音片全てについて声質差文量計算処理が終了したか否かを判定する（ステップＳ４０３）。まだ声質差分量計算処理を行っていない肉声音片がある場合、声質差分量計算部６は、全ての肉声音片に対して声質差分量計算処理が終了するまでステップＳ４０１及びＳ４０２の処理を繰り返す。 When the voice quality difference amount calculation processing for the real voice sound piece is completed, the voice quality difference amount calculation unit 6 determines whether or not the voice quality difference sentence amount calculation processing is completed for all the real voice sound pieces determined by the voice message creation unit 4 ( Step S403). When there is a real voice sound piece that has not yet been subjected to the voice quality difference amount calculation process, the voice quality difference amount calculation unit 6 repeats the processes of steps S401 and S402 until the voice quality difference amount calculation process is completed for all the real voice sound pieces. .

例えば、ステップＳ２０１において、テキスト「まもなく」、「右方向」、「そのあと」及び「しばらく道なりです」に対応する肉声音片が決定された場合、それらの各々についてステップＳ４０１及びＳ４０２が実行される。 For example, if in step S201 real voice sound pieces corresponding to the texts “Soon”, “Right”, “After that”, and “It's been a while” are determined, Steps S401 and S402 are executed for each of them. The

例えば、声質差分量計算部６は、テキスト「まもなく」に対応する合成音片を、その韻律が肉声音片「まもなく」の韻律と等しくなるように作成する（ステップＳ４０１）。そして、声質差分量計算部６は、ステップＳ４０１で作成された合成音片「まもなく」の音響特徴量と肉声音片「まもなく」の音響特徴量との差分量を計算する（ステップＳ４０２）。テキスト「右方向」、「そのあと」及び「しばらく道なりです」についても同様の処理が終了するまで、ステップＳ４０１及びＳ４０２が繰り返される（ステップＳ４０３）。 For example, the voice quality difference amount calculation unit 6 creates a synthetic speech piece corresponding to the text “Soon” so that its prosody is equal to the prosody of the real voice sound piece “Soon” (step S 401). Then, the voice quality difference amount calculation unit 6 calculates a difference amount between the acoustic feature amount of the synthesized speech piece “Soon” and the acoustic feature amount of the real voice sound piece “Soon” created in Step S401 (Step S402). Steps S401 and S402 are repeated until the same processing is completed for the texts “rightward”, “after that”, and “after a while” (step S403).

図６は、本発明の実施形態における声質差分量計算処理結果を示す説明図である。 FIG. 6 is an explanatory diagram showing a voice quality difference amount calculation processing result in the embodiment of the present invention.

図６は、図４のように決定された各肉声音片について計算された声質差分量の例を示す。図６において、音片ＩＤ５０１及びテキスト５０２は、それぞれ、図４の音片ＩＤ３０１及びテキスト３０２に対応する。 FIG. 6 shows an example of the voice quality difference amount calculated for each real voice sound piece determined as shown in FIG. In FIG. 6, the sound piece ID 501 and the text 502 correspond to the sound piece ID 301 and the text 302 in FIG. 4, respectively.

声質差分量５０３は、各肉声音片について計算された声質差分量である。図４の例では、テキスト「まもなく」「右方向」「そのあと」及び「しばらく道なりです」に対応する声質差分量５０３として、それぞれ、「１．２３４５」「０．５４６７」「３．２１０」及び「０．３３２２」が算出されている。テキスト「中野坂上交差点を」に対応する肉声音片は音声データベース１２６内に存在しないため、このテキストに対応する声質差分量５０３は算出されていない。 The voice quality difference amount 503 is a voice quality difference amount calculated for each voice voice piece. In the example of FIG. 4, “1.2345”, “0.5467”, “3.210” are respectively used as the voice quality difference amounts 503 corresponding to the texts “soon”, “right”, “after that”, and “for a while”. And “0.3322” are calculated. Since the voice voice sound piece corresponding to the text “Nakano Sakagami Intersection” does not exist in the voice database 126, the voice quality difference amount 503 corresponding to this text is not calculated.

従来の合成音声作成技術（例えば特許文献２に記載されたもの）によれば、作成された合成音声の声質はほぼ一定となる。すなわち、ステップＳ２０４において算出された各肉声音片に対応する声質差分量５０３の値は、各肉声音片と、これから作成しようとする音声メッセージに含まれる合成音片（図６の例ではテキスト「中野坂上交差点を」に対応する合成音片）との間の声質の差と等価であると考えられる。このため、各肉声音片に対応する声質差分量５０３の値がより小さく（望ましくは０に）なるように、各肉声音片の声質を変換することによって、各肉声音片と合成音片との間の声質の差が解消され、さらに、肉声音片間の声質の差も解消されることが期待できる。 According to a conventional synthesized speech creation technique (for example, one described in Patent Document 2), the voice quality of the created synthesized speech is substantially constant. That is, the value of the voice quality difference amount 503 corresponding to each real voice sound piece calculated in step S204 is obtained by combining each real voice sound piece and a synthesized sound piece (in the example of FIG. This is considered to be equivalent to the difference in voice quality between the Nakano-Sakaue intersection and the synthesized speech piece corresponding to “ Therefore, by converting the voice quality of each real voice sound piece so that the value of the voice quality difference amount 503 corresponding to each real voice sound piece becomes smaller (preferably 0), It can be expected that the difference in voice quality between the two will be eliminated, and further, the difference in voice quality between the real voice segments will also be eliminated.

ただし、一般に、肉声音片の声質を変換することによってその音質（すなわち、その音片の聞き取りやすさ）は低下する。声質の変換量が大きいほど（すなわち解消しようとする声質差分量５０３の値が大きいほど）、音質の低下量も大きくなる。音片の音質が低下するほど、その音片を聞き取れない可能性が高くなる。このため、実際に肉声音片の声質を変換するか否かは、肉声音片ごとに、種々の要因に基づいて決定する必要がある。 However, generally, by converting the voice quality of a real voice sound piece, the sound quality (that is, ease of hearing of the sound piece) is lowered. The greater the voice quality conversion amount (that is, the greater the value of the voice quality difference amount 503 to be eliminated), the greater the sound quality reduction amount. The lower the sound quality of a sound piece, the more likely it is that the sound piece cannot be heard. Therefore, whether or not to actually convert the voice quality of the real voice sound piece needs to be determined for each real voice sound piece based on various factors.

このため、次に、声質変換音片決定部７は、全体処理フローチャート（図３）のステップＳ２０５において声質変換音片決定処理（Ｂ）を行う。 Therefore, next, the voice quality conversion sound piece determination unit 7 performs voice quality conversion sound piece determination processing (B) in step S205 of the overall process flowchart (FIG. 3).

図７は、本発明の実施形態において実行される声質変換音片決定処理（Ｂ）を示すフローチャートである。 FIG. 7 is a flowchart showing voice quality conversion sound piece determination processing (B) executed in the embodiment of the present invention.

声質変換音片決定部７は、まず、各肉声音片の音片情報を計算する（ステップＳ６０１）。音片情報計算結果の一例を図８に示す。 First, the voice quality conversion sound piece determination unit 7 calculates sound piece information of each real voice sound piece (step S601). An example of the sound piece information calculation result is shown in FIG.

図８のテーブルは、音片ＩＤ７０１、テキスト７０２、声質差分量７０３、音片情報７０４、声質変換コスト７０５及び声質変換可否７０６の各カラムからなる。 The table in FIG. 8 includes columns of a speech piece ID 701, text 702, voice quality difference amount 703, sound piece information 704, voice quality conversion cost 705, and voice quality conversion availability 706.

音片ＩＤ７０１、テキスト７０２及び声質差分量７０３は、それぞれ、音片ＩＤ５０１、テキスト５０２及び声質差分量５０３と同様である。 The sound piece ID 701, text 702, and voice quality difference amount 703 are the same as the sound piece ID 501, text 502, and voice quality difference amount 503, respectively.

音片情報７０４は、各肉声音片の重要度に基づいて定められる。肉声音片の重要度は、肉声音片を聞き取れなかった場合に生じる不利益の大きさを示す指標であり、必要性又は有用性と言い換えられてもよい。図８の例では、各肉声音片の重要度を示す指標として、あらかじめ定められた任意設定重要度及び音片の長さが用いられる。 The sound piece information 704 is determined based on the importance of each real voice sound piece. The importance level of a real voice sound piece is an index indicating the magnitude of a disadvantage that occurs when a real voice sound piece cannot be heard, and may be referred to as necessity or usefulness. In the example of FIG. 8, a predetermined arbitrarily set importance and the length of a sound piece are used as an index indicating the importance of each real voice sound piece.

任意設定重要度とは、作成する音声メッセージにおける肉声音片の重要度を表す、あらかじめ肉声音片ごとに任意に設定された指標である。例えば図８に示すようなカーナビゲーションシステムの音声メッセージの場合、目的地、距離及び方向等を示す肉声音片の重要度は比較的高く、それら以外の肉声音片の重要度は比較的低い。なお、任意設定重要度は、音声メッセージ作成装置の製造者又は使用者によってあらかじめ音片ごとに設定されていてもよいし、音声メッセージ作成部４が利用音片を決定したときに所定の基準にしたがって計算されてもよい。 The arbitrarily set importance level is an index that is arbitrarily set in advance for each real voice segment, which represents the importance level of the real voice segment in the created voice message. For example, in the case of the voice message of the car navigation system as shown in FIG. 8, the importance of the real voice sound piece indicating the destination, the distance, the direction, etc. is relatively high, and the importance of the other real voice sound pieces is relatively low. Note that the arbitrarily set importance may be set for each sound piece in advance by the manufacturer or user of the voice message creation device, or is set to a predetermined standard when the voice message creation unit 4 determines the use voice piece. Therefore, it may be calculated.

図８の例では、任意設定重要度の逆数（カラム７０４Ａ）が音片情報７０４の一部として計算される。具体的には、図８は、テキスト「右方向」に対応する音片が最も重要である（すなわち、それを聞き取れなかった場合の不利益が最も大きい）と判定された例を示す。このため、テキスト「右方向」に対応するカラム７０４Ａには、他のテキストに対応するものより小さい値「０．０５」が格納される。一方、図８では、テキスト「まもなく」及び「そのあと」に対応する音片の重要度が比較的低いと判定されている。このため、それらのテキストに対応するカラム７０４Ａには、比較的大きい値（それぞれ「３．００」及び「２．５０」）が格納されている。 In the example of FIG. 8, the reciprocal of the arbitrarily set importance (column 704 A) is calculated as part of the sound piece information 704. Specifically, FIG. 8 shows an example in which it is determined that the sound piece corresponding to the text “rightward” is the most important (that is, the disadvantage is the greatest when it cannot be heard). Therefore, a value “0.05” smaller than that corresponding to the other text is stored in the column 704A corresponding to the text “rightward”. On the other hand, in FIG. 8, it is determined that the importance level of the sound pieces corresponding to the text “soon” and “after” is relatively low. For this reason, relatively large values (“3.00” and “2.50”, respectively) are stored in the column 704A corresponding to the texts.

一方、音片の長さが長い方がその中に重要なメッセージが含まれる可能性が高い（すなわち重要度が高い）と考えられる。図８の例では、音片に対応するテキストの音節数を音片の長さと定義し、その逆数（カラム７０４Ｂ）が音片情報７０４の一部として計算される。例えば、テキスト「まもなく」の音節数は「４」であるため、そのテキストに対応するカラム７０４Ｂに、「４」の逆数である「０．２５」が格納される。 On the other hand, it is considered that the longer the sound piece is, the higher the possibility that an important message is included therein (that is, the importance is high). In the example of FIG. 8, the number of syllables of the text corresponding to the sound piece is defined as the length of the sound piece, and the reciprocal number (column 704 B) is calculated as a part of the sound piece information 704. For example, since the number of syllables of the text “Soon” is “4”, “0.25” which is the reciprocal of “4” is stored in the column 704B corresponding to the text.

なお、音片情報７０４は、さらに、肉声音片に対応するテキストに含まれる単語の品詞に基づいて計算されてもよい。例えば、上記の任意設定重要度が、品詞に基づいて定められてもよい。図８に示すようなカーナビゲーションシステムの音声メッセージの場合、固有名詞は、目的地又は目的地に到達するまでの経路上の地点の地名である可能性があるため、その他の品詞と比較して重要度が高いと考えられる。このため、固有名詞に対応する任意設定重要度として、その他の品詞に対応する任意設定重要度より大きい値が設定されてもよい。 Note that the piece information 704 may be further calculated based on the part of speech of the word included in the text corresponding to the real voice piece. For example, the arbitrarily set importance may be determined based on the part of speech. In the case of the voice message of the car navigation system as shown in FIG. 8, the proper noun may be the place name of the destination or the point on the route to reach the destination. Considered high importance. For this reason, a value larger than the arbitrarily set importance corresponding to other parts of speech may be set as the arbitrarily set importance corresponding to the proper noun.

一つの肉声音片に対応するテキストが複数の単語を含み、それらの単語の品詞が互いに異なる場合、例えば、それらの複数の単語の品詞の任意設定重要度の合計値の逆数が音片情報７０４として使用されてもよいし、それらの複数の単語の品詞の任意設定重要度のうち最も大きい値の逆数が音片情報７０４として使用されてもよい。 When the text corresponding to one real voice sound piece includes a plurality of words and the parts of speech of these words are different from each other, for example, the reciprocal of the total value of the arbitrarily set importance of the parts of speech of the plurality of words is the sound piece information 704. Or the reciprocal of the largest value among the arbitrarily set importance levels of the parts of speech of the plurality of words may be used as the piece information 704.

次に、声質変換音片決定部７は、上記計算された音片情報７０４と、図５の声質差分量計算処理によって計算された声質差分量とに基づいて声質変換コスト７０５を計算する（ステップＳ６０２）。具体的には、声質変換音片決定部７は、声質差分量に、音片情報７０４に基づく重み付けをすることによって、声質変換コスト７０５を計算する。 Next, the voice quality conversion sound piece determination unit 7 calculates a voice quality conversion cost 705 based on the calculated sound piece information 704 and the voice quality difference amount calculated by the voice quality difference amount calculation processing of FIG. S602). Specifically, the voice quality conversion sound piece determination unit 7 calculates the voice quality conversion cost 705 by weighting the voice quality difference amount based on the sound piece information 704.

図８の例では、上記計算された音片情報７０４の値と、上記計算された声質差分量との和が声質変換コスト７０５として計算される。例えば、図８に示すように、テキスト「まもなく」に対応する声質差分量７０３が「１．２３４」、それに対応する音片情報７０４が「３．００」及び「０．２５」である場合、声質変換コスト７０５としてそれらの値の合計値「４．４８４」が計算される。しかし、上記以外の方法によって（例えば、声質差分量に音片情報７０４の値を乗算することによって）声質変換コスト７０５が計算されてもよい。 In the example of FIG. 8, the sum of the calculated sound piece information 704 and the calculated voice quality difference amount is calculated as the voice quality conversion cost 705. For example, as shown in FIG. 8, when the voice quality difference amount 703 corresponding to the text “Soon” is “1.234” and the corresponding sound piece information 704 is “3.00” and “0.25”, A total value “4.484” of these values is calculated as the voice quality conversion cost 705. However, the voice quality conversion cost 705 may be calculated by a method other than the above (for example, by multiplying the voice quality difference amount by the value of the speech piece information 704).

次に、声質変換音片決定部７は、肉声音片ごとに声質変換コスト７０５の値が所定の閾値を超えているか否かを判定する（ステップＳ６０３）。 Next, the voice quality conversion sound piece determination unit 7 determines whether or not the value of the voice quality conversion cost 705 exceeds a predetermined threshold value for each real voice sound piece (step S603).

図７の例では閾値を１．０としたため、図８において音片ＩＤ７０１が「０００１」である音片、及び、音片ＩＤ７０１が「００１５」である音片の声質変換コスト７０５（それぞれ、「４．４８４」及び「５．９６０」）がその閾値を超えている。閾値を超えた音片に対応する声質変換可否７０６として、その音片が声質変換の対象であることを表す情報（図８の例では「１」）が格納される（ステップＳ６０４）。 In the example of FIG. 7, since the threshold value is 1.0, the voice quality conversion cost 705 for the sound piece whose sound piece ID 701 is “0001” and the sound piece whose sound piece ID 701 is “0015” in FIG. 4.484 "and" 5.960 ") exceed the threshold. Information (“1” in the example of FIG. 8) indicating that the sound piece is a voice quality conversion target is stored as voice quality conversion availability 706 corresponding to the sound piece exceeding the threshold (step S604).

一方、閾値を超えていない音片に対応する声質変換可否７０６として、その音片が声質変換の対象でないことを表す情報（図８の例では「０」）が格納される（ステップＳ６０５）。 On the other hand, information (“0” in the example of FIG. 8) indicating that the sound piece is not subject to voice quality conversion is stored as voice quality conversion availability 706 corresponding to the sound piece not exceeding the threshold (step S605).

図７では、声質変換の対象と判定された音片が声質変換音片、声質変換の対象でないと判定された音片が声質無変換音片と記載される。 In FIG. 7, a sound piece determined to be a voice quality conversion target is described as a voice quality conversion sound piece, and a sound piece determined to be not a voice quality conversion target is described as a voice quality non-converted sound piece.

上記の処理によれば、肉声音片について計算された声質差分量の値が大きく、かつ、その肉声音片の重要度が低いほど、声質変換コスト７０５は大きくなる。声質変換コスト７０５が大きい肉声音片ほど、声質変換の対象になりやすい。 According to the above processing, the voice quality conversion cost 705 increases as the value of the voice quality difference amount calculated for the real voice sound piece increases and the importance of the real voice sound piece decreases. A voiced voice piece having a higher voice quality conversion cost 705 is more likely to be subject to voice quality conversion.

肉声音片について計算された声質差分量の値が大きいほど、その肉声音片の声質は、これから作成しようとする均一な音声メッセージの声質から大きく乖離している。すなわち、均一な声質の音声メッセージを作成するためには、声質差分量の値が大きい肉声音片ほど、その声質変換をする必要性が高いといえる。このため、上記の処理によれば、重要度が同じである場合、声質差分量の値が大きい肉声音片ほど、声質変換の対象になりやすい。 The greater the value of the voice quality difference calculated for the real voice sound piece, the farther the voice quality of the real voice sound piece is from the voice quality of the uniform voice message to be created. That is, in order to create a voice message with a uniform voice quality, it can be said that the voice quality of a voice voice piece having a larger voice quality difference value is more likely to be converted. For this reason, according to said process, when importance is the same, the more the voice quality sound piece with a larger value of voice quality difference amount, the easier it is to be subject to voice quality conversion.

しかし、声質変換によって音片の音質は劣化するため、重要度の高い音片の聞き取りやすさを確保するためには、声質変換を実行しないほうがよい。このため、上記の処理によれば、声質差分量が同じである場合、重要度が高い肉声音片ほど、声質変換の対象になりにくい。 However, since the sound quality of the sound piece deteriorates due to the voice quality conversion, it is better not to execute the voice quality conversion in order to ensure the ease of hearing of the sound piece with high importance. For this reason, according to said process, when the amount difference of voice quality is the same, the more important the voice voice piece, the less the voice quality conversion target.

次に、声質変換部８は、全体処理フローチャート（図３）のステップＳ２０６において、ステップＳ２０５において声質変換音片であると決定された肉声音片の声質変換処理を行う。この声質変換処理は、本発明の背景技術として引用された非特許文献１又は非特許文献２に開示された技術を用いて実行されてもよい。この声質変換は、図５の声質差分量計算処理のステップＳ４０２において計算された、当該肉声音片と肉声韻律合成音声との音響特徴量の差分を目標として実行される。言い換えると、この声質変換処理によって声質を変換された後の肉声音片と、その肉声音片に対応する肉声韻律合成音声との音響特徴量の差分は、図５のステップＳ４０２において計算された差分より小さく（望ましくは０に）なる。 Next, the voice quality conversion unit 8 performs voice quality conversion processing of the real voice sound piece determined to be the voice quality converted sound piece in step S205 in step S206 of the overall processing flowchart (FIG. 3). This voice quality conversion process may be executed using the technique disclosed in Non-Patent Document 1 or Non-Patent Document 2 cited as the background art of the present invention. This voice quality conversion is executed with the target difference between the acoustic feature quantities calculated in step S402 of the voice quality difference amount calculation process of FIG. 5 between the real voice sound piece and the real voice prosody synthesized speech. In other words, the difference between the acoustic feature quantities of the real voice sound piece after the voice quality is converted by the voice quality conversion process and the real voice prosody synthesized speech corresponding to the real voice sound piece is the difference calculated in step S402 in FIG. It becomes smaller (preferably 0).

次に、音声メッセージ作成部４は、各音片を接続することによって音声メッセージを作成し、作成された音声メッセージを出力する（ステップＳ２０７）。 Next, the voice message creation unit 4 creates a voice message by connecting each sound piece, and outputs the created voice message (step S207).

以上、説明したように、本発明によれば、利用肉声音片ごとに声質変換コストを算出し、声質変換コストが閾値を越えたものについてのみ声質変換を施すことによって、必要以上の音質劣化を防ぎつつ声質の均一な音声メッセージを作成することができる。 As described above, according to the present invention, the voice quality conversion cost is calculated for each used voice voice sound piece, and the voice quality conversion is performed only for the voice quality conversion cost exceeding the threshold value, thereby reducing the sound quality deterioration more than necessary. It is possible to create a voice message with uniform voice quality while preventing it.

以上、本発明の実施形態について説明したが、本発明はこれらの実施形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変形して実施することができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to these embodiment, In the range which does not deviate from the summary, it can implement in various deformation | transformation.

本発明の実施形態の音声メッセージ作成装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the voice message preparation apparatus of embodiment of this invention. 本発明の実施形態の音声メッセージ作成装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the voice message preparation apparatus of embodiment of this invention. 本発明の実施形態の音声メッセージ作成装置の全体の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the whole voice message preparation apparatus of embodiment of this invention. 本発明の実施形態における利用音片決定結果の一例を示す説明図である。It is explanatory drawing which shows an example of the utilization sound piece determination result in embodiment of this invention. 本発明の実施形態において実行される声質差分量計算処理を示すフローチャートである。It is a flowchart which shows the voice quality difference amount calculation process performed in embodiment of this invention. 本発明の実施形態における声質差分量計算処理結果を示す説明図である。It is explanatory drawing which shows the voice quality difference amount calculation process result in embodiment of this invention. 本発明の実施形態において実行される声質変換音片決定処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion sound piece determination process performed in embodiment of this invention. 本発明の実施形態における声質変換音片の決定結果の一例を示す説明図である。It is explanatory drawing which shows an example of the determination result of the voice quality conversion sound piece in embodiment of this invention.

Explanation of symbols

１入力部
２出力部
３記憶部
４音声メッセージ作成部
５音声合成部
６声質差分量計算部
７声質変換音片決定部
８声質変換部 DESCRIPTION OF SYMBOLS 1 Input part 2 Output part 3 Storage part 4 Voice message preparation part 5 Speech synthesis part 6 Voice quality difference amount calculation part 7 Voice quality conversion sound piece determination part 8 Voice quality conversion part

Claims

A voice message creation device that creates a voice message using real voice data,
A storage device in which the real voice data is stored in advance, a processor connected to the storage device, an input device and an output device connected to the processor,
When the information specifying the text is input, the voice data corresponding to the text specified by the input information is identified,
Calculating the difference between the acoustic feature quantity of the identified real voice data and the reference acoustic feature quantity defined for the identified real voice data;
Based on the calculated difference and a value indicating the importance level of the identified real voice data, it is determined whether or not to convert an acoustic feature amount of the identified real voice data,
When it is determined that the acoustic feature value of the identified real voice data is to be converted, the acoustic feature value of the identified real voice data is converted, and a voice message based on the converted real voice data is converted. make,
A voice message creation device that creates a voice message based on the identified real voice data when it is determined not to convert an acoustic feature amount of the identified real voice data.

The voice message creating device creates synthesized voice data corresponding to the designated text so that the prosody of the synthesized voice data is equal to the prosody of the identified real voice data,
The voice message creation device according to claim 1, wherein an acoustic feature amount of the created synthesized voice data is used as the reference acoustic feature amount.

The voice message creation device according to claim 2, wherein the prosody of the voice data includes one corresponding to a voice pitch, a loudness, and a speaking speed among acoustic feature quantities of the voice data.

2. The voice message creation device according to claim 1, wherein the acoustic feature amount of the voice data includes at least one of a spectrum, a cepstrum, a Δ cepstrum, and a mel cepstrum calculated from the voice data.

The voice message creation device comprises:
By calculating the conversion cost by weighting the calculated difference using a value indicating the importance level of the identified real voice data,
2. The voice message creation device according to claim 1, wherein when the calculated conversion cost exceeds a predetermined threshold, it is determined to convert an acoustic feature amount of the identified real voice data.

The voice message creation device is configured such that the conversion cost increases as the calculated difference increases, and the conversion cost increases as the value indicating the importance of the identified real voice data indicates a lower importance. The voice message creation device according to claim 5, wherein the conversion cost is calculated so as to increase.

The value indicating the importance level of the real voice data is a value given in advance to the real voice data, the length of the voice corresponding to the real voice data, and the word included in the text corresponding to the real voice data. The voice message creating apparatus according to claim 6, wherein the voice message creating apparatus is determined based on at least one of the parts of speech.

8. The voice message creation device according to claim 7, wherein the value indicating the importance of the real voice data is determined so as to indicate a higher importance as the length of the voice corresponding to the real voice data is longer. .

The value indicating the importance of the real voice data when a proper noun is included in the text corresponding to the real voice data, the value of the real voice data when the proper noun is not included in the text corresponding to the real voice data 8. The voice message creation device according to claim 7, wherein the voice message creation device is determined so as to indicate an importance level higher than a value indicating the importance level.

When it is determined that the voice message creation device converts the acoustic feature amount of the identified real voice data, the difference between the acoustic feature amount of the identified real voice data and the reference acoustic feature amount is reduced. The voice message creation device according to claim 1, wherein an acoustic feature amount of the identified real voice data is converted.

A voice message creation device creates a voice message using real voice data,
The voice message creation device includes a storage device in which the real voice data is stored in advance, a processor connected to the storage device, and an input device and an output device connected to the processor,
The method
When information specifying a text is input, a first procedure for specifying real voice data corresponding to the text specified by the input information;
A second procedure for calculating a difference between an acoustic feature amount of the identified real voice data and a reference acoustic feature amount defined for the identified real voice data;
A third procedure for determining whether to convert an acoustic feature amount of the identified real voice data, based on the calculated difference and a value indicating the importance of the identified real voice data; ,
When it is determined that the acoustic feature value of the identified real voice data is to be converted, the acoustic feature value of the identified real voice data is converted, and a voice message based on the converted real voice data is converted. A fourth step to create,
And a fifth step of creating a voice message based on the identified real voice data when it is determined not to convert the acoustic feature quantity of the identified real voice data.

The method further includes a sixth step of creating synthesized speech data corresponding to the designated text so that the prosody of the synthesized speech data is equal to the prosody of the identified real voice data;
The method according to claim 11, wherein an acoustic feature amount of the synthesized speech data created in the sixth procedure is used as the reference acoustic feature amount in the second procedure.

The method according to claim 12, wherein the prosody of the speech data includes one corresponding to a pitch, a loudness, and a speaking speed among acoustic features of the speech data.

The method according to claim 11, wherein the acoustic feature amount of the speech data includes at least one of a spectrum, a cepstrum, a Δ cepstrum, and a mel cepstrum calculated from the speech data.

The third procedure includes
A procedure for calculating a conversion cost by weighting the calculated difference using a value indicating the importance of the identified real voice data;
The method according to claim 11, further comprising: a step of determining to convert an acoustic feature amount of the identified real voice data when the calculated conversion cost exceeds a predetermined threshold value.

The procedure for calculating the conversion cost is such that the conversion cost increases as the calculated difference increases, and the value indicating the importance of the identified real voice data indicates a lower importance. The method according to claim 15, further comprising a step of calculating the conversion cost such that the conversion cost is increased.

The value indicating the importance level of the real voice data is a value given in advance to the real voice data, the length of the voice corresponding to the real voice data, and the word included in the text corresponding to the real voice data. The method of claim 16, wherein the method is defined based on at least one of the parts of speech.

The method according to claim 17, wherein the value indicating the importance level of the real voice data is determined such that the higher the voice length corresponding to the real voice data is, the higher the importance level is.

The value indicating the importance of the real voice data when a proper noun is included in the text corresponding to the real voice data, the value of the real voice data when the proper noun is not included in the text corresponding to the real voice data The method of claim 17, wherein the method is defined to indicate an importance level higher than a value indicating the importance level.

The fourth procedure includes a procedure of converting an acoustic feature quantity of the identified real voice data so that a difference between an acoustic feature quantity of the identified real voice data and the reference acoustic feature quantity becomes small. The method according to claim 11.