JP4758931B2

JP4758931B2 - Speech synthesis apparatus, method, program, and recording medium thereof

Info

Publication number: JP4758931B2
Application number: JP2007079426A
Authority: JP
Inventors: 孝中村; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-03-26
Filing date: 2007-03-26
Publication date: 2011-08-31
Anticipated expiration: 2027-03-26
Also published as: JP2008241898A

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology which alleviates sense of incongruity felt by a user when speech is changed from operator's speech to synthesized speech in a speech guidance system. <P>SOLUTION: A speaker identifying section 42 uses a feature quantity which is extracted from the speech of the operator 16 to select the most similar identification model to speech of the operator 16 from among a plurality of identification models prepared beforehand. A synthesized speech creation section 43 reads a speech data corresponding to the selected identification model, and creates synthesized speech close to quality of operator's speech, which corresponds to information regarding speech provided for the user determined by the operator. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、音声を合成する装置、方法、プログラム及びその記録媒体に関する。 The present invention relates to an apparatus, method, program, and recording medium for synthesizing speech .

従来、下記のような音声案内システムがあった（例えば、非特許文献１参照。）。 Conventionally, there has been a voice guidance system as described below (for example, see Non-Patent Document 1).

ユーザは、通信網を介して、コールセンタのオペレータと対話を行う。オペレータは、予め登録された回答音声の候補の中から、上記対話を通じてユーザに適した回答音声を選択する。音声合成装置は、オペレータの代わりに、上記選択された回答音声を再生する。オペレータは、回答音声を選択した後は、ユーザから解放され、他のユーザと対話を行い、適切な回答音声を同様に選択する。 A user interacts with a call center operator via a communication network. The operator selects an answer voice suitable for the user from the answer voice candidates registered in advance through the dialogue. The voice synthesizer reproduces the selected answer voice instead of the operator. After selecting the answer voice, the operator is released from the user, interacts with other users, and selects an appropriate answer voice in the same manner.

このように、オペレータはユーザとの対話、回答音声の選択のみを行い、回答音声の再生を音声合成装置に行わせることにより、ユーザ１人あたりのオペレータの対応時間を低減することができた。言い換えれば、オペレータ１人あたりの対応可能件数を増加させることができるため、対応件数あたりのオペレータの数を減少させることができた。これにより、従来技術による音声案内システムは、コールセンタのコストを削減することができた。 As described above, the operator can only interact with the user, select the answer voice, and cause the voice synthesizer to play the answer voice, thereby reducing the response time of the operator per user. In other words, since the number of cases that can be handled per operator can be increased, the number of operators per number of cases that can be handled can be reduced. As a result, the voice guidance system according to the prior art can reduce the cost of the call center.

しかし、従来技術による音声案内システムでは、音声合成装置が再生する音声は、ユーザと対話をしたオペレータの音声を考慮せずに作成されていた。したがって、ユーザは、直前まで対話をしていたオペレータの音声とは、話者性が異なる合成音声によるアナウンスを聴取していた。
ＮＴＴアイティ株式会社、［online］、平成１６年１１月３０日、［平成１９年３月１６日検索］、インターネット＜URL: http://www.ntt-it.co.jp/press/2004/041130/041130vcj.html＞ However, in the voice guidance system according to the prior art, the voice reproduced by the voice synthesizer is created without considering the voice of the operator who has interacted with the user. Therefore, the user has listened to the announcement by the synthesized voice having different speaker characteristics from the voice of the operator who has been in conversation until immediately before.
NTT IT Corporation, [online], November 30, 2004, [March 16, 2007 search], Internet <URL: http://www.ntt-it.co.jp/press/2004/ 041130 / 041130vcj.html>

従来技術による音声案内システムでは、最初に対話をしていたオペレータの音声の話者性と、音声合成装置が再生する回答音声の話者性とが異なるため、ユーザに不自然さや違和感を与えるという問題があった。 In the voice guidance system according to the prior art, the talker nature of the voice of the operator who had the initial conversation is different from the talker nature of the answer voice reproduced by the speech synthesizer, which makes the user feel unnatural or uncomfortable. There was a problem.

この発明の第一の態様によれば、音声合成装置によって合成音声を生成し、コールセンタのオペレータの代わりに合成音声でユーザに情報を提供する音声案内システムにおける音声合成装置において、音声合成装置は話者識別手段と合成音声生成手段とを備え、話者識別手段は、複数の識別モデルを記憶する識別モデル記憶手段と、入力されたオペレータの音声から第一の特徴量を抽出する第一特徴量抽出手段と、第一の特徴量を、識別モデル記憶手段から読み込んだ各識別モデルに入力して、各識別モデルごとに尤度を計算する尤度計算手段と、計算された尤度を最大にする識別モデルを予め用意された複数の識別モデルの中から選択する選択手段と、を備え、合成音声生成手段は、複数の各識別モデルに対応する音声データを記憶する音声データ記憶手段と、選択された識別モデルに対応する音声データを、音声データ記憶手段から読み出す音声データ選択手段と、読み出された音声データを用いて、入力されたオペレータが決定したユーザに提供する音声に関する情報に対応する、オペレータの音声の音質に近い音質を有する合成音声を生成する合成手段と、を備え、音声合成装置は、オペレータの音声から話者の個人性を表す話者性パラメータを抽出する第二特徴量抽出手段と、話者性パラメータを用いて、合成音声の音質がオペレータの音声の音質に近づくように合成音声を変形し変形音声を生成する変形手段と、計算された尤度の最大値が所定の閾値αより大又は以上であれば、または、計算された尤度の最大値が上記閾値αよりも大きい閾値である所定の閾値βより小又は以下であれば、変形音声を出力し、そうでなければ、合成音声を出力するように制御する制御手段と、をさらに備えることを特徴とする音声合成装置。 According to the first aspect of the present invention, in the voice synthesizer in the voice guidance system that generates the synthesized voice by the voice synthesizer and provides the user with the synthesized voice instead of the call center operator, the voice synthesizer Speaker identification means and synthesized speech generation means, wherein the speaker identification means is an identification model storage means for storing a plurality of identification models, and a first feature quantity for extracting a first feature quantity from the input operator's voice The extraction means, the first feature amount is input to each identification model read from the identification model storage means, the likelihood calculation means for calculating the likelihood for each identification model, and the calculated likelihood is maximized Selection means for selecting an identification model to be selected from a plurality of identification models prepared in advance, and the synthesized speech generation means is a sound for storing speech data corresponding to the plurality of identification models. The data storage means, the voice data corresponding to the selected identification model are provided to the user determined by the input operator using the voice data selection means for reading from the voice data storage means and the read voice data. Synthesizing means for generating synthesized speech having a quality close to the quality of the operator's speech corresponding to the information related to the speech, and the speech synthesizer sets a speaker characteristic parameter representing the individuality of the speaker from the operator's speech. A second feature amount extracting means for extracting, a deforming means for generating a deformed speech by transforming the synthesized speech so that the sound quality of the synthesized speech approximates the sound quality of the operator's speech using the speaker characteristics parameter, and the calculated likelihood If the maximum value of the degree is greater than or greater than the predetermined threshold value α, or the calculated maximum likelihood value is smaller than the predetermined threshold value β that is a threshold value that is larger than the threshold value α. If not more than, and outputs a modified audio, otherwise, the speech synthesis apparatus for and control means for controlling to output the synthesized speech, and further comprising a.
この発明の第二の態様によれば、音声合成装置によって合成音声を生成し、コールセンタのオペレータの代わりに合成音声でユーザに情報を提供する音声案内システムにおける音声合成装置において、音声合成装置は話者識別手段と合成音声生成手段とを備え、話者識別手段は、複数の識別モデルを記憶する識別モデル記憶手段と、入力されたオペレータの音声から第一の特徴量を抽出する第一特徴量抽出手段と、第一の特徴量を、識別モデル記憶手段から読み込んだ各識別モデルに入力して、各識別モデルごとに尤度を計算する尤度計算手段と、計算された尤度を最大にする識別モデルを予め用意された複数の識別モデルの中から選択する選択手段と、を備え、合成音声生成手段は、複数の各識別モデルに対応する音声データを記憶する音声データ記憶手段と、選択された識別モデルに対応する音声データを、音声データ記憶手段から読み出す音声データ選択手段と、読み出された音声データを用いて、入力されたオペレータが決定したユーザに提供する音声に関する情報に対応する、オペレータの音声の音質に近い音質を有する合成音声を生成する合成手段と、を備え、音声合成装置は、オペレータの音声から話者の個人性を表す話者性パラメータを抽出する第二特徴量抽出手段と、話者性パラメータを用いて、合成音声の音質がオペレータの音声の音質に近づくように合成音声を変形し変形音声を生成する変形手段と、計算された尤度の最大値が、所定の閾値αより大又は以上であり、かつ、上記閾値αよりも大きい閾値である所定の閾値βより小又は以下であれば、変形音声を出力し、そうでなければ、合成音声を出力するように制御する制御手段と、をさらに備えることを特徴とする音声合成装置。 According to the second aspect of the present invention, in the voice synthesizer in the voice guidance system in which the voice synthesizer generates synthesized voice and provides the user with the synthesized voice instead of the call center operator. Speaker identification means and synthesized speech generation means, wherein the speaker identification means is an identification model storage means for storing a plurality of identification models, and a first feature quantity for extracting a first feature quantity from the input operator's voice The extraction means, the first feature amount is input to each identification model read from the identification model storage means, the likelihood calculation means for calculating the likelihood for each identification model, and the calculated likelihood is maximized Selection means for selecting an identification model to be selected from a plurality of identification models prepared in advance, and the synthesized speech generation means is a sound for storing speech data corresponding to the plurality of identification models. The data storage means, the voice data corresponding to the selected identification model are provided to the user determined by the input operator using the voice data selection means for reading from the voice data storage means and the read voice data. Synthesizing means for generating synthesized speech having a quality close to the quality of the operator's speech corresponding to the information related to the speech, and the speech synthesizer sets a speaker characteristic parameter representing the individuality of the speaker from the operator's speech. A second feature amount extracting means for extracting, a deforming means for generating a deformed speech by transforming the synthesized speech so that the sound quality of the synthesized speech approximates the sound quality of the operator's speech using the speaker characteristics parameter, and the calculated likelihood If the maximum value of the degree is greater than or greater than a predetermined threshold α and smaller than or less than a predetermined threshold β that is a threshold greater than the threshold α, the modified voice Outputs, otherwise, the speech synthesis device comprising control means for controlling to output the synthesized speech, further comprising a.

音声合成装置が再生する音声の話者性を、最初に対話をしていたオペレータの音声の話者性に近づけることにより、ユーザに不自然さや違和感を与えることがなくなる。これにより、ユーザの充足度を高めることができる。 By making the speech nature of the speech reproduced by the speech synthesizer closer to the speech nature of the voice of the operator who had the initial conversation, the user is not given unnaturalness or discomfort. Thereby, a user's satisfaction degree can be raised.

［第一実施形態］
第一実施形態による音声案内システム１の機能構成について説明する。音声案内システム１には、図１に示すように、ユーザ１０が対話を行うための入出力部１７と、オペレータ１６が対話を行うための入出力部１５とが通信網１３を介して接続されている。通信網１３は、電話網、携帯電話網、ＩＰ通信網、インターネット等の任意の通信網であり、対話を行うことができるものであれば、有線・無線、公衆網・専用線を問わない。入出力部１５、１７は、電話の送受話器、携帯電話、ハンドセット等である。 [First embodiment]
A functional configuration of the voice guidance system 1 according to the first embodiment will be described. As shown in FIG. 1, the voice guidance system 1 is connected to an input / output unit 17 for the user 10 to perform a dialog and an input / output unit 15 for the operator 16 to perform a dialog via a communication network 13. ing. The communication network 13 is an arbitrary communication network such as a telephone network, a cellular phone network, an IP communication network, and the Internet, and may be a wired / wireless, a public network, or a dedicated line as long as it can perform a conversation. The input / output units 15 and 17 are a telephone handset, a mobile phone, a handset, and the like.

通信網１３と入出力部１５との間には、オペレータ１６からの指示を受けて、入出力部１７の接続先を、入出力部１５と音声合成装置１１の何れか一方に切り替える切替部１２が設けられている。最初の状態では、切替部１２は、入出力部１５に接続されている。 In response to an instruction from the operator 16 between the communication network 13 and the input / output unit 15, the switching unit 12 switches the connection destination of the input / output unit 17 to either the input / output unit 15 or the speech synthesizer 11. Is provided. In the initial state, the switching unit 12 is connected to the input / output unit 15.

また、例えば、オペレータ１６が決定したユーザに提供する音声に関する情報を取得する第一取得部１４ａと、オペレータ１６の音声を取得する第二取得部１４ｂとがそれぞれ設けられており、第一取得部１４ａは取得した情報を、第二取得部１４ｂは取得したオペレータの音声をそれぞれ音声合成装置１１に渡たす。第一取得部１４ａは、例えば、キーボード、ポインティングデバイス、マイク等の入力手段である。第二取得部１４ｂは、例えば、マイク等の入力手段である。音声合成装置１１は、渡された情報とオペレータ１６の音声とを用いて、オペレータ１６の音声の話者性に近い話者性を持つ合成音声を生成して、ユーザ１０に提供するものである。 Further, for example, a first acquisition unit 14a that acquires information related to the voice provided to the user determined by the operator 16 and a second acquisition unit 14b that acquires the voice of the operator 16 are provided. 14 a passes the acquired information, and the second acquisition unit 14 b passes the acquired operator voice to the voice synthesizer 11. The first acquisition unit 14a is an input unit such as a keyboard, a pointing device, or a microphone. The second acquisition unit 14b is input means such as a microphone. The speech synthesizer 11 uses the received information and the voice of the operator 16 to generate a synthesized voice having a talkability close to that of the voice of the operator 16 and provides it to the user 10. .

図７を参照して、音声案内システム１の処理の流れについて説明をする。図７は、音声案内システム１の処理の流れを例示するフローチャートである。 With reference to FIG. 7, the flow of processing of the voice guidance system 1 will be described. FIG. 7 is a flowchart illustrating the processing flow of the voice guidance system 1.

＜ステップＳ１＞
ユーザ１０は、入出力部１７を用いて通信網１３を介してオペレータ１６と対話を行う。その対話の中で、ユーザ１０は、得たいと考えている情報の特徴を伝える。オペレータ１６は、ユーザ１０が得たいと考えている情報を特定する。 <Step S1>
The user 10 interacts with the operator 16 via the communication network 13 using the input / output unit 17. In the dialogue, the user 10 conveys the characteristics of the information he / she wants to obtain. The operator 16 specifies information that the user 10 wants to obtain.

＜ステップＳ２＞
オペレータ１６は、その対話を通じて、ユーザに提供する音声を決定する。オペレータ１６は、その決定したユーザに提供する音声に関する情報を第一取得部１４ａに入力する。これにより、ユーザに提供する音声に関する情報は、第一取得部１４ａによって取得されて、音声合成装置１１に出力される。決定された情報は、例えば、テキスト情報、識別情報等の情報である。 <Step S2>
The operator 16 determines the voice to be provided to the user through the dialogue. The operator 16 inputs information related to the voice to be provided to the determined user to the first acquisition unit 14a. Thereby, the information regarding the voice provided to the user is acquired by the first acquisition unit 14 a and is output to the voice synthesizer 11. The determined information is, for example, information such as text information and identification information.

例えば、ユーザがオペレータに電話番号を問い合わせている等の場合には、上記決定された情報は、「０３−××××−××××」等の１０個の数字からなるテキスト情報である。×は０から９までの数字である。 For example, when the user is inquiring the telephone number of the operator, the determined information is text information composed of 10 numbers such as “03-xxxx-xxxx”. . X is a number from 0 to 9.

また、ユーザ１０に提供する音声をテキスト情報として用意しておき、各テキスト情報に対して異なる識別情報を予め定めておく。そして、オペレータ１６が、その識別情報を決定して、第一取得部１４ａに入力してもよい。この場合には、この識別情報が、音声合成装置１１に渡される。この場合、音声合成装置１１は、図示していない記憶部に各識別情報に対応するテキスト情報を記憶している。そして、渡された識別情報に対応するテキスト情報をこの記憶部から読み出して後述する音声合成処理を行う。 Moreover, the voice provided to the user 10 is prepared as text information, and different identification information is determined in advance for each text information. Then, the operator 16 may determine the identification information and input it to the first acquisition unit 14a. In this case, this identification information is passed to the speech synthesizer 11. In this case, the speech synthesizer 11 stores text information corresponding to each identification information in a storage unit (not shown). Then, text information corresponding to the passed identification information is read from the storage unit and a speech synthesis process described later is performed.

＜ステップＳ３＞
切替部１２を切り替えるための切替指令情報が、オペレータ１６によって、図示していない入力手段から入力される。切替指令情報は、切替部１２に渡される。切替部１２は、切替指令情報を受けて、音声合成装置１１から出力される音声が、ユーザ１０に出力されるように切替部１２を切り替える。すなわち、ユーザ１０が聴取する音声が、オペレータの音声から音声合成装置１１が出力した音声に切り替えられる。 <Step S3>
Switching command information for switching the switching unit 12 is input by an operator 16 from input means (not shown). The switching command information is passed to the switching unit 12. The switching unit 12 receives the switching command information, and switches the switching unit 12 so that the sound output from the speech synthesizer 11 is output to the user 10. That is, the voice that the user 10 listens to is switched from the voice of the operator to the voice output by the voice synthesizer 11.

＜ステップＳ４＞
オペレータ１６の音声が、第二取得部１４ｂに入力されて、音声合成装置１１に渡される。ここで、オペレータ１６の音声とは、オペレータ１６自身の音声だけではなく、第二取得部１４ｂや入出力部１５で収音されたオペレータ１６の周囲の環境から発生した音を含んでいてもよい。 <Step S4>
The voice of the operator 16 is input to the second acquisition unit 14 b and passed to the voice synthesizer 11. Here, the voice of the operator 16 may include not only the voice of the operator 16 but also a sound generated from the environment around the operator 16 collected by the second acquisition unit 14b or the input / output unit 15. .

＜ステップＳ５＞
音声合成装置１１は、入力された情報とオペレータの音声とを用いて、オペレータの音声の音質に近づけた合成音声を生成して、ユーザ１０に提供する。 <Step S5>
The voice synthesizer 11 uses the input information and the operator's voice to generate a synthesized voice close to the sound quality of the operator's voice and provides it to the user 10.

以下、図２と図８を参照して、音声合成装置１１の機能構成、ステップＳ５の処理の詳細について説明する。図２は、音声合成装置１１の機能構成を例示する図であり、図８は、ステップＳ５を構成する各処理の流れを例示するフローチャートである。図２は、音声合成装置１１を重点的に説明するための図であるため、入出力部１５、切替部１２、通信網１３等が省略されている点に留意する。 Hereinafter, the functional configuration of the speech synthesizer 11 and details of the processing in step S5 will be described with reference to FIGS. FIG. 2 is a diagram illustrating the functional configuration of the speech synthesizer 11, and FIG. 8 is a flowchart illustrating the flow of each process constituting step S <b> 5. Since FIG. 2 is a diagram for explaining the speech synthesizer 11 with emphasis, it should be noted that the input / output unit 15, the switching unit 12, the communication network 13, and the like are omitted.

図２に例示するように、音声合成装置１１は、入力されたオペレータの音声を用いて、予め用意した識別モデルの中から当てはまりのよい識別モデルを選択する話者識別部４２と、オペレータが選択した情報と、話者識別部４２が選択した識別モデルについての情報（例えば、識別モデルのＩＤ）とが入力され、オペレータの音声の話者性に近い合成音声を生成する合成音声生成部４３とを備える。 As illustrated in FIG. 2, the speech synthesizer 11 uses the input operator's voice to select a speaker identification unit 42 that selects a suitable identification model from among identification models prepared in advance, and the operator selects And a synthesized speech generation unit 43 that receives the information about the identification model selected by the speaker identification unit 42 (for example, the ID of the identification model) and generates a synthesized speech that is close to the speech characteristics of the operator's speech. Is provided.

話者識別部４２は、オペレータの音声から第一の特徴量を抽出する第一特徴量抽出部４２１と、予め用意され異なるＩＤが付けられた各識別モデル４２２１〜４２２ｎを格納する識別モデル記憶部４２２と、識別モデル記憶部４２２から読み込んだ各識別モデルに上記抽出された特徴量を入力して、各識別モデルごとに尤度を計算して選択部４２３に出力する尤度計算部４２０と、計算された尤度を最大化する識別モデルのＩＤを選択して合成音声生成部４３に出力する選択部４２３とを例えば備える。 The speaker identification unit 42 includes a first feature amount extraction unit 421 that extracts a first feature amount from the operator's voice, and an identification model storage unit that stores each of the identification models 4221 to 422n that are prepared in advance and assigned different IDs. 422, a likelihood calculation unit 420 that inputs the extracted feature quantity to each identification model read from the identification model storage unit 422, calculates a likelihood for each identification model, and outputs the likelihood to the selection unit 423; For example, a selection unit 423 that selects an ID of an identification model that maximizes the calculated likelihood and outputs the ID to the synthesized speech generation unit 43 is provided.

合成音声生成部４３は、各識別モデル４２２１〜４２２ｎに対応した、音声データ４３３１〜４３３ｎが格納された音声データ記憶部４３３と、選択部４２３で選択された識別モデルのＩＤに対応する音声データを音声データ記憶部４３３から読み出して合成部４３１に出力する音声データ選択部４３２と、第一取得部１４ａで取得された情報と、音声データ選択部４３２で選択された音声データとを用いて、オペレータ１６が選択したユーザ１０に提供する音声に関する情報に対応する、オペレータの音声の音質に近い音質を有する合成音声を生成する合成部４３１とを例えば備える。 The synthesized speech generation unit 43 stores speech data corresponding to each identification model 4221 to 422n, speech data storage unit 433 storing speech data 4331 to 433n, and speech data corresponding to the ID of the identification model selected by the selection unit 423. Using the voice data selection unit 432 that is read from the voice data storage unit 433 and output to the synthesis unit 431, the information acquired by the first acquisition unit 14a, and the voice data selected by the voice data selection unit 432, an operator is used. 16 includes, for example, a synthesis unit 431 that generates synthesized speech having sound quality close to that of the operator's speech, corresponding to information related to speech provided to the user 10 selected by 16.

≪ステップＳ５１（図８）≫
第一特徴量抽出部４２１は、入力されたオペレータ１６の音声から、特徴量を抽出して、尤度計算部４２０に送る。特徴量として、例えば、スペクトル包絡、ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）等の任意の音響特徴量、音声特徴量を用いることができる。また、それらの特徴量の計算方法も任意である。本実施形態では、特徴量として、音声認識、話者識別等で一般的に用いられているＭＦＣＣを用いる場合を例に挙げて説明をする。特徴量としてＭＦＣＣを用いる場合、音声を所定の時間長で分割した各フレームｔごとに、ＭＦＣＣを計算する。 << Step S51 (FIG. 8) >>
The first feature quantity extraction unit 421 extracts the feature quantity from the input voice of the operator 16 and sends it to the likelihood calculation unit 420. As the feature amount, for example, an arbitrary acoustic feature amount or speech feature amount such as a spectrum envelope or MFCC (Mel-Frequency Cepstrum Coefficient) can be used. Moreover, the calculation method of those feature-values is also arbitrary. In the present embodiment, the case where MFCC generally used in speech recognition, speaker identification, and the like is used as the feature amount will be described as an example. When MFCC is used as the feature amount, the MFCC is calculated for each frame t obtained by dividing the voice by a predetermined time length.

≪ステップＳ５２≫
尤度計算部４２０は、第一特徴量抽出部４２１が計算した特徴量と、識別モデル記憶部４２２から読み込んだ識別モデルとを用いて、各識別モデルごとに尤度を計算して、選択部４２３に送る。本実施形態においては、各識別モデルごとの尤度と共に、その識別モデルに付されたＩＤを併せて送る。 << Step S52 >>
The likelihood calculation unit 420 calculates the likelihood for each identification model using the feature amount calculated by the first feature amount extraction unit 421 and the identification model read from the identification model storage unit 422, and selects the selection unit. 423. In this embodiment, together with the likelihood for each identification model, the ID assigned to that identification model is also sent.

識別モデルを、混合正規分布（以下、ＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）とする。）を用いて表現した場合を例に挙げて、尤度計算部４２０の処理について説明をする。 The process of the likelihood calculation unit 420 will be described by taking as an example a case where the identification model is expressed using a mixed normal distribution (hereinafter referred to as GMM (Gaussian Mixture Model)).

フレームｔのＭＦＣＣをＸ_ｔ、混合分布数をＭ、モデルパラメータをλ，ｄ次元ガウス関数をｂ_ｉ（・）、混合重みをｃ_ｉ（Σ_ｉｃ_ｉ＝１である。）、Ｕ_ｉを平均ベクトル、σ_ｉを共分散行列、・^Ｔを・の転置とすると、フレームｔにおける第一の尤度Ｐ（Ｘ_ｔ｜λ）は、以下のように表される。

The MFCC of frame t is X _t , the number of mixture distributions is M, the model parameter is λ, the d-dimensional Gaussian function is b _i (•), the mixture weight is c _i (Σ _i c _i = 1), and U _i is set. If the mean vector, σ _i is a covariance matrix, and • ^T is a transpose of •, the first likelihood P (X _t | λ) in the frame t is expressed as follows.

なお、モデルパラメータλは、Ｍ個のモデルパラメータλ_ｉ（ｉ＝１，…，Ｍ）から構成され、各モデルパラメータλ_ｉは、混合重みｃ_ｉ、平均ベクトルＵ_ｉ、共分散行列σ_ｉから構成される。

The model parameter λ is composed of M model parameters λ _i (i = 1,..., M), and each model parameter λ _i is obtained from the mixture weight c _i , the average vector U _i , and the covariance matrix σ _i. Composed.

モデルパラメータを構成する、混合重みｃ_ｉ、平均ベクトルＵ_ｉ、共分散行列σ_ｉは、各識別モデルに対応した音声データから予め計算しておく。すなわち、各音声データごとにモデルパラメータλを計算する。これらの各識別モデルごとに計算したモデルパラメータλは、異なるＩＤが付けられて、識別モデル記憶部４２２に予め格納される。識別モデルを読み込むとは、対応するモデルパラメータλを読み込むことを意味する。 The mixture weight c _i , average vector U _i , and covariance matrix σ _i constituting the model parameters are calculated in advance from the speech data corresponding to each identification model. That is, the model parameter λ is calculated for each audio data. The model parameters λ calculated for each of these identification models are given different IDs and stored in advance in the identification model storage unit 422. Reading the identification model means reading the corresponding model parameter λ.

本実施形態では、尤度計算部４２０は、複数のフレーム（例えば、第一番目のフレーム１から第Ｔ番目のフレームＴの各フレーム）ごとに、第一の尤度Ｐ（Ｘ_ｔ｜λ）を計算してその対数を取る。そして、それらを加算して、第二の尤度Ｐ（Ｘ｜λ）を計算して、選択部４２３に送る。

In the present embodiment, the likelihood calculation unit 420 has a first likelihood P (X _t | λ) for each of a plurality of frames (for example, each frame from the first frame 1 to the T-th frame T). And take its logarithm. Then, by adding them, the second likelihood P (X | λ) is calculated and sent to the selection unit 423.

≪ステップＳ５３≫
選択部４２３は、尤度計算部４２０が各識別モデルごとに計算した尤度（本実施形態においては、第二の尤度Ｐ（Ｘ｜λ））を最大にする識別モデルを選択する。そして、その最大にする識別モデルについての情報を、合成音声生成部４３に送る。識別モデルについての情報とは、ここでは、識別モデルと対応する音声データに付されたＩＤである。 << Step S53 >>
The selection unit 423 selects an identification model that maximizes the likelihood (in this embodiment, the second likelihood P (X | λ)) calculated by the likelihood calculation unit 420 for each identification model. Then, information about the identification model to be maximized is sent to the synthesized speech generation unit 43. Here, the information about the identification model is an ID given to the audio data corresponding to the identification model.

≪ステップＳ５４≫
合成音声生成部４３の音声データ選択部４３２は、選択部４２３が選択した識別モデルに対応する音声データを音声データ記憶部４３３から読み出して、合成部４３１に送る。具体的には、合成音声生成部４３は、選択部４２３から送られた識別モデルのＩＤと同じＩＤを持つ音声データを音声データ記憶部４３３から読み出す。 << Step S54 >>
The voice data selection unit 432 of the synthesized voice generation unit 43 reads voice data corresponding to the identification model selected by the selection unit 423 from the voice data storage unit 433 and sends the voice data to the synthesis unit 431. Specifically, the synthesized voice generation unit 43 reads voice data having the same ID as the ID of the identification model sent from the selection unit 423 from the voice data storage unit 433.

≪ステップＳ５５≫
合成部４３１は、音声データ選択部４３２から送られた音声データを用いて、オペレータが決定したユーザに提供する音声に関する情報に対応する、オペレータの音声の音質に近い音質を有する合成音声を生成する。生成された合成音声は、ユーザ１０に提供される。 << Step S55 >>
The synthesis unit 431 uses the voice data sent from the voice data selection unit 432 to generate synthesized voice having a sound quality close to the voice quality of the operator's voice corresponding to the information related to the voice provided to the user determined by the operator. . The generated synthesized speech is provided to the user 10.

合成音声を作成する方法は何れの周知技術を用いてもよい。以下では、合成方式として、波形接続型の音声合成を用い、オペレータ１６が決定したユーザ１０に提供する音声についての情報として、テキスト情報を用いた場合を例に挙げて説明する。 Any known technique may be used as a method of creating the synthesized speech. In the following, a case where waveform connection type speech synthesis is used as the synthesis method and text information is used as information about the speech provided to the user 10 determined by the operator 16 will be described as an example.

合成部４３１は、図３に例示するように、テキスト解析部４３１１、韻律生成部４３１２、合成処理部４３１３、読み・アクセント辞書記憶部４３１４、韻律生成規則記憶部４３１５から構成される。 As illustrated in FIG. 3, the synthesis unit 431 includes a text analysis unit 4311, a prosody generation unit 4312, a synthesis processing unit 4313, a reading / accent dictionary storage unit 4314, and a prosody generation rule storage unit 4315.

図９を参照して、合成部４３１の処理例を説明する。図９は、合成部４３１の処理の流れを例示する図である。 With reference to FIG. 9, a processing example of the synthesis unit 431 will be described. FIG. 9 is a diagram illustrating an example of the processing flow of the synthesis unit 431.

≪ステップＳ５５１≫
テキスト解析部４３１１は、入力されたテキスト情報を形態素解析し、得られた形態素情報から、読み・アクセント辞書記憶部４３１４から読み込んだ読み・アクセント辞書を参照して、入力テキスト情報の読み・アクセントを決定する。解析された形態素、決定された読み、決定されたアクセントに関する情報は、韻律生成部４３１２と合成処理部４３１３に送られる。 << Step S551 >>
The text analysis unit 4311 performs morphological analysis on the input text information, refers to the reading / accent dictionary read from the reading / accent dictionary storage unit 4314 from the obtained morpheme information, and reads the reading / accent of the input text information. decide. Information on the analyzed morpheme, the determined reading, and the determined accent is sent to the prosody generation unit 4312 and the synthesis processing unit 4313.

例えば、「お電話しますか？」というテキスト情報が入力された場合、テキスト解析部４３１１は、「オデンワ［０２］シマスカ？［０２］」と形態素解析を行い、読みとアクセントを決定する。ここで、下線の部分は読みの部分であり、［０２］はアクセントの部分を表す。 For example, if the text information "Do you want to call?" Is input, text analysis section 4311, performs a morphological analysis as "Odenwa [02] Shimasuka? [02]", to determine the reading and accent. Here, the underlined portion is a reading portion, and [02] represents an accent portion.

≪ステップＳ５５２≫
韻律生成部４３１２は、上記解析された形態素、決定された読み、決定されたアクセントから、韻律生成規則記憶部４３１５から読み込んだ韻律生成規則を参照して、合成音声の韻律を生成する。生成された韻律は、合成処理部４３１３に送られる。 << Step S552 >>
The prosody generation unit 4312 generates a prosody of the synthesized speech by referring to the prosody generation rule read from the prosody generation rule storage unit 4315 from the analyzed morpheme, the determined reading, and the determined accent. The generated prosody is sent to the synthesis processing unit 4313.

例えば、韻律は、図３の韻律生成部４３１２の右側に例示したように、横軸を時間、縦軸を基本周波数とするグラフにおける曲線で表すことができる。図３の韻律生成部４３１２の右側に例示したグラフの曲線は、「オデンワ［０２］シマスカ？［０２］」に関する情報に対応する韻律である。 For example, the prosody can be represented by a curve in a graph with the horizontal axis representing time and the vertical axis representing the fundamental frequency, as illustrated on the right side of the prosody generation unit 4312 in FIG. Curve in the graph illustrated on the right side of the prosody generation part 4312 of FIG. 3 is a prosody which corresponds to the information related to "Odenwa [02] Shimasuka? [02]".

≪ステップＳ５５３≫
合成処理部４３１３は、上記解析された形態素、決定された読み、決定されたアクセント、上記決定された韻律等に合った音声を、音声データ４３１６の中から選択、接続して合成音声を生成する。音声データ４３１６は、音声データ選択部４３２から送られた音声データ、すなわち、選択部４２３が選択した識別モデルに対応する、音声データ記憶部４３３の中から音声データ選択部４３２が選択した１つの音声データである。音声データ４３１６には、図３に例示するように、該当音声の音韻、該当音声の前後の音韻及び韻律がそれぞれ対応付けられた複数の音声データが格納されている。合成処理部４３１３は、該当音声の音韻、該当音声の前後の音韻及び韻律をインデックスとして、音声データ４３１６を参照して、音声データ４３１６の中から音声を選択、接続して合成音声を生成する（例えば、参考文献１参照。）。 << Step S553 >>
The synthesis processing unit 4313 selects and connects speech that matches the analyzed morpheme, the determined reading, the determined accent, the determined prosody, and the like from the speech data 4316 to generate a synthesized speech. . The voice data 4316 is the voice data sent from the voice data selection unit 432, that is, one voice selected by the voice data selection unit 432 from the voice data storage unit 433 corresponding to the identification model selected by the selection unit 423. It is data. As illustrated in FIG. 3, the audio data 4316 stores a plurality of audio data in which the phoneme of the corresponding speech, the phonemes before and after the corresponding speech, and the prosody are associated with each other. The synthesis processing unit 4313 refers to the audio data 4316 using the phoneme of the corresponding speech, the phonemes before and after the corresponding speech, and the prosody as an index, and selects and connects the speech from the speech data 4316 to generate a synthesized speech ( See, for example, Reference 1.)

〔参考文献１〕平井俊男、外７名，「コーパス・ベース音声合成システムＸＩＭＥＲＡ」，信学技報，社団法人電子情報通信学会，ＳＰ２００５−１８（２００５−５），ｐ．３７−４２
このように、尤度を最大にする識別モデルに対応した音声データから合成音声を生成することにより、オペレータ１６の音声の音質又は声質に近い合成音声、言い換えればオペレータの話者性を反映した合成音声を生成することができる。これにより、ユーザ１０が聴取する音声が、オペレータ１６の音声から合成音声に切り替わった際の違和感やギャップ等を軽減することができる。 [Reference 1] Toshio Hirai, 7 others, “Corpus-based Speech Synthesis System XIMERA”, IEICE Technical Report, The Institute of Electronics, Information and Communication Engineers, SP2005-18 (2005-5), p. 37-42
In this way, by generating synthesized speech from speech data corresponding to the identification model that maximizes the likelihood, synthesized speech that is close to the voice quality of the operator 16 voice or voice quality, in other words, that reflects the operator's speech characteristics Voice can be generated. Thereby, it is possible to reduce a sense of incongruity or a gap when the voice that the user 10 listens to is switched from the voice of the operator 16 to the synthesized voice.

［第二実施形態］
第二実施形態による音声案内システムは、音声合成装置１１に代えて、図４に例示する音声合成装置２１を備え、この音声合成装置２１が、生成した合成音声を変形して、さらにオペレータ１６の音声の音質に近づけた変形音声を生成する点で、第一実施形態による音声案内システム１とは異なる。すなわち、音声合成装置２１が、図８の破線で示すステップＳ５６ａの処理を行う点で、第一実施形態による音声案内システムとは異なる。他の機能構成、処理については第一実施形態と同様であるため説明を省略する。図４は、音声合成装置２１の機能構成を例示する図である。また、図１０は、ステップＳ５６ａの処理の流れを例示するフローチャートである。 [Second Embodiment]
The voice guidance system according to the second embodiment includes a voice synthesizer 21 illustrated in FIG. 4 instead of the voice synthesizer 11, and the voice synthesizer 21 deforms the generated synthesized voice to further change the operator 16. It differs from the voice guidance system 1 according to the first embodiment in that a modified voice close to the voice quality is generated. That is, the voice synthesizer 21 is different from the voice guidance system according to the first embodiment in that the process of step S56a indicated by the broken line in FIG. 8 is performed. Since other functional configurations and processes are the same as those in the first embodiment, the description thereof is omitted. FIG. 4 is a diagram illustrating a functional configuration of the speech synthesizer 21. FIG. 10 is a flowchart illustrating the process flow of step S56a.

図４に例示するように、音声合成装置２１は、話者識別部４２、合成音声生成部４３、第二特徴量抽出部６４、変形部６５を備える。話者識別部４２と、合成音声生成部４３は第一実施形態と同様のものである。 As illustrated in FIG. 4, the speech synthesizer 21 includes a speaker identification unit 42, a synthesized speech generation unit 43, a second feature amount extraction unit 64, and a deformation unit 65. The speaker identification unit 42 and the synthesized speech generation unit 43 are the same as those in the first embodiment.

≪ステップＳ５６ａ１≫
第二特徴量抽出部６４は、オペレータの音声から第二の特徴量（話者性パラメータ）を抽出して、変形部６５に送る。第二の特徴量である話者性パラメータは、音声の発声者（話者）の個人性を表すパラメータである。例えば、話者性パラメータとして、基本周波数、発話速度、パワー等の任意の音響特徴量を用いることができる。話者性パラメータを抽出する方法は任意の周知技術を用いてもよい。第二特徴量抽出部６４は、例えば、オペレータの音声を一定の時間長のフレームに分割して、そのフレームごとに基本周波数を抽出して、変形部６５に送る。 << Step S56a1 >>
The second feature amount extraction unit 64 extracts a second feature amount (speaker parameter) from the operator's voice and sends it to the deformation unit 65. The speaker characteristic parameter, which is the second feature amount, is a parameter representing the individuality of the voice speaker (speaker). For example, an arbitrary acoustic feature quantity such as a fundamental frequency, an utterance speed, and a power can be used as the speaker property parameter. Any known technique may be used as a method for extracting the speaker characteristic parameter. For example, the second feature amount extraction unit 64 divides the operator's voice into frames of a certain time length, extracts the fundamental frequency for each frame, and sends the basic frequency to the deformation unit 65.

≪ステップＳ５６ａ２≫
変形部６５は、話者性パラメータを用いて、合成音声生成部４３が生成した合成音声の音質がオペレータ１６の音声の音質に近づくように、合成音声を変形して、変形音声を生成する。生成された変形音声は、ユーザ１０に提供される。音声を変形する方法は何れの周知技術を用いてもよい。 << Step S56a2 >>
The deforming unit 65 uses the speaker property parameter to deform the synthesized speech so that the quality of the synthesized speech generated by the synthesized speech generating unit 43 approaches the quality of the voice of the operator 16 to generate a modified speech. The generated modified voice is provided to the user 10. Any known technique may be used as a method of transforming sound.

以下、話者性パラメータとして、基本周波数を用いた場合を例に挙げて説明をする。この場合、変形部６５は、図５に例示するように、基本周波数抽出部６５１、平均値計算部６５２、平行移動部６５３、再合成部６５４を備える。 Hereinafter, the case where the fundamental frequency is used as the speaker property parameter will be described as an example. In this case, the deforming unit 65 includes a fundamental frequency extracting unit 651, an average value calculating unit 652, a parallel moving unit 653, and a recombining unit 654, as illustrated in FIG.

基本周波数抽出部６５１は、合成音声生成部４３が生成した合成音声の基本周波数を抽出して、平均値計算部６５２に送る。具体的には、基本周波数抽出部６５１は、合成音声を一定の時間長のフレームに分割して、そのフレームごとに基本周波数を抽出する。基本周波数を求める方法としては、ケプストラム法、ラグ法、ＴＥＭＰＯ法等の任意の手法を用いることができる。 The fundamental frequency extraction unit 651 extracts the fundamental frequency of the synthesized speech generated by the synthesized speech generation unit 43 and sends it to the average value calculation unit 652. Specifically, the fundamental frequency extraction unit 651 divides the synthesized speech into frames having a certain time length, and extracts the fundamental frequency for each frame. As a method for obtaining the fundamental frequency, any method such as a cepstrum method, a lag method, or a TEMPO method can be used.

平均値計算部６５２は、基本周波数抽出部６５１が抽出した基本周波数の平均値と、第二特徴量抽出部６４が抽出した基本周波数の平均値とをそれぞれ求めて、平行移動部６５３に送る。ここで、基本周波数の平均値とは、例えば、オペレータが所定の長さの一文を読み上げた音声における基本周波数の平均値、一定の時間長における基本周波数の平均値、一定のフレーム数における基本周波数の平均値である。上記一文、上記時間長、上記フレーム数が長い又は多いほど、基本周波数の平均値を精度良く求めることができる。すなわち、話者固有の声の高さ・低さを抽出し易くなり、その分だけ本実施形態が奏する効果も増す。 The average value calculation unit 652 obtains the average value of the fundamental frequency extracted by the fundamental frequency extraction unit 651 and the average value of the fundamental frequency extracted by the second feature amount extraction unit 64, and sends them to the translation unit 653. Here, the average value of the fundamental frequency is, for example, the average value of the fundamental frequency in the voice that the operator reads out a sentence of a predetermined length, the average value of the fundamental frequency in a certain time length, and the fundamental frequency in a certain number of frames. Is the average value. As the one sentence, the time length, and the number of frames are longer or larger, the average value of the fundamental frequency can be obtained with higher accuracy. That is, it becomes easy to extract the height and lowness of the voice unique to the speaker, and the effect of the present embodiment is increased accordingly.

平行移動部６５３は、上記求まった基本周波数抽出部６５１が抽出した基本周波数の平均値が、上記求まった第二特徴量抽出部６４が抽出した基本周波数の平均値に合うように、基本周波数の概形（図３の韻律生成部４３１２の右側に例示した曲線と同様のもの）を平行移動により変更する。 The translation unit 653 adjusts the fundamental frequency so that the average value of the fundamental frequency extracted by the fundamental frequency extraction unit 651 is matched with the average value of the fundamental frequency extracted by the second feature amount extraction unit 64. The outline (similar to the curve illustrated on the right side of the prosody generation unit 4312 in FIG. 3) is changed by translation.

再合成部６５４は、上記変更された基本周波数の概形に合わせて音声を再合成して、変形音声を生成する。再合成する際の方法としては、ＰＳＯＬＡ等の信号処理方法を用いることができる（例えば、参考文献２参照。）。 The re-synthesizing unit 654 re-synthesizes the voice in accordance with the changed basic frequency to generate a modified voice. As a method for re-synthesis, a signal processing method such as PSOLA can be used (for example, see Reference 2).

〔参考文献２〕Ryo Mochizuki and Tetsunori Kobayashi, ”A low-band spectrum envelope modeling for high quality pitch modification”, Proc. IEEE International Conference on Acoustics, Speech, and Signal,Processing (ICASSP 2004), Vol.1, pp.645-648 (2004-5).
このように、オペレータ１６の音声から抽出した第二の特徴量を用いて、オペレータ１６の音声の音質に近づくように合成音声に変形を加えることにより、さらにオペレータ１６の音声の音質に近い変形音声を生成することができる。これにより、音声を切り替える際の違和感やギャップをさらに軽減することができる。 [Reference 2] Ryo Mochizuki and Tetsunori Kobayashi, “A low-band spectrum envelope modeling for high quality pitch modification”, Proc. IEEE International Conference on Acoustics, Speech, and Signal, Processing (ICASSP 2004), Vol.1, pp .645-648 (2004-5).
In this way, by using the second feature amount extracted from the voice of the operator 16 and modifying the synthesized voice so as to approach the voice quality of the voice of the operator 16, a modified voice closer to the voice quality of the voice of the operator 16. Can be generated. Thereby, it is possible to further reduce a sense of incongruity and a gap when switching voices.

また、本実施形態によれば、オペレータの音声に十分近い識別モデルと音声データの組を用意しなくても、オペレータの話者性を十分に反映した変形音声を生成することができる。このため、予め用意する識別モデルと音声データの組の数を少なくしても、音声を切り替える際の違和感やギャップを小さくすることができるというメリットがある。 Further, according to the present embodiment, it is possible to generate a modified speech that sufficiently reflects the operator's speaker characteristics without preparing a set of an identification model and speech data sufficiently close to the operator's speech. For this reason, even if the number of sets of the identification model and voice data prepared in advance is reduced, there is an advantage that it is possible to reduce a sense of incongruity and a gap when switching voices.

［第三実施形態］
第三実施形態のよる音声案内システムは、音声合成装置２１に変えて、図６に例示する音声合成装置３１を備え、この音声合成装置３１が、合成音声をオペレータの音声に類似させるための変形量が大きくなる場合には変形を行わないという制御を行う点で、第二実施形態による音声案内システムとは少なくとも異なる。すなわち、音声合成装置２１が、図８に一点鎖線で示すステップＳ５６ｂの処理を行う点で、第二実施形態による音声案内システムとは異なる。他の機能構成、処理については第二実施形態と基本的には同様であるため説明を省略する。図６は、音声合成装置３１の機能構成を例示する図である。また、図１１は、ステップＳ５６ｂの処理の流れを例示するフローチャートである。 [Third embodiment]
The voice guidance system according to the third embodiment includes a voice synthesizer 31 illustrated in FIG. 6 instead of the voice synthesizer 21, and the voice synthesizer 31 is a modification for making the synthesized voice similar to the voice of the operator. It is at least different from the voice guidance system according to the second embodiment in that control is performed such that the deformation is not performed when the amount increases. That is, the voice synthesizer 21 is different from the voice guidance system according to the second embodiment in that the process of step S56b indicated by the alternate long and short dash line in FIG. 8 is performed. Other functional configurations and processes are basically the same as those in the second embodiment, and thus the description thereof is omitted. FIG. 6 is a diagram illustrating a functional configuration of the speech synthesizer 31. FIG. 11 is a flowchart illustrating the process flow of step S56b.

図６に例示するように、音声合成装置３１は、例えば、話者識別部４２、合成音声生成部４３、第二特徴量抽出部６４、変形部６５、制御部８０を備える。制御部８０は、例えば、合成音声と変形音声のうちどちらを生成・出力するかを判断する判断部８３と、判断部８３の判断結果に応じて、各接続を切り替える切替部８１ａ〜８１ｃを備える。 As illustrated in FIG. 6, the speech synthesizer 31 includes, for example, a speaker identification unit 42, a synthesized speech generation unit 43, a second feature amount extraction unit 64, a deformation unit 65, and a control unit 80. The control unit 80 includes, for example, a determination unit 83 that determines which one of synthesized speech and modified speech is to be generated and output, and switching units 81a to 81c that switch each connection according to the determination result of the determination unit 83. .

≪ステップＳ５６ｂ１（図８）≫
制御部８０の判断部８３は、尤度の最大値が、所定の閾値α以上であり、かつ、所定の閾値β以下であるかどうか、すなわちα≦尤度の最大値≦βであるかどうかを判断する。尤度の最大値は、話者識別部４２の選択部４２３（図２）が選択した尤度を最大にする識別モデルの尤度である。第三実施形態においては、この尤度の最大値が話者識別部４２から判断部８３に渡される。 << Step S56b1 (FIG. 8) >>
The determination unit 83 of the control unit 80 determines whether or not the maximum likelihood value is equal to or greater than the predetermined threshold value α and equal to or smaller than the predetermined threshold value β, that is, whether α ≦ maximum likelihood value ≦ β. Judging. The maximum likelihood is the likelihood of the identification model that maximizes the likelihood selected by the selection unit 423 (FIG. 2) of the speaker identification unit 42. In the third embodiment, the maximum likelihood value is passed from the speaker identification unit 42 to the determination unit 83.

この例では、判断部８３は、α≦尤度の最大値≦βであるかどうかを判断しているが、代わりに、α＜尤度の最大値＜β、α＜尤度の最大値≦β、α≦尤度の最大値＜βかどうかを判断してもよい。 In this example, the determination unit 83 determines whether α ≦ maximum value of likelihood ≦ β, but instead α <maximum value of likelihood <β, α <maximum value of likelihood ≦ It may be determined whether β, α ≦ maximum likelihood <β.

所定の閾値αと所定の閾値βは、尤度の定義、合成音声の生成方法、変形音声の生成方法、求める音声の品質等により異なるため、適切な結果が得られるように適宜定める。 The predetermined threshold value α and the predetermined threshold value β vary depending on the definition of likelihood, the method of generating synthesized speech, the method of generating modified speech, the quality of required speech, and the like, so that they are appropriately determined so that appropriate results can be obtained.

制御部８０が、α≦尤度の最大値≦βであるかどうか等を判断基準とする理由については後述する。 The reason why the control unit 80 uses a criterion as to whether or not α ≦ maximum likelihood ≦ β will be described later.

≪ステップＳ５６ｂ２≫
α≦尤度の最大値≦βであれば、制御部８０は、合成音声を変形して変形音声を生成して出力するように制御する。具体的には、判断部８３は、各切替部８１ａ〜８１ｃに、変形音声を生成して出力する旨の信号を送り、その信号を受けて、各切替部８１ａ〜８１ｃはスイッチを切り替える。この例では、各切替部８１ａ〜８１ｃは、図６のＹｅｓが表示された端子に接続するようにスイッチを切り替える。 << Step S56b2 >>
If α ≦ maximum value of likelihood ≦ β, the control unit 80 performs control so as to deform the synthesized speech to generate and output a modified speech. Specifically, the determination unit 83 sends a signal for generating and outputting the modified sound to each of the switching units 81a to 81c, and the switching units 81a to 81c switch the switches in response to the signal. In this example, each switching part 81a-81c switches a switch so that it may connect with the terminal by which Yes of FIG. 6 was displayed.

そして、音声合成装置３１は、図１１のステップＳ５６ｂ３（合成音声の生成）、ステップＳ５６ｂ４（第二の音響特徴量の抽出）、ステップＳ５６ｂ５（変形音声の生成）の処理を行う。ステップＳ５６ｂ３とステップＳ５６ｂ４の処理は、第一実施形態と第二実施形態で説明したステップＳ５５（図８）、ステップＳ５６ａ１（図１０）とステップＳ５６ａ２の処理と同様であるため、説明を省略する。 Then, the speech synthesizer 31 performs processing of step S56b3 (generation of synthesized speech), step S56b4 (extraction of the second acoustic feature amount), and step S56b5 (generation of modified speech) of FIG. The processing of step S56b3 and step S56b4 is the same as the processing of step S55 (FIG. 8), step S56a1 (FIG. 10), and step S56a2 described in the first embodiment and the second embodiment, and thus description thereof is omitted.

≪ステップＳ５６ｂ６≫
α≦尤度の最大値≦βでなければ、制御部８０は、変形音声を生成せずに、合成音声を出力するように制御する。具体的には、判断部８３は、各切替部８１ａ〜８１ｃに、変形音声を生成・出力しない旨の信号を送り、その信号を受けて、各切替部８１ａ〜８１ｃはスイッチを切り替える。この例では、各切替部８１ａ〜８１ｃは、図６のＮｏが表示された端子に接続するようにスイッチを切り替える。 << Step S56b6 >>
If α ≦ maximum likelihood value ≦ β is not satisfied, the control unit 80 performs control so as to output synthesized speech without generating modified speech. Specifically, the determination unit 83 sends a signal indicating that the modified sound is not generated / output to the switching units 81a to 81c, and the switching units 81a to 81c switch the switches in response to the signal. In this example, each switching part 81a-81c switches a switch so that it may connect with the terminal by which No of FIG. 6 was displayed.

そして、音声合成装置３１は、図１１のステップＳ５６ｂ７（合成音声の生成）を行う。ステップＳ５６ｂ７の処理は、第一実施形態で説明をしたステップＳ５５（図８）と同様であるため説明を省略する。 Then, the speech synthesizer 31 performs step S56b7 (generation of synthesized speech) in FIG. Since the process of step S56b7 is the same as that of step S55 (FIG. 8) demonstrated in 1st embodiment, description is abbreviate | omitted.

尤度の最大値が所定の閾値α以上である場合には、合成音声の音質がオペレータの音声の音質に既に十分に近い。このため、この合成音声から変形音声を生成しても、生成された変形音声の音質が、オペレータの音声の音質に近づかない可能性、さらにはオペレータの音声の音質から遠ざかる可能性がある。 When the maximum likelihood is equal to or greater than the predetermined threshold α, the quality of the synthesized speech is already sufficiently close to the quality of the operator's speech. For this reason, even if the modified speech is generated from the synthesized speech, the quality of the generated modified speech may not be close to the quality of the operator's speech, and may be further away from the quality of the operator's speech.

一方、尤度の最大値が所定の閾値β以下である場合には、オペレータの音声の音質に近づけるための変形量が大きい。このため、この合成音声から変形音声を生成すると、生成された変形音声が、オペレータの音声の音質に近づく可能性はあるものの、自然性や了解度が劣化した低品質なものになる可能性がある。 On the other hand, when the maximum likelihood value is less than or equal to the predetermined threshold value β, the amount of deformation to approximate the sound quality of the operator's voice is large. For this reason, when the modified speech is generated from the synthesized speech, the generated modified speech may be close to the sound quality of the operator's speech, but may be of low quality with reduced naturalness and intelligibility. is there.

このため、本実施形態において、制御部８０は、α≦尤度の最大値≦βであるかどうか等を判断基準として、合成音声を生成・出力するか、変形音声を生成・出力するかを決めている。 For this reason, in the present embodiment, the control unit 80 determines whether to generate / output synthesized speech or generate / output modified speech, based on whether or not α ≦ maximum value of likelihood ≦ β. I have decided.

このように、合成音声から変形音声を生成する際の変形量が大きくなり、自然性・了解度が劣化するおそれがある場合には、変形音声を生成・出力せずに、合成音声を生成・出力するという制御を行うことにより、音声の切り替えによる違和感・ギャップの軽減と、ユーザ１０に提供する音声の自然性・了解度の確保とを両立することができる。 In this way, if the amount of deformation when generating modified speech from synthesized speech increases and the naturalness and intelligibility may deteriorate, the synthesized speech can be generated without generating / outputting the modified speech. By performing the control of output, it is possible to achieve both a sense of incongruity / gap due to voice switching and securing the naturalness / intelligibility of the voice provided to the user 10.

［変形例等］
図７に例示するステップＳ２〜Ｓ４の処理の順序は任意である。すなわち、ステップＳ２〜Ｓ４の何れの処理を先に行ってもよく、また、これらのステップの処理を同時に行ってもよい。 [Modifications, etc.]
The order of the processes in steps S2 to S4 illustrated in FIG. 7 is arbitrary. That is, any process of steps S2 to S4 may be performed first, and the processes of these steps may be performed simultaneously.

第一取得部１４ａ（図１）と第二取得部１４ｂの２つの取得手段に代えて、マイク等の電気音響変換器からなる１つの取得手段である第三取得部１４ｃを設けてもよい。この場合、オペレータ１６は、ユーザに提供する情報を自らの声で第三取得部１４ｃに入力する。そして、オペレータ１６の声で入力された情報は、音声合成装置１１に出力される。 Instead of the two acquisition means of the first acquisition unit 14a (FIG. 1) and the second acquisition unit 14b, a third acquisition unit 14c that is one acquisition unit including an electroacoustic transducer such as a microphone may be provided. In this case, the operator 16 inputs information to be provided to the user into the third acquisition unit 14c with his / her own voice. Information input by the voice of the operator 16 is output to the speech synthesizer 11.

ステップＳ４（図７）の処理を行う代わりに、図１に破線で示すように、ステップＳ１のユーザ１０とオペレータ１６との間の対話において、入出力部１５で収音されたオペレータ１６の音声が、音声合成装置１１に入力されるようにしてもよい。 Instead of performing the process of step S4 (FIG. 7), as indicated by a broken line in FIG. 1, in the dialogue between the user 10 and the operator 16 in step S1, the voice of the operator 16 collected by the input / output unit 15 May be input to the speech synthesizer 11.

図１１に破線で示すように、ステップＳ５６ｂ１の前に、合成音声の生成を行うステップＳ５６０の処理を行ってもよい。この場合には、ステップＳ５６ｂ３とステップＳ５６ｂ７の処理は行わない。他の処理については、上記と同様である。 As indicated by a broken line in FIG. 11, the process of step S560 for generating synthesized speech may be performed before step S56b1. In this case, the process of step S56b3 and step S56b7 is not performed. Other processes are the same as described above.

ステップＳ５６ｂ１（図１１）において、判断部８３（図６）が、α≦尤度の最大値、α＜尤度の最大値、尤度の最大値≦β、尤度の最大値＜βの何れかひとつの判断をして、合成音声を生成するか、変形音声を生成するかの判断をしてもよい。 In step S56b1 (FIG. 11), the determination unit 83 (FIG. 6) determines that α ≦ maximum likelihood value, α <maximum likelihood value, maximum likelihood value ≦ β, and maximum likelihood value <β. One determination may be made to determine whether to generate synthesized speech or to generate modified speech.

ステップＳ５６ｂ６（図１１）において、α≦尤度の最大値≦βでない場合に、切替部８１ａ（図６）と切替部８１ｃのみをＮｏの方の端子に接続するようにスイッチを切り替えてもよい。すなわち、制御部８０は、最終的に合成音声を出力するように制御しさえすれば、変形音声を生成するように制御してもよい。 In step S56b6 (FIG. 11), when α ≦ maximum likelihood value ≦ β is not satisfied, the switch may be switched so that only the switching unit 81a (FIG. 6) and the switching unit 81c are connected to the No terminal. . That is, the control unit 80 may perform control so as to generate modified speech as long as control is performed so as to finally output synthesized speech.

音声合成装置１１、２１、３１をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムを図１２に例示するコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 When the speech synthesizers 11, 21, and 31 are realized by a computer, processing contents of functions that each device should have are described by a program. Then, by executing this program on the computer illustrated in FIG. 12, the above processing functions are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ
−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD
-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, each apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

音声案内システム１の機能構成を例示する図。The figure which illustrates the function structure of the voice guidance system. 第一実施形態による音声合成装置１１の機能構成を例示する図。The figure which illustrates the functional structure of the speech synthesizer 11 by 1st embodiment. 合成部４３１の機能構成を例示する図。The figure which illustrates the function structure of the synthetic | combination part 431. FIG. 第二実施形態による音声合成装置２１の機能構成を例示する図。The figure which illustrates functional composition of speech synthesizing device 21 by a second embodiment. 変形部６５の機能構成を例示する図。The figure which illustrates the function structure of the deformation | transformation part. 第三実施形態による音声合成装置３１の機能構成を例示する図。The figure which illustrates functional composition of speech synthesizing device 31 by a third embodiment. 音声案内システム１の処理の流れを例示する図。The figure which illustrates the flow of a process of the voice guidance system. ステップＳ５の処理の流れを例示する図。The figure which illustrates the flow of processing of Step S5. ステップＳ５５の処理の流れを例示する図。The figure which illustrates the flow of processing of Step S55. ステップＳ５６ａの処理の流れを例示する図。The figure which illustrates the flow of processing of Step S56a. ステップＳ５６ｂの処理の流れを例示する図。The figure which illustrates the flow of processing of Step S56b. 音声合成装置をコンピュータで実現する場合の機能構成を例示する図。The figure which illustrates the function structure in the case of implement | achieving a speech synthesizer with a computer.

Explanation of symbols

１音声案内システム
１０ユーザ
１１音声合成装置
１２切替部
１３通信網
１４ａ第一取得部
１４ｂ第二取得部
１４ｃ第三取得部
１５入出力部
１６オペレータ
１７入出力部
２１音声合成装置
３１音声合成装置
４２話者識別部
４３合成音声生成部
６４第二特徴量抽出部
６５変形部
８０制御部
８３判断部
４２０尤度計算部
４２１第一特徴量抽出部
４２２識別モデル記憶部
４２３選択部
４３１合成部
４３２音声データ選択部
４３３音声データ記憶部
４２２ｎ識別モデル
４３３ｎ音声データ DESCRIPTION OF SYMBOLS 1 Voice guidance system 10 User 11 Voice synthesizer 12 Switching part 13 Communication network 14a First acquisition part 14b Second acquisition part 14c Third acquisition part 15 Input / output part 16 Operator 17 Input / output part 21 Voice synthesizer 31 Voice synthesizer 42 Speaker identification unit 43 Synthetic speech generation unit 64 Second feature amount extraction unit 65 Deformation unit 80 Control unit 83 Determination unit 420 Likelihood calculation unit 421 First feature amount extraction unit 422 Identification model storage unit 423 Selection unit 431 Synthesis unit 432 Speech Data selection unit 433 Audio data storage unit 422n Identification model 433n Audio data

Claims

In a speech synthesizer in a speech guidance system that generates synthesized speech by a speech synthesizer and provides information to the user with synthesized speech instead of a call center operator,
The speech synthesizer includes a speaker identification unit and a synthesized speech generation unit.
The speaker identification means is
An identification model storage means for storing a plurality of identification models;
From the speech of the inputted said operator and first feature extraction means for extracting a first feature quantity,
The first feature amount, and input to the identification model read from the identification model storage unit, a likelihood calculation means for calculating a likelihood for each identified model,
Selecting means for selecting an identification model that maximizes the calculated likelihood from a plurality of identification models prepared in advance ;
With
The synthetic speech generating means is
Voice data storage means for storing voice data corresponding to each of the plurality of identification models;
Voice data selection means for reading voice data corresponding to the selected identification model from the voice data storage means;
Using the speech data read out above, it corresponds to the information about the audio to be provided to the user that entered the operator has determined, synthesizing means for generating a synthetic speech having a quality close to the quality of the voice of the operator ,
Bei to give a,
The above speech synthesizer
A second feature amount extracting means for extracting a speaker characteristic parameter representing the individuality of the speaker from the voice of the operator;
Deformation means for deforming the synthesized speech and generating a modified speech so that the quality of the synthesized speech approaches the quality of the operator's speech using the speaker property parameter ;
If the maximum value of the calculated likelihood is greater than or greater than a predetermined threshold α, or the maximum value of the calculated likelihood is smaller than a predetermined threshold β that is a threshold larger than the threshold α or if less, then it outputs the modified speech, otherwise, control means for controlling to output the synthesized speech,
A speech synthesizer further comprising:

In a speech synthesizer in a speech guidance system that generates synthesized speech by a speech synthesizer and provides information to the user with synthesized speech instead of a call center operator,
The speech synthesizer includes a speaker identification unit and a synthesized speech generation unit.
The speaker identification means is
An identification model storage means for storing a plurality of identification models;
First feature quantity extraction means for extracting a first feature quantity from the input voice of the operator;
The first feature amount, and input to the identification model read from the identification model storage unit, a likelihood calculation means for calculating a likelihood for each identified model,
Selecting means for selecting an identification model that maximizes the calculated likelihood from a plurality of identification models prepared in advance ;
With
The synthetic speech generating means is
Voice data storage means for storing voice data corresponding to each of the plurality of identification models;
Voice data selection means for reading voice data corresponding to the selected identification model from the voice data storage means;
Using the speech data read out above, it corresponds to the information about the audio to be provided to the user that entered the operator has determined, synthesizing means for generating a synthetic speech having a quality close to the quality of the voice of the operator ,
Bei to give a,
The above speech synthesizer
A second feature amount extracting means for extracting a speaker characteristic parameter representing the individuality of the speaker from the voice of the operator;
Deformation means for deforming the synthesized speech and generating a modified speech so that the quality of the synthesized speech approaches the quality of the operator's speech using the speaker property parameter ;
If the calculated maximum likelihood is greater than or greater than a predetermined threshold α and smaller than or less than a predetermined threshold β that is a threshold greater than the threshold α , the modified speech is output. and, otherwise, control means for controlling to output the synthesized speech,
A speech synthesizer further comprising:

  In a voice synthesis method in a voice guidance method for generating synthesized voice by a voice synthesis method and providing information to a user with synthesized voice instead of a call center operator,
  The speech synthesis method includes a speaker identification step and a synthesized speech generation step,
  The speaker identification step is
  A first feature amount extraction step for extracting a first feature amount from the input voice of the operator;
  A likelihood calculation step of inputting the first feature amount into each identification model read from an identification model storage means for storing a plurality of identification models, and calculating a likelihood for each identification model;
  A selection step of selecting an identification model that maximizes the calculated likelihood from a plurality of identification models prepared in advance;
  Have
  The synthetic speech generation step includes
  A voice data selection step of reading voice data corresponding to the selected identification model from voice data storage means for storing voice data corresponding to each of the plurality of identification models;
  Using the read voice data to generate a synthesized voice having a sound quality close to that of the operator's voice corresponding to the information related to the voice provided to the user determined by the operator; ,
  Have
  The above speech synthesis method
  A second feature extraction step for extracting a speaker characteristic parameter representing the individuality of the speaker from the operator's voice;
  Using the speaker property parameter, a deformation step for deforming the synthesized speech to generate a modified speech so that the sound quality of the synthesized speech approaches the sound quality of the operator's speech;
  If the maximum value of the calculated likelihood is greater than or greater than a predetermined threshold α, or the maximum value of the calculated likelihood is smaller than a predetermined threshold β that is a threshold larger than the threshold α or A control step for controlling to output the modified voice if not, and to output the synthesized voice otherwise;
  The speech synthesis method further comprising:

  In a voice synthesis method in a voice guidance method for generating synthesized voice by a voice synthesis method and providing information to a user with synthesized voice instead of a call center operator,
  The speech synthesis method includes a speaker identification step and a synthesized speech generation step,
  The speaker identification step is
  A first feature amount extraction step for extracting a first feature amount from the input voice of the operator;
  A likelihood calculation step of inputting the first feature amount into each identification model read from an identification model storage means for storing a plurality of identification models, and calculating a likelihood for each identification model;
  A selection step of selecting an identification model that maximizes the calculated likelihood from a plurality of identification models prepared in advance;
  Have
  The synthetic speech generation step includes
  A voice data selection step of reading voice data corresponding to the selected identification model from voice data storage means for storing voice data corresponding to each of the plurality of identification models;
  Using the read voice data to generate a synthesized voice having a sound quality close to that of the operator's voice corresponding to the information related to the voice provided to the user determined by the operator; ,
  Have
  The above speech synthesis method
  A second feature extraction step for extracting a speaker characteristic parameter representing the individuality of the speaker from the operator's voice;
  Using the speaker property parameter, a deformation step for deforming the synthesized speech to generate a modified speech so that the sound quality of the synthesized speech approaches the sound quality of the operator's speech;
  If the calculated maximum likelihood is greater than or greater than a predetermined threshold α and smaller than or less than a predetermined threshold β that is a threshold greater than the threshold α, the modified speech is output. Otherwise, a control step for controlling to output the synthesized speech;
  The speech synthesis method further comprising:

Voice synthesizing program for causing a computer to function as the speech synthesis device according to claim 1 or 2.

A computer-readable recording medium on which the speech synthesis program according to claim 5 is recorded.