JP6507867B2

JP6507867B2 - Voice generation device, voice generation method, and program

Info

Publication number: JP6507867B2
Application number: JP2015117697A
Authority: JP
Inventors: 淳哉斎藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-06-10
Filing date: 2015-06-10
Publication date: 2019-05-08
Anticipated expiration: 2035-06-10
Also published as: JP2017003774A

Description

本発明は、音声生成装置、音声生成方法、及びプログラムに関する。 The present invention relates to an audio generation device, an audio generation method, and a program.

音声生成装置には、所定の入力値の変化に応じてリアルタイムで声質等を変化させるものがある。所定の入力値としては、例えば、声質等の調節に用いるスライダーの位置や、装置周辺の騒音レベル等が挙げられる。 Some voice generation devices change voice quality or the like in real time according to a change in a predetermined input value. Examples of the predetermined input value include the position of a slider used to adjust voice quality and the like, and the noise level around the device.

この種の音声生成装置は、１つの出力対象（メッセージ）の音声データを生成するために、声質等の組み合わせが異なる複数の音声データを保持しており、入力値に応じて複数の音声データのいずれかを選択して再生する。そして、音声データの再生中に入力値が変化すると、変化後の入力値に応じた音声データに切り替える（例えば、特許文献１を参照）。このような音声生成装置は、生成する音声の声質等を周囲の騒音をスペクトルに応じて選択する方法（例えば、特許文献２を参照）に比べて音声生成時の処理負荷が軽い。 This type of voice generation device holds a plurality of voice data having different combinations of voice quality etc. in order to generate voice data of one output target (message), and a plurality of voice data are generated according to the input value. Select one to play. Then, when the input value changes during reproduction of the audio data, the audio data is switched to the audio data according to the changed input value (see, for example, Patent Document 1). Such a voice generation device has a lighter processing load at the time of voice generation as compared to a method (for example, refer to Patent Document 2) of selecting the voice quality and the like of the voice to be generated according to the spectrum.

上記の音声生成装置が保持する複数の音声データはモーフィングにより作成される。モーフィングは、声質等が異なる２つの音声データを所望の比率（モーフィング率）で混合することにより中間的な声質を有する合成音声を生成する方法である。２つの音声データをモーフィングする場合、音声データ全体を一定のモーフィング率でモーフィングするだけでなく、音素や音節を単位としてモーフィング率を指定してモーフィングすることも可能である（例えば、特許文献３を参照）。 A plurality of audio data held by the above-described audio generation device is created by morphing. Morphing is a method of generating a synthetic voice having an intermediate voice quality by mixing two voice data having different voice quality etc. at a desired ratio (morphing rate). When morphing two voice data, it is possible to not only morph the entire voice data at a constant morphing rate, but also to designate the morphing rate in units of phonemes and syllables (for example, Patent Document 3) reference).

上記の音声生成装置は、工場における設備の操作の指示や設備の稼働状況を作業員に報知する等、騒音レベルが一定ではない環境下での音声案内に適用することができる。 The above-described voice generation device can be applied to voice guidance in an environment where the noise level is not constant, such as notifying a worker of an instruction on facility operation in a factory and the operation status of the facility.

特開２００６−１７８０５２号公報JP, 2006-178052, A 特開平２−２１０４９７号公報Unexamined-Japanese-Patent No. 2-210497 特開２００６−２２７５８９号公報Unexamined-Japanese-Patent No. 2006-227589

上記の音声生成装置を騒音レベルが一定ではない環境下に適用する場合、所定の入力値として設備周辺の騒音レベルを用いる。また、１つのメッセージに対する複数の音声データは、騒音レベルが低いときに聞き取りやすい条件で作成された第１の音声データと、騒音レベルが高いときに聞き取りやすい条件で作成された第２の音声データとのモーフィングにより作成する。そして、騒音レベルに応じた音声データを選択する際には、騒音レベルとモーフィング率との対応関係を表す変換テーブルに基づき、装置周辺の騒音レベルと対応するモーフィング率の音声データを選択する。これにより、音声データ（メッセージ）の再生中に設備周辺の騒音レベルが上昇した場合にメッセージが聞き取りにくくなることを防げ、メッセージの聞き漏らし等を防ぐことができる。 When the above-described voice generation device is applied to an environment where the noise level is not constant, the noise level around the facility is used as a predetermined input value. Also, a plurality of voice data for one message are a first voice data created under conditions that are easy to hear when the noise level is low, and a second voice data created under conditions that are easy to hear when the noise level is high. Create by morphing with. Then, when selecting voice data according to the noise level, voice data of the morphing rate corresponding to the noise level around the device is selected based on a conversion table representing the correspondence between the noise level and the morphing rate. This prevents the message from being difficult to hear when the noise level in the vicinity of the facility rises during the reproduction of voice data (message), thereby preventing the message from being missed.

また、音声データ（メッセージ）の再生中に設備周辺の騒音レベルが低いときには、騒音レベルが低いときに聞き取りやすい条件で作成された音声データが出力される。そのため、騒音レベルが高いときに聞き取りやすい音声を騒音レベルが低い環境下で聞くことによる聞き疲れ等を防ぐこともできる。 Further, when the noise level around the facility is low during the reproduction of the voice data (message), the voice data created under the condition easy to hear when the noise level is low is output. Therefore, it is possible to prevent tiredness and the like by listening to a voice that is easy to hear when the noise level is high in an environment where the noise level is low.

しかしながら、第１の音声データ及び第２の音声データをモーフィングして作成した複数の音声データは、アクセントと相関のある基本周波数も異なる。そのため、音声データにおける１つのアクセント句の再生中に、騒音レベルの変化に応じてモーフィング率の異なる音声データに切り替えた場合、基本周波数が変化してアクセントが変わってしまう。このようにアクセント句のアクセントが変わってしまうと、作業員は間違ったアクセントでメッセージを聞くこととなる。そのため、作業員が違和感を覚えメッセージの内容を理解しづらくなる可能性がある。 However, the plurality of voice data generated by morphing the first voice data and the second voice data also differ in the fundamental frequency that is correlated with the accent. Therefore, when switching to audio data having a different morphing rate according to a change in noise level during reproduction of one accent phrase in audio data, the fundamental frequency is changed and the accent is changed. If the accent phrase changes in this way, workers will hear the message with the wrong accent. Therefore, there is a possibility that the worker feels uncomfortable and it is difficult to understand the contents of the message.

一つの側面において、本発明は、入力値の変化に応じてモーフィング率が異なる音声データに切り替えても再生された音声が聞き取りやすく、かつアクセントが変わらないようにすることを目的とする。 In one aspect, it is an object of the present invention to make it easy to hear reproduced voice even if it is switched to voice data having a different morphing rate according to a change in input value, and to prevent accent change.

本発明の１つの態様である音声生成装置は、モーフィング率決定部と、音声再生部と、を備える。モーフィング率決定部は、入力装置からの入力値に基づいて声質のモーフィング率及び基本周波数のモーフィング率を含む２以上のモーフィング率を決定する。音声再生部は、モーフィング率に基づいて音声データを選択して再生する。ここで、モーフィング率決定部は、第１のモーフィング率決定部と、第２のモーフィング率決定部とを含む。第１のモーフィング率決定部は、音声データの再生位置おける入力値に基づいて声質のモーフィング率を決定する。第２のモーフィング率決定部は、再生位置を含むアクセント句の先頭を再生したときの入力値に基づいて基本周波数のモーフィング率を決定する。 An audio generation apparatus according to an aspect of the present invention includes a morphing rate determination unit and an audio reproduction unit. The morphing rate determining unit determines two or more morphing rates including a voice quality morphing rate and a fundamental frequency morphing rate based on input values from the input device. The audio reproduction unit selects and reproduces audio data based on the morphing rate. Here, the morphing rate determining unit includes a first morphing rate determining unit and a second morphing rate determining unit. The first morphing rate determination unit determines the voice quality morphing rate based on the input value at the reproduction position of the voice data. The second morphing rate determination unit determines the morphing rate of the fundamental frequency based on the input value when the beginning of the accent phrase including the reproduction position is reproduced.

上述の態様によれば、入力値の変化に応じてモーフィング率が異なる音声データに切り替えても再生された音声が聞き取りやすく、かつアクセントが変わらない。 According to the above-described aspect, even when switching to audio data having different morphing rates in accordance with a change in input value, the reproduced audio is easy to hear and the accent does not change.

第１の実施形態に係る音声生成装置の機能ブロック図である。It is a functional block diagram of the speech generation device concerning a 1st embodiment. 第１の実施形態におけるモーフィング率決定部の機能ブロック図である。It is a functional block diagram of the morphing rate determination unit in the first embodiment. 音声データベースの構成を示す図である。It is a figure which shows the structure of an audio | voice database. 再生位置の対応関係を示す図である。It is a figure which shows the correspondence of a reproduction | regeneration position. １つの音声データセットの構成例を示す図である。It is a figure which shows the structural example of one audio | voice data set. 第１の実施形態に係る音声生成処理を示すフローチャート（その１）である。It is a flowchart (the 1) which shows the audio | voice production | generation processing which concerns on 1st Embodiment. 第１の実施形態に係る音声生成処理を示すフローチャート（その２）である。It is a flowchart (the 2) which shows the audio | voice production | generation process which concerns on 1st Embodiment. 音声データの再生位置とモーフィング率との関係を説明するグラフを表す図である。It is a figure showing the graph explaining the relationship between the reproduction | regeneration position of audio | voice data, and a morphing rate. アクセント句の再生中に騒音レベルが変化したときのアクセントを説明するグラフを表す図である。It is a figure showing the graph explaining an accent when a noise level changes during reproduction of an accent phrase. コンピュータのハードウェア構成図である。It is a hardware block diagram of a computer. 第１の実施形態に係る音声生成装置の別の適用例を示す図である。It is a figure which shows another application example of the audio | voice production | generation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る音声生成装置の更に別の適用例を示す図である。It is a figure which shows the further another application example of the audio | voice production | generation apparatus which concerns on 1st Embodiment. 第２の実施形態に係るｅラーニングシステムの構成例を示す図である。It is a figure which shows the structural example of the e-learning system which concerns on 2nd Embodiment. 表示装置に表示される作業ウインドの構成例を示す図である。It is a figure which shows the structural example of the work window displayed on a display apparatus. 第２の実施形態に係る音声生成装置の機能ブロック図である。It is a functional block diagram of the voice generation device concerning a 2nd embodiment. 第２の実施形態における合成音声作成部の機能ブロック図である。It is a functional block diagram of the synthetic speech production part in a 2nd embodiment. 第２の実施形態に係る音声生成処理を示すフローチャート（その１）である。It is a flowchart (the 1) which shows the audio | voice production | generation processing which concerns on 2nd Embodiment. 第２の実施形態に係る音声生成処理を示すフローチャート（その２）である。It is a flowchart (the 2) which shows the audio | voice production | generation process which concerns on 2nd Embodiment.

［第１の実施形態］
本実施形態では、工場における設備の操作の指示や設備の稼働状況を作業員に報知する音声生成装置に本発明を適用した場合の、音声生成装置の構成や音声生成方法等を説明する。 First Embodiment
In the present embodiment, the configuration of the voice generation device, the voice generation method, and the like will be described in the case where the present invention is applied to a voice generation device for notifying a worker of an instruction of operation of facilities in the factory and the operating status of the facilities.

図１は、第１の実施形態に係る音声生成装置の機能ブロック図である。
図１に示すように、本実施形態に係る音声生成装置１は、入力値処理部１００と、モーフィング率決定部１０１と、変換テーブル１０２と、音声再生部１０３と、音声データベース１０４とを備える。また、音声再生部１０３は、音声データ選択部１０３ａと、再生制御部１０３ｂとを含む。音声生成装置１は、一定間隔で騒音レベルを取得し、これに応じた音声をフレーム単位で生成し再生する動作を繰り返すことにより、周辺の騒音レベルに応じた音声を再生する。 FIG. 1 is a functional block diagram of the voice generation device according to the first embodiment.
As shown in FIG. 1, the voice generation device 1 according to the present embodiment includes an input value processing unit 100, a morphing rate determination unit 101, a conversion table 102, a voice reproduction unit 103, and a voice database 104. Further, the audio reproduction unit 103 includes an audio data selection unit 103a and a reproduction control unit 103b. The voice generation device 1 reproduces voice according to the surrounding noise level by acquiring noise levels at regular intervals and repeating the operation of generating and playing back voice corresponding to this in frame units.

入力値処理部１００は、マイク２から入力される音声信号（入力値）に基づいて、設備３の周辺の騒音レベルを算出する。この入力値処理部１００は、設備３の制御部３００からの音声データの再生を指示する制御信号を受信すると、音声信号の取得を開始する。また、入力値処理部１００は、再生制御部１０３ｂからの音声データの再生が終了したことを示す信号を受信すると、マイク２からの音声信号の取得及び騒音レベルの算出を終了する。 The input value processing unit 100 calculates the noise level around the facility 3 based on the audio signal (input value) input from the microphone 2. When the input value processing unit 100 receives a control signal instructing reproduction of audio data from the control unit 300 of the facility 3, the input value processing unit 100 starts acquisition of the audio signal. Further, when the input value processing unit 100 receives the signal indicating that the reproduction of the audio data from the reproduction control unit 103 b is finished, the input value processing unit 100 ends the acquisition of the audio signal from the microphone 2 and the calculation of the noise level.

モーフィング率決定部１０１は、入力値処理部１００で算出した騒音レベルと、変換テーブル１０２と、再生制御部１０３ｂからのアクセント句境界を示す情報に基づいて、モーフィング率を決定する。変換テーブル１０２は、騒音レベルとモーフィング率との対応関係を示すテーブルである。また、アクセント句境界を示す情報は、現在処理対象となっているフレームがアクセント句境界であるかを示す情報である。 The morphing rate determination unit 101 determines the morphing rate based on the noise level calculated by the input value processing unit 100, the conversion table 102, and the information indicating the accent phrase boundary from the reproduction control unit 103b. The conversion table 102 is a table showing the correspondence between the noise level and the morphing rate. Further, the information indicating the accent phrase boundary is information indicating whether the frame currently being processed is the accent phrase boundary.

音声再生部１０３は、設備３の制御部３００からの出力対象の音声データを指定する情報と、モーフィング率決定部１０１で決定したモーフィング率とに基づいて、音声データベース１０４から音声データを読み出し、スピーカ４に出力する。音声データベース１０４には、予め様々なモーフィング率でモーフィングした音声データを格納してある。音声データを指定する情報及びモーフィング率は、音声データ選択部１０３ａが受信する。音声データ選択部１０３ａは、音声データを指定する情報及びモーフィング率をキー情報として音声データベース１０４を検索し、該当する音声データを特定する。また、音声データ選択部１０３ａは、音声データを特定すると、特定した音声データのＩＤ情報を再生制御部１０３ｂに通知する。再生制御部１０３ｂは、通知されたＩＤ情報に基づいて音声データベース１０４から音声データを読み出し、再生するフレームを決定して、スピーカ４に出力する。また、再生制御部１０３ｂは、アクセント句境界を示す情報をモーフィング率決定部１０１に送信する。更に、再生制御部１０３ｂは、音声データの出力（再生）が終了すると、再生が終了したことを入力値処理部１００に通知する。 The voice reproduction unit 103 reads voice data from the voice database 104 based on the information specifying the voice data to be output from the control unit 300 of the facility 3 and the morphing rate determined by the morphing rate determination unit 101, and the speaker Output to 4. The voice database 104 stores voice data morphed at various morphing rates in advance. The voice data selection unit 103a receives the information specifying the voice data and the morphing rate. The voice data selection unit 103a searches the voice database 104 using the information specifying the voice data and the morphing rate as key information, and identifies the corresponding voice data. Further, when the audio data selection unit 103a specifies the audio data, the audio data selection unit 103a notifies the reproduction control unit 103b of ID information of the specified audio data. The reproduction control unit 103 b reads audio data from the audio database 104 based on the notified ID information, determines a frame to be reproduced, and outputs the frame to the speaker 4. Also, the reproduction control unit 103 b transmits information indicating accent phrase boundaries to the morphing rate determination unit 101. Furthermore, when the output (reproduction) of audio data is completed, the reproduction control unit 103b notifies the input value processing unit 100 that the reproduction is ended.

図２は、第１の実施形態におけるモーフィング率決定部の機能ブロック図である。
図２に示すように、本実施形態におけるモーフィング率決定部１０１は、瞬時モーフィング率決定部１０１ａと、アクセント句モーフィング率決定部１０１ｂとを含む。更に、モーフィング率決定部１０１は、声質モーフィング率決定部１０１ｃと、基本周波数モーフィング率決定部１０１ｄと、継続長モーフィング率決定部１０１ｅとを含む。 FIG. 2 is a functional block diagram of the morphing rate determination unit in the first embodiment.
As shown in FIG. 2, the morphing rate determination unit 101 in the present embodiment includes an instant morphing rate determination unit 101a and an accent phrase morphing rate determination unit 101b. Further, the morphing rate determination unit 101 includes a voice quality morphing rate determination unit 101c, a fundamental frequency morphing rate determination unit 101d, and a continuous length morphing rate determination unit 101e.

瞬時モーフィング率決定部１０１ａは、入力値処理部１００で算出した騒音レベルと、変換テーブル１０２とに基づいて、現在の騒音レベルに応じたモーフィング率を決定する。 Based on the noise level calculated by the input value processing unit 100 and the conversion table 102, the instantaneous morphing ratio determining unit 101a determines the morphing ratio according to the current noise level.

アクセント句モーフィング率決定部１０１ｂは、現在処理対象となっているフレームを含むアクセント句の先頭のフレームを再生したときの騒音レベルと、変換テーブル１０２とに基づいてモーフィング率を決定する。以下、アクセント句の先頭のフレームを再生したときの騒音レベルのことをアクセント句先頭の騒音レベルともいう。現在処理対象となっているフレームを含むアクセント句先頭の騒音レベルは、入力値処理部１００から受信した騒音レベルと、再生制御部１０３ｂから受信したアクセント句境界を示す情報に基づいて、アクセント句境界における騒音レベルを保持することにより求める。そして、入力値処理部１００から騒音レベルを受信する毎に、現在処理対象となっているフレームを含むアクセント句先頭の騒音レベルと、変換テーブル１０２とに基づいてモーフィング率を決定する。 The accent phrase morphing rate determination unit 101 b determines the morphing rate based on the noise level when the first frame of the accent phrase including the frame currently being processed is reproduced and the conversion table 102. Hereinafter, the noise level when the first frame of the accent phrase is reproduced is also referred to as the noise level at the beginning of the accent phrase. The noise level at the beginning of the accent phrase including the frame currently being processed is based on the noise level received from the input value processing unit 100 and the information indicating the accent phrase boundary received from the reproduction control unit 103b. By maintaining the noise level in Then, each time a noise level is received from the input value processing unit 100, the morphing rate is determined based on the noise level at the beginning of the accent phrase including the frame currently being processed and the conversion table 102.

声質モーフィング率決定部１０１ｃは、再生する音声データにおける声質のモーフィング率を決定する。ここで、音声データの声質は、メルケプストラム、メル一般化ケプストラム、又はスペクトルによって表されるパラメータである。本実施形態における声質モーフィング率決定部１０１ｃは、瞬時モーフィング率決定部１０１ａで決定したモーフィング率を声質のモーフィング率とする。 The voice quality morphing rate determination unit 101c determines the voice quality morphing rate in the voice data to be reproduced. Here, the voice quality of voice data is a parameter represented by mel cepstrum, mel generalized cepstrum, or spectrum. The voice quality morphing rate determination unit 101c in the present embodiment sets the morphing rate determined by the instantaneous morphing rate determination unit 101a as the voice quality morphing rate.

基本周波数モーフィング率決定部１０１ｄは、再生する音声データにおける基本周波数のモーフィング率を決定する。ここで、基本周波数は、Ｆ０とも呼ばれ、声の高さを表すパラメータである。本実施形態における基本周波数モーフィング率決定部１０１ｄは、アクセント句モーフィング率決定部１０１ｂで決定したモーフィング率を基本周波数のモーフィング率とする。 The fundamental frequency morphing rate determination unit 101d determines the morphing rate of the fundamental frequency in the audio data to be reproduced. Here, the fundamental frequency is also called F0, and is a parameter that represents the height of the voice. The fundamental frequency morphing rate determination unit 101d in the present embodiment sets the morphing rate determined by the accent phrase morphing rate determination unit 101b as the morphing rate of the fundamental frequency.

継続長モーフィング率決定部１０１ｅは、再生する音声データにおける継続長のモーフィング率を決定する。ここで、継続長は、音素の長さを表すパラメータである。本実施形態における継続長モーフィング率決定部１０１ｅは、アクセント句モーフィング率決定部１０１ｂで決定したモーフィング率を継続長のモーフィング率とする。なお、継続長モーフィング率１０１ｅは、瞬時モーフィング率決定部１０１ａで決定したモーフィング率を継続長のモーフィング率にしてもよい。 The continuous length morphing rate determination unit 101e determines the morphing rate of the continuous length in the audio data to be reproduced. Here, the continuation length is a parameter that represents the length of the phoneme. The continuous length morphing rate determination unit 101e in the present embodiment sets the morphing rate determined by the accent phrase morphing rate determination unit 101b as the morphing rate of the continuous length. In the continuous length morphing rate 101 e, the morphing rate determined by the instantaneous morphing rate determination unit 101 a may be the morphing rate of the continuous length.

図３Ａは、音声データベースの構成を示す図である。図３Ｂは、再生位置の対応関係を示す図である。図３Ｃは、１つの音声データセットの構成例を示す図である。 FIG. 3A is a diagram showing the configuration of a speech database. FIG. 3B is a view showing the correspondence between reproduction positions. FIG. 3C is a diagram showing an example of the configuration of one audio data set.

図３Ａに示すように、本実施形態に係る音声生成装置１の音声データベース１０４は、第１の音声データセット１０４−１及び第２の音声データセット１０４−２を含む複数の音声データセットからなる。ここで、１つの音声データセットは、ある１つのメッセージについての声質のモーフィング率と、基本周波数（及び継続長）のモーフィング率との組み合わせが異なる複数の音声データの集合である。例えば、第１の音声データセット１０４−１に含まれる複数の音声データは、全て「ハンドルを右に回してください」というメッセージの音声データであるが、それぞれ、声質のモーフィング率と、基本周波数のモーフィング率との組み合わせが異なる。また、各音声データは、予め、アクセント句境界の情報を保持する。例えば、「ハンドルを右に回してください」を発声した音声データであれば、「｜ハンドルを｜右に｜回してください」の「｜」に相当する音声データ上の位置がアクセント句境界であるという情報を保持する。ただし、テキスト情報はなくてもよく、少なくとも音声データ上の各位置がアクセント句境界であるかどうかという情報を保持する。更に、各音声データは、再生位置の対応が取れるように、予め、例えば、図３Ｂに示すように、１モーラあたり、１．０進むような基準時刻を保持する。 As shown in FIG. 3A, the voice database 104 of the voice generation device 1 according to the present embodiment is composed of a plurality of voice data sets including a first voice data set 104-1 and a second voice data set 104-2. . Here, one voice data set is a set of voice data in which the combination of the voice quality morphing rate for one message and the morphing rate of the fundamental frequency (and duration) is different. For example, although a plurality of voice data included in the first voice data set 104-1 are voice data of a message "Turn handle to the right" all, voice quality morphing rate and fundamental frequency The combination with the morphing rate is different. Also, each audio data holds information on accent phrase boundaries in advance. For example, in the case of voice data uttering "Turn handle to the right", the position on the voice data that corresponds to "|" of "| handle | turn to the right | Hold the information. However, the text information may not be present, and at least information indicating whether each position on the audio data is an accent phrase boundary is held. Furthermore, as shown in FIG. 3B, each audio data holds a reference time which advances by 1.0 per mora, for example, as shown in FIG. 3B so as to correspond to the reproduction position.

また、本実施形態では、図３Ｃに示した第１の音声データセット１０４−１のように、１つの音声データセットにおける声質のモーフィング率ＭＰ及び基本周波数のモーフィング率ＭＡを、それぞれ０から１まで０．１ずつ変化させている。なお、モーフィング率ＭＰ，ＭＡは、騒音レベルが低いときに聞き取りやすい条件で作成された第１の音声データと、騒音レベルが高いときに聞き取りやすい条件で作成された第２の音声データとをモーフィングしたときの第１の音声データの比率を表している。 Further, in the present embodiment, as in the first voice data set 104-1 shown in FIG. 3C, the morphing rate MP of the voice quality and the morphing rate MA of the fundamental frequency in one voice data set are each from 0 to 1. It is changed by 0.1. The morphing rates MP and MA are morphing the first voice data created under conditions that are easy to hear when the noise level is low and the second voice data created under conditions that are easy to hear when the noise level is high. Represents the ratio of the first audio data at the time of

また、図３Ｃにおいて、音声データＭＤ（ＭＰ，ＭＡ）｛ＭＰ＝０〜１、ＭＡ＝０〜１｝は、それぞれ、声質のモーフィング率がＭＰ、基本周波数のモーフィング率がＭＡの音声データを表している。また、図３Ｃにおいて、ＭＤＧｎ｛ｎ＝０〜１０｝は、基本周波数のモーフィング率ＭＡが同じ値で声質のモーフィング率ＭＰが異なる音声データＭＤ（ＭＰ，ＭＡ）｛ＭＰ＝０〜１｝のグループを表している。 Further, in FIG. 3C, speech data MD (MP, MA) {MP = 0 to 1, MA = 0 to 1} represent speech data in which the voice quality morphing rate is MP and the fundamental frequency morphing rate is MA, respectively. ing. Further, in FIG. 3C, MDGn {n = 0 to 10} is a group of voice data MD (MP, MA) {MP = 0 to 1} in which the morphing rate MA of the fundamental frequency is the same value and the morphing rate MP of voice quality is different. Represents

また、音声データセットの作成に用いる第１の音声データ及び第２の音声データは、音声合成処理によりテキストデータから変換した音声データでもよいし、人が発した音声を録音して得た音声データでもよい。 Further, the first voice data and the second voice data used to create a voice data set may be voice data converted from text data by voice synthesis processing, or voice data obtained by recording voice uttered by a person May be.

次に、本実施形態の音声生成装置１における音声生成処理について説明する。
図４Ａは、第１の実施形態に係る音声生成処理を示すフローチャート（その１）である。図４Ｂは、第１の実施形態に係る音声生成処理を示すフローチャート（その２）である。 Next, the sound generation processing in the sound generation device 1 of the present embodiment will be described.
FIG. 4A is a flowchart (part 1) illustrating a voice generation process according to the first embodiment. FIG. 4B is a flowchart (part 2) showing the sound generation process according to the first embodiment.

本実施形態に係る音声生成装置１は、設備３の制御部３００から音声（メッセージ）の出力を要求する制御信号を受信したときに、設備３の周囲の騒音レベルに応じたモーフィング率の音声データをフレーム単位で生成し出力する。このとき、音声生成装置１は、図４Ａに示すように、まず、再生位置を含むアクセント句の先頭を再生したときの騒音レベルを初期化する（ステップＳ１）。ステップＳ１は、アクセント句モーフィング率決定部１０１ｂが行う。 The voice generation device 1 according to the present embodiment receives voice data of morphing rate according to the noise level around the facility 3 when receiving a control signal requesting output of voice (message) from the control unit 300 of the facility 3 Is generated and output in frame units. At this time, as shown in FIG. 4A, the voice generation device 1 first initializes the noise level when the head of the accent phrase including the reproduction position is reproduced (step S1). The accent phrase morphing rate determination unit 101b performs step S1.

次に、音声生成装置１は、マイク２からの音声信号（入力値）を取得し、現時点の騒音レベルを算出する（ステップＳ２）。ステップＳ２は、入力値処理部１００が行う。入力値処理部１００は、例えば、予め用意された音声信号の入力パワーと騒音レベルとの対応テーブルに基づいて騒音レベルを算出する。また、入力値処理部１００は、算出した騒音レベルをモーフィング率決定部１０１の瞬時モーフィング率決定部１０１ａ及びアクセント句モーフィング率決定部１０１ｂに渡す。 Next, the voice generation device 1 obtains a voice signal (input value) from the microphone 2 and calculates the noise level at the present time (step S2). Step S2 is performed by the input value processing unit 100. The input value processing unit 100 calculates the noise level based on, for example, a prepared correspondence table between the input power of the audio signal and the noise level. Further, the input value processing unit 100 passes the calculated noise level to the instant morphing rate determination unit 101a and the accent phrase morphing rate determination unit 101b of the morphing rate determination unit 101.

瞬時モーフィング率決定部１０１ａは、入力値処理部１００から現時点の騒音レベルを受け取ると、図４Ａに示すように、現時点の騒音レベル、及び変換テーブル１０２に基づいて瞬時モーフィング率を求める（ステップＳ３ａ）。その後、瞬時モーフィング率決定部１０１ａは、求めた瞬時モーフィング率を声質モーフィング率決定部１０１ｃに渡す。すると、声質モーフィング率決定部１０１ｃは、受け取った瞬時モーフィング率を声質のモーフィング率ＭＰに設定する（ステップＳ３ｂ）。 Upon receiving the noise level at the present time from the input value processing unit 100, the instantaneous morphing rate determining unit 101a obtains the instantaneous morphing rate based on the noise level at the current time and the conversion table 102 (step S3a). . After that, the instant morphing rate determination unit 101a passes the obtained instant morphing rate to the voice quality morphing rate determination unit 101c. Then, the voice quality morphing rate determination unit 101c sets the received instantaneous morphing rate as the voice quality morphing rate MP (step S3b).

一方、アクセント句モーフィング率決定部１０１ｂは、入力値処理部１００から現時点の騒音レベルを受け取ると、図４Ａに示すように、現時点の騒音レベルを保持する（ステップＳ４ａ）。続けて、アクセント句モーフィング率決定部１０１ｂは、アクセント句先頭の騒音レベル、すなわち再生位置を含むアクセント句の先頭を再生したときの騒音レベルが設定済みであるか否かを確認する（ステップＳ４ｂ）。騒音レベルが未設定の場合（ステップＳ４ｂ；Ｎｏ）、アクセント句モーフィング率決定部１０１ｂは、現時点の騒音レベルを、再生位置を含むアクセント句の先頭を再生したときの騒音レベルに設定する（ステップＳ４ｃ）。その後、アクセント句モーフィング率決定部１０１ｂは、アクセント句の先頭を再生したときの騒音レベルと変換テーブル１０２とに基づいてアクセント句モーフィング率を求める（ステップＳ４ｄ）。また、騒音レベルが設定済みの場合（ステップＳ４ｂ；Ｙｅｓ）、アクセント句モーフィング率決定部１０１ｂは、ステップＳ４ｃの処理をスキップして、アクセント句モーフィング率を求める処理（ステップＳ４ｄ）を行う。ステップＳ４ｄの後、アクセント句モーフィング率決定部１０１ｂは、求めたアクセント句モーフィング率を基本周波数モーフィング率決定部１０１ｄ及び継続長モーフィング率決定部１０１ｅに渡す。すると、基本周波数モーフィング率決定部１０１ｄは、受け取ったアクセント句モーフィング率を基本周波数のモーフィング率ＭＡに設定する（ステップＳ４ｅ）。同様に、継続長モーフィング率決定部１０１ｅは、受け取ったアクセント句モーフィング率を継続長のモーフィング率に設定する（ステップＳ４ｅ）。 On the other hand, upon receiving the noise level at the present time from the input value processing unit 100, the accent phrase morphing factor determination unit 101b holds the current noise level as shown in FIG. 4A (step S4a). Subsequently, the accent phrase morphing rate determination unit 101b confirms whether the noise level at the beginning of the accent phrase, that is, the noise level when reproducing the beginning of the accent phrase including the reproduction position has been set (step S4b). . When the noise level is not set (step S4b; No), the accent phrase morphing factor determination unit 101b sets the current noise level as the noise level when the head of the accent phrase including the reproduction position is reproduced (step S4c). ). Thereafter, the accent phrase morphing rate determination unit 101b obtains an accent phrase morphing rate based on the noise level when reproducing the beginning of the accent phrase and the conversion table 102 (step S4 d). If the noise level has already been set (step S4b; Yes), the accent phrase morphing rate determination unit 101b skips the process of step S4c and performs a process of obtaining the accent phrase morphing rate (step S4d). After step S4d, the accent phrase morphing rate determination unit 101b passes the obtained accent phrase morphing rate to the basic frequency morphing rate determination unit 101d and the continuous length morphing rate determination unit 101e. Then, the fundamental frequency morphing rate determination unit 101d sets the received accent phrase morphing rate as the morphing rate MA of the fundamental frequency (step S4e). Similarly, the continuous length morphing rate determination unit 101e sets the received accent phrase morphing rate as the morphing rate of the continuous length (step S4e).

こうして声質、基本周波数、及び継続長のモーフィング率が決定すると、モーフィング率決定部１０１は、決定した声質、基本周波数、及び継続長のモーフィング率ＭＰ，ＭＡを音声再生部１０３の音声データ選択部１０３ａに渡す。図４Ｂに示すように、音声データ選択部１０３ａは、設備３の制御部３００からの出力対象の音声データを指定する情報、声質のモーフィング率ＭＰ、及び基本周波数のモーフィング率ＭＡに基づいて、音声データベース１０４から出力用の音声データを決定する（ステップＳ５）。このとき、音声データ選択部１０３ａは、音声データを指定する情報に基づいて音声データベース１０４から音声データセットを特定する。また、音声データ選択部１０３ａは、声質のモーフィング率ＭＰ、及び基本周波数のモーフィング率ＭＡに基づいて、特定した音声データセットにおける音声データＭＤ（ＭＰ，ＭＡ）を決定する。その後、音声データ選択部１０３ａは、音声データＭＤ（ＭＰ，ＭＡ）に関する情報を再生制御部１０３ｂに渡す。 Thus, when the voice quality, the fundamental frequency, and the morphing rate of the duration are determined, the morphing rate determination unit 101 determines the determined voice quality, the fundamental frequency, and the morphing rate MP, MA of the duration as the voice data selection unit 103a of the voice reproduction unit 103. Pass to As shown in FIG. 4B, the voice data selection unit 103a is a voice based on the information for specifying the voice data to be output from the control unit 300 of the facility 3, the morphing rate MP of voice quality, and the morphing rate MA of the fundamental frequency. Voice data for output is determined from the database 104 (step S5). At this time, the voice data selection unit 103a specifies the voice data set from the voice database 104 based on the information specifying the voice data. Further, the speech data selection unit 103a determines the speech data MD (MP, MA) in the identified speech data set based on the morphing rate MP of the voice quality and the morphing rate MA of the fundamental frequency. Thereafter, the audio data selection unit 103a passes information related to the audio data MD (MP, MA) to the reproduction control unit 103b.

再生制御部１０３ｂは、音声データＭＤ（ＭＰ，ＭＡ）に関する情報を受け取ると、音声データベース１０４から音声データＭＤ（ＭＰ，ＭＡ）を読み出し、現時点の基準時刻に基づく再生位置からスピーカ４に出力する（ステップＳ６）。 When the reproduction control unit 103b receives the information on the sound data MD (MP, MA), the reproduction control unit 103b reads the sound data MD (MP, MA) from the sound database 104 and outputs the sound data to the speaker 4 from the reproduction position based on the current reference time Step S6).

また、再生制御部１０３ｂは、音声データをスピーカ４に出力すると、再生位置が音声データの終了位置に到達しているか否かを確認する（ステップＳ７）。再生位置が音声データの終了位置に到達していない場合（ステップＳ７；Ｎｏ）、再生制御部１０３ｂは、次に、再生位置がアクセント句境界と一致するか否かを確認する（ステップＳ８）。再生位置がアクセント句境界と一致する場合（ステップＳ８；Ｙｅｓ）、再生制御部１０３ｂは、アクセント句モーフィング率決定部１０１ｂと協働して、アクセント句の先頭を再生したときの騒音レベルを現時点での騒音レベルに更新する（ステップＳ９）。その後、再生制御部１０３ｂは、再生位置を次のフレーム先頭に変更し（ステップＳ１０）、入力値処理部１００にステップＳ２の処理を行わせる。以後、音声生成装置１は、再生位置が音声データの終了位置に到達するまでステップＳ２〜Ｓ１０を繰り返す。 Further, when the audio data is output to the speaker 4, the reproduction control unit 103b checks whether the reproduction position has reached the end position of the audio data (step S7). If the reproduction position has not reached the end position of the audio data (step S7; No), the reproduction control unit 103b then checks whether the reproduction position matches the accent phrase boundary (step S8). When the playback position matches the accent phrase boundary (step S8; Yes), the playback control unit 103b cooperates with the accent phrase morphing rate determination unit 101b to reproduce the noise level at the time of playing the beginning of the accent phrase at this point. The noise level is updated (step S9). Thereafter, the reproduction control unit 103b changes the reproduction position to the beginning of the next frame (step S10), and causes the input value processing unit 100 to perform the process of step S2. Thereafter, the sound generation device 1 repeats steps S2 to S10 until the reproduction position reaches the end position of the sound data.

そして、再生位置が音声データの終了位置に到達した場合（ステップＳ７；Ｙｅｓ）、再生制御部１０３ｂは終了位置の出力をもって出力処理を終了する。これにより、音声生成装置１は待機状態となる。待機状態の音声生成装置１は、設備３の制御部３００からの新たな制御信号を受信すると、当該制御信号に応じた音声データの生成及び出力処理を行う。 When the reproduction position reaches the end position of the audio data (step S7; Yes), the reproduction control unit 103b ends the output processing with the output of the end position. Thus, the voice generation device 1 is in the standby state. When the audio generation device 1 in the standby state receives a new control signal from the control unit 300 of the facility 3, the audio generation device 1 generates and outputs audio data according to the control signal.

図５は、音声データの再生位置とモーフィング率との関係を説明するグラフを表す図である。なお、図５には、音声データの各再生位置における騒音レベルＬをプロットしたグラフ、声質モーフィング率ＭＰをプロットしたグラフ、及び基本周波数モーフィング率ＭＡをプロットしたグラフを上下方向に並べて示している。 FIG. 5 is a graph showing the relationship between the reproduction position of audio data and the morphing rate. FIG. 5 shows a graph in which the noise level L at each reproduction position of voice data is plotted, a graph in which the voice quality morphing rate MP is plotted, and a graph in which the fundamental frequency morphing rate MA is plotted vertically arranged.

ある音声データを再生しているときの騒音レベルＬは、例えば、図５に示したように、Ｌ１≦Ｌ≦Ｌ２の範囲で変動する。図５に示した例において、ｎ番目のアクセント句を再生しているときの騒音レベルＬは、アクセント句の先頭となる再生位置Ｐ１（アクセント句境界Ｂｎ）ではＬ＝Ｌ１であるが、途中で上昇してＬ＝Ｌ２に変化する。 For example, as shown in FIG. 5, the noise level L when reproducing certain audio data fluctuates in the range of L1 ≦ L ≦ L2. In the example shown in FIG. 5, the noise level L when the n-th accent phrase is reproduced is L = L1 at the reproduction position P1 (accent phrase boundary Bn) at the beginning of the accent phrase. It rises and changes to L = L2.

本実施形態に係る音声生成処理では、現時点の騒音レベルＬに基づいて声質モーフィング率ＭＰを決定する。そのため、ｎ番目のアクセント句に含まれる再生位置Ｐ４を再生する時点の騒音レベルＬがＬ＝Ｌ２である場合、再生位置Ｐ４に対する声質モーフィング率ＭＰは騒音レベルＬ２に応じた値ＭＰ（Ｌ２）となる。 In the voice generation process according to the present embodiment, the voice quality morphing rate MP is determined based on the current noise level L. Therefore, when the noise level L at the time of reproducing the reproduction position P4 included in the n-th accent phrase is L = L2, the voice quality morphing ratio MP with respect to the reproduction position P4 has a value MP (L2) corresponding to the noise level L2. Become.

一方、本実施形態に係る音声生成処理では、再生位置を含むアクセント句の先頭を再生したときの騒音レベルに基づいて基本周波数及び継続長のモーフィング率ＭＡを決定する。そのため、再生位置Ｐ４を再生する時点の騒音レベルＬがＬ＝Ｌ２であっても、再生位置Ｐ４に対する基本周波数及び継続長のモーフィング率ＭＡは、ｎ番目のアクセント句の先頭を再生する時点の騒音レベルＬ１に応じた値ＭＡ（Ｌ１）となる。 On the other hand, in the voice generation process according to the present embodiment, the morphing rate MA of the fundamental frequency and the duration is determined based on the noise level when the head of the accent phrase including the playback position is played back. Therefore, even if the noise level L at the time of reproducing the reproduction position P4 is L = L2, the morphing rate MA of the fundamental frequency and the continuation length with respect to the reproduction position P4 is the noise at the time of reproducing the beginning of the n-th accent phrase. The value MA (L1) corresponds to the level L1.

このように、本実施形態に係る音声生成処理では、１つのアクセント句の再生中に騒音レベルＬが大きく変化した場合、声質モーフィング率のみが騒音レベルに応じて変化し、基本周波数及び継続長のモーフィング率は変化しない。すなわち、１つのアクセント句を再生している間、騒音レベルに応じて変化するのは聞き取りやすさとの相関がある声質モーフィング率のみであり、アクセントとの相関がある基本周波数モーフィング率は変化しない。よって、音声データの再生中にアクセント句のアクセントが変わってしまうことを防止できる。 As described above, in the voice generation process according to the present embodiment, when the noise level L largely changes during reproduction of one accent phrase, only the voice quality morphing rate changes according to the noise level, and the fundamental frequency and the duration length are changed. The morphing rate does not change. That is, while playing back one accent phrase, it is only the voice quality morphing rate that has a correlation with audibleness that changes according to the noise level, and the fundamental frequency morphing rate that has a correlation with the accent does not change. Therefore, it is possible to prevent the accent phrase from changing while the audio data is being reproduced.

図６は、アクセント句の再生中に騒音レベルが変化したときのアクセントを説明するグラフを表す図である。なお、図６には、「ハンドルを」というアクセント句を再生したときの騒音レベル及び周波数と、再生した音声のアクセントとを示している。 FIG. 6 is a graph showing an accent when the noise level changes during the reproduction of the accent phrase. Note that FIG. 6 shows the noise level and frequency when the accent phrase “handle the wheel” is reproduced, and the accent of the reproduced sound.

図６において、曲線Ｆ（Ｌ１）は、騒音レベルＬがＬ＝Ｌ１のときに聞き取りやすい条件で作成した音声データにおける再生位置と基本周波数との関係を示している。また、曲線Ｆ（Ｌ２）は、騒音レベルＬがＬ＝Ｌ２のときに聞き取りやすい条件で作成した音声データにおける再生位置と基本周波数との関係を示している。そして、曲線Ｆｏｕｔは、騒音レベルＬに基づいて決定した基本周波数のモーフィング率に従って音声データを生成したときの再生位置と基本周波数との関係を示している。 In FIG. 6, a curve F (L1) shows the relationship between the reproduction position and the fundamental frequency in the audio data created under the condition that the noise level L is L = L1. Further, a curve F (L2) indicates the relationship between the reproduction position and the fundamental frequency in the audio data created under the condition that the noise level L is L = L2 so as to be easily audible. A curve Fout indicates the relationship between the reproduction position and the fundamental frequency when the audio data is generated according to the morphing rate of the fundamental frequency determined based on the noise level L.

従来の音声生成処理においては、アクセント句の再生中に騒音レベルＬが変化すると基本周波数のモーフィング率も変化する。このとき、図６の上段のグラフに示すように、再生位置が０（アクセント句の先頭）からＰ１（「ン」と「ド」との間）までの騒音レベルＬ２の区間の音声は、曲線Ｆ（Ｌ２）の基本周波数で再生される。同様に、再生位置がＰ２（「ド」と「ル」との間）からＰ３（アクセント句の終了位置）までの騒音レベルＬ１の区間の音声は、曲線Ｆ（Ｌ１）の基本周波数で再生される。また、再生位置がＰ１からＰ２までの区間のように騒音レベルＬがＬ＝Ｌ２からＬ＝Ｌ１へと徐々に減少している場合の音声は、騒音レベルに応じた基本周波数のモーフィング率ＭＡで周波数を変化させながら再生する。 In the conventional voice generation process, when the noise level L changes during playback of the accent phrase, the morphing rate of the fundamental frequency also changes. At this time, as shown in the graph at the top of FIG. 6, the sound of the section of the noise level L2 from the playback position 0 (the beginning of the accent phrase) to P1 (between “n” and “d”) is a curve It is reproduced at the fundamental frequency of F (L2). Similarly, the sound in the section of the noise level L1 from the playback position P2 (between “D” and “L”) to P3 (end position of the accent phrase) is played at the fundamental frequency of the curve F (L1) Ru. Also, in the case where the noise level L gradually decreases from L = L2 to L = L1 as in the section from the playback position P1 to P2, the speech with the fundamental frequency morphing rate MA according to the noise level is Play while changing the frequency.

したがって、従来の音声生成処理により「ハンドルを」というアクセント句を再生した場合の基本周波数は、図６の上段のグラフに示した曲線Ｆｏｕｔのようになる。すなわち、騒音レベルＬが低下する再生位置Ｐ１以降は、再生開始時の音声データにおける基本周波数Ｆ（Ｌ２）よりも低い周波数で再生される。よって、再生された「ハンドルを」というアクセント句は、図６の上段に示したように「ン」の部分だけ強くなって聞こえる。しかしながら、「ハンドルを」というアクセント句を標準的なアクセントで発音した場合、図６の下段のグラフに示したように、「ンドルを」の４音が「ハ」よりも強く、かつほぼ同じ強さで聞こえる。よって、従来の音声生成処理のように騒音レベルの変化に応じてアクセント句内で基本周波数のモーフィング率が変化した場合、アクセントが変わってしまい、作業員に違和感を与えてしまうことがある。また、アクセントの異なる同音異義語があるアクセント句を含むメッセージの場合、メッセージの内容を理解しづらくなることがある。 Therefore, the fundamental frequency in the case where the accent phrase “handle” is reproduced by the conventional voice generation processing is as shown by a curve Fout shown in the graph at the top of FIG. That is, after the reproduction position P1 at which the noise level L decreases, reproduction is performed at a frequency lower than the basic frequency F (L2) of the audio data at the time of start of reproduction. Thus, the reproduced accent phrase “handle” sounds stronger only at the “n” portion as shown in the upper part of FIG. However, when the accent phrase "handle" is pronounced with a standard accent, as shown in the lower graph of FIG. 6, the four sounds of "noodle" are stronger than "ha" and almost the same I hear it. Therefore, when the morphing rate of the fundamental frequency changes in the accent phrase according to the change of the noise level as in the conventional voice generation processing, the accent may change, which may make the worker feel uncomfortable. In addition, in the case of a message including an accent phrase having homonyms having different accents, the content of the message may be difficult to understand.

これに対し、本発明（本実施形態）に係る音声生成処理では、上記のように、１つのアクセント句の再生中における基本周波数のモーフィング率ＭＡは、途中で騒音レベルが大きく変化しても、アクセント句の先頭を再生したときのモーフィング率のままである。すなわち、図６の中段及び下段のグラフに示したように、アクセント句の先頭を再生したときの騒音レベルＬがＬ＝Ｌ２であれば、騒音レベルＬが変化する再生位置Ｐ１以降の基本周波数も騒音レベルＬ２のときのままである。そのため、再生された「ハンドルを」というアクセント句の周波数を表す曲線Ｆｏｕｔは曲線Ｆ（Ｌ２）と一致する。したがって、再生された「ハンドルを」というアクセント句は、図６の下段のグラフに示したように「ンドルを」の４音が「ハ」よりも強く、かつほぼ同じ強さで聞こえる。よって、騒音レベルが途中で変化しても標準的なアクセントで聞き取ることができ、作業員が違和感を覚えることや、内容を理解しづらくなることを防止できる。 On the other hand, in the voice generation process according to the present invention (the present embodiment), as described above, the morphing rate MA of the fundamental frequency during reproduction of one accent phrase has a large change in noise level in the middle, It remains the morphing rate when playing the beginning of the accent phrase. That is, as shown in the middle and lower graphs of FIG. 6, if the noise level L when reproducing the beginning of the accent phrase is L = L2, the fundamental frequency after the reproduction position P1 at which the noise level L changes is also It remains at the time of the noise level L2. Therefore, the curve Fout representing the frequency of the reproduced "handle" accent phrase coincides with the curve F (L2). Therefore, as shown in the lower graph of FIG. 6, the four accents of "Noodle" sound stronger and almost equal in strength to "ha" as shown in the lower graph of FIG. Therefore, even if the noise level changes in the middle, it can be heard with a standard accent, and it is possible to prevent the worker from feeling uncomfortable or becoming difficult to understand the contents.

このように、第１の実施形態によれば、現時点（現在の再生位置）における騒音レベルに基づいて声質のモーフィング率を決定することで、騒音レベルの変化により音声が聞き取りにくくなることを防止できる。しかも、現在の再生位置を含むアクセント句の先頭を再生したときの騒音レベルに基づいて現在の再生位置における基本周波数のモーフィング率を決定することで、アクセントが変わって内容を理解しづらくなることも防止できる。 As described above, according to the first embodiment, by determining the morphing rate of voice quality based on the noise level at the current time (the current reproduction position), it is possible to prevent the voice from being difficult to hear due to the change in the noise level. . Furthermore, by determining the morphing rate of the fundamental frequency at the current playback position based on the noise level when the beginning of the accent phrase including the current playback position is played back, the accent may change and it may be difficult to understand the content. It can prevent.

工場等の施設においては、作業員が設備を安全かつ正しく操作できるよう、設備の稼働状況や操作案内をリアルタイムで正確に報知することが望まれる。本実施形態の音声生成装置１は、上記のように、設備３の周囲の騒音レベルに応じて声質をリアルタイムで変化させる一方で、基本周波数はアクセント句単位で変化させる。そのため、騒音レベルが一定ではない環境下においても、音声によるメッセージを聞き取りやすく、アクセントの間違いにより内容を理解しづらくなることもない。よって、本実施形態の音声生成装置１は、工場等の施設における安全かつ正確な作業の支援に最適といえる。 In a facility such as a factory, it is desirable to accurately notify in real time the operating status and operation guidance of a facility so that workers can operate the facility safely and correctly. As described above, the voice generation device 1 of the present embodiment changes the voice quality in real time according to the noise level around the facility 3 while changing the fundamental frequency in accent phrase units. Therefore, even in an environment where the noise level is not constant, it is easy to hear a voice message, and it is not difficult to understand the content due to an accent mistake. Therefore, it can be said that the voice generation device 1 of the present embodiment is optimal for supporting safe and accurate work in facilities such as a factory.

なお、第１の実施形態に係る音声生成装置１では、図２に示した瞬時モーフィング率決定部１０１ａと声質モーフィング率決定部１０１ｃとが１つの決定部に統合されたものであってもよい。同様に、図２に示したアクセント句モーフィング率決定部１０１ｂ、基本周波数モーフィング率決定部１０１ｄ、及び継続長モーフィング率決定部１０１ｅは、１つの統合された決定部であってもよい。また、継続長モーフィング率決定部１０１ｅは、瞬時モーフィング率決定部１０１ａで決定したモーフィング率を継続長のモーフィング率にしてもよい。 In the voice generation device 1 according to the first embodiment, the instantaneous morphing rate determining unit 101a and the voice quality morphing rate determining unit 101c illustrated in FIG. 2 may be integrated into one determining unit. Similarly, the accent phrase morphing rate determination unit 101b, the fundamental frequency morphing rate determination unit 101d, and the continuous length morphing rate determination unit 101e illustrated in FIG. 2 may be one integrated determination unit. Also, the continuous length morphing rate determination unit 101e may set the morphing rate determined by the instantaneous morphing rate determination unit 101a as the morphing rate of the continuous length.

また、第１の実施形態に係る音声生成装置１は、例えば、コンピュータと、図４Ａ及び図４Ｂに示した処理をコンピュータに実行させるプログラムとにより実現可能である。このコンピュータとプログラムにより実現される音声生成装置１について、図７を参照しながら説明する。 Also, the voice generation device 1 according to the first embodiment can be realized by, for example, a computer and a program that causes the computer to execute the processing illustrated in FIGS. 4A and 4B. The audio generation device 1 realized by the computer and the program will be described with reference to FIG.

図７は、コンピュータのハードウェア構成図である。
図７に示すように、音声生成装置として動作させるコンピュータ５は、プロセッサ５０と、主記憶装置５１と、補助記憶装置５２と、入力装置５３と、出力装置５４と、通信インタフェース装置５５と、を備える。コンピュータ５におけるこれらの要素５０〜５５は、バス５９により相互に接続されており、要素間でのデータの受け渡しが可能になっている。 FIG. 7 is a hardware configuration diagram of a computer.
As shown in FIG. 7, the computer 5 operated as an audio generation device includes a processor 50, a main storage device 51, an auxiliary storage device 52, an input device 53, an output device 54, and a communication interface device 55. Prepare. These elements 50 to 55 in the computer 5 are connected to one another by a bus 59 so that data can be passed between the elements.

プロセッサ５０は、Central Processing Unit（ＣＰＵ）又はMicro Processing Unit（ＭＰＵ）等の演算処理装置であり、オペレーティングシステムを含む各種のプログラムを実行することによりコンピュータ５の全体の動作を制御する。 The processor 50 is an arithmetic processing unit such as a central processing unit (CPU) or a micro processing unit (MPU), and controls the overall operation of the computer 5 by executing various programs including an operating system.

主記憶装置５１は、Read Only Memory（ＲＯＭ）５１ａ及びRandom Access Memory（ＲＡＭ）５１ｂを含む。ＲＯＭ５１ａには、例えばコンピュータ５の起動時にプロセッサ５０が読み出す所定の基本制御プログラム等が予め記録されている。また、ＲＡＭ５１ｂは、プロセッサ５０が各種のプログラムを実行する際に、必要に応じて作業用記憶領域として使用する。本実施形態においては、例えば、アクセント句の先頭を再生したときの騒音レベルや、再生する音声データを含む音声データセットを示す情報等の一時的な保持にＲＡＭ５１ｂを用いることができる。 The main storage device 51 includes a read only memory (ROM) 51 a and a random access memory (RAM) 51 b. For example, a predetermined basic control program or the like read out by the processor 50 when the computer 5 is started is recorded in the ROM 51a in advance. The RAM 51 b is also used as a working storage area as needed when the processor 50 executes various programs. In the present embodiment, for example, the RAM 51b can be used to temporarily hold the noise level when reproducing the beginning of the accent phrase, information indicating an audio data set including audio data to be reproduced, and the like.

補助記憶装置５２は、Hard Disk Drive（ＨＤＤ）やSolid State Drive（ＳＳＤ）等の主記憶装置５１に比べて大容量の記憶装置である。補助記憶装置５２には、プロセッサ５０によって実行される各種のプログラムや、変換テーブル１０２及び音声データベース１０４を含む各種のデータを記憶させることができる。 The auxiliary storage device 52 is a storage device having a large capacity as compared with the main storage device 51 such as a Hard Disk Drive (HDD) or a Solid State Drive (SSD). The auxiliary storage device 52 can store various programs executed by the processor 50 and various data including the conversion table 102 and the voice database 104.

入力装置５３は、例えば、各種のボタンやスイッチ、及びマイク２である。ボタンやスイッチは、コンピュータ５（音声生成装置１）の動作設定等に用いる。コンピュータ５のオペレータが各種のボタンやスイッチを操作すると、その操作内容に対応付けられている入力情報がプロセッサ５０に送信される。また、マイク２は、設備３の周囲の騒音レベルを求める際に用いる。 The input device 53 is, for example, various buttons and switches, and the microphone 2. The button or switch is used to set the operation of the computer 5 (voice generation device 1). When the operator of the computer 5 operates various buttons and switches, input information associated with the contents of the operation is transmitted to the processor 50. Also, the microphone 2 is used to obtain the noise level around the facility 3.

出力装置５４は、例えば液晶ディスプレイやスピーカ４である。液晶ディスプレイは、プロセッサ５０等から送信される表示データに従って操作案内や設定値等を表示する。また、スピーカ４は、プロセッサ５０等から送信される音声データを出力する。 The output device 54 is, for example, a liquid crystal display or a speaker 4. The liquid crystal display displays operation guidance, setting values and the like according to display data transmitted from the processor 50 and the like. Further, the speaker 4 outputs audio data transmitted from the processor 50 or the like.

通信インタフェース装置５５は、コンピュータ５と設備３の制御部３００とを通信可能に接続するための装置である。コンピュータ５は、通信インタフェース装置５５により設備３の制御部３００からの制御信号を受信すると、制御信号に応じたメッセージ（音声データ）を生成して出力する。 The communication interface device 55 is a device for communicably connecting the computer 5 and the control unit 300 of the facility 3. When the computer 5 receives the control signal from the control unit 300 of the facility 3 by the communication interface device 55, the computer 5 generates and outputs a message (voice data) according to the control signal.

このコンピュータ５は、プロセッサ５０が補助記憶装置５２から上述した音声生成処理についてのプログラムを読み出して実行する。プロセッサ５０は、プログラムの実行中、通信インタフェース装置５５を介して設備３の制御部３００からの制御信号を受信すると、マイク２を利用して設備３の周囲の騒音レベルを求める。また、プロセッサ５０は、現時点の騒音レベル、アクセント句の先頭を再生したときの騒音レベル、補助記憶装置５２あるいはＲＡＭ５１ｂに記憶させた変換テーブル１０２等に基づいて、瞬時モーフィング率ＭＰ及びアクセント句モーフィング率ＭＡを決定する。そして、プロセッサ５０は、声質のモーフィング率を瞬時モーフィング率ＭＰに設定するとともに、基本周波数及び継続長のモーフィング率をアクセント句モーフィング率ＭＡに設定する。更に、プロセッサ５０は、設定した声質、基本周波数、及び継続長のモーフィング率の組み合わせに基づいて補助記憶装置５２の音声データベース１０４から再生する音声データを読み出し、スピーカ４に出力する。 In the computer 5, the processor 50 reads out the program for the above-mentioned sound generation processing from the auxiliary storage device 52 and executes it. When the processor 50 receives a control signal from the control unit 300 of the facility 3 via the communication interface device 55 during execution of a program, the processor 50 obtains the noise level around the facility 3 using the microphone 2. Also, the processor 50 generates an instantaneous morphing rate MP and accent phrase morphing based on the current noise level, the noise level when reproducing the beginning of the accent phrase, and the conversion table 102 stored in the auxiliary storage device 52 or the RAM 51b. Determine the rate MA. Then, the processor 50 sets the morphing rate of voice quality to the instantaneous morphing rate MP, and sets the morphing rate of the fundamental frequency and the duration to the accent phrase morphing rate MA. Further, the processor 50 reads voice data to be reproduced from the voice database 104 of the auxiliary storage device 52 based on the combination of the set voice quality, the fundamental frequency, and the morphing rate of the continuous length, and outputs the voice data to the speaker 4.

［音声生成装置１の適用例］
本実施形態に係る音声生成装置１の適用例として、図１には、音声生成装置１を設備３とは別個に設けた例を挙げている。しかしながら、本実施形態に係る音声生成装置１は、これに限らず、設備３に音声生成部として内蔵させたものであってもよい。更に、複数の設備の稼働状況を１つの管理サーバで一元管理している施設に適用する場合、設備３からではなく、管理サーバからの制御信号に基づいて音声を出力することも可能である。 [Application Example of Voice Generation Device 1]
As an application example of the voice generation device 1 according to the present embodiment, FIG. 1 shows an example in which the voice generation device 1 is provided separately from the facility 3. However, the sound generation device 1 according to the present embodiment is not limited to this, and may be built in the equipment 3 as a sound generation unit. Furthermore, when the operation statuses of a plurality of facilities are applied to a facility centrally managed by one management server, it is also possible to output voice based on a control signal from the management server, not from the facility 3.

図８Ａは、第１の実施形態に係る音声生成装置の別の適用例を示す図である。
本実施形態に係る音声生成装置１を適用可能な工場等の施設は、図８Ａに示すように、複数の設備３（３Ａ，３Ｂ）があり、それらの稼働状況を１つの管理サーバ６で一元管理している場合が多い。管理サーバ６は、各設備３と通信可能に接続されており、例えば、各設備３に設けた各種のセンサから設備３内の温度、圧力、設備３から作業員までの距離、作業員の有無等の情報を取得する。そして、管理サーバ６は、各設備３から取得した情報に基づいて各設備３の稼働状況を監視し、各設備３が正常に稼動するよう管理する。このように管理サーバ６で複数の設備３の稼働状況を一元管理している場合、各設備３に個別に適用された複数の音声生成装置１（１Ａ，１Ｂ）の動作も管理サーバ６で制御、管理することが可能である。複数の音声生成装置１の動作を管理サーバ６で一元管理すると、例えば、設備３Ａに何らかの異常が発生したことを伝える音声データを、設備３Ｂの周囲に向けて出力することができる。そのため、設備３Ａの周囲に向けて異常を伝える音声データを出力したにも係わらず一定期間異常への対処がなされない場合に、他の設備３Ｂの周囲にいる作業員等に設備３Ａの異常を報知することができる。したがって、設備３Ａの異常への対処が遅れることによる設備３Ａの故障等を防止することができる。また、複数の設備３Ａ，３Ｂが連動（協働）している場合、例えば、１つの設備で発生した異常を他の設備の周囲にいる作業員に早期に報知でき、連鎖的な設備の異常の発生を防止することができる。 FIG. 8A is a diagram showing another application example of the voice generation device according to the first embodiment.
Facilities, such as a factory to which the voice generation device 1 according to the present embodiment can be applied, include a plurality of facilities 3 (3A, 3B) as shown in FIG. 8A. It is often managed. The management server 6 is communicably connected to each facility 3. For example, the temperature, pressure, the distance from the facility 3 to the worker, the presence or absence of a worker, from various sensors provided to each facility 3 Get information such as Then, the management server 6 monitors the operation status of each facility 3 based on the information acquired from each facility 3 and manages each facility 3 to operate normally. As described above, when the operation statuses of the plurality of facilities 3 are centrally managed by the management server 6, the operation of the plurality of voice generation devices 1 (1 A, 1 B) individually applied to the respective facilities 3 is also controlled by the management server 6 It is possible to manage. When the management server 6 centrally manages the operations of the plurality of voice generation devices 1, for example, voice data conveying that an abnormality has occurred in the facility 3A can be output toward the periphery of the facility 3B. Therefore, even if the voice data conveying the abnormality toward the periphery of the facility 3A is output but the abnormality is not dealt with for a certain period of time, the worker of the other facility 3B around the abnormality of the facility 3A It can be informed. Therefore, it is possible to prevent a failure or the like of the equipment 3A due to a delay in coping with the abnormality of the equipment 3A. In addition, when a plurality of facilities 3A and 3B are linked (cooperation), for example, an abnormality occurring in one facility can be promptly notified to a worker who is around another facility, and a chain of facility abnormalities Can be prevented.

図８Ｂは、第１の実施形態に係る音声生成装置の更に別の適用例を示す図である。
複数の設備３（３Ａ，３Ｂ）の稼働状況を１つの管理サーバ６で一元管理している施設に音声生成装置１を適用する場合、例えば、図８Ｂに示すように、音声生成装置１に相当する音声生成部６００を管理サーバ６に設けてもよい。このようにすることで、複数の設備３のそれぞれに音声生成装置１を適用する場合に比べ、音声生成装置の導入コストやメンテナンス費用を低減することができる。 FIG. 8B is a diagram showing still another application example of the voice generation device according to the first embodiment.
When the voice generation device 1 is applied to a facility in which the operation status of a plurality of facilities 3 (3A, 3B) is centrally managed by one management server 6, for example, as shown in FIG. 8B, it corresponds to the voice generation device 1 The voice generation unit 600 may be provided in the management server 6. By doing this, compared to the case where the voice generation device 1 is applied to each of the plurality of facilities 3, the introduction cost and the maintenance cost of the voice generation device can be reduced.

なお、本実施形態に係る音声生成装置１は、工場に限らず、例えば、駅構内や繁華街等で音声案内を行う装置に適用することも可能であることはもちろんである。 In addition, it is needless to say that the voice generation device 1 according to the present embodiment can be applied not only to a factory but also to, for example, a device that performs voice guidance in a station yard or a downtown area.

［第２の実施形態］
本実施形態では、ｅラーニングシステムに本発明を適用した場合の音声生成装置の構成や音声生成方法等を説明する。 Second Embodiment
In the present embodiment, a configuration of a voice generation apparatus, a voice generation method, and the like when the present invention is applied to an e-learning system will be described.

図９は、第２の実施形態に係るｅラーニングシステムの構成例を示す図である。
図９に示すように、本実施形態に係るｅラーニングシステムでは、ホストコンピュータ８と、複数の端末（クライアント）９とがインターネット等の通信ネットワーク１０で接続されている。ホストコンピュータ８は、教材の作成や提供等を行うコンピュータである。一方、複数の端末９は、それぞれ、学習者が教材を利用して学習する際に用いるコンピュータである。 FIG. 9 is a view showing a configuration example of an e-learning system according to the second embodiment.
As shown in FIG. 9, in the e-learning system according to the present embodiment, a host computer 8 and a plurality of terminals (clients) 9 are connected by a communication network 10 such as the Internet. The host computer 8 is a computer that creates and provides teaching materials. On the other hand, each of the plurality of terminals 9 is a computer used when a learner learns using a teaching material.

ホストコンピュータ８は、教材としての音声データを作成するときや再生するときに音声生成装置として動作する。このホストコンピュータ８は、コンピュータ本体８０と、キーボード８１と、マウス８２と、表示装置８３と、スピーカ８４とを備える。コンピュータ本体８０は、図７に示したコンピュータのハードウェア構成におけるプロセッサ５０、主記憶装置５１、補助記憶装置５２、通信インタフェース装置５５等を備える。また、キーボード８１及びマウス８２は、図７に示したコンピュータのハードウェア構成における入力装置５３に該当する。また、表示装置８３及びスピーカ８４は、図７に示したコンピュータのハードウェア構成における出力装置５４に該当する。 The host computer 8 operates as a voice generation device when creating or reproducing voice data as teaching material. The host computer 8 includes a computer main body 80, a keyboard 81, a mouse 82, a display device 83, and a speaker 84. The computer main body 80 includes a processor 50, a main storage device 51, an auxiliary storage device 52, a communication interface device 55 and the like in the hardware configuration of the computer shown in FIG. The keyboard 81 and the mouse 82 correspond to the input device 53 in the hardware configuration of the computer shown in FIG. The display device 83 and the speaker 84 correspond to the output device 54 in the hardware configuration of the computer shown in FIG.

ホストコンピュータ８を音声生成装置１として動作させるときには、コンピュータ本体８０に音声データ作成プログラムを実行させる。音声データ作成プログラムは、オペレータがキーボード８１等を操作して入力した文字情報（テキストデータ）から音声データを作成するプログラムである。音声データ作成プログラムの実行中、表示装置８３には、例えば、図９及び図１０に示したような作業ウインド８５が表示される。 When the host computer 8 is operated as the voice generation device 1, the computer main body 80 is made to execute a voice data creation program. The voice data creation program is a program for creating voice data from character information (text data) input by the operator operating the keyboard 81 or the like. During execution of the voice data creation program, for example, a work window 85 as shown in FIGS. 9 and 10 is displayed on the display device 83.

図１０は、表示装置に表示される作業ウインドの構成例を示す図である。
音声データ作成時に表示装置８３に表示される作業ウインド８５には、例えば、図１０に示したように、入力領域８５ａと、再生ボタン８５ｂと、保存ボタン８５ｃと、スライダー８５ｄと、溝８５ｅとが設けられている。 FIG. 10 is a view showing a configuration example of a work window displayed on the display device.
For example, as shown in FIG. 10, an input area 85a, a play button 85b, a save button 85c, a slider 85d, and a groove 85e are provided in the work window 85 displayed on the display device 83 when generating audio data. It is provided.

入力領域８５ａは、図９に示したキーボード８１等を操作して入力した文字情報を音声データ作成用の文字情報として受け付け、表示する領域である。 The input area 85a is an area for receiving and displaying character information input by operating the keyboard 81 or the like shown in FIG. 9 as character information for voice data creation.

再生ボタン８５ｂは、入力領域８５ａに表示された文字情報を音声データに変換して再生するときに使用する。また、保存ボタン８５ｃは、入力領域に表示された文字情報を音声データに変換して保存、すなわち電子ファイルとして記憶装置に記憶させるときに使用する。 The reproduction button 85b is used when converting the character information displayed in the input area 85a into audio data and reproducing it. Further, the save button 85c is used when converting the character information displayed in the input area into voice data and storing it, that is, storing it in the storage device as an electronic file.

スライダー８５ｄは、入力領域８５ａに表示された文字情報を音声データに変換して再生する際の音声の強調度合いの指定に用いる。このスライダー８５ｄは、溝８５ｅに沿って左右に動かすことが可能であり、図１０に示した例では、スライダー８５ｄを溝８５ｅの左端（平静）に移動させたときの強調度合いが最も低く、溝８５ｅの右端（強調）に近づくほど強調度合いが高くなる。スライダー８５ｄを溝８５ｅに沿って移動させると、溝８５ｅの左端からの距離に応じてスライダー値が変化する。コンピュータ本体８０が音声データを作成する際には、スライダー８５ｄの位置に応じた強調度合いになるよう、スライダー値に基づいて平静時の音声パラメータと強調時の音声パラメータとをモーフィングする。 The slider 85 d is used to specify the degree of emphasis of speech when converting character information displayed in the input area 85 a into speech data and reproducing the speech data. The slider 85d can be moved left and right along the groove 85e, and in the example shown in FIG. 10, the degree of emphasis is lowest when the slider 85d is moved to the left end (smooth) of the groove 85e. The closer to the right end (emphasis) of 85e, the higher the degree of emphasis. When the slider 85d is moved along the groove 85e, the slider value changes in accordance with the distance from the left end of the groove 85e. When the computer main body 80 creates audio data, based on the slider value, the audio parameter at the time of calm and the audio parameter at the time of enhancement are morphed so that the emphasis degree corresponds to the position of the slider 85 d.

次に、コンピュータ本体８０を音声生成装置１として動作させた場合の機能ブロックの構成例を、図１１及び図１２を参照して説明する。 Next, a configuration example of functional blocks when the computer main body 80 is operated as the voice generation device 1 will be described with reference to FIGS. 11 and 12.

図１１は、第２の実施形態に係る音声生成装置の機能ブロック図である。図１２は、第２の実施形態における合成音声作成部の機能ブロック図である。 FIG. 11 is a functional block diagram of the voice generation device according to the second embodiment. FIG. 12 is a functional block diagram of a synthetic speech generation unit in the second embodiment.

図１１に示すように、第２の実施形態に係る音声生成装置１（コンピュータ本体８０）は、入力データ処理部１２０と、モーフィング率決定部１２１と、変換テーブル１２２と、合成音声作成部１２３と、音声データベース１２４と、を備える。また、音声生成装置１は、表示制御部１２５と、テキストデータベース１２６と、を更に備える。 As shown in FIG. 11, the voice generation apparatus 1 (computer main body 80) according to the second embodiment includes an input data processing unit 120, a morphing rate determination unit 121, a conversion table 122, and a synthesized speech creation unit 123. , Speech database 124. Further, the voice generation device 1 further includes a display control unit 125 and a text database 126.

入力データ処理部１２０は、入力装置（キーボード）８１から入力されるテキストデータの受け付け処理、及び入力装置（マウス）８２から入力されるスライダー８５ｄの位置情報の受け付け処理を行う。入力データ処理部１２０は、入力されたテキストデータを表示制御部１２５渡すとともに、テキストデータベース１２６に記憶させる。また、入力データ処理部１２０は、入力されたスライダー８５ｄの位置情報（スライダー値）を表示制御部１２５に渡す。更に、入力データ処理部１２０は、マウス８２等からの再生ボタン８５ｂ又は保存ボタン８５ｃを押下する操作と対応した信号を受け付けると、スライダー値をモーフィング率決定部１２１に渡すとともに、テキストデータを合成音声作成部１２３に渡す。 The input data processing unit 120 performs processing of receiving text data input from the input device (keyboard) 81 and processing of receiving position information of the slider 85 d input from the input device (mouse) 82. The input data processing unit 120 passes the input text data to the display control unit 125 and stores the text data in the text database 126. Further, the input data processing unit 120 passes the input position information (slider value) of the slider 85 d to the display control unit 125. Furthermore, when the input data processing unit 120 receives a signal corresponding to the operation of pressing the play button 85 b or the save button 85 c from the mouse 82 or the like, the input data processing unit 120 passes the slider value to the morphing rate determination unit 121 and synthesizes the text data as synthesized speech. Pass it to the creation unit 123.

表示制御部１２５は、表示装置８３の表示を制御する。表示制御部１２５は、例えば、入力データ処理部１２０から受け取ったテキストデータ及びスライダー値に基づいて、表示装置８３に表示させた作業ウインド８５における入力領域８５ａ内の表示やスライダー８５ｄの位置を変更する。 The display control unit 125 controls the display of the display device 83. The display control unit 125 changes the display in the input area 85a in the work window 85 displayed on the display device 83 and the position of the slider 85d based on, for example, the text data and the slider value received from the input data processing unit 120. .

モーフィング率決定部１２１は、入力データ処理部１２０から受け取ったスライダー値と、変換テーブル１２２と、合成音声作成部１２３からのアクセント句境界を示す情報とに基づいて、モーフィング率を決定する。変換テーブル１２２は、スライダー値とモーフィング率との対応関係を示すテーブルである。また、アクセント句境界を示す情報は、現在スピーカ８４に出力した音声データの再生位置がアクセント句境界であるかを示す情報である。 The morphing rate determination unit 121 determines the morphing rate based on the slider value received from the input data processing unit 120, the conversion table 122, and the information indicating the accent phrase boundary from the synthetic speech creation unit 123. The conversion table 122 is a table showing the correspondence between slider values and morphing rates. Further, the information indicating the accent phrase boundary is information indicating whether the reproduction position of the audio data currently output to the speaker 84 is the accent phrase boundary.

合成音声作成部１２３は、入力データ処理部１２０から受け取ったテキストデータと、モーフィング率決定部１２１で決定したモーフィング率とに基づいて合成音声を作成する。また、合成音声作成部１２３は、作成した合成音声をスピーカ８４に出力する。更に、保存ボタン８５ｃを押下する操作に応じて合成音声を作成した場合、合成音声作成部１２３は、作成した合成音声を音声データベース１２４に記憶させる。なお、合成音声を音声データベース１２４に記憶させる場合、合成音声作成部１２３は、合成音声のデータをテキストデータベース１２６に記憶させたテキストデータと対応付けて記憶させる。 The synthetic speech creation unit 123 creates synthetic speech based on the text data received from the input data processing unit 120 and the morphing rate determined by the morphing rate determination unit 121. Further, the synthetic speech generation unit 123 outputs the generated synthetic speech to the speaker 84. Furthermore, when the synthesized speech is generated according to the operation of pressing the save button 85 c, the synthesized speech generation unit 123 stores the generated synthesized speech in the speech database 124. When the synthesized speech is stored in the speech database 124, the synthesized speech creation unit 123 stores the data of the synthesized speech in association with the text data stored in the text database 126.

本実施形態の音声生成装置１におけるモーフィング率決定部１２１は、第１の実施形態と同様の方法で声質、基本周波数、及び継続長のモーフィング率を決定する。すなわち、モーフィング率決定部１２１は、図２に示したモーフィング率決定部１０１と同様、瞬時モーフィング率決定部と、アクセント句モーフィング率決定部と、声質モーフィング率決定部と、基本周波数モーフィング率決定部と、継続長モーフィング率決定部とを含む。 The morphing rate determination unit 121 in the voice generation device 1 of the present embodiment determines the morphing rate of voice quality, fundamental frequency, and duration in the same manner as in the first embodiment. That is, the morphing rate determination unit 121, like the morphing rate determination unit 101 shown in FIG. 2, is an instantaneous morphing rate determination unit, an accent phrase morphing rate determination unit, a voice quality morphing rate determination unit, and a fundamental frequency morphing rate determination unit And a continuous length morphing rate determining unit.

一方、本実施形態の音声生成装置１における合成音声作成部１２３は、テキストデータとモーフィング率とに基づいて合成音声の音声データを作成する。本実施形態の合成音声作成部１２３は、既知の音声合成方法の１つである、隠れマルコフモデル（ＨＭＭ）に基づいた合成方法により音声データを作成する。この合成音声作成部１２３は、図１２に示すように、言語処理部１２３ａと、平静音声パラメータ作成部１２３ｂと、強調音声パラメータ作成部１２３ｃと、モーフィング処理部１２３ｄと、分析合成部１２３ｅと、を含む。また、合成音声作成部１２３は、平静音声ＨＭＭパラメータ１２３ｆと、強調音声ＨＭＭパラメータ１２３ｇと、を更に含む。 On the other hand, the synthetic speech creation unit 123 in the speech generation device 1 of the present embodiment creates speech data of synthetic speech based on the text data and the morphing rate. The synthetic speech creation unit 123 of the present embodiment creates speech data by a synthesis method based on a Hidden Markov Model (HMM), which is one of known speech synthesis methods. As shown in FIG. 12, the synthetic speech creation unit 123 includes a language processing unit 123a, a silent speech parameter creation unit 123b, an enhanced speech parameter creation unit 123c, a morphing processing unit 123d, and an analysis / synthesis unit 123e. Including. Further, the synthetic speech creation unit 123 further includes a plain speech HMM parameter 123 f and an emphasis speech HMM parameter 123 g.

言語処理部１２３ａは、テキストデータを読み・アクセントを表す表音テキストに変換する。 The language processing unit 123a reads and converts text data into phonetic text representing accents.

平静音声パラメータ作成部１２３ｂは、表音テキストと平静音声ＨＭＭパラメータ１２３ｆとに基づいて、平静時の音声についての音声パラメータを作成する。また、強調音声パラメータ作成部１２３ｃは、表音テキストと強調音声ＨＭＭパラメータ１２３ｇとに基づいて、強調時の音声についての音声パラメータを作成する。 The silent speech parameter creation unit 123b creates a speech parameter for a silent speech based on the phonetic text and the silent speech HMM parameter 123f. Also, the emphasized speech parameter creation unit 123c creates speech parameters for the speech during emphasis based on the phonetic text and the emphasized speech HMM parameters 123g.

モーフィング処理部１２３ｄは、平静時の音声についての音声パラメータと強調時の音声についての音声パラメータとをモーフィング率決定部１２１で決定したモーフィング率でモーフィングし、現在のフレームに対する音声パラメータを作成する。 The morphing processing unit 123d morphs the speech parameter of the speech at the time of silence and the speech parameter of the speech at the emphasis at the morphing rate determined by the morphing rate determination unit 121, and creates the speech parameter for the current frame.

分析合成部１２３ｅは、現在のフレームに対する音声パラメータを分析合成して音声波形に変換する。また、分析合成部１２３ｅは、現在のフレームにアクセント句の先頭が含まれる場合、アクセント句の先頭が含まれることを示す情報をモーフィング率決定部１２１に通知する。 The analysis and synthesis unit 123 e analyzes and synthesizes speech parameters for the current frame and converts the speech parameters into speech waveforms. Further, when the beginning of the accent phrase is included in the current frame, the analysis and synthesis unit 123 e notifies the morphing rate determination unit 121 of information indicating that the beginning of the accent phrase is included.

次に、本実施形態の音声生成装置１（ホストコンピュータ８）における音声生成処理について説明する。 Next, the voice generation processing in the voice generation device 1 (host computer 8) of the present embodiment will be described.

図１３Ａは、第２の実施形態に係る音声生成処理を示すフローチャート（その１）である。図１３Ｂは、第２の実施形態に係る音声生成処理を示すフローチャート（その２）である。 FIG. 13A is a flowchart (part 1) illustrating an audio generation process according to the second embodiment. FIG. 13B is a flowchart (part 2) illustrating the sound generation process according to the second embodiment.

本実施形態に係る音声生成装置１は、図１０に示したような作業ウインド８５の再生ボタン８５ｂ又は保存ボタン８５ｃを押下する操作がなされたときに、入力領域８５ａに表示されたテキストを音声データに変換して再生する。このとき、音声生成装置１は、図１３Ａに示すように、まず、テキストデータの再生位置、及びスライダー値と再生位置との対応付けを初期化する（ステップＳ２１）。ステップＳ２１は、モーフィング率決定部１２１のアクセント句モーフィング率決定部（図示せず）が行う。 The voice generation device 1 according to the present embodiment voices the text displayed in the input area 85a when the operation of pressing the play button 85b or the save button 85c of the work window 85 as shown in FIG. 10 is performed. Convert to and play. At this time, as shown in FIG. 13A, the audio generation device 1 first initializes the reproduction position of the text data and the correspondence between the slider value and the reproduction position (step S21). Step S21 is performed by the accent phrase morphing rate determining unit (not shown) of the morphing rate determining unit 121.

次に、音声生成装置１は、現時点のスライダー値を取得してモーフィング率決定部１２１に渡すとともに、テキストデータを合成音声作成部１２３に渡す（ステップＳ２２）。ステップＳ２２は、入力データ処理部１２０が行う。入力データ処理部１２０は、取得したスライダー値をモーフィング率決定部１２１の瞬時モーフィング率決定部及びアクセント句モーフィング率決定部（図示せず）に渡す。また、入力データ処理部１２０は、テキストデータを合成音声作成部１２３の言語処理部１２３ａに渡す。 Next, the speech generation device 1 acquires the slider value at the current point and passes it to the morphing rate determination unit 121 and passes the text data to the synthesized speech creation unit 123 (step S22). The input data processing unit 120 performs step S22. The input data processing unit 120 passes the acquired slider value to the instant morphing rate determination unit and the accent phrase morphing rate determination unit (not shown) of the morphing rate determination unit 121. Further, the input data processing unit 120 passes the text data to the language processing unit 123 a of the synthetic speech creation unit 123.

ステップＳ２２の後、音声生成装置１は、モーフィング率決定処理（ステップＳ２３）、及び音声パラメータの作成処理を行う。モーフィング率決定処理（ステップＳ２３）は、モーフィング率決定部１２１が行う。モーフィング率決定部１２１は、図４Ａに示したステップＳ３ａ，Ｓ３ｂ及びステップＳ４ａ〜Ｓ４ｅと同様の処理により、声質、基本周波数、及び継続長のモーフィング率を決定する。なお、本実施形態で行うステップＳ２３の処理では、騒音レベルの代わりにスライダー値を用いる。また、モーフィング率決定部１２１は、決定した声質、基本周波数、及び継続長のモーフィング率を合成音声作成部１２３のモーフィング処理部１２３ｄに渡す。 After step S22, the voice generation device 1 performs a morphing rate determination process (step S23) and a voice parameter creation process. The morphing rate determining unit 121 performs the morphing rate determining process (step S23). The morphing rate determining unit 121 determines the morphing rate of the voice quality, the fundamental frequency, and the duration by the same processes as steps S3a and S3b and steps S4a to S4e illustrated in FIG. 4A. In the process of step S23 performed in the present embodiment, a slider value is used instead of the noise level. In addition, the morphing rate determination unit 121 passes the determined voice quality, the fundamental frequency, and the morphing rate of the duration to the morphing processing unit 123 d of the synthesized speech creation unit 123.

一方、音声パラメータの作成処理は、合成音声作成部１２３が行う。合成音声作成部１２３は、まず、モーフィング率決定処理Ｓ２３と並行して、テキストデータを読み・アクセントを表す表音テキストに変換する処理（ステップＳ２４ａ）と、平静音声及び強調音声についての音声パラメータを作成する処理（ステップＳ２４ｂ）とを行う。 On the other hand, the synthetic speech creation unit 123 performs the speech parameter creation process. First, in parallel with the morphing rate determination process S23, the synthetic speech creation unit 123 reads the text data and converts it into a phonetic text representing an accent (step S24a), and speech parameters for plain speech and emphasized speech. A process of creating (step S24b) is performed.

ステップＳ２４ａは、言語処理部１２３ａが行う。言語処理部１２３ａでは、既知の変換方法のいずれかによりテキストデータを表音テキストに変換する。 The language processing unit 123a performs step S24a. The language processing unit 123a converts text data into phonetic text by any of known conversion methods.

また、ステップＳ２４ｂは、平静音声パラメータ作成部１２３ｂ及び強調音声パラメータ作成部１２３ｃが行う。平静音声パラメータ作成部１２３ｂは、表音テキストと平静音声ＨＭＭパラメータ１２３ｆとに基づいて、強調度が最も低い平静時の音声についての音声パラメータを作成する。強調音声パラメータ作成部１２３ｃは、表音テキストと強調音声ＨＭＭパラメータ１２３ｇとに基づいて、強調度が最も高い強調時の音声についての音声パラメータを作成する。このステップＳ２４ｂの処理は、隠れマルコフモデルに基づく既知の音声パラメータの作成方法のいずれかにより行う。平静音声パラメータ作成部１２３ｂは、作成した音声パラメータをモーフィング処理部１２３ｄに渡す。同様に、強調音声パラメータ作成部１２３ｃは、作成した音声パラメータをモーフィング処理部１２３ｄに渡す。 Step S24b is performed by the quiet speech parameter creation unit 123b and the emphasis speech parameter creation unit 123c. The silent speech parameter creation unit 123b creates, based on the phonetic text and the silent speech HMM parameter 123f, a speech parameter for a speech at quietness with the lowest degree of emphasis. The emphasized speech parameter creation unit 123c creates speech parameters for the speech during emphasis with the highest degree of emphasis based on the phonetic text and the emphasized speech HMM parameters 123g. The process of step S24b is performed by any of known speech parameter creation methods based on the Hidden Markov Model. The silent speech parameter creation unit 123b passes the created voice parameter to the morphing processing unit 123d. Similarly, the emphasized voice parameter creation unit 123c passes the created voice parameter to the morphing processing unit 123d.

モーフィング処理部１２３ｄは、音声パラメータ及びモーフィング率を受け取ると、図１３Ｂに示すように、受け取った音声パラメータとモーフィング率とに基づいてフレームに対する音声パラメータを作成する（ステップＳ２５）。モーフィング処理部１２３ｄは、既知の音声合成処理におけるモーフィング処理のいずれかによりフレームに対する音声パラメータを作成する。また、モーフィング処理部１２３ｄは、作成した音声パラメータを分析合成部１２３ｅに渡す。 When receiving the speech parameter and the morphing rate, the morphing processing unit 123d creates a speech parameter for the frame based on the received speech parameter and the morphing rate, as shown in FIG. 13B (step S25). The morphing processing unit 123d creates speech parameters for the frame by any of the morphing processes in the known speech synthesis process. In addition, the morphing processing unit 123d passes the created speech parameter to the analysis and synthesis unit 123e.

分析合成部１２３ｅは、フレームに対する音声パラメータを分析合成してフレームの音声データ（音声波形）に変換する（ステップＳ２６）。分析合成部１２３ｅは、既知の音声合成処理における変換方法のいずれかにより、フレームの音声パラメータを音声データに変換する。 The analysis and synthesis unit 123 e analyzes and synthesizes speech parameters for the frame and converts the speech parameters into speech data (speech waveform) of the frame (step S 26). The analysis and synthesis unit 123e converts the speech parameter of the frame into speech data by any of the conversion methods in the known speech synthesis process.

また、分析合成部１２３ｅは、得られた音声データを出力する（ステップＳ２７）。分析合成部１２３ｅは、得られた音声データをスピーカ８４に出力する。また、作業ウインド８５の保存ボタン８５ｃを押下する操作に応じた音声合成処理の場合、音声データを音声データベース１２４に記憶させる。 Further, the analysis and synthesis unit 123e outputs the obtained voice data (step S27). The analysis and synthesis unit 123e outputs the obtained voice data to the speaker 84. Further, in the case of speech synthesis processing according to the operation of pressing the save button 85 c of the work window 85, the speech data is stored in the speech database 124.

更に、分析合成部１２３ｅは、音声データを出力した後、フレームがテキストデータの終了位置に到達しているかを確認する（ステップＳ２８）。フレームがテキストデータの終了位置に到達していない場合（ステップＳ２８；Ｎｏ）、分析合成部１２３ｅは、次に、フレームにアクセント句境界が含まれるかを確認する（ステップＳ２９）。アクセント句境界が含まれる場合（ステップＳ２９；Ｙｅｓ）、分析合成部１２３ｅは、モーフィング率決定部１２１と協働して、再生位置を含むアクセント句の先頭を再生した時刻のスライダー値を現時点でのスライダー値に更新する（ステップＳ３０）。その後、分析合成部１２３ｅは、フレームを次のフレームに変更し（ステップＳ３１）、入力データ処理部１２０にステップＳ２２の処理を行わせる。以後、音声生成装置１は、フレームがテキストデータの終了位置に到達するまでステップＳ２２〜Ｓ３１を繰り返す。 Furthermore, after outputting the voice data, the analysis and synthesis unit 123 e confirms whether the frame has reached the end position of the text data (step S 28). If the frame has not reached the end position of the text data (step S28; No), the analysis and synthesis unit 123e then checks whether the frame includes an accent phrase boundary (step S29). When the accent phrase boundary is included (step S29; Yes), the analysis synthesis unit 123e cooperates with the morphing rate determination unit 121 to set the slider value of the time when the beginning of the accent phrase including the reproduction position is reproduced at the current time The slider value is updated (step S30). Thereafter, the analysis and synthesis unit 123e changes the frame to the next frame (step S31), and causes the input data processing unit 120 to perform the process of step S22. Thereafter, the voice generation device 1 repeats steps S22 to S31 until the frame reaches the end position of the text data.

そして、フレームがテキストデータの終了位置に到達した場合（ステップＳ２８；Ｙｅｓ）、分析合成部１２３ｅは最後のフレームの音声データを出力して処理を終了する。これにより、音声生成装置１は待機状態となる。待機状態の音声生成装置１は、作業ウインド８５の再生ボタン８５ｂ又は保存ボタン８５ｃを押下する操作と対応した信号を受信すると、再びテキストデータの生成及び出力処理を行う。 Then, when the frame has reached the end position of the text data (step S28; Yes), the analysis and synthesis unit 123e outputs the voice data of the last frame and ends the processing. Thus, the voice generation device 1 is in the standby state. When the sound generation device 1 in the standby state receives a signal corresponding to an operation of pressing the play button 85 b or the save button 85 c of the work window 85, the sound generation device 1 performs text data generation and output processing again.

このように、本実施形態の音声生成処理では、テキストデータを音声データ（音声波形）に変換する際、声の張り方等の音声の強調度合いと相関がある声質のモーフィング率を現時点のスライダー値に基づいて決定する。そのため、音声データの所望の区間を容易に強調させることができる。例えば、図１０に示したように、作業ウインド８５の入力領域８５ａに「Ｃ言語ではポインタが重要です。」と入力して再生する場合、スピーカ８４からスライダー８５ｄの位置に応じた強調度の音声で「Ｃ言語ではポインタが重要です。」と出力される。この際、「ポインタが」というアクセント句の先頭でスライダー８５ｄを右側に移動させ、終了位置でスライダー８５ｄを左側に戻すと、「ポインタが」という部分を強調することができる。 As described above, in the voice generation process of the present embodiment, when text data is converted into voice data (voice waveform), the morphing rate of voice quality that has a correlation with the degree of voice emphasis such as how to stretch voice Make a decision based on Therefore, the desired section of the voice data can be easily emphasized. For example, as shown in FIG. 10, when reproducing by inputting "The pointer is important in C language" in the input area 85a of the work window 85, the voice of the emphasis degree according to the position of the slider 85d from the speaker 84 "The pointer is important in C language" is output. At this time, the slider 85d is moved to the right at the beginning of the accent phrase "pointer is", and the slider 85d is returned to the left at the end position, whereby the "pointer is" can be emphasized.

また、本実施形態の音声生成処理では、アクセントとの相関がある基本周波数のモーフィング率をアクセント句の先頭を再生したときのスライダー値に基づいて決定する。そのため、アクセント句の再生中にスライダー値を変えてもアクセントは変わらない。よって、アクセント句のアクセントが変わってしまい再生音の内容を理解しづらくなることを防止できる。例えば、「Ｃ言語ではポインタが重要です。」というテキストデータを「ポインタが」というアクセント句が強調されるよう再生した場合、「ポインタが」の再生中にスライダー８５ｂの位置が変わってもアクセントは変わらない。 Further, in the voice generation process of the present embodiment, the morphing rate of the fundamental frequency having a correlation with the accent is determined based on the slider value when the head of the accent phrase is reproduced. Therefore, changing the slider value while playing the accent phrase does not change the accent. Therefore, it is possible to prevent that the accent of the accent phrase changes and it becomes difficult to understand the contents of the reproduced sound. For example, when the text data "The pointer is important in C language" is reproduced so that the accent phrase "the pointer is" is emphasized, the accent is displayed even if the position of the slider 85b changes while the "pointer is" playing. does not change.

なお、本実施形態では隠れマルコフモデルに基づいて平静時及び強調時の音声パラメータを作成する場合を例に挙げたが、モーフィングに使用する２つの音声パラメータは、これに限らず、他の方法で作成してもよい。 In the present embodiment, the case of creating speech parameters at the time of calmness and emphasis based on the Hidden Markov Model has been described as an example, but the two speech parameters used for morphing are not limited to this, and other methods may be used. You may create it.

以上記載した各実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
入力装置からの入力値に基づいて声質のモーフィング率及び基本周波数のモーフィング率を含む２以上のモーフィング率を決定するモーフィング率決定部と、
前記モーフィング率に基づく音声データを再生する音声再生部と、を備え、
前記モーフィング率決定部は、
前記音声データの各フレームを再生する時点における前記入力値に基づいて前記声質のモーフィング率を決定する第１のモーフィング率決定部と、
前記音声データのうち前記フレームを含むアクセント句の先頭のフレームを再生したときの前記入力値に基づいて前記基本周波数のモーフィング率を決定する第２のモーフィング率決定部と、を含む
ことを特徴とする音声生成装置。
（付記２）
前記モーフィング率決定部は、
前記第２のモーフィング率決定部で決定した前記基本周波数のモーフィング率を継続長のモーフィング率に決定する、
ことを特徴とする付記１に記載の音声生成装置。
（付記３）
前記モーフィング率決定部は、
前記第１のモーフィング率決定部で決定した前記声質のモーフィング率を継続長のモーフィング率に決定する、
ことを特徴とする付記１に記載の音声生成装置。
（付記４）
前記入力値は、装置外部の所定の区域における騒音レベルの値を含む、
ことを特徴とする付記１に記載の音声生成装置。
（付記５）
前記入力値は、予め定めた範囲内を移動可能なスライダーの前記範囲内における位置を示す値である、
ことを特徴とする付記１に記載の音声生成装置。
（付記６）
前記モーフィング率の組み合わせが異なる複数の音声データを記憶させた記憶部、を更に備え、
前記音声再生部は、
前記音声データの再生位置毎に、前記モーフィング率決定部で決定した前記モーフィング率の組み合わせに基づいて、前記記憶部から音声データを読み出して前記再生位置から出力する再生制御部を含む、
ことを特徴とする付記１に記載の音声生成装置。
（付記７）
前記音声生成装置は、所定の言語に基づくテキストデータに基づいて合成音声を作成する合成音声作成部、を備え、
前記合成音声作成部は、
所定の言語に基づくテキストデータを表音テキストに変換する言語処理部と、
声種の異なる２以上の変換パラメータに基づいて、前記表音テキストについての２以上の音声パラメータを作成する音声パラメータ作成部と、
前記２以上の音声パラメータを前記モーフィング率に基づいてモーフィングして前記合成位置に対する音声パラメータを作成するモーフィング処理部と、
モーフィング処理部で作成した音声パラメータを音声データに変換して出力する出力部と、
を含むことを特徴とする付記１に記載の音声合成装置。
（付記８）
コンピュータが、
出力対象の音声データにおける現時点の再生位置と、当該再生位置と対応した入力値とに基づいて声質のモーフィング率を決定し、
前記音声データのうちの前記再生位置を含むアクセント句の先頭を再生したときの入力値に基づいて基本周波数のモーフィング率を決定し、
決定した前記モーフィング率を含む２以上のモーフィング率に基づいて音声データを生成する、
処理を実行することを特徴とする音声生成方法。
（付記９）
継続長のモーフィング率を前記基本周波数のモーフィング率と同じモーフィング率に決定し、
前記声質、前記基本周波数、及び前記継続長のモーフィング率に基づいて前記音声データを生成する、
ことを特徴とする付記８に記載の音声生成方法。
（付記１０）
継続長のモーフィング率を前記声質のモーフィング率と同じモーフィング率に決定し、
前記声質、前記基本周波数、及び前記継続長のモーフィング率に基づいて前記音声データを生成する、
ことを特徴とする付記８に記載の音声生成方法。
（付記１１）
前記音声データの前記再生位置毎に、前記声質及び基本周波数のモーフィング率に基づいて、予め用意された前記声質及び基本周波数のモーフィング率の組み合わせが異なる複数の音声データのいずれかを選択して前記音声データを生成する、
ことを特徴とする付記８に記載の音声生成方法。
（付記１２）
出力対象の音声データにおける現時点の再生位置と、当該再生位置に対応した入力値とに基づいて声質のモーフィング率を決定し、
前記音声データのうちの前記再生位置を含むアクセント句の先頭を再生したときの入力値に基づいて基本周波数のモーフィング率を決定し、
決定した前記モーフィング率を含む２以上のモーフィング率に基づいて音声データを生成する、
処理をコンピュータに実行させるためのプログラム。 The following appendices will be further disclosed regarding the embodiment including each example described above.
(Supplementary Note 1)
A morphing rate determining unit that determines two or more morphing rates including a voice quality morphing rate and a fundamental frequency morphing rate based on input values from an input device;
An audio reproduction unit that reproduces audio data based on the morphing rate,
The morphing rate determining unit
A first morphing rate determining unit that determines a morphing rate of the voice quality based on the input value at the time of reproducing each frame of the voice data;
And a second morphing rate determining unit that determines the morphing rate of the fundamental frequency based on the input value when the first frame of the accent phrase including the frame in the audio data is reproduced. Voice generation device.
(Supplementary Note 2)
The morphing rate determining unit
The morphing rate of the fundamental frequency determined by the second morphing rate determining unit is determined as a morphing rate of a continuous length.
The voice generation device according to claim 1, characterized in that:
(Supplementary Note 3)
The morphing rate determining unit
The morphing rate of the voice quality determined by the first morphing rate determination unit is determined as the morphing rate of the continuous length,
The voice generation device according to claim 1, characterized in that:
(Supplementary Note 4)
The input value includes the value of the noise level in a predetermined area outside the device,
The voice generation device according to claim 1, characterized in that:
(Supplementary Note 5)
The input value is a value indicating a position within the range of a slider movable within a predetermined range.
The voice generation device according to claim 1, characterized in that:
(Supplementary Note 6)
A storage unit storing a plurality of audio data different in combination of the morphing rates;
The voice reproduction unit
A reproduction control unit for reading out audio data from the storage unit and outputting the audio data from the reproduction position based on a combination of the morphing rates determined by the morphing rate determination unit for each reproduction position of the audio data;
The voice generation device according to claim 1, characterized in that:
(Appendix 7)
The speech generation device includes a synthetic speech creation unit that creates synthetic speech based on text data based on a predetermined language,
The synthetic speech generation unit
A language processing unit for converting text data based on a predetermined language into phonetic text;
A speech parameter creation unit that creates two or more speech parameters for the phonetic text based on two or more conversion parameters different in voice type;
A morphing processing unit that morphs the two or more speech parameters based on the morphing rate to create speech parameters for the synthesis position;
An output unit that converts voice parameters generated by the morphing processing unit into voice data and outputs the converted data;
The speech synthesizer according to claim 1, further comprising:
(Supplementary Note 8)
The computer is
The morphing rate of voice quality is determined based on the current reproduction position in the audio data to be output and the input value corresponding to the reproduction position,
The morphing rate of the fundamental frequency is determined based on the input value when the head of the accent phrase including the reproduction position in the audio data is reproduced.
Generating voice data based on two or more morphing rates including the determined morphing rate,
A voice generation method characterized by performing processing.
(Appendix 9)
Determine the duration morphing rate to be the same morphing rate as the fundamental frequency morphing rate,
Generating the audio data based on the voice quality, the fundamental frequency, and the morphing rate of the duration;
The speech generation method according to appendix 8, characterized in that
(Supplementary Note 10)
Determine the morphing rate of the duration length to be the same as the morphing rate of the voice quality,
Generating the audio data based on the voice quality, the fundamental frequency, and the morphing rate of the duration;
The speech generation method according to appendix 8, characterized in that
(Supplementary Note 11)
For each of the reproduction positions of the voice data, one of a plurality of voice data having different combinations of the voice quality and the morphing rate of the fundamental frequency prepared in advance is selected based on the voice quality and the morphing rate of the fundamental frequency. Generate voice data,
The speech generation method according to appendix 8, characterized in that
(Supplementary Note 12)
The morphing rate of voice quality is determined based on the current playback position in the audio data to be output and the input value corresponding to the playback position,
The morphing rate of the fundamental frequency is determined based on the input value when the head of the accent phrase including the reproduction position in the audio data is reproduced.
Generating voice data based on two or more morphing rates including the determined morphing rate,
A program that causes a computer to execute a process.

１音声合成装置
１００入力値処理部
１０１，１２１モーフィング率決定部
１０２，１２２変換テーブル
１０３音声再生部
１２３合成音声作成部
１０４，１２４音声データベース
１２０入力データ処理部
１２５表示制御部
１２６テキストデータベース
１０１ａ瞬時モーフィング率決定部
１０１ｂアクセント句モーフィング率決定部
１０１ｃ声質モーフィング率決定部
１０１ｄ基本周波数モーフィング率決定部
１０１ｅ継続長モーフィング率決定部
１０３ａ音声データ選択部
１０３ｂ再生制御部
１２３ａ言語処理部
１２３ｂ平静音声パラメータ作成部
１２３ｃ強調音声パラメータ作成部
１２３ｄモーフィング処理部
１２３ｅ分析合成部
１２３ｆ平静音声ＨＭＭパラメータ
１２３ｇ強調音声ＨＭＭパラメータ
２マイク
３，３Ａ，３Ｂ設備
４，８４スピーカ
５コンピュータ
５０プロセッサ
５１主記憶装置
５２補助記憶装置
５３入力装置
５４出力装置
５５通信インタフェース装置
６管理サーバ
８ホストコンピュータ
８０コンピュータ本体
８１キーボード
８２マウス
８３表示装置
８５作業ウインド
８５ａ入力領域
８５ｂ再生ボタン
８５ｃ保存ボタン
８５ｄスライダー
８５ｅ溝
９クライアント
１０通信ネットワーク DESCRIPTION OF SYMBOLS 1 voice synthesizer 100 input value processing unit 101, 121 morphing rate determination unit 102, 122 conversion table 103 voice reproduction unit 123 synthesized voice creation unit 104, 124 voice database 120 input data processing unit 125 display control unit 126 text database 101 a instant morphing Rate determination unit 101b Accent phrase morphing ratio determination unit 101c Voice quality morphing ratio determination unit 101d Basic frequency morphing ratio determination unit 101e Continuous length morphing ratio determination unit 103a Voice data selection unit 103b Reproduction control unit 123a Language processing unit 123b Plain speech parameter creation unit 123c Emphasized speech parameter creation unit 123d morphing processing unit 123e analysis synthesis unit 123f plain speech HMM parameter 123g emphasis speech HMM parameter 2 microphone 3, 3A, 3 B Equipment 4, 84 Speaker 5 Computer 50 Processor 51 Main storage device 52 Auxiliary storage device 53 Input device 54 Output device 55 Communication interface device 6 Management server 8 Host computer 80 Computer main body 81 Keyboard 82 Mouse 83 Display device 85 Work window 85a Input area 85b play button 85c save button 85d slider 85e groove 9 client 10 communication network

Claims

A morphing rate determining unit that determines two or more morphing rates including a voice quality morphing rate and a fundamental frequency morphing rate based on input values from an input device;
An audio reproduction unit that reproduces audio data based on the morphing rate,
The morphing rate determining unit
A first morphing rate determining unit that determines a morphing rate of the voice quality based on the input value at the time of reproducing each frame of the voice data;
And a second morphing rate determining unit that determines the morphing rate of the fundamental frequency based on the input value when the first frame of the accent phrase including the frame in the audio data is reproduced. Voice generation device.

The morphing rate determining unit
The morphing rate of the fundamental frequency determined by the second morphing rate determining unit is determined as a morphing rate of a continuous length.
The voice generation device according to claim 1, characterized in that:

The morphing rate determining unit
The morphing rate of the voice quality determined by the first morphing rate determination unit is determined as the morphing rate of the continuous length,
The voice generation device according to claim 1, characterized in that:

The input value includes the value of the noise level in a predetermined area outside the device,
The voice generation device according to claim 1, characterized in that:

A storage unit storing a plurality of audio data different in combination of the morphing rates;
The voice reproduction unit
A reproduction control unit for reading out audio data from the storage unit and outputting the audio data from the reproduction position based on a combination of the morphing rates determined by the morphing rate determination unit for each reproduction position of the audio data;
The voice generation device according to claim 1, characterized in that:

The speech generation device includes a synthetic speech creation unit that creates synthetic speech based on text data based on a predetermined language,
The synthetic speech generation unit
A language processing unit for converting text data based on a predetermined language into phonetic text;
A speech parameter creation unit that creates two or more speech parameters for the phonetic text based on two or more conversion parameters different in voice type;
A morphing processing unit that morphs the two or more speech parameters based on the morphing rate to create speech parameters for the synthesis position;
An output unit that converts voice parameters generated by the morphing processing unit into voice data and outputs the converted data;
The speech synthesis apparatus according to claim 1, further comprising:

The computer is
The morphing rate of voice quality is determined based on the current reproduction position in the audio data to be output and the input value corresponding to the reproduction position,
The morphing rate of the fundamental frequency is determined based on the input value when the head of the accent phrase including the reproduction position in the audio data is reproduced.
Generating voice data based on two or more morphing rates including the determined morphing rate,
A voice generation method characterized by performing processing.

The morphing rate of voice quality is determined based on the current reproduction position in the audio data to be output and the input value corresponding to the reproduction position,
The morphing rate of the fundamental frequency is determined based on the input value when the head of the accent phrase including the reproduction position in the audio data is reproduced.
Generating voice data based on two or more morphing rates including the determined morphing rate,
A program that causes a computer to execute a process.