JP6291808B2

JP6291808B2 - Speech synthesis apparatus and method

Info

Publication number: JP6291808B2
Application number: JP2013244525A
Authority: JP
Inventors: 充伸神沼; 健太南
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2013-11-27
Filing date: 2013-11-27
Publication date: 2018-03-14
Anticipated expiration: 2033-11-27
Also published as: JP2015102773A

Description

本発明は、音声を聞いた人が感じる親しさの程度を増減させた音声を再生することができる音声発生装置、音声を聞いた人が感じる親しさの程度を増減させることができる音声合成装置及び方法に関する。 The present invention relates to a voice generation apparatus capable of reproducing a voice in which the degree of familiarity felt by a person who has heard the voice is increased, and a voice synthesizer capable of increasing or decreasing the degree of familiarity in which the person who has heard the voice feels. And a method.

近年、電子機器の操作方法を操作者に音声で説明する音声ガイダンスが普及している。音声ガイダンスに用いられる音声は、韻律が平坦で感情のこもっていない音声であることが多い。特許文献１には、無感情な音声に感情を付与する音声合成装置が記載されている。 In recent years, voice guidance that explains an operation method of an electronic device to an operator by voice has become widespread. The voice used for voice guidance is often a voice that has a flat prosody and no emotion. Patent Document 1 describes a speech synthesizer that adds emotion to emotionless speech.

特開平７−７２９００号公報JP-A-7-72900

従来の音声合成装置である特許文献１に記載の音声合成装置は、ニューラルネットワークを用いて無感情な音声のパラメータを感情のこもったパラメータに変換する学習を行わせることによって、無感情な音声に感情を付与する。よって、特許文献１に記載の音声合成装置においては、煩雑な構成・手順が必要となってしまうという問題点がある。 A speech synthesizer described in Patent Document 1 which is a conventional speech synthesizer makes an emotional voice by learning to convert an emotional voice parameter into an emotional parameter using a neural network. Give emotions. Therefore, the speech synthesizer described in Patent Document 1 has a problem that a complicated configuration and procedure are required.

本発明はこのような問題点に鑑み、簡易な構成・手順で音声の親しさの程度を効果的に増大させることができる音声合成装置及び方法を提供することを目的とする。 The present invention has been made in view of such problems, and an object thereof is to provide a voice synthesizing apparatus and method capable of effectively increasing size of the degree of familiarity of the voice easy easy Do construction and procedure.

本発明は、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とし、開始位置以降の周波数を一定の周波数だけ上昇させた韻律情報を含む音声とすることによって、人が感じる音声の親しさの程度を増大させる。 The present invention is later than the first phoneme of the last phrase in sentence as a starting position one of the positions in the last clause, the prosodic information of the frequency after the starting position was only on the temperature constant frequency By including the voice, the degree of familiarity of the voice felt by the person is increased .

本発明の音声合成装置及び方法によれば、簡易な構成・手順で音声の親しさの程度を効果的に増大させることができる。 According to the speech synthesis apparatus and method of the present invention, it is possible to effectively increase large degree of familiarity of the speech with a simple configuration and procedure.

第１実施形態の音声発生装置及び音声合成装置を示すブロック図である。It is a block diagram which shows the speech generator and speech synthesizer of 1st Embodiment. 第２実施形態の音声発生装置及び音声合成装置を示すブロック図である。It is a block diagram which shows the speech generator and speech synthesizer of 2nd Embodiment. 第３実施形態の音声発生装置及び音声合成装置を示すブロック図である。It is a block diagram which shows the speech generator and speech synthesizer of 3rd Embodiment. 実施形態の音声合成方法を説明するための第１の例文の振幅波形と第１の例文を構成する文節及び音素を示す図である。It is a figure which shows the phrase and phoneme which comprise the amplitude waveform of the 1st example sentence for demonstrating the speech synthesis method of embodiment, and a 1st example sentence. 実施形態の音声合成方法を説明するための第２の例文の振幅波形と第２の例文を構成する文節及び音素を示す図である。It is a figure which shows the phrase and phoneme which comprise the amplitude waveform of the 2nd example sentence for demonstrating the speech synthesis method of embodiment, and a 2nd example sentence. 文章における最後の文節の最後の音素の部分の周波数を上昇させたときの振幅と周波数の特性を示す図である。It is a figure which shows the characteristic of an amplitude and frequency when raising the frequency of the last phoneme part of the last clause in a sentence. 文章における最後の文節の最後の音素を開始位置として、周波数を上昇させる例を示す図である。It is a figure which shows the example which raises a frequency by making the last phoneme of the last clause in a sentence into a starting position. 文章における最後の文節の最後の音素より１つ前の音素を開始位置として、周波数を上昇させる例を示す図である。It is a figure which shows the example which raises a frequency by making the phoneme before the last phoneme of the last clause in a sentence into a starting position. 文章における最後の文節の最後の音素の母音を開始位置として、周波数を上昇させる例を示す図である。It is a figure which shows the example which raises a frequency by making the vowel of the last phoneme of the last clause in a sentence into a starting position. 文章の韻律が示す周波数特性における、文章の最後の文節の最後の音素に最も近い極点または変曲点を、周波数を上昇させる開始位置とする場合を説明するための図である。It is a figure for demonstrating the case where the pole or inflection point nearest to the last phoneme of the last phrase of a sentence is made into the frequency starting point which raises a frequency in the frequency characteristic which a sentence prosody shows.

以下、各実施形態の音声発生装置、音声合成装置及び方法について、添付図面を参照して説明する。各実施形態の音声発生装置、音声合成装置及び方法は、音声に親しさを付与して音声の親しさの程度を増大させることができる。また、各実施形態の音声発生装置、音声合成装置及び方法は、音声の親しさの程度を減少させることもできる。以下の説明では、音声の親しさの程度を増大させる動作を中心に説明することとする。 Hereinafter, a speech generation device, a speech synthesis device, and a method according to each embodiment will be described with reference to the accompanying drawings. The speech generation device, speech synthesis device, and method of each embodiment can increase the degree of familiarity of speech by adding familiarity to speech. In addition, the speech generation device, speech synthesis device, and method of each embodiment can reduce the degree of familiarity of speech. In the following description, the operation for increasing the degree of familiarity of sound will be mainly described.

＜第１実施形態の音声発生装置及び音声合成装置＞
図１に示す第１実施形態の音声発生装置及び音声合成装置は、音声合成によって音声データを生成する際に、音声の親しさの程度を増大させる構成例である。第１実施形態の音声発生装置及び音声合成装置は、文章を示すテキストデータに基づいて音声データを生成する音声合成装置である。 <Speech generator and speech synthesizer of first embodiment>
The speech generator and speech synthesizer of the first embodiment shown in FIG. 1 are configuration examples that increase the degree of familiarity of speech when generating speech data by speech synthesis. The speech generator and speech synthesizer of the first embodiment are speech synthesizers that generate speech data based on text data indicating a sentence.

図１において、所定の文章を示すテキストデータは、韻律情報生成部１１と韻律情報修正部１３と合成部１４とに入力される。テキストデータは、例えばアスキーコードである。 In FIG. 1, text data indicating a predetermined sentence is input to a prosody information generation unit 11, a prosody information correction unit 13, and a synthesis unit 14. The text data is, for example, an ASCII code.

韻律辞書１２は、複数の韻律情報のパターンを保持している。韻律情報とは、音声における声質以外の部分であり、アクセントやリズム等を形成する部分である。韻律情報生成部１１は、韻律辞書１２より、入力されたテキストデータの文章の各文節に適したパターンの韻律情報を読み出して、文章の韻律情報を生成する。韻律情報は、韻律情報修正部１３に入力される。 The prosodic dictionary 12 holds a plurality of prosodic information patterns. Prosodic information is a part other than the voice quality in speech, and is a part that forms accents, rhythms, and the like. The prosody information generation unit 11 reads prosody information of a pattern suitable for each clause of the sentence of the input text data from the prosody dictionary 12, and generates the prosody information of the sentence. The prosody information is input to the prosody information correction unit 13.

例えば、テキストデータが示す文章が「…を設定いたします」という文章であり、文節「…を」と文節「設定」との間、文節「設定」と文節「いたします」との間に、息継ぎの時間に相当する短時間の間隔を設けるとする。この場合、テキストデータ自体に間隔を設けてもよいし、韻律情報生成部１１において間隔を設けた状態の韻律情報を生成してもよい。 For example, the sentence indicated by the text data is a sentence “I will set…”, and between the phrase “…” and the phrase “setting”, between the phrase “setting” and the phrase “I will” It is assumed that a short interval corresponding to this time is provided. In this case, the text data itself may be provided with an interval, or the prosodic information generation unit 11 may generate prosodic information with the interval provided.

韻律情報修正部１３は、音声の親しさの程度を増大させるよう韻律情報を修正する。韻律情報修正部１３における韻律情報の具体的な修正の仕方については後に詳述する。修正された韻律情報は合成部１４に入力される。 The prosodic information correction unit 13 corrects the prosodic information so as to increase the degree of familiarity of the speech. A specific method of correcting the prosodic information in the prosodic information correcting unit 13 will be described in detail later. The modified prosody information is input to the synthesis unit 14.

音道辞書１５は、複数の音道情報のパターンを保持している。音道情報とは、音声における声質の部分である。音道辞書１５は、音道情報のパターンを文章単位で保持していてもよいし、単語単位で保持していてもよいし、音素単位で保持していてもよい。 The sound path dictionary 15 holds a plurality of sound path information patterns. The sound path information is a part of voice quality in speech. The sound path dictionary 15 may hold a pattern of sound path information in units of sentences, may be stored in units of words, or may be stored in units of phonemes.

合成部１４は、入力されたテキストデータの文章に適したパターンの音道情報を読み出し、修正された韻律情報と音道情報とを合成することによってデジタル信号の音声データを生成する。音声データはＤ／Ａ変換器１６によってアナログ信号に変換されて、スピーカ１７より音声として出力される。 The synthesizing unit 14 reads out the sound path information of a pattern suitable for the text of the input text data, and synthesizes the corrected prosodic information and the sound path information to generate sound data of a digital signal. The audio data is converted into an analog signal by the D / A converter 16 and output as sound from the speaker 17.

図１に示す音声合成装置によって構成した音声発生装置は、韻律情報生成部１１〜音道辞書１５の部分を、演算処理装置（マイクロプロセッサ）と記憶装置とを含むマイクロコンピュータで構成することができる。 Sound generating equipment constructed in accordance with the speech synthesis device shown in FIG. 1, a portion of the prosodic information generation unit 11 to the sound path dictionary 15, be constituted by a microcomputer including a processing unit (microprocessor) and a memory device it can.

＜第２実施形態の音声発生装置及び音声合成装置＞
図２に示す第２実施形態の音声発生装置及び音声合成装置は、音声データが予め音声ファイルとして形成されている場合に音声の親しさの程度を増大させる構成例である。第２実施形態の音声発生装置及び音声合成装置は、文章の音声データを含む音声ファイルと、文章を示すテキストデータと、テキストデータのタイミングデータとに基づいて、音声データの韻律情報を修正する音声処理装置である。 <Speech generator and speech synthesizer of second embodiment>
The speech generation device and speech synthesis device according to the second embodiment shown in FIG. 2 is a configuration example that increases the degree of familiarity of speech when speech data is previously formed as a speech file. The speech generator and speech synthesizer of the second embodiment are speech that corrects the prosodic information of speech data based on a speech file that includes speech data of a sentence, text data that indicates the sentence, and timing data of the text data. It is a processing device.

図２において、韻律・声道分離部２１には文章の音声データを含む音声ファイルが入力される。音声ファイルは、例えばＷＡＶ形式である。音声ファイルはＷＡＶ形式に限定されない。 In FIG. 2, the prosody / vocal tract separation unit 21 receives an audio file including audio data of sentences. The audio file is, for example, in the WAV format. The audio file is not limited to the WAV format.

韻律・声道分離部２１は、音声ファイルの音声データを韻律情報と声道情報とに分離する。声道情報保持部２２は声道情報を保持する。韻律情報保持部２３は韻律情報を保持する。 The prosody / vocal tract separation unit 21 separates audio data of the audio file into prosody information and vocal tract information. The vocal tract information holding unit 22 holds the vocal tract information. The prosodic information holding unit 23 holds prosodic information.

修正位置検出部２６には、音声ファイルの音声データを示すテキストデータとタイミングデータとが入力される。タイミングデータは、音声データの時間位置を示す。タイミングデータによって、音素の発話開始位置や文節間に息継ぎの時間に相当する間隔を設定することができる。修正位置検出部２６は、テキストデータ及びタイミングデータに基づいて、韻律情報修正部２４において韻律情報を修正する際の修正位置を検出する。 Text data indicating the voice data of the voice file and timing data are input to the correction position detection unit 26. The timing data indicates the time position of the audio data. With the timing data, it is possible to set an interval corresponding to the breathing time between phoneme utterance start positions and phrases. The correction position detection unit 26 detects a correction position when the prosody information correction unit 24 corrects the prosody information based on the text data and the timing data.

韻律情報修正部２４は、修正位置検出部２６が検出した修正位置より韻律情報を修正することによって、音声の親しさの程度を増大させる。修正された韻律情報は合成部１４に入力される。修正位置検出部２６における修正位置の検出の仕方、及び、韻律情報修正部２４における具体的な修正の仕方については後に詳述する。 The prosodic information correction unit 24 increases the degree of familiarity of the voice by correcting the prosodic information from the correction position detected by the correction position detection unit 26. The modified prosody information is input to the synthesis unit 14. A method of detecting the correction position in the correction position detection unit 26 and a specific correction method in the prosody information correction unit 24 will be described in detail later.

合成部１４は、修正された韻律情報と声道情報保持部２２に保持された声道情報とを合成することによってデジタル信号の音声データを生成する。音声データはＤ／Ａ変換器２７によってアナログ信号に変換されて、スピーカ２８より音声として出力される。 The synthesizer 14 synthesizes the modified prosody information and the vocal tract information held in the vocal tract information holding unit 22 to generate voice data of a digital signal. The audio data is converted into an analog signal by the D / A converter 27 and output as audio from the speaker 28.

図２に示す音声処理装置によって構成した音声発生装置及び音声合成装置は、韻律・声道分離部２１〜修正位置検出部２６の部分を、演算処理装置と記憶装置とを含むマイクロコンピュータで構成することができる。 The speech generator and speech synthesizer configured by the speech processing device shown in FIG. 2 includes the prosody / vocal tract separation unit 21 to the correction position detection unit 26 as a microcomputer including an arithmetic processing unit and a storage device. be able to.

＜第３実施形態の音声発生装置及び音声合成装置＞
図３に示す第３実施形態の音声発生装置及び音声合成装置は、人が話した音声の親しさの程度を増大させる構成例である。図３において、図２と同一部分には同一符号を付し、その説明を適宜省略する。 <Speech Generator and Speech Synthesizer of Third Embodiment>
The speech generator and speech synthesizer of the third embodiment shown in FIG. 3 is a configuration example that increases the degree of familiarity of speech spoken by a person. 3, the same parts as those in FIG. 2 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

第３実施形態の音声発生装置及び音声合成装置は、人が発する文章の音声をマイクロホンで収音した音声データと、音声データを音声認識することによって生成した文章を示すテキストデータとに基づいて、音声データの韻律情報を修正する音声処理装置である。 The speech generator and speech synthesizer of the third embodiment are based on speech data obtained by collecting speech of a sentence uttered by a person with a microphone, and text data indicating a sentence generated by speech recognition of the speech data. This is a speech processing device for correcting prosodic information of speech data.

図３において、マイクロホン３１は人が発した音声を収音してアナログの音声信号を出力する。Ａ／Ｄ変換器３２は、アナログの音声信号をデジタルの音声データに変換する。音声データは、韻律・声道分離部２１と音声認識部３３とに入力される。 In FIG. 3, a microphone 31 picks up a voice uttered by a person and outputs an analog voice signal. The A / D converter 32 converts an analog audio signal into digital audio data. The voice data is input to the prosody / vocal tract separation unit 21 and the voice recognition unit 33.

音声認識部３３は、入力された音声データの音声を認識してテキストデータを出力する。テキストデータは、修正位置検出部３４に入力される。修正位置検出部３４は、例えば形態素解析の手法を用いて韻律情報を修正する際の修正位置を検出する。韻律情報修正部２４は、修正位置検出部３４が検出した修正位置より韻律情報を修正することによって、音声の親しさの程度を増大させる。 The voice recognition unit 33 recognizes the voice of the input voice data and outputs text data. The text data is input to the correction position detector 34. The correction position detection unit 34 detects a correction position when correcting the prosodic information using, for example, a morphological analysis technique. The prosodic information correction unit 24 increases the degree of familiarity of the voice by correcting the prosodic information from the correction position detected by the correction position detection unit 34.

図３に示す音声処理装置によって構成した音声発生装置及び音声合成装置は、マイクロホン３１，Ａ／Ｄ変換器３２，Ｄ／Ａ変換器２７，スピーカ２８以外の部分を、演算処理装置と記憶装置とを含むマイクロコンピュータで構成することができる。 The speech generator and speech synthesizer configured by the speech processing device shown in FIG. 3 includes parts other than the microphone 31, the A / D converter 32, the D / A converter 27, and the speaker 28, an arithmetic processing device, a storage device, and the like. It can comprise with the microcomputer containing.

＜実施形態の音声合成方法＞
図１の韻律情報修正部１３、図２及び図３の韻律情報修正部２４における韻律情報の修正方法、及び、図２の修正位置検出部２６、図３の修正位置検出部３４における修正位置の検出方法について説明する。 <Speech Synthesis Method of Embodiment>
The prosody information correction unit 13 in FIG. 1, the prosody information correction method in the prosody information correction unit 24 in FIGS. 2 and 3, and the correction position detection unit 26 in FIG. 2 and the correction position in the correction position detection unit 34 in FIG. A detection method will be described.

図４の（ａ）は、第１の例文として「経由地にします」なる音声を発生させたときの振幅波形を示している。図４の（ｂ）に示すように、「経由地にします」をローマ字表記した「KeIYuChiNiShiMaSu」のKe，I，Yu，Chi，Ni，Shi，Ma，Suはそれぞれ音素番号１〜８の音素を示している。音素番号１〜８の音素は、例えば時間位置2.22秒から2.85秒までのそれぞれの時間位置に位置している。 FIG. 4A shows an amplitude waveform when the first example sentence, “I will make a stopover”, is generated. As shown in Fig. 4 (b), Ke, I, Yu, Chi, Ni, Shi, Ma, and Su of "KeIYuChiNiShiMaSu" with "Make a stopover" in Roman letters are phonemes numbered 1-8. Show. The phonemes with phoneme numbers 1 to 8 are located at respective time positions from 2.22 seconds to 2.85 seconds, for example.

「KeIYuChi」は文節Ｐｈ１、「Ni」は文節Ｐｈ２、「ShiMaSu」は文節Ｐｈ３である。実施形態の音声合成方法においては、複数の文節を有する文章の音声を発生させるとき、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とし、開始位置以降の周波数を一定の周波数だけ上昇させることによって、音声の親しさの程度を増大させる。 “KeIYuChi” is the phrase Ph1, “Ni” is the phrase Ph2, and “ShiMaSu” is the phrase Ph3. In the speech synthesis method of the embodiment, when generating speech of a sentence having a plurality of clauses, it is after the first phoneme of the last clause in the sentence, and any position in the last clause is a start position, By increasing the frequency after the start position by a certain frequency, the degree of familiarity of the voice is increased.

図４の（ｂ）に示す例では、図１の韻律情報修正部１３、図２及び図３の韻律情報修正部２４は、最後の文節である文節Ｐｈ３の最初の音素「Shi」より後であり、文節Ｐｈ３内のいずれかの位置を開始位置とする。韻律情報修正部１３，２４は、その開始位置以降の周波数を一定の周波数だけ上昇させる。図２の修正位置検出部２６、図３の修正位置検出部３４は、最後の文節である文節Ｐｈ３を検出する。 In the example shown in FIG. 4B, the prosodic information correcting unit 13 in FIG. 1 and the prosodic information correcting unit 24 in FIGS. 2 and 3 are after the first phoneme “Shi” of the phrase Ph3 that is the last phrase. Yes, any position in the phrase Ph3 is set as the start position. The prosodic information correction units 13 and 24 increase the frequency after the start position by a certain frequency. The correction position detection unit 26 in FIG. 2 and the correction position detection unit 34 in FIG. 3 detect the phrase Ph3 that is the last phrase.

図５の（ａ），（ｂ）は他の例を示す。図５の（ａ）は、第２の例文として「ゆっくり楽しんできて下さいね」なる音声を発生させたときの振幅波形を示している。 FIGS. 5A and 5B show another example. FIG. 5A shows an amplitude waveform when a voice “Please enjoy yourself slowly” is generated as a second example sentence.

図５の（ｂ）に示すように、「ゆっくり楽しんできて下さいね」をローマ字表記した「YuKkuRiTaNoShiNDeKiTeKuDaSaINe」のYu，Kku，Ri，Ta，No，Shi，N，De，Ki，Te，Ku，Da，Sa，I，Neはそれぞれ音素番号１〜１５の音素を示している。音素番号１〜１５の音素は、例えば時間位置2.22秒から3.49秒までのそれぞれの時間位置に位置している。 As shown in Fig. 5B, "YuKkuRiTaNoShiNDeKiTeKuDaSaINe" Yu, Kku, Ri, Ta, No, Shi, N, De, Ki, Te, Ku, Da , Sa, I, Ne indicate phonemes with phoneme numbers 1 to 15, respectively. The phonemes with the phoneme numbers 1 to 15 are located at the respective time positions from 2.22 seconds to 3.49 seconds, for example.

「YuKkuRi」は文節Ｐｈ１、「TaNoShiNDe」は文節Ｐｈ２、「KiTe」は文節Ｐｈ３、「KuDaSaINe」は文節Ｐｈ４である。図５の（ｂ）に示す例では、図１の韻律情報修正部１３、図２及び図３の韻律情報修正部２４は、文章における最後の文節Ｐｈ４の文節の最初の音素「Ku」より後であり、最後の文節内のいずれかの位置を開始位置とする。韻律情報修正部１３，２４は、その開始位置以降の周波数を一定の周波数だけ上昇させることによって、音声の親しさの程度を増大させる。 “YuKkuRi” is the phrase Ph1, “TaNoShiNDe” is the phrase Ph2, “KiTe” is the phrase Ph3, and “KuDaSaINe” is the phrase Ph4. In the example shown in FIG. 5B, the prosodic information correcting unit 13 in FIG. 1 and the prosodic information correcting unit 24 in FIGS. 2 and 3 are after the first phoneme “Ku” of the phrase of the last phrase Ph4 in the sentence. And any position in the last phrase is the start position. The prosodic information correction units 13 and 24 increase the degree of familiarity of the speech by increasing the frequency after the start position by a certain frequency.

図６の（ａ），（ｂ）は、図５の（ｂ）における文節Ｐｈ４の語尾である音素「Ne」の部分の周波数を上昇させたときの振幅と周波数の特性を示している。図６の（ｂ）に示す黒丸の点は、周波数特性における極大値もしくは極小値を示す極点、または、変曲点を示している。黒丸の点の位置は、音素の位置とは必ずしも一致しない。但し、音素の位置が極点または変曲点となる場合も多い。 6A and 6B show the amplitude and frequency characteristics when the frequency of the phoneme “Ne” part which is the ending of the phrase Ph4 in FIG. 5B is increased. Black dots shown in FIG. 6B indicate extreme points or inflection points indicating the maximum value or the minimum value in the frequency characteristics. The position of the black dot does not necessarily match the position of the phoneme. However, in many cases, the position of the phoneme is a pole or an inflection point.

図６の（ａ），（ｂ）は、音素「Ne」における子音N以降の周波数を上昇させた場合を示している。図６の（ｂ）において、破線は周波数を上昇させていない状態の特性、実線は周波数を上昇させた状態の特性を示している。ここでは、音素「Ne」の部分の韻律の周波数を４０Ｈｚ上昇させた例を示している。 6A and 6B show a case where the frequency after the consonant N in the phoneme “Ne” is increased. In FIG. 6B, the broken line indicates the characteristic when the frequency is not increased, and the solid line indicates the characteristic when the frequency is increased. In this example, the frequency of the prosody of the phoneme “Ne” portion is increased by 40 Hz.

周波数を上昇させても、図６の（ａ）に示す振幅の特性には影響を与えない。よって、周波数を上昇させていない状態と周波数を上昇させた状態とで、振幅の特性には変化はない。 Increasing the frequency does not affect the amplitude characteristics shown in FIG. Therefore, there is no change in the amplitude characteristics between the state where the frequency is not increased and the state where the frequency is increased.

図７は、図６の（ａ），（ｂ）と同様であり、最後の文節Ｐｈ４の最後の音素「Ne」を開始位置とした例である。図７では、文節Ｐｈ４における音素「Ne」の時間位置は3.49と設定されている。この時間位置3.49は子音Nの位置を示す。よって、文節Ｐｈ４の子音N以降の周波数が上昇することになる。 FIG. 7 is the same as (a) and (b) of FIG. 6 and shows an example in which the last phoneme “Ne” of the last phrase Ph4 is set as the start position. In FIG. 7, the time position of the phoneme “Ne” in the phrase Ph4 is set to 3.49. This time position 3.49 indicates the position of the consonant N. Therefore, the frequency after the consonant N of the phrase Ph4 increases.

図８は、最後の文節Ｐｈ４の最後の音素より１つ前の音素「I」を開始位置とした例である。文節Ｐｈ４の最初の音素「Ku」より後の開始位置としては、図７に示すように、語尾である最後の音素「Ne」が好適である。しかしながら、語尾が弱く発音されると、語尾を開始位置としてもさほど効果が得られない。この場合には、図８に示すように、語尾より１つ前の音素「I」を開始位置とするのがよい。 FIG. 8 shows an example in which the phoneme “I” immediately before the last phoneme of the last phrase Ph4 is set as the start position. As the start position after the first phoneme “Ku” of the phrase Ph4, as shown in FIG. 7, the last phoneme “Ne” that is the end is suitable. However, if the ending is pronounced weakly, even if the ending is used as the start position, the effect is not so much obtained. In this case, as shown in FIG. 8, the phoneme “I” immediately before the ending is preferably set as the start position.

特に図示していないが、文節Ｐｈ４では、語尾より前に最初の音素「Ku」以外で音素「Da」，「Sa」が存在している。音素「Sa」または「Da」を開始位置とすることも可能である。 Although not particularly illustrated, in the phrase Ph4, phonemes “Da” and “Sa” exist other than the first phoneme “Ku” before the ending. The phoneme “Sa” or “Da” can be set as the start position.

なお、文節Ｐｈ４の最初の音素「Ku」を開始位置とせず、音素「Da」以降を開始位置とすると、柔らかな印象となる場合が多い。 If the first phoneme “Ku” of the phrase Ph4 is not set as the start position, and the phoneme “Da” and subsequent positions are set as the start position, a soft impression is often obtained.

図９に示す例は、図４の（ｂ）と同様、「経由地にします」（「KeIYuChiNiShiMaSu」）を示している。図９においては、音素「Su」の子音Sの時間位置が2.85、母音uの時間位置が2.90と別々に設定されている。このような場合には、最後の音素「Su」における母音u以降の周波数を上昇させてもよい。 The example shown in FIG. 9 shows “Make a transit point” (“KeIYuChiNiShiMaSu”), as in FIG. 4B. In FIG. 9, the time position of the consonant S of the phoneme “Su” is set to 2.85 and the time position of the vowel u is set separately to 2.90. In such a case, the frequency after the vowel u in the last phoneme “Su” may be increased.

最後の文節の最初の音素以外で、最後の音素より前の音素を開始位置とする場合においても、音素の子音の時間位置と母音の時間位置とが別々に設定されている場合には、子音を開始位置としてもよいし、母音を開始位置としてもよい。 If the start position is a phoneme other than the first phoneme of the last phrase, but the time position of the phoneme consonant and the time position of the vowel are set separately, the consonant May be used as the start position, or a vowel may be used as the start position.

図１０を用いて、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置として、周波数を上昇させる際のさらに詳細かつ好ましい音声合成方法について説明する。ここでは、「ゆっくり楽しんできて下さいね」という文章の「…いね」の部分を例とする。「…いね」の部分の音声の韻律が図１０の（ａ）に示すような周波数特性を有するとする。ここでは、簡略化のため周波数特性を概略的に示している。 With reference to FIG. 10, a more detailed and preferable speech synthesis method for increasing the frequency after the first phoneme of the last phrase in the sentence and starting from any position in the last phrase will be described. . Here, the “... Ie” part of the sentence “Please enjoy yourself slowly” is taken as an example. Assume that the prosody of the voice of “...” Portion has frequency characteristics as shown in FIG. Here, frequency characteristics are schematically shown for simplification.

図１０の（ａ）〜（ｃ）において、黒丸の点ｐ１〜ｐ６は、図６と同様、極点または変曲点を示している。点ｐ１〜ｐ６の位置は、音素の位置とは必ずしも一致しないが、音素の位置が極点または変曲点となる場合も多い。 In FIGS. 10A to 10C, black circle points p <b> 1 to p <b> 6 indicate pole points or inflection points as in FIG. 6. The positions of the points p1 to p6 do not necessarily coincide with the positions of the phonemes, but the positions of the phonemes are often extreme points or inflection points.

図１の韻律情報修正部１３、図２及び図３の韻律情報修正部２４は、最後の文節の最後の音素「ね」を選択した場合、音素「ね」に最も近い位置であり、文章の韻律が示す周波数特性の極点または変曲点を、周波数を一定の周波数だけ上昇させる開始位置とすることができる。 The prosodic information correcting unit 13 in FIG. 1 and the prosodic information correcting unit 24 in FIGS. 2 and 3, when selecting the last phoneme “ne” of the last phrase, are closest to the phoneme “ne”, The pole point or inflection point of the frequency characteristic indicated by the prosody can be set as a starting position where the frequency is increased by a certain frequency.

図１０の（ａ）の例では、「…いね」の最後の音素「ね」に最も近い極点または変曲点は点ｐ６である。図１の韻律情報修正部１３、図２及び図３の韻律情報修正部２４は、点ｐ６を周波数上昇の開始位置として、点ｐ６以降の周波数を上昇させる。 In the example of FIG. 10A, the pole or inflection point closest to the last phoneme “Ne” of “... The prosody information correction unit 13 in FIG. 1 and the prosody information correction unit 24 in FIGS. 2 and 3 increase the frequency after the point p6 with the point p6 as the start position of the frequency increase.

図１０の（ｂ）は、点ｐ６以降の周波数を上昇させた状態を示している。周波数の上昇によって、点ｐ６は点ｐ６’へと移る。 FIG. 10B shows a state where the frequency after the point p6 is increased. As the frequency increases, point p6 moves to point p6 '.

図１０の（ｂ）に示すような周波数特性は、周波数が急激に変化する。そこで、周波数を一定の周波数だけ上昇させる開始位置（ここでは点ｐ６（ｐ６’））より所定時間前の位置より、開始位置まで周波数を連続的に変化させることが好ましい。開始位置より所定時間前の位置も、極点または変曲点であるのがよい。 In the frequency characteristic as shown in FIG. 10B, the frequency changes abruptly. Therefore, it is preferable to continuously change the frequency from a position a predetermined time before the start position (here, point p6 (p6 ')) where the frequency is increased by a certain frequency to the start position. The position a predetermined time before the start position may also be a pole or an inflection point.

図１０の（ｃ）の例では、所定時間前の位置を、開始位置より前に位置する極点または変曲点である点ｐ５としている。点ｐ５から点ｐ６’まで周波数が連続的に上昇するように周波数を直線的に変化させてもよいし、上に凸の曲線状または下に凸の曲線状に変化させてもよい。 In the example of FIG. 10C, the position before the predetermined time is a point p5 that is a pole or an inflection point located before the start position. The frequency may be linearly changed so that the frequency continuously increases from the point p5 to the point p6 ', or may be changed to an upward convex curve or a downward convex curve.

ここで、点ｐ５から点ｐ６’までは０．０５秒以上の時間があると自然に聞こえやすい。よって、開始位置である極点または変曲点と、開始位置の直前に位置する極点または変曲点との時間間隔が０．０５秒未満である場合には、開始位置に対して、開始位置より前の０．０５秒以上の時間間隔を有する極点または変曲点を選択するのがよい。 Here, if there is a time of 0.05 seconds or more from the point p5 to the point p6 ', it is easy to hear naturally. Therefore, when the time interval between the pole or inflection point that is the start position and the pole or inflection point located immediately before the start position is less than 0.05 seconds, A pole or inflection point having a time interval of 0.05 seconds or more is preferably selected.

文章の最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置とは、最後の文節の最初の音素を除き、最後の文節内のいずれかの音素に最も近い極点または変曲点であってもよい。 After the first phoneme of the last phrase of a sentence, any position in the last phrase means the extreme point closest to any phoneme in the last phrase, excluding the first phoneme of the last phrase It may be an inflection point.

なお、「はい」や「すみません」のように文章が１つの文節のみからなる場合も、周波数を上昇させる対象とする。１つの文節のみの文章における文節も最後の文節と称することとする。 It should be noted that even if the sentence is composed of only one phrase such as “Yes” or “I'm sorry”, the frequency is increased. A clause in a sentence with only one clause is also referred to as the last clause.

上述した開始位置以降全体の周波数を一定の周波数だけ上昇させた音声を複数の人が聞き、親しさの程度の変化を評価した結果、音声の親しさの程度を増大させる効果が確認されている。 As a result of a plurality of people listening to the voice whose overall frequency has been increased by a certain frequency after the start position described above and evaluating the change in the degree of familiarity, the effect of increasing the degree of familiarity of the voice has been confirmed. .

また、上述した開始位置以降の周波数を一定の周波数だけ下降させると、音声の親しさの程度が減少することも確認されている。各実施形態の音声発生装置、音声合成装置及び方法は、人が音声を聞いたときに感じる親しさの程度を意図的に減少させるために、上述した開始位置以降の周波数を一定の周波数だけ下降させることも可能である。 It has also been confirmed that when the frequency after the above-described start position is lowered by a certain frequency, the degree of familiarity with the sound decreases. The speech generator, speech synthesizer, and method of each embodiment lower the frequency after the start position described above by a certain frequency in order to intentionally reduce the degree of familiarity that a person feels when listening to speech. It is also possible to make it.

以上のように、各実施形態の音声発生装置及び音声合成装置は、韻律情報修正部１３，２４と、合成部１４，２５とを備える。韻律情報修正部１３，２４は、複数の文節よりなる文章を音声で表現するに際し、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とする。韻律情報修正部１３，２４は、開始位置以降の周波数を一定の周波数だけ上昇または下降させるように韻律情報を修正する。 As described above, the speech generation device and the speech synthesis device according to each embodiment include the prosodic information correction units 13 and 24 and the synthesis units 14 and 25. When the prosodic information correction units 13 and 24 express a sentence composed of a plurality of phrases by speech, the prosodic information correction units 13 and 24 are located after the first phoneme of the last phrase in the sentence, and any position in the last phrase is set as a start position. . The prosodic information correction units 13 and 24 correct the prosodic information so that the frequency after the start position is increased or decreased by a certain frequency.

合成部１４，２５は、韻律情報修正部１３，２４によって修正された韻律情報と音道情報とを合成することにより、文章の音声データを生成する。 The synthesizing units 14 and 25 synthesize the prosodic information corrected by the prosody information correcting units 13 and 24 and the sound path information, thereby generating sentence voice data.

各実施形態の音声発生装置及び音声合成装置によれば、簡易な構成で音声の親しさの程度を効果的に増減させることができる。 According to the speech generator and speech synthesizer of each embodiment, the degree of familiarity of speech can be effectively increased or decreased with a simple configuration.

韻律情報修正部１３，２４は、最後の文節内の最初の音素を除くいずれかの音素の子音または母音の位置を、周波数を一定の周波数だけ上昇または下降させる開始位置とする。これによって、最後の文節の最初の音素より後の開始位置より韻律情報を修正することができる。 The prosodic information correction units 13 and 24 set the position of the consonant or vowel of any phoneme except the first phoneme in the last phrase as the start position for increasing or decreasing the frequency by a certain frequency. Thereby, the prosodic information can be corrected from the start position after the first phoneme of the last phrase.

韻律情報修正部１３，２４は、上記のいずれかの音素を最後の文節の最後の音素とし、語尾の１文字のみ韻律情報を修正することができる。例えば語尾のみ周波数を上昇させると、柔らかな印象を与えつつ、音声の親しさの程度を増大させることができる。 The prosodic information modification units 13 and 24 can use any one of the above phonemes as the last phoneme of the last phrase, and can correct the prosodic information for only one character at the end of the word. For example, when the frequency is increased only for the ending, it is possible to increase the degree of familiarity of the voice while giving a soft impression.

韻律情報修正部１３，２４は、最後の文節内の最初の音素を除くいずれかの音素に最も近い位置であり、文章の韻律が示す周波数特性の極点または変曲点を開始位置としてもよい。これによって、最後の文節の最初の音素より後の開始位置より韻律情報を修正することができる。 The prosodic information correction units 13 and 24 may be positions closest to any one of the phonemes except the first phoneme in the last phrase, and may have the extreme point or the inflection point of the frequency characteristic indicated by the prosody of the sentence as the start position. Thereby, the prosodic information can be corrected from the start position after the first phoneme of the last phrase.

このとき、韻律情報修正部１３，２４は、開始位置より所定時間前の位置より開始位置まで周波数を連続的に変化させることが好ましい。このようにすれば、違和感がほとんどなく、音声の親しさの程度を増減させることができる。 At this time, it is preferable that the prosodic information correction units 13 and 24 continuously change the frequency from a position a predetermined time before the start position to the start position. In this way, there is almost no sense of incongruity, and the degree of familiarity of the voice can be increased or decreased.

韻律情報修正部１３，２４は、所定時間前の位置を、開始位置より前に位置する極点または変曲点とするのがよい。このようにすれば、周波数特性の変化に合わせて周波数を連続的に変化させることができる。 The prosodic information correction units 13 and 24 may set a position before a predetermined time as a pole or an inflection point located before the start position. In this way, it is possible to continuously change the frequency according to the change of the frequency characteristic.

実施形態の音声発生装置及び音声合成装置は、文章を示すテキストデータに基づいて音声データを生成する音声合成装置であってよい。音声合成装置は、音声合成によって音声の親しさの程度を増減させた音声データを生成することができる。 The speech generator and speech synthesizer of the embodiment may be a speech synthesizer that generates speech data based on text data indicating a sentence. Speech synthesizer can generate voice data to increase or decrease the degree of familiarity of the speech by the speech synthesis.

実施形態の音声発生装置及び音声合成装置は、文章の音声データを含む音声ファイルと、文章を示すテキストデータと、テキストデータのタイミングデータとに基づいて、音声データの韻律情報を修正する音声処理装置であってよい。音声発生装置及び音声合成装置をこのように動作する音声処理装置で構成すれば、音声ファイルとして記録された音声データの音声の親しさの程度を増減させることができる。 The speech generation device and the speech synthesis device according to the embodiment include a speech processing device that corrects prosodic information of speech data based on a speech file including speech data of a sentence, text data indicating the sentence, and timing data of the text data. It may be. If the voice generator and the voice synthesizer are configured by a voice processing device that operates in this manner, the degree of familiarity of the voice data recorded as a voice file can be increased or decreased.

実施形態の音声発生装置及び音声合成装置は、人が発する文章の音声をマイクロホンで収音した音声データと、音声データを音声認識することによって生成した文章を示すテキストデータとに基づいて、音声データの韻律情報を修正する音声処理装置であってよい。音声発生装置及び音声合成装置をこのように動作する音声処理装置で構成すれば、人が発する音声の親しさの程度を増減させることができる。 The speech generation device and the speech synthesis device according to the embodiment are based on speech data in which speech of a sentence uttered by a person is collected by a microphone and text data indicating a sentence generated by speech recognition of the speech data. It may be a speech processing device that corrects the prosodic information. If the voice generator and the voice synthesizer are configured by a voice processing device that operates in this way, it is possible to increase or decrease the degree of familiarity of voice uttered by a person.

実施形態の音声合成方法は、韻律情報修正工程と合成工程とを含む。韻律情報修正工程は、複数の文節よりなる文章の音声データを構成する韻律情報と音道情報とのうち、韻律情報における文章の最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とする。韻律情報修正工程は、開始位置以降の周波数を一定の周波数だけ上昇または下降させるよう修正する。 The speech synthesis method according to the embodiment includes a prosodic information correction step and a synthesis step. The prosodic information correction step is after the first phoneme of the last phrase of the sentence in the prosodic information among the prosodic information and sound path information constituting the speech data of the sentence composed of a plurality of phrases, and in the last phrase Either position is set as the start position. In the prosodic information correction step, the frequency after the start position is corrected so as to increase or decrease by a certain frequency.

合成工程は、韻律情報修正工程にて修正された韻律情報と音道情報とを合成して、文章の音声データを発音させたときの音声が有する親しさの程度を変化させた音声データを生成する。 In the synthesis step, the prosody information corrected in the prosody information correction step and the sound path information are synthesized to generate voice data in which the degree of familiarity of the voice when the voice data of the sentence is pronounced is changed. To do.

実施形態の音声合成方法によれば、簡易な手順で音声の親しさの程度を効果的に増減させることができる。 According to the speech synthesis method of the embodiment, the degree of familiarity of speech can be effectively increased or decreased by a simple procedure.

本発明は以上説明した各実施形態の音声発生装置、音声合成装置及び方法に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変更可能である。 The present invention is not limited to the speech generator, speech synthesizer, and method of each embodiment described above, and various modifications can be made without departing from the scope of the present invention.

図１〜図３に示す音声発生装置は、音声合成装置を備えた構成を示している。音声発生装置が音声合成装置を備えず、音声合成装置を音声発生装置の外部に設けてもよい。音声発生装置は、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とし、開始位置以降の周波数を一定の周波数だけ上昇または下降させた韻律情報を含むように生成された音声データを保持する記憶部と、記憶部より読み出された音声データを再生する音声再生部とを備える構成であってもよい。 The voice generator shown in FIGS. 1 to 3 has a configuration including a voice synthesizer . Sound generation device is not equipped with a voice synthesizer may be provided a voice synthesizing apparatus to an external sound generation device. The sound generation device is a prosodic information that is after the first phoneme of the last phrase in the sentence, has a start position at any position in the last phrase, and increases or decreases the frequency after the start position by a certain frequency. A configuration may be provided that includes a storage unit that holds audio data generated so as to include the audio data, and an audio reproduction unit that reproduces audio data read from the storage unit.

図１におけるＤ／Ａ変換器１６及びスピーカ１７、図２，図３におけるＤ／Ａ変換器２７及びスピーカ２８は、音声再生部の少なくとも一部を構成する。音声発生装置が音声データを保持する記憶部を備える場合、記憶部から音声データを読み出す読み出し部も音声再生部の一部とすることができる。 The D / A converter 16 and the speaker 17 in FIG. 1 and the D / A converter 27 and the speaker 28 in FIGS. 2 and 3 constitute at least a part of the sound reproducing unit. When the sound generation device includes a storage unit that stores sound data, a reading unit that reads out sound data from the storage unit may be part of the sound reproduction unit.

このように、音声発生装置は、文章における最後の文節の最初の音素より後であり、最後の文節内のいずれかの位置を開始位置とし、開始位置以降の周波数を一定の周波数だけ上昇または下降させた韻律情報を含むように生成された音声データを再生する音声再生部を備えればよい。 In this way, the sound generation device is after the first phoneme of the last phrase in the sentence, starts at any position in the last phrase, and increases or decreases the frequency after the start position by a certain frequency. What is necessary is just to provide the audio | voice reproduction | regeneration part which reproduce | regenerates the audio | voice data produced | generated so that the prosodic information made to be included may be included.

音声発生装置及び音声合成装置をハードウェアで構成してもよいし、ソフトウェアで構成してもよく、両者を混在させて構成してもよい。 The voice generation device and the voice synthesis device may be configured by hardware, may be configured by software, or may be configured by mixing both.

音声を韻律情報と声道とに分離して合成する際に、例えば、vocoderと称される一般的な音声分析合成系を用いることが可能である。ソフトウェアとしては、音声分析用ソフトウェアPraatを用いることが可能である。Praatで使われているT-SOLAアルゴリズムを用いるとよい。 When separating and synthesizing speech into prosodic information and vocal tract, for example, a general speech analysis / synthesis system called vocoder can be used. As software, voice analysis software Praat can be used. Use the T-SOLA algorithm used in Praat.

本発明を、コンピュータに、音声合成方法における韻律情報修正工程と合成工程と同等の、韻律情報修正ステップと合成ステップとを実行させる音声合成プログラムによって実現することも可能である。 The present invention, in the computer, equivalent to the prosody information correction step and the synthesis step in the speech synthesis method can also be implemented by a speech synthesis program for executing the synthetic steps with prosody information correction step.

１３，２４韻律情報修正部
１４，２５合成部
１６，２７Ｄ／Ａ変換器（音声再生部）
１７，２８スピーカ（音声再生部） 13, 24 Prosodic information correction unit 14, 25 Compositing unit 16, 27 D / A converter (voice reproduction unit)
17, 28 Speaker (Audio playback unit)

Claims

When expressing a sentence by voice, it is after the first phoneme of the last phrase in the sentence, any position in the last phrase is a start position, and the frequency after the start position is increased by a certain frequency. A prosody information correction unit that corrects the prosody information so as to increase,
A synthesizing unit that generates speech data of the sentence by synthesizing the prosodic information and the sound path information corrected by the prosody information correcting unit;
With
The prosodic information correction unit is a position closest to any phoneme except the first phoneme in the last phrase, and the extreme point or inflection point of the frequency characteristic indicated by the prosody of the sentence is set as the start position. A speech synthesizer characterized by the above.

The speech synthesizer according to claim 1 , wherein the prosodic information correction unit continuously changes the frequency from a position a predetermined time before the start position to the start position.

The speech synthesizer according to claim 2 , wherein the prosodic information correction unit sets a position before the predetermined time as a pole or an inflection point positioned before the start position.

The speech synthesizer according to any one of claims 1 to 3 , wherein the speech synthesizer is a speech synthesizer that generates the speech data based on text data indicating the sentence.

The speech synthesizer is a speech processing device that corrects prosodic information of the speech data based on a speech file including speech data of the sentence, text data indicating the sentence, and timing data of the text data. The speech synthesizer according to any one of claims 1 to 3 .

The voice synthesizer is based on voice data obtained by collecting voice of a sentence uttered by a person with a microphone, and text data indicating the sentence generated by voice recognition of voice data collected by the microphone. The speech synthesizer according to claim 1 , wherein the speech synthesizer is a speech processing device that corrects prosodic information of speech data.

Of the prosodic information and sound path information constituting the speech data of the sentence, the position is after the first phoneme of the last phrase of the sentence in the prosodic information, and any position in the last phrase is a start position. Prosody information correction step for correcting the frequency after the start position to increase by a certain frequency;
Synthesizing the prosodic information corrected in the prosodic information correcting step and the sound path information to generate speech data in which the degree of familiarity of the speech when the speech data of the sentence is pronounced is increased A synthesis process;
A speech synthesis method comprising:

8. The speech synthesis method according to claim 7, wherein, in the prosodic information correction step, a position of a consonant or vowel of any phoneme excluding the first phoneme in the last phrase is set as the start position.

9. The speech synthesis method according to claim 8, wherein, in the prosodic information correction step, any one of the phonemes is used as the last phoneme of the last phrase.

An audio data input process for inputting the input audio data;
A speech recognition step of recognizing speech from the input speech data and outputting text data;
A start position detecting step of detecting the start position from the text data;
The speech synthesis method according to claim 7, further comprising:

The speech synthesis method according to claim 10, wherein in the start position detecting step, a phrase included in the text data is detected.

The speech synthesis method according to claim 7, wherein the constant frequency is 40 Hz.