JP2010224392A

JP2010224392A - Utterance support device, method, and program

Info

Publication number: JP2010224392A
Application number: JP2009073796A
Authority: JP
Inventors: Takeshi Iwaki; 健岩木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2010-10-07

Abstract

<P>PROBLEM TO BE SOLVED: To relieve a user using a speech synthesis device from the psychological sense of self-doubt by directly feeling the vibration in response to the synthesized speech in conformity with the timing of speech utterance. <P>SOLUTION: The utterance support device includes: a speech waveform generation means generating a speech waveform based on a text inputted through a text input means; a waveform power calculation means calculating a waveform power value for each prescribed time section based on the speech waveform from the speech waveform generation means; a physical vibration conversion means obtaining physical vibration quantity in response to the waveform power value of each time section from the waveform power calculation means; and a vibration conduction means imparting to the utterer the vibration based on the physical vibration quantity in response to the waveform power value of each time section. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、発話補助装置、方法及びプログラムに関し、例えば、声帯切除をした利用者が音声合成装置を用いて発話する際に、心理的な自信喪失感を改善できるように補助する装置に適用し得るものである。 The present invention relates to an utterance assisting device, method, and program, and is applied to, for example, an apparatus that assists a user who has excised vocal cords so as to improve psychological loss of confidence when speaking using a speech synthesizer. To get.

例えば、入力されたテキストを音声に変換し、合成音声を出力する音声合成装置があり、このような音声合成装置は、様々な利用分野に利用されている。 For example, there is a speech synthesizer that converts input text into speech and outputs synthesized speech, and such speech synthesizer is used in various fields of use.

従来の音声合成方法としては、例えば特許文献１に記載される技術がある。特許文献１に記載されているように、従来の音声合成方法は、言語処理部が、入力された日本語テキストの形態素解析、構文解析を行い、アクセント位置付きのかな文字列を生成する。韻律パターン生成部で、言語処理部２０１からのアクセント位置付きのかな文字列から、韻律情報（例えば、個々の音素の継続時間長、個々の時間区間（フレーム）での声の高さ、および無音区間が存在する場合その継続時間長）を算出する。そして、音素波形生成部は、韻律生成部２０からの韻律情報に基づいて合成音声波形を生成して出力するというものである。 As a conventional speech synthesis method, there is a technique described in Patent Document 1, for example. As described in Patent Document 1, in the conventional speech synthesis method, the language processing unit performs morphological analysis and syntax analysis of the input Japanese text to generate a kana character string with an accent position. In the prosody pattern generation unit, prosodic information (for example, duration of individual phonemes, voice pitch in each time interval (frame), and silence) from a kana character string with accent position from the language processing unit 201 If there is a section, the duration is calculated). Then, the phoneme waveform generation unit generates and outputs a synthesized speech waveform based on the prosodic information from the prosody generation unit 20.

特開２００３−２０８１８８号公報JP 2003-208188 A

ところで、音声合成装置の利用態様として、例えば、病気や事故等により声帯を切除した者が音声合成装置を利用する場合がある。 By the way, as a usage mode of the speech synthesizer, for example, a person who has excised the vocal cords due to illness or accident may use the speech synthesizer.

例えば、声帯切除前に自身の音声を録音して音声データをデータベース化しておき、自身が入力したテキストを、自身の声（声質）で自身の口調（話し方）で合成音声を出力するというものである。これにより、利用者個人の特性や感情表現、発話の自然性といった本人性を再現することができる。 For example, you can record your own voice and create a database of voice data before excision of vocal cords, and output synthesized voice with your own voice (voice quality) and your own tone (speaking method). is there. As a result, it is possible to reproduce the identity of the user, such as personal characteristics, emotional expression, and naturalness of speech.

しかしながら、このような利用態様においては、次のような課題が生じ得る。 However, in such a usage mode, the following problems may occur.

通常、健常者は、自身が発話した音声を、発話者本人の耳だけでなく、喉や頬骨などから直接伝わる振動も感じ取っており、これにより自身の発話音声を聴取している。 Normally, a healthy person feels not only the speaker's own ears but also vibrations transmitted directly from the throat, cheekbones, and the like, thereby listening to his / her speech.

しかし、声帯を切除した者が、上記のような音声合成装置を利用して本人性を再現した合成音声を出力する場合、耳から発話音性を聴取できたとしても、喉や頬骨などから直接伝わる振動を感じ取ることができない。 However, when a person who has excised the vocal cords outputs synthesized speech that reproduces his / her identity using a speech synthesizer as described above, even if he / she can hear the utterance sound from his / her ears, I can't feel the vibrations.

自身の生の声は通常、骨に伝導し、鼓膜を振動させて耳に届く声が耳からの声により優先される。これに対して、合成音声は、他人の声同様、口から発した声が空気中を伝わって耳から聞いた声であり、自身の持つ「自分の声」とのイメージとの間にギャップが生じ得ることがある。また、合成音声の不完全さに起因する「自分の声」とのイメージとの間のギャップが生じ得ることがある。例えば、合成音声の品質の劣化に伴い、自分らしくない声質となったり、話し方や間の取り方が異なったりすることがある。 The voice of one's own life is usually conducted to the bone, and the voice reaching the ear by vibrating the eardrum is prioritized by the voice from the ear. Synthetic speech, on the other hand, is a voice that is heard from the ear as the voice uttered from the mouth, like other people's voice, and there is a gap between it and the image of its own voice. It can happen. In addition, there may be a gap between the image of “your voice” due to imperfection of the synthesized speech. For example, with the deterioration of the quality of the synthesized speech, the voice quality may not be unique, and the way of speaking and taking time may differ.

このような場合、発話者は、不快感を感じ自信を持って発言することができなくなってしまうことが、しばしば発生する。 In such a case, it often happens that the speaker feels uncomfortable and cannot speak with confidence.

このような合成音声の不完全さに起因する心理的な自信の喪失に対する改善策として、人間の発話・認知機構に基づく情報の補償が有効である。 Compensation of information based on human speech / cognitive mechanisms is effective as an improvement measure against such psychological loss of confidence resulting from the imperfection of synthesized speech.

つまり、音声の発話時における喉や頬骨などから直接伝わる振動を、合成音声の生成タイミングにあわせて合成音声から擬似的に生成し、合成音声の出力タイミングにこれを感じ取ることができれば、自身が、今発話しているという強い実感を得ることができ、また、合成音声の不完全さに起因する自信喪失感を軽減することができる。 In other words, if the vibration directly transmitted from the throat and cheekbones at the time of speech utterance is artificially generated from the synthesized voice according to the synthesized voice generation timing, and this can be felt at the synthesized voice output timing, It is possible to obtain a strong feeling that the user is speaking, and to reduce the feeling of loss of confidence due to the imperfection of the synthesized speech.

そのため、音声合成装置を利用する利用者が、音声発話時のタイミングに合わせて、合成音声に応じた振動を直接感じ取ってもらい、心理的な自信喪失感を軽減することができる発話補助装置、方法及びプログラムが求められている。 Therefore, the user who uses the speech synthesizer can directly feel the vibration corresponding to the synthesized speech in accordance with the timing at the time of speech utterance, and can reduce the feeling of loss of psychological confidence. And a program is needed.

第１の本発明の発話補助装置は、（１）テキスト入力手段を通じて入力されたテキストに基づいて音声波形を生成する音声波形生成手段と、（２）音声波形生成手段からの音声波形に基づき所定の時間区間毎の波形パワー値を算出する波形パワー算出手段と、（３）波形パワー算出手段からの各時間区間の波形パワー値に応じた物理振動量を求める物理振動変換手段と、（４）各時間区間の波形パワーに応じた物理振動量に基づく振動を発話者に与える振動伝導手段とを備えることを特徴とする。 The speech assisting device according to the first aspect of the present invention includes: (1) a speech waveform generation unit that generates a speech waveform based on text input through a text input unit; and (2) a predetermined waveform based on a speech waveform from the speech waveform generation unit. Waveform power calculating means for calculating a waveform power value for each time interval of (3), (3) physical vibration converting means for obtaining a physical vibration amount corresponding to the waveform power value of each time interval from the waveform power calculating means, and (4) Vibration conduction means for providing a speaker with vibration based on the amount of physical vibration corresponding to the waveform power of each time section.

第２の本発明の発話補助方法は、テキスト入力手段及び振動伝導手段を備える発話補助装置の発話補助方法であって、（１）音声波形生成手段が、テキスト入力手段を通じて入力されたテキストに基づいて音声波形を生成する音声波形生成工程と、（２）波形パワー算出手段が、音声波形生成手段からの音声波形に基づき所定の時間区間毎の波形パワー値を算出する波形パワー算出工程と、（３）物理振動変換手段が、波形パワー算出手段からの各時間区間の波形パワー値に応じた物理振動量を求め、この物理振動量を振動伝導手段に供給する物理振動変換工程とを有することを特徴とする。 The speech assisting method of the second aspect of the present invention is an speech assisting method of an speech assisting device comprising a text input means and a vibration conducting means. (1) The speech waveform generating means is based on text input through the text input means. (2) a waveform power calculation step in which the waveform power calculation means calculates a waveform power value for each predetermined time interval based on the voice waveform from the voice waveform generation means; 3) The physical vibration converting means has a physical vibration converting step of obtaining a physical vibration amount corresponding to the waveform power value of each time section from the waveform power calculating means and supplying the physical vibration amount to the vibration conducting means. Features.

第３の本発明の発話補助プログラムは、テキスト入力手段及び振動伝導手段を備える発話補助装置を、（１）テキスト入力手段を通じて入力されたテキストに基づいて音声波形を生成する音声波形生成手段、（２）音声波形生成手段からの音声波形に基づき所定の時間区間毎の波形パワー値を算出する波形パワー算出手段と、（３）波形パワー算出手段からの各時間区間の波形パワー値に応じた物理振動量を求め、この物理振動量を振動伝導手段に供給する物理振動変換手段として機能させることを特徴とする。 According to a third aspect of the present invention, there is provided a speech assist program comprising: (1) a speech waveform generating means for generating a speech waveform based on text input through a text input means; 2) Waveform power calculation means for calculating a waveform power value for each predetermined time interval based on the voice waveform from the voice waveform generation means; and (3) Physical corresponding to the waveform power value of each time interval from the waveform power calculation means. The vibration amount is obtained, and this physical vibration amount is made to function as a physical vibration conversion means for supplying the vibration conduction means.

本発明によれば、音声合成装置を利用する利用者が、音声発話時のタイミングに合わせて、合成音声に応じた振動を直接感じ取ってもらい、心理的な自信喪失感を軽減することができる。 ADVANTAGE OF THE INVENTION According to this invention, the user who uses a speech synthesizer can directly feel the vibration according to a synthetic | combination voice according to the timing at the time of speech utterance, and can reduce a psychological loss of self-confidence.

第１の実施形態の発話補助装置の構成を示す構成図である。It is a block diagram which shows the structure of the speech assistance apparatus of 1st Embodiment. 第１の実施形態の音声合成部の機能を説明するブロック図である。It is a block diagram explaining the function of the speech synthesizer of 1st Embodiment. 第１の実施形態のパワー算出部の機能を説明するブロック図である。It is a block diagram explaining the function of the power calculation part of 1st Embodiment. 第１の実施形態のパワー算出部によるフレームの波形パワーの算出方法を説明する説明図である。It is explanatory drawing explaining the calculation method of the waveform power of the flame | frame by the power calculation part of 1st Embodiment. 第１の実施形態の各フレームの波形パワーから物理振動量を生成する方法を説明する説明図である。It is explanatory drawing explaining the method to produce | generate a physical vibration amount from the waveform power of each flame | frame of 1st Embodiment. 第１の実施形態の発話補助装置による処理を説明するフローチャートである。It is a flowchart explaining the process by the speech assistance apparatus of 1st Embodiment.

（Ａ）第１の実施形態
以下では、本発明の発話補助装置、方法及びプログラムの第１の実施形態を、図面を参照しながら説明する。 (A) 1st Embodiment Below, the 1st Embodiment of the speech assistance apparatus, method, and program of this invention is described, referring drawings.

（Ａ−１）第１の実施形態の構成
図１は、第１の実施形態の発話補助装置の構成を説明する構成図である。図１において、第１の実施形態は、テキスト入力部１０１、音声合成部１０２、パワー算出部１０３、振動生成部１０４、振動伝導部１０５、出力部１０６を少なくとも有して構成される。 (A-1) Configuration of the First Embodiment FIG. 1 is a configuration diagram illustrating the configuration of the speech assisting device of the first embodiment. In FIG. 1, the first embodiment includes at least a text input unit 101, a speech synthesis unit 102, a power calculation unit 103, a vibration generation unit 104, a vibration conduction unit 105, and an output unit 106.

音声合成部１０２、パワー算出部１０３及び振動生成部１０４は、例えば、パーソナルコンピュータ等の情報処理装置が実現する機能である。これらの機能は、ソフトウェア処理により実現することができ、例えば、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＥＥＰＲＯＭ等のハードウェア構成を備える情報処理装置において、ＲＯＭに格納される処理プログラムをＣＰＵが読み出し実行することにより、これらの機能が実現される。 The voice synthesis unit 102, the power calculation unit 103, and the vibration generation unit 104 are functions realized by an information processing apparatus such as a personal computer, for example. These functions can be realized by software processing. For example, in an information processing apparatus having a hardware configuration such as a CPU, ROM, RAM, and EEPROM, the CPU reads and executes a processing program stored in the ROM. These functions are realized.

テキスト入力部１０１は、利用者の操作により所望のテキスト入力を行うものであり、入力されたテキストデータを音声合成部１０２に与えるものである。テキスト入力部１０１は、例えばキーボード、タッチパネルなどが該当する。 The text input unit 101 inputs a desired text by a user's operation, and provides the input text data to the speech synthesis unit 102. The text input unit 101 corresponds to, for example, a keyboard or a touch panel.

音声合成部１０２は、テキスト入力部１０１から入力されたテキストデータに基づいて、フレーム（所定の処理時間区間）毎あるいは呼気段落単位で合成音声波形を生成して、合成音声波形をパワー算出部１０３及び出力部１０６に与えるものである。 The speech synthesizer 102 generates a synthesized speech waveform for each frame (predetermined processing time interval) or for each expiratory paragraph based on the text data input from the text input unit 101, and the synthesized speech waveform is converted into a power calculator 103. And the output unit 106.

音声合成部１０２による音声合成方法としては、例えば特許文献１に記載されるような既存の方法を広く適用することができる。ここでは、例えば図２を参照しながら音声合成方法を説明する。 As a speech synthesis method by the speech synthesizer 102, for example, an existing method described in Patent Document 1 can be widely applied. Here, a speech synthesis method will be described with reference to FIG.

図２は、音声合成部１０２の内部構成を示す。なお、音声合成部１０２の内部構成については、特開２００７−２３３２１６号公報に記載されている内容を適用可能である。図２に示すように、音声合成部１０２は、言語処理部２０１、韻律生成部２０２、波形生成部２０３、一時格納部２０４、単語辞書２０５、音声素片データベース２０６を少なくとも有する。 FIG. 2 shows the internal configuration of the speech synthesizer 102. Note that the content described in JP 2007-233216 A can be applied to the internal configuration of the speech synthesizer 102. As shown in FIG. 2, the speech synthesis unit 102 includes at least a language processing unit 201, a prosody generation unit 202, a waveform generation unit 203, a temporary storage unit 204, a word dictionary 205, and a speech unit database 206.

言語処理部２０１は、入力テキス文字列の形態素解析や構文解析を行い、アクセント、イントネーション等を決定し、アクセント位置付きのかな文字列（中間言語）を韻律生成部２０２に与えるものである。 The language processing unit 201 performs morphological analysis and syntax analysis of the input text character string, determines accents, intonations, and the like, and provides a kana character string (intermediate language) with an accent position to the prosody generation unit 202.

言語処理部２０１は、単語辞書２０５を参照して、入力テキスト文字列に対して形態素解析、構文解析等のテキスト処理を行い、音声合成の単位である音素単位に分割し、解析によって得られた韻律情報を付して合成ターゲットとして出力するものである。ここで、単語辞書２０５には、各単語の読み仮名、文法情報、アクセント型、アクセント結合規則などが登録された辞書である。 The language processing unit 201 refers to the word dictionary 205, performs text processing such as morpheme analysis and syntax analysis on the input text character string, divides the input text string into phoneme units which are units of speech synthesis, and is obtained by analysis Prosody information is added and output as a synthesis target. Here, the word dictionary 205 is a dictionary in which kana, grammatical information, accent type, accent combination rules, and the like of each word are registered.

また、言語処理部２０１は、単語辞書２０５のアクセント結合規則を参照しながら、単語系列の文法的又は意味的なまとまりに応じてアクセント位置を付与するものである。 Further, the language processing unit 201 gives an accent position according to a grammatical or semantic group of the word series while referring to an accent combination rule of the word dictionary 205.

さらに、言語処理部２０１は、アクセント位置を付与した文法的又は意味的なまとまり（アクセント句）から係り受けを判断し、係り受けのあるアクセント句を呼気段落として形成するものである。 Further, the language processing unit 201 determines a dependency from a grammatical or semantic group (accent phrase) to which an accent position is added, and forms a dependent accent phrase as an exhalation paragraph.

韻律生成部２０２は、言語処理部２０１から出力された合成ターゲットを構成するターゲット音素列に対し、合成すべき音声の韻律に対応する音響特徴量パラメータ（ターゲットパラメータ）を生成し、各音素に付するターゲット音素からなるターゲット音素列として出力するものである。 The prosody generation unit 202 generates an acoustic feature parameter (target parameter) corresponding to the prosody of the speech to be synthesized for the target phoneme sequence constituting the synthesis target output from the language processing unit 201, and attaches it to each phoneme. To be output as a target phoneme string consisting of target phonemes.

ここで、韻律情報は、音声合成に必要なパラメータであり、例えば、音声素片、各音素の継続時間長、ピッチ（個々のフレームでの声の高さ）、無音区間が存在する場合にはその継続時間長などが該当する。 Here, the prosodic information is a parameter necessary for speech synthesis. For example, when there is a speech segment, a duration length of each phoneme, a pitch (a voice height in each frame), and a silent section, The duration time is applicable.

また、音声素片データベース２０６には、例えば声帯切除前に、声帯切除をする利用者から収録した音声データを分析、加工した音声素片データをデータベース化したものである。 The speech segment database 206 is a database of speech segment data obtained by analyzing and processing speech data recorded from a user who performs vocal cord excision, for example, before excision of the vocal cords.

波形生成部２０３は、韻律生成部２０２からターゲット音素列が与えられると、音声素片データベース２０６を参照して、音素片データを取り出して音声素片データの波形を合成ターゲットに従って互いに接続し、音声波形を生成し出力するものである。波形生成部２０３による波形生成方法としては、種々の方法を適用することができ、例えば、音声素片をピッチ周期毎にずらして重ね合わせる波形重畳法などを適用することができる。 When the target phoneme sequence is given from the prosody generation unit 202, the waveform generation unit 203 refers to the speech unit database 206, extracts phoneme unit data, and connects the speech unit data waveforms to each other according to the synthesis target. Generates and outputs a waveform. Various methods can be applied as a waveform generation method by the waveform generation unit 203. For example, a waveform superposition method in which speech segments are shifted and superimposed for each pitch period can be applied.

一時格納部２０４は、波形生成部２０３により生成された合成音声波形を一時的に格納した後、格納した合成音声波形を出力するものである。これは、合成音声波形の生成処理の時間効率（すなわち、生成波形データの生成ビットレート）が一定でない場合が生じ得る。例えば、入力されるテキスト文字列が多い場合と少ない場合とでは、合成音声波形の生成に必要なＣＰＵ処理量が大きく異なる場合等がある。この場合、フレーム単位あるいは呼気段落単位で生成された合成音声波形を一時格納部２０４に一旦格納することで、後続処理を一定のビットレートで行うようにすることができる。 The temporary storage unit 204 temporarily stores the synthesized speech waveform generated by the waveform generation unit 203 and then outputs the stored synthesized speech waveform. This may occur when the time efficiency of the synthetic speech waveform generation process (that is, the generation bit rate of the generated waveform data) is not constant. For example, the amount of CPU processing required for generating a synthesized speech waveform may differ greatly depending on whether the input text character string is large or small. In this case, the synthesized speech waveform generated in frame units or expiratory paragraph units is temporarily stored in the temporary storage unit 204, so that subsequent processing can be performed at a constant bit rate.

パワー算出部１０３は、音声合成部１０２から合成音声波形を受け取り、所定の時間区間の波形パワー値を算出し、各時間区間の波形パワー値の時系列変化を求めるものである。また、パワー算出部１０３は、算出した各時間区間の波形パワーを振動生成部１０４に与えるものである。 The power calculation unit 103 receives the synthesized speech waveform from the speech synthesis unit 102, calculates a waveform power value in a predetermined time interval, and obtains a time series change in the waveform power value in each time interval. The power calculation unit 103 gives the calculated waveform power of each time interval to the vibration generation unit 104.

図３は、パワー算出部１０３の内部構成を示す内部構成図である。図３に示すように、パワー算出部１０３は、窓掛け部３０１、波形パワー算出部３０２を少なくとも有する。 FIG. 3 is an internal configuration diagram showing an internal configuration of the power calculation unit 103. As shown in FIG. 3, the power calculation unit 103 includes at least a windowing unit 301 and a waveform power calculation unit 302.

窓掛け部３０１は、音声合成部１０２からの合成音声波形から、所定の時間区間を窓掛けにより切り出し、例えばハニング窓、ハミング窓などの窓掛け処理を行うものである。 The windowing unit 301 cuts out a predetermined time section from the synthesized speech waveform from the speech synthesizing unit 102 by windowing, and performs windowing processing such as a Hanning window and a Hamming window.

図４は、合成音声波形に対して所定の時間区間の窓掛け処理を説明する説明図である。 FIG. 4 is an explanatory diagram for explaining a windowing process in a predetermined time interval with respect to the synthesized speech waveform.

ここでは、１つのフレーム区間の信号の強さを算出するために、そのフレームを代表する波形データを切り出す処理を行う。図４に示すように、第１の実施形態ではフレームの中心時刻を窓の中心として、フレーム間隔の２倍の大きさのハニング窓にて切り出しを行う。窓関数は、ハミング窓、ガウス窓、テーパ窓等、適宜変えても良い、また窓長はパワーの移動平均値の時間分解能であり、フレーム周期の２倍を基本として、利用用途に応じて、適宜変更しても良い。 Here, in order to calculate the strength of the signal in one frame section, a process of cutting out waveform data representing that frame is performed. As shown in FIG. 4, in the first embodiment, the center time of a frame is used as the center of the window, and clipping is performed with a Hanning window that is twice the frame interval. The window function may be changed as appropriate, such as a Hamming window, a Gaussian window, a tapered window, etc., and the window length is the time resolution of the moving average value of power, and is based on twice the frame period, depending on the application. You may change suitably.

第１の実施形態では、この切り出し波形の各サンプル値を２乗値を累積したものを当該フレーム区間の波形パワー値とする。ここで、算出する波形パワー値の時系列情報は、生成された合成音声のパワー時系列の移動平均値であり、その算出方法として、絶対値を累積するなど、波形の振幅を数値化できる方法であれば適宜変えてもよい。 In the first embodiment, each sample value of the cut-out waveform is obtained by accumulating a square value as the waveform power value of the frame section. Here, the time series information of the waveform power value to be calculated is a moving average value of the power time series of the generated synthesized speech, and as a calculation method thereof, a method of quantifying the waveform amplitude, such as accumulating absolute values Any change may be made accordingly.

後述するように、振動生成部１０４により波形パワーに応じた物理振動量に変換し、振動伝導部１０５により物理振動量に応じた振動を出力ことになるが、波形パワーに応じた振動を出力するまでの感度（応答性）を良くするために、比較的短時間の時間区間とする。 As will be described later, the vibration generation unit 104 converts the physical vibration amount according to the waveform power, and the vibration conduction unit 105 outputs the vibration according to the physical vibration amount, but outputs the vibration according to the waveform power. In order to improve the sensitivity (responsiveness) up to, a relatively short time interval is used.

波形パワー算出部３０２は、窓掛け部３０１により切り出された所定の時間区間の波形データ（窓掛け波形データ）の波形パワー値を算出するものである。 The waveform power calculation unit 302 calculates the waveform power value of the waveform data (windowed waveform data) in a predetermined time section cut out by the windowing unit 301.

ここで、波形パワー算出部３０２による波形パワー値の算出方法としては、例えば窓掛け波形データに対して、所定のサンプリング周波数によりサンプリングを行い、窓掛け波形データのサンプル値を求める。そして、この窓掛け波形データの各サンプル値を２乗したものを１フレーム区間内で累積した値を、当該フレーム区間の波形パワー値とする。 Here, as a calculation method of the waveform power value by the waveform power calculation unit 302, for example, sampling is performed on the windowed waveform data at a predetermined sampling frequency to obtain a sample value of the windowed waveform data. A value obtained by squaring each sample value of the windowed waveform data and accumulating it in one frame section is set as a waveform power value in the frame section.

なお、波形パワー算出部３０２による波形パワー値の算出方法は、上記の例に限定されるものではなく、窓掛け波形データの各サンプル値を移動平均することができれば良いので、例えば、各サンプル値の絶対値を累積した値を、当該フレーム区間の波形パワー値としても良い。 Note that the method of calculating the waveform power value by the waveform power calculation unit 302 is not limited to the above example, and it is only necessary that each sample value of the windowed waveform data can be moving averaged. A value obtained by accumulating the absolute values of these may be used as the waveform power value of the frame section.

振動生成部１０４は、パワー算出部１０３により算出された各フレーム区間の波形パワー値に基づいて物理振動量を求め、その物理振動量を振動伝導部１０５に与えるものである。 The vibration generation unit 104 obtains a physical vibration amount based on the waveform power value of each frame section calculated by the power calculation unit 103, and gives the physical vibration amount to the vibration conduction unit 105.

ここで、振動生成部１０４による各フレーム区間の波形パワー値から物理振動量を生成する方法としては、例えば、波形パワー値の大きさに比例した物理振動量を生成する方法を適用することができる。 Here, as a method of generating a physical vibration amount from the waveform power value of each frame section by the vibration generation unit 104, for example, a method of generating a physical vibration amount proportional to the magnitude of the waveform power value can be applied. .

図５は、波形パワー値の大きさに応じて物理振動量を生成する一例を示す図である。振動生成部１０４は、図５に示すような比例関係を示す関係情報を保持しておく。そして、パワー算出部１０３から各フレームの波形パワー値を受け取ると、振動生成部１０４は、図５に示すような関係情報を参照して、対応する物理振動量を求める。つまり、振動生成部１０４は、各フレームの波形パワー値に所定の定数を乗じることにより物理振動量を求める。 FIG. 5 is a diagram illustrating an example of generating the physical vibration amount according to the magnitude of the waveform power value. The vibration generation unit 104 holds relation information indicating a proportional relation as shown in FIG. When the waveform power value of each frame is received from the power calculation unit 103, the vibration generation unit 104 refers to the relationship information as illustrated in FIG. 5 and obtains the corresponding physical vibration amount. That is, the vibration generation unit 104 obtains the physical vibration amount by multiplying the waveform power value of each frame by a predetermined constant.

また、別の方法として、振動生成部１０４は、各フレーム区間の波形パワー値を量子化し、その量子化値に応じた物理振動量を求めるようにしてもよい。この場合、振動生成部１０４は、各フレーム区間の波形パワー値の量子化値に対応する複数の物理振動量を予め設定しておき、対応する物理信号量を選択するようにしても良いし、また量子化により得られた量子化値に所定の定数を乗じて物理振動量を求めるようにしても良い。 As another method, the vibration generation unit 104 may quantize the waveform power value of each frame section and obtain a physical vibration amount corresponding to the quantized value. In this case, the vibration generation unit 104 may preset a plurality of physical vibration amounts corresponding to the quantized values of the waveform power values in each frame section, and select the corresponding physical signal amount. Further, the physical vibration amount may be obtained by multiplying a quantized value obtained by quantization by a predetermined constant.

振動伝導部１０５は、振動生成部１０４で生成された物理振動量に応じて物理的に振動するものであり、利用者（人体）の喉や頬骨や頸部等に振動を伝えるものである。振動伝導部１０５は、例えば、振動器（例えばバイブレータ）や骨伝導出力装置（例えば骨伝導スピーカ等）などが該当する。 The vibration conducting unit 105 physically vibrates according to the amount of physical vibration generated by the vibration generating unit 104, and transmits vibration to the throat, cheekbone, neck, etc. of the user (human body). The vibration conduction unit 105 corresponds to, for example, a vibrator (for example, a vibrator) or a bone conduction output device (for example, a bone conduction speaker).

これにより、合成音声の生成タイミングに合わせて、合成音声から擬似的に生成した振動を、利用者に直接伝えることができる。そのため、利用者自身が、今発話したことを実感することができ、合成音声の不完全さに起因する自信喪失感を軽減させることができる。 Thereby, the vibration artificially generated from the synthesized speech can be directly transmitted to the user in accordance with the generation timing of the synthesized speech. Therefore, the user himself / herself can feel that he / she has spoken, and the feeling of loss of confidence due to the imperfection of the synthesized speech can be reduced.

また、振動伝導部１０５は、人体の皮膚に貼付ができるように、例えばバイブレータを包んだ柔らかい布パッドや皮膚貼付用テープなどの装着部材が付されており、これにより利用者の喉や頸部等に装着する。 In addition, the vibration conducting unit 105 is provided with a mounting member such as a soft cloth pad or a skin adhesive tape wrapped around a vibrator so that the vibration conducting unit 105 can be applied to the skin of the human body. Attach to etc.

出力部１０６は、音声合成部１０２により合成された合成音声波形に基づいて合成音声を出力するものであり、例えば、スピーカ等が該当する。 The output unit 106 outputs synthesized speech based on the synthesized speech waveform synthesized by the speech synthesis unit 102, and corresponds to, for example, a speaker.

（Ａ−２）第１の実施形態の動作
次に、第１の実施形態の発話補助装置１００を用いた発話補助方法の動作について図６を参照しながら説明する。 (A-2) Operation of the First Embodiment Next, the operation of the speech assist method using the speech assist device 100 of the first embodiment will be described with reference to FIG.

まず、発話補助装置１００を利用する利用者は、例えばキーボードなどのテキスト入力部１０１を用いて、所望のテキスト文字列を入力する（ステップＳ１０１）。 First, the user who uses the speech assisting apparatus 100 inputs a desired text character string using the text input unit 101 such as a keyboard (step S101).

テキスト入力部１０１に入力されたテキスト文字列は音声合成部１０２に与えられ、音声合成部１０２において、テキスト文字列は合成音声波形が生成される（ステップＳ１０２）。 The text character string input to the text input unit 101 is given to the speech synthesizer 102, and the speech synthesizer 102 generates a synthesized speech waveform for the text character string (step S102).

音声合成部１０２により合成された合成音声波形は、パワー算出部１０３に与えられ、パワー算出部１０３により、短時間でなるフレームの時間区間の波形パワー値がフレーム毎に算出される（ステップＳ１０３）。 The synthesized speech waveform synthesized by the speech synthesis unit 102 is given to the power calculation unit 103, and the power calculation unit 103 calculates the waveform power value of the time section of the short frame for each frame (step S103). .

パワー算出部１０３では、音声合成波形から１フレームの時間区間を切り出し、例えばハニング窓やハミング窓等の窓処理を行う。パワー算出部１０３は、この１フレームの時間区間を窓処理することにより、切り出された窓掛け波形データの各サンプル値を２乗したものを累積した２乗平均値を求めて、これを当該フレームの波形パワー値とする。 The power calculation unit 103 cuts out a time period of one frame from the speech synthesis waveform and performs window processing such as a Hanning window or a Hamming window. The power calculation unit 103 performs window processing on the time interval of one frame, thereby obtaining a mean square value obtained by accumulating the squared values of each sample value of the extracted windowed waveform data. Waveform power value.

また、パワー算出部１０３は、このような波形パワー算出処理を各フレーム間隔で行うことで、窓掛け波形データの各フレームの波形パワーの時系列を得る。 Further, the power calculation unit 103 obtains a time series of the waveform power of each frame of the windowed waveform data by performing such waveform power calculation processing at each frame interval.

次に、パワー算出部１０３により算出された各フレームの波形パワーは、振動生成部１０４に与えられる。振動生成部１０４では、各フレームの波形パワー値に応じた物理振動量が求められる（ステップＳ１０４）。 Next, the waveform power of each frame calculated by the power calculation unit 103 is given to the vibration generation unit 104. In the vibration generation unit 104, a physical vibration amount corresponding to the waveform power value of each frame is obtained (step S104).

振動生成部１０４では、各フレームの波形パワー値の大きさに比例するよう、振動伝導部１０５を振動させる物理振動量を求めることができればよく、例えば、各フレームの波形パワー値に所定の定数を乗算して物理振動量を求めたり、各フレームの波形パワー値を量子化し、その量子化値に対応する物理振動量を求めたりすること等で実現できる。 The vibration generation unit 104 only needs to be able to determine the amount of physical vibration that causes the vibration conducting unit 105 to vibrate so as to be proportional to the magnitude of the waveform power value of each frame. For example, a predetermined constant is set for the waveform power value of each frame. This can be realized by multiplying to obtain the physical vibration amount, quantizing the waveform power value of each frame, and obtaining the physical vibration amount corresponding to the quantized value.

また、振動生成部１０４は、各フレームの波形パワー値に応じた物理振動量を求めるが、パワー算出部１０３が各フレームの波形パワー値を時系列で求めるため、振動生成部１０４が求める物理振動量も各フレームについて時系列のものとなる。 The vibration generation unit 104 obtains the physical vibration amount corresponding to the waveform power value of each frame. However, since the power calculation unit 103 obtains the waveform power value of each frame in time series, the physical vibration obtained by the vibration generation unit 104 is obtained. The quantity is also time-series for each frame.

振動伝導部１０５は、振動生成部１０４により生成された物理振動量により利用者に対して振動を伝導する（ステップＳ１０５）。 The vibration conducting unit 105 conducts vibration to the user based on the amount of physical vibration generated by the vibration generating unit 104 (step S105).

（Ａ−３）第１の実施形態の効果
以上のように、第１の実施形態によれば、利用者がテキスト入力すると、スピーカ等の出力部から合成音声が出力されると共に、利用者の頸部等に合成音声の大きさに応じた振動がフィードバックされるので、利用者に心理的な側面でのより強い安心感を与えることができる。 (A-3) Effects of the First Embodiment As described above, according to the first embodiment, when a user inputs text, a synthesized speech is output from an output unit such as a speaker, and the user's Since vibration corresponding to the magnitude of the synthesized speech is fed back to the neck or the like, it is possible to give the user a stronger sense of security in a psychological aspect.

（Ｂ）他の実施形態
第１の実施形態では、生成される音声波形の短時間パワーをバイブレータ等の振動伝導部を用いて頸部等にフィードバックする方法を説明したが、骨伝導スピーカなどを用いることで代替することも可能である。 (B) Other Embodiments In the first embodiment, the method of feeding back the short-time power of the generated speech waveform to the neck or the like using a vibration conducting unit such as a vibrator has been described. It is also possible to substitute by using it.

１００…振動伝導部、１０１…テキスト入力部、１０２…音声合成部、１０３…パワー算出部１０３…振動生成部、１０５…振動伝導部、１０６…出力部。 DESCRIPTION OF SYMBOLS 100 ... Vibration conduction part, 101 ... Text input part, 102 ... Speech synthesis part, 103 ... Power calculation part 103 ... Vibration generation part, 105 ... Vibration conduction part, 106 ... Output part.

Claims

Speech waveform generation means for generating a speech waveform based on text input through the text input means;
Waveform power calculation means for calculating a waveform power value for each predetermined time interval based on the voice waveform from the voice waveform generation means;
Physical vibration conversion means for obtaining a physical vibration amount according to the waveform power value of each time interval from the waveform power calculation means;
An utterance assisting device comprising: vibration conduction means for providing a speaker with vibration based on a physical vibration amount corresponding to the waveform power value in each time interval.

Voice output means for outputting voice corresponding to the voice waveform from the voice waveform generation means;
The vibration conducting means is capable of contacting the throat, cheekbone or neck of the speaker, and at the timing when the voice is output from the voice output means, the waveform power value of the time interval corresponding to the output voice is obtained. The utterance assisting device according to claim 1, wherein the corresponding vibration is transmitted to a speaker.

The speech waveform generation means includes a database storing a plurality of speech segment data and prosodic information, and generates synthesized speech corresponding to an input text with reference to the database. The speech assisting device according to 1 or 2.

The speech assisting device according to any one of claims 1 to 3, wherein the vibration conducting means is a vibrator or a bone conduction speaker.

An utterance assisting method of an utterance assisting device comprising text input means and vibration conducting means,
A voice waveform generation step for generating a voice waveform based on the text input through the text input means;
A waveform power calculation step in which a waveform power calculation means calculates a waveform power value for each predetermined time interval based on the voice waveform from the voice waveform generation means;
A physical vibration converting means for obtaining a physical vibration amount corresponding to the waveform power value of each time interval from the waveform power calculating means and supplying the physical vibration amount to the vibration conducting means; An utterance assistance method characterized by

An utterance assisting device comprising a text input means and a vibration conduction means,
Speech waveform generation means for generating a speech waveform based on the text input through the text input means;
Waveform power calculating means for calculating a waveform power value for each predetermined time interval based on the voice waveform from the voice waveform generating means;
An utterance assist characterized by obtaining a physical vibration amount corresponding to the waveform power value of each time interval from the waveform power calculation means and functioning as a physical vibration conversion means for supplying the physical vibration amount to the vibration conduction means program.