JP5962925B2

JP5962925B2 - Speech synthesis device, music playback device, speech synthesis program, and music playback program

Info

Publication number: JP5962925B2
Application number: JP2013243830A
Authority: JP
Inventors: 翔平永井
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-11-26
Filing date: 2013-11-26
Publication date: 2016-08-03
Anticipated expiration: 2033-11-26
Also published as: JP2015102726A

Description

本発明は、入力情報に基づいて音声合成を行う音声合成装置、並びに、楽曲再生と音声合成を行う楽曲再生装置に関する。また、各種情報処理装置において音声合成処理を実行させるための音声合成プログラム、並びに、楽曲再生処理と音声合成処理を実行させるための楽曲再生プログラムに関する。 The present invention relates to a speech synthesizer that performs speech synthesis based on input information, and a music player that performs music playback and speech synthesis. The present invention also relates to a speech synthesis program for executing speech synthesis processing in various information processing apparatuses, and a music playback program for executing music playback processing and speech synthesis processing.

従来、テキスト情報等を入力して、音声を合成する音声合成装置が知られている。このような音声合成装置では、人が話すのと同様に、自然な話し声を実現することが求められており、そのための各種開発が行われている。近年、音声合成はこのような話し声のみならず、歌唱音声にも使用され、自然な歌唱を実現するための開発も進められている。さらに特許文献１では、楽曲等のリズムに合わせて、アクセントを有する英単語等の語と、当該語に対応する日本語訳等の語とを、学習者が確実に効率よく記憶することができる音声情報を形成することが開示されている。 Conventionally, a speech synthesizer for synthesizing speech by inputting text information or the like is known. In such a speech synthesizer, it is required to realize a natural speaking voice in the same way as a person speaks, and various developments have been made for that purpose. In recent years, speech synthesis is used not only for such spoken voices but also for singing voices, and development for realizing natural singing has been promoted. Furthermore, in Patent Document 1, a learner can reliably and efficiently store words such as English words having accents and words such as Japanese translations corresponding to the words in accordance with the rhythm of music or the like. Forming audio information is disclosed.

ところで、特許文献１に開示される楽曲等のリズムに合わせて発音する音声合成を、特許文献１のような学習のみならず、フィットネス等、各種運動におけるインストラクションに使用することが考えられる。このようなインストラクションは、予め音楽に同期して録音したものを用いることも可能であるが、音声合成を使用することで、従来、インストラクターが行っていたのと同様、ユーザの運動状況に応じた適切なインストラクションを伝えることも可能となる。 By the way, it is conceivable to use speech synthesis that is sounded in accordance with the rhythm of music or the like disclosed in Patent Document 1 for instruction in various exercises such as fitness as well as learning as in Patent Document 1. Such instructions can be recorded in synchronization with the music in advance, but by using speech synthesis, as in the case of instructors in the past, depending on the exercise situation of the user It is also possible to convey appropriate instructions.

現在、ゲーム機等では、各種フィットネスのソフトウェアが提供されている。このようなソフトウェアでは、カメラで撮像したユーザの状況に応じたインストラクションを音声で伝えることが行われている。また、スマートフォンでは、ＧＰＳ等の各種センサを使用して取得した、ユーザのランニング状況（速度や脈拍）に応じて、インストラクションを音声で伝えることも行われている。 Currently, various fitness software is provided in game machines and the like. In such software, an instruction according to a user's situation captured by a camera is transmitted by voice. Moreover, in the smart phone, the instruction | indication is also transmitted with an audio | voice according to the user's running condition (speed or a pulse) acquired using various sensors, such as GPS.

特開２０１１−１３２９５号公報JP 2011-13295 A

このようなフィットネスあるいはランニング等、各種運動は、音楽を聴取しながら行われることがある。このような場合、前述したインストラクションを音声で出力する際、インストラクションの内容を伝えるのと同様、音楽の流れや音楽のテンポ感を失わせないことが重要である。音楽のテンポに合わせて、音声合成の長さを合わせる形態も考えられるが、そのような形態では不自然な発音になってしまう場合がある。また、特許文献１に開示される学習用を目的とした音声合成についても、楽曲のリズムにさらに即して音声合成を行うことが求められている。 Various types of exercise such as fitness or running may be performed while listening to music. In such a case, when outputting the above-described instruction by voice, it is important not to lose the music flow and the tempo of music, as in the case of conveying the contents of the instruction. A form in which the length of speech synthesis is adjusted in accordance with the tempo of music is also conceivable, but such a form may result in unnatural pronunciation. In addition, for speech synthesis intended for learning disclosed in Patent Document 1, it is also required to perform speech synthesis more in line with the rhythm of music.

本発明は、このような課題を解消することを目的とするものであり、楽曲のリズムにあった自然な音声合成を可能とする音声合成装置、並びに、音声合成と楽曲の再生を同時に行う楽曲再生装置、そして、そのためのプログラムを提供することを目的とするものである。 The present invention is intended to solve such problems, and a speech synthesizer that enables natural speech synthesis that matches the rhythm of a song, and a song that simultaneously performs speech synthesis and song playback. It is an object of the present invention to provide a playback device and a program therefor.

上述する課題を解決するため、本発明は以下の構成を採用するものである。
テキスト情報に基づく音声合成処理を実行可能な音声合成装置であって、
前記音声合成処理は、
テキスト情報を、アクセント句に分割すると共に、各アクセント句内のモーラにアクセント位置を設定する形態素解析処理と、
アクセント句のアクセント位置に対応するモーラが、入力される拍子情報の強拍タイミングに重なるように配置するとともに、隣接するアクセント句が重ならないようにアクセント句を縮める配置処理と、
前記配置処理で配置されたアクセント句に基づいて、音声波形を合成する音声波形合成処理と、を実行する。 In order to solve the above-described problems, the present invention adopts the following configuration.
A speech synthesizer capable of performing speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process for dividing text information into accent phrases and setting an accent position for a mora in each accent phrase;
Arrangement processing that arranges the mora corresponding to the accent position of the accent phrase so as to overlap the strong beat timing of the input time signature information, and shrinks the accent phrase so that adjacent accent phrases do not overlap,
A speech waveform synthesis process for synthesizing a speech waveform based on the accent phrase arranged in the placement process is executed.

さらに本発明に係る音声合成装置において、
前記配置処理は、隣接するアクセント句間が所定時間以上有するように、アクセント句を配置する。 Furthermore, in the speech synthesizer according to the present invention,
The arrangement process arranges accent phrases so that adjacent accent phrases have a predetermined time or more.

さらに本発明に係る音声合成装置において、
前記配置処理は、アクセント位置の無いアクセント句を、隣接するアクセント句に接続して配置する。 Furthermore, in the speech synthesizer according to the present invention,
The arrangement process arranges an accent phrase without an accent position by connecting it to an adjacent accent phrase.

また本発明に係る楽曲再生装置は、
楽曲情報に基づく楽曲再生処理と、テキスト情報に基づく音声合成処理とを実行可能な、楽曲再生装置であって、
前記音声合成処理は、
テキスト情報を、アクセント句に分割すると共に、各アクセント句内のモーラにアクセント位置を設定する形態素解析処理と、
アクセント句のアクセント位置に対応するモーラが、前記楽曲再生処理に伴って出力される拍子情報の強拍タイミングに重なるように配置するとともに、隣接するアクセント句が重ならないようにアクセント句を縮める配置処理と、
前記配置処理で配置されたアクセント句に基づいて、音声波形を合成する音声波形合成処理と、を実行する。 Moreover, the music reproducing device according to the present invention includes:
A music playback device capable of performing music playback processing based on music information and speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process for dividing text information into accent phrases and setting an accent position for a mora in each accent phrase;
Arrangement processing that arranges the mora corresponding to the accent position of the accent phrase so that it overlaps the strong beat timing of the time signature information output along with the music reproduction process, and shrinks the accent phrase so that adjacent accent phrases do not overlap When,
A speech waveform synthesis process for synthesizing a speech waveform based on the accent phrase arranged in the placement process is executed.

また本発明に係る音声合成プログラムは、
テキスト情報に基づく音声合成処理を情報処理装置に実行させる音声合成プログラムであって、
前記音声合成処理は、
入力されたテキストを、アクセント句に分割すると共に、各アクセント句内のモーラにアクセント位置を設定する形態素解析処理と、
アクセント句のアクセント位置に対応するモーラが、入力される拍子情報の強拍タイミングに重なるように配置するとともに、隣接するアクセント句が重ならないようにアクセント句を縮める配置処理と、
前記配置処理で配置されたアクセント句に基づいて、音声波形を合成する音声波形合成処理と、を実行する。 A speech synthesis program according to the present invention is
A speech synthesis program for causing an information processing apparatus to perform speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process that divides the input text into accent phrases and sets an accent position to a mora in each accent phrase;
Arrangement processing that arranges the mora corresponding to the accent position of the accent phrase so as to overlap the strong beat timing of the input time signature information, and shrinks the accent phrase so that adjacent accent phrases do not overlap,
A speech waveform synthesis process for synthesizing a speech waveform based on the accent phrase arranged in the placement process is executed.

本発明に係る音声合成装置、楽曲再生装置、音声合成プログラム、楽曲再生プログラムによれば、アクセント句のアクセント位置に対応するモーラが、拍子情報の強拍タイミングに重なるように発音されることで、楽曲の再生等の拍子リズムにあった音声合成を行うことが可能である。その際、アクセント句が重ならないようにアクセント句を縮めること
で、聞き取りやすい音声出力を可能としている。このような構成により、フィットネスのインストラクション等、楽曲等、拍子リズムに合わせた音声合成を行う際、リズム感のある適切な音声を得ることが可能となる。 According to the speech synthesizer, the music playback device, the speech synthesis program, and the music playback program according to the present invention, the mora corresponding to the accent position of the accent phrase is pronounced so as to overlap the strong beat timing of the time signature information, It is possible to perform speech synthesis that matches the time rhythm of music playback. At that time, by shortening the accent phrase so that the accent phrases do not overlap, it is possible to output sound that is easy to hear. With such a configuration, it is possible to obtain appropriate sound with a sense of rhythm when performing speech synthesis in accordance with the time rhythm, such as fitness instructions, music, and the like.

本発明の実施形態に係る情報処理装置の構成を示すブロック図The block diagram which shows the structure of the information processing apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声合成処理を示すフロー図The flowchart which shows the speech synthesis process which concerns on embodiment of this invention 本発明の実施形態に係る発声タイミングの算出（処理Ａ）を示すフロー図The flowchart which shows calculation (process A) of the utterance timing which concerns on embodiment of this invention 本発明の実施形態に係るアクセント句の音調調整（処理Ｂ）を示すフロー図The flowchart which shows the tone adjustment (process B) of the accent phrase which concerns on embodiment of this invention. 本発明の実施形態に係る音声合成処理例を模式的に示した図The figure which showed typically the speech synthesis processing example which concerns on embodiment of this invention 本発明の他の実施形態に係る音声合成処理例を模式的に示した図The figure which showed typically the speech synthesis processing example which concerns on other embodiment of this invention.

図１は、本発明の実施形態に係る情報処理装置１の構成を示すブロック図である。情報処理装置１としては、パーソナルコンピュータ、スマートフォンやタブレット端末、各種楽曲再生装置等の携帯情報端末等、各種形態が考えられる。本実施形態の情報処理装置１は、楽曲情報に基づく演奏処理と、テキスト情報に基づく音声合成処理を実行可能としている。本発明に係る情報処理装置１は、このような形態の他、外部の装置から入力された強拍タイミングに関する情報に基づいて音声合成処理のみを行う形態としてもよい。 FIG. 1 is a block diagram showing a configuration of an information processing apparatus 1 according to an embodiment of the present invention. As the information processing apparatus 1, various forms such as a personal computer, a smart phone, a tablet terminal, and a portable information terminal such as various music reproducing apparatuses can be considered. The information processing apparatus 1 according to the present embodiment can perform performance processing based on music information and speech synthesis processing based on text information. In addition to this form, the information processing apparatus 1 according to the present invention may be configured to perform only speech synthesis processing based on information related to strong beat timing input from an external apparatus.

本実施形態の情報処理装置１は、制御部１０、記憶部１３を備えて構成されている。制御部１０は、ＣＰＵ、ＲＯＭ、ＲＡＭ等を有して構成され、演奏処理１１と音声合成処理１２を実行可能としている。記憶部１３は、音声合成処理に必要な形態素解析辞書１３ａ、音素情報１３ｂを記憶している。制御部１０は、入力される楽曲情報に基づいて演奏音を出力する演奏処理１１と、入力されるテキスト情報に基づいて音声波形を合成する音声合成処理１２を実行可能としている。制御部１０から出力される演奏音と音声波形はミキシング部１４で合成され、スピーカ１５から放音される。 The information processing apparatus 1 according to the present embodiment includes a control unit 10 and a storage unit 13. The control unit 10 includes a CPU, a ROM, a RAM, and the like, and can perform a performance process 11 and a voice synthesis process 12. The storage unit 13 stores a morpheme analysis dictionary 13a and phoneme information 13b necessary for speech synthesis processing. The control unit 10 can execute a performance process 11 that outputs a performance sound based on input music information and a voice synthesis process 12 that synthesizes a speech waveform based on input text information. The performance sound and the sound waveform output from the control unit 10 are synthesized by the mixing unit 14 and emitted from the speaker 15.

演奏処理１１に使用する楽曲情報、音声合成処理１２に使用するテキスト情報は、記憶部１３に記憶されるものであっても、外部の記憶媒体、あるいは、インターネット等の通信を介して取得するものであってもよい。 The music information used for the performance processing 11 and the text information used for the speech synthesis processing 12 may be acquired via an external storage medium or communication such as the Internet, even if stored in the storage unit 13. It may be.

制御部１０にて実行される演奏処理１１では、楽曲情報の演奏に同期して、音声合成処理１２側に拍子情報を出力する。拍子情報には、強拍タイミングを含んで構成される。強拍タイミングは、例えば、１〜４拍で構成される４拍子における１拍目の先頭タイミングに相当する。強拍タイミングを含む拍子情報は、予め楽曲情報中に含まれるものであってもよいし、あるいは、演奏された演奏音の音量に基づいて形成されるものであってもよい。 In the performance processing 11 executed by the control unit 10, time signature information is output to the voice synthesis processing 12 side in synchronization with the performance of the music information. The time signature information includes a strong beat timing. The strong beat timing corresponds to, for example, the start timing of the first beat in a 4-beat composed of 1 to 4 beats. The time signature information including the strong beat timing may be included in advance in the music information, or may be formed based on the volume of the performance sound played.

制御部１０にて実行される音声合成処理１２は、テキスト情報を音声合成処理に必要な音声パラメータに変換する形態素解析処理１２ａ、音声パラメータに基づいて音声波形を合成する音声波形合成処理１２ｃを実行する。特に、本実施形態では、拍子情報の強拍タイミングに同期した音声合成を実現するため、形態素解析処理１２ａで発生した音声パラメータを調整する調整処理１２ｂを行うこととしている。 The voice synthesis process 12 executed by the control unit 10 executes a morpheme analysis process 12a for converting text information into a voice parameter necessary for the voice synthesis process, and a voice waveform synthesis process 12c for synthesizing a voice waveform based on the voice parameter. To do. In particular, in the present embodiment, adjustment processing 12b for adjusting the speech parameters generated in the morphological analysis processing 12a is performed in order to realize speech synthesis synchronized with the strong beat timing of the time signature information.

では、本実施形態の情報処理装置１にて実行される音声合成処理について、その詳細を説明する。本実施形態の音声合成処理は、入力されるテキスト情報に基づいて実行される処理であり、処理結果として音声波形が合成出力される。入力されたテキスト情報は、形態素解析辞書１３ａを使用して形態素解析処理（Ｓ１０）が施され、音声合成波形に必要な各種情報（音声パラメータ）が取得される（Ｓ１１）。図５には、本実施形態の音声合
成処理の様子が示されている。図５（Ａ）には、テキスト情報の例文が示されている。 Now, the details of the speech synthesis process executed by the information processing apparatus 1 of this embodiment will be described. The speech synthesis process of this embodiment is a process executed based on input text information, and a speech waveform is synthesized and output as a processing result. The input text information is subjected to morphological analysis processing (S10) using the morphological analysis dictionary 13a, and various information (speech parameters) necessary for the speech synthesis waveform is acquired (S11). FIG. 5 shows a state of the speech synthesis process of the present embodiment. FIG. 5A shows an example sentence of text information.

図５（Ｂ）は、図５（Ａ）のテキスト情報に対して、形態素解析処理を施した結果である。本実施形態では、入力されたテキスト情報を、形態素解析辞書１３ａを参照して、アクセント句Ａ１〜Ａ４に分割している。また、アクセント句は、音韻の単位となるモーラにて構成されている。各モーラに、出力する音声波形の音長が割り当てられる。例えば、アクセント句Ａ１は、３つのモーラ「ウ」、「デ」、「オ」で構成され、各モーラには、それぞれ、Ｂ１、Ｂ２、Ｂ３の音長が割り当てられる。したがって、アクセント句Ａ１の音長は、各モーラの音長の合計値Ｂ１＋Ｂ２＋Ｂ３＝Ｃ１となる。 FIG. 5B shows the result of performing morphological analysis processing on the text information of FIG. In the present embodiment, the input text information is divided into accent phrases A1 to A4 with reference to the morphological analysis dictionary 13a. The accent phrase is composed of mora which is a unit of phoneme. Each mora is assigned a sound length of an output sound waveform. For example, the accent phrase A1 is composed of three mora “U”, “DE”, and “O”, and the sound lengths of B1, B2, and B3 are assigned to each mora, respectively. Therefore, the tone length of the accent phrase A1 is the sum of the tone lengths of each mora B1 + B2 + B3 = C1.

さらに、形態素解析処理では、各アクセント句Ａ１〜Ａ３に対して、アクセント位置を設定している。アクセント位置は、アクセント句Ａ〜Ａ３中のモーラに対して設定される情報であり、音声合成処理を行う際は、当該モーラにアクセント（音高、または、音量の少なくとも一方を調整）を付与することで、聴取上、自然な音声を実現している。図５の例文では、全てのアクセント句Ａ１〜Ａ３がアクセント位置を有しているが、アクセント位置の無いアクセント句も存在する。 Further, in the morphological analysis process, an accent position is set for each accent phrase A1 to A3. The accent position is information set for the mora in the accent phrases A to A3, and when performing speech synthesis processing, an accent (adjusts at least one of pitch or volume) is given to the mora. In this way, natural sound is realized in listening. In the example sentence of FIG. 5, all accent phrases A1 to A3 have an accent position, but there are also accent phrases without an accent position.

通常、形態素解析処理にて得られた情報に基づいて、音素情報等、音声合成波形を形成するための各種情報（音源情報）を読み出して、出力することで音声合成が実行可能である。本実施形態では、演奏処理に伴う強拍タイミングに同期して、音声合成処理を実行するため、図５（Ｂ）のような形態素解析処理の完了した情報に対して各種調整を行うこととしている。具体的には、アクセント句の発声タイミングと、アクセント句の音長を調整することとしている。 Usually, speech synthesis can be performed by reading out and outputting various information (sound source information) for forming a speech synthesis waveform, such as phoneme information, based on information obtained by morpheme analysis processing. In the present embodiment, in order to execute the speech synthesis process in synchronization with the strong beat timing accompanying the performance process, various adjustments are made to the information that has been subjected to the morphological analysis process as shown in FIG. . Specifically, the accent phrase utterance timing and the accent phrase sound length are adjusted.

図２の音声合成処理では、形態素解析処理で得られたアクセント句毎に発生タイミングの算出（Ｓ２０：処理Ａ）と、先行アクセント句との間隔が一定以上でない場合、アクセント句の調整（Ｓ３０：処理Ｂ）を含む調整処理を実行可能としている。図３には、アクセント句の発声タイミングを算出する処理Ａのフロー図が示されている。図５（Ｃ）には、図５（Ｂ）の形態素解析結果について、アクセント句の配置の様子が示されている。図５（Ｃ）中、Ｔ１〜Ｔ５は、拍子情報にて規定される強拍タイミングが示されている。アクセント句Ａ１（「ウデオ」）についてみると、次（この場合、最初）の強拍タイミングＡ＝Ｔ１に設定される。 In the speech synthesis process of FIG. 2, if the interval between the calculation of the occurrence timing for each accent phrase obtained by the morphological analysis process (S20: process A) and the preceding accent phrase is not more than a certain value, the accent phrase is adjusted (S30: The adjustment process including the process B) can be executed. FIG. 3 shows a flowchart of the process A for calculating the utterance timing of the accent phrase. FIG. 5C shows the arrangement of accent phrases for the morphological analysis result of FIG. 5B. In FIG. 5C, T1 to T5 indicate the strong beat timing defined by the time signature information. Looking at the accent phrase A1 ("UDEO"), the next (in this case, first) strong beat timing A = T1 is set.

次に、アクセント句中、アクセント位置より手前にあるモーラの音長の合計値が計算される（Ｓ２２）。アクセント句Ａ１の場合、アクセント位置は２番目のモーラ（「デ」）にあるため、モーラの音長Ｂ＝モーラ（「デ」）の音長となる。強拍タイミングＡと音長Ｂに基づいて、アクセント句の発音タイミングをＡ−Ｂ移動させることで、アクセント句中、アクセント位置に対応するモーラの先頭を、強拍タイミングと同期させることが可能となる。次のアクセント句Ａ２（「オーキクフッテ」）も同様の処理によって、その発声タイミングが決定される。 Next, the total value of the sound lengths of the mora before the accent position in the accent phrase is calculated (S22). In the case of the accent phrase A1, since the accent position is in the second mora (“de”), the mora sound length B = the sound length of mora (“de”). By moving the accent phrase pronunciation timing A-B based on the strong beat timing A and the sound length B, the head of the mora corresponding to the accent position in the accent phrase can be synchronized with the strong beat timing. Become. The utterance timing of the next accent phrase A2 (“Okiku Foot”) is determined by the same processing.

本実施形態では、アクセント句Ａ１〜Ａ４を順に配置する際、従前に配置されたアクセント句が占めていない強拍タイミング位置を、次の強拍タイミングＡとして使用し、アクセント句を配置することとしている。したがって、アクセント句Ａ２（「オーキクフッテ」）は強拍タイミングＴ２、アクセント句Ａ３（「アシブミ」）は強拍タイミングＴ４、アクセント句Ａ４（「シマショー」）は強拍タイミングＴ５が、次の強拍タイミングＡとして使用される。 In this embodiment, when arranging the accent phrases A1 to A4 in order, the strong beat timing position not occupied by the previously arranged accent phrases is used as the next strong beat timing A, and the accent phrases are arranged. Yes. Therefore, the accent phrase A2 (“Okiku foot”) is the strong beat timing T2, the accent phrase A3 (“Ashibumi”) is the strong beat timing T4, the accent phrase A4 (“Shimasho”) is the strong beat timing T5, and the next strong beat timing. Used as A.

このように本実施形態では、アクセント句Ａ１〜Ａ４中、アクセント位置に対応するモーラの発声タイミング（開始位置）を、強拍タイミングと一致させているが、アクセント
句の発声タイミングは、このような形態に限らず、アクセント位置に対応するモーラの発生期間中に、強拍タイミングが位置する、すなわち、アクセント句のアクセント位置に対応するモーラが、強拍タイミングに重なるように配置する形態であれば、聴覚上、拍子リズムにあった音声合成を実現することが可能である。 As described above, in this embodiment, in the accent phrases A1 to A4, the utterance timing (start position) of the mora corresponding to the accent position is made coincident with the strong beat timing. As long as the strong beat timing is located during the generation period of the mora corresponding to the accent position, that is, the mora corresponding to the accent position of the accent phrase is arranged so as to overlap the strong beat timing. It is possible to realize speech synthesis that meets the beat rhythm on an auditory basis.

図５（Ｃ）のアクセント句Ａ２とＡ３の関係をみて分かるように、上述したようにアクセント句のアクセント位置を基準として配置を行った場合、アクセント句間で重なってしまうことが考えられる。このような状態で発声させた場合、聴感上、不自然な音声になるとともに、場合によっては、音声の意味を解釈できない場合も考えられる。本実施形態では、このような状態を解消するため、支障の生じたアクセント句の音長を縮める処理（処理Ｂ）を行うこととしている。 As can be seen from the relationship between the accent phrases A2 and A3 in FIG. 5C, it is conceivable that the accent phrases overlap each other when arranged based on the accent positions of the accent phrases as described above. When the voice is uttered in such a state, the sound becomes unnatural in terms of audibility, and in some cases, the meaning of the voice cannot be interpreted. In the present embodiment, in order to eliminate such a state, a process (process B) for shortening the length of the accent phrase in which trouble has occurred is performed.

図４には、支障の生じたアクセント句の音長を縮める処理Ｂのフロー図が示されている。本実施形態では、図２で説明したように音声合成処理のフロー図中、先行アクセント句との間隔（時間長）が所定時間長未満の場合、先行アクセント句に対して、この処理Ｂ（Ｓ３０）が実行される。以下、アクセント句の間隔（時間長）が所定時間長未満にあるかどうかを判断する閾値を間隔閾値と称す。この処理では、間隔閾値−先行アクセント句との間隔をＡ、施工アクセント句のモーラ数をＢとパラメータとして、先行アクセント句内の各モーラに対する音長が調整される。各モーラの音長を当初の音長からＡ／Ｂだけ差し引いた音長とする（Ｓ３４）ことで、調整対象となっている先行アクセント句全体の長さを音長Ａだけ短くし、先行アクセント句とアクセント句との間に間隔閾値で規定する時間長を確保することとしている。なお、間隔閾値は、０以上の時間長であれば適宜に設定することが可能である。 FIG. 4 shows a flowchart of the process B for shortening the length of the accent phrase in which the trouble has occurred. In the present embodiment, as described with reference to FIG. 2, when the interval (time length) with the preceding accent phrase is less than a predetermined time length in the flowchart of the speech synthesis process, this process B (S30) is performed for the preceding accent phrase. ) Is executed. Hereinafter, a threshold for determining whether an interval (time length) between accent phrases is less than a predetermined time length is referred to as an interval threshold. In this process, the pitch of each mora in the preceding accent phrase is adjusted with the interval threshold-the interval between the preceding accent phrases as A and the number of mora of the construction accent phrase as B as parameters. By making the length of each mora the length obtained by subtracting A / B from the original length (S34), the entire length of the preceding accent phrase to be adjusted is shortened by the length A, and the leading accent A time length defined by an interval threshold is secured between the phrase and the accent phrase. The interval threshold value can be set as appropriate as long as the time length is 0 or more.

図５（Ｄ）は、図５（Ｃ）に示す強拍タイミングに基づいて配置されたアクセント句に対して、処理Ｂを施すことで音長が調整された結果が示されている。この実施形態では、アクセント句Ａ２（「オーキクテオフッテ」）と、アクセント句Ａ３（「アシブミ」）に対して、音長の調整を施したものとなっている。このようなアクセント句、モーラの配置によって決定される音声パラメータに基づいて、音声波形を合成する（図２のＳ１５）ことで、演奏される楽曲のリズムにあった自然な音声合成を行うことを可能としている。本実施形態の音声波形合成処理では、形態素解析処理１２ａ、配置処理１２ｂを経て決定した音声パラメータに基づいて、記憶部１３内の音素情報１３ｂを読み出し、音声波形が形成される。 FIG. 5D shows the result of adjusting the tone length by performing the process B on the accent phrase arranged based on the strong beat timing shown in FIG. In this embodiment, the tone length is adjusted with respect to the accent phrase A2 ("Okuikuteofette") and the accent phrase A3 ("Ashibumi"). By synthesizing the speech waveform based on the speech parameters determined by the arrangement of the accent phrase and the mora (S15 in FIG. 2), it is possible to perform natural speech synthesis that matches the rhythm of the music to be played. It is possible. In the speech waveform synthesis process of the present embodiment, the phoneme information 13b in the storage unit 13 is read based on the speech parameters determined through the morpheme analysis process 12a and the placement process 12b, and a speech waveform is formed.

ところで、形態素解析処理では、アクセント位置の無いアクセント句が発生することがある。図５で説明した例文は、全てのアクセント句にアクセント位置を有する形態であったが、本実施形態の音声合成処理では、アクセント位置の無いアクセント句は、強拍タイミングに同期させて配置することができない。そのため、アクセント位置の無いアクセント句が発生した場合、後続するアクセント句（アクセント位置有り）に接続し、後続するアクセント句のアクセント位置に基づいて、強拍タイミングに同期させることとしている。 By the way, in the morphological analysis process, an accent phrase without an accent position may occur. The example sentence explained in FIG. 5 has a form in which all accent phrases have an accent position. However, in the speech synthesis process of this embodiment, an accent phrase without an accent position is arranged in synchronization with the strong beat timing. I can't. Therefore, when an accent phrase without an accent position is generated, it is connected to a subsequent accent phrase (with an accent position) and synchronized with a strong beat timing based on the accent position of the subsequent accent phrase.

図６には、図５で説明した例文の前に、アクセント位置の無いアクセント句Ａ０（「マズワ」）が配置された形態（前方の部分）が示されている。この場合、アクセント句Ａ０を後続するアクセント句Ａ１に接続し、図５の例と同様に、アクセント句Ａ１のアクセント位置を強拍タイミングＴ１に同期させて配置することで、アクセント句の無いアクセント句Ａ０についても配置を可能とするとともに、聴覚上においても違和感の少ない音声合成を実現している。アクセント位置の無いアクセント句の配置は、聴感上、このように後続するアクセント句に接続することが好ましいが、例えば、アクセント位置の無いアクセント句が文末に位置している場合等、後続するアクセント句に接続できない場合には、ア
クセント位置の無いアクセント句を直前のアクセント句に接続することとしてもよい。このように本発明では、アクセント位置の無いアクセント句を隣接（直前または後続）するアクセント句に接続することとしている。 FIG. 6 shows a form (front portion) in which an accent phrase A0 (“Mazuwa”) without an accent position is arranged before the example sentence described in FIG. In this case, the accent phrase A0 is connected to the subsequent accent phrase A1, and the accent position of the accent phrase A1 is arranged in synchronization with the strong beat timing T1, as in the example of FIG. The arrangement of A0 is also possible, and speech synthesis with less sense of incongruity is realized. The arrangement of the accent phrase without an accent position is preferably connected to the subsequent accent phrase for the sake of hearing. For example, when the accent phrase without an accent position is located at the end of the sentence, the subsequent accent phrase is not connected. If it is not possible to connect to an accent phrase, an accent phrase without an accent position may be connected to the immediately preceding accent phrase. As described above, in the present invention, an accent phrase without an accent position is connected to an adjacent accent phrase (immediately preceding or following).

なお、本発明はこれらの実施形態のみに限られるものではなく、それぞれの実施形態の構成を適宜組み合わせて構成した実施形態も本発明の範疇となるものである。 Note that the present invention is not limited to these embodiments, and embodiments configured by appropriately combining the configurations of the respective embodiments also fall within the scope of the present invention.

１…情報処理装置１２ｃ…音声波形合成処理
１０…制御部１３…記憶部
１１…演奏処理１３ａ…形態素解析辞書
１２…音声合成処理１３ｂ…音素情報
１２ａ…形態素解析処理１４…ミキシング部
１２ｂ…調整処理１５…スピーカ DESCRIPTION OF SYMBOLS 1 ... Information processing apparatus 12c ... Speech waveform synthesis process 10 ... Control part 13 ... Memory | storage part 11 ... Performance process 13a ... Morphological analysis dictionary 12 ... Speech synthesis process 13b ... Phoneme information 12a ... Morphological analysis process 14 ... Mixing part 12b ... Adjustment process 15 ... Speaker

Claims

A speech synthesizer capable of performing speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process for dividing text information into accent phrases and setting an accent position for a mora in each accent phrase;
An adjustment process that arranges the mora corresponding to the accent position of the accent phrase so as to overlap the strong beat timing of the input time signature information, and shrinks the accent phrase so that adjacent accent phrases do not overlap,
A speech synthesizer that performs speech waveform synthesis processing that synthesizes speech waveforms based on the accent phrases arranged in the adjustment processing.

The speech synthesizer according to claim 1, wherein the adjustment process arranges accent phrases so that adjacent accent phrases have a predetermined time or more.

The speech synthesizer according to claim 1, wherein the adjustment process arranges an accent phrase without an accent position in connection with an adjacent accent phrase.

A music playback device capable of performing music playback processing based on music information and speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process for dividing text information into accent phrases and setting an accent position for a mora in each accent phrase;
An adjustment process that arranges the mora corresponding to the accent position of the accent phrase so that it overlaps the strong beat timing of the time signature information output with the music playback process, and shrinks the accent phrase so that adjacent accent phrases do not overlap When,
A music reproducing device that performs voice waveform synthesis processing for synthesizing a voice waveform based on the accent phrase arranged in the adjustment process.

A speech synthesis program for causing an information processing apparatus to perform speech synthesis processing based on text information,
The speech synthesis process
A morphological analysis process that divides the input text into accent phrases and sets an accent position to a mora in each accent phrase;
An adjustment process that arranges the mora corresponding to the accent position of the accent phrase so as to overlap the strong beat timing of the input time signature information, and shrinks the accent phrase so that adjacent accent phrases do not overlap,
A speech synthesis program for executing speech waveform synthesis processing for synthesizing a speech waveform based on the accent phrase arranged in the adjustment processing.