JP2010026223A

JP2010026223A - Target parameter determination device, synthesis voice correction device and computer program

Info

Publication number: JP2010026223A
Application number: JP2008187035A
Authority: JP
Inventors: Reiko Tako; 礼子田高; Toru Tsugi; 徹都木; Hiroyuki Segi; 寛之世木; Nobumasa Seiyama; 信正清山
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2008-07-18
Filing date: 2008-07-18
Publication date: 2010-02-04

Abstract

<P>PROBLEM TO BE SOLVED: To easily determine a target parameter which is used for processing in which prosody of synthesis voice is corrected to target prosody. <P>SOLUTION: A target parameter determination device is provided with a voice data storage section; a prosody selection section in which model voice data are obtained including time variation information of a basic frequency of voice, and time information of a phoneme, tone model data, pitch model data and time model data are selected according to a category of the model voice data, and time variation information of each basic frequency of the selected pitch model data and the tone model data, and time information of each phoneme of the selected tone model data and the time model data are obtained; and a target parameter determination section in which the time variation information of the basic frequency included in the tone model data is changed according to the time variation information of the basic frequency included in the pitch model data, and further the time variation information of the basic frequency which is the target parameter is determined, according to time information of the phoneme included in the time model data. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、合成音声の韻律を決定する目標パラメータ決定装置、決定された韻律に従って合成音声を修正する合成音声修正装置、及びこれらのコンピュータプログラムに関する。 The present invention relates to a target parameter determination device that determines a prosody of a synthetic speech, a synthetic speech correction device that corrects a synthetic speech according to the determined prosody, and a computer program thereof.

合成音声の韻律は、その発話内容や前後の文脈などの言語情報に基づいて予測生成されることが多い。予測生成された韻律が不自然である場合には、合成音声の韻律を自然な韻律（以下、「目標韻律」という。）に修正する必要がある。
特許文献１には、韻律の修正を行なう修正者を補助するユーザインターフェースに関する技術が開示されている。特許文献１に開示された技術では、まず修正者が、合成音声の韻律を目標韻律に修正する処理に用いられるパラメータ（以下、「目標パラメータ」という。）の値を決定する。そして、利用者は、決定された目標パラメータの値に基づいて、画面に表示されたスライダーを操作し、各音節の韻律を修正する。 The prosody of a synthesized speech is often predicted and generated based on linguistic information such as the content of the utterance and the context before and after. When the predicted and generated prosody is unnatural, it is necessary to correct the prosody of the synthesized speech to a natural prosody (hereinafter referred to as “target prosody”).
Patent Document 1 discloses a technique related to a user interface that assists a corrector who corrects a prosody. In the technique disclosed in Patent Document 1, a corrector first determines a value of a parameter (hereinafter, referred to as “target parameter”) used for processing for correcting a prosody of a synthesized speech to a target prosody. Then, the user operates the slider displayed on the screen based on the determined target parameter value to correct the prosody of each syllable.

また、特許文献２には、発話された音声を録音し、録音された音声から特徴パラメータを抽出する技術が開示されている。また、特許文献３には、韻律変換を行う技術が開示されている。この特許文献３に開示された技術では、韻律も含めた音声変換が、音声素片に対して行われる。また、非特許文献１にも、韻律変換を行う技術が開示されている。
特開２００３−３６１００号公報特開２００７−１４０００２号公報特許第３９１３７７０号都木徹、梅田哲夫、“ピッチ変更時のひずみをスペクトル領域で修正する声質変換方式とその品質の心理評価”、信学論（Ａ）、ｖｏｌ．Ｊ７３−Ａ、Ｎｏ．３、ｐｐ．３８７−３９６、１９９０年３月 Patent Document 2 discloses a technique for recording spoken voice and extracting feature parameters from the recorded voice. Patent Document 3 discloses a technique for performing prosody conversion. In the technique disclosed in Patent Document 3, speech conversion including prosody is performed on speech segments. Non-Patent Document 1 also discloses a technique for performing prosody conversion.
JP 2003-36100 A JP 2007-140002 A Japanese Patent No. 3913770 Toru Tsuzuki and Tetsuo Umeda, “Voice quality conversion method for correcting distortion in the spectral domain and psychological evaluation of its quality”, Science theory (A), vol. J73-A, no. 3, pp. 387-396, March 1990

しかしながら、目標パラメータの具体的な値を決定するには専門的な知識や経験を要するため、修正対象である合成音声（以下、「修正対象音声」という。）の修正後の韻律のイメージを修正者が持っていても、修正者が目標パラメータの具体的な値を決定することは困難であるという問題があった。さらには、このような専門的な知識や経験を有している修正者であっても、目標パラメータを決定する作業には多くの時間を要してしまうという問題があった。 However, since it takes technical knowledge and experience to determine the specific value of the target parameter, the image of the prosody after correction of the synthesized speech to be corrected (hereinafter referred to as “correction target speech”) is corrected. However, it is difficult for the corrector to determine a specific value of the target parameter. Furthermore, even a corrector having such specialized knowledge and experience has a problem that it takes a lot of time to determine a target parameter.

本発明は、上記事情を考慮して為されたものであり、目標パラメータを容易に決定することを可能とする目標パラメータ決定装置、合成音声修正装置、及びコンピュータプログラムを提供することを目的とするものである。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a target parameter determination device, a synthesized speech correction device, and a computer program that can easily determine a target parameter. Is.

［１］上記の課題を解決するため、本発明の一態様による目標パラメータ決定装置は、音声と、前記音声の発話内容と、前記音声の基本周波数の時間変化情報と、前記音声に含まれる音素のタイミングを表す音素の時間情報とを対応付けて記憶する音声データ記憶部と、音声の基本周波数の時間変化情報と前記音声に含まれる音素のタイミングを表す音素の時間情報とを有する見本音声データを取得し、前記見本音声データの種別に応じて、音調の見本である音調見本データと音高の見本である音高見本データと音素タイミングの見本である時間見本データとを選択し、選択された前記音高見本データおよび前記音調見本データそれぞれの基本周波数の時間変化情報と、選択された前記音調見本データと前記時間見本データそれぞれの音素の時間情報とを取得する韻律選択部と、前記音調見本データが有する前記基本周波数の時間変化情報を、前記音高見本データが有する前記基本周波数の時間変化情報に応じて変更し、さらに前記時間見本データが有する音素の時間情報に合わせることによって目標パラメータとなる基本周波数の時間変化情報を決定するとともに、前記音調見本データが有する音素の時間情報を、前記時間見本データが有する音素の時間情報に合わせることによって目標パラメータとなる音素の時間情報を決定する目標パラメータ決定部と、を具備することを特徴とする。
ここで、基本周波数の時間変化情報は、基本周波数の値が経過時間毎に配列されたデータである。この基本周波数の値は、絶対的な音の高さに対応している。従って、上述の基本周波数の時間変化情報は、音の高さを含む。
この構成によれば、音調見本データと、音高見本データと、時間情報見本データとのそれぞれが、見本音声データの種別に応じて選択される。そして、音高見本の基本周波数の時間変化情報に応じて、音調見本の基本周波数の時間変化情報が変更され、さらに、時間見本データの音素の時間情報に合わせるように変更されることにより、目標となる基本周波数の時間変化情報が得られる。さらに、音調見本データの音素の時間情報を、時間見本データの音素の時間情報に合わせることによって、目標となる音素の時間変化情報が決定される。 [1] In order to solve the above-described problem, a target parameter determination apparatus according to an aspect of the present invention includes a voice, speech content of the voice, time change information of the fundamental frequency of the voice, and phonemes included in the voice. Audio data storage unit for storing the time information of the phonemes representing the timing of the sound, the time variation information of the basic frequency of the sound, and the time information of the phonemes representing the timing of the phonemes included in the sound In accordance with the type of the sample audio data, the tone sample data that is a sample of tone, the pitch sample data that is a sample of pitch, and the time sample data that is a sample of phoneme timing are selected and selected. Further, time change information of the fundamental frequency of each of the pitch sample data and the tone sample data, and the phoneme of each of the selected tone sample data and the time sample data Prosody selection unit for acquiring interval information, time change information of the fundamental frequency included in the tone sample data is changed according to time change information of the fundamental frequency included in the pitch sample data, and further the time sample The time change information of the fundamental frequency as a target parameter is determined by matching with the time information of the phoneme included in the data, and the time information of the phoneme included in the tone sample data is matched with the time information of the phoneme included in the time sample data. And a target parameter determination unit that determines time information of phonemes as target parameters.
Here, the time change information of the fundamental frequency is data in which the values of the fundamental frequency are arranged for each elapsed time. This fundamental frequency value corresponds to the absolute pitch of the sound. Therefore, the time change information of the fundamental frequency described above includes the pitch of the sound.
According to this configuration, each of the tone sample data, the pitch sample data, and the time information sample data is selected according to the type of the sample audio data. Then, according to the time change information of the fundamental frequency of the pitch sample, the time change information of the basic frequency of the tone sample is changed, and further changed to match the time information of the phoneme of the time sample data, so that the target The time change information of the fundamental frequency is obtained. Furthermore, the time change information of the target phoneme is determined by matching the time information of the phonemes of the tone sample data with the time information of the phonemes of the time sample data.

［２］また、本発明の一態様は、上記の目標パラメータ決定装置において、音声の入力を受け付ける音声入力部と、前記音声に対応する発話内容を取得する発話内容取得部と、前記音声入力部によって受け付けられた前記音声と前記発話内容とに基づき、当該音声の基本周波数の時間変化情報及び当該音声の音素の時間情報を算出する音声分析部と、をさらに具備し、前記韻律選択部は、前記音声分析部によって算出された前記基本周波数の時間変化情報及び前記音素の時間情報を有する前記音声を前記見本音声データとして取得する、ことを特徴とする。
この構成によれば、音声と当該音声の発話内容との入力を受け付け、この受け付けられた音声と発話内容を基に、音声の基本周波数の時間変化情報と音素の時間情報とを得て、見本音声データを得るようにした。これにより、入力される音声を音声見本音声データとして用いることが可能となる。 [2] Further, according to one aspect of the present invention, in the target parameter determination device, a voice input unit that receives voice input, an utterance content acquisition unit that acquires utterance content corresponding to the voice, and the voice input unit A speech analysis unit that calculates time change information of the fundamental frequency of the speech and time information of the phoneme of the speech based on the speech and the utterance content received by the voice, and the prosody selection unit, The voice having the time change information of the fundamental frequency and the time information of the phoneme calculated by the voice analysis unit is acquired as the sample voice data.
According to this configuration, the input of the voice and the utterance content of the voice is accepted, and the time change information of the fundamental frequency of the voice and the time information of the phoneme are obtained based on the received voice and the utterance content, and the sample is obtained. Audio data was obtained. This makes it possible to use the input voice as voice sample voice data.

［３］また、本発明の一態様は、上記の目標パラメータ決定装置において、見本音声データを選択する指示の入力を受け付ける見本音声指定部と、前記見本音声指定部が受け付けた指示に基づいて前記音声データ記憶部を検索することによって前記見本音声データを得る見本音声検索部をさらに具備し、前記韻律選択部は、前記見本音声検索部が得た見本音声データを取得する、ことを特徴とする。
この構成によれば、音声データ記憶部に記憶された音声データの中から、見本を選択する指示に応じた音声データを、見本音声データとして用いることができる。 [3] Further, according to an aspect of the present invention, in the target parameter determination device, the sample voice designation unit that receives an input of an instruction to select sample voice data, and the instruction received by the sample voice designation unit A sample voice search unit that obtains the sample voice data by searching a voice data storage unit is further provided, and the prosody selection unit acquires the sample voice data obtained by the sample voice search unit. .
According to this configuration, audio data corresponding to an instruction for selecting a sample from the audio data stored in the audio data storage unit can be used as the sample audio data.

［４］また、本発明の一態様は、上記の目標パラメータ決定装置において、修正対象となる修正対象音声を選択する指示の入力を受け付ける修正対象音声指定部と、前記音声データ記憶部を検索することによって前記修正対象音声の表記と同じ表記を有する見本音声を得る見本音声検索部をさらに具備し、前記韻律選択部は、前記見本音声検索部が得た見本音声データを取得する、ことを特徴とする。
この構成によれば、修正対象の音声の指定に応じて、音声に対応する発話内容を有する見本音声データが得られる。 [4] Further, according to one aspect of the present invention, in the target parameter determination device, a correction target voice designation unit that receives an input of an instruction to select a correction target voice to be corrected, and the voice data storage unit are searched. A sample voice search unit that obtains a sample voice having the same notation as the correction target voice, and the prosody selection unit acquires the sample voice data obtained by the sample voice search unit. And
According to this configuration, sample voice data having the utterance content corresponding to the voice is obtained in accordance with the designation of the voice to be corrected.

［５］また、本発明の一態様は、上記の目標パラメータ決定装置において、修正対象である音声を記憶する修正対象音声記憶部と、前記音声データ記憶部から、前記修正対象である音声の表記と異なる表記であって、かつ、音素数またはモーラ数が一致する見本音声データを得る見本音声検索部と、をさらに具備し、前記韻律選択部は、前記見本音声検索部が得た見本音声データを取得する、ことを特徴とする。 [5] Further, according to one aspect of the present invention, in the target parameter determination apparatus, a correction target voice storage unit that stores a voice that is a correction target and a notation of the voice that is the correction target from the voice data storage unit A sample voice search unit that obtains sample voice data having a different notation and the same phoneme number or mora number, and the prosody selection unit includes the sample voice data obtained by the sample voice search unit It is characterized by acquiring.

［６］また、本発明の一態様は、合成音声修正装置であって、上記の目標パラメータ決定装置と、修正対象である音声を記憶する修正対象音声記憶部（合成音声記憶部）と、前記修正対象音声を読み出し、前記目標パラメータ決定装置によって決定された前記基本周波数の時間変化情報及び音素の時間情報に基づいて前記修正対象音声を修正する修正部と、を具備することを特徴とする。
この構成によれば、見本音声データが指定されることのみによって、目標パラメータ決定装置が修正対象となる合成音声の目標パラメータが決定される。そして、決定された目標パラメータに基づいて、修正対象の音声データの韻律の修正が行われる。 [6] Moreover, one aspect of the present invention is a synthesized speech correction device, the target parameter determination device described above, a correction target speech storage unit (synthesized speech storage unit) that stores speech to be corrected, A correction unit that reads the correction target voice and corrects the correction target voice based on the time change information of the fundamental frequency and the time information of the phoneme determined by the target parameter determination device.
According to this configuration, the target parameter of the synthesized speech to be corrected by the target parameter determination device is determined only by specifying the sample voice data. Then, based on the determined target parameter, the prosody of the audio data to be corrected is corrected.

［７］また、本発明の一態様によるコンピュータプログラムは、音声と、前記音声の発話内容と、前記音声の基本周波数の時間変化情報と、前記音声に含まれる音素のタイミングを表す音素の時間情報とを対応付けて記憶する音声データ記憶部を有するコンピュータを、音声の基本周波数の時間変化情報と前記音声に含まれる音素のタイミングを表す音素の時間情報とを有する見本音声データを取得し、前記見本音声データの種別に応じて、音調の見本である音調見本データと音高の見本である音高見本データと音素のタイミングの見本である時間見本データとを選択し、選択された前記音高見本データおよび前記音調見本データそれぞれの基本周波数の時間変化情報と、選択された前記音調見本データと前記時間見本データそれぞれの音素の時間情報とを取得する韻律選択手段、前記音調見本データが有する前記基本周波数の時間変化情報を、前記音高見本データが有する前記基本周波数の時間変化情報に応じて変更し、さらに前記時間見本データが有する音素の時間情報に合わせることによって目標パラメータとなる基本周波数の時間変化情報を決定するとともに、前記音調見本データが有する音素の時間情報を、前記時間見本データが有する音素の時間情報に合わせることによって目標パラメータとなる音素の時間情報を決定する目標パラメータ決定手段、として機能させるためのコンピュータプログラムである。 [7] In addition, the computer program according to one aspect of the present invention provides a phoneme, utterance content of the speech, time change information of the fundamental frequency of the speech, and time information of phonemes indicating timing of phonemes included in the speech. A computer having a voice data storage unit that stores and associates with each other, obtains sample voice data having time change information of the fundamental frequency of voice and time information of phonemes representing the timing of phonemes included in the voice, According to the type of sample audio data, tone sample data that is a sample of tone, pitch sample data that is a sample of pitch, and time sample data that is a sample of phoneme timing are selected, and the selected pitch Time variation information of the fundamental frequency of each of the sample data and the tone sample data, and the phonemes of the selected tone sample data and the time sample data, respectively. Prosody selection means for acquiring time information, time change information of the fundamental frequency of the tone sample data is changed according to time change information of the fundamental frequency of the pitch sample data, and the time sample data The time change information of the fundamental frequency serving as a target parameter is determined by matching with the time information of the phoneme possessed by the phoneme, and the time information of the phoneme included in the tone sample data is matched with the time information of the phoneme included in the time sample data Is a computer program for functioning as target parameter determining means for determining time information of phonemes as target parameters.

本発明により、合成音声の韻律を目標韻律に修正する処理に用いるための目標パラメータの値を修正者が具体的に検討することなく、目標パラメータを容易に決定することが可能となる。 According to the present invention, it is possible to easily determine a target parameter without requiring the corrector to specifically examine the value of the target parameter to be used for processing for correcting the prosody of the synthesized speech to the target prosody.

［第１の実施の形態］
以下、本発明の複数の実施形態について、図面を参照しながら説明する。
図１は、第１の実施形態による目標パラメータ決定装置の機能構成を表すブロック図である。図示するように、目標パラメータ決定装置１は、音声データベース１１と、修正対象音声指定部１２と、合成音声記憶部１３と、音声入力部１４と、音声テキスト入力部１５（発話内容取得部）と、音声分析部１６と、見本音声指定部１７と、見本音声検索部１８と、韻律選択部１９と、目標パラメータ決定部２０とを含んで構成される。 [First Embodiment]
Hereinafter, a plurality of embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram illustrating a functional configuration of a target parameter determination device according to the first embodiment. As shown in the figure, the target parameter determination device 1 includes a speech database 11, a correction target speech designation unit 12, a synthesized speech storage unit 13, a speech input unit 14, and a speech text input unit 15 (utterance content acquisition unit). The voice analysis unit 16, the sample voice designation unit 17, the sample voice search unit 18, the prosody selection unit 19, and the target parameter determination unit 20 are configured.

音声データベース１１は、音声信号の振幅の時系列データである音声の音声信号データと、音声の韻律情報と、発話者識別情報とを含む音声データを記憶する。韻律情報は、基本周波数の時間変化情報と音素の時間情報とからなる。発話者識別情報は、音声を発声する発話者を識別する識別情報である。
音声データベース１１は、ハードディスク装置や光磁気ディスク装置、半導体メモリ、
ＣＤ−ＲＯＭ等の記録媒体、あるいはこれらの組み合わせにより構成される。 The speech database 11 stores speech data including speech signal data that is time-series data of speech signal amplitude, speech prosody information, and speaker identification information. The prosody information includes time change information of the fundamental frequency and time information of the phonemes. The speaker identification information is identification information for identifying a speaker who utters a voice.
The audio database 11 includes a hard disk device, a magneto-optical disk device, a semiconductor memory,
It is configured by a recording medium such as a CD-ROM or a combination thereof.

修正対象音声指定部１２は、合成音声記憶部１３に記憶される合成音声の中から、修正対象である合成音声を選択する指示の入力を受け付ける。修正対象音声指定部１２は、合成音声データを指定する入力を、例えば、修正対象である合成音声データの発話内容を表すテキストデータによって受け付ける。修正対象音声指定部１２には、例えば、キーボードやマウス等の入力装置が用いられる。
合成音声記憶部１３は、音声の音声信号データと、音声の韻律情報と、発話内容（表記）と、発話者識別情報とを含む合成音声データを記憶する。 The correction target voice designation unit 12 receives an input of an instruction to select a synthetic voice that is a correction target from among the synthesized voices stored in the synthesized voice storage unit 13. The correction target voice specifying unit 12 receives an input for specifying the synthesized voice data, for example, by text data representing the utterance content of the synthesized voice data to be corrected. For the correction target voice designation unit 12, for example, an input device such as a keyboard or a mouse is used.
The synthesized speech storage unit 13 stores synthesized speech data including speech signal data, speech prosody information, utterance content (notation), and speaker identification information.

音声入力部１４は、マイクロフォン等を用いて、見本となる音声である見本音声の入力を受け付ける。
音声テキスト入力部１５は、音声入力部１４が受け付けた見本音声の発話内容を表すテキストデータの入力を受け付ける。この音声テキスト入力部１５は、例えば、キーボードやマウス等の入力装置が用いられる。 The voice input unit 14 receives an input of sample voice, which is a voice serving as a sample, using a microphone or the like.
The voice text input unit 15 receives input of text data representing the utterance content of the sample voice received by the voice input unit 14. For the voice text input unit 15, for example, an input device such as a keyboard or a mouse is used.

音声分析部１６は、音声テキスト入力部１５が入力を受け付けたテキストデータに基づいて、音声入力部１４から入力された見本音声を分析し、見本音声の韻律情報を生成する。具体的には、音声分析部１６は、音声認識技術の強制アラインメントを実行することによって、音声信号データと音素ラベルとを有する音響モデルを用い、音声テキスト入力部１５が入力を受け付けたテキストデータから得られる音素ラベルに対応する音響モデルを見本音声にそれぞれ当てはめ、隣り合う音素ラベルに対する時間軸方向の境界を検出して、音素毎の開始時点と終了時点を決定し、音素の時間情報を得る。
さらに、音声分析部１６は、見本音声の基本周波数の時間変化情報を有声区間（声帯の振動を伴う音声である有声音声の区間）において生成し、生成された基本周波数の値に対し、有声区間の分析値を用いてスプライン関数などによりスムージングを行うことによって、変化が滑らかな基本周波数の時間変化情報を生成する。このとき、音声分析部１６は、基本周波数の値を取得できない無声区間（声帯の振動を伴わない音声である無声音声の区間）については、この無声区間の前後の有声区間の基本周波数の値から内挿した値を用いて、無声区間の基本周波数を補間する。 The voice analysis unit 16 analyzes the sample voice input from the voice input unit 14 based on the text data received by the voice text input unit 15 and generates prosody information of the sample voice. Specifically, the speech analysis unit 16 uses an acoustic model having speech signal data and phoneme labels by executing forced alignment of speech recognition technology, and from the text data received by the speech text input unit 15. The acoustic model corresponding to the obtained phoneme label is applied to each sample voice, the boundary in the time axis direction with respect to the adjacent phoneme label is detected, the start time and the end time for each phoneme are determined, and the time information of the phoneme is obtained.
Furthermore, the voice analysis unit 16 generates time change information of the fundamental frequency of the sample voice in a voiced section (voiced voice section that is voice accompanied by vocal cord vibration), and the voiced section is generated with respect to the generated fundamental frequency value. By performing smoothing using a spline function or the like using the analysis value of the above, the time change information of the fundamental frequency with a smooth change is generated. At this time, the voice analysis unit 16 uses the fundamental frequency values of the voiced sections before and after the voiceless section for a voiceless section (voiced voice section that is voice without accompanying vocal cord vibration) for which the fundamental frequency value cannot be obtained. Using the interpolated value, the fundamental frequency of the silent section is interpolated.

見本音声指定部１７は、音声データベース１１に記憶される音声データのうち、修正を行う場合における見本となる音声（見本音声）の発話内容のテキストを、例えば、かな表記と漢字表記と音素ラベル表記とのいずれかによって受け付ける。具体的には、見本音声指定部１７は、他語同話者音声と、他語他話者音声と、同語他話者音声とのいずれかの指定を受け付ける。他語同話者音声とは、修正対象音声と発話者識別情報が同一であり、且つ修正対象音声と発話内容が異なる音声である。他語他話者音声とは、修正対象音声と発話者識別情報が異なり、且つ修正対象音声と発話内容が異なる音声である。同語他話者音声とは、修正対象音声と発話者識別情報が異なり、且つ修正対象音声と発話内容が同一の音声である。例えば、見本音声指定部１７には、キーボードやマウス等の入力装置が用いられる。 The sample voice designating unit 17 includes, for example, kana notation, kanji notation, and phoneme label notation for the utterance content of the voice (sample voice) that becomes a sample in the case of correction among the sound data stored in the sound database 11. And accept either. Specifically, the sample voice designating unit 17 accepts designation of any one of the other language speaker voice, the other language speaker voice, and the other language speaker voice. The foreign language speaker voice is a voice having the same correction target voice and speaker identification information and having a different utterance content from the correction target voice. The other-language / other-speaker voice is a voice having different speech to be corrected and speaker identification information and having different utterance contents from the voice to be corrected. The same-speech other-speaker voice is a voice in which the correction target voice and the speaker identification information are different, and the correction target voice and the utterance content are the same. For example, an input device such as a keyboard or a mouse is used for the sample voice designation unit 17.

見本音声検索部１８は、他語同話者音声と、他語他話者音声と、同語他話者音声との韻律情報とのうち少なくともいずれかを、音声データベース１１から読み出す。
また、見本音声検索部１８は、見本音声指定部１７によって受け付けられた修正対象音声の、発話内容を表すテキストデータと発話者識別情報とをキーにして音声データベース１１を検索し、その結果、見本音声を得る。 The sample voice search unit 18 reads out from the voice database 11 at least one of prosody information of the other language speaker voice, the other language speaker voice, and the synonym speaker voice.
Further, the sample voice search unit 18 searches the voice database 11 using text data representing the utterance contents and speaker identification information of the correction target voice received by the sample voice specifying unit 17 as a key, and as a result, Get voice.

韻律選択部１９は、音声分析部１６が生成した見本音声（他語同話者音声と他語他話者音声と同語他話者音声のうち少なくともいずれか１つ）の韻律情報と、見本音声検索部１８が読み出した見本音声（他語同話者音声と他語他話者音声と同語他話者音声のうち少なくともいずれか１つ）の韻律情報と、または修正対象の合成音声の韻律情報との中から、目標パラメータ決定処理に用いられる音調見本（音調の見本である音調見本データ）、音高見本（音高の見本である音高見本データ）、及び時間情報見本（音素タイミングの見本である時間見本データ）をそれぞれ選択する韻律選択処理を行う。ここで、音高とは、音の高さであり、例えば、音声の基本周波数によって表される。例えば、音高は、基本周波数の時間変化情報に含まれる基本周波数の値の最大値や最小値や平均値が用いられる。なお、基本周波数とは、音声の調波成分の中で最も低い周波数である。音調とは、声の高さの配置（高さアクセント）であり、基本周波数の値の相対的または絶対的な変化の時系列によって表される。
韻律選択処理の詳細については、後で図面を参照しながら説明する。 The prosody selection unit 19 includes prosody information generated by the voice analysis unit 16 (at least one of the other-language synonym voice, the other-language other-speaker voice, and the same-word other-speaker voice) and the sample. The prosody information of the sample voice (at least one of the other-language synonym voice, the other-language other-speaker voice, and the same-word other-speaker voice) read by the voice search unit 18 or the synthesized voice to be corrected From the prosodic information, a tone sample (tone sample data that is a sample of tone), a pitch sample (pitch sample data that is a sample of pitch), and a time information sample (phoneme timing) that are used for target parameter determination processing Prosody selection processing is performed for selecting each of the time sample data). Here, the pitch is the pitch of the sound, and is represented by, for example, the fundamental frequency of the sound. For example, the maximum, minimum, or average value of the fundamental frequency value included in the time change information of the fundamental frequency is used as the pitch. The fundamental frequency is the lowest frequency among the harmonic components of speech. The tone is an arrangement (pitch accent) of the pitch of the voice, and is represented by a time series of relative or absolute changes in the value of the fundamental frequency.
Details of the prosody selection processing will be described later with reference to the drawings.

目標パラメータ決定部２０は、韻律選択部１９によって選択された音調見本、音高見本、及び時間情報見本それぞれの韻律情報に基づいて目標パラメータ決定処理を行い、目標パラメータを決定する。目標パラメータとは、目標となる韻律を有する基本周波数の時間変化情報と、目標となる韻律を有する音素の時間情報とからなる情報（韻律情報）である。 The target parameter determination unit 20 performs target parameter determination processing based on the prosodic information of the tone sample, the pitch sample, and the time information sample selected by the prosody selection unit 19 to determine the target parameter. The target parameter is information (prosodic information) including time change information of a fundamental frequency having a target prosody and time information of phonemes having a target prosody.

音声データベース１１に記憶される音声データについて、図２を用いてさらに詳細に説明する。図２は、音声データベース１１に記憶される音声データの構成を表す概略図である。
音声データは、音声信号データと、韻律情報と、発話者識別情報と、漢字表記と、かな表記と、音素ラベル表記とを対応付けて、音声毎に保持するデータである。音声信号データは、音声の波形に対応するデータであって、例えば、振幅値の時系列による配列データである。韻律情報は、基本周波数の時間変化情報と音素の時間情報とからなる。基本周波数の時間変化情報は、基本周波数の値が経過時間毎に配列されたデータである。音素の時間情報は、音声の開始時点を時刻の基準として音素毎のその音素の開始時点と終了時点とを表す。音素については後述する。
発話者識別情報は、音声の発話者を識別する識別情報である。漢字表記は、音声の発話内容を表す漢字を発話の順に配列された情報である。かな表記は、音声の発話内容を表すかなを発話の順に配列された情報である。音素ラベル表記は、音声の発話内容を表す音素ラベルを発話の順に配列された情報である。 The voice data stored in the voice database 11 will be described in more detail with reference to FIG. FIG. 2 is a schematic diagram showing the configuration of audio data stored in the audio database 11.
The speech data is data that is stored for each speech in association with speech signal data, prosodic information, speaker identification information, kanji notation, kana notation, and phoneme label notation. The audio signal data is data corresponding to an audio waveform, for example, array data based on time series of amplitude values. The prosody information includes time change information of the fundamental frequency and time information of the phonemes. The time change information of the fundamental frequency is data in which values of the fundamental frequency are arranged for each elapsed time. The phoneme time information represents the start time and the end time of each phoneme with the start time of the speech as a time reference. The phoneme will be described later.
The speaker identification information is identification information for identifying a voice speaker. The kanji notation is information in which kanji representing speech utterance contents are arranged in the order of utterances. Kana notation is information in which kana representing the utterance content of speech is arranged in the order of utterances. The phoneme label notation is information in which phoneme labels representing speech utterance contents are arranged in the order of utterances.

例えば図２において、図中のデータの１行目は、音声信号データ“ＷＡＶＥ１”の音声の基本周波数の時間変化情報“ＦＲＱ１”と、音素の時間情報“ＴＩＭＥ１”とが韻律情報として音声信号データに対応付けされているとともに、この音声の発話者識別情報が“Ａ０１”、この音声の発話内容の漢字表記が“北海道”、かな表記が“ほっかいどー”、この音声の発話内容の音素ラベル表記が“ｈｏＱｋａｉｄｏ：”であることを表す。
また、図中のデータの４行目は、発話内容が１モーラである音声データを表しており、音声信号データ“ＷＡＶＥ４”の音声の基本周波数の時間変化情報“ＦＲＱ４”と、音素の時間情報“ＴＩＭＥ４”とが韻律情報として音声信号データに対応付けされているとともに、この音声の発話者識別情報が“Ａ０１”、この音声の発話内容の漢字表記が無く（図２においては「−」と表す）、かな表記が“あ”、この音声の発話内容の音素ラベル表記が“ａ”であることを表す。
なお、音声データベース１１は、音声データが新規に作成された場合には、新たに記憶することが可能であり、また、音声を作成する他の装置において作成された音声データを得て、新たに記憶することも可能である。 For example, in FIG. 2, the first line of the data in the figure is the audio signal data in which the time change information “FRQ1” of the basic frequency of the audio of the audio signal data “WAVE1” and the time information “TIME1” of the phoneme are prosodic information. , The speech identification information of this speech is “A01”, the kanji notation of the speech content of this speech is “Hokkaido”, the kana notation is “Hokkaido”, and the phoneme label of the speech content of this speech The notation is “hoQkaido:”.
Further, the fourth line of the data in the figure represents voice data whose utterance content is 1 mora, the time change information “FRQ4” of the fundamental frequency of the voice of the voice signal data “WAVE4”, and the time information of phonemes. “TIME4” is associated with the speech signal data as prosodic information, the speaker identification information of this speech is “A01”, and there is no kanji notation of the speech utterance content (“−” in FIG. 2). Kana notation is “A”, and the phoneme label notation of the utterance content of this voice is “a”.
The voice database 11 can be newly stored when voice data is newly created. In addition, the voice database 11 obtains voice data created by another device that creates voice and newly creates the voice data. It is also possible to memorize.

図３は、音声信号の波形図である。図３の波形図における縦軸は振幅を表し、横軸は音声を開始した時点を基準として経過した時間を表す。音声信号の振幅の値が経過時間毎に配列されたデータが音声信号データである。具体的には、図３は、図２のデータの１行目におけるＷＡＶＥ１である「ほっかいどー」が発話された場合の波形図である。 FIG. 3 is a waveform diagram of an audio signal. The vertical axis in the waveform diagram of FIG. 3 represents the amplitude, and the horizontal axis represents the time that has elapsed with reference to the time when the voice was started. Data in which the amplitude value of the audio signal is arranged for each elapsed time is the audio signal data. Specifically, FIG. 3 is a waveform diagram in a case where “Hokkaiido” which is WAVE1 in the first line of the data in FIG. 2 is spoken.

図４は、基本周波数の時間に応じた変化を表すグラフである。この基本周波数の時間変化情報は、基本周波数の値が経過時間毎に配列されたデータである。図４において、縦軸は基本周波数を表し、横軸は経過時間を表す。具体的には、図４は、図２のデータの１行目におけるＦＲＱ１をグラフとして表したものでる。 FIG. 4 is a graph showing a change of the fundamental frequency according to time. This time change information of the fundamental frequency is data in which values of the fundamental frequency are arranged for each elapsed time. In FIG. 4, the vertical axis represents the fundamental frequency, and the horizontal axis represents the elapsed time. Specifically, FIG. 4 is a graph showing FRQ1 in the first row of the data in FIG.

図５は、音素の時間情報を表すテキストデータを示す概略図である。例えば図５において、１行目のＴＩＭＥ１は、音素の時間情報のデータの名称であり、２行目から１１行目は、音素毎の音素の時間情報である。なお、音素とは音韻論上の音の最小単位であり、母音や子音それぞれが１音素に対応する。また、撥音や長音や促音もそれぞれが１音素に対応する。
２行目から１１行目のデータにおいて、一列目は音声の開始時点を時刻の基準として各音素の開始時点までの時間を１万分の１秒単位で表し、二列目は音声の開始時点を時刻の基準として各音素の終了時点までの時間を１万分の１秒単位で表し、三列目は音素の音素ラベルを表す。例えば図５において、“０４７５０ｓｉｌ”は、音声の開始時点から０．４７５秒経過するまでの間が無声区間であることを表す。また、“４７５０５１００ｈ”は、音声の開始時点を基準として０．４７５秒経過した時点から０．５１秒経過するまでの間の音素が“ｈ”であることを表す。なお、音素ラベルｓｉｌは、音素がないことを表し、音素ラベルＱは促音を表し、音素ラベルｏ：は「お」の長音を表す。なお、ここでは、時刻が１万分の１秒単位である場合を一例として説明したが、１千分の１秒単位（ミリ秒）など、他の単位で表すようにしてもよい。 FIG. 5 is a schematic diagram showing text data representing time information of phonemes. For example, in FIG. 5, TIME1 on the first line is the name of the time information data of phonemes, and the second to eleventh lines are the time information of phonemes for each phoneme. Note that a phoneme is a minimum unit of sound in phonology, and each vowel and consonant corresponds to one phoneme. In addition, each of the repellent sound, the long sound, and the prompt sound corresponds to one phoneme.
In the data from the second row to the eleventh row, the first column shows the time to the start time of each phoneme in units of 1 / 10,000 second with the start time of the voice as a time reference, and the second column shows the start time of the voice. The time until the end point of each phoneme is expressed in units of 1 / 10,000 second as a time reference, and the third column shows phoneme labels of phonemes. For example, in FIG. 5, “0 4750 sil” indicates that the period from when the voice starts until 0.475 seconds elapses is a silent section. Further, “4750 5100 h” represents that the phoneme from the time when 0.475 seconds elapses to the time when 0.51 seconds elapses with respect to the start time of the sound is “h”. Note that the phoneme label sil indicates that there is no phoneme, the phoneme label Q indicates a prompt sound, and the phoneme label o: indicates a long sound of “o”. Although the case where the time is in units of 1 / 10,000 seconds has been described as an example here, the time may be expressed in other units such as 1 / 1000th of a second (millisecond).

図６は、見本音声指定部１７が入力を受け付ける他語音声を、利用者が選ぶ時の条件を説明する概念図である。
他語音声は、他語他話者音声と他語同話者音声との２つがある。他語音声であるか否かによって、音調見本と、音高見本と、時間情報見本とそれぞれ選択する際に、他語音声の韻律情報の優先順位が変わる。
他語音声を見本音声指定部１７から指定する場合、利用者は、修正対象音声と発話内容のモーラ数が同じである音声を選択する。モーラとは、音の長さについての音韻論上の単位である。日本語では、概ね、拗音については仮名２文字が１モーラに対応し、拗音以外については、仮名１文字が１モーラに対応する。１モーラは、１または複数の音素により構成される。
図６（ａ）は、修正対象音声の具体例「あおいいえ」のモーラ区切り及び音素区切りを表す図であり、図６（ｂ）は、他語音声の具体例「しろいいえ」のモーラ区切り及び音素区切りを表す図である。
図６の場合、「しろいいえ」のモーラ数は５であり、修正対象音声「あおいいえ」のモーラ数と一致する。従って、「しろいいえ」が修正対象音声とモーラ数が同じであるという条件を満たしており、利用者は他語音声として選択することが可能である。 FIG. 6 is a conceptual diagram for explaining conditions when the user selects another language voice for which the sample voice designating unit 17 accepts input.
There are two types of other language voices: another language other speaker's voice and another language same speaker's voice. The priority of prosodic information of other language speech changes when selecting a tone sample, a pitch sample, and a time information sample, depending on whether or not it is another language speech.
When the other language voice is designated from the sample voice designation unit 17, the user selects a voice having the same number of mora in the speech to be corrected and the utterance content. Mora is a phonological unit of sound length. In Japanese, generally, two characters of kana correspond to 1 mora for stuttering, and one character of kana corresponds to 1 mora for other than stuttering. One mora is composed of one or more phonemes.
FIG. 6A is a diagram showing a mora delimiter and phoneme delimitation of a specific example “Ao No” of the speech to be corrected, and FIG. 6B is a mora delimiter of a specific example “Shiro No” of another language sound. It is a figure showing a phoneme division | segmentation.
In the case of FIG. 6, the number of mora of “Shiro No” is 5, which matches the number of mora of the correction target voice “Ao No”. Therefore, “Shiro No” satisfies the condition that the number of mora is the same as the correction target voice, and the user can select it as the other language voice.

なお、利用者は、他語音声について、修正対象音声と発話内容のモーラ数が同じであって、修正対象音声と音素数が異なる音声を選択することも可能である。例えば、図６に示すように、図６（ｂ）に表される「しろいいえ」の音素数が７であり、図６（ａ）に表される修正対象音声「あおいいえ」の音素数が５であるため、音素数が一致しないが、上述したように「しろいいえ」はモーラ数が一致するので条件を満たす。従って、利用者は、修正対象音声とモーラ数が同じであって、且つ、音素数が異なる音声を他語音声として選択することができる。 Note that the user can also select a speech that has the same number of mora in the utterance content as the correction target speech and a different number of phonemes from the correction target speech. For example, as shown in FIG. 6, the number of phonemes of “Shiro No” shown in FIG. 6B is 7, and the number of phonemes of the correction target speech “Ao No” shown in FIG. The number of phonemes does not match because it is 5, but “Shiro No” satisfies the condition because the number of mora matches as described above. Therefore, the user can select a speech having the same number of mora as the correction target speech and a different number of phonemes as the other language speech.

図７は、韻律選択部１９が見本音声を選択する優先順位の一例を表す概要図である。具体的に、図７においては、音調見本と、音高見本と、時間情報見本との組み合わせが、優先順位に対応付けられている。
音声の候補としては、例えば、以下の音声がある。下に列挙するものは音声の種別である。
（１）音声入力部が入力を受け付けた音声
（ａ）修正対象音声と同発話内容
（ｂ）修正対象音声と異なる発話内容、同音素数
（ｃ）修正対象音声と異なる発話内容、同モーラ数
（２）音声データベースに記憶された音声
（ａ）修正対象音声と異なる発話内容、同音素数、同話者
（ｂ）修正対象音声と異なる発話内容、同モーラ数、同話者
（ｃ）修正対象音声と異なる発話内容、同音素数、他話者
（ｄ）修正対象音声と異なる発話内容、同モーラ数、他話者
（ｅ）修正対象音声と同発話内容、他話者
（３）修正対象音声 FIG. 7 is a schematic diagram showing an example of a priority order in which the prosody selection unit 19 selects a sample voice. Specifically, in FIG. 7, a combination of a tone sample, a pitch sample, and a time information sample is associated with the priority order.
Examples of voice candidates include the following voices. Listed below are the types of audio.
(1) Voice received by the voice input unit (a) Speech to be corrected and the same utterance content (b) Utterance content and number of same phonemes different from the correction target voice (c) 2) Speech stored in the speech database (a) Utterance content, same phoneme number, and speaker different from the correction target speech (b) Utterance content, same mora number, same speaker different from the correction target speech (c) Correction target speech Utterance content, same phoneme number, other speaker (d) utterance content, same mora number, other speaker different from correction target speech (e) correction target speech, same utterance content, other speaker (3) correction target speech

音調見本と、音高見本と、時間情報見本とには、それぞれ、上述の（１）（ａ）から（１）（ｃ）と、（２）（ａ）から（２）（ｅ）と、（３）とのうち、いずれか１つが選ばれる。
例えば、図７の１行目は、（２）（ａ）が音調見本であり、（３）が音高見本であり、（３）が時間情報見本である組み合わせが、優先順位１に対応付けされている。
また、図７の２行目は、（２）（ａ）が音調見本であり、（２）（ｂ）が音高見本であり、（３）が時間情報見本である組み合わせが、優先順位２に対応付けされている。
ここでは、１が最も高い優先順位である。 For the tone sample, pitch sample, and time information sample, (1) (a) to (1) (c) and (2) (a) to (2) (e), respectively, Any one of (3) is selected.
For example, in the first line of FIG. 7, a combination in which (2) (a) is a tone sample, (3) is a pitch sample, and (3) is a time information sample is associated with priority 1. Has been.
In the second line of FIG. 7, the combination (2) (a) is a tone sample, (2) (b) is a pitch sample, and (3) is a time information sample. Is associated with.
Here, 1 is the highest priority.

具体的に、韻律選択部１９は、音調見本については、音声入力部１４が入力を受け付けた見本音声または見本音声検索部１８が音声データベース１１から検索して得た見本音声から１つ選択する。また、韻律選択部１９は、音高見本については、見本音声検索部１８が音声データベース１１から得た見本音声（修正対象音声と同話者の音声）、または合成音声記憶部１３から得た修正対象音声のいずれかから１つ選択する。また、韻律情報選択部１９は、音素の時間情報については、音声入力部１４が受け付けた見本音声と、見本音声検索部１８が音声データベース１１から得た見本音声と、合成音声記憶部１３から得た修正対象音声のいずれかから１つ選択する。 Specifically, the prosody selection unit 19 selects one tone tone sample from the sample speech received by the speech input unit 14 or the sample speech obtained by the search from the speech database 11 by the sample speech search unit 18. For the pitch sample, the prosody selection unit 19 uses the sample voice (the target voice and the voice of the same speaker) obtained by the sample voice search unit 18 from the voice database 11 or the correction obtained from the synthesized voice storage unit 13. Select one of the target voices. The prosodic information selection unit 19 obtains the phoneme time information from the sample speech received by the speech input unit 14, the sample speech obtained from the speech database 11 by the sample speech search unit 18, and the synthesized speech storage unit 13. One of the selected correction target voices is selected.

韻律選択部１９は、音調見本について選択した見本音声と、音高見本について選択した見本音声と、音素の時間情報について選択した見本音声との組み合わせのうち、取り得る組み合わせのそれぞれの優先順位に従って、優先順位のうち最も高い組み合わせを選択する。 The prosody selection unit 19 selects a sample voice selected for the tone sample, a sample voice selected for the pitch sample, and a sample voice selected for the phoneme time information according to the priority of each possible combination. Select the highest priority combination.

例えば、音声入力部１４から見本音声の入力を受け付けず、見本音声指定部１７から（２）（ａ）、（２）（ｂ）である見本音声の入力を受け付け、修正対象音声を記憶する合成音声記憶部１３から（３）を受け付けた場合には、音調見本が（２）（ａ）、音高見本が（３）、時間情報見本が（３）である組み合わせに対応する優先順位が１であり、また、これらの見本音声の他の組み合わせに対応する優先順位が２、３、７であるので、韻律選択部１９は、優先順位が１である組み合わせを選択する。
また、例えば、音声入力部１４から（１）（ａ）である見本音声を受け付け、見本音声指定部１７から（２）（ｂ）である見本音声の入力を受け付け、修正対象音声を記憶する合成音声記憶部１３から（３）を受け付けた場合には、音調見本が（１）（ａ）、音高見本が（３）、時間情報見本が（３）である組み合わせに対応する優先順位が４であり、音調見本が（１）（ａ）、音高見本が（２）（ｂ）、時間情報見本が（３）である組み合わせに対応する優先順位が５であるので、韻律選択部１９は、優先順位が４の組み合わせを選択する。 For example, the input of the sample voice is not received from the voice input unit 14 and the input of the sample voices (2) (a), (2) (b) is received from the sample voice specifying unit 17 and the correction target voice is stored. When (3) is received from the voice storage unit 13, the priority corresponding to the combination in which the tone sample is (2) (a), the pitch sample is (3), and the time information sample is (3) is 1 In addition, since the priorities corresponding to other combinations of these sample sounds are 2, 3, and 7, the prosody selecting unit 19 selects a combination having a priority of 1.
Also, for example, a sample voice (1) (a) is received from the voice input unit 14, a sample voice (2) (b) is received from the sample voice designation unit 17, and the synthesis target voice is stored. When (3) is received from the voice storage unit 13, the priority corresponding to the combination in which the tone sample is (1) (a), the pitch sample is (3), and the time information sample is (3) is 4. Since the priority corresponding to the combination in which the tone sample is (1) (a), the pitch sample is (2) (b), and the time information sample is (3) is 5, the prosody selecting unit 19 , A combination with a priority of 4 is selected.

なお、図７においては、図示した組み合わせに限られるものではなく、組み合わせを変えたり、優先順位を変えたりするようにしてもよい。
また、例えば、この図７に示す情報を予めハードディスク等の記憶装置に記憶しておき、韻律選択処理を行う場合、韻律選択部１９は、この記憶装置を参照し、優先順位に従い、見本となる韻律情報を選択することが可能である。 In addition, in FIG. 7, it is not restricted to the combination shown in figure, You may make it change a combination and change a priority.
Further, for example, when the information shown in FIG. 7 is stored in advance in a storage device such as a hard disk and the prosody selection processing is performed, the prosody selection unit 19 refers to this storage device and uses the sample according to the priority order. Prosodic information can be selected.

次に、目標パラメータ決定部２０の詳細について説明する。
図８は、基本周波数の時間方向の平均値（以下、「時間平均値」という）を用いて、音調見本の基本周波数の時間変化情報を算出する場合について説明する説明図である。図８において、縦軸は基本周波数を表し、横軸は時間を表す。 Next, details of the target parameter determination unit 20 will be described.
FIG. 8 is an explanatory diagram for explaining a case where time change information of the fundamental frequency of the tone sample is calculated using an average value of the fundamental frequency in the time direction (hereinafter referred to as “time average value”). In FIG. 8, the vertical axis represents the fundamental frequency, and the horizontal axis represents time.

図８（ａ）と図８（ｂ）は、音調見本の基本周波数の時間変化を表すグラフ（ア）、及び音高見本の基本周波数の時間変化を表すグラフ（イ）を示す。さらに、図８（ｂ）は、音高見本の基本周波数の平均値と音調見本の基本周波数の平均値とに基づいて目標パラメータ決定部２０によって算出される基本周波数の時間変化を表すグラフ（ウ）を示す。
基本周波数の時間平均値を用いて、音調見本の基本周波数の時間変化情報を算出する場合、目標パラメータ決定部２０は、まず音調見本（図８（ａ）ア）の基本周波数の時間平均値と音高見本（図８（ａ）イ）の基本周波数の時間平均値とを算出する。そして、目標パラメータ決定部２０は、音調見本（図８（ｂ）ア）の基本周波数の時間平均値が、音高見本（図８（ａ）イ）の基本周波数の時間平均値と同じとなるような、音調見本の基本周波数の時間変化情報を算出する。具体的には、目標パラメータ決定部２０は、音高見本の基本周波数の時間平均値と音調見本の基本周波数の時間平均値との差を算出し、算出された差を音調見本の基本周波数に加算した和の時系列のデータを算出する。この算出される音調見本の基本周波数の時間変化情報のグラフを図８（ｂ）ウに示す。
このように、音高見本の基本周波数の平均値と同じになるような平均値を持つ音調見本の基本周波数を算出することにより、音高見本の音高に近づけた音調見本の基本周波数の時間変化情報が得られる。 FIG. 8A and FIG. 8B show a graph (a) showing the time change of the fundamental frequency of the tone sample and a graph (A) showing the time change of the fundamental frequency of the pitch sample. Further, FIG. 8B is a graph (C) showing the time change of the fundamental frequency calculated by the target parameter determination unit 20 based on the average value of the fundamental frequency of the pitch sample and the average value of the fundamental frequency of the tone sample. ).
When calculating the time change information of the fundamental frequency of the tone sample using the time average value of the fundamental frequency, the target parameter determination unit 20 first calculates the time average value of the fundamental frequency of the tone sample (FIG. 8A) and The time average value of the fundamental frequency of the pitch sample (FIG. 8A) is calculated. Then, the target parameter determination unit 20 makes the time average value of the fundamental frequency of the tone sample (FIG. 8B) the same as the time average value of the fundamental frequency of the pitch sample (FIG. 8A). Such time change information of the fundamental frequency of the tone sample is calculated. Specifically, the target parameter determination unit 20 calculates the difference between the time average value of the fundamental frequency of the pitch sample and the time average value of the fundamental frequency of the tone sample, and uses the calculated difference as the fundamental frequency of the tone sample. Calculate time-series data of the added sum. FIG. 8B shows a graph of time change information of the fundamental frequency of the calculated tone sample.
Thus, by calculating the fundamental frequency of the tone sample having an average value that is the same as the average value of the fundamental frequency of the pitch sample, the time of the fundamental frequency of the tone sample close to the pitch of the pitch sample. Change information is obtained.

次に、図９を用いて、音調見本の音声素片の発話開始から発話終了までの全体の時間を合計した音調見本全体長に対する、音調見本の各音声素片の発話開始から発話終了までの時間の比（全体長に対する音声素片の比）を変えずに、音調見本全体長を変更する場合について説明する。ここで、音声素片は、合成音声を構成する音声波形のデータである。図９の例では、音声素片の単位がモーラである場合を説明する。
図９は、全体長に対する音声素片の比を変えずに、音調見本全体長を変更する場合について説明する概念図である。図９の上段は時間情報見本の音素の時間情報を表し、図９の中段は音調見本の音素の時間情報を表し、図９の下段は目標パラメータの音素の時間情報を示す。図９の横軸は、時間を示している。
全体長に対する音声素片の比を変えずに、音調見本全体長を変更する場合、目標パラメータ決定部２０は、まず時間情報見本の各音声素片の発話開始から発話終了までの時間を合計した時間情報見本全体長と、音調見本全体長とを算出する。そして、目標パラメータ決定部２０は、全体長に対する音声素片の比を変えることなく、音調見本の音素の時間情報を、時間情報見本全体長と音調見本全体長とが一致するように変更する。具体的には、目標パラメータ決定部２０は、時間情報見本全体長と音調見本全体長とを算出する。そして、目標パラメータ決定部２０は、全体長に対する音声素片の比を算出し、音調見本の音素の時間情報に対し、時間情報見本全体長に対する音調見本全体長の比を音調見本の各音素の時間情報に乗ずることによって変更し、時間情報見本全体長と音調見本全体長とが一致する音調見本の音素の時間情報を得る。目標パラメータ決定部２０は、得られた音素の時間情報を、目標パラメータの音素の時間情報として決定する。
これにより、時間情報見本全体長と音調見本全体長とが異なる場合であっても、時間情報見本全体長に合わせた音調見本全体長を得ることができ、音調見本全体長の調整を行うことができる。 Next, with reference to FIG. 9, from the start of speech to the end of utterance of each speech unit of the tone sample, the total length of the tone sample totaling the total time from the start of speech to the end of speech of the tone sample of the tone sample A case will be described in which the overall length of the tone sample is changed without changing the time ratio (ratio of the speech segment to the overall length). Here, the speech segment is speech waveform data constituting the synthesized speech. In the example of FIG. 9, a case where the unit of the speech unit is a mora will be described.
FIG. 9 is a conceptual diagram illustrating a case where the overall length of the tone sample is changed without changing the ratio of the speech segment to the overall length. The upper part of FIG. 9 represents the time information of the phonemes of the time information sample, the middle part of FIG. 9 represents the time information of the phonemes of the tone sample, and the lower part of FIG. 9 represents the time information of the phonemes of the target parameters. The horizontal axis in FIG. 9 indicates time.
When changing the overall length of the tone sample without changing the ratio of the speech unit to the overall length, the target parameter determination unit 20 first sums the time from the start of speech to the end of speech of each speech unit in the time information sample. The total length of the time information sample and the total length of the tone sample are calculated. Then, the target parameter determination unit 20 changes the time information of the phoneme of the tone sample so that the total length of the time information sample matches the overall length of the tone sample without changing the ratio of the speech segment to the overall length. Specifically, the target parameter determination unit 20 calculates the total length of the time information sample and the total length of the tone sample. Then, the target parameter determination unit 20 calculates the ratio of the speech segment to the overall length, and calculates the ratio of the overall length of the tone sample to the overall length of the time information sample relative to the time information of the phoneme of the tone sample. By changing the time information, the time information of the phoneme of the tone sample in which the total length of the time information sample and the total length of the tone sample coincide with each other is obtained. The target parameter determining unit 20 determines the obtained phoneme time information as the phoneme time information of the target parameter.
As a result, even if the total length of the time information sample and the total length of the tone sample are different, it is possible to obtain the total length of the tone sample in accordance with the total length of the time information sample, and to adjust the total length of the tone sample. it can.

図１０は、音調見本の音高が音高見本の音高に合わせて変更された音調見本の基本周波数の時間変化情報を、時間情報見本に従って変更する場合について説明する説明図である。音調見本の音高が音高見本の音高に合わせて変更された音調見本の基本周波数の時間変化情報を、時間情報見本に従って変更する場合、目標パラメータ決定部２０は、音高見本の音高に近づけた音調見本の基本周波数の時間変化情報（図１０（ウ））を、時間情報見本に従って、音高が変更された音調見本の基本周波数の時間変化情報の全体長と音素の時間情報の全体長とを一致させるように変更することによって、目標パラメータの基本周波数の時間変化情報を生成する。
具体的には、目標パラメータ決定部２０は、時間情報見本の時間情報の全体長に対する、音高が変更された音調見本の時間情報の全体長の比を算出し、この算出された比を、音高が変更された音調見本の時間情報の全体長に乗ずることによって、音高が変更された音調見本の時間情報を更新する。例えば、図１０（エ）は、音高見本の音高に近づけた音調見本の基本周波数の時間変化情報（図１０（ウ））を、音素の時間情報の全体長と一致するように更新した後の基本周波数の時間変化情報を示す。 FIG. 10 is an explanatory diagram for explaining a case where the time change information of the fundamental frequency of the tone sample in which the pitch of the tone sample is changed in accordance with the pitch of the pitch sample is changed according to the time information sample. When the time change information of the fundamental frequency of the tone sample in which the pitch of the tone sample is changed according to the pitch of the pitch sample is changed according to the time information sample, the target parameter determination unit 20 sets the pitch of the pitch sample. The time change information (FIG. 10 (c)) of the fundamental frequency of the tone sample that is close to the time length of the basic frequency time change information of the tone sample whose pitch is changed according to the time information sample and the time information of the phoneme By changing the overall length so as to match, the time change information of the fundamental frequency of the target parameter is generated.
Specifically, the target parameter determination unit 20 calculates the ratio of the total length of the time information of the tone sample with the pitch changed to the total length of the time information of the time information sample, and calculates the calculated ratio as By multiplying the total length of the time information of the tone sample whose pitch is changed, the time information of the tone sample whose pitch is changed is updated. For example, in FIG. 10D, the time change information (FIG. 10C) of the fundamental frequency of the tone sample that is close to the pitch of the pitch sample is updated to match the overall length of the phoneme time information. The time change information of the later fundamental frequency is shown.

なお、目標パラメータ決定部２０は、音高見本の基本周波数の平均値に代えて、音高見本の基本周波数の高低幅、最大値、又は最小値に基づいて音調見本の基本周波数を変更するようにしてもよい。また、目標パラメータ決定部２０は、音高見本の基本周波数の平均値、最大値、及び最小値のいずれかと、音高見本の基本周波数の高低幅とに基づいて音調見本の基本周波数を変更しても良い。 The target parameter determination unit 20 changes the fundamental frequency of the tone sample based on the height, the maximum value, or the minimum value of the fundamental frequency of the pitch sample instead of the average value of the fundamental frequency of the pitch sample. It may be. The target parameter determination unit 20 changes the fundamental frequency of the tone sample based on one of the average value, maximum value, and minimum value of the fundamental frequency of the pitch sample, and the height of the basic frequency of the pitch sample. May be.

上述の音高見本の基本周波数の最大値、最小値、高低幅に基づいて音調見本の基本周波数を変更する場合について説明する。図１１は、目標パラメータ決定部２０が音高見本の基本周波数の最大値に基づいて音調見本の基本周波数の時間変化情報を算出する場合について説明する説明図である。音高見本の基本周波数の最大値に基づいて音調見本の基本周波数の時間変化情報を算出する場合、目標パラメータ決定部２０は、まず音高見本の基本周波数の最大値を算出する。そして、目標パラメータ決定部２０は、音調見本の基本周波数（図１１（ア））の最大値と、音高見本の基本周波数の最大値とを算出し、音高見本の基本周波数（図１１（イ））の最大値が同じ値となる音調高見本の基本周波数を算出する。具体的には、目標パラメータ決定部２０は、音高見本の基本周波数の最大値と音調見本の基本周波数の最大値との差を算出し、算出された差に音調見本の各時点における基本周波数を加算した和の時系列のデータを生成する。
図１１において、図１１（ア）は音調見本の基本周波数を表すグラフであり、図１１（イ）は音高見本の基本周波数を表すグラフであり、図１１（ウ）は、音高見本の基本周波数の最大値と音調見本の基本周波数の最大値との差に、音調見本の各時点における基本周波数を加算した和のグラフである。 A case will be described in which the fundamental frequency of the tone sample is changed based on the maximum value, minimum value, and pitch range of the fundamental frequency of the above-described pitch sample. FIG. 11 is an explanatory diagram illustrating a case where the target parameter determination unit 20 calculates time change information of the fundamental frequency of the tone sample based on the maximum value of the fundamental frequency of the pitch sample. When calculating the time change information of the fundamental frequency of the tone sample based on the maximum value of the fundamental frequency of the pitch sample, the target parameter determination unit 20 first calculates the maximum value of the fundamental frequency of the pitch sample. Then, the target parameter determination unit 20 calculates the maximum value of the fundamental frequency of the tone sample (FIG. 11A) and the maximum value of the basic frequency of the pitch sample, and the basic frequency of the pitch sample (FIG. B) Calculate the fundamental frequency of the pitch sample with the same maximum value. Specifically, the target parameter determination unit 20 calculates a difference between the maximum value of the basic frequency of the pitch sample and the maximum value of the basic frequency of the tone sample, and the basic frequency at each time point of the tone sample is calculated to the calculated difference. The time series data of the sum that is added is generated.
11 (a) is a graph showing the fundamental frequency of the tone sample, FIG. 11 (a) is a graph showing the fundamental frequency of the pitch sample, and FIG. 11 (c) is a graph of the pitch sample. It is a graph of the sum which added the fundamental frequency in each time of a tone sample to the difference of the maximum value of a fundamental frequency and the maximum value of the fundamental frequency of a tone sample.

次に、目標パラメータ決定部２０が音高見本の基本周波数の最小値に基づいて音調見本の基本周波数の時間変化情報を算出する場合について説明する。音高見本の基本周波数の最小値に基づいて音調見本の基本周波数の時間変化情報を算出する場合、目標パラメータ決定部２０は、まず音高見本の基本周波数の最小値と、音調見本の基本周波数の最小値とを算出する。そして、目標パラメータ決定部２０は、音調見本の基本周波数の最小値と、音高見本の基本周波数の最小値とが同じ値となるような、音調見本の基本周波数を算出する。具体的には、目標パラメータ決定部２０は、音高見本の基本周波数の最小値と音調見本の基本周波数の最小値との差を算出し、算出された差に音調見本の各時点における基本周波数を加算した和の時系列のデータを生成する。 Next, a case where the target parameter determination unit 20 calculates time change information of the fundamental frequency of the tone sample based on the minimum value of the fundamental frequency of the pitch sample will be described. When calculating time change information of the fundamental frequency of the tone sample based on the minimum value of the fundamental frequency of the pitch sample, the target parameter determining unit 20 firstly determines the minimum value of the basic frequency of the pitch sample and the basic frequency of the tone sample. The minimum value of is calculated. Then, the target parameter determination unit 20 calculates the fundamental frequency of the tone sample such that the minimum value of the fundamental frequency of the tone sample and the minimum value of the basic frequency of the pitch sample are the same value. Specifically, the target parameter determination unit 20 calculates a difference between the minimum value of the fundamental frequency of the pitch sample and the minimum value of the basic frequency of the tone sample, and the basic frequency at each time point of the tone sample is calculated to the calculated difference. The time series data of the sum that is added is generated.

図１２は、目標パラメータ決定部２０が音高見本の基本周波数の高低幅及び平均値に基づいて音調見本の基本周波数を変更する場合を説明する説明図である。目標パラメータ決定部２０が音高見本の基本周波数の高低幅及び平均値に基づいて音調見本の基本周波数を変更する場合、目標パラメータ決定部２０は、まず音高見本の基本周波数の最大値と最小値との幅（値の差）と、音調見本の基本周波数の最大値と最小値との幅とを算出する。次に、目標パラメータ決定部２０は、音調見本の基本周波数の最大値と最小値との幅が、先に算出された音高見本の基本周波数の最大値と最小値との幅と同じ値となるように、音調見本の基本周波数の時間変化情報を生成する。具体的には、目標パラメータ決定部２０は、音調見本の基本周波数の最大値及び最小値の幅と、音高見本の基本周波数の最大値及び最小値の幅との比を算出し、音調見本の各時点における基本周波数にこの比を乗じた値の時系列のデータを生成する。
図１２（ａ）は、音調見本の基本周波数のグラフを図１２（ａ）アに、音調見本の基本周波数の最大値と最小値との幅と、同じになるような音調見本の基本周波数のグラフを図１２（ａ）エに示す。
ここで、目標パラメータ決定部２０は、得られた音調見本の基本周波数（図１２（ａ）エ）の各時点における基本周波数に音高見本の基本周波数の平均値を加算した和の時系列のデータを算出する。このとき得られる音調見本の基本周波数のグラフを図１２（ｂ）ウに示す。 FIG. 12 is an explanatory diagram for explaining a case where the target parameter determination unit 20 changes the fundamental frequency of the tone sample based on the pitch range and the average value of the pitch sample. When the target parameter determination unit 20 changes the fundamental frequency of the tone sample based on the pitch range and the average value of the fundamental frequency of the pitch sample, the target parameter determination unit 20 first determines the maximum and minimum values of the fundamental frequency of the pitch sample. The width of the value (difference in value) and the width between the maximum value and the minimum value of the fundamental frequency of the tone sample are calculated. Next, the target parameter determination unit 20 sets the width between the maximum value and the minimum value of the fundamental frequency of the tone sample to the same value as the width of the maximum value and the minimum value of the fundamental frequency of the pitch sample calculated previously. Thus, time change information of the fundamental frequency of the tone sample is generated. Specifically, the target parameter determination unit 20 calculates a ratio between the maximum and minimum widths of the fundamental frequency of the tone sample and the maximum and minimum widths of the basic frequency of the pitch sample, and the tone sample. Time series data of a value obtained by multiplying the fundamental frequency at each time point by this ratio is generated.
FIG. 12A is a graph of the fundamental frequency of the tone sample. FIG. 12A is a graph of the fundamental frequency of the tone sample that is the same as the width of the maximum value and the minimum value of the fundamental frequency of the tone sample. The graph is shown in FIG.
Here, the target parameter determination unit 20 is a time series of the sum obtained by adding the average value of the fundamental frequencies of the pitch samples to the fundamental frequency at each time point of the fundamental frequencies of the obtained tone samples (FIG. 12A). Calculate the data. A graph of the fundamental frequency of the tone sample obtained at this time is shown in FIG.

図１３は、目標パラメータ決定部２０が、時間情報見本の音声素片の長さと音調見本の音声素片の長さとが、対応する音声素片同士で一致するように、音調見本の音素の時間情報を変更する場合の処理概念を表す概念図である。音声素片の長さとは、１つの音声素片の発話開始から発話終了までの時間である。対応する音声素片とは、配列された音声素片のうち先頭から数えた順番が、時間情報見本と音調見本において一致する音声素片である。
図１３の上段は音調見本のモーラの長さを表し、図１３の中段は目標パラメータのモーラの長さを表し、図１３の下段は時間情報見本のモーラの長さを表す。なお、モーラの長さは、そのモーラに含まれる音素の長さ（時間）の合計値である。図１３の横軸は、時間を示している。
時間情報見本のモーラの長さと、対応する音調見本のモーラの長さとが一致するように、音調見本の音素の時間情報を変更する場合、目標パラメータ決定部２０は、まず時間情報見本（下段）の各モーラの長さを算出する。そして、目標パラメータ決定部２０は、音調見本（上段）のモーラの長さが、対応する時間情報見本（下段）のモーラの長さに一致するように変更する。例えば、目標パラメータ決定部２０は、音調見本のモーラの長さを、対応する時間情報見本のモーラの長さに置き換えることによって一致させる。目標パラメータ決定部２０は、この処理によって得られたモーラの長さに基づき、音調見本の音素の時間情報を変更する。即ち、目標パラメータ（中段）の音素の時間情報を得る。 FIG. 13 shows the time of the phoneme of the tone sample so that the target parameter determination unit 20 matches the length of the speech unit of the time information sample and the length of the speech unit of the tone sample between the corresponding speech units. It is a conceptual diagram showing the process concept in the case of changing information. The length of a speech unit is the time from the start of speech of one speech unit to the end of speech. The corresponding speech unit is a speech unit in which the order counted from the head of the arranged speech units is the same in the time information sample and the tone sample.
The upper part of FIG. 13 represents the length of the mora of the tone sample, the middle part of FIG. 13 represents the length of the mora of the target parameter, and the lower part of FIG. 13 represents the length of the mora of the time information sample. The length of the mora is a total value of the lengths (time) of phonemes included in the mora. The horizontal axis in FIG. 13 indicates time.
When changing the time information of the phoneme of the tone sample so that the length of the time information sample mora matches the length of the corresponding tone sample mora, first, the target parameter determination unit 20 first sets the time information sample (lower). The length of each mora is calculated. Then, the target parameter determination unit 20 changes the length of the mora of the tone sample (upper) to match the length of the mora of the corresponding time information sample (lower). For example, the target parameter determination unit 20 matches the length of the mora of the tone sample with the length of the mora of the corresponding time information sample. The target parameter determination unit 20 changes the time information of the phonemes of the tone sample based on the length of the mora obtained by this processing. That is, the time information of the phoneme of the target parameter (middle stage) is obtained.

なお、図８、図１１から図１２を用いて、音高見本に従って（１）基本周波数の時間平均値に基づいて音調見本の音高を変更する処理、（２）基本周波数の最大値に基づいて音調見本の音高を変更する処理、（３）基本周波数の最小値に基づいて音調見本の音高を変更する処理、（４）基本周波数の最大値と最小値の差に基づいて音調見本の音高を変更する処理、について説明したが、音高を変更する処理としては、この（１）から（４）のいずれか１つを、利用者の指示に従って適用するようにしてもよい。
また、図９、図１３を用いて（５）各音声素片の比を変えずに全体長を変更する処理、（６）各音素の時間情報と時間情報見本の各音素の時間情報とが、対応する音声素片同士で一致するように変更することにより全体長を変更する処理、について説明したが、全体長を変更する処理としては、（５）と（６）とのいずれか１つを、利用者の指示に従って適用するようにしてもよい。
また、図１０を用いて（７）音調見本の音高が、基本周波数の時間平均値に基づいて音高見本の音高に合わせて変更された音調見本の基本周波数の時間変化情報を、各音声素片の比を変えずに全体長を変更する処理、について説明した。この（７）において、音調見本の音高を変更する場合、基本周波数の時間平均値に基づいて変更するのではなく、上述の（２）、（３）、（４）のいずれかを行うようにしてもよい。また、（７）において、音高を変更した後の音調見本の全体長を変更する場合、（５）各音声素片の比を変えずに全体長を変更する場合について説明したが、（６）の各音素の時間情報と時間情報見本の各音素の時間情報とが、対応する音声素片同士で一致するように変更することにより、全体長を変更するようにしてもよい。このように、（１）から（４）のいずれかによって音高見本の音高に従って音調見本の音高を変更した後、（５）または（６）によって、音高が変更された音調見本の全体長を変更することができる。この（１）から（４）の処理と、（５）、（６）の処理とのうちいずれの処理を適用するかについては、利用者が選択するようにしてもよい。 8 and 11 to 12, according to the pitch sample, (1) a process for changing the pitch of the tone sample based on the time average value of the fundamental frequency, and (2) based on the maximum value of the fundamental frequency. Processing to change the pitch of the tone sample, (3) processing to change the pitch of the tone sample based on the minimum value of the fundamental frequency, and (4) tone sample based on the difference between the maximum value and the minimum value of the fundamental frequency. Although the processing for changing the pitch is described, as the processing for changing the pitch, any one of (1) to (4) may be applied in accordance with a user instruction.
9 and 13, (5) a process of changing the overall length without changing the ratio of each speech unit, and (6) the time information of each phoneme and the time information of each phoneme in the time information sample. The process of changing the overall length by changing the corresponding speech units to match each other has been described, but as the process of changing the overall length, one of (5) and (6) May be applied in accordance with user instructions.
In addition, using FIG. 10, (7) the time change information of the fundamental frequency of the tone sample in which the pitch of the tone sample is changed according to the pitch of the pitch sample based on the time average value of the fundamental frequency, The process of changing the overall length without changing the ratio of the speech segments has been described. In (7), when the pitch of the tone sample is changed, it is not changed based on the time average value of the fundamental frequency, but one of the above (2), (3), and (4) is performed. It may be. Further, in (7), the case where the overall length of the tone sample after changing the pitch is changed, and (5) the case where the overall length is changed without changing the ratio of each speech unit has been described. The total length may be changed by changing the time information of each phoneme and the time information of each phoneme in the time information sample so that the corresponding speech segments match each other. As described above, after changing the pitch of the tone sample according to the pitch of the pitch sample according to any one of (1) to (4), the pitch of the tone sample whose pitch is changed according to (5) or (6). The overall length can be changed. The user may select which of the processes (1) to (4) and the processes (5) and (6) to apply.

次に、目標パラメータ決定装置１全体の処理手順について説明する。
図１４は、目標パラメータ決定装置１全体の処理手順を表すフローチャートである。
図示するように、ステップＳ０１において、まず修正対象音声指定部１２が、修正対象音声を指定する入力を受け付ける。具体的には、修正対象音声指定部１２は、修正対象音声の発話内容及び発話者識別情報の入力を受け付けることによって、任意の修正対象音声の指定を受け付ける。 Next, a processing procedure of the entire target parameter determination device 1 will be described.
FIG. 14 is a flowchart showing a processing procedure of the entire target parameter determination device 1.
As shown in the figure, in step S01, the correction target voice specifying unit 12 first receives an input for specifying the correction target voice. Specifically, the correction target voice designation unit 12 accepts designation of an arbitrary correction target voice by accepting input of the utterance content of the correction target voice and the speaker identification information.

次に、ステップＳ０３において、見本音声の指定が終了したか否かを判定する。見本音声の指定が終了した場合には、ステップＳ０９に進み、見本音声の指定が終了していない場合には、ステップＳ０４に進む。この判定は、例えば、ステップＳ０４からＳ０８のループを繰り返した回数（例えば、利用者によって指定された回数）、または指定終了命令の入力の有無のいずれかによって行う。 Next, in step S03, it is determined whether or not the designation of the sample voice has been completed. If the designation of the sample voice has been completed, the process proceeds to step S09. If the designation of the sample voice has not been completed, the process proceeds to step S04. This determination is performed by, for example, either the number of times the loop from step S04 to S08 is repeated (for example, the number specified by the user) or the presence / absence of a designation end command.

次に、ステップＳ０４において、音声入力部１４が見本音声の入力を受け付けたか否かを判定する。音声入力部１４が見本音声の入力を受け付けた場合には、ステップＳ０５に進み、音声テキスト入力部１５が、音声入力部１４が入力を受け付けた見本音声の発話内容を表すテキストデータの入力を受け付ける。次に、ステップＳ０６において、音声分析部１６が、入力された見本音声に対し音声分析処理を実行し、音声テキスト入力部１５によって入力されたテキストデータに基づいて、入力された音声（同語音声又は他語音声）の韻律情報を生成し、ステップＳ０３に進む。
一方、ステップＳ０４において、音声入力部１４が見本音声の入力を受け付けていない場合には、ステップＳ０７において、見本音声指定部１７が、見本音声を指定するテキストデータの入力を受け付けたか否かを判定する。見本音声指定部１７が、見本音声を指定するテキストデータの入力を受け付けていない場合には、ステップＳ０３に進み、見本音声指定部１７が、見本音声を指定するテキストデータの入力を受け付けた場合には、ステップＳ０８において、見本音声検索部１８が、指定された見本音声（他語同話者音声、他語他話者音声、同語他話者音声のいずれか）の音声データを音声データベース１１から読み出し、ステップＳ０３に進む。 Next, in step S04, it is determined whether or not the voice input unit 14 has received sample voice input. When the voice input unit 14 receives an input of the sample voice, the process proceeds to step S05, and the voice text input unit 15 receives an input of text data representing the utterance content of the sample voice received by the voice input unit 14. . Next, in step S06, the speech analysis unit 16 performs speech analysis processing on the input sample speech, and based on the text data input by the speech text input unit 15, the input speech (synonymous speech) (Or other language speech) prosodic information is generated, and the process proceeds to step S03.
On the other hand, in step S04, if the voice input unit 14 has not received the input of the sample voice, it is determined in step S07 whether the sample voice specifying unit 17 has received the input of the text data specifying the sample voice. To do. When the sample voice designating unit 17 has not accepted the input of text data designating the sample voice, the process proceeds to step S03, and when the sample voice designating unit 17 accepts the input of text data designating the sample voice. In step S08, the sample voice search unit 18 converts the voice data of the designated sample voice (any other-speaker voice, other-speaker voice, same-speaker other-speaker voice) into the voice database 11. From step S03, and proceeds to step S03.

一方、ステップＳ０３において、見本音声の指定が終了した場合（ステップＳ０３：ＹＥＳ）には、ステップＳ０９において、韻律情報選択部１９は、修正対象音声指定部１２が入力を受け付けた修正対象音声の韻律情報を合成音声記憶部１３から読み出す。 On the other hand, when the designation of the sample voice is completed in step S03 (step S03: YES), the prosody information selection unit 19 in step S09, the prosody of the correction target voice that the correction target voice designation unit 12 has received the input. Information is read from the synthesized speech storage unit 13.

次に、ステップＳ１０において、韻律選択部１９が、目標パラメータ決定装置１全体の処理が開始してからステップＳ１０の処理までの間に韻律情報が読み出された見本音声又は韻律情報が算出された見本音声の中から、音調見本、音高見本、時間情報見本のそれぞれに適した音声を、優先順位に従って選択する。
次に、ステップＳ１１において、目標パラメータ決定部２０が、音調見本の基本周波数の時間変化情報及び音高見本の基本周波数の時間変化情報に基づいて、音高見本の音高に合わせた音調見本の基本周波数の時間変化情報を算出する。
例えば、目標パラメータ決定部２０は、音調見本の基本周波数の時間平均値と音高見本の基本周波数の時間平均値とを算出し、音高見本の基本周波数の時間平均値と音調見本の基本周波数の時間平均値との差を算出し、算出された差を音調見本の基本周波数に加算した和の時系列のデータを算出することによって、音高見本の音高に合わせた音調見本の基本周波数の時間変化情報を得る。 Next, in step S10, the prosody selection unit 19 calculates sample speech or prosody information from which prosody information has been read between the start of processing of the entire target parameter determination device 1 and the processing of step S10. From the sample sounds, the sound suitable for each of the tone sample, the pitch sample, and the time information sample is selected according to the priority order.
Next, in step S11, the target parameter determination unit 20 generates a tone sample that matches the pitch of the pitch sample based on the time change information of the basic frequency of the tone sample and the time change information of the basic frequency of the pitch sample. The time change information of the fundamental frequency is calculated.
For example, the target parameter determination unit 20 calculates the time average value of the basic frequency of the tone sample and the time average value of the basic frequency of the pitch sample, and calculates the time average value of the basic frequency of the pitch sample and the basic frequency of the tone sample. The basic frequency of the tone sample is adjusted to the pitch of the pitch sample by calculating the time series data of the sum of the calculated time difference and the sum of the calculated difference and the basic frequency of the tone sample. Get time change information.

次に、ステップＳ１２において、目標パラメータ決定部２０が、音調見本の音素の時間情報及び時間情報見本の音素の時間情報に基づいて、目標パラメータである音素の時間情報を算出して得る。
例えば、目標パラメータ決定部２０は、音調見本の音素の時間情報に基づいて音調見本全体長と、時間情報見本の音素の時間情報に基づいて時間情報見本全体長とを算出する。そして、目標パラメータ決定部２０は、全体長に対する音声素片の比を算出し、音調見本の音素の時間情報を、音調見本の各音素の時間情報に、時間情報見本全体長に対する音調見本全体長の比を乗ずることによって変更し、時間情報見本全体長と音調見本全体長とが一致する音調見本の音素の時間情報を得る。目標パラメータ決定部２０は、得られた音素の時間情報を、目標パラメータの音素の時間情報として決定する。 Next, in step S12, the target parameter determination unit 20 calculates phoneme time information that is a target parameter based on the time information of the phonemes of the tone sample and the time information of the phonemes of the time information sample.
For example, the target parameter determination unit 20 calculates the overall length of the tone sample based on the time information of the phonemes of the tone sample and the overall length of the time information sample based on the time information of the phonemes of the time information sample. Then, the target parameter determination unit 20 calculates the ratio of the speech segment to the overall length, converts the time information of the phoneme of the tone sample into the time information of each phoneme of the tone sample, and the overall length of the tone sample with respect to the overall length of the time information sample. To obtain the time information of the phoneme of the tone sample in which the total length of the time information sample and the total length of the tone sample coincide with each other. The target parameter determining unit 20 determines the obtained phoneme time information as the phoneme time information of the target parameter.

次に、ステップＳ１３において、目標パラメータ決定部２０が、音高見本の音高に合わせた音調見本の基本周波数の時間変化情報と、目標パラメータの音素の時間情報とに基づいて、目標パラメータである基本周波数の時間変化情報を算出して得る。
例えば、目標パラメータ決定部２０は、時間情報見本の時間情報の全体長に対する、音高が変更された音調見本の時間情報の全体長の比を算出し、この算出された比を、音高が変更された音調見本の時間情報の全体長に乗ずることによって、音高が変更された音調見本の時間情報を更新し、目標パラメータとして得る。
音高が変更された音調見本の時間情報を更新して目標パラメータが得られると、このフローチャート全体の処理を終了する。 Next, in step S13, the target parameter determination unit 20 is the target parameter based on the time change information of the fundamental frequency of the tone sample according to the pitch of the pitch sample and the time information of the phoneme of the target parameter. It is obtained by calculating time change information of the fundamental frequency.
For example, the target parameter determination unit 20 calculates the ratio of the total length of the time information of the tone sample with the pitch changed to the total length of the time information of the time information sample, and the pitch is calculated based on the calculated pitch. By multiplying the overall length of the time information of the changed tone sample, the time information of the tone sample whose pitch has been changed is updated and obtained as a target parameter.
When the target parameter is obtained by updating the time information of the tone sample whose pitch has been changed, the process of the entire flowchart is terminated.

このように構成された目標パラメータ決定装置１では、修正対象音声を修正する利用者が目標韻律を有する見本音声を音声入力部１４へ発話したり、見本音声指定部１７に指示を入力したりすることによって、目標パラメータが決定される。そのため、利用者は、目標パラメータの値について具体的に検討することなく、見本音声を発話又は選択する指示を入力するだけで、目標パラメータを容易に決定することができる。 In the target parameter determination apparatus 1 configured as described above, a user who corrects the correction target voice utters a sample voice having the target prosody to the voice input unit 14 or inputs an instruction to the sample voice designation unit 17. Thus, the target parameter is determined. Therefore, the user can easily determine the target parameter only by inputting an instruction to speak or select the sample voice without specifically examining the value of the target parameter.

また、目標パラメータ決定装置１では、他語同話者音声や他語他話者音声のように修正対象音声と発話内容が異なる音声であっても、音調見本、音高見本、及び時間情報見本として使用し目標パラメータを決定することができる。従って、目標パラメータ決定装置１では、修正者によって発話される音声又は音声データベース１１の中から指定される音声は、必ずしも修正対象音声と発話内容が同じである必要が無くなる。そのため、修正者の発話又は音声データベース１１における指定の自由度を向上させることができ、目標パラメータの決定がより容易となる。 Further, in the target parameter determination device 1, even if the speech content is different from the speech to be corrected, such as the other language speaker voice or the other language speaker voice, the tone sample, the pitch sample, and the time information sample Can be used to determine target parameters. Therefore, in the target parameter determination device 1, the voice uttered by the corrector or the voice designated from the voice database 11 does not necessarily have the same utterance content as the correction target voice. Therefore, it is possible to improve the degree of freedom of designation of the corrector's utterance or the voice database 11, and it becomes easier to determine the target parameter.

＜変形例＞
以上説明した第１の実施形態においては、韻律選択部１９は、音素の時間情報については、同語音声、他語音声、他語他話者音声、他語同話者音声、同語他話者音声それぞれの音声（音声データベース１１から得られた音声あるいは音声入力部１４が入力を受け付けた音声）の音声区間の全体時間長を、修正対象音声の音声区間の全体時間長で正規化せずに選択する構成について説明したが、各音声区間の全体時間長を、修正対象音声の音声区間の全体時間長と一致するように正規化しておき、正規化された音素の時間情報から選択するようにしても良い。この正規化は、例えば、目標パラメータ決定部２０が、修正対象音声の時間情報見本全体長と、正規化する対象である音声の全体長を算出する。そして目標パラメータ決定部２０が、修正対象音声の全体長に対する音声素片の比を算出し、算出された比を正規化する対象である音声の音素の時間情報に乗ずることによって変更する。
このように正規化しておくことにより、修正対象の音声の全体長を変えずに、目標パラメータを得ることができる。 <Modification>
In the first embodiment described above, the prosody selection unit 19 uses the same-speech speech, other-speech speech, other-speaker speech, other-speaker speech, and synonym-speech for the phoneme time information. The total time length of the voice of each person's voice (the voice obtained from the voice database 11 or the voice received by the voice input unit 14) is not normalized with the total time length of the voice section of the correction target voice. However, the overall time length of each speech section is normalized so as to match the overall time length of the speech section of the correction target speech, and is selected from the normalized phoneme time information. Anyway. In this normalization, for example, the target parameter determination unit 20 calculates the total time information sample length of the correction target speech and the overall length of the speech to be normalized. Then, the target parameter determination unit 20 calculates the ratio of the speech segment to the entire length of the correction target speech, and changes the calculated ratio by multiplying the time information of the speech phoneme to be normalized.
By normalizing in this way, the target parameter can be obtained without changing the overall length of the sound to be corrected.

［第２の実施の形態］
次に、本発明の第２の実施形態について説明する。
図１５は、同実施形態による合成音声修正装置２の機能構成を表すブロック図である。図示するように、合成音声修正装置２は、図１に示す第１の実施形態である目標パラメータ決定装置１が有する各機能部と、修正部２１とを含んで構成される。図１の目標パラメータ決定装置１の各機能部に対応する部分については、同一の符号を付し、その説明を省略する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described.
FIG. 15 is a block diagram showing a functional configuration of the synthesized speech correction apparatus 2 according to the embodiment. As shown in the figure, the synthesized speech correction apparatus 2 includes each functional unit included in the target parameter determination apparatus 1 according to the first embodiment shown in FIG. Parts corresponding to the respective functional units of the target parameter determination device 1 in FIG. 1 are denoted by the same reference numerals, and description thereof is omitted.

修正部２１は、目標パラメータ決定部２０によって得られる目標パラメータに従って、見本の音声素片を選択し、その音声素片の音声信号データを選択し、修正対象音声の音声素片の音声信号データを、選択された音声信号データに置換することによって修正対象音声の韻律の修正を行い、目標韻律に近い韻律を有する合成音声を生成する。
具体的には、修正部２１は、目標パラメータの基本周波数と、修正対象音声の基本周波数との差を、音声素片毎に算出し、この差が所定の閾値を超える音声素片（以下、「修正対象の音声素片」という）を検出する。
この修正対象の音声素片を検出する場合、修正部２１は、（ａ）音声素片の開始時点における基本周波数の差、（ｂ）音声素片の終了時点における基本周波数の差、（ｃ）音声素片の開始から終了までの間の中間時点における基本周波数の差、（ｄ）音声素片の開始から終了までの範囲における基本周波数の平均値の差、（ｅ）音声素片の開始から終了までの時間を範囲における基本周波数の差分の絶対値の定積分値、のいずれかが所定の閾値を越えたか否かに基づいて検出する。
次に、修正部２１は、（１）検出された修正対象の音声素片と音素ラベルが一致し、且つ、（２）目標パラメータの基本周波数の時間変化情報における修正対象区間（修正対象の音声素片の開始から終了までの時間に対応する区間）の基本周波数に最も近い基本周波数を有する、音声素片の音声信号データを音声データベース１１から読み出す。そして、修正部２１は、修正対象区間の音声信号データを、読み出された音声素片の音声信号データと、それに対応する韻律情報（基本周波数の時間変化情報と、音素の時間情報）とを、合成音声記憶部１３に、合成音声として新たに登録する。
なお、修正部２１は、新たに登録された合成音声全体の周波数分析をし直すことによって基本周波数の時間変化情報を得る。 The correction unit 21 selects a sample speech unit according to the target parameter obtained by the target parameter determination unit 20, selects the speech signal data of the speech unit, and selects the speech signal data of the speech unit of the speech to be corrected. Then, by replacing the selected speech signal data with the selected speech signal data, the prosody of the speech to be modified is modified to generate a synthesized speech having a prosody close to the target prosody.
Specifically, the correction unit 21 calculates, for each speech unit, a difference between the fundamental frequency of the target parameter and the fundamental frequency of the speech to be modified, and a speech unit whose difference exceeds a predetermined threshold (hereinafter, (Referred to as “speech segment to be corrected”).
When detecting the speech unit to be corrected, the correcting unit 21 (a) a difference in fundamental frequency at the start time of the speech unit, (b) a difference in fundamental frequency at the end time of the speech unit, (c) Difference in fundamental frequency at an intermediate point between the start and end of a speech unit, (d) Difference in average value of fundamental frequencies in the range from start to end of speech unit, (e) From start of speech unit The time until the end is detected based on whether one of the definite integral values of the absolute value of the fundamental frequency difference in the range exceeds a predetermined threshold value.
Next, the correction unit 21 (1) the detected speech element to be corrected matches the phoneme label, and (2) the correction target section (the correction target speech in the time change information of the fundamental frequency of the target parameter). The speech signal data of the speech unit having the fundamental frequency closest to the fundamental frequency of the segment corresponding to the time from the start to the end of the segment is read from the speech database 11. Then, the correcting unit 21 converts the audio signal data of the section to be corrected into the audio signal data of the read speech segment and the corresponding prosody information (time change information of the fundamental frequency and time information of the phoneme). Then, it is newly registered in the synthesized voice storage unit 13 as synthesized voice.
The correction unit 21 obtains time change information of the fundamental frequency by performing frequency analysis again on the newly registered synthesized speech as a whole.

なお、修正部２１は、上記（２）の条件を満たすか否かについて、より具体的には以下のように判定する。まず、修正部２１は、目標韻律の同部分の基本周波数の平均値と、始端値と、終端値と、音素の時間情報とのうち、予め設定された１つ以上の指標を、誤差最小であることを評価する項を含む波形接続型音声合成のコスト関数を用い、誤差最小となる同種の音声素片を音声データベース１１から検索して得る。そして、得られた音声素片の音声信号データを、（２）の条件を満たす音声信号データであると判定する。
そして、修正部２１は、修正対象の音声素片の音声信号データを、選択された音声信号データに書き換えることによって、修正対象音声の韻律の修正を行う。 The correction unit 21 determines more specifically whether or not the condition (2) is satisfied as follows. First, the correction unit 21 sets one or more preset indexes among the average value of the fundamental frequency, the start value, the end value, and the phoneme time information of the same part of the target prosody, with a minimum error. Using a cost function of waveform-connected speech synthesis that includes a term for evaluating the existence, a speech unit of the same kind that minimizes the error is retrieved from the speech database 11. And it determines with the audio | voice signal data of the obtained audio | voice unit being the audio | voice signal data which satisfy | fill the conditions of (2).
Then, the correction unit 21 corrects the prosody of the correction target speech by rewriting the audio signal data of the target speech unit to be corrected to the selected audio signal data.

図１６は、修正前後の修正対象音声における基本周波数の時間変化情報の変化状態を表す説明図である。図１６（ａ）の上段は基本周波数の時間変化情報を表し、図１６（ａ）の下段は音素の時間情報を表す。図１６（ａ）の上段の基本周波数の時間変化情報と、図１６（ａ）の下段の音素の時間情報とは、同一の音声の情報である。図１６（ａ）アは、修正対象音声の基本周波数の時間変化を表すグラフであり、図１６（ａ）イは目標パラメータの基本周波数の時間変化を表すグラフである。
図１６（ａ）において、修正部２１は、修正対象音声の基本周波数の時間変化情報と目標パラメータの基本周波数の時間変化情報との差が所定以上である音声素片を検出し、この検出された音声素片（先頭（“あ”）から数えて４番目の音声素片（“い”））を、修正対象の音声素片であると判定する。次に、修正部２１は、音声データベース１１から、上述した条件を満たす音声素片の音声信号データと基本周波数の時間変化情報とを読み出す。そして、修正部２１は、修正対象の音声素片の音声信号データを、上述した条件を満たす音声素片の音声信号データに書き換えるとともに、書き換えられた音声信号データを音声分析を行って韻律情報を作成し直す。
図１６（ｂ）アは修正後の合成音声の基本周波数の時間変化を表すグラフであり、図１６（ｂ）イは目標パラメータの基本周波数の時間変化を表すグラフである。このような修正処理によって、修正対象音声の基本周波数の時間変化情報が、目標パラメータの基本周波数の時間変化情報に近づくように修正される。
なお、合成音声の音素の時間情報の修正を行う場合、修正部２１は、修正対象音声の修正対象となる音素の時間情報を、音声データベース１１から得られた音素の時間情報に書き換えることによって修正を行う。 FIG. 16 is an explanatory diagram showing a change state of time change information of the fundamental frequency in the correction target voice before and after the correction. The upper part of FIG. 16A represents time change information of the fundamental frequency, and the lower part of FIG. 16A represents time information of phonemes. The time change information of the fundamental frequency in the upper part of FIG. 16A and the time information of the phonemes in the lower part of FIG. 16A are information of the same sound. FIG. 16A is a graph showing the time change of the fundamental frequency of the target speech, and FIG. 16A is a graph showing the time change of the fundamental frequency of the target parameter.
In FIG. 16A, the correcting unit 21 detects a speech unit in which the difference between the time change information of the basic frequency of the target speech to be corrected and the time change information of the basic frequency of the target parameter is greater than or equal to a predetermined value. The speech unit (the fourth speech unit (“I”) counted from the head (“A”)) is determined to be the speech unit to be corrected. Next, the correction unit 21 reads out the speech signal data of the speech unit that satisfies the above-described conditions and the time change information of the fundamental frequency from the speech database 11. Then, the correcting unit 21 rewrites the audio signal data of the speech unit to be corrected to the audio signal data of the audio unit that satisfies the above-described conditions, and performs audio analysis on the rewritten audio signal data to obtain prosody information. Recreate it.
FIG. 16B is a graph showing the time change of the fundamental frequency of the synthesized speech after correction, and FIG. 16B is a graph showing the time change of the fundamental frequency of the target parameter. By such correction processing, the time change information of the basic frequency of the correction target voice is corrected so as to approach the time change information of the basic frequency of the target parameter.
When correcting the time information of the phoneme of the synthesized speech, the correcting unit 21 corrects the time information of the phoneme to be corrected of the correction target speech by rewriting the time information of the phoneme obtained from the speech database 11. I do.

このように構成された合成音声修正装置２は、使用者が修正目標となる韻律を有する音声を指定することによって、目標パラメータを決定し、決定された目標パラメータに基づいて合成音声の修正を行う。そのため、使用者は、目標パラメータを具体的に検討することなく、容易に合成音声の修正を行うことが可能となる。 The synthesized speech correcting apparatus 2 configured as described above determines a target parameter by designating a speech having a prosody that is a correction target, and corrects the synthesized speech based on the determined target parameter. . Therefore, the user can easily correct the synthesized speech without specifically considering the target parameter.

＜変形例＞
上述した第２の実施形態において修正部２１は、修正対象音声の基本周波数と目標パラメータの基本周波数との差が、所定の閾値を越えた区間を対象として、音声信号データを更新することにより修正するようにしたが、この所定の閾値を越えた区間のみではなく、全ての区間を対象として、修正を行うようにしてもよいし、一部の区間（例えば、図１６（ａ）ウに示す区間に対応する音声素片）を利用者が指定して、修正を行うようにしてもよい。また、上述のコスト関数の重みの大きさを、利用者から入力される指示に従って、任意に変更するように構成しても良い。
また、上記（１）、（２）の条件を満たす音声素片を選択する場合、同話者の音声データを選択するようにしてもよいし、他話者の音声データを選択するようにしてもよい。
なお、目標パラメータ決定部２０が作成した目標韻律を用いずに音声合成を行う装置等、他の装置では、発話内容の言語解析から得られるアクセント情報に基づく音声素片選択等が行われる場合もある。 <Modification>
In the second embodiment described above, the correction unit 21 corrects the audio signal data by updating the audio signal data for a section in which the difference between the basic frequency of the correction target voice and the basic frequency of the target parameter exceeds a predetermined threshold. However, the correction may be performed not only on the section exceeding the predetermined threshold but on all sections, or a part of the sections (for example, as shown in FIG. 16A). The speech unit corresponding to the section) may be designated by the user to be corrected. Further, the size of the weight of the cost function described above may be arbitrarily changed in accordance with an instruction input from the user.
Further, when selecting speech segments that satisfy the conditions (1) and (2) above, the speech data of the same speaker may be selected, or the speech data of other speakers may be selected. Also good.
Note that in other devices such as a device that performs speech synthesis without using the target prosody created by the target parameter determination unit 20, speech unit selection based on accent information obtained from language analysis of speech content may be performed. is there.

［第３の実施の形態］
次に、本発明の第３の実施形態について説明する。
図１７は、同実施形態による合成音声修正装置３の機能構成を表すブロック図である。図示するように、合成音声修正装置３は、図１に示す第１の実施形態である目標パラメータ決定装置１が有する各機能部と、修正部３１とを含んで構成される。図１の目標パラメータ決定装置１の各機能部に対応する部分については、同一の符号を付し、その説明を省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described.
FIG. 17 is a block diagram showing a functional configuration of the synthesized speech correction apparatus 3 according to the embodiment. As shown in the figure, the synthesized speech correction device 3 includes each functional unit included in the target parameter determination device 1 according to the first embodiment shown in FIG. 1 and a correction unit 31. Parts corresponding to the respective functional units of the target parameter determination device 1 in FIG. 1 are denoted by the same reference numerals, and description thereof is omitted.

修正部３１は、目標パラメータ決定部２０で決定された目標パラメータを用いて、修正対象の合成音声を修正する。このとき、（１）修正対象の音声素片のみを韻律変換して置き換える方法と、（２）修正対象の合成音声全体を韻律変換して置き換える方法とがある。 The correcting unit 31 corrects the synthesis target speech to be corrected using the target parameter determined by the target parameter determining unit 20. At this time, there are (1) a method of replacing only the speech unit to be corrected by prosody conversion, and (2) a method of replacing the entire speech to be corrected by prosody conversion.

（１）の修正対象の音声素片のみを韻律変換して置き換える方法の場合、修正部３１は、目標パラメータとして与えられた基本周波数の時間変化情報に基づき、修正対象の音声素片に対応する区間の、（ａ）基本周波数の時間変化情報か、（ｂ）基本周波数の平均値（時間平均値）か、（ｃ）基本周波数の最大値のいずれかを用いる。また、修正部３１は、音声信号のパワーとして、（ｄ）修正対象の音声素片の音声信号のパワーか、（ｅ）上述した（ｄ）の平均値か、（ｆ）上述した（ｄ）の最大値のいずれかを用いる。また、修正部３１は、目標パラメータとして与えられた音素の時間情報に基づき、その時間情報の中から修正対象の音声素片のデータを取り出して用いる。そして、修正部３１は、これらの値を用いて音声信号を変換する処理を行なう。なお、音声信号の変換処理自体には既存技術を用いる。なおここで、上記の（ａ）と（ｂ）と（ｃ）のどの値を用いるかは、予め記憶されている設定値に従う。また、上記の（ｄ）と（ｅ）と（ｆ）のどの値を用いるかは、予め記憶されている設定値に従う。 In the case of the method of replacing only the speech unit to be modified in (1) by prosody conversion, the modification unit 31 corresponds to the speech unit to be modified based on the time change information of the fundamental frequency given as the target parameter. Either (a) time change information of the fundamental frequency, (b) an average value of the fundamental frequency (time average value), or (c) a maximum value of the fundamental frequency of the section is used. Further, the correcting unit 31 determines whether the power of the sound signal is (d) the power of the sound signal of the sound unit to be corrected, (e) the average value of (d) described above, or (f) the (d) described above. One of the maximum values of is used. Further, the correcting unit 31 extracts and uses the data of the speech unit to be corrected from the time information based on the time information of the phoneme given as the target parameter. And the correction part 31 performs the process which converts an audio | voice signal using these values. Note that existing technology is used for the audio signal conversion processing itself. Here, which value of the above (a), (b), and (c) is used depends on a preset value stored in advance. Further, which value of (d), (e), and (f) is used depends on a preset value stored in advance.

（２）の修正対象の合成音声全体を韻律変換して置き換える方法の場合、修正部３１は、目標パラメータとして与えられた基本周波数の時間変化情報に基づき、（ａ）その基本周波数の時間変化情報か、（ｂ）その基本周波数の時間変化情報から算出される基本周波数の平均値（時間平均）のいずれかを用いる。また、修正部３１は、修正後の音声信号のパワーとしては、修正対象音声のパワーを用いる。また、修正部３１は、目標パラメータとして与えられた音素の時間情報をそのまま用いる。そして、修正部３１は、これらの値を用いて音声信号を変換する処理を行なう。なお、音声信号の変換処理自体として既存技術を用いることは上の場合と同様である。なおここで、上記の（ａ）と（ｂ）のどの値を用いるかは、予め記憶されている設定値に従う。 In the case of the method (2) of replacing the entire synthesized speech to be corrected by prosody conversion, the correction unit 31 is based on the time change information of the fundamental frequency given as the target parameter. Or (b) an average value (time average) of fundamental frequencies calculated from time change information of the fundamental frequency is used. Further, the correcting unit 31 uses the power of the correction target sound as the power of the corrected sound signal. Further, the correcting unit 31 uses the phoneme time information given as the target parameter as it is. And the correction part 31 performs the process which converts an audio | voice signal using these values. The use of the existing technology as the audio signal conversion process itself is the same as in the above case. Here, which value of (a) and (b) is used depends on a preset value stored in advance.

修正部３１は、上記の（１）または（２）のいずれかの方法で韻律変換して得られた音声信号データと、それに対応する韻律情報（基本周波数の時間変化情報と、音素の時間情報）とを、合成音声記憶部１３に、合成音声として新たに登録する。 The correcting unit 31 includes speech signal data obtained by prosody conversion by any one of the above methods (1) or (2), and corresponding prosodic information (basic frequency time change information and phoneme time information). ) Are newly registered as synthesized speech in the synthesized speech storage unit 13.

図１８は、修正対象の音声素片を韻律変換して置き換える場合を説明する概念図である。例えば、「あおいいえ」のうち、音声素片「い」を修正対象とし、上述の（１）の方法によって韻律変換が行われると、例えば、音声素片「い」に対応する基本周波数の時間変化情報が、図１８（ａ）に示すグラフから図１８（ｂ）に示すグラフのように変わる。 FIG. 18 is a conceptual diagram illustrating a case where a speech unit to be corrected is replaced by prosody conversion. For example, when the speech unit “I” of “ANO” is targeted for correction and the prosody conversion is performed by the method (1) described above, for example, the time of the fundamental frequency corresponding to the speech unit “I” The change information changes from the graph shown in FIG. 18A to a graph shown in FIG.

このように構成された合成音声修正装置３は、第２の実施形態における合成音声修正装置２と同様に、使用者が修正目標となる韻律を有する音声を指定することによって、目標パラメータを決定し、決定された目標パラメータに基づいて合成音声の修正を行う。そのため、使用者は、目標パラメータを具体的に検討することなく、容易に合成音声の修正を行うことが可能となる。 The synthesized speech correction apparatus 3 configured as described above determines a target parameter by designating a speech having a prosody as a correction target by the user, similarly to the synthesized speech correction apparatus 2 in the second embodiment. Then, the synthesized speech is corrected based on the determined target parameter. Therefore, the user can easily correct the synthesized speech without specifically considering the target parameter.

なお、上述した実施形態における目標パラメータ決定装置１、合成音声修正装置２、及び合成音声修正装置３の一部又は全部の機能をコンピュータで実現する場合、これらの装置の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。
以上、この発明の実施形態を図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 In the case where a part or all of the functions of the target parameter determination device 1, the synthesized speech correction device 2, and the synthesized speech correction device 3 in the above-described embodiment are realized by a computer, a program for realizing the functions of these devices May be recorded on a computer-readable recording medium, and a program recorded on the recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in the computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program that holds a program for a certain period of time. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.
As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the concrete structure is not restricted to this embodiment, The design etc. of the range which does not deviate from the summary of this invention are included.

第１の実施形態による目標パラメータ決定装置の機能構成を表すブロック図である。It is a block diagram showing the functional structure of the target parameter determination apparatus by 1st Embodiment. 音声データベースによって記憶される音声データの構成を表す概略図である。It is the schematic showing the structure of the audio | voice data memorize | stored by an audio | voice database. 音声信号の振幅の変化を表す波形図である。It is a wave form diagram showing the change of the amplitude of an audio | voice signal. 基本周波数の時間に応じた変化を表すグラフである。It is a graph showing the change according to the time of a fundamental frequency. 音素の時間情報の概略を表す概略図である。It is the schematic showing the outline of the time information of a phoneme. 見本音声指定部１７が入力を受け付ける他語音声を、利用者が選ぶ時の条件を説明する概念図である。It is a conceptual diagram explaining the conditions when a user selects the other language voice which the sample audio | voice designation | designated part 17 receives an input. 韻律選択部１９が韻律選択処理を行う場合に、見本となる音声データを選択する優先順位の一例を表す概要図である。It is a schematic diagram showing an example of the priority which selects the audio | voice data used as a sample, when the prosody selection part 19 performs a prosody selection process. 基本周波数の値の時間平均値を用いて、音調見本の基本周波数の時間変化情報を算出する場合について説明する説明図である。It is explanatory drawing explaining the case where the time change information of the fundamental frequency of a tone sample is calculated using the time average value of the value of a fundamental frequency. 全体長に対する音声素片の比を変えずに、音調見本全体長を変更する場合について説明する概念図である。It is a conceptual diagram explaining the case where the overall length of a tone sample is changed without changing the ratio of the speech segment to the overall length. 音調見本の音高が音高見本の音高に合わせて変更された音調見本の基本周波数の時間変化情報を、時間情報見本に従って変更する場合について説明する説明図である。It is explanatory drawing explaining the case where the time change information of the fundamental frequency of the tone sample which the pitch of the tone sample was changed according to the pitch of the pitch sample is changed according to the time information sample. 目標パラメータ決定部２０が音高見本の基本周波数の最大値に基づいて音調見本の基本周波数の時間変化情報を算出する場合について説明する説明図である。It is explanatory drawing explaining the case where the target parameter determination part 20 calculates the time change information of the fundamental frequency of a tone sample based on the maximum value of the fundamental frequency of a pitch sample. 目標パラメータ決定部２０が音高見本の基本周波数の高低幅及び平均値に基づいて音調見本の基本周波数を変更する場合を説明する説明図である。It is explanatory drawing explaining the case where the target parameter determination part 20 changes the fundamental frequency of a tone sample based on the height range and average value of the fundamental frequency of a pitch sample. 目標パラメータ決定部２０が、時間情報見本の音声素片の長さと音調見本の音声素片の長さとが、対応する音声素片同士で一致するように、音調見本の音素の時間情報を変更する場合の処理概念を表す概念図である。The target parameter determination unit 20 changes the time information of the phoneme of the tone sample so that the length of the speech unit of the time information sample and the length of the tone unit of the tone sample coincide with each other. It is a conceptual diagram showing the processing concept in the case. 目標パラメータ決定装置全体の処理手順を表すフローチャートである。It is a flowchart showing the process sequence of the whole target parameter determination apparatus. 第２の実施形態による合成音声修正装置２の機能構成を表すブロック図である。It is a block diagram showing the function structure of the synthetic | combination voice correction apparatus 2 by 2nd Embodiment. 修正前後の修正対象音声における基本周波数の時間変化情報の変化状態を表す説明図である。It is explanatory drawing showing the change state of the time change information of the fundamental frequency in the audio | voice for correction before and behind correction. 第３の実施形態による合成音声修正装置３の機能構成を表すブロック図である。It is a block diagram showing the function structure of the synthetic | combination voice correction apparatus 3 by 3rd Embodiment. 修正対象の音声素片を変換して置き換える場合について説明する概念図である。It is a conceptual diagram explaining the case where the speech element of correction object is converted and replaced.

Explanation of symbols

１目標パラメータ決定装置
１１音声データベース
１２修正対象音声指定部
１３合成音声記憶部（修正対象音声記憶部）
１４音声入力部
１５音声テキスト入力部（発話内容取得部）
１６音声分析部
１７見本音声指定部
１８見本音声検索部
１９韻律選択部
２０目標パラメータ決定部
２，３合成音声修正装置
２１，３１修正部 DESCRIPTION OF SYMBOLS 1 Target parameter determination apparatus 11 Audio | voice database 12 Correction object audio | voice designation | designated part 13 Synthetic audio | voice storage part (correction object audio | voice storage part)
14 Voice input part 15 Voice text input part (utterance content acquisition part)
16 speech analysis unit 17 sample speech designation unit 18 sample speech search unit 19 prosody selection unit 20 target parameter determination unit 2, 3 synthesized speech correction device 21, 31 correction unit

Claims

A voice data storage unit that stores voice, utterance content of the voice, time change information of the fundamental frequency of the voice, and time information of phonemes that represent the timing of phonemes included in the voice;
Sample tone data having time change information of the fundamental frequency of speech and time information of phonemes representing the timing of phonemes included in the speech is acquired, and a tone sample that is a sample of tone according to the type of the sample speech data Data and pitch sample data that is a sample of pitch and time sample data that is a sample of the timing of phonemes, and time variation information of the fundamental frequency of each of the selected pitch sample data and tone sample data; A prosody selection unit that acquires time information of the phonemes of the selected tone sample data and the time sample data;
By changing the time change information of the fundamental frequency included in the tone sample data according to the time change information of the basic frequency included in the pitch sample data, and further matching the time information of the phonemes included in the time sample data. While determining the time change information of the fundamental frequency as the target parameter, the time information of the phoneme as the target parameter is adjusted by matching the time information of the phoneme included in the tone sample data with the time information of the phoneme included in the time sample data. A target parameter determination unit to determine;
A target parameter determination device comprising:

The target parameter determination device according to claim 1, wherein
A voice input unit that accepts voice input;
An utterance content acquisition unit for acquiring utterance content corresponding to the voice;
A voice analysis unit that calculates time change information of the fundamental frequency of the voice and time information of the phoneme of the voice based on the voice and the utterance content received by the voice input unit;
Further comprising
The prosody selection unit acquires the speech having the time change information of the fundamental frequency and the time information of the phoneme calculated by the speech analysis unit as the sample speech data.
A target parameter determination device characterized by that.

The target parameter determination device according to claim 1, wherein
A sample voice designation unit that accepts input of an instruction to select sample voice data;
A sample voice search unit for obtaining the sample voice data by searching the voice data storage unit based on an instruction received by the sample voice designation unit;
The prosody selection unit obtains sample voice data obtained by the sample voice search unit;
A target parameter determination device characterized by that.

The target parameter determination device according to claim 1, wherein
A correction target voice designation unit that receives an input of an instruction to select a correction target voice to be corrected;
A sample voice search unit that obtains a sample voice having the same notation as the correction target voice by searching the voice data storage unit;
The prosody selection unit obtains sample voice data obtained by the sample voice search unit;
A target parameter determination device characterized by that.

The target parameter determination device according to claim 1, wherein
A correction target voice storage unit for storing a correction target voice;
A sample voice search unit that obtains sample voice data that is different from the notation of the voice to be corrected and has the same number of phonemes or mora from the voice data storage unit;
Further comprising
The prosody selection unit obtains sample voice data obtained by the sample voice search unit;
A target parameter determination device characterized by that.

A target parameter determination device according to claim 1;
A correction target voice storage unit for storing a correction target voice;
A correction unit that reads out the correction target speech and corrects the correction target speech based on time change information of the fundamental frequency and time information of phonemes determined by the target parameter determination device;
A synthesized speech correction apparatus comprising:

A computer having a voice data storage unit that stores voice, speech utterance content, time change information of the fundamental frequency of the voice, and time information of phonemes representing timing of phonemes included in the voice in association with each other. ,
Sample tone data having time change information of the fundamental frequency of speech and time information of phonemes representing the timing of phonemes included in the speech is acquired, and a tone sample that is a sample of tone according to the type of the sample speech data Data and pitch sample data that is a sample of pitch and time sample data that is a sample of the timing of phonemes, and time variation information of the fundamental frequency of each of the selected pitch sample data and tone sample data; Prosody selection means for acquiring time information of each of the selected tone sample data and time sample data;
By changing the time change information of the fundamental frequency included in the tone sample data according to the time change information of the basic frequency included in the pitch sample data, and further matching the time information of the phonemes included in the time sample data. While determining the time change information of the fundamental frequency as the target parameter, the time information of the phoneme as the target parameter is adjusted by matching the time information of the phoneme included in the tone sample data with the time information of the phoneme included in the time sample data. Target parameter determining means for determining,
Computer program to function as.