JP5075865B2

JP5075865B2 - Audio processing apparatus, method, and program

Info

Publication number: JP5075865B2
Application number: JP2009074957A
Authority: JP
Inventors: 眞弘森田; 岳彦籠嶋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2012-11-21
Anticipated expiration: 2029-03-25
Also published as: JP2010230704A

Abstract

PROBLEM TO BE SOLVED: To provide a speech processing device, a method, and a program which integrate an elementary speech unit without breaking features of a sound source and a vocal tract filter inherent in a speech waveform. SOLUTION: A phoneme-metric input reception section 41 receives inputs of a plurality of segments obtained by dividing phoneme series corresponding to target voice by a synthetic unit and metrical information corresponding to each of the plurality of segments. An acquisition section 43 acquires a plurality of elementary speech units related with the segment and the metrical information for each of the plurality of segments. A vocal tract filter component integration section 45 integrates a vocal tract filter component of the plurality of acquired elementary speech units for every segment. A sound source component integration section 46 expands and contracts a sound source component of a periodic component of the plurality of acquired elementary speech units based on a fundamental frequency or a shape of a waveform of the sound source component and integrates them for every segment. An elementary unit integration section 44 integrates the plurality of acquired elementary speech units for every segment by filtering the integrated sound source component using the vocal tract filter. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声処理装置、方法、及びプログラムに関する。 The present invention relates to an audio processing apparatus, method, and program.

近年、任意の文章から人工的に音声信号を作り出す音声合成装置では、音質の向上が求められている。 In recent years, in speech synthesizers that artificially generate speech signals from arbitrary sentences, improvement in sound quality is required.

例えば、特許文献１では、適切な音声素片が存在しない場合などに、部分的に合成音の音質が劣化する問題を改善するため、合成単位当たり複数個ずつの音声素片を選択し、これらを合成単位ごとに融合することによって新たな音声素片を生成して、音声を合成する方法（複数素片選択融合方式）が開示されている。 For example, in Patent Document 1, in order to improve the problem that the sound quality of the synthesized sound partially deteriorates when there is no appropriate speech unit, a plurality of speech units are selected per synthesis unit, A method (multiple segment selection and fusion method) for synthesizing speech by generating a new speech unit by fusing for each synthesis unit is disclosed.

また、非特許文献１では、複数個の音声素片を融合する際に、周期的な成分（周期成分）と非周期的な成分（非周期成分）に分けて融合し、非周期成分については、さらに音源に関する特徴量と声道フィルタに関する特徴量に分け、それぞれの特徴量で融合することが開示されている。 In Non-Patent Document 1, when a plurality of speech segments are fused, they are fused by dividing into a periodic component (periodic component) and an aperiodic component (non-periodic component). Further, it is disclosed that the feature amount relating to the sound source and the feature amount relating to the vocal tract filter are divided into the respective feature amounts.

特開２００５−１６４７４９号公報Japanese Patent Laid-Open No. 2005-164749

森田眞弘、籠嶋岳彦、”有声音中の非周期成分を考慮した複数素片選択融合方式による音声合成”、日本音響学会春季講演論文集、２９５−２９６、２００８Akihiro Morita and Takehiko Tsujishima, “Speech synthesis by multiple unit selection and fusion method considering non-periodic components in voiced sound”, Acoustical Society of Japan Spring Meeting, 295-296, 2008

しかしながら、特許文献１に開示されている音声素片の融合方法は、基本的に複数個の音声波形を平均化する方法であり、音源や声道フィルタの特徴など、音声の生成過程に関わるさまざまな特徴が混ざったままのものを融合している。このため、融合による効果がどの特徴にどのように現れるかが明確でなく、結果として、音声波形に内在する各成分の特徴を融合によって壊してしまい、かえって音質が劣化する可能性がある。 However, the speech unit fusion method disclosed in Patent Document 1 is basically a method of averaging a plurality of speech waveforms, and various features related to speech generation processes such as sound source and vocal tract filter characteristics. It fuses things that are mixed with various features. For this reason, it is not clear how the effect of the fusion appears in which feature, and as a result, the feature of each component in the speech waveform may be destroyed by the fusion, and the sound quality may be deteriorated.

なお、非特許文献１では、非周期成分を音源と声道フィルタの特徴量に分けてそれぞれで融合する方法が開示されているが、ノイズ的な音源で駆動される非周期成分では、音源のパワーの時間変化と周波数特性さえ適切に表せば良いため、音源波形の形状自体を考慮する必要はない。これに対し、周期的な声帯振動が音源となる周期成分では、音源波形の形状やタイミングが非常に重要で、良好な音質を実現するためにはこれらを正確に表す必要がある。 Note that Non-Patent Document 1 discloses a method in which non-periodic components are divided into feature quantities of a sound source and a vocal tract filter and fused together. However, in non-periodic components driven by a noisy sound source, It is not necessary to consider the shape of the sound source waveform itself, as long as the power changes with time and the frequency characteristics only need to be expressed appropriately. On the other hand, in a periodic component in which periodic vocal cord vibration is a sound source, the shape and timing of the sound source waveform are very important, and it is necessary to accurately represent these in order to realize good sound quality.

本発明は、上記事情に鑑みてなされたものであり、音声波形に内在する音源および声道フィルタの特徴を壊すことなく音声素片を融合することができる音声処理装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides a speech processing apparatus, method, and program capable of fusing speech segments without destroying the characteristics of the sound source and vocal tract filter inherent in the speech waveform. The purpose is to do.

上述した課題を解決し、目的を達成するために、本発明の一態様にかかる音声処理装置は、目標音声に対応する音韻系列を合成単位で分割した複数のセグメントと、複数の前記セグメントの各々に対応する韻律情報の入力を受け付ける音韻・韻律入力受付部と、複数の前記セグメントの各々に対して、前記セグメント及び前記セグメントに対応する前記韻律情報に関連付けられた複数の音声素片を取得する取得部と、取得された複数の前記音声素片の声道フィルタ成分を、前記セグメント毎に融合する声道フィルタ成分融合部と、取得された複数の前記音声素片の周期成分の音源成分を、基本周波数又は音源成分波形の形状に基づいて伸縮して、前記セグメント毎に融合する音源成分融合部と、前記声道フィルタ成分融合部で融合された融合声道フィルタ成分を特性とする声道フィルタを用いて、前記音源成分融合部で融合された融合音源成分をフィルタリングすることにより、前記取得部により取得された複数の前記音声素片を前記セグメント毎に融合する素片融合部と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, a speech processing apparatus according to an aspect of the present invention includes a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and each of the plurality of segments. A phoneme / prosody input receiving unit that accepts input of prosodic information corresponding to, and a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment for each of the segments An acquisition unit, a vocal tract filter component fusion unit that fuses the acquired vocal tract filter components of the plurality of speech units for each segment, and a sound source component of the acquired periodic components of the plurality of speech units. The sound source component fusion unit that expands and contracts based on the shape of the fundamental frequency or the sound source component waveform and fuses for each segment and the fusion fused by the vocal tract filter component fusion unit Using a vocal tract filter characterized by a tract filter component, filtering the fused sound source component fused by the sound source component fusion unit, the plurality of speech segments obtained by the obtaining unit are obtained for each segment. And an element fusion part to be fused.

また、本発明の別の態様にかかる音声処理方法は、音韻・韻律入力受付部が、目標音声に対応する音韻系列を合成単位で分割した複数のセグメントと、複数の前記セグメントの各々に対応する韻律情報の入力を受け付ける入力受付ステップと、取得部が、複数の前記セグメントの各々に対して、前記セグメント及び前記セグメントに対応する前記韻律情報に関連付けられた複数の音声素片を取得する取得ステップと、声道フィルタ成分融合部が、取得された複数の前記音声素片の声道フィルタ成分を、前記セグメント毎に融合する声道フィルタ成分融合ステップと、音源成分融合部が、取得された複数の前記音声素片の周期成分の音源成分を、基本周波数又は音源成分波形の形状に基づいて伸縮して、前記セグメント毎に融合する音源成分融合ステップと、素片融合部が、前記声道フィルタ成分融合ステップで融合された融合声道フィルタ成分を特性とする声道フィルタを用いて、前記音源成分融合ステップで融合された融合音源成分をフィルタリングすることにより、前記取得ステップにより取得された複数の前記音声素片を前記セグメント毎に融合する素片融合ステップと、を含むことを特徴とする。 Further, in the speech processing method according to another aspect of the present invention, the phoneme / prosody input receiving unit corresponds to each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and each of the plurality of segments. An input receiving step for receiving input of prosodic information, and an acquiring step for acquiring, for each of the plurality of segments, a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment. A vocal tract filter component merging unit for merging the vocal tract filter components of the plurality of acquired speech segments for each segment; and a plurality of sound source component merging units acquired. The sound source component fusion unit that expands and contracts the sound source component of the periodic component of the speech unit based on the fundamental frequency or the shape of the sound source component waveform and fuses it for each segment. And the unit fusion unit filtering the fused sound source component fused in the sound source component fusion step using a vocal tract filter characterized by the fused vocal tract filter component fused in the vocal tract filter component fusion step By doing so, a unit fusion step of fusing the plurality of speech units acquired in the acquisition step for each segment is included.

また、本発明の別の態様にかかる音声処理プログラムは、音韻・韻律入力受付部が、目標音声に対応する音韻系列を合成単位で分割した複数のセグメントと、複数の前記セグメントの各々に対応する韻律情報の入力を受け付ける入力受付ステップと、取得部が、複数の前記セグメントの各々に対して、前記セグメント及び前記セグメントに対応する前記韻律情報に関連付けられた複数の音声素片を取得する取得ステップと、声道フィルタ成分融合部が、取得された複数の前記音声素片の声道フィルタ成分を、前記セグメント毎に融合する声道フィルタ成分融合ステップと、音源成分融合部が、取得された複数の前記音声素片の周期成分の音源成分を、基本周波数又は音源成分波形の形状に基づいて伸縮して、前記セグメント毎に融合する音源成分融合ステップと、素片融合部が、前記声道フィルタ成分融合ステップで融合された融合声道フィルタ成分を特性とする声道フィルタを用いて、前記音源成分融合ステップで融合された融合音源成分をフィルタリングすることにより、前記取得ステップにより取得された複数の前記音声素片を前記セグメント毎に融合する素片融合ステップと、をコンピュータに実行させるためのものである。 Further, in the speech processing program according to another aspect of the present invention, the phoneme / prosody input reception unit corresponds to each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and each of the plurality of segments. An input receiving step for receiving input of prosodic information, and an acquiring step for acquiring, for each of the plurality of segments, a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment. A vocal tract filter component merging unit for merging the vocal tract filter components of the plurality of acquired speech segments for each segment; and a plurality of sound source component merging units acquired. A sound source that expands and contracts the sound source component of the periodic component of the speech unit based on the fundamental frequency or the shape of the sound source component waveform and fuses it for each segment A fused sound source component fused in the sound source component fusion step using a vocal tract filter characterized by a fused voice tract filter component fused in the vocal tract filter component fusion step, , By causing the computer to execute a segment fusion step of fusing the plurality of speech segments acquired in the acquisition step for each segment.

本発明によれば、音声波形に内在する音源および声道フィルタの特徴を壊すことなく音声素片を融合することができるという効果を奏する。 According to the present invention, there is an effect that speech segments can be fused without destroying the characteristics of the sound source and vocal tract filter inherent in the speech waveform.

本実施の形態の音声処理装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the audio | voice processing apparatus of this Embodiment. 本実施の形態の音声素片記憶部に記憶されている情報の一例を示す図である。It is a figure which shows an example of the information memorize | stored in the speech unit memory | storage part of this Embodiment. 本実施の形態の素片融合部の詳細な構成の一例を示すブロック図である。It is a block diagram which shows an example of a detailed structure of the element fusion part of this Embodiment. 本実施の形態の融合単位抽出部の処理の一例を示す図である。It is a figure which shows an example of the process of the fusion unit extraction part of this Embodiment. 本実施の形態の声道フィルタ成分融合部の詳細な構成の一例を示すブロック図である。It is a block diagram which shows an example of a detailed structure of the vocal tract filter component fusion part of this Embodiment. 本実施の形態の音源成分融合部の詳細な構成の一例を示すブロック図である。It is a block diagram which shows an example of a detailed structure of the sound source component fusion | melting part of this Embodiment. 本実施の形態の音源波形変形部による音源波形の変形方法の一例を説明するための図である。It is a figure for demonstrating an example of the deformation | transformation method of the sound source waveform by the sound source waveform deformation | transformation part of this Embodiment. 本実施の形態の生成部による音声波形の生成処理の一例を説明するための図である。It is a figure for demonstrating an example of the production | generation process of the audio | voice waveform by the production | generation part of this Embodiment. 本実施の形態の音声処理装置で行われる音声合成の処理手順の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process sequence of the speech synthesis performed with the speech processing unit of this Embodiment. 本実施の形態の音声素片取得処理の処理手順の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process sequence of the speech unit acquisition process of this Embodiment. 変形例２の音声処理装置で行われる融合音声素片の作成手順の流れの一例を示すフローチャートである。12 is a flowchart illustrating an example of a flow of a procedure for creating a fusion speech unit performed by the speech processing apparatus according to the second modification. 変形例３の音声処理装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the audio | voice processing apparatus of the modification 3.

以下、添付図面を参照しながら、本発明にかかる音声処理装置、方法、及びプログラムの最良な実施の形態を詳細に説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of a sound processing device, a method, and a program according to the invention will be described in detail with reference to the accompanying drawings.

図１は、本実施の形態の音声処理装置１の構成の一例を示すブロック図である。図１に示すように、音声処理装置１は、テキスト入力部１０と、言語処理部２０と、韻律処理部３０と、音声合成部４０とを備える。 FIG. 1 is a block diagram showing an example of the configuration of the speech processing apparatus 1 according to the present embodiment. As shown in FIG. 1, the speech processing apparatus 1 includes a text input unit 10, a language processing unit 20, a prosody processing unit 30, and a speech synthesis unit 40.

テキスト入力部１０は音声処理の対象となるテキストを入力する。 The text input unit 10 inputs text to be subjected to speech processing.

言語処理部２０は、テキスト入力部１０から入力されるテキストの形態素解析や構文解析などの言語解析を行う。 The language processing unit 20 performs language analysis such as morphological analysis and syntax analysis of text input from the text input unit 10.

韻律制御部３０は、言語処理部２０の言語解析結果からアクセントやイントネーションを処理し、音韻系列及び韻律情報を生成する。 The prosodic control unit 30 processes accents and intonations from the language analysis result of the language processing unit 20 to generate phoneme sequences and prosodic information.

音声合成部４０は、韻律制御部３０により生成された音韻系列及び韻律情報から音声波形を生成する。そして、音声合成部４０は、音韻・韻律入力受付部４１と、音声素片記憶部４２と、取得部４３と、素片融合部４４と、声道フィルタ成分融合部４５と、音源成分融合部４６と、生成部４７と、出力部４８とを含む。 The speech synthesizer 40 generates a speech waveform from the phoneme sequence and prosodic information generated by the prosody controller 30. The speech synthesis unit 40 includes a phoneme / prosody input reception unit 41, a speech unit storage unit 42, an acquisition unit 43, a unit fusion unit 44, a vocal tract filter component fusion unit 45, and a sound source component fusion unit. 46, a generation unit 47, and an output unit 48.

音韻・韻律入力受付部４１は、韻律制御部３０から目標音声に対応する音韻系列を合成単位で分割した複数のセグメントと、複数の前記セグメントの各々に対応する韻律情報の入力を受け付ける。具体的には、音韻・韻律入力受付部４１は、韻律制御部３０から入力された音韻系列を合成単位であるセグメントに分割し、分割された複数のセグメントの各々に対応する韻律情報とともに受け付ける。 The phonological / prosodic input receiving unit 41 receives a plurality of segments obtained by dividing the phonological sequence corresponding to the target speech by the synthesis unit from the prosody control unit 30 and the input of the prosodic information corresponding to each of the plurality of segments. Specifically, the phoneme / prosody input receiving unit 41 divides the phoneme sequence input from the prosody control unit 30 into segments that are synthesis units, and receives together with the prosody information corresponding to each of the plurality of divided segments.

なお、「目標音声」は、音声を合成する際の目標となる（仮想的な）音声、即ち、入力された音韻の並びと韻律を実現し、かつ、理想的に自然な音声をいう。また、「音韻系列」は、例えば音韻記号の系列であり、「韻律情報」は、例えば基本周波数、音韻継続時間長、パワーなどである。また、「合成単位」とは、合成音声を生成するときに用いる音声の単位のことであり、音素あるいは音素を分割したもの（例えば、半音素など）の組み合わせである。例えば半音素、音素（Ｃ、Ｖ）、ダイフォン（ＣＶ、ＶＣ、ＶＶ）、トライフォン（ＣＶＣ、ＶＣＶ）、音節（ＣＶ、Ｖ）、などであり（Ｖは母音、Ｃは子音を表す）、これらが混在しているなど可変長であってもよい。 The “target speech” refers to a (virtual) speech that is a target when synthesizing speech, that is, an ideal natural speech that realizes the arrangement and prosody of input phonemes. The “phoneme sequence” is, for example, a sequence of phoneme symbols, and the “prosodic information” is, for example, a fundamental frequency, a phoneme duration, power, or the like. The “synthesis unit” is a unit of speech used when generating synthesized speech, and is a combination of phonemes or phonemes divided (for example, semi-phonemes). For example, semitones, phonemes (C, V), diphones (CV, VC, VV), triphones (CVC, VCV), syllables (CV, V), etc. (V is a vowel, C is a consonant), It may be variable length such as a mixture of these.

音声素片記憶部４２は、複数の音声素片と、音声素片の各々に関連付けられた環境情報を対応付けて記憶するものであり、例えば、ＨＤＤ（Hard Disk Drive）、光ディスク、メモリカード、ＲＡＭ（Random Access Memory）などの既存の記憶媒体により実現できる。なお本実施の形態では、音声処理装置１が音声素片記憶部４２を備えているが、音声素片記憶部４２を外部記憶媒体等（例えば、音声処理装置１に着脱可能な記憶媒体）により実現する場合には、音声素片記憶部４２を省略するようにしてもよい。 The speech unit storage unit 42 stores a plurality of speech units and environmental information associated with each speech unit in association with each other. For example, an HDD (Hard Disk Drive), an optical disc, a memory card, This can be realized by an existing storage medium such as a RAM (Random Access Memory). In the present embodiment, the speech processing apparatus 1 includes the speech unit storage unit 42. However, the speech unit storage unit 42 is provided by an external storage medium or the like (for example, a storage medium that is detachable from the speech processing apparatus 1). In the case of realization, the speech unit storage unit 42 may be omitted.

なお、「音声素片」は、合成単位に対応する音声信号の波形もしくはその特徴を表すパラメータ系列などを示すものである。また、「環境情報」は、関連付けられた音声素片の音韻・韻律環境を示す情報であり、例えば、音声素片の音素名、先行音素、後続音素、後々続音素、基本周波数、音韻継続時間長、パワー、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発声速度、感情などである。なお、これら以外にも、音声素片の音響特徴のうち音声素片を選択するのに有用な音声素片の始端・終端でのケプストラム係数などの情報を、「環境情報」に含めるようにしてもよい。 The “speech segment” indicates a waveform of a speech signal corresponding to a synthesis unit or a parameter series representing the characteristics thereof. The “environment information” is information indicating the phoneme / prosodic environment of the associated speech unit. For example, the phoneme name of the speech unit, the preceding phoneme, the subsequent phoneme, the subsequent phoneme, the fundamental frequency, and the phoneme duration Length, power, presence of stress, position from the accent core, time from breathing, speaking speed, emotion, etc. In addition to these, information such as cepstrum coefficients at the start and end of speech units that are useful for selecting speech units among the acoustic features of speech units should be included in the “environment information”. Also good.

図２は、音声素片記憶部４２に記憶されている情報の一例を示す図である。図２に示す例では、音声素片は、合成単位が音素の場合の音声波形となっている。なお、これらの音声素片は、音素毎にラベル付けされた多数の音声データから、当該ラベルに従って音素毎に音声波形を切り出したものである。 FIG. 2 is a diagram illustrating an example of information stored in the speech unit storage unit 42. In the example shown in FIG. 2, the speech segment has a speech waveform when the synthesis unit is a phoneme. These speech segments are obtained by cutting speech waveforms for each phoneme from a large number of speech data labeled for each phoneme according to the label.

また、図２に示す例では、環境情報は、音声素片に対応した音韻（音素名）、隣接音韻（ここでは、前後それぞれ２音素ずつ）、基本周波数、音韻継続時間長、及び音響特徴量を示す音声素片始終端のケプストラム係数となっている。なお、これらの環境情報は、音声素片を切り出す元になった音声データを分析して抽出することによって得られる。 In the example shown in FIG. 2, the environment information includes phonemes (phoneme names) corresponding to speech segments, adjacent phonemes (here, two phonemes each before and after), fundamental frequency, phoneme duration length, and acoustic features. This is a cepstrum coefficient at the beginning and end of a speech unit. The environmental information can be obtained by analyzing and extracting the voice data from which the speech segment is cut out.

なお、図２では、合成単位が音素の場合の例を示しているが、合成単位は、半音素、ダイフォン、トライフォン、音節、あるいはこれらの組み合わせや可変長であってもよい。 Although FIG. 2 shows an example in which the synthesis unit is a phoneme, the synthesis unit may be a semiphoneme, a diphone, a triphone, a syllable, or a combination or variable length thereof.

図１に戻り、取得部４３は、音韻・韻律入力受付部４１により分割された複数のセグメントの各々に対して、セグメント及びセグメントに対応する韻律情報に関連付けられた複数の音声素片を取得する。具体的には、取得部４３は、セグメント及びセグメントに対応する韻律情報（環境情報）に関連付けられた複数の音声素片を音声素片記憶部４２から取得する。 Returning to FIG. 1, the acquisition unit 43 acquires, for each of the plurality of segments divided by the phoneme / prosody input reception unit 41, a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment. . Specifically, the acquisition unit 43 acquires a plurality of speech units associated with segments and prosodic information (environment information) corresponding to the segments from the speech unit storage unit 42.

この際、取得部４３は、既存の素片選択型音声合成方法や複数素片選択融合型音声合成方法と同様に、音声素片の取得の尺度に、各音声素片候補を用いて音声を合成した場合の合成音声と目標音声との歪みの大きさを間接的に表すコストを用い、このコストができるだけ小さくなるように融合する音声素片の組み合わせを取得する。 At this time, the acquisition unit 43 uses each speech unit candidate as a scale for acquiring speech units, as in the existing unit selection type speech synthesis method and multiple unit selection fusion type speech synthesis method. Using a cost that indirectly represents the magnitude of distortion between the synthesized speech and the target speech in the case of synthesis, a combination of speech units to be merged so as to minimize this cost is acquired.

なお、音声素片の取得の尺度となるコストは、対象の音声素片を目標の音韻・韻律環境で使用することによって生じる合成音声の目標音声に対する歪みの度合いを表す目標コストと、対象の音声素片を隣接する音声素片と接続したときに生じる合成音声の目標音声に対する歪みの度合いを表す接続コストとから成る。 Note that the cost used as a measure of speech segment acquisition includes the target cost that represents the degree of distortion of the synthesized speech generated by using the target speech segment in the target phoneme / prosodic environment and the target speech. It consists of a connection cost that represents the degree of distortion of the synthesized speech with respect to the target speech that occurs when the segment is connected to an adjacent speech segment.

目標コストには、音声素片が持つ基本周波数と目標の基本周波数の違い（差）によって生じる歪み（基本周波数コスト）、音声素片の音韻継続時間長と目標の音韻継続時間長の違い（差）によって生じる歪み（継続時間長コスト）、音声素片が属していた音韻環境と目標の音韻環境の違いによって生じる歪み（音韻環境コスト）、音声素片が元々あった単語内や呼気段落内、文内での位置と合成時の位置の違いによって生じる歪み（位置的環境コスト）などがある。 The target cost includes the distortion (basic frequency cost) caused by the difference (difference) between the fundamental frequency of the speech unit and the target fundamental frequency (difference), the difference between the phoneme duration length of the speech unit and the target phoneme duration length (difference) ) Due to the difference between the phoneme environment to which the speech segment belonged and the target phoneme environment (phoneme environment cost), within the word or exhalation paragraph where the speech segment originally existed, There is distortion (positional environmental cost) caused by the difference between the position in the sentence and the position at the time of composition.

接続コストには、音声素片境界でのスペクトルの違い（差）によって生じる歪み（スペクトル接続コスト）や、音声素片境界での基本周波数の違い（差）によって生じる歪み（基本周波数接続コスト）などがある。 Connection costs include distortion caused by spectral difference (difference) at the speech unit boundary (spectrum connection cost), distortion caused by fundamental frequency difference (difference) at the speech unit boundary (basic frequency connection cost), etc. There is.

素片融合部４４は、取得部４３により取得された複数の音声素片を融合して新たな音声素片を生成する。具体的には、素片融合部４４は、後述の声道フィルタ成分融合部４５で融合された融合声道フィルタ成分を特性とする声道フィルタを用いて、後述の音源成分融合部４６で融合された融合音源成分をフィルタリングすることにより、取得部４３により取得された複数の音声素片をセグメント毎に融合する。 The unit fusion unit 44 unites a plurality of speech units acquired by the acquisition unit 43 to generate a new speech unit. Specifically, the unit fusion unit 44 uses a vocal tract filter characterized by a fused vocal tract filter component fused by a vocal tract filter component fusion unit 45 described later, and fuses it in a sound source component fusion unit 46 described later. By filtering the fused sound source components, the plurality of speech segments acquired by the acquisition unit 43 are fused for each segment.

図３は、本実施の形態の素片融合部４４の詳細な構成の一例を示すブロック図である。図３に示すように、素片融合部４４は、複数素片入力受付部４４１と、融合単位抽出部４４２と、プリエンファシス部４４３と、線形予測分析部４４４と、ディエンファシス部４４５と、目標パワー算出部４４６と、線形予測フィルタ部４４７と、パワー補正部４４８と、融合音声素片出力部４４９とを含む。 FIG. 3 is a block diagram illustrating an example of a detailed configuration of the segment fusion unit 44 of the present embodiment. As shown in FIG. 3, the unit fusion unit 44 includes a multiple unit input reception unit 441, a fusion unit extraction unit 442, a pre-emphasis unit 443, a linear prediction analysis unit 444, a de-emphasis unit 445, a target A power calculation unit 446, a linear prediction filter unit 447, a power correction unit 448, and a fusion speech unit output unit 449 are included.

複数素片入力受付部４４１は、取得部４３によりセグメント毎に複数個ずつ取得された音声素片の入力を受け付ける。 The multi-unit input receiving unit 441 receives input of speech units acquired by the acquiring unit 43 for each segment.

融合単位抽出部４４２は、セグメント毎に入力された複数個の音声素片の各々から、融合するのに適した融合単位の波形を抽出し、セグメント毎に各音声素片の波形数を揃える。 The fusion unit extraction unit 442 extracts a waveform of a fusion unit suitable for fusion from each of a plurality of speech units input for each segment, and aligns the number of waveforms of each speech unit for each segment.

なお本実施の形態においては、融合単位はピッチ波形としている。「ピッチ波形」とは、その長さが音声の基本周期の数倍程度で、それ自身は基本周期を持たない比較的短い波形である。 In this embodiment, the unit of fusion is a pitch waveform. A “pitch waveform” is a relatively short waveform that has a length that is several times the basic period of speech and does not have a basic period.

そして、このようなピッチ波形を抽出する方法として、例えば、基本周期同期窓を用いる方法などがある。この方法では、予め各々の音声素片の音声波形に対して基本周期間隔毎にマーク（ピッチマーク）を付しておき、このピッチマークを中心にして、窓長が基本周期の２倍のハニング窓で窓掛けすることによって、ピッチ波形を切り出す。 As a method for extracting such a pitch waveform, for example, there is a method using a basic period synchronization window. In this method, a mark (pitch mark) is added to the speech waveform of each speech unit in advance at every basic period interval, and Hanning whose window length is twice the basic period centering on this pitch mark. A pitch waveform is cut out by windowing with a window.

図４は、融合単位をピッチ波形としたときの、融合単位抽出部４４２の処理の一例を示す図である。 FIG. 4 is a diagram illustrating an example of processing of the fusion unit extraction unit 442 when the fusion unit is a pitch waveform.

図４に示す例において、点線６０で囲まれた３種類の音声波形は、あるセグメントに対して取得された音声素片の音声波形を示している。また、点線６１で囲まれた３種類の音声波形は、点線６０で囲まれた３種類の音声波形から、基本周期同期窓を用いた方法により抽出されたピッチ波形系列を示している。図４に示すように、音声波形から抽出されるピッチ波形の個数は、通常、音声素片ごとに異なっている。 In the example shown in FIG. 4, three types of speech waveforms surrounded by a dotted line 60 indicate speech waveforms of speech segments acquired for a certain segment. Further, the three types of speech waveforms surrounded by the dotted line 61 indicate pitch waveform sequences extracted from the three types of speech waveforms surrounded by the dotted line 60 by a method using a basic period synchronization window. As shown in FIG. 4, the number of pitch waveforms extracted from the speech waveform usually differs for each speech unit.

そこで、融合単位抽出部４４２は、セグメント毎に各音声素片のピッチ波形の波形数を、同一数に揃える。具体的には、融合単位抽出部４４２は、ピッチ波形の少ない系列に対しては、系列に含まれるいくつかのピッチ波形を複製することによってピッチ波形数を増やし、ピッチ波形の多い系列に対しては、系列中のいくつかのピッチ波形を間引くことによってピッチ波形数を減らす。 Therefore, the fusion unit extraction unit 442 aligns the number of pitch waveforms of each speech unit to the same number for each segment. Specifically, the fusion unit extraction unit 442 increases the number of pitch waveforms by duplicating several pitch waveforms included in the sequence for sequences with a small pitch waveform, and for sequences with a large pitch waveform. Reduces the number of pitch waveforms by thinning out several pitch waveforms in the sequence.

なお、本実施の形態では、揃える対象となるピッチ波形の波形数は、目標の音韻継続時間長の合成音声を生成するために必要なピッチ波形数としているが、例えば、最もピッチ波形数の多いものに揃えるようにしてもよい。 In the present embodiment, the number of pitch waveforms to be aligned is the number of pitch waveforms necessary to generate a synthesized speech having a target phoneme duration length. For example, the number of pitch waveforms is the largest. You may make it align with a thing.

そして、図４の点線６２で囲まれた３種類の音声波形のように、セグメント毎に各音声素片のピッチ波形の波形数を、同一数に揃える。なお、点線６２で囲まれた３種類の音声波形は、ピッチ波形の数を７つに揃えた例を示している。 Then, like the three types of speech waveforms surrounded by the dotted line 62 in FIG. 4, the number of pitch waveforms of each speech unit is made equal to the same number for each segment. Note that three types of speech waveforms surrounded by a dotted line 62 show an example in which the number of pitch waveforms is set to seven.

図３に戻り、プリエンファシス部４４３は、融合する音声波形の各々に対して、音声のスペクトル包絡に一般的に見られる負の傾き（チルトと呼ばれ、周波数の低域から高域に向かってパワーが下がる）を取り除くフィルタリングを行う。具体的には、プリエンファシス部４４３は、融合単位抽出部４４２により抽出されたピッチ波形の各々に対して、スペクトル包絡全体での負の傾き（チルト）を取り除くように、高域の周波数成分を強調するフィルタリングを行なう。 Returning to FIG. 3, the pre-emphasis unit 443 applies a negative slope (called tilt, generally seen in the spectral envelope of the sound) to each of the sound waveforms to be fused, from a low frequency to a high frequency. Perform filtering to remove power). Specifically, the pre-emphasis unit 443 applies high-frequency components to each pitch waveform extracted by the fusion unit extraction unit 442 so as to remove the negative slope (tilt) in the entire spectrum envelope. Perform emphasis filtering.

ここで、有声音の音源は、声帯の周期的な開閉によって生じる呼気流（声帯体積流）の振動であるが、この音源波形の周波数特性に強いローパス特性があるため、有声音の音声波形の周波数成分には、一般的に上述のようなチルトが見られる。このため、声道フィルタでのスペクトル特性を精度良く分析するためには、予めこのようなチルトを取り除いておくことが好ましい。 Here, the voiced sound source is the vibration of the expiratory airflow (voiced volume flow) caused by the periodic opening and closing of the vocal cords, but since the frequency characteristics of this sound source waveform have a low-pass characteristic, the voice waveform of the voiced sound The above-described tilt is generally seen in the frequency component. For this reason, it is preferable to remove such tilt in advance in order to accurately analyze the spectral characteristics of the vocal tract filter.

そこで、本実施の形態では、プリエンファシス部４４３は、音声分析の際に一般的に用いられるプリエンファシスフィルタ、即ち伝達関数が例えば数式（１）のように表されるフィルタを用いて、フィルタリングを行なう。なお、数式（１）のａには、通常０．９８〜１．０の値が用いられる。また、ａの値は、全ピッチ波形に対して一定にしてもよいし、音声素片が元々あった文中位置などに基づいてピッチ波形ごとに変更するようにしてもよい。例えば、文末では、声帯の緊張度が緩み、チルトが強まる傾向があるため、それ以外の箇所よりａの値を大きめに設定するようにしてもよい。 Therefore, in the present embodiment, the pre-emphasis unit 443 performs filtering by using a pre-emphasis filter that is generally used in speech analysis, that is, a filter whose transfer function is expressed by, for example, Equation (1). Do. Note that a value of 0.98 to 1.0 is normally used for a in the formula (1). The value of a may be constant for all pitch waveforms, or may be changed for each pitch waveform based on the position in the sentence where the speech segment originally existed. For example, at the end of the sentence, the tension of the vocal cords tends to relax and the tilt tends to increase, so the value of a may be set larger than the rest.

線形予測分析部４４４は、プリエンファシス部４４３によりフィルタリングされた音声波形の各々に対して、線形予測分析を行い、線形予測係数と線形予測残差を算出する。ここで分析対象の音声波形をｓ（ｎ）、線形予測係数をαｋ（ｋ＝１，．．．，ｐ、ｐは分析次数）、線形予測残差をｅ（ｎ）とすると、これらの関係は以下の数式（２）のように表される。 The linear prediction analysis unit 444 performs linear prediction analysis on each of the speech waveforms filtered by the pre-emphasis unit 443, and calculates a linear prediction coefficient and a linear prediction residual. Here, when the speech waveform to be analyzed is s (n), the linear prediction coefficient is αk (k = 1,..., P is the order of analysis), and the linear prediction residual is e (n), these relationships Is expressed by the following mathematical formula (2).

そして、線形予測分析では、数式（２）において、線形予測残差ｅ（ｎ）の二乗平均を最小にするように線形予測係数を求める。 In the linear prediction analysis, a linear prediction coefficient is obtained so as to minimize the mean square of the linear prediction residual e (n) in Equation (2).

なお、数式（２）は全極型のフィルタであるが、音声生成モデルにおいて声道のシステム関数が全極型フィルタでうまく近似できるとされているため、本実施の形態においては、この線形予測フィルタを声道フィルタとみなす。即ち、線形予測分析によって得られる線形予測係数は声道フィルタのスペクトル特性を表し、線形予測残差は音源波形の近似であるとみなす。 Although Equation (2) is an all-pole filter, since the vocal tract system function can be well approximated by an all-pole filter in the speech generation model, this linear prediction is performed in this embodiment. Consider the filter as a vocal tract filter. That is, the linear prediction coefficient obtained by the linear prediction analysis represents the spectral characteristics of the vocal tract filter, and the linear prediction residual is regarded as an approximation of the sound source waveform.

また、線形予測分析の方法としては、自己相関法、共分散法などの既存の方法を用いるようにしてもよい。また、本実施形態では、例えば元の音声波形が２２ｋＨｚサンプリングの場合、分析次数ｐを２０程度の値とする。 As a method of linear prediction analysis, an existing method such as an autocorrelation method or a covariance method may be used. In this embodiment, for example, when the original speech waveform is 22 kHz sampling, the analysis order p is set to a value of about 20.

そして、線形予測分析部４４４は、上記のように、線形予測分析によって、セグメントに対する複数個のピッチ波形の各々に対して、線形予測係数と線形予測残差を算出し、線形予測係数を声道フィルタ成分融合部４５に出力し、線形予測残差をディエンファシス部４４５に出力する。 Then, as described above, the linear prediction analysis unit 444 calculates a linear prediction coefficient and a linear prediction residual for each of a plurality of pitch waveforms for the segment by linear prediction analysis, and uses the linear prediction coefficient as a vocal tract. The result is output to the filter component fusion unit 45, and the linear prediction residual is output to the de-emphasis unit 445.

ディエンファシス部４４５は、線形予測分析部４４４により算出された線形予測残差波形の各々に対して、プリエンファシス部４４３で適用したフィルタリングの逆フィルタリングを行い、ディエンファシスした線形予測残差を音源成分融合部４６に出力する。 The de-emphasis unit 445 performs inverse filtering of the filtering applied by the pre-emphasis unit 443 on each of the linear prediction residual waveforms calculated by the linear prediction analysis unit 444, and uses the de-emphasized linear prediction residual as a sound source component. The data is output to the fusion unit 46.

即ち、ディエンファシス部４４５は、伝達関数が例えば数式（３）のように表されるフィルタを用いて、プリエンファシス部４４３による高域の強調を元に戻すフィルタリングを行なう。なお、ａの値は、プリエンファシス部４４３で用いたのと同じ値を用いる。 That is, the de-emphasis unit 445 performs filtering that restores the high-frequency emphasis by the pre-emphasis unit 443 using a filter whose transfer function is expressed by, for example, Equation (3). Note that the same value as that used in the pre-emphasis unit 443 is used as the value of a.

目標パワー算出部４４６は、融合単位抽出部４４２から入力された音声波形のパワーを基に、融合によって生成される新たな音声素片の目標となるパワーである目標パワーを算出する。具体的には、目標パワー算出部４４６は、融合単位抽出部４４２により抽出されたセグメントの複数個のピッチ波形から、融合によって生成される新たなピッチ波形の目標パワーを算出する。なお、本実施の形態では、目標パワー算出部４４６は、ピッチ波形の各々に対してパワーを算出し、これらを平均化することによって目標パワーを求める。 The target power calculation unit 446 calculates a target power that is a target power of a new speech unit generated by the fusion based on the power of the speech waveform input from the fusion unit extraction unit 442. Specifically, the target power calculation unit 446 calculates the target power of a new pitch waveform generated by the fusion from the plurality of pitch waveforms of the segments extracted by the fusion unit extraction unit 442. In the present embodiment, the target power calculation unit 446 calculates the power for each of the pitch waveforms and averages them to obtain the target power.

線形予測フィルタ部４４７は、声道フィルタ成分融合部４５で融合された融合声道フィルタ成分を特性とする声道フィルタを用いて、音源成分融合部４６で融合された融合音源成分をフィルタリングすることにより、融合音声素片を生成する。具体的には、線形予測フィルタ部４４７は、セグメント毎に、声道フィルタ成分融合部４５で融合された融合済みの線形予測係数を用いて、音源成分融合部４６で融合された融合音源波形をフィルタリングすることにより、融合音声素片のピッチ波形を生成する。 The linear prediction filter unit 447 filters the fused sound source component fused by the sound source component fusion unit 46 using a vocal tract filter characterized by the fused vocal tract filter component fused by the vocal tract filter component fusion unit 45. To generate a fused speech segment. Specifically, the linear prediction filter unit 447 uses, for each segment, the fused sound source waveform fused by the sound source component fusion unit 46 using the fused linear prediction coefficient fused by the vocal tract filter component fusion unit 45. By filtering, a pitch waveform of the fusion speech unit is generated.

なお、線形予測フィルタは数式（２）で表され、αkには声道フィルタ成分融合部４５で融合された融合済みの線形予測係数を用いる。また、音源成分融合部４６で融合された融合音源波形を、数式（２）のｅ（ｎ）に代入することにより、融合音源波形のピッチ波形がｓ（ｎ）として生成される。 The linear prediction filter is expressed by Equation (2), and a linear prediction coefficient that has been merged by the vocal tract filter component fusion unit 45 is used for αk. Also, by substituting the fused sound source waveform fused by the sound source component merging unit 46 into e (n) of Equation (2), a pitch waveform of the fused sound source waveform is generated as s (n).

パワー補正部４４８は、線形予測フィルタ部４４７により生成された融合音声素片の音声波形に対し、目標パワー算出部４４６で算出された目標パワーに合うようにパワーを増幅または減幅する。 The power correction unit 448 amplifies or reduces the power of the speech waveform of the fused speech unit generated by the linear prediction filter unit 447 so as to match the target power calculated by the target power calculation unit 446.

融合音声素片出力部４４９は、パワー補正部４４８により目標パワーに合うように補正されたセグメント毎の融合音声素片を、生成部４７に出力する。 The fused speech unit output unit 449 outputs the fused speech unit for each segment corrected by the power correcting unit 448 so as to match the target power, to the generating unit 47.

なお、本実施の形態では、素片融合部４４（線形予測分析部４４４）は、音声素片を声道フィルタ成分と音源成分に分離する方法に線形予測分析法を用いているが、例えば、声道フィルタが極零型フィルタで近似されるＡＲＸ(AutoRegressive with eXogenous input)音声生成モデルを用いた方法など既存の分離方法を用いてもよい。 In this embodiment, the unit fusion unit 44 (linear prediction analysis unit 444) uses a linear prediction analysis method as a method of separating a speech unit into a vocal tract filter component and a sound source component. An existing separation method such as a method using an ARX (Auto Regressive with eXogenous input) speech generation model in which the vocal tract filter is approximated by a pole-zero filter may be used.

また、音声の生成過程において有声音の周期成分は、（１）声帯の周期的な開閉によって生じる呼気流（声帯体積流）の振動が、（２）舌や唇、口蓋で形を調整(調音)された声道を通過し、（３）唇または鼻腔で放射されることによって生成される。 In the voice generation process, the periodic components of voiced sound are: (1) vibration of expiratory airflow (vocal volume) generated by the periodic opening and closing of the vocal cords, and (2) adjusting the shape with the tongue, lips, and palate (articulation) ), And (3) produced by being emitted in the lips or nasal passages.

ところで、線形予測分析法やＡＲＸモデルを用いた方法では一般的に、（２）を声道フィルタで近似する一方、（１）に（３）の放射による効果を含めたもの、即ち（１）に（３）を畳み込んだものを音源として扱っている。 By the way, in a method using a linear prediction analysis method or an ARX model, (2) is generally approximated by a vocal tract filter, while (1) includes the effect of radiation of (3), that is, (1) (3) is handled as a sound source.

しかしながら、本実施の形態においては、音源成分には必ずしも放射の効果を含む必要はなく、（１）だけの成分を近似したものであっても良い。なお、放射特性は微分で良く近似できるため、（１）だけの成分を近似した音源成分は、線形予測分析法やＡＲＸモデルを用いた方法で求めた音源成分を積分することにより求めることができる。 However, in the present embodiment, the sound source component does not necessarily include the radiation effect, and may be an approximation of only the component (1). Since the radiation characteristic can be well approximated by differentiation, the sound source component approximating only the component (1) can be obtained by integrating the sound source component obtained by the linear prediction analysis method or the method using the ARX model. .

図１に戻り、声道フィルタ成分融合部４５は、取得部４３により取得された複数の音声素片の声道フィルタの特徴を表す声道フィルタ成分をセグメント毎に融合する。 Returning to FIG. 1, the vocal tract filter component fusion unit 45 merges the vocal tract filter components representing the characteristics of the vocal tract filter of the plurality of speech segments acquired by the acquisition unit 43 for each segment.

図５は、本実施の形態の声道フィルタ成分融合部４５の詳細な構成の一例を示すブロック図である。図５に示すように、声道フィルタ成分融合部４５は、複数線形予測係数入力受付部４５１と、ＬＳＰ変換部４５２と、ＬＳＰ平均化部４５３と、ＬＰＣ変換部４５４と、融合線形予測係数出力部４５５とを含む。 FIG. 5 is a block diagram showing an example of a detailed configuration of the vocal tract filter component fusion unit 45 of the present embodiment. As shown in FIG. 5, the vocal tract filter component fusion unit 45 includes a multiple linear prediction coefficient input reception unit 451, an LSP conversion unit 452, an LSP averaging unit 453, an LPC conversion unit 454, and a fused linear prediction coefficient output. Part 455.

複数線形予測係数入力受付部４５１は、線形予測分析部４４４（図３参照）から、セグメントに対する複数個のピッチ波形の各々に対して算出された線形予測係数の入力を受け付ける。 The multiple linear prediction coefficient input reception unit 451 receives input of linear prediction coefficients calculated for each of the plurality of pitch waveforms for the segment from the linear prediction analysis unit 444 (see FIG. 3).

ＬＳＰ変換部４５２は、複数線形予測係数入力受付部４５１に受け付けられた複数個の線形予測係数（ＬＰＣ：Linear Prediction Coefficient）の各々を、線スペクトル対（ＬＳＰ：Line Spectrum Pair）に変換する。なお、「線スペクトル対」は、線形予測係数と相互に変換が可能な周波数領域のパラメータであり、既存の方法によって、線形予測係数からの変換が可能である。 The LSP conversion unit 452 converts each of a plurality of linear prediction coefficients (LPC: Linear Prediction Coefficient) received by the multiple linear prediction coefficient input reception unit 451 into a line spectrum pair (LSP). The “line spectrum pair” is a frequency domain parameter that can be mutually converted with the linear prediction coefficient, and can be converted from the linear prediction coefficient by an existing method.

ＬＳＰ平均化部４５３は、ＬＳＰ変換部４５２により変換された複数個の線スペクトル対を、ｉ番目の係数毎（例えば、２０次の線形予測係数に対する線スペクトル対は、０より大かつπ未満の２０個の角周波数を表す係数で構成）に平均化する。 The LSP averaging unit 453 converts the plurality of line spectrum pairs converted by the LSP conversion unit 452 into i-th coefficients (for example, the line spectrum pair for the 20th-order linear prediction coefficient is greater than 0 and less than π). It is averaged to be composed of coefficients representing 20 angular frequencies.

ＬＰＣ変換部４５４は、ＬＳＰ平均化部４５３により平均化された線スペクトル対を、線形予測係数に変換する。 The LPC conversion unit 454 converts the line spectrum pair averaged by the LSP averaging unit 453 into a linear prediction coefficient.

融合線形予測係数出力部４５５は、ＬＰＣ変換部４５４により変換された線形予測係数を、融合線形予測係数として線形予測フィルタ部４４７（図３参照）に出力する。 The fused linear prediction coefficient output unit 455 outputs the linear prediction coefficient converted by the LPC conversion unit 454 to the linear prediction filter unit 447 (see FIG. 3) as a fused linear prediction coefficient.

なお、本実施の形態では、線スペクトル対が一般的にホルマント周波数との対応に優れており、線スペクトル対領域での平均化によって複数の線形予測係数に共通する平均的なスペクトル特徴を比較的良好に得ることができることから、声道フィルタ成分融合部４５は、以上のような融合方法を用いたが、線形予測係数の融合方法はこの方法に限定されるものではない。 In the present embodiment, the line spectrum pair is generally excellent in correspondence with the formant frequency, and the average spectrum characteristic common to the plurality of linear prediction coefficients is relatively reduced by averaging in the line spectrum pair region. Since the vocal tract filter component fusion unit 45 uses the above fusion method because it can be obtained satisfactorily, the fusion method of linear prediction coefficients is not limited to this method.

例えば、声道フィルタ成分融合部４５は、線形予測係数から線形予測極を算出した後、複数の線形予測極を補間して平均的な線形予測極を得る方法や、線形予測係数をＬＰＣメルケプストラムに変換してからメルケプストラム領域で平均化を行い線形予測係数に戻す方法などを用いるようにしてもよい。 For example, the vocal tract filter component fusion unit 45 calculates a linear prediction pole from a linear prediction coefficient and then interpolates a plurality of linear prediction poles to obtain an average linear prediction pole, or converts a linear prediction coefficient into an LPC mel cepstrum. For example, a method of averaging in the mel cepstrum region and returning to the linear prediction coefficient may be used.

また、声道フィルタ成分融合部４５は、複数個の線形予測係数の代わりに元のピッチ波形（複数個）の入力を受け付け、これらのピッチ波形を時間方向に接続したものを線形予測分析することによって、複数個のピッチ波形の特徴を平均的に表す線形予測係数を求めることにより、声道フィルタ成分を融合するようにしてもよい。 Further, the vocal tract filter component fusion unit 45 receives the input of the original pitch waveform (plurality) instead of the plurality of linear prediction coefficients, and performs linear prediction analysis of those pitch waveforms connected in the time direction. Thus, the vocal tract filter components may be fused by obtaining a linear prediction coefficient that averages the characteristics of a plurality of pitch waveforms.

なお、本実施形態では、声道フィルタ成分融合部４５で融合する声道フィルタ成分が線形予測係数の場合を例にとり説明したが、声道フィルタ成分融合部４５で融合する声道フィルタ成分は線形予測係数に限定されるものではなく、声道フィルタの特性を表すものであれば、いかなるパラメータを融合するようにしてもよい。例えば、声道フィルタにＡＲＸモデルを用いる場合、声道フィルタ成分融合部４５は、ＡＲＸモデルの各フィルタ係数を融合する。 In this embodiment, the case where the vocal tract filter component fused by the vocal tract filter component fusion unit 45 is a linear prediction coefficient has been described as an example. However, the vocal tract filter component fused by the vocal tract filter component fusion unit 45 is linear. The parameters are not limited to prediction coefficients, and any parameters may be used as long as they represent the characteristics of the vocal tract filter. For example, when an ARX model is used for the vocal tract filter, the vocal tract filter component fusion unit 45 fuses each filter coefficient of the ARX model.

図１に戻り、音源成分融合部４６は、取得部４３により取得された複数の音声素片の周期成分の音源成分を、基本周波数又は音源成分波形の形状に基づいて伸縮して、セグメント毎に融合する。なお、音声素片の周期成分と非周期成分への分離には、例えば、ＰＳＨＦ（Pitch-scaled harmonic filter）などの方法を用いることができる。 Returning to FIG. 1, the sound source component merging unit 46 expands and contracts the sound source component of the periodic components of the plurality of speech units acquired by the acquiring unit 43 based on the fundamental frequency or the shape of the sound source component waveform for each segment. To merge. For example, a method such as PSHF (Pitch-scaled harmonic filter) can be used to separate the speech element into a periodic component and an aperiodic component.

図６は、本実施の形態の音源成分融合部４６の詳細な構成の一例を示すブロック図である。図６に示すように、複数音源波形入力受付部４６１と、不良音源波形除去部４６２と、音源波形アラインメント部４６３と、音源波形変形部４６４と、音源波形平均化部４６５と、融合音源波形出力部４６６とを含む。 FIG. 6 is a block diagram illustrating an example of a detailed configuration of the sound source component fusion unit 46 of the present embodiment. As shown in FIG. 6, a multiple sound source waveform input receiving unit 461, a defective sound source waveform removing unit 462, a sound source waveform alignment unit 463, a sound source waveform deforming unit 464, a sound source waveform averaging unit 465, and a fused sound source waveform output Part 466.

複数音源波形入力受付部４６１は、ディエンファシス部４４５（図３参照）から、セグメントに対する複数個のピッチ波形の各々に対応した線形予測残差の入力を音源波形の入力として受け付ける。 The multiple sound source waveform input receiving unit 461 receives, from the de-emphasis unit 445 (see FIG. 3), an input of a linear prediction residual corresponding to each of a plurality of pitch waveforms for a segment as an input of a sound source waveform.

不良音源波形除去部４６２は、複数音源波形入力受付部４６１に入力された複数個の音源波形の各々をチェックし、所定の除去条件に該当する音源成分を除去する。 The defective sound source waveform removing unit 462 checks each of the plurality of sound source waveforms input to the multiple sound source waveform input receiving unit 461 and removes the sound source component corresponding to a predetermined removal condition.

なお、「所定の除去条件」には、例えば、声門閉鎖点と考えられるパルス的な成分が複数個見られる場合（文末以外）が該当する（ガラガラ声など発声に問題がある箇所に相当するため）。また例えば、線形予測残差波形中のパルス的な成分の位置が線形予測残差波形の中心位置から大きくずれている場合が該当する（元のピッチ波形の切り出し位置に問題があるため）。また例えば、線形予測残差波形の形状が他の波形の形状と大きく異なる場合が該当する（発声のスタイルなどが大きく異なる場合に相当するため）。 Note that the “predetermined removal condition” corresponds to, for example, a case where a plurality of pulse-like components considered to be glottal closing points (other than the end of the sentence) are applicable (corresponding to a portion having a problem in utterance such as a rattling voice) ). In addition, for example, the case where the position of the pulse component in the linear prediction residual waveform is greatly deviated from the center position of the linear prediction residual waveform is applicable (because there is a problem in the cutout position of the original pitch waveform). Further, for example, the case where the shape of the linear prediction residual waveform is greatly different from the shape of other waveforms is applicable (because this corresponds to a case where the style of utterance is greatly different).

音源波形アラインメント部４６３は、不良音源波形除去部４６２により所定の除去条件に該当する音源成分が除去された複数個の音源波形の各々を、当該音源成分の特徴点の位置の誤差が閾値以下になるように時間方向にアラインメントする。具体的には、音源波形アラインメント部４６３は、不良音源波形除去部４６２から入力された複数個の線形予測残差波形の各々を、音源波形中の最も重要な位置が線形予測残差波形間で一致するように、時間方向にアラインメントする。 The sound source waveform alignment unit 463 sets each of the plurality of sound source waveforms from which the sound source component corresponding to the predetermined removal condition has been removed by the defective sound source waveform removal unit 462 so that the error of the position of the feature point of the sound source component is equal to or less than the threshold value. Align in the time direction so that Specifically, the sound source waveform alignment unit 463 selects each of the plurality of linear prediction residual waveforms input from the defective sound source waveform removal unit 462, and the most important position in the sound source waveform is between the linear prediction residual waveforms. Align in time direction to match.

本実施の形態では、音源波形中の最も重要な位置を声門閉鎖点であると考え、この声門閉鎖点に対応する位置を、複数個の線形予測残差波形の間で時間方向に揃える。なお、「声門閉鎖点」は、声帯振動のサイクルの中で開いていた声門が急激に閉じるタイミングを表し、線形予測残差波形においては、１基本周期内でのローカルピークがそのタイミングに対応する。 In the present embodiment, the most important position in the sound source waveform is considered as the glottal closing point, and the position corresponding to this glottal closing point is aligned in the time direction among the plurality of linear prediction residual waveforms. The “glottal closing point” represents the timing at which the glottal that has been opened in the glottal oscillation cycle closes abruptly. In the linear prediction residual waveform, the local peak within one fundamental period corresponds to that timing. .

そこで、音源波形アラインメント部４６３は、複数個の線形予測残差波形の各々について振幅最大の位置を求め、これらの位置が複数個の線形予測残差波形間で一致するように時間方向にアラインメントする。 Therefore, the sound source waveform alignment unit 463 obtains the position of the maximum amplitude for each of the plurality of linear prediction residual waveforms, and aligns them in the time direction so that these positions coincide among the plurality of linear prediction residual waveforms. .

但し、線形予測残差波形の中には、声門閉鎖点が明確でない、即ち顕著なピークが存在しないものも存在し、これらに対して求めた振幅最大の位置が声門閉鎖点に対応していない場合もあり得る。 However, some of the linear prediction residual waveforms do not have a clear glottal closure point, i.e., there is no significant peak, and the maximum amplitude position obtained for these does not correspond to the glottal closure point. There may be cases.

このため、他の線形予測残差波形との間の相互相関など、他の指標も一緒に考慮すると、よりロバストなアラインメントが可能である。例えば、線形予測残差波形間での振幅最大の位置のずれの二乗を、線形予測残差波形間の相互相関で割った値のようなものをコスト関数として、コスト関数が最小になるようにアラインメントするようにすればよい。 Therefore, more robust alignment is possible when other indices such as cross-correlation with other linear prediction residual waveforms are taken into consideration. For example, the cost function is minimized with the cost function as the value obtained by dividing the square of the position shift of the maximum amplitude between the linear prediction residual waveforms by the cross-correlation between the linear prediction residual waveforms. Alignment should be done.

なお本実施形態においては、声門閉鎖点の求め方として線形予測残差波形の振幅最大の位置を求める方法を説明したが、ウェーブレット変換を用いる方法など、声門閉鎖点が適切に抽出できる方法であれば、いかなる方法を用いても良い。また、線形予測残差波形間での時間方向のアラインメントの方法についても、上記の方法に限定する必要はなく、線形予測残差波形間での声門閉鎖点のずれが、所望の範囲内に収まる方法であれば、いかなる方法を用いても良い。 In the present embodiment, the method for obtaining the position of the maximum amplitude of the linear prediction residual waveform has been described as a method for obtaining the glottal closing point, but any method that can appropriately extract the glottal closing point, such as a method using wavelet transform. Any method may be used. Further, the alignment method in the time direction between the linear prediction residual waveforms is not limited to the above method, and the gap of the glottal closing point between the linear prediction residual waveforms falls within a desired range. Any method may be used as long as it is a method.

音源波形変形部４６４は、音源波形アラインメント部４６３によりアライメントされた
複数個の線形予測残差波形の各々に対して、時間方向の伸縮などの変形を加える。 The sound source waveform deforming unit 464 applies deformation such as expansion and contraction in the time direction to each of the plurality of linear prediction residual waveforms aligned by the sound source waveform alignment unit 463.

図７は、音源波形変形部４６４による音源波形の変形方法の一例を説明するための図であり、音源波形変形部４６４に入力される線形予測残差波形の一例を示している。 FIG. 7 is a diagram for explaining an example of a method for deforming a sound source waveform by the sound source waveform deforming unit 464, and shows an example of a linear prediction residual waveform input to the sound source waveform deforming unit 464.

図７に示す線形予測残差波形では、全区間(Ｄall)の長さは、融合単位であるピッチ波形一つ分に対応し、当該ピッチ波形が元々あった位置での音声波形の基本周期の約２倍となっている。 In the linear prediction residual waveform shown in FIG. 7, the length of all sections (Dall) corresponds to one pitch waveform which is a unit of fusion, and the basic period of the speech waveform at the position where the pitch waveform originally existed. It is about twice.

また、Ｄ１〜Ｄ４は、線形予測残差波形の中での特徴点を基に４区間に区切ったときの各区間を表している。具体的には、振幅最大の位置すなわち声門閉鎖点に対応する位置とその周辺を含む区間がＤ３、Ｄ３の直前でかつ負の振幅を持つ区間をＤ２、Ｄ２の前方の区間をＤ１、Ｄ３の後方の区間をＤ４となっている。 D1 to D4 represent each section when divided into four sections based on the feature points in the linear prediction residual waveform. Specifically, a position including the position with the maximum amplitude, that is, the position corresponding to the glottal closing point and its periphery is immediately before D3 and D3 and a section having a negative amplitude is D2, and a section in front of D2 is D1 and D3. The rear section is D4.

本実施の形態では、音源波形変形部４６４による音源波形の変形は、目標音声の基本周波数に基づいて時間方向に伸縮することで行なう。 In the present embodiment, the sound source waveform is deformed by the sound source waveform deforming unit 464 by expanding and contracting in the time direction based on the fundamental frequency of the target sound.

音源波形は、理想的には目標音声の基本周波数に合った長さになっていることが好ましいが、音声素片が元々属していた音声の基本周波数と目標音声の基本周波数が異なる場合、音源波形の長さが目標音声にとって不適切な可能性が高い。 The sound source waveform should ideally have a length that matches the fundamental frequency of the target speech, but if the fundamental frequency of the speech to which the speech unit originally belonged differs from the fundamental frequency of the target speech, The waveform length is likely to be inappropriate for the target speech.

そこで、音源波形変形部４６４は、音源波形の全区間Ｄallの長さが目標音声の基本周期(１秒を基本周波数で割った長さ)の２倍の長さになるよう、時間方向に伸縮する。 Therefore, the sound source waveform deforming unit 464 expands and contracts in the time direction so that the length of the entire section Dall of the sound source waveform is twice the basic period of the target speech (the length obtained by dividing 1 second by the basic frequency). To do.

但し、音源波形においては、声門閉鎖点周辺の区間Ｄ３の形状は非常に重要で、この区間を変形すると音質に大きな悪影響が出る可能性が高い。 However, in the sound source waveform, the shape of the section D3 around the glottal closure point is very important, and if this section is deformed, there is a high possibility that the sound quality will be greatly affected.

そこで、本実施の形態では、音源波形変形部４６４は、区間Ｄ３は変形せず、元の形状を保持する。即ち、全区間Ｄallの長さが目標音声の基本周期の２倍になるように、区間Ｄ１、Ｄ２、およびＤ４を伸縮する。 Therefore, in the present embodiment, the sound source waveform deforming unit 464 retains the original shape without deforming the section D3. That is, the sections D1, D2, and D4 are expanded and contracted so that the length of the entire section Dall is twice the basic period of the target speech.

また、音源波形において、区間Ｄ２の形状も比較的重要であるが、複数個の音源波形間で区間Ｄ２の位置が異なったまま音源波形平均化部４６４で音源波形を平均化してしまうと、区間Ｄ２の形状が壊れ、合成音の音質が劣化する要因になりうる。 In the sound source waveform, the shape of the section D2 is also relatively important. However, if the sound source waveform is averaged by the sound source waveform averaging unit 464 while the position of the section D2 is different among a plurality of sound source waveforms, the section The shape of D2 is broken, which can be a factor that degrades the quality of the synthesized sound.

そこで、音源波形変形部４６４による区間Ｄ２の伸縮においては、区間Ｄ２の開始点が複数の音源波形間で揃うように伸縮率を決めても良い。 Therefore, in the expansion / contraction of the section D2 by the sound source waveform deforming unit 464, the expansion / contraction ratio may be determined so that the start points of the section D2 are aligned among a plurality of sound source waveforms.

なお、上述した伸縮方法は一例であり、いかなる伸縮方法も適用することができる。 In addition, the expansion / contraction method mentioned above is an example, and any expansion / contraction method is applicable.

音源波形平均化部４６５は、音源波形変形部４６４により伸縮された複数の音源波形を平均化して、融合音源波形を生成する。 The sound source waveform averaging unit 465 averages the plurality of sound source waveforms expanded and contracted by the sound source waveform deforming unit 464 to generate a fused sound source waveform.

なお、本実施の形態では、単純に線形予測残差波形を平均化することにするが、不良音源波形除去部４６２などから出力される音源波形の不良度合いなどの情報を用いて、融合する線形予測残差波形間で何らかの重み付けをして平均化しても良い。また、音源波形を複数の周波数帯域に分割した後、各周波数帯域下でさらなる時間方向のアラインメントを行なった後で音源波形の平均化を行い、平均化した各帯域の音源波形を帯域間で足し合わせることによって融合する方法などでも良い。 In this embodiment, the linear prediction residual waveform is simply averaged. However, the linearity to be fused using information such as the degree of failure of the sound source waveform output from the defective sound source waveform removing unit 462 or the like. You may weight and average between prediction residual waveforms. Also, after dividing the sound source waveform into multiple frequency bands, after further alignment in the time direction under each frequency band, the sound source waveforms are averaged, and the averaged sound source waveforms of each band are added between the bands. A method of fusing by combining them may be used.

融合音源波形出力部４６６は、音源波形平均化部４６５で生成された融合音源波形を、線形予測フィルタ部４４７（図３参照）に出力する。 The fused sound source waveform output unit 466 outputs the fused sound source waveform generated by the sound source waveform averaging unit 465 to the linear prediction filter unit 447 (see FIG. 3).

なお、本実施の形態では、音源成分融合部４６は、音源波形の融合を波形自体の平均化によって行なっているが、音源波形をまず声帯音源波モデルで近似して、モデルのパラメータの領域で平均化し、平均化したパラメータを用いて声帯音源波モデルで音源波形を合成することによって音源波形を融合するようにしてもよい。 In the present embodiment, the sound source component fusion unit 46 performs sound source waveform fusion by averaging the waveforms themselves. However, the sound source waveform is first approximated by a vocal cord sound source wave model, and is used in the parameter region of the model. The sound source waveforms may be merged by averaging and synthesizing the sound source waveform with a vocal cord sound source wave model using the averaged parameters.

声帯音源波モデルとしては、例えば、ＬＦ（Liljencrants and Fant）モデルがあり、ＬＦモデルにおいては、５つのパラメータを用いて、音源波形の特徴を、高い自由度かつ良好に表すことができる。各音源波形をＬＦモデルで近似し、これらの５つのパラメータのそれぞれを平均化することによって、音源波形の特徴を壊すことなく融合することができる。 An example of the vocal cord sound source wave model is an LF (Liljencrants and Fant) model. In the LF model, the characteristics of the sound source waveform can be expressed with a high degree of freedom and goodness using five parameters. By approximating each sound source waveform with an LF model and averaging each of these five parameters, it is possible to fuse the sound source waveforms without breaking them.

このような、声帯音源波モデルのパラメータ領域で音源波形を融合する場合、ＬＦモデルに限らず、Rosenbergモデルなど他の声帯音源波モデルを用いるようにしてもよい。また、前述したように、音源成分融合部４６で融合する音源成分は、放射による効果を含むものであっても良いし、含まないものであっても良い。 When the sound source waveforms are fused in the parameter region of the vocal cord sound wave model, other vocal cord sound wave models such as the Rosenberg model may be used instead of the LF model. Further, as described above, the sound source component fused by the sound source component fusion unit 46 may or may not include an effect due to radiation.

図１に戻り、生成部４７は、素片融合部４４により融合された融合音声素片を変形および接続して、合成音声の音声波形を生成する。具体的には、生成部４７は、素片融合部４４で生成されたセグメント毎の融合音声素片を、音韻・韻律入力受付部４１に入力された韻律情報に従って韻律変形しながら、セグメント間で接続することによって、音声波形を生成する。 Returning to FIG. 1, the generation unit 47 deforms and connects the fusion speech units fused by the unit fusion unit 44 to generate a speech waveform of synthesized speech. Specifically, the generation unit 47 changes the fusion speech unit for each segment generated by the unit fusion unit 44 between segments while changing the prosody according to the prosodic information input to the phoneme / prosody input reception unit 41. By connecting, an audio waveform is generated.

図８は、生成部４７による音声波形の生成処理の一例を説明するための図である。図８では、素片融合部４４で生成された、音素「ａ」「Ｎ」「ｓ」「a」「a」の各セグメントに対する音声素片を、変形・接続して、「ａＮｓａａ」という音声波形を生成する例を示している。 FIG. 8 is a diagram for explaining an example of a voice waveform generation process by the generation unit 47. In FIG. 8, the speech unit “a”, “N”, “s”, “a”, “a” generated by the segment fusion unit 44 is transformed and connected to the speech “aNsaa”. An example of generating a waveform is shown.

なお図８に示す例では、有声音の音声素片はピッチ波形の系列で表現されている。一方、無声音の音声素片は、フレーム毎の波形として表現されている。また、図８の点線は、目標の音韻継続時間長に従って分割した音素毎のセグメントの境界を表し、白い三角は、目標の基本周波数に従って配置した各ピッチ波形を重畳する位置（ピッチマーク）を示している。 In the example shown in FIG. 8, a voiced speech segment is represented by a series of pitch waveforms. On the other hand, an unvoiced speech segment is represented as a waveform for each frame. In addition, the dotted line in FIG. 8 represents the boundary of the segment for each phoneme divided according to the target phoneme duration, and the white triangle indicates the position (pitch mark) where each pitch waveform arranged according to the target fundamental frequency is superimposed. ing.

生成部４７は、図８に示すように、有声音については音声素片のそれぞれのピッチ波形を対応するピッチマーク上の重畳し、無声音については各フレームの波形をセグメント中の各フレームに対応する部分に貼り付けることによって、所望の韻律（ここでは、基本周波数、音韻継続時間長）を持った音声波形を生成する。 As shown in FIG. 8, the generation unit 47 superimposes the pitch waveform of each speech segment on the corresponding pitch mark for voiced sound, and corresponds the waveform of each frame to each frame in the segment for unvoiced sound. By pasting on the part, a speech waveform having a desired prosody (here, fundamental frequency, phoneme duration) is generated.

出力部４８は、生成部４７で生成した音声波形を出力する。 The output unit 48 outputs the voice waveform generated by the generation unit 47.

次に、本実施の形態の音声処理装置１の動作について説明する。図９は、本実施の形態の音声処理装置１で行われる音声合成の処理手順の流れの一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 1 according to this embodiment will be described. FIG. 9 is a flowchart showing an example of the flow of a speech synthesis process performed by the speech processing apparatus 1 according to the present embodiment.

ステップＳ１０では、テキスト入力部１０は、音声処理の対象となるテキストを入力する。 In step S 10, the text input unit 10 inputs text to be subjected to speech processing.

ステップＳ１１では、言語処理部２０は、テキスト入力部１０から入力されるテキストの形態素解析や構文解析などの言語解析を行う。 In step S 11, the language processing unit 20 performs language analysis such as morphological analysis and syntax analysis of the text input from the text input unit 10.

ステップＳ１２では、韻律制御部３０は、言語処理部２０の言語解析結果からアクセントやイントネーションを処理し、音韻系列及び韻律情報を生成する。 In step S12, the prosody control unit 30 processes accents and intonations from the language analysis result of the language processing unit 20, and generates phoneme sequences and prosody information.

ステップＳ１３では、音韻・韻律入力受付部４１は、韻律制御部３０から目標音声に対応する音韻系列を合成単位で分割した複数のセグメントと、複数の前記セグメントの各々に対応する韻律情報の入力を受け付ける。 In step S13, the phonological / prosodic input receiving unit 41 receives a plurality of segments obtained by dividing the phonological sequence corresponding to the target speech from the prosody control unit 30 in synthesis units and the prosodic information corresponding to each of the plurality of segments. Accept.

ステップＳ１４では、取得部４３は、音韻・韻律入力受付部４１により分割された複数のセグメントの各々に対して、セグメント及びセグメントに対応する韻律情報に関連付けられた複数の音声素片を音声素片記憶部４２から取得する音声素片取得処理を行う。なお、音声素片取得処理の詳細については、後述する。 In step S 14, for each of the plurality of segments divided by the phoneme / prosody input receiving unit 41, the acquisition unit 43 converts a plurality of speech units associated with the segment and the prosodic information corresponding to the segment into speech units. A speech segment acquisition process acquired from the storage unit 42 is performed. Details of the speech segment acquisition process will be described later.

ステップＳ１５では、声道フィルタ成分融合部４５は、取得部４３により取得された複数の音声素片の声道フィルタ成分をセグメント毎に融合する。 In step S 15, the vocal tract filter component fusion unit 45 fuses the vocal tract filter components of the plurality of speech segments acquired by the acquisition unit 43 for each segment.

ステップＳ１６では、音源成分融合部４６は、取得部４３により取得された複数の音声素片の周期成分の音源成分を、基本周波数又は音源成分波形の形状に基づいて伸縮して、セグメント毎に融合する。 In step S16, the sound source component merging unit 46 expands / contracts the sound source component of the periodic components of the plurality of speech units acquired by the acquiring unit 43 based on the fundamental frequency or the shape of the sound source component waveform, and fuses each segment. To do.

ステップＳ１７では、素片融合部４４は、声道フィルタ成分融合部４５で融合された融合声道フィルタを用いて、音源成分融合部４６で融合された融合音源成分をフィルタリングすることにより、取得部４３により取得された複数の音声素片をセグメント毎に融合する。 In step S17, the segment fusion unit 44 filters the fused sound source component fused by the sound source component fusion unit 46 using the fused vocal tract filter fused by the vocal tract filter component fusion unit 45, thereby obtaining the acquisition unit. The plurality of speech segments acquired by 43 are fused for each segment.

ステップＳ１８では、生成部４７は、素片融合部４４により融合された融合音声素片を変形および接続して、合成音声の音声波形を生成する。 In step S18, the generation unit 47 deforms and connects the fusion speech units fused by the unit fusion unit 44, and generates a speech waveform of synthesized speech.

ステップＳ１９では、出力部４８は、生成部４７で生成した音声波形を出力する。 In step S 19, the output unit 48 outputs the speech waveform generated by the generation unit 47.

次に、図１０を参照しながら、図９のステップＳ１４の音声素片取得処理について説明する。図１０は、図９のステップＳ１４の音声素片取得処理の処理手順の流れの一例を示すフローチャートである。なお図１０では、Ｎ（Ｎ≧２）個のセグメント各々に対してＭ（Ｍ≧２）個ずつの音声素片を選ぶ例について説明する。 Next, the speech segment acquisition process in step S14 of FIG. 9 will be described with reference to FIG. FIG. 10 is a flowchart showing an example of the flow of the speech segment acquisition process in step S14 of FIG. FIG. 10 illustrates an example in which M (M ≧ 2) speech segments are selected for each of N (N ≧ 2) segments.

ステップＳ１０１では、取得部４３は、音声素片記憶部４２に記憶されている音声素片群の中から、各セグメント１つずつ音声素片の系列を選択する。具体的には、取得部４３は、目標の音韻系列・韻律情報と、音声素片記憶部４２に記憶された環境情報を基に、系列としてのコストの総和（トータルコスト）が最小となる音声素片の系列である最適素片系列を求め、選択する。なお、最適素片系列の探索には、動的計画法（ＤＰ：dynamic programming）を用いることで効率的に行うことができる。 In step S 101, the acquisition unit 43 selects a speech unit sequence for each segment from the speech unit group stored in the speech unit storage unit 42. Specifically, the acquisition unit 43 uses the target phoneme sequence / prosodic information and the environment information stored in the speech unit storage unit 42 to generate a speech that minimizes the total cost (total cost) as a sequence. An optimum segment sequence that is a sequence of segments is obtained and selected. The search for the optimum unit sequence can be efficiently performed by using dynamic programming (DP).

ステップＳ１０２では、取得部４３は、セグメント番号を表すカウンターｉに初期値「１」を代入する。 In step S102, the acquisition unit 43 assigns an initial value “1” to a counter i that represents a segment number.

ステップＳ１０３では、取得部４３は、セグメントｉに対する各音声素片候補に対してコストを算出する。この際に用いるコストは、音声素片候補の目標コストと、前後のセグメントの最適音声素片（最適素片系列に含まれる音声素片）及び音声素片候補の接続コストと、の和である。 In step S103, the acquisition unit 43 calculates a cost for each speech segment candidate for the segment i. The cost used in this case is the sum of the target cost of the speech unit candidate and the optimal speech unit of the preceding and following segments (speech unit included in the optimal unit sequence) and the connection cost of the speech unit candidate. .

ステップＳ１０４では、取得部４３は、算出したコストを用いて、コストの小さい上位Ｍ個の音声素片を選択する。 In step S104, the acquisition unit 43 uses the calculated cost to select the top M speech units with the lowest cost.

ステップＳ１０５では、取得部４３は、カウンターｉがセグメント数Ｎ以下であるか否かを判定する。 In step S105, the acquisition unit 43 determines whether the counter i is equal to or less than the number of segments N.

セグメント数Ｎ以下である場合には（ステップＳ１０５でＹｅｓ）、ステップＳ１０６へ進み、セグメント数Ｎ以下でない場合には（ステップＳ１０５でＮｏ）、取得部４３は、音声素片取得処理を終了する。 If the number is less than or equal to the number of segments N (Yes in step S105), the process proceeds to step S106. If the number is not less than the number N (No in step S105), the acquisition unit 43 ends the speech segment acquisition process.

ステップＳ１０６では、取得部４３は、カウンターｉをインクリメントして、セグメントｉに対する各音声素片候補に対してコストを算出する。 In step S106, the acquisition unit 43 increments the counter i and calculates a cost for each speech element candidate for the segment i.

このように本実施の形態では、音声素片の特徴を音源と声道フィルタの特徴に分離して融合することによって、音源やスペクトルの構造を壊すことなく融合することが可能であり、従来の複数素片選択融合方式以上に高い音質の合成音声が生成できる。 As described above, according to the present embodiment, by separating and merging the features of speech units into the features of the sound source and the vocal tract filter, it is possible to perform the fusion without breaking the sound source and the structure of the spectrum. Synthetic speech with higher sound quality than the multiple unit selection and fusion method can be generated.

特に本実施の形態では、目標音声の基本周波数や音源波形の形状に基づいて音源波形を変形することによって、音源波形間の形状の違いのうち、制御可能な違いをできるだけ取り除いたうえで音源波形の平均化を行なえるので、音源波形の形状における特徴をよく保った融合が可能となり、より高い音質が実現できる。 In particular, in the present embodiment, the sound source waveform is deformed based on the fundamental frequency of the target sound and the shape of the sound source waveform, thereby removing the controllable differences among the shape differences between the sound source waveforms as much as possible. Therefore, it is possible to perform fusion that keeps the characteristics of the shape of the sound source waveform well and achieve higher sound quality.

また、本実施の形態では、不適切な形状を持つ音源波形を検出し、これらの音源波形を融合の対象から除去しているため、元の発声に問題があったり、途中の分析で何らかの失敗をした音声素片が含まれていても、音質の劣化が起こりにくい。 In this embodiment, since sound source waveforms having an inappropriate shape are detected and these sound source waveforms are removed from the fusion target, there is a problem with the original utterance or some failure in the analysis in the middle. Even if a speech unit that has been played is included, the sound quality is unlikely to deteriorate.

（変形例）
なお、本発明は、上記実施の形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化することができる。また、上記実施の形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成することができる。例えば、実施の形態に示される全構成要素からいくつかの構成要素を削除してもよい。さらに、異なる実施の形態にわたる構成要素を適宜組み合わせても良い。 (Modification)
It should be noted that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

（変形例１）
上記実施の形態では、取得部４３が、声道フィルタ成分と音源成分で共通の音声素片を取得する例について説明したが、それぞれの成分に適した音声素片を別々に取得するようにしてもよい。 (Modification 1)
In the above embodiment, the example in which the acquisition unit 43 acquires a common speech unit for the vocal tract filter component and the sound source component has been described. However, the speech unit suitable for each component is separately acquired. Also good.

この場合、音声素片の取得の尺度となるコストの計算におけるサブコスト間の重み付けの仕方を、声道フィルタ成分と音源成分の各々で変えることにより、取得部４３は、各成分に適した音声素片（両成分間で異なる音声素片）を取得する。 In this case, by changing the weighting method between the sub-costs in the calculation of the cost, which is a measure for obtaining the speech element, for each of the vocal tract filter component and the sound source component, the obtaining unit 43 can obtain the speech element suitable for each component. A piece (speech segment that differs between both components) is acquired.

具体的には、声道フィルタ成分については、前後の音素などの音韻環境による影響を特に大きく受けやすく、また前後の音声素片とスペクトルを滑らかに接続することが合成音の音質にとって重要であることから、音韻環境コストやスペクトル接続コストの重みを重めに設定する。 Specifically, the vocal tract filter component is particularly susceptible to the influence of the phoneme environment such as the front and back phonemes, and it is important for the sound quality of the synthesized sound to smoothly connect the front and back speech segments to the spectrum. Therefore, the phonological environment cost and the spectrum connection cost are set to be heavy.

一方、音源成分については、呼気段落内や文内での位置による影響や（例えば、文頭と文末では声帯の緊張度が変わるなど）基本周波数による影響（例えば、声の高いところの方が声帯の緊張度が高いなど）を特に強く受けやすいため、位置的環境コストや基本周波数コストの重みを重めに設定する。 On the other hand, sound source components are affected by the position in the expiratory paragraph or sentence (for example, the tension of the vocal cord changes at the beginning and end of the sentence) Because it is particularly susceptible to strong tension (such as high degree of tension), the weight of positional environmental cost and fundamental frequency cost is set to be heavy.

このようにすると、声道フィルタ成分と音源成分の各々の融合に用いる音声素片を、各成分に合った方法で取得するので、上記実施の形態よりも高い音質が実現できる。特に、音声素片のバリエーションが限られている場合に、有効性が高い。この場合には、声道フィルタ成分と音源成分で音声素片の取得数が異なる場合がある。 In this way, since the speech element used for the fusion of each of the vocal tract filter component and the sound source component is acquired by a method suitable for each component, higher sound quality than the above embodiment can be realized. This is particularly effective when there are limited variations of speech segments. In this case, the number of acquired speech segments may differ between the vocal tract filter component and the sound source component.

なお、音声素片の取得方法は、声道フィルタ成分と音源成分のそれぞれで全く異なる方法を用いるようにしてもよい。 Note that the speech segment acquisition method may be a completely different method for each of the vocal tract filter component and the sound source component.

（変形例２）
上記実施の形態では、音声処理装置１が、融合音声素片から音声波形を生成して出力する例について説明したが、音声処理装置１は、融合音声素片を作成する装置としてもよい。この場合、音声処理装置１の音声合成部４０は、生成部４７及び出力部４８を含まなくてもよい。 (Modification 2)
In the above embodiment, the example in which the speech processing device 1 generates and outputs a speech waveform from the fused speech unit has been described. However, the speech processing device 1 may be a device that creates a fused speech unit. In this case, the speech synthesis unit 40 of the speech processing device 1 may not include the generation unit 47 and the output unit 48.

変形例２の音声処理装置の動作について説明する。図１１は、変形例２の音声処理装置で行われる融合音声素片の作成手順の流れの一例を示すフローチャートである。 The operation of the sound processing apparatus according to the second modification will be described. FIG. 11 is a flowchart illustrating an example of a procedure for creating a fusion speech unit performed by the speech processing apparatus according to the second modification.

まず、テキストの入力から音声素片の融合までは（ステップＳ２０１〜ステップＳ２０８）、図９のフローチャートのステップＳ１０〜ステップＳ１７までの処理と同様であるため、説明を省略する。なお、ステップＳ２０１では、テキスト入力部により数千、数万文といった大量のテキストが入力される。このため、ステップＳ２０８では、素片融合部により大量の融合音声素片が生成される。 First, the process from text input to speech unit fusion (step S201 to step S208) is the same as the process from step S10 to step S17 in the flowchart of FIG. In step S201, a large amount of text such as thousands or tens of thousands of sentences is input by the text input unit. Therefore, in step S208, a large number of fused speech segments are generated by the segment fusion unit.

ステップＳ２０９では、素片融合部４４は、生成した大量の融合音声素片の中から、融合音声素片の素片種別毎に融合音声素片をいくつずつ抽出するかを決定する。 In step S209, the unit fusion unit 44 determines how many fusion speech units are to be extracted for each unit type of the fusion speech unit from the generated large amount of fusion speech units.

ここで、素片種別とは、素片の音韻環境などで分類された種別を指す。例えば、素片種別／ａ／は、音素／ａ／に対応する素片のこととする。各素片種別に何個ずつ素片を配分するかは、各素片種別の素片の出現頻度などに応じて決める。例えば、素片種別／ａ／の素片が素片種別／ｕ／の素片よりも出現頻度が高い場合は、素片種別／ａ／に多めの素片を配分することとする。素片種別ｉに配分する素片の個数をＮ_ｉ（Ｎ_ｉ≧１）とする。 Here, the segment type refers to a category classified by the phoneme environment of the segment. For example, the segment type / a / is a segment corresponding to the phoneme / a /. The number of segments allocated to each segment type is determined according to the appearance frequency of each segment type. For example, when the segment type / a / has a higher appearance frequency than the segment type / u / segment, a larger number of segments is allocated to the segment type / a /. Let N _i (N _i ≧ 1) be the number of segments to be allocated to the segment type i.

ステップＳ２１０では、素片融合部４４は、素片種別を表すカウンターｉに初期値「１」を代入する。 In step S210, the segment fusion unit 44 substitutes an initial value “1” for the counter i representing the segment type.

ステップＳ２１１では、素片融合部４４は、素片種別ｉの融合済み周期成分素片及び融合済み非周期成分素片を、素片融合部４４により融合された素片種別ｉの融合音声素片の中から、出現頻度が上位のものをＮ_ｉずつ抽出する。 In step S211, the unit fusion unit 44 combines the unit type i fused periodic component unit and the united non-periodic component unit into unit type i fused speech units. N _i that have the highest appearance frequency are extracted from N _i .

ステップＳ２１２では、素片融合部４４は、カウンターｉが素片種別数Ｎ（Ｎ≧１）以下であるか否かを判定する。 In step S212, the segment fusion unit 44 determines whether or not the counter i is equal to or less than the segment type number N (N ≧ 1).

素片種別数Ｎ以下である場合には（ステップＳ２１２でＹｅｓ）、ステップＳ２１３へ進み、素片種別数Ｎ以下でない場合には（ステップＳ２１２でＮｏ）、素片融合部４４は、融合音声素片の作成を終了する。 If it is less than or equal to the number N of segment types (Yes in step S212), the process proceeds to step S213. Finish creating the piece.

ステップＳ２１３では、素片融合部４４は、カウンターｉをインクリメントして、素片融合部４４により融合された素片種別ｉの融合音声素片の中から、出現頻度が上位のものをＮ_ｉずつ抽出する。 In step S213, unit fusion unit 44 increments the counter i, from the fused speech unit of the fused segment type i by unit fusion unit 44, occurrence frequency ones of the upper portions N _i Extract.

このようにすると、後述する変形例３のように、音声素片の融合機能を有していない音声処理装置であっても、変形例２の音声処理装置により作成された融合音声素片を格納することで、音声波形に内在する音源および声道フィルタの特徴を壊すことなく融合された音声素片を用いた音声合成を行うことができる。従って、従来の複数素片選択融合方式以上に高い音質の合成音声を生成できる。 In this way, even in a speech processing device that does not have a speech unit fusion function, as in Modification 3 described later, the fusion speech unit created by the speech processing apparatus in Modification 2 is stored. By doing so, it is possible to perform speech synthesis using speech segments fused without destroying the features of the sound source and vocal tract filter inherent in the speech waveform. Therefore, it is possible to generate synthesized speech with higher sound quality than the conventional multiple unit selection fusion method.

（変形例３）
上記実施の形態では、音声処理装置１が、融合音声素片を生成する例について説明したが、変形例３では、例えば変形例２の音声処理装置などにより作成された融合音声素片を予め格納している音声処理装置について説明する。 (Modification 3)
In the above-described embodiment, the example in which the speech processing apparatus 1 generates the fusion speech unit has been described. However, in Modification 3, for example, the fusion speech unit created by the speech processing apparatus in Modification 2 or the like is stored in advance. The voice processing apparatus that is being used will be described.

なお、以下では、上記実施の形態との相違点の説明を主に行い、上記実施の形態と同様の機能を有する構成要素については、上記実施の形態と同様の名称・符号を付し、その説明を省略する。 In the following, differences from the above embodiment will be mainly described. Constituent elements having the same functions as those of the above embodiment are given the same names and symbols as those of the above embodiment, and Description is omitted.

図１２は、変形例３の音声処理装置１００１の構成の一例を示すブロック図である。音声処理装置１００１の合成部１０４０は、素片融合部４４、声道フィルタ成分融合部４５、及び音源成分融合部４６を備えていない点で、上記実施の形態の音声処理装置１と相違する。また合成部１０４０は、融合音声素を記憶する融合音声素片記憶部１０４２を備えている点で上記実施の形態の音声処理装置１と相違する。また取得部１０４３は、融合音声素片記憶部１０４２から融合音声素を取得する点で上記実施の形態の音声処理装置１と相違する。 FIG. 12 is a block diagram illustrating an example of the configuration of the voice processing device 1001 according to the third modification. The synthesizing unit 1040 of the audio processing device 1001 is different from the audio processing device 1 of the above-described embodiment in that the unit fusion unit 44, the vocal tract filter component fusion unit 45, and the sound source component fusion unit 46 are not provided. The synthesizing unit 1040 is different from the speech processing apparatus 1 of the above-described embodiment in that it includes a fused speech unit storage unit 1042 that stores fused speech elements. The acquisition unit 1043 is different from the speech processing apparatus 1 according to the above-described embodiment in that a fusion speech element is obtained from the fusion speech unit storage unit 1042.

融合音声素片記憶部１０４２は、前述の変形例２の音声処理装置により生成された融合済音声素片の中から、出現頻度の高い音声素片を抽出したものを記憶する。 The fused speech unit storage unit 1042 stores a speech unit having a high appearance frequency extracted from the fused speech units generated by the speech processing apparatus according to the second modification.

なお、融合音声素片記憶部１０４２に記憶するために選択する音声素片の個数は、融合音声素片記憶部１０４２のサイズと合成音声の音質とのトレードオフで、任意に決めることができる。例えば、より多くの音声素片を選択して記憶すれば、融合音声素片記憶部１０４２のサイズは大きくなるが、合成音声の音質を高くすることができる。また例えば、音声素片の数を減らせば、合成音声の音質は犠牲になるが、融合音声素片記憶部１０４２のサイズを小さくすることができる。 Note that the number of speech units to be selected for storage in the fused speech unit storage unit 1042 can be arbitrarily determined by a trade-off between the size of the fused speech unit storage unit 1042 and the sound quality of the synthesized speech. For example, if more speech units are selected and stored, the size of the fused speech unit storage unit 1042 increases, but the quality of the synthesized speech can be improved. For example, if the number of speech units is reduced, the quality of the synthesized speech is sacrificed, but the size of the fused speech unit storage unit 1042 can be reduced.

このように変形例３の音声処理装置１００１によれば、音声素片の融合処理が不要となるため、ＣＰＵスペックが非常に低いローエンドのミドルウェア向けにも対応することができる。 As described above, according to the voice processing device 1001 of the third modification, since the voice unit fusion process is not required, it can be applied to low-end middleware having a very low CPU specification.

（変形例４）
なお、上記実施の形態では、出現頻度の高い素片を抽出する方法を説明したが、素片の両端で算出したメルケプストラムなどの素片の特徴量を用いて抽出しても良い。 (Modification 4)
In the above-described embodiment, the method of extracting a segment having a high appearance frequency has been described. However, the segment may be extracted using the feature amount of a segment such as a mel cepstrum calculated at both ends of the segment.

この場合、各素片種別に対して出力された融合済み周期成分素片及び融合済み非周期成分素片をそれぞれ、素片の特徴量を用いてクラスタリングし、分割された各クラスタの中心（セントロイド）に最も近い素片を抽出する。クラスタリングにおけるクラスタ数は、各素片種別に配分する素片数に応じて決める。 In this case, the fused periodic component segments and the fused non-periodic component segments output for each segment type are clustered using the segment feature values, and the center (centicent) of each divided cluster is stored. The segment closest to Lloyd) is extracted. The number of clusters in clustering is determined according to the number of segments allocated to each segment type.

出現頻度に基づいて素片を抽出する場合は、出現頻度が低いコンテキストに対して適切な素片が抽出されない可能性があり、入力テキストによっては音質が大きく劣化してしまう可能性があるが、本方法によって素片を抽出した場合、特徴量空間をできるだけ広く覆うような素片のセットが抽出できるため、出現頻度に基づいて抽出した場合より安定した合成音が生成できる。 When extracting a segment based on the appearance frequency, an appropriate segment may not be extracted for a context with a low appearance frequency, and depending on the input text, the sound quality may be greatly degraded. When a segment is extracted by this method, a set of segments that covers the feature amount space as much as possible can be extracted, so that a more stable synthesized sound can be generated than when extracted based on the appearance frequency.

なお、上記実施の形態の音声処理装置１、１００１は、ＣＰＵ（Central Processing Unit）などの制御装置、ＲＯＭ（Read Only Memory）やＲＡＭ（Random Access Memory）、ＨＤＤ、光ディスク、メモリカードなどの記憶装置、タッチパネルや操作ボタンなどの入力装置、スピーカなどの音声出力装置等を備えたハードウェア構成となっている。 Note that the audio processing apparatuses 1 and 1001 of the above embodiment include a control device such as a CPU (Central Processing Unit), a storage device such as a ROM (Read Only Memory) and a RAM (Random Access Memory), an HDD, an optical disk, and a memory card. The hardware configuration includes an input device such as a touch panel and operation buttons, an audio output device such as a speaker, and the like.

また、上記実施の形態の音声処理装置１、１００１で実行される音声処理プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フレキシブルディスク（FD）、ＣＤ−Ｒ、ＤＶＤ（Digital Versatile Disk）等のコンピュータで読み取り可能な記録媒体に記録されて提供される。 In addition, the audio processing program executed by the audio processing apparatuses 1 and 1001 of the above embodiment is a file in an installable format or an executable format, and is a CD-ROM, flexible disk (FD), CD-R, DVD ( And recorded on a computer-readable recording medium such as a digital versatile disk.

また、上記実施の形態の音声処理装置１、１００１で実行される音声処理プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、上記実施の形態の音声処理装置１、１００１で実行される音声処理プログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 In addition, the voice processing program executed by the voice processing apparatus 1 or 1001 of the above embodiment is stored on a computer connected to a network such as the Internet, and is provided by being downloaded via the network. Also good. Further, the voice processing program executed by the voice processing apparatuses 1 and 1001 according to the above-described embodiment may be provided or distributed via a network such as the Internet.

また、上記実施の形態の音声処理装置１、１００１で実行される音声処理プログラムを、ＲＯＭ等に予め組み込んで提供するように構成してもよい。 Further, the voice processing program executed by the voice processing apparatuses 1 and 1001 of the above-described embodiments may be provided by being incorporated in advance in a ROM or the like.

また、上記実施の形態の音声処理装置１、１００１で実行される音声処理プログラムは、上述した各部（音韻・韻律入力受付部、取得部、素片融合部、声道フィルタ成分融合部、音源成分融合部等）を含むモジュール構成となっている。そして、実際のハードウェアとしてはＣＰＵ（プロセッサ）が上記記憶媒体から翻訳プログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、音韻・韻律入力受付部、取得部、素片融合部、声道フィルタ成分融合部、音源成分融合部等が主記憶装置上に生成されるようになっている。 In addition, the speech processing program executed by the speech processing apparatuses 1 and 1001 of the above embodiment includes the above-described units (phoneme / prosody input reception unit, acquisition unit, segment fusion unit, vocal tract filter component fusion unit, sound source component) The module configuration includes a fusion portion and the like. As actual hardware, a CPU (processor) reads out and executes a translation program from the storage medium, so that the above-described units are loaded onto the main storage device, and a phoneme / prosody input reception unit, acquisition unit, and unit fusion Sections, vocal tract filter component fusion portions, sound source component fusion portions, and the like are generated on the main storage device.

１、１００１音声処理装置
４１音韻・韻律入力受付部
４３、１０４３取得部
４４素片融合部
４５声道フィルタ成分融合部
４６音源成分融合部 DESCRIPTION OF SYMBOLS 1,1001 Speech processing apparatus 41 Phoneme / prosody input reception part 43,1043 Acquisition part 44 Segment fusion part 45 Vocal tract filter component fusion part 46 Sound source component fusion part

Claims

A plurality of segments obtained by dividing the phoneme sequence corresponding to the target speech by synthesis unit, and a phoneme / prosody input receiving unit that receives input of prosody information corresponding to each of the plurality of segments;
For each of the plurality of segments, an acquisition unit that acquires a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment;
A vocal tract filter component fusion unit that fuses the acquired vocal tract filter components of the plurality of speech segments for each of the segments;
A sound source component fusion unit that expands and contracts the sound source component of the acquired periodic component of the plurality of speech units based on the fundamental frequency or the shape of the sound source component waveform, and fuses each segment;
Acquired by the acquisition unit by filtering the fused sound source component fused by the sound source component fusion unit using a vocal tract filter characterized by the fused vocal tract filter component fused by the vocal tract filter component fusion unit A speech processing apparatus, comprising: a segment fusion unit that fuses the plurality of speech segments that are segmented for each segment.

The sound source component fusion unit aligns the sound source components of the plurality of speech units acquired by the acquisition unit in the time direction so that the error of the position of the feature point of the sound source component is equal to or less than a threshold value, The speech processing apparatus according to claim 1, wherein each segment is fused.

The sound source component merging unit removes the sound source component corresponding to a predetermined removal condition from the sound source components of the plurality of speech units acquired by the acquisition unit, and fuses each segment. The speech processing apparatus according to claim 1 or 2, characterized in that:

The sound source component fusion unit averages the parameters obtained by approximating the sound source components of the plurality of speech units acquired by the acquisition unit with a vocal cord sound wave model, and uses the averaged parameters The sound processing apparatus according to claim 1, wherein sound source components are fused for each segment.

The acquisition unit includes, as a plurality of speech units, a plurality of sound source component fusion speech units used for fusion of the sound source components, and a plurality of the sound source component fusion speech units that are different from the vocal tract filter components. The speech processing apparatus according to claim 1, wherein a plurality of vocal tract filter component fusion speech units used for fusion are acquired.

6. The speech processing apparatus according to claim 5, wherein the number of acquired speech elements for sound source component fusion differs from the number of acquired speech elements for vocal tract filter component fusion.

A generating unit that generates a speech waveform by connecting the fusion speech units fused by the unit fusion unit for each segment;
The voice processing apparatus according to claim 1, further comprising an output unit that outputs the voice waveform.

A phoneme / prosodic input receiving unit, a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and an input receiving step for receiving input of prosodic information corresponding to each of the plurality of segments;
An acquisition unit, for each of a plurality of the segments, an acquisition step of acquiring a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment;
A vocal tract filter component merging unit, wherein the vocal tract filter component fusion step of merging the acquired vocal tract filter components of the plurality of speech segments for each segment;
A sound source component fusion step is performed by expanding and contracting the sound source component of the acquired periodic component of the plurality of speech units based on the fundamental frequency or the shape of the sound source component waveform and fusing each segment.
The unit fusion unit filters the fused sound source component fused in the sound source component fusion step by using a vocal tract filter characterized by the fused vocal tract filter component fused in the vocal tract filter component fusion step. And a segment fusion step of fusing the plurality of speech segments acquired in the acquisition step for each segment.

A phoneme / prosodic input receiving unit, a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech by a synthesis unit, and an input receiving step for receiving input of prosodic information corresponding to each of the plurality of segments;
An acquisition unit, for each of a plurality of the segments, an acquisition step of acquiring a plurality of speech segments associated with the segment and the prosodic information corresponding to the segment;
A vocal tract filter component merging unit, wherein the vocal tract filter component fusion step of merging the acquired vocal tract filter components of the plurality of speech segments for each segment;
A sound source component fusion step is performed by expanding and contracting the sound source component of the acquired periodic component of the plurality of speech units based on the fundamental frequency or the shape of the sound source component waveform and fusing each segment.
The unit fusion unit filters the fused sound source component fused in the sound source component fusion step by using a vocal tract filter characterized by the fused vocal tract filter component fused in the vocal tract filter component fusion step. A speech processing program for causing a computer to execute a unit fusion step of fusing the plurality of speech units acquired in the acquisition step for each segment.