JP2008292587A

JP2008292587A - Rhythm creating device, rhythm creating method and rhythm creating program

Info

Publication number: JP2008292587A
Application number: JP2007135847A
Authority: JP
Inventors: Nobuyuki Katae; 伸之片江; Kentaro Murase; 健太郎村瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2007-05-22
Filing date: 2007-05-22
Publication date: 2008-12-04
Anticipated expiration: 2027-05-22
Also published as: JP5029884B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a rhythm creating device, a rhythm creating method and a rhythm creating program, capable of crating a corrected rhythm pattern by correcting an extraction error of a voice rhythm pattern extracted from human utterance, without spoiling naturalness and expression of human utterance, and even without spending time and effort. <P>SOLUTION: The rhythm creating device 2 comprises: a pitch pattern extraction section 26b for extracting a voice pitch pattern which shows a rhythm of human voice from a voice data; and a correction rhythm creation section 27 for creating a correction pitch pattern on the basis of highly reliable extraction pattern by a pitch pattern extraction section 26b in the voice pitch pattern, instead of less reliable extraction pattern by the pitch pattern extraction section 26b in the voice pitch pattern, and a regular pitch pattern created by a pitch pattern creation section 24b. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、任意のテキストと、このテキストの内容を読み上げた人間の音声とを受け付け、受け付けた任意のテキストおよび人間の音声に基づいて、韻律パターンを生成する韻律生成装置、韻律生成方法、および、韻律生成プログラムに関する。 The present invention accepts an arbitrary text and a human voice that reads out the content of the text, and generates a prosody pattern based on the received arbitrary text and human voice, a prosody generation method, and , Prosody generation program.

近年、テキストを音声に変換して出力する音声合成技術が各種のシステムあるいは装置に用いられている。例えば、ＩＶＲ（自動音声応答：Interactive Voice Response）システム、車載情報端末、携帯電話での操作方法ガイダンスやメールの読み上げ、視覚障害者・発話障害者の支援システムなどである。このような音声合成技術においては、現状、人間の発声並みに自然で、表現力豊かな合成音声を生成することは困難である。 In recent years, speech synthesis technology for converting text into speech and outputting it has been used in various systems or apparatuses. For example, there are an IVR (Automatic Voice Response) system, an in-vehicle information terminal, an operation method guidance on a mobile phone, reading out an e-mail, a support system for visually handicapped and speech handicapped. In such a speech synthesis technology, it is difficult to generate a synthesized speech that is natural as human speech and rich in expressiveness.

すなわち、合成音声の韻律は、一般に、テキストにおける単語の読みや品詞を解析する形態素解析、文節や係り受けの解析といった言語解析に基づき、アクセントの設定、イントネーションの設定、ポーズや話速の設定などを経て決定される。しかしながら、現状の処理技術では、文章の意味や前後の文脈を考慮した解析を、人間のように正確に行うことは困難で、解析結果に誤りが含まれることがある。このため、音声合成技術により生成された合成音声は、人間の発声と比較して、声の高さ、イントネーション、リズムなどの喋り方を決める韻律が不自然な箇所が含まれることがある。 In other words, the prosody of synthesized speech is generally based on linguistic analysis such as morphological analysis that analyzes word reading and parts of speech in text, analysis of clauses and dependency, accent settings, intonation settings, pause and speech speed settings, etc. To be determined. However, with the current processing technology, it is difficult to perform an analysis that takes into account the meaning of the sentence and the context before and after like a human being, and the analysis result may include an error. For this reason, the synthesized speech generated by the speech synthesis technique may include portions where the prosody that determines how to speak, such as voice pitch, intonation, and rhythm, is unnatural compared to human speech.

そこで、合成音声の韻律の品質を高める方法として、予め合成音声するテキストが決まっている場合に、人間の発声から音声韻律パターンを抽出し、抽出した音声韻律パターンをそのまま用いて合成音声を生成する方法が知られている（例えば、特許文献１〜４参照）。この方法では、人間の発声とその音声韻律パターンの抽出作業が予め必要となるが、人間の発声から抽出された音声韻律パターンを用いて合成音声を生成するので、人間の発声並みに自然で、表現力豊かな合成音声を生成することができる。
特開平１０−１５３９９８号公報特開平９−２９２８９７号公報特開平１１−１４３４８３号公報特開平７−１４０９９６号公報 Therefore, as a method of improving the quality of the synthesized speech prosody, when the text to be synthesized is determined in advance, the speech prosodic pattern is extracted from the human speech and the synthesized speech is generated using the extracted speech prosodic pattern as it is. Methods are known (see, for example, Patent Documents 1 to 4). In this method, extraction of the human utterance and its speech prosody pattern is required in advance, but since the synthesized speech is generated using the speech prosody pattern extracted from the human utterance, it is natural as human speech, Synthetic speech rich in expressiveness can be generated.
JP-A-10-153998 JP-A-9-292897 Japanese Patent Laid-Open No. 11-14383 JP-A-7-140996

しかしながら、上記従来の方法では、人間の発声から抽出された音声韻律パターンの抽出精度が低い場合、すなわち、音声韻律パターンの抽出誤りが生じている場合、韻律が不自然な合成音声になるという問題を生じる。 However, in the above-described conventional method, when the extraction accuracy of the speech prosody pattern extracted from the human utterance is low, that is, when the extraction error of the speech prosody pattern occurs, the problem is that the prosody becomes an unnatural synthetic speech. Produce.

具体的には、人間の発声から音声韻律パターンを抽出するためには、人間の発声中の各音素における開始点と終了点を検出する音素ラベリング技術、人間の発声中の各時刻におけるピッチを検出するピッチ抽出技術などが必要である。これらの技術には様々な優れた方式が開発されているが、人間の発声は非常に多様で不規則であることから、１００％の精度で音声韻律パターンを抽出することは不可能である。このため、ユーザが、ＧＵＩ装置などを用いて、音声韻律パターンの抽出誤りを修正する必要がある。この作業は、音声に関する専門的な知識を必要とし、かつ、手間と時間がかかる。 Specifically, in order to extract speech prosodic patterns from human speech, phoneme labeling technology that detects the start and end points of each phoneme in human speech, and the pitch at each time during human speech are detected. A pitch extraction technique is required. Various excellent methods have been developed for these techniques. However, since human utterances are very diverse and irregular, it is impossible to extract speech prosodic patterns with 100% accuracy. For this reason, the user needs to correct the extraction error of the speech prosodic pattern using a GUI device or the like. This work requires specialized knowledge about audio, and is time consuming and time consuming.

本発明は、上記の問題点に鑑みてなされたものであり、その目的は、人間の発声から抽出された音声韻律パターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正韻律パターンを生成することが可能な韻律生成装置、韻律生成方法、および、韻律生成プログラムを提供することにある。 The present invention has been made in view of the above-mentioned problems, and its purpose is to extract a speech prosody pattern extracted from a human utterance without impairing the naturalness and expressiveness of the human utterance. In addition, it is an object of the present invention to provide a prosody generation device, a prosody generation method, and a prosody generation program capable of generating a modified prosody pattern by correcting without taking time and effort.

上記目的を達成するために本発明における韻律生成装置は、任意のテキストが入力されるテキスト入力部と、前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理部と、前記表音文字列データ、および、韻律生成規則に基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成部と、前記テキストを読み上げた人間の音声を音声データに変換する音声入力部と、前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出部と、前記音声韻律抽出部が前記音声データから前記音声韻律パターンを抽出する際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出部による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出部による抽出の信頼性が低いパターンと判定する信頼度判定部と、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成部とを備えたことを特徴とする。なお、前記規則韻律パターン、前記音声韻律パターン、および、前記修正韻律パターンは、例えば、声の高さの変化パターンを表すピッチパターンである。 In order to achieve the above object, the prosody generation device according to the present invention generates a phonetic character string data indicating a reading of the text by performing a text analysis on the text input unit to which an arbitrary text is input and the text. Based on a language processing unit, the phonetic character string data, and a prosody generation rule, a regular prosody generation unit that generates a regular prosody pattern indicating the prosody of the text, and voice data of human speech read out from the text A speech input unit for converting to speech, a speech prosody extraction unit for extracting a speech prosody pattern indicating the prosody of the human speech from the speech data, and a speech prosody extraction unit for extracting the speech prosody pattern from the speech data In the speech prosody pattern, the speech prosody extraction unit extracts a pattern having the reliability greater than or equal to a threshold value. A reliability determination unit that determines a pattern with high extraction reliability and determines a pattern with a reliability less than a threshold among the speech prosodic patterns as a pattern with low extraction reliability by the speech prosody extraction unit; Based on the regular prosodic pattern and the regular prosody pattern of the speech prosody pattern, which are highly reliable in extraction by the speech prosody extractor, instead of the pattern of the prosody pattern which is not reliable by the speech prosody extractor. A modified prosody generation unit for generating a modified prosody pattern is provided. The regular prosodic pattern, the speech prosodic pattern, and the modified prosodic pattern are, for example, pitch patterns representing a voice pitch change pattern.

本発明の韻律生成装置によれば、修正韻律生成部により生成された修正韻律パターンは、音声韻律抽出部による抽出の信頼性が低いパターンの代わりに、音声韻律抽出部による抽出の信頼性が高いパターン、および、規則韻律パターンに基づいて生成されたパターンである。すなわち、修正韻律生成部により生成された修正韻律パターンは、音声韻律抽出部による抽出の信頼性が低いパターンを用いることなく、音声韻律抽出部による抽出の信頼性が高いパターン、および、適切な規則韻律パターンに基づいて生成されたパターンである。これにより、人間の発声から抽出された音声韻律パターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正韻律パターンを生成することが可能となる。 According to the prosody generation device of the present invention, the modified prosody pattern generated by the modified prosody generation unit has high extraction reliability by the speech prosody extraction unit instead of the pattern with low extraction reliability by the speech prosody extraction unit. It is a pattern generated based on a pattern and a regular prosodic pattern. That is, the modified prosody pattern generated by the modified prosody generation unit does not use a pattern with low extraction reliability by the speech prosody extraction unit, and a pattern with high extraction reliability by the speech prosody extraction unit and an appropriate rule. It is a pattern generated based on the prosodic pattern. This makes it possible to correct errors in the extraction of speech prosodic patterns extracted from human utterances without compromising the naturalness and expressiveness of human utterances and without taking time and effort. Can be generated.

上記本発明における韻律生成装置においては、前記修正韻律生成部は、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンに近似するように前記規則韻律パターンを変形し、変形した規則韻律パターンと、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンとを接続することにより、修正韻律パターンを生成する韻律補完部を含む態様とするのが好ましい。 In the prosody generation device according to the present invention, the modified prosody generation unit modifies the regular prosody pattern so as to approximate a pattern with high reliability of extraction by the speech prosody extraction unit of the speech prosody pattern, It is preferable to include a prosody complementing unit that generates a modified prosody pattern by connecting the regular prosody pattern and a pattern having high extraction reliability by the speech prosody extraction unit among the speech prosody patterns.

上記構成によれば、韻律補完部により生成された修正韻律パターンは、音声韻律抽出部による抽出の信頼性が高いパターンに近似するように適切な規則韻律パターンを変形し、変形した規則韻律パターンと、音声韻律抽出部による抽出の信頼性が高いパターンとを接続することにより生成されたパターンである。これにより、人間の発声から抽出された音声韻律パターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正韻律パターンを生成することが可能となる。 According to the above configuration, the modified prosodic pattern generated by the prosody complementing unit is transformed into an appropriate regular prosody pattern so as to approximate a pattern with high reliability of extraction by the speech prosody extracting unit, A pattern generated by connecting a pattern with high extraction reliability by the speech prosody extraction unit. This makes it possible to correct errors in the extraction of speech prosodic patterns extracted from human utterances without compromising the naturalness and expressiveness of human utterances and without taking time and effort. Can be generated.

上記本発明における韻律生成装置においては、前記修正韻律生成部は、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンに近似するように前記規則韻律パターンを変形し、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンを用いることなく、変形した規則韻律パターンを用いることにより、修正韻律パターンを生成する韻律修正部を含む態様とするのが好ましい。 In the prosody generation device according to the present invention, the modified prosody generation unit modifies the regular prosody pattern so as to approximate a pattern with high reliability of extraction by the speech prosody extraction unit of the speech prosody pattern, It is preferable to include a prosody modification unit that generates a modified prosody pattern by using a modified regular prosody pattern without using a pattern with high extraction reliability by the speech prosody extraction unit among the speech prosody patterns. .

上記構成によれば、韻律修正部により生成された修正韻律パターンは、音声韻律抽出部による抽出の信頼性が高いパターンに近似するように適切な規則韻律パターンを変形し、音声韻律抽出部による抽出の信頼性が高いパターンを用いることなく、変形した規則韻律パターンを用いることにより生成されたパターンである。これにより、人間の発声から抽出された音声韻律パターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正韻律パターンを生成することが可能となる。 According to the above configuration, the modified prosody pattern generated by the prosody modification unit is modified by an appropriate regular prosody pattern so as to approximate a pattern with high reliability of extraction by the speech prosody extraction unit, and extracted by the speech prosody extraction unit. This is a pattern generated by using a modified regular prosodic pattern without using a pattern with high reliability. This makes it possible to correct errors in the extraction of speech prosodic patterns extracted from human utterances without compromising the naturalness and expressiveness of human utterances and without taking time and effort. Can be generated.

上記目的を達成するために本発明における韻律編集システムは、上記韻律生成装置と、前記韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるＧＵＩ装置とを備えたことを特徴とする。 To achieve the above object, a prosody editing system according to the present invention includes the prosody generation device and a GUI device that edits at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generation device. It is characterized by that.

本発明の韻律編集システムによれば、ＧＵＩ装置は、韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるので、韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つに対して、ユーザは、木目細かい調整を行うことが可能となる。 According to the prosody editing system of the present invention, the GUI device edits at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generating device, so that the phonetic character string generated by the prosody generating device is edited. The user can make fine adjustments to at least one of the data and the modified prosodic pattern.

上記目的を達成するために本発明における音声合成システムは、上記韻律生成装置と、前記韻律生成装置により生成された修正韻律パターンに基づいて、合成音声を生成し出力する音声合成装置とを備えたことを特徴とする。 To achieve the above object, a speech synthesis system according to the present invention includes the prosody generation device and a speech synthesis device that generates and outputs synthesized speech based on the modified prosodic pattern generated by the prosody generation device. It is characterized by that.

本発明の音声合成システムによれば、音声合成装置は、韻律生成装置により生成された修正韻律パターンに基づいて合成音声を生成し出力するので、出力された合成音声は、人間の発声が有する自然性・表現力を備えた合成音声となる。 According to the speech synthesis system of the present invention, the speech synthesizer generates and outputs a synthesized speech based on the modified prosodic pattern generated by the prosody generation device. Therefore, the output synthesized speech is a natural speech possessed by a human utterance. Synthetic speech with sex and expressive power.

上記目的を達成するために本発明における音声合成システムは、上記韻律生成装置と、前記韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるＧＵＩ装置と、前記韻律生成装置により生成された修正韻律パターン、および、前記ＧＵＩ装置により編集された修正韻律パターンの少なくとも１つに基づいて、合成音声を生成し出力する音声合成装置とを備えたことを特徴とする。 In order to achieve the above object, a speech synthesis system according to the present invention includes the prosody generation device, a GUI device that edits at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generation device, And a speech synthesizer that generates and outputs synthesized speech based on at least one of the modified prosodic pattern generated by the prosody generating device and the modified prosodic pattern edited by the GUI device. .

本発明の音声合成システムによれば、音声合成装置は、韻律生成装置により生成された修正韻律パターンおよびＧＵＩ装置により編集された修正韻律パターンの少なくとも１つに基づいて合成音声を生成し出力するので、出力された合成音声は、人間の発声が有する自然性・表現力を備えた合成音声となる。 According to the speech synthesis system of the present invention, the speech synthesizer generates and outputs a synthesized speech based on at least one of the modified prosodic pattern generated by the prosody generating device and the modified prosodic pattern edited by the GUI device. The output synthesized speech is a synthesized speech having the naturalness and expressiveness that human speech has.

上記目的を達成するために本発明における韻律生成方法は、コンピュータが備えるテキスト入力部が、任意のテキストが入力されるテキスト入力工程と、前記コンピュータが備える言語処理部が、前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理工程と、前記コンピュータが備える規則韻律生成部が、前記表音文字列データ、および、統計的な韻律に関するデータに基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成工程と、前記コンピュータが備える音声入力部が、前記テキストを読み上げた人間の音声を音声データに変換する音声入力工程と、前記コンピュータが備える音声韻律抽出部が、前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出工程と、前記コンピュータが備える信頼度判定部が、前記音声韻律抽出工程にて前記音声データから前記音声韻律パターンが抽出された際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出工程による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出工程による抽出の信頼性が低いパターンと判定する信頼度判定工程と、前記コンピュータが備える修正韻律生成部が、前記音声韻律パターンのうち前記音声韻律抽出工程による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出工程による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成工程とを含むことを特徴とする。 In order to achieve the above object, in the prosody generation method according to the present invention, a text input unit included in a computer performs a text input process in which an arbitrary text is input, and a language processing unit included in the computer performs language analysis on the text. Thus, a language processing step for generating phonetic character string data indicating the reading of the text, and a regular prosody generation unit provided in the computer are based on the phonetic character string data and statistical prosody data. A regular prosody generation step of generating a regular prosody pattern indicating the prosody of the text, a speech input unit included in the computer, a speech input step of converting a human speech read out from the text into speech data, and the computer A speech prosody extraction unit comprising a speech prosody parameter indicating the prosody of the human speech from the speech data; A speech prosody extraction step for extracting a speech pattern, and a reliability determination unit included in the computer obtains the reliability of the extraction when the speech prosody pattern is extracted from the speech data in the speech prosody extraction step And determining a pattern having a reliability greater than or equal to a threshold value among the speech prosodic patterns as a pattern having a high extraction reliability by the speech prosody extraction step, and selecting a pattern having the reliability less than the threshold among the speech prosodic patterns A reliability determination step for determining a pattern with low extraction reliability by the phonetic prosody extraction step, and a pattern with low extraction reliability by the phonetic prosody extraction step among the phonetic prosody patterns by the modified prosody generation unit included in the computer Instead of the pattern of the phonetic prosody pattern that is highly reliable for extraction by the phonetic prosody extraction step, and Characterized in that it comprises a modified prosody generation step of generating a modified prosody pattern based on the serial rule prosody pattern.

上記目的を達成するために本発明における韻律生成プログラムは、任意のテキストが入力されるテキスト入力処理と、前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理と、前記表音文字列データ、および、統計的な韻律に関するデータに基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成処理と、前記テキストを読み上げた人間の音声を音声データに変換する音声入力処理と、前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出処理と、前記音声韻律抽出処理にて前記音声データから前記音声韻律パターンが抽出された際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出処理による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出処理による抽出の信頼性が低いパターンと判定する信頼性判定処理と、前記音声韻律パターンのうち前記音声韻律抽出処理による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出処理による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成処理とをコンピュータに実行させることを特徴とする。 In order to achieve the above object, the prosody generation program according to the present invention generates a phonetic character string data indicating a reading of the text by performing a text input process in which an arbitrary text is input and language analysis of the text. Based on language processing, the phonetic character string data, and data on statistical prosody, regular prosody generation processing for generating a regular prosody pattern indicating the prosody of the text, and human speech read out from the text The speech prosody pattern is extracted from the speech data by speech input processing for converting to speech data, speech prosody extraction processing for extracting speech prosody patterns indicating the prosody of the human speech from the speech data, and speech prosody extraction processing. The reliability of the extraction at the time of extraction is acquired, and the reliability of the speech prosodic pattern is equal to or greater than a threshold value A turn is determined as a pattern with high extraction reliability by the speech prosody extraction process, and a pattern whose reliability is less than a threshold is determined as a pattern with low extraction reliability by the speech prosody extraction process. A pattern having a high reliability of extraction by the speech prosody extraction process among the speech prosodic patterns, instead of a pattern having a low reliability of the extraction by the speech prosody extraction process of the speech prosody pattern, and The computer is caused to execute a modified prosody generation process for generating a modified prosody pattern based on the regular prosody pattern.

なお、本発明における韻律生成方法および韻律生成プログラムは、上記の韻律生成装置と同様の効果を得る。 It should be noted that the prosody generation method and prosody generation program according to the present invention achieve the same effects as the above-mentioned prosody generation apparatus.

以上のように、本発明の韻律生成装置、韻律生成方法、および、韻律生成プログラムは、人間の発声から抽出された音声韻律パターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正韻律パターンを生成することが可能であるという効果を奏する。 As described above, the prosody generation device, the prosody generation method, and the prosody generation program according to the present invention impair the naturalness and expressiveness of the human utterance due to the extraction error of the speech prosodic pattern extracted from the human utterance. In addition, there is an effect that it is possible to generate a modified prosodic pattern by correcting without taking time and effort.

以下、本発明のより具体的な実施形態について図面を参照しながら詳細に説明する。 Hereinafter, more specific embodiments of the present invention will be described in detail with reference to the drawings.

[実施の形態１]
図１は、本実施形態に係る音声合成システム１の概略構成を示すブロック図である。すなわち、本実施形態に係る音声合成システム１は、韻律生成装置２、および、音声合成装置３を備えている。韻律生成装置２と音声合成装置３とは有線または無線により互いに接続されている。韻律生成装置２は、任意のテキストと、このテキストを読み上げた人間の音声とを受け付け、受け付けた任意のテキストおよび人間の音声に基づいて、修正韻律パターンを生成する装置である。音声合成装置３は、韻律生成装置２により生成された修正韻律パターンを受け付け、受け付けた修正韻律パターンに基づいて、合成音声を生成し出力する装置である。韻律生成装置２および音声合成装置３は、例えば、パーソナルコンピュータ、サーバマシンなどの汎用コンピュータによって構成される。なお、韻律生成装置２および音声合成装置３は、例えば、車載情報端末、携帯電話、家電製品などの電子機器に組み込まれたコンピュータによって構成されていてもよい。また、韻律生成装置２および音声合成装置３は、同一のハードウェア内にそれぞれ存在していてもよいし、異なるハードウェア内にそれぞれ存在していてもよい。 [Embodiment 1]
FIG. 1 is a block diagram showing a schematic configuration of a speech synthesis system 1 according to the present embodiment. That is, the speech synthesis system 1 according to the present embodiment includes a prosody generation device 2 and a speech synthesis device 3. The prosody generation device 2 and the speech synthesis device 3 are connected to each other by wire or wireless. The prosody generation device 2 is a device that accepts an arbitrary text and a human voice that reads out the text, and generates a modified prosody pattern based on the received arbitrary text and the human voice. The speech synthesizer 3 is a device that receives the modified prosodic pattern generated by the prosody generating device 2 and generates and outputs synthesized speech based on the received modified prosodic pattern. The prosody generation device 2 and the speech synthesis device 3 are configured by general-purpose computers such as personal computers and server machines, for example. The prosody generation device 2 and the speech synthesizer 3 may be configured by, for example, a computer incorporated in an electronic device such as an in-vehicle information terminal, a mobile phone, or a home appliance. The prosody generation device 2 and the speech synthesis device 3 may exist in the same hardware, or may exist in different hardware.

（韻律生成装置の構成）
韻律生成装置２は、テキスト入力部２１、単語辞書２２、言語処理部２３、規則韻律生成部２４、音声入力部２５、音声韻律抽出部２６、および、修正韻律生成部２７を備えている。 (Configuration of prosody generation device)
The prosody generation device 2 includes a text input unit 21, a word dictionary 22, a language processing unit 23, a regular prosody generation unit 24, a speech input unit 25, a speech prosody extraction unit 26, and a modified prosody generation unit 27.

テキスト入力部２１は、任意のテキストが入力される。本実施形態においては、テキスト入力部２１は、「音声ガイダンスに従ってプッシュボタンを押してください。」を表すテキストが入力されたものとする。テキスト入力部２１は、例えば、キーボード、マウスなどの入力デバイスを介してユーザからテキストの入力を受け付けてもよいし、コンピュータが備えるメモリなどに記録されたデータを読み取ることによってテキストを受け付けてもよい。テキスト入力部２１は、入力されたテキストを言語処理部２３に出力する。 An arbitrary text is input to the text input unit 21. In the present embodiment, it is assumed that the text input unit 21 has input text representing “Please push the push button according to the voice guidance”. For example, the text input unit 21 may accept text input from the user via an input device such as a keyboard or a mouse, or may accept text by reading data recorded in a memory provided in the computer. . The text input unit 21 outputs the input text to the language processing unit 23.

単語辞書２２は、複数の単語の表記、読み、品詞、アクセント情報を格納する。アクセント情報は、例えば、アクセント型を示すデータである。例えば、韻律生成装置２が単語データを記録した記録媒体を読み取ることによって、単語辞書２２には、上記の単語の表記、読み、品詞、アクセント情報が格納される。 The word dictionary 22 stores a plurality of word notations, readings, parts of speech, and accent information. Accent information is data indicating an accent type, for example. For example, when the prosody generation device 2 reads a recording medium on which word data is recorded, the word dictionary 22 stores the above-described word notation, reading, part of speech, and accent information.

言語処理部２３は、単語辞書２２を用いて、テキスト入力部２１から出力されたテキストに対して形態素解析を行う。テキストは、言語処理部２３において単語辞書２２を用いて形態素解析を行うことにより、複数の単語に分割される。図２は、本実施形態に係る言語処理部２３がテキストに対して形態素解析を行った結果を示す概念図である。図２に示すように、言語処理部２３は、分割された各単語について、単語辞書２２を用いることにより、品詞、および、読みを生成する。品詞は、普通名詞、動詞連用形、形容詞、形容動詞、格助詞、接続助詞などを含む。ここで、普通名詞、動詞連用形、形容詞、形容動詞などは、自立語に分類される。格助詞、接続助詞などは、付属語に分類される。読みは、単語の読みを示す。なお、読みは、アクセント核を含んでいる。ここで、アクセント核は、アクセントが「高」から「低」へ移行する位置である。本実施形態においては、アクセント核を「’」の記号で表し、例えば、「オ’ンセー」のように表記する。なお、形態素解析の方法として、例えば、ビタビ（Viterbi）アルゴリズムや最長一致法などが挙げられるが、本実施形態で用いられる形態素解析の方法は、特定のものに限定されない。 The language processing unit 23 performs morphological analysis on the text output from the text input unit 21 using the word dictionary 22. The text is divided into a plurality of words by performing morphological analysis using the word dictionary 22 in the language processing unit 23. FIG. 2 is a conceptual diagram showing a result of the morphological analysis performed on the text by the language processing unit 23 according to the present embodiment. As shown in FIG. 2, the language processing unit 23 generates a part of speech and a reading by using the word dictionary 22 for each divided word. Part of speech includes common nouns, verb conjunctions, adjectives, adjective verbs, case particles, connective particles, and the like. Here, common nouns, verb usage forms, adjectives, adjective verbs, etc. are classified as independent words. Case particles, connection particles, etc. are classified as appendages. Reading indicates the reading of a word. The reading includes an accent nucleus. Here, the accent nucleus is a position where the accent shifts from “high” to “low”. In the present embodiment, the accent nucleus is represented by a symbol “′”, for example, “O'N”. The morpheme analysis method includes, for example, the Viterbi algorithm and the longest match method, but the morpheme analysis method used in the present embodiment is not limited to a specific one.

また、言語処理部２３は、テキスト入力部２１から出力されたテキストに対して行った形態素解析の結果に基づいて、複数の文節とその読みを生成する。図３は、本実施形態に係る言語処理部２３により生成された複数の文節とその読みを示す概念図である。図３に示すように、言語処理部２３は、「音声ガイダンスに」、「従って」、「プッシュボタンを」、「押してください。」の４つの文節を生成する。文節は、自立語の後に付属語が接続されたものである。例えば、「音声ガイダンスに」という文節は、普通名詞である「音声」および「ガイダンス」の複合名詞である「音声ガイダンス」が１個の自立語として扱われ、その後に、格助詞（付属語）である「に」が接続されている。また、言語処理部２３は、任意のアクセント結合規則に従い、生成された文節に対して、適宜アクセント核を新たに設定することにより、読みを生成する。例えば、「音声」、「ガイダンス」、「に」のそれぞれの単語の読み「オ’ンセー」、「ガ’イダンス」、「ニ」がアクセント結合され、「オンセーガ’イダンスニ」という文節の読みが生成される。 Further, the language processing unit 23 generates a plurality of clauses and their readings based on the result of morphological analysis performed on the text output from the text input unit 21. FIG. 3 is a conceptual diagram showing a plurality of clauses generated by the language processing unit 23 according to the present embodiment and their readings. As shown in FIG. 3, the language processing unit 23 generates four clauses “to voice guidance”, “accordingly”, “push button”, and “please press”. A phrase is an independent word followed by an appendix. For example, in the phrase “for voice guidance”, the common noun “speech” and the compound noun “voice guidance” of “guidance” are treated as one independent word, and then the case particle (adjunct). "" Is connected. In addition, the language processing unit 23 generates a reading by appropriately setting an accent nucleus for the generated phrase according to an arbitrary accent coupling rule. For example, the words “sound”, “guidance”, and “ni” are combined with the accents of “O'N'SE”, “GA'I DANCE”, and “NI”, and the phrase “ONSEGA 'IDANNI” is generated. Is done.

さらに、言語処理部２３は、任意の規則に従って、生成された複数の文節間の係り受け（修飾）関係の解析を行う。本実施形態においては、言語処理部２３は、「音声ガイダンスに→従って」、「従って→押してください。」、「プッシュボタンを→押してください。」という係り受け関係を特定する。 Furthermore, the language processing unit 23 analyzes the dependency (modification) relationship between the plurality of generated clauses according to an arbitrary rule. In this embodiment, the language processing unit 23 specifies dependency relationships such as “follow voice guidance → follow”, “follow → press”, and “push push button →”.

言語処理部２３は、上記の形態素解析、係り受け解析などの言語解析の結果に基づいて、表音文字列データを生成する。表音文字列データは、テキストの読みを示すデータである。本実施形態においては、言語処理部２３は、「オンセーガ’イダンスニ＿シタガッテ，プッシュボ’タンオ＿オシテクダサ’イ．」を示す表音文字列データを生成する。ここで、「＿」は、アクセント句の境界を表す記号である。アクセント句は、アクセントを構成する単位であって、上記の文節に概ね対応する。「，」は、アクセント句の境界を表す記号であり、かつ、フレーズの境界を表す記号である。フレーズは、文あるいは節を統語論的に分析した際の単位であって、複数の単語からなる。すなわち、本実施形態においては、「オンセーガ’イダンスニ＿シタガッテ」、「プッシュボ’タンオ＿オシテクダサ’イ．」がそれぞれ１フレーズとなる。「’」は、アクセント核を表す記号である。なお、上記の表音文字列データのフォーマットは、単なる一例であり、表音文字列データの表し方は、これに限定されない。言語処理部２３は、生成した表音文字列データを規則韻律生成部２４および音声韻律抽出部２６に出力する。 The language processing unit 23 generates phonetic character string data based on the results of language analysis such as morphological analysis and dependency analysis. The phonetic character string data is data indicating reading of text. In the present embodiment, the language processing unit 23 generates phonetic character string data indicating “ONSEGA 'IDANNI_SHITAGATTE, push button' TAO_OSITECHDASA '.”. Here, “_” is a symbol representing the boundary of an accent phrase. An accent phrase is a unit constituting an accent and generally corresponds to the above-mentioned phrase. “,” Is a symbol that represents the boundary of the accent phrase and also represents the boundary of the phrase. A phrase is a unit when syntactically analyzing a sentence or a clause, and consists of a plurality of words. In other words, in the present embodiment, “ONSEGA’IDANNI_SHITAGATTE” and “PUSHBO’TANO_Oshitekadasa ′” are each one phrase. “′” Is a symbol representing an accent nucleus. The format of the phonetic character string data is merely an example, and the way of expressing the phonetic character string data is not limited to this. The language processing unit 23 outputs the generated phonetic character string data to the regular prosody generation unit 24 and the speech prosody extraction unit 26.

規則韻律生成部２４は、言語処理部２３から出力された表音文字列データを音素記号列に変換する。本実施形態においては、規則韻律生成部２４は、表音文字列データ「オンセーガ’イダンスニ＿シタガッテ，プッシュボ’タンオ＿オシテクダサ’イ．」を、音素記号列「ｏＮｓｅ−ｇａｉｄａＮｓｕｎｉｓｈｉｔａｇａｑｔｅＱｐｕｑｓｈｂｏｔａＮｏｏｓｈｉｔｅｋｕｄａｓａｉＱ」に変換する。ここで、「Ｑ」は、ポーズを表す記号である。「Ｎ」は、「ン」を表す記号であって、「ニ」を表す記号である「ｎｉ」と区別するために、大文字にて表記している。規則韻律生成部２４は、変換した音素記号列に基づいて、規則韻律パターンを生成する。なお、規則韻律パターンは、音素時間長パターン、規則ピッチパターン、および、パワーパターンを含む。このため、規則韻律生成部２４は、音素時間長生成部２４ａ、ピッチパターン生成部２４ｂ、および、パワー生成部２４ｃを有している。 The regular prosody generation unit 24 converts the phonetic character string data output from the language processing unit 23 into a phoneme symbol string. In the present embodiment, the regular prosody generation unit 24 converts the phonetic character string data “ONSEGA’IDANNI_SHITAGATTE, push-bo” TANO_OSITECHDASA’i. Here, “Q” is a symbol representing a pose. “N” is a symbol representing “n”, and is represented by capital letters to distinguish it from “ni” which is a symbol representing “ni”. The regular prosody generation unit 24 generates a regular prosody pattern based on the converted phoneme symbol string. The regular prosody pattern includes a phoneme time length pattern, a regular pitch pattern, and a power pattern. For this reason, the regular prosody generation unit 24 includes a phoneme time length generation unit 24a, a pitch pattern generation unit 24b, and a power generation unit 24c.

音素時間長生成部２４ａは、人間の発声における統計的な音素時間長を示すデータを記録した音素時間長テーブルを有している。音素時間長生成部２４ａは、音素記号列の各音素に基づいて、音素時間長テーブルからデータを抽出し、抽出したデータを結合することにより、音素時間長パターンを生成する。なお、音素時間長テーブルには、例えば、音素「ａ」の音素時間長を示すデータ、音素「ｉ」の音素時間長を示すデータ、音素「ｕ」の音素時間長を示すデータ、・・・が順に記録されている。 The phoneme time length generation unit 24a has a phoneme time length table in which data indicating statistical phoneme time lengths in human speech is recorded. The phoneme time length generator 24a extracts data from the phoneme time length table based on each phoneme of the phoneme symbol string, and generates a phoneme time length pattern by combining the extracted data. The phoneme time length table includes, for example, data indicating the phoneme time length of the phoneme “a”, data indicating the phoneme time length of the phoneme “i”, data indicating the phoneme time length of the phoneme “u”,. Are recorded in order.

ピッチパターン生成部２４ｂは、フレーズから生成されたフレーズ成分に、アクセント句から生成されたアクセント句成分を重畳することにより、規則ピッチパターンを生成する。図４は、フレーズ成分にアクセント句成分が重畳された状態を示す概念図である。図４に示すように、フレーズ成分Ｆ₁には、アクセント句成分Ａ₁およびＡ₂が重畳され、フレーズ成分Ｆ₂には、アクセント句成分Ａ₃およびＡ₄が重畳される。ここで、フレーズ成分Ｆ₁およびＦ₂は、右下がりの三角形のモデルとして表される。すなわち、一般に、人間の発声は、その出始めでは声は高いが、次第に声門下圧の低下などによって声の高さが低下する。つまり、フレーズ成分Ｆ₁およびＦ₂は、ピッチが時刻と共に低下する特性を表す声立て成分である。なお、右下がりの三角形のモデルが、統計的な規則ピッチパターンに関するデータであって、ピッチパターン生成部２４ｂの図示しないメモリに予め記録されている。 The pitch pattern generation unit 24b generates a regular pitch pattern by superimposing the accent phrase component generated from the accent phrase on the phrase component generated from the phrase. FIG. 4 is a conceptual diagram showing a state in which an accent phrase component is superimposed on a phrase component. As shown in FIG. 4, the phrase component F _1, it is superimposed accent phrase component A ₁ and A _2, the phrase component F _2, accent phrase component A ₃ and A ₄ are superimposed. Here, the phrase components F ₁ and F ₂ are represented as a right-down triangular model. That is, generally speaking, a human voice is high at the beginning, but the voice pitch gradually decreases due to a decrease in subglottic pressure or the like. That is, the phrase components F ₁ and F ₂ are voice components that express the characteristic that the pitch decreases with time. The right-down triangle model is data relating to a statistical regular pitch pattern, and is recorded in advance in a memory (not shown) of the pitch pattern generation unit 24b.

また、アクセント句成分Ａ₁〜Ａ₄は、台形のモデルとして表される。ここで、例えば、アクセント句成分Ａ₁の場合について考える。アクセント句成分Ａ₁に対応する音素記号列「ｏＮｓｅ−ｇａｉｄａＮｓｕｎｉ」は、表音文字列データ「オンセーガ’イダンスニ」に対応する。すなわち、一般に、人間の発声は、アクセント核が位置する前の部分「オンセーガ」の声が高くなり、アクセント核が位置する後の部分「イダンスニ」の声が低くなる。つまり、アクセント句成分Ａ₁は、音素記号列「ｏＮｓｅ−ｇａ」が高い特性を表す成分である。これと同様に、アクセント句成分Ａ₂は、音素記号列「ｓｈｉｔａｇａｑｔｅ」が高い特性を表す成分である。アクセント句成分Ａ₃は、音素記号列「ｐｕｑｓｈｂｏ」が高い特性を表す成分である。アクセント句成分Ａ₄は、音素記号列「ｏｓｈｉｔｅｋｕｄａｓａ」が高い特性を表す成分である。なお、台形のモデルが、統計的な規則ピッチパターンに関するデータであって、ピッチパターン生成部２４ｂの図示しないメモリに予め記録されている。 Accent phrase components A _{1 to} A ₄ are represented as trapezoidal models. Here, for example, consider the case of the accent phrase component A ₁ . The phoneme symbol string “oNse-gaidaNsuni” corresponding to the accent phrase component A ₁ corresponds to the phonetic character string data “Onsega 'Idanni”. That is, in general, the voice of the part “Onsega” before the accent core is high and the voice of the part “Idannis” after the accent core is low in human speech. That is, the accent phrase component A ₁ is a component that represents a high characteristic of the phoneme symbol string “oNse-ga”. Similarly, the accent phrase component A ₂ is a component that represents a characteristic that the phoneme symbol string “shitagaqte” is high. The accent phrase component A ₃ is a component that represents a high characteristic of the phoneme symbol string “puqshbo”. The accent phrase component A ₄ is a component that represents a high characteristic of the phoneme symbol string “ositekudasa”. The trapezoidal model is data relating to a statistical regular pitch pattern, and is recorded in advance in a memory (not shown) of the pitch pattern generation unit 24b.

ピッチパターン生成部２４ｂは、フレーズ成分にアクセント句成分が重畳された場合における外形のパターンを規則ピッチパターンとする。図５は、本実施形態に係るピッチパターン生成部２４ｂにより生成された規則ピッチパターンの一例を示す概念図である。図５に示すように、規則ピッチパターンは、フレーズ成分Ｆ₁に、アクセント句成分Ａ₁およびＡ₂が重畳され、かつ、フレーズ成分Ｆ₂に、アクセント句成分Ａ₃およびＡ₄が重畳された場合における外形のパターンである。 The pitch pattern generation unit 24b sets the pattern of the outer shape when the accent phrase component is superimposed on the phrase component as a regular pitch pattern. FIG. 5 is a conceptual diagram illustrating an example of a regular pitch pattern generated by the pitch pattern generation unit 24b according to the present embodiment. As shown in FIG. 5, regular pitch pattern, the phrase component F _1, is superimposed accent phrase component A ₁ and A _2, and the phrase component F _2, accent phrase component A ₃ and A ₄ are superimposed It is the pattern of the outer shape in the case.

パワー生成部２４ｃは、各音素に固有のパワー値を記録したパワー値テーブルを有している。なお、パワー値は、統計的なパワーに関するデータであって、声の大きさを表す値である。パワー生成部２４ｃは、音素記号列の各音素に基づいて、パワー値テーブルからパワー値を抽出する。ここで、一般に、同じ音素であっても、規則ピッチパターンが高いほどパワー値は大きく、規則ピッチパターンが低いほどパワー値は小さくなる。パワー生成部２４ｃは、パワー値テーブルから抽出したパワー値を、規則ピッチパターンの高低に応じて補正することにより、パワーパターンを生成する。 The power generation unit 24c has a power value table in which power values unique to each phoneme are recorded. The power value is data relating to statistical power and is a value representing the loudness of the voice. The power generation unit 24c extracts a power value from the power value table based on each phoneme in the phoneme symbol string. Here, in general, even for the same phoneme, the higher the regular pitch pattern, the larger the power value, and the lower the regular pitch pattern, the smaller the power value. The power generation unit 24c generates a power pattern by correcting the power value extracted from the power value table according to the level of the regular pitch pattern.

すなわち、上記の方法によって生成された音素時間長パターン、規則ピッチパターン、および、パワーパターンを含む規則韻律パターンは、統計的には妥当な韻律パターンとなるが、平均的な韻律パターンであるため、表現力にやや乏しい韻律パターンとなる。韻律生成部２４は、音素時間長パターン、規則ピッチパターン、および、パワーパターンを含む規則韻律パターンを修正韻律生成部２７に出力する。なお、音素時間長パターン、規則ピッチパターン、および、パワーパターンの生成方法は、上記の方法に限定されない。また、上記では、音素時間長パターン、規則ピッチパターン、および、パワーパターンの生成に統計的なデータを使用する例を示したが、ヒューリスティックに生成された韻律生成規則に基づいて、音素時間長パターン、規則ピッチパターン、および、パワーパターンの生成を行うことも可能である。 That is, the regular prosodic pattern including the phoneme duration pattern, the regular pitch pattern, and the power pattern generated by the above method is a statistically valid prosodic pattern, but is an average prosodic pattern. The prosodic pattern is somewhat poor in expressiveness. The prosody generation unit 24 outputs a regular prosody pattern including a phoneme time length pattern, a regular pitch pattern, and a power pattern to the modified prosody generation unit 27. Note that the phoneme time length pattern, the regular pitch pattern, and the power pattern generation method are not limited to the above methods. Also, in the above, an example is shown in which statistical data is used to generate phoneme time length patterns, regular pitch patterns, and power patterns, but phoneme time length patterns based on prosodic generation rules generated heuristically. It is also possible to generate regular pitch patterns and power patterns.

音声入力部２５は、テキスト入力部２１が受け付けたテキストを読み上げた人間の音声を受け付ける機能を有している。このため、音声入力部２５は、例えば、マイクロフォンから構成される。本実施形態においては、音声入力部２５は、「音声ガイダンスに従ってプッシュボタンを押してください。」を読み上げた人間の音声を受け付ける。音声入力部２５は、受け付けた人間の音声を計算機で処理可能なデジタルの音声データに変換する。音声入力部２５は、変換した音声データを音声韻律抽出部２６に出力する。なお、音声入力部２５は、予め録音装置に録音された人間の発声を再生することによって得られるアナログ音声の他、ＣＤ（Compact Disc）あるいはＭＤ（Mini Disc）などの記録媒体に記録されたデジタルの音声データや、有線あるいは無線の通信網で送信されるデジタルの音声データなどを直接受け付けてもよい。また、音声入力部２５は、受け付けた音声データが圧縮されている場合、圧縮されている音声データを伸長する機能を有していてもよい。 The voice input unit 25 has a function of receiving a human voice read out from the text received by the text input unit 21. For this reason, the voice input unit 25 is constituted by a microphone, for example. In the present embodiment, the voice input unit 25 receives a human voice that reads “Please push the push button according to the voice guidance.” The voice input unit 25 converts the received human voice into digital voice data that can be processed by a computer. The voice input unit 25 outputs the converted voice data to the voice prosody extraction unit 26. Note that the voice input unit 25 is a digital voice recorded on a recording medium such as a CD (Compact Disc) or an MD (Mini Disc) in addition to analog voice obtained by reproducing a human voice previously recorded by a recording device. Audio data or digital audio data transmitted via a wired or wireless communication network may be directly received. Further, the voice input unit 25 may have a function of expanding the compressed voice data when the received voice data is compressed.

音声韻律抽出部２６は、規則韻律生成部２４と同様、言語処理部２３から出力された表音文字列データを音素記号列に変換する。本実施形態においては、音声韻律抽出部２６は、表音文字列データ「オンセーガ’イダンスニ＿シタガッテ，プッシュボ’タンオ＿オシテクダサ’イ．」を、音素記号列「ｏＮｓｅ−ｇａｉｄａＮｓｕｎｉｓｈｉｔａｇａｑｔｅＱｐｕｑｓｈｂｏｔａＮｏｏｓｈｉｔｅｋｕｄａｓａｉＱ」に変換する。音声韻律抽出部２６は、変換した音素記号列に基づいて、音声入力部２５から出力された音声データから音声韻律パターンを抽出する。なお、音声韻律パターンは、音素時間長パターン、音声ピッチパターン、および、パワーパターンを含む。このため、音声韻律抽出部２６は、音素時間長抽出部２６ａ、ピッチパターン抽出部２６ｂ、信頼度判定部２６ｃ、および、パワー抽出部２６ｄを有している。 Similar to the regular prosody generation unit 24, the phonetic prosody extraction unit 26 converts the phonogram character string data output from the language processing unit 23 into a phoneme symbol string. In the present embodiment, the phonetic prosody extraction unit 26 converts the phonetic character string data “ONSEGA’IDANNI_SHITAGATTE, push-bo” TANO_OSITECHDASA’i. The speech prosody extraction unit 26 extracts a speech prosody pattern from the speech data output from the speech input unit 25 based on the converted phoneme symbol string. Note that the phonetic prosody pattern includes a phoneme time length pattern, a voice pitch pattern, and a power pattern. Therefore, the speech prosody extraction unit 26 includes a phoneme time length extraction unit 26a, a pitch pattern extraction unit 26b, a reliability determination unit 26c, and a power extraction unit 26d.

音素時間長抽出部２６ａは、どの音素がどういう特徴量になりやすいかという情報を統計的にモデル化したデータを記録した音素モデルを有している。音素時間長抽出部２６ａは、音素記号列の各音素に基づいて、音素モデルからモデル化したデータを抽出する。音素時間長抽出部２６ａは、抽出したデータと音声データとを照合することにより、抽出したデータと最も類似する音声データの区間を特定する。音素時間長抽出部２６ａは、特定した区間に音素境界を設定することにより、音声データから音素時間長パターンを抽出する。このような抽出方法は、一般に、音素ラベリングと呼ばれている。なお、音素モデルは、例えば、ＭＦＣＣ（Mel Frequency Cepstral Coefficients）などのパラメータを用いて表される。また、音声入力部２５から出力された音声データもＭＦＣＣなどのパラメータに変換した後に、ＨＭＭ（Hidden Markov Model）、ＤＰ（Dynamic Programming）などの照合方法によって照合することが一般的である。 The phoneme time length extraction unit 26a has a phoneme model in which data obtained by statistically modeling information on which phoneme is likely to be a feature amount is recorded. The phoneme time length extraction unit 26a extracts modeled data from the phoneme model based on each phoneme in the phoneme symbol string. The phoneme time length extraction unit 26a specifies a section of voice data that is most similar to the extracted data by collating the extracted data with the voice data. The phoneme time length extraction unit 26a extracts a phoneme time length pattern from speech data by setting a phoneme boundary in the specified section. Such an extraction method is generally called phoneme labeling. Note that the phoneme model is expressed using parameters such as MFCC (Mel Frequency Cepstral Coefficients). In general, the voice data output from the voice input unit 25 is also converted into a parameter such as MFCC and then verified by a matching method such as HMM (Hidden Markov Model) or DP (Dynamic Programming).

ピッチパターン抽出部２６ｂは、相関処理法を用いることにより、音声データから音声ピッチパターンを抽出する。ここで、相関処理法は、相関処理が波形の位相歪みに強いことを利用した方法である。本実施形態においては、相関処理法の一例として、自己相関関数（ＡＣＦ：autocorrelation function）を用いた場合について説明するが、これに限定されない。例えば、自己相関関数に代えて、変形相関、ＳＩＦＴアルゴリズム、平均振幅差関数（ＡＭＤＦ）などの他の相関処理法を用いてもよい。また、相関処理法に代えて、波形処理法、スペクトル処理法などの他の方法を用いてもよい。 The pitch pattern extraction unit 26b extracts an audio pitch pattern from the audio data by using a correlation processing method. Here, the correlation processing method is a method utilizing the fact that the correlation processing is strong against waveform phase distortion. In the present embodiment, a case where an autocorrelation function (ACF) is used as an example of a correlation processing method will be described, but the present invention is not limited to this. For example, instead of the autocorrelation function, other correlation processing methods such as modified correlation, SIFT algorithm, and average amplitude difference function (AMDF) may be used. Further, instead of the correlation processing method, other methods such as a waveform processing method and a spectrum processing method may be used.

ここで、自己相関関数は、音声データ自体にどの程度の類似性があるのかを表す関数である。自己相関関数は、下記の（数１）にて定義される。なお、下記の（数１）において、φ（ｍ）は相関値を表す。ｘ（ｎ）は音声データの時系列を表す。Ｎは切り出して分析に用いる音声データの標本数を表す。ｍは０、１、２、・・・、Ｎ−１である。 Here, the autocorrelation function is a function representing how much similarity the audio data itself has. The autocorrelation function is defined by the following (Equation 1). In the following (Equation 1), φ (m) represents a correlation value. x (n) represents a time series of audio data. N represents the number of samples of audio data that are cut out and used for analysis. m is 0, 1, 2,..., N−1.

つまり、ピッチパターン抽出部２６ｂは、音声データの時系列ｘ（ｎ）を上記の（数１）に適用することにより、相関値φ（ｍ）を算出する。ピッチパターン抽出部２６ｂは、算出した相関値φ（ｍ）から極大値（ピーク値）を抽出し、極大値の周期の逆数を算出することにより、音声データから音声ピッチパターンを抽出する。このとき、信頼度判定部２６ｃは、ピッチパターン抽出部２６ｂが音声データから音声ピッチパターンを抽出する際における、抽出の信頼度を取得する。本実施形態においては、信頼度判定部２６ｃは、ピッチパターン抽出部２６ｂが算出した相関値φ（ｍ）をそのまま信頼度として利用する。また、信頼度判定部２６ｃは、音声ピッチパターンのうち信頼度が閾値以上のパターンをピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンと判定する。一方、信頼度判定部２６ｃは、音声ピッチパターンのうち信頼度が閾値未満のパターンをピッチパターン抽出部２６ｂによる抽出の信頼性が低いパターンと判定する。 That is, the pitch pattern extraction unit 26b calculates the correlation value φ (m) by applying the time series x (n) of the audio data to the above (Equation 1). The pitch pattern extraction unit 26b extracts a maximum value (peak value) from the calculated correlation value φ (m), and calculates a reciprocal of the period of the maximum value, thereby extracting a sound pitch pattern from the sound data. At this time, the reliability determination unit 26c acquires the extraction reliability when the pitch pattern extraction unit 26b extracts the audio pitch pattern from the audio data. In the present embodiment, the reliability determination unit 26c uses the correlation value φ (m) calculated by the pitch pattern extraction unit 26b as the reliability. In addition, the reliability determination unit 26c determines that a pattern having a reliability greater than or equal to a threshold among the voice pitch patterns is a pattern with high extraction reliability by the pitch pattern extraction unit 26b. On the other hand, the reliability determination unit 26c determines a pattern whose reliability is less than the threshold among the voice pitch patterns as a pattern with low extraction reliability by the pitch pattern extraction unit 26b.

以下では、ピッチパターン抽出部２６ｂによる音声ピッチパターンの抽出処理、および、信頼度判定部２６ｃによる信頼性の判定処理について、図６および図７を参照しながら具体的に説明する。図６は、任意の母音の音声データの時系列ｘ（ｎ）を示す概念図である。図６に示す音声データの時系列ｘ（ｎ）を上記の（数１）に適用すると、相関値φ（ｍ）が求まる。図７は、図６に示す音声データの時系列ｘ（ｎ）を上記の（数１）に適用した場合における相関値φ（ｍ）を示す概念図である。図７に示すように、相関値φ（ｍ）は、Ａ、Ｂ、Ｃの時点で極大値となるが、ピッチパターン抽出部２６ｂは、最も値が大きいＣの時点における極大値Ｍを選択する。ピッチパターン抽出部２６ｂは、Ｃの時点における極大値Ｍの周期Ｔの逆数を算出することにより、音声データから音声ピッチパターンを抽出する。 Hereinafter, the voice pitch pattern extraction processing by the pitch pattern extraction unit 26b and the reliability determination processing by the reliability determination unit 26c will be specifically described with reference to FIGS. FIG. 6 is a conceptual diagram showing a time series x (n) of audio data of an arbitrary vowel. When the time series x (n) of the audio data shown in FIG. 6 is applied to the above (Equation 1), the correlation value φ (m) is obtained. FIG. 7 is a conceptual diagram showing the correlation value φ (m) when the time series x (n) of the audio data shown in FIG. 6 is applied to the above (Equation 1). As shown in FIG. 7, the correlation value φ (m) has a maximum value at time points A, B, and C, but the pitch pattern extraction unit 26 b selects the maximum value M at the time point C having the largest value. . The pitch pattern extraction unit 26b extracts the voice pitch pattern from the voice data by calculating the reciprocal of the period T of the maximum value M at the time point C.

ここで、信頼度判定部２６ｃは、Ｃの時点における極大値Ｍが閾値Ｓ以上であるか否かを判定する。つまり、信頼度判定部２６ｃは、極大値Ｍが閾値Ｓ以上であれば、ピッチパターン抽出部２６ｂによる抽出の信頼性は高いと判定する。一方、信頼度判定部２６ｃは、極大値Ｍが閾値Ｓ未満であれば、ピッチパターン抽出部２６ｂによる抽出の信頼性は低いと判定する。図６に示す例では、Ｃの時点における極大値Ｍは閾値Ｓ以上であるので、信頼度判定部２６ｃは、ピッチパターン抽出部２６ｂによる抽出の信頼性は高いと判定する。すなわち、一般に、母音ａ，ｉ，ｕ，ｅ，ｏ、撥音Ｎ、半母音ｙ，ｗ、鼻音ｎ，ｍなどの音声データの時系列は、はっきりとした周期性を有するので（例えば、図６参照）、相関値φ（ｍ）の極大値が閾値Ｓ以上となり易く、ピッチパターン抽出部２６ｃによる抽出の信頼性は高くなる。一方、有声破裂音／摩擦音ｂ，ｄ，ｇ，ｊ，ｚなどの音声データの時系列は、あいまいな周期性を有するので、相関値φ（ｍ）の極大値が閾値Ｓ未満となり易く、ピッチパターン抽出部２６ｃによる抽出の信頼性は低くなる。なお、無声破裂音／摩擦音ｐ，ｔ，ｋ，ｓ，ｓｈ，ｈ、促音ｑ、ポーズＱなどの音声データの時系列は、周期性がないため極大値が観測されず、結果としてピッチは抽出されない。ここで、母音、撥音、半母音、鼻音などであっても、直前あるいは直後に有声破裂音／摩擦音、ポーズＱなどがあれば、相関値φ（ｍ）である極大値が閾値Ｓ未満となり易く、ピッチパターン抽出部２６ｃによる抽出の信頼性は低くなる。なお、閾値Ｓは、信頼度判定部２６ｃの図示しないメモリに予め記録されている。 Here, the reliability determination unit 26c determines whether or not the maximum value M at the time point C is equal to or greater than the threshold value S. That is, if the maximum value M is equal to or greater than the threshold value S, the reliability determination unit 26c determines that the extraction reliability by the pitch pattern extraction unit 26b is high. On the other hand, if the maximum value M is less than the threshold value S, the reliability determination unit 26c determines that the extraction reliability of the pitch pattern extraction unit 26b is low. In the example shown in FIG. 6, since the maximum value M at the time point C is equal to or greater than the threshold value S, the reliability determination unit 26c determines that the extraction reliability by the pitch pattern extraction unit 26b is high. That is, generally, the time series of voice data such as vowels a, i, u, e, o, repelling N, semi-vowels y, w, and nasal sounds n, m has a clear periodicity (see, for example, FIG. 6). ), The maximum value of the correlation value φ (m) tends to be equal to or greater than the threshold value S, and the extraction reliability by the pitch pattern extraction unit 26c is increased. On the other hand, since the time series of voice data such as voiced plosive / friction sounds b, d, g, j, and z has ambiguous periodicity, the maximum value of the correlation value φ (m) tends to be less than the threshold value S, and the pitch The extraction reliability by the pattern extraction unit 26c is lowered. It should be noted that the time series of voice data such as unvoiced plosive / friction sounds p, t, k, s, sh, h, prompting sound q, pause pose Q, etc. has no periodicity, so no local maximum value is observed, resulting in the extraction of pitch. Not. Here, even if it is a vowel, a repellent sound, a semi-vowel, a nasal sound, etc., if there is a voiced plosive / friction sound, a pose Q, etc. immediately before or after, the maximum value that is the correlation value φ (m) tends to be less than the threshold value S, The extraction reliability by the pitch pattern extraction unit 26c is lowered. The threshold S is recorded in advance in a memory (not shown) of the reliability determination unit 26c.

図８は、本実施形態に係るピッチパターン抽出部２６ｂにより抽出された音声ピッチパターンの一例を示す概念図である。図８に示すように、音声ピッチパターンは、信頼度判定部２６ｃにより抽出の信頼性が高いと判定されたピッチについては実線のパターンにて表し、信頼度判定部２６ｃにより抽出の信頼性が低いと判定されたピッチについては点線のパターンにて表している。すなわち、図８における点線のパターンは、信頼度判定部２６ｃにより抽出の信頼性が低いと判定されたパターンであるので、ピッチパターン抽出部２６ｂによる音声ピッチパターンの抽出誤りが生じているパターンである可能性が高い。つまり、図８に示す音声ピッチパターンをそのまま用いて合成音声を生成すると、点線のパターンに対応する音素の部分で韻律が不自然な合成音声となる可能性が高い。 FIG. 8 is a conceptual diagram showing an example of a voice pitch pattern extracted by the pitch pattern extraction unit 26b according to the present embodiment. As shown in FIG. 8, in the audio pitch pattern, a pitch determined to have high extraction reliability by the reliability determination unit 26c is represented by a solid line pattern, and extraction reliability is low by the reliability determination unit 26c. The pitch determined as is represented by a dotted line pattern. That is, the dotted line pattern in FIG. 8 is a pattern in which the reliability determination unit 26c has determined that the extraction reliability is low, and thus the pitch pattern extraction unit 26b has a voice pitch pattern extraction error. Probability is high. That is, if a synthesized speech is generated using the speech pitch pattern shown in FIG. 8 as it is, there is a high possibility that the synthesized speech has an unnatural prosody in the phoneme portion corresponding to the dotted line pattern.

パワー抽出部２６ｄは、音声入力部２５から出力された音声データからパワーパターンを抽出する。パワーパターンは、音声データに例えば２０ｍｓｅｃ程度の一定の窓長を設定し、この窓内の音声データの自乗和をとることにより算出される。 The power extraction unit 26d extracts a power pattern from the audio data output from the audio input unit 25. The power pattern is calculated by setting a fixed window length of, for example, about 20 msec in the audio data and taking the square sum of the audio data in this window.

音声韻律抽出部２６は、上記の方法によって抽出された音素時間長パターン、音声ピッチパターン、および、パワーパターンを含む音声韻律パターンを修正韻律生成部２７に出力する。なお、音素時間長パターン、音声ピッチパターン、および、パワーパターンの抽出方法は、上記の方法に限定されない。 The speech prosody extraction unit 26 outputs the speech prosody pattern including the phoneme time length pattern, speech pitch pattern, and power pattern extracted by the above method to the modified prosody generation unit 27. Note that the method for extracting the phoneme time length pattern, the voice pitch pattern, and the power pattern is not limited to the above method.

修正韻律生成部２７は、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が低いパターンの代わりに、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターン、および、規則ピッチパターンに基づいて修正ピッチパターンを生成する。このため、修正韻律生成部２７は、韻律補完部２７ａを有している。 The modified prosody generation unit 27 uses a pattern having a high extraction reliability by the pitch pattern extraction unit 26b in the voice pitch pattern instead of a pattern having a low extraction reliability by the pitch pattern extraction unit 26b. A corrected pitch pattern is generated based on the regular pitch pattern. For this reason, the modified prosody generation unit 27 includes a prosody supplementing unit 27a.

韻律補完部２７ａは、音声韻律抽出部２６から出力された音声ピッチパターンのうち、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンを抽出する。図９は、図８に示す音声ピッチパターンのうち、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンの一例を示す概念図である。すなわち、図９に示すパターンは、図８に示す音声ピッチパターンのうち、実線のパターンのみを抽出したパターンである。 The prosody complementing unit 27a extracts a pattern with high reliability of extraction by the pitch pattern extracting unit 26b from the speech pitch patterns output from the speech prosody extracting unit 26. FIG. 9 is a conceptual diagram showing an example of a pattern with high extraction reliability by the pitch pattern extraction unit 26b among the voice pitch patterns shown in FIG. That is, the pattern shown in FIG. 9 is a pattern in which only the solid line pattern is extracted from the voice pitch pattern shown in FIG.

また、韻律補完部２７ａは、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターン（図９参照）に近似するように、規則韻律生成部２４から出力された規則ピッチパターンを変形する。ここで、図９に示すパターンのアクセント句における時系列をＰ（ｎ）、図５に示す規則ピッチパターンのアクセント句における時系列をＱ（ｎ）、変形後の規則ピッチパターンのアクセント句における時系列をＱ´（ｎ）とする。本実施形態においては、韻律補完部２７ａは、下記の（数２）および（数３）を用いることにより、時系列Ｑ（ｎ）を時系列Ｑ´（ｎ）に変形する。なお、（数２）において、Ｐ_dは、Ｑ（ｎ）の傾斜変更量を表す。Ｔ_sは、Ｑ（ｎ）の時間伸縮率を表す。Ｔ_mは、Ｑ（ｎ）の時間移動幅を表す。Ｆ_sは、Ｑ（ｎ）のピッチ伸縮率を表す。Ｆ_mは、Ｑ（ｎ）のピッチ移動幅を表す。また、（数３）において、Ｄは、Ｐ（ｎ）とＱ´（ｎ）との誤差を表す。つまり、本実施形態に係る韻律補完部２７ａは、（数３）における誤差Ｄが最小となるように、（数２）のＰ_d、Ｔ_s、Ｔ_m、Ｆ_s、Ｆ_mを算出し、算出したＰ_d、Ｔ_s、Ｔ_m、Ｆ_s、Ｆ_mに基づいて、時系列Ｑ（ｎ）を時系列Ｑ´（ｎ）に変形する。韻律補完部２７ａは、これをアクセント句毎に行う。なお、時系列Ｑ（ｎ）を時系列Ｑ´（ｎ）に変形する方法はこれに限定されない。例えば、韻律補完部２７ａは、フレーズ毎に処理を行ってもよいし、下記の（数２）および（数３）に代えて、任意の公知の数式を用いてもよい。 Further, the prosody complementing unit 27a uses the regular pitch pattern output from the regular prosody generation unit 24 so as to approximate a pattern (see FIG. 9) with high reliability of extraction by the pitch pattern extraction unit 26b among the voice pitch patterns. Deform. Here, the time series in the accent phrase of the pattern shown in FIG. 9 is P (n), the time series in the accent phrase of the regular pitch pattern shown in FIG. 5 is Q (n), and the time series in the accent phrase of the modified regular pitch pattern Let the sequence be Q ′ (n). In the present embodiment, the prosody complementing unit 27a transforms the time series Q (n) into the time series Q ′ (n) by using the following (Equation 2) and (Equation 3). In (Expression 2), P _d represents the amount of inclination change of Q (n). T _s represents the time expansion / contraction rate of Q (n). T _m represents the time movement width of Q (n). F _s represents the pitch expansion / contraction rate of Q (n). F _m represents the pitch movement width of Q (n). In (Expression 3), D represents an error between P (n) and Q ′ (n). That is, the prosody complementing unit 27a according to the present embodiment calculates P _d , T _s , T _m , F _s , and F _m in (Equation 2) so that the error D in (Equation 3) is minimized, Based on the calculated P _d , T _s , T _m , F _s , and F _m , the time series Q (n) is transformed into the time series Q ′ (n). The prosody complementer 27a performs this for each accent phrase. Note that the method of transforming the time series Q (n) into the time series Q ′ (n) is not limited to this. For example, the prosody complementing unit 27a may perform processing for each phrase, or may use any known mathematical formula instead of the following (Equation 2) and (Equation 3).

図１０は、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が高いと判定されたパターン（図９参照）に近似するように変形された規則ピッチパターンの一例を示す概念図である。図１０に示すように、変形された規則ピッチパターンは、点線のパターンにて表している。なお、図１０に示す実線のパターンは、図９に示すパターンである。 FIG. 10 is a conceptual diagram illustrating an example of a regular pitch pattern modified to approximate a pattern (see FIG. 9) determined to have high extraction reliability by the pitch pattern extraction unit 26b in the audio pitch pattern. . As shown in FIG. 10, the modified regular pitch pattern is represented by a dotted line pattern. The solid line pattern shown in FIG. 10 is the pattern shown in FIG.

韻律補完部２７ａは、上記にて変形された規則ピッチパターンと、音声ピッチパターンのうちピッチパターン抽出部２６ｃによる抽出の信頼性が高いパターンとを接続することにより、修正ピッチパターンを生成する。すなわち、韻律補完部２７ａは、図１０に示す実線のパターンをそのまま用い、この実線のパターンと点線のパターンとを接続する。さらに、韻律補完部２７ａは、実線のパターンと点線のパターンとの接続部分を滑らかにするために、任意の公知の手法に従ってスムージングを行う。図１１は、韻律補完部２７ａによりスムージングされたピッチパターンの一例を示す概念図である。図１１に示す○印は、実線のパターンと点線のパターンとの接続部分であって、スムージングされた箇所を示す。このような処理を行うことにより、修正ピッチパターンが生成される。図１２は、韻律補完部２７ａにより生成された修正ピッチパターンの一例を示す概念図である。 The prosody complementing unit 27a generates a corrected pitch pattern by connecting the regular pitch pattern modified as described above and a pattern having high reliability of extraction by the pitch pattern extracting unit 26c among the voice pitch patterns. That is, the prosody complementing unit 27a uses the solid line pattern shown in FIG. 10 as it is, and connects the solid line pattern and the dotted line pattern. Further, the prosody complementing unit 27a performs smoothing according to any known technique in order to smooth the connection portion between the solid line pattern and the dotted line pattern. FIG. 11 is a conceptual diagram showing an example of a pitch pattern smoothed by the prosody complementing unit 27a. The circles shown in FIG. 11 are connected portions between the solid line pattern and the dotted line pattern, and indicate a smoothed portion. By performing such processing, a corrected pitch pattern is generated. FIG. 12 is a conceptual diagram showing an example of a corrected pitch pattern generated by the prosody complementing unit 27a.

なお、韻律補完部２７ａは、音声韻律抽出部２６から出力された音素時間長パターン、および、パワーパターンについてもそれぞれ抽出誤りを修正する。例えば、音素時間長パターンの場合、まず、信頼度判定部２６ｃは、音素時間長抽出部２６ａが音声データから音素時間長パターンを抽出する際におけるこの抽出の信頼度を算出する。例えば、信頼度判定部２６ｃは、音素モデルから抽出したモデル化したデータと、音声データの各区間とを照合することにより算出された類似度を信頼度として利用する。また、信頼度判定部２６ｃは、音素時間長パターンのうち信頼度が閾値以上のパターンを音素時間長抽出部２６ａによる抽出の信頼性が高いパターンと判定する。一方、信頼度判定部２６ｃは、音素時間長パターンのうち信頼度が閾値未満のパターンを音素時間長抽出部２６ａによる抽出の信頼性が低いパターンと判定する。これにより、韻律補完部２７ａは、音素時間長パターンのうち音素時間長抽出部２６ａによる抽出の信頼性が低いパターンの代わりに、音素時間長パターンのうち音素時間長抽出部２６ａによる抽出の信頼性が高いパターン、および、音素時間長生成部２４ａにより生成された音素時間長パターンに基づいて修正音素時間長パターンを生成する。また、例えば、パワーパターンの場合、韻律補完部２７ａは、任意の公知の手法に従って抽出誤りを修正し、修正パワーパターンを生成する。 The prosody complementing unit 27a also corrects the extraction errors for the phoneme time length pattern and the power pattern output from the speech prosody extracting unit 26, respectively. For example, in the case of a phoneme time length pattern, first, the reliability determination unit 26c calculates the reliability of this extraction when the phoneme time length extraction unit 26a extracts a phoneme time length pattern from speech data. For example, the reliability determination unit 26c uses the similarity calculated by comparing the modeled data extracted from the phoneme model with each section of the speech data as the reliability. In addition, the reliability determination unit 26c determines a pattern having a reliability greater than or equal to a threshold among the phoneme time length patterns as a pattern with high extraction reliability by the phoneme time length extraction unit 26a. On the other hand, the reliability determination unit 26c determines a pattern whose reliability is less than the threshold among the phoneme time length patterns as a pattern with low reliability of extraction by the phoneme time length extraction unit 26a. Thus, the prosody complementing unit 27a uses the phoneme time length extraction unit 26a of the phoneme time length pattern to extract the reliability of the phoneme time length pattern, instead of the pattern having a low extraction reliability of the phoneme time length extraction unit 26a. A modified phoneme time length pattern is generated based on the pattern having a high height and the phoneme time length pattern generated by the phoneme time length generation unit 24a. Further, for example, in the case of a power pattern, the prosody complementing unit 27a corrects the extraction error according to any known method to generate a corrected power pattern.

韻律補完部２７ａは、上記の方法によって生成された修正音素時間長パターン、修正ピッチパターン、および、修正パワーパターンを含む修正韻律パターンを音声合成装置３に出力する。 The prosody complementing unit 27a outputs the modified prosody pattern including the modified phoneme time length pattern, the modified pitch pattern, and the modified power pattern generated by the above method to the speech synthesizer 3.

ところで、上記の韻律生成装置２は、パーソナルコンピュータなどの任意のコンピュータにプログラムをインストールすることによっても実現される。すなわち、上記のテキスト入力部２１、言語処理部２３、規則韻律生成部２４、音声入力部２５、音声韻律抽出部２６、および、修正韻律生成部２７は、コンピュータのＣＰＵがこれらの機能を実現するプログラムに従って動作することによって具現化される。したがって、テキスト入力部２１、言語処理部２３、規則韻律生成部２４、音声入力部２５、音声韻律抽出部２６、および、修正韻律生成部２７の機能を実現するためのプログラムまたはそれを記録した記録媒体も、本発明の一実施形態である。また、単語辞書２２は、コンピュータの内蔵記憶装置またはこのコンピュータからアクセス可能な記憶装置によって具現化される。 By the way, the above-mentioned prosody generation device 2 is also realized by installing a program in an arbitrary computer such as a personal computer. That is, in the text input unit 21, the language processing unit 23, the regular prosody generation unit 24, the speech input unit 25, the speech prosody extraction unit 26, and the modified prosody generation unit 27, the CPU of the computer realizes these functions. It is embodied by operating according to a program. Therefore, a program for realizing the functions of the text input unit 21, the language processing unit 23, the regular prosody generation unit 24, the speech input unit 25, the speech prosody extraction unit 26, and the modified prosody generation unit 27, or a recording in which the program is recorded. A medium is also an embodiment of the present invention. The word dictionary 22 is embodied by a built-in storage device of a computer or a storage device accessible from this computer.

（音声合成装置の構成）
音声合成装置３は、波形辞書３１、波形生成部３２、および、合成音声出力部３３を備えている。 (Configuration of speech synthesizer)
The speech synthesizer 3 includes a waveform dictionary 31, a waveform generator 32, and a synthesized speech output unit 33.

波形辞書３１は、複数の波形データを格納する。例えば、音声合成装置３が波形データを記録した記録媒体を読み取ることによって、波形辞書３１には、上記の波形データが格納される。 The waveform dictionary 31 stores a plurality of waveform data. For example, the waveform synthesizer 3 reads the recording medium on which the waveform data is recorded, so that the waveform data is stored in the waveform dictionary 31.

波形生成部３２は、韻律生成装置２から出力された修正韻律パターンに基づいて、波形辞書３１を用いて合成音声の波形を生成する。波形生成部３２は、生成した合成音声の波形を合成音声出力部３３に出力する。 The waveform generation unit 32 generates a waveform of synthesized speech using the waveform dictionary 31 based on the modified prosodic pattern output from the prosody generation device 2. The waveform generation unit 32 outputs the generated synthesized speech waveform to the synthesized speech output unit 33.

合成音声出力部３３は、波形生成部３２から出力された合成音声の波形に基づいて、合成音声を生成する。合成音声出力部３３は、生成した合成音声を音声合成装置３の外部に出力する。すなわち、合成音声出力部３３により出力された合成音声は、韻律生成装置２により生成された修正韻律パターンを用いているので、人間の発声が有する自然性・表現力を備えた合成音声となる。 The synthesized speech output unit 33 generates synthesized speech based on the synthesized speech waveform output from the waveform generating unit 32. The synthesized speech output unit 33 outputs the generated synthesized speech to the outside of the speech synthesizer 3. That is, since the synthesized speech output by the synthesized speech output unit 33 uses the modified prosodic pattern generated by the prosody generating device 2, it becomes a synthesized speech having the naturalness and expressiveness that human speech has.

ところで、上記の音声合成装置３は、パーソナルコンピュータなどの任意のコンピュータにプログラムをインストールすることによっても実現される。すなわち、上記の波形生成部３２および合成音声出力部３３は、コンピュータのＣＰＵがこれらの機能を実現するプログラムに従って動作することによって具現化される。したがって、波形生成部３２および合成音声出力部３３の機能を実現するためのプログラムまたはそれを記録した記録媒体も、本発明の一実施形態である。また、波形辞書３１は、コンピュータの内蔵記憶装置またはこのコンピュータからアクセス可能な記憶装置によって具現化される。 By the way, the speech synthesizer 3 described above can also be realized by installing a program in an arbitrary computer such as a personal computer. In other words, the waveform generation unit 32 and the synthesized speech output unit 33 are realized by the CPU of the computer operating according to a program that realizes these functions. Therefore, a program for realizing the functions of the waveform generation unit 32 and the synthesized speech output unit 33 or a recording medium on which the program is recorded is also an embodiment of the present invention. The waveform dictionary 31 is embodied by a built-in storage device of a computer or a storage device accessible from this computer.

以上、音声合成システム１の構成について説明したが、音声合成システム１の構成は、図１に示す構成に限定されない。例えば、韻律生成装置２におけるテキスト入力部２１の代わりに、音声認識部を備えるようにしてもよい。 The configuration of the speech synthesis system 1 has been described above, but the configuration of the speech synthesis system 1 is not limited to the configuration illustrated in FIG. For example, a voice recognition unit may be provided instead of the text input unit 21 in the prosody generation device 2.

図１３は、本実施形態の変形例に係る音声合成システム１ａの概略構成を示すブロック図である。図１３において、図１と同様の機能を有する構成については、同じ参照符号を付記している。韻律生成装置２は、図１に示すテキスト入力部２１の代わりに、音声認識部２８を備えている。音声認識部２８は、人間の音声を認識する機能を有している。このため、音声認識部２８は、音声入力部２５から出力された音声データを特徴量に変換する。音声認識部２８は、変換した特徴量を用いて、音響モデルおよび言語モデル（共に図示せず）を参照しながら、人間の音声を表すのに最も確率的に高い語彙や文字並びを認識結果として出力する。つまり、音声認識部２８は、認識結果を言語処理部２３に出力する。これにより、ユーザが、韻律生成装置２にテキストを入力する必要がないので、ユーザによる手間を削減することが可能となる。 FIG. 13 is a block diagram illustrating a schematic configuration of a speech synthesis system 1a according to a modification of the present embodiment. 13, components having the same functions as those in FIG. 1 are denoted by the same reference numerals. The prosody generation device 2 includes a speech recognition unit 28 instead of the text input unit 21 shown in FIG. The voice recognition unit 28 has a function of recognizing human voice. Therefore, the voice recognition unit 28 converts the voice data output from the voice input unit 25 into a feature amount. The speech recognition unit 28 uses the converted feature amount as a recognition result while referring to an acoustic model and a language model (both not shown) as a recognition result. Output. That is, the voice recognition unit 28 outputs the recognition result to the language processing unit 23. Thereby, since it is not necessary for a user to input a text into the prosody generation apparatus 2, it becomes possible to reduce a user's effort.

（音声合成システムの動作）
次に、上記の構成に係る音声合成システム１の動作について、図１４を参照しながら説明する。 (Operation of speech synthesis system)
Next, the operation of the speech synthesis system 1 according to the above configuration will be described with reference to FIG.

図１４は、音声合成システム１の動作の一例を示すフローチャートである。すなわち、図１４に示すように、テキスト入力部２１は、任意のテキストが入力される（工程Ｏｐ１）。言語処理部２３は、単語辞書２２を用いて、工程Ｏｐ１にて入力されたテキストに対して言語解析を行う（工程Ｏｐ２）。なお、言語解析は、上記の形態素解析、係り受け解析などである。言語処理部２３は、工程Ｏｐ２の言語解析の結果に基づいて、テキストの読みを示す表音文字列データを生成する（工程Ｏｐ３）。規則韻律生成部２４は、工程Ｏｐ３にて生成された表音文字列データを音素記号列に変換し、変換した音素記号列に基づいて、規則韻律パターンを生成する（工程Ｏｐ４）。なお、規則韻律パターンは、音素時間長パターン、規則ピッチパターン、および、パワーパターンを含む。 FIG. 14 is a flowchart showing an example of the operation of the speech synthesis system 1. That is, as shown in FIG. 14, the text input unit 21 receives an arbitrary text (step Op1). The language processing unit 23 performs language analysis on the text input in step Op1 using the word dictionary 22 (step Op2). The language analysis includes the morphological analysis and dependency analysis described above. The language processing unit 23 generates phonetic character string data indicating the reading of the text based on the result of the language analysis in step Op2 (step Op3). The regular prosody generation unit 24 converts the phonetic character string data generated in step Op3 into a phoneme symbol string, and generates a regular prosody pattern based on the converted phoneme symbol string (step Op4). The regular prosody pattern includes a phoneme time length pattern, a regular pitch pattern, and a power pattern.

音声入力部２５は、工程Ｏｐ１にて入力されたテキストを読み上げた人間の音声を受け付け、受け付けた人間の音声を音声データに変換する（工程Ｏｐ５）。音声韻律抽出部２６は、工程Ｏｐ３にて生成された表音文字列データを音素記号列に変換し、変換した音素記号列に基づいて、工程Ｏｐ５にて変換された音声データから音声韻律パターンを抽出する（工程Ｏｐ６）。なお、音声韻律パターンは、音素時間長パターン、音声ピッチパターン、および、パワーパターンを含む。ここで、例えば、音声韻律抽出部２６のピッチパターン抽出部２６ｂは、上記の（数１）にて定義される自己相関関数を用いることにより、工程Ｏｐ５にて変換された音声データから音声ピッチパターンを抽出する。 The voice input unit 25 receives the human voice read out from the text input in step Op1, and converts the received human voice into voice data (step Op5). The phoneme prosody extraction unit 26 converts the phonetic character string data generated in step Op3 into a phoneme symbol string, and based on the converted phoneme symbol string, a phoneme prosody pattern is converted from the voice data converted in step Op5. Extract (Step Op6). Note that the phonetic prosody pattern includes a phoneme time length pattern, a voice pitch pattern, and a power pattern. Here, for example, the pitch pattern extraction unit 26b of the speech prosody extraction unit 26 uses the autocorrelation function defined in the above (Equation 1) to generate a speech pitch pattern from the speech data converted in step Op5. To extract.

信頼度判定部２６ｃは、ピッチパターン抽出部２６ｂが音声データから音声ピッチパターンを抽出する際におけるこの抽出の信頼度を算出する（工程Ｏｐ７）。本実施形態においては、信頼度判定部２６ｃは、ピッチパターン抽出部２６ｂが算出した相関値φ（ｍ）をそのまま信頼度として利用する。また、信頼度判定部２６ｃは、音声ピッチパターンのうち信頼度が閾値以上のパターンをピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンと判定し、音声ピッチパターンのうち信頼度が閾値未満のパターンをピッチパターン抽出部２６ｂによる抽出の信頼性が低いパターンと判定する（工程Ｏｐ８）。 The reliability determination unit 26c calculates the reliability of this extraction when the pitch pattern extraction unit 26b extracts the audio pitch pattern from the audio data (Step Op7). In the present embodiment, the reliability determination unit 26c uses the correlation value φ (m) calculated by the pitch pattern extraction unit 26b as the reliability. Further, the reliability determination unit 26c determines that a pattern having a reliability greater than or equal to a threshold among the voice pitch patterns is a pattern with high extraction reliability by the pitch pattern extraction unit 26b, and the reliability is less than the threshold among the voice pitch patterns. The pattern is determined to be a pattern with low extraction reliability by the pitch pattern extraction unit 26b (Step Op8).

韻律補完部２７ａは、工程Ｏｐ８にて信頼性が高いと判定されたパターン（図９参照）に近似するように、工程Ｏｐ４にて生成された規則ピッチパターンを変形する（工程Ｏｐ９）。例えば、韻律補完部２７ａは、上記の（数２）および（数３）を用いることにより、規則ピッチパターンを変形する。そして、韻律補完部２７ａは、工程Ｏｐ８にて信頼性が高いと判定されたパターンをそのまま用い、工程Ｏｐ８にて信頼性が高いと判定されたパターンと、工程Ｏｐ９にて変形された規則ピッチパターンとを接続する（工程Ｏｐ１０）。韻律補完部２７ａは、工程Ｏｐ１０にて接続された接続部分を滑らかにするために、任意の公知の手法に従ってスムージングを行い、修正ピッチパターンを生成する（工程Ｏｐ１１）。そして、韻律補完部２７ａは、工程Ｏｐ１１にて生成された修正ピッチパターンを含む修正韻律パターンを音声合成装置３に出力する（工程Ｏｐ１２）。 The prosody complementing unit 27a modifies the regular pitch pattern generated in step Op4 so as to approximate the pattern determined to have high reliability in step Op8 (see FIG. 9) (step Op9). For example, the prosody complementing unit 27a deforms the regular pitch pattern by using the above (Equation 2) and (Equation 3). Then, the prosody complementing unit 27a uses the pattern determined to be highly reliable in Step Op8 as it is, the pattern determined to be highly reliable in Step Op8, and the regular pitch pattern deformed in Step Op9. Are connected (step Op10). The prosody complementing unit 27a performs smoothing according to any known method to generate a corrected pitch pattern in order to smooth the connected portions connected in step Op10 (step Op11). Then, the prosody complementing unit 27a outputs the modified prosody pattern including the modified pitch pattern generated in step Op11 to the speech synthesizer 3 (step Op12).

次に、音声合成装置３の波形生成部３２は、工程Ｏｐ１２にて出力された修正韻律パターンに基づいて、波形辞書３１を用いて合成音声の波形を生成する（工程Ｏｐ１３）。合成音声出力部３３は、工程Ｏｐ１３にて生成された合成音声の波形に基づいて、合成音声を生成する（工程Ｏｐ１４）。合成音声出力部３３は、工程Ｏｐ１４にて生成された合成音声を音声合成装置３の外部に出力する（工程Ｏｐ１５）。 Next, the waveform generation unit 32 of the speech synthesizer 3 generates a synthesized speech waveform using the waveform dictionary 31 based on the modified prosodic pattern output in step Op12 (step Op13). The synthesized speech output unit 33 generates synthesized speech based on the waveform of the synthesized speech generated in step Op13 (step Op14). The synthesized speech output unit 33 outputs the synthesized speech generated in step Op14 to the outside of the speech synthesizer 3 (step Op15).

以上のように、本実施形態に係る韻律生成装置２によれば、韻律補完部２７ａにより生成された修正ピッチパターンは、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンに近似するように適切な規則ピッチパターンを変形し、変形した規則ピッチパターンと、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンとを接続することにより生成されたパターンである。これにより、人間の発声から抽出された音声ピッチパターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正ピッチパターンを生成することが可能となる。 As described above, according to the prosody generation device 2 according to the present embodiment, the corrected pitch pattern generated by the prosody complementation unit 27a is appropriately approximated to a pattern with high reliability of extraction by the pitch pattern extraction unit 26b. This is a pattern generated by deforming a regular pitch pattern and connecting the deformed regular pitch pattern and a pattern with high extraction reliability by the pitch pattern extraction unit 26b. As a result, the correction pitch pattern can be corrected by correcting the extraction error of the voice pitch pattern extracted from the human voice without compromising the naturalness and expressiveness of the human voice and without taking time and effort. Can be generated.

[実施の形態２]
図１５は、本実施形態に係る音声合成システム１０の概略構成を示すブロック図である。すなわち、本実施形態に係る音声合成システム１０は、図１に示す韻律生成装置２の代わりに、韻律生成装置４を備えている。なお、図１５において、図１と同様の機能を有する構成については、同じ参照符号を付記し、その詳細な説明を省略する。 [Embodiment 2]
FIG. 15 is a block diagram illustrating a schematic configuration of the speech synthesis system 10 according to the present embodiment. That is, the speech synthesis system 10 according to the present embodiment includes a prosody generation device 4 instead of the prosody generation device 2 shown in FIG. In FIG. 15, components having the same functions as those in FIG. 1 are given the same reference numerals, and detailed descriptions thereof are omitted.

韻律生成装置４は、図１に示す修正韻律生成部２７の代わりに、修正韻律生成部４１を備えている。なお、上記の修正韻律生成部４１は、コンピュータのＣＰＵがこの機能を実現するプログラムに従って動作することによっても具現化される。 The prosody generation device 4 includes a modified prosody generation unit 41 instead of the modified prosody generation unit 27 shown in FIG. The modified prosody generation unit 41 is also realized by the computer CPU operating according to a program that realizes this function.

修正韻律生成部４１は、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が低いパターンの代わりに、音声ピッチパターンのうちピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターン、および、規則ピッチパターンに基づいて修正ピッチパターンを生成する。このため、修正韻律生成部４１は、韻律修正部４１ａを有している。 The modified prosody generation unit 41 uses a pattern having a high extraction reliability by the pitch pattern extraction unit 26b in the voice pitch pattern, instead of a pattern having a low extraction reliability by the pitch pattern extraction unit 26b in the voice pitch pattern, and A corrected pitch pattern is generated based on the regular pitch pattern. For this reason, the modified prosody generation unit 41 includes a prosody modification unit 41a.

韻律修正部４１ａは、音声韻律抽出部２６から出力された音声ピッチパターンのうち、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンを抽出する（図９参照）。また、韻律修正部４１ａは、音声韻律抽出部２６から出力された音声ピッチパターンのうち、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターン（図９参照）に近似するように、規則韻律生成部２４から出力された規則ピッチパターンを変形する（図１０参照）。ここまでは図１に示す韻律補完部２７ａの処理と同様である。 The prosody modification unit 41a extracts a pattern having high extraction reliability by the pitch pattern extraction unit 26b from the speech pitch patterns output from the speech prosody extraction unit 26 (see FIG. 9). Further, the prosody modification unit 41a generates a regular prosody so as to approximate a pattern (see FIG. 9) with high reliability of extraction by the pitch pattern extraction unit 26b among the speech pitch patterns output from the speech prosody extraction unit 26. The regular pitch pattern output from the unit 24 is deformed (see FIG. 10). The processing up to this point is the same as the processing of the prosody complementing unit 27a shown in FIG.

図１６は、図１０に示す太線のパターンを除去し、変形された規則ピッチパターンのみを示した概念図である。韻律修正部４１ａは、変形された規則ピッチパターンにおけるアクセント句の境界部分を滑らかにするために、任意の公知の手法に従ってスムージングを行う。図１７は、韻律修正部４１ａによりスムージングされたピッチパターンの一例を示す概念図である。図１７に示す○印は、変形された規則ピッチパターンにおけるアクセント句の境界部分であって、スムージングされた箇所を示す。このような処理を行うことにより、修正ピッチパターンが生成される。図１８は、韻律修正部４１ａにより生成された修正ピッチパターンの一例を示す概念図である。韻律修正部４１ａは、図１８に示す修正ピッチパターンを音声合成装置３に出力する。 FIG. 16 is a conceptual diagram showing only a modified regular pitch pattern by removing the thick line pattern shown in FIG. The prosody modification unit 41a performs smoothing according to any known method in order to smooth the boundary portion of the accent phrase in the modified regular pitch pattern. FIG. 17 is a conceptual diagram showing an example of a pitch pattern smoothed by the prosody modification unit 41a. A circle mark shown in FIG. 17 indicates a smoothed portion that is a boundary portion of an accent phrase in the modified regular pitch pattern. By performing such processing, a corrected pitch pattern is generated. FIG. 18 is a conceptual diagram illustrating an example of a corrected pitch pattern generated by the prosody correcting unit 41a. The prosody modification unit 41 a outputs the modified pitch pattern shown in FIG. 18 to the speech synthesizer 3.

以上のように、本実施形態に係る韻律生成装置４によれば、韻律修正部４１ａにより生成された修正ピッチパターンは、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンに近似するように適切な規則ピッチパターンを変形し、ピッチパターン抽出部２６ｂによる抽出の信頼性が高いパターンを用いることなく、変形した規則ピッチパターンを用いることにより生成されたパターンである。これにより、人間の発声から抽出された音声ピッチパターンの抽出誤りを、人間の発声が有する自然性・表現力を損なうことなく、しかも、手間と時間をかけずに修正することにより、修正ピッチパターンを生成することが可能となる。 As described above, according to the prosody generation device 4 according to the present embodiment, the corrected pitch pattern generated by the prosody correction unit 41a is appropriately approximated to a pattern with high extraction reliability by the pitch pattern extraction unit 26b. This pattern is generated by using a modified regular pitch pattern without modifying a regular regular pitch pattern and using a pattern with high extraction reliability by the pitch pattern extraction unit 26b. As a result, the correction pitch pattern can be corrected by correcting the extraction error of the voice pitch pattern extracted from the human voice without compromising the naturalness and expressiveness of the human voice and without taking time and effort. Can be generated.

[実施の形態３]
図１９は、本実施形態に係る音声合成システム（韻律編集システム）１１の概略構成を示すブロック図である。すなわち、本実施形態に係る音声合成システム１１は、図１に示す音声合成システム１に加えて、ＧＵＩ（Graphical User Interface）装置５を備えている。ＧＵＩ装置５と韻律生成装置２とは有線または無線により互いに接続されている。また、ＧＵＩ装置５と音声合成装置３とは有線または無線により互いに接続されている。なお、図１９において、図１と同様の機能を有する構成については、同じ参照符号を付記し、その詳細な説明を省略する。また、図１９において、韻律生成装置２の各構成部材２１〜２７、および、音声合成装置３の各構成部材３１〜３３の図示を省略している。さらに、上記のＧＵＩ装置５は、図１３に示す音声合成システム１ａ、および、図１５に示す音声合成システム１０に備えられていてもよい。 [Embodiment 3]
FIG. 19 is a block diagram showing a schematic configuration of the speech synthesis system (prosody editing system) 11 according to the present embodiment. That is, the speech synthesis system 11 according to the present embodiment includes a GUI (Graphical User Interface) device 5 in addition to the speech synthesis system 1 shown in FIG. The GUI device 5 and the prosody generation device 2 are connected to each other by wire or wireless. The GUI device 5 and the speech synthesizer 3 are connected to each other by wire or wireless. In FIG. 19, configurations having the same functions as those in FIG. 1 are given the same reference numerals, and detailed descriptions thereof are omitted. Further, in FIG. 19, illustrations of the constituent members 21 to 27 of the prosody generation device 2 and the constituent members 31 to 33 of the speech synthesizer 3 are omitted. Further, the GUI device 5 may be provided in the speech synthesis system 1a shown in FIG. 13 and the speech synthesis system 10 shown in FIG.

ＧＵＩ装置５は、韻律生成装置２により生成された表音文字列データおよび修正韻律パターンをユーザに編集させる装置である。このため、ＧＵＩ装置５は、ユーザに対して表音文字列データおよび修正韻律パターンを提示し、入力デバイスを用いて提示された表音文字列データおよび修正韻律パターンを編集可能なユーザインターフェース機能を提供する。それゆえ、ＧＵＩ装置５は、表示部５１、および、編集部５２を備えている。なお、上記の表示部５１および編集部５２は、コンピュータのＣＰＵがこの機能を実現するプログラムに従って動作することによっても具現化される。 The GUI device 5 is a device that allows the user to edit the phonetic character string data and the modified prosody pattern generated by the prosody generation device 2. Therefore, the GUI device 5 has a user interface function that presents the phonetic character string data and the modified prosodic pattern to the user and can edit the phonetic character string data and the modified prosodic pattern presented using the input device. provide. Therefore, the GUI device 5 includes a display unit 51 and an editing unit 52. The display unit 51 and the editing unit 52 described above are also realized by the computer CPU operating according to a program that realizes this function.

表示部５１は、液晶ディスプレイ、有機ＥＬディスプレイ、プラズマディスプレイ、ＣＲＴディスプレイなどの任意の表示デバイスから構成される。編集部５２は、キーボード、マウス、テンキー、タッチパネルなどの任意の入力デバイスから構成される。 The display unit 51 includes an arbitrary display device such as a liquid crystal display, an organic EL display, a plasma display, or a CRT display. The editing unit 52 includes an arbitrary input device such as a keyboard, a mouse, a numeric keypad, and a touch panel.

図２０は、表示部５１に表示される表示画面の一例を示す概念図である。図２０に示すように、表示部５１の表示画面は、テキスト編集部５１ａ、言語処理ボタン５１ｂ、言語処理結果編集部５１ｃ、規則韻律生成ボタン５１ｄ、規則韻律パターン表示部５１ｅ、音声入力ボタン５１ｆ、音声韻律抽出ボタン５１ｇ、音声韻律パターン表示部５１ｈ、自動修正ボタン５１ｉ、修正韻律パターン表示部５１ｊ、および、波形生成ボタン５１ｋを有している。 FIG. 20 is a conceptual diagram illustrating an example of a display screen displayed on the display unit 51. As shown in FIG. 20, the display screen of the display unit 51 includes a text editing unit 51a, a language processing button 51b, a language processing result editing unit 51c, a rule prosody generation button 51d, a rule prosody pattern display unit 51e, a voice input button 51f, A speech prosody extraction button 51g, a speech prosody pattern display unit 51h, an automatic correction button 51i, a modified prosody pattern display unit 51j, and a waveform generation button 51k are provided.

テキスト編集部５１ａは、任意のテキストをユーザに入力させる。図２０に示す例では、テキスト編集部５１ａには、「音声ガイダンスに従ってプッシュボタンを押してください。」を表すテキストがユーザにより入力されている。なお、ＧＵＩ装置５に予め用意されているテキストファイルをユーザが指定し、指定したテキストファイルを開くことにより、テキスト編集部５１ａにテキストが入力されるようにしてもよい。 The text editing unit 51a allows the user to input arbitrary text. In the example illustrated in FIG. 20, text representing “please push the push button according to the voice guidance” is input to the text editing unit 51 a by the user. The text may be input to the text editing unit 51a by the user specifying a text file prepared in advance in the GUI device 5 and opening the specified text file.

言語処理ボタン５１ｂは、韻律生成装置２の言語処理部２３に対して、テキスト編集部５１ａに入力されたテキストの言語解析を指示するためのボタンである。 The language processing button 51b is a button for instructing the language processing unit 23 of the prosody generation device 2 to perform language analysis of the text input to the text editing unit 51a.

言語処理結果編集部５１ｃは、言語処理部２３による言語解析の結果に基づいて生成された表音文字列データを表示する。図２０に示す例では、言語処理結果編集部５１ｃには、表音文字列データ「オンセーガ’イダンスニ＿シタガッテ，プッシュボ’タンオ＿オシテクダサ’イ．」が表示されている。また、言語処理結果編集部５１ｃは、表示された表音文字列データをユーザに編集させる機能を有している。これにより、言語処理部２３による言語解析が誤っている場合、すなわち、表示された表音文字列データが誤っている場合、例えば、ユーザは、アクセント核の位置を変更し、あるいは、アクセント句やフレーズの境界を変更することにより、正しい表音文字列データに変更することが可能となる。 The language processing result editing unit 51c displays the phonetic character string data generated based on the result of language analysis by the language processing unit 23. In the example shown in FIG. 20, the phonetic character string data “ONSEGA 'IDANNI_SHITAGATTE, push-bo' TAO_OSITECHDASAY '” is displayed in the language processing result editing unit 51c. The language processing result editing unit 51c has a function of allowing the user to edit the displayed phonetic character string data. Thereby, when the language analysis by the language processing unit 23 is wrong, that is, when the displayed phonetic character string data is wrong, for example, the user changes the position of the accent nucleus, By changing the boundary of the phrase, it is possible to change to the correct phonetic character string data.

規則韻律生成ボタン５１ｄは、韻律生成装置２の規則韻律生成部２４に対して、言語処理結果編集部５１ｃに表示された表音文字列データに基づいて規則韻律パターンを生成するように指示するボタンである。 The rule prosody generation button 51d is a button for instructing the rule prosody generation unit 24 of the prosody generation device 2 to generate a rule prosody pattern based on the phonetic character string data displayed in the language processing result editing unit 51c. It is.

規則韻律パターン表示部５１ｅは、規則韻律生成部２４により生成された規則韻律パターンを表示する。図２０に示す例では、規則韻律パターン表示部５１ｅには、規則韻律パターンのうち、規則ピッチパターンおよび音素時間長パターンが表示されている。なお、規則韻律パターン表示部５１ｅには、パワーパターンが表示されていてもよい。 The regular prosody pattern display unit 51 e displays the regular prosody pattern generated by the regular prosody generation unit 24. In the example shown in FIG. 20, the regular prosody pattern display unit 51e displays a regular pitch pattern and a phoneme duration pattern among the regular prosodic patterns. The regular prosodic pattern display unit 51e may display a power pattern.

音声入力ボタン５１ｆは、テキスト編集部５１ａに入力されたテキストを読み上げた人間の音声をＧＵＩ装置５に入力させるためのボタンである。例えば、ユーザが、音声入力ボタン５１ｆを指示し、テキストを読み上げると、テキストを読み上げた人間の音声がＧＵＩ装置５に録音される。このため、ＧＵＩ装置５にはマイクロフォンが内蔵または接続されている。なお、ユーザが、音声入力ボタン５１ｆを指示すると、音声データファイルが表示され、表示された音声データファイルを指示することにより、人間の音声をＧＵＩ装置５に入力させるようにしてもよい。 The voice input button 51f is a button for causing the GUI device 5 to input a human voice read out from the text input to the text editing unit 51a. For example, when the user designates the voice input button 51 f and reads out the text, the voice of the human who reads the text is recorded in the GUI device 5. For this reason, a microphone is built in or connected to the GUI device 5. Note that when the user instructs the voice input button 51f, a voice data file is displayed, and a human voice may be input to the GUI device 5 by pointing to the displayed voice data file.

音声韻律抽出ボタン５１ｇは、韻律生成装置２の音声韻律抽出部２６に対して、音声入力ボタン５１ｆにより入力された人間の音声から音声韻律パターンを抽出するように指示するボタンである。 The speech prosody extraction button 51g is a button for instructing the speech prosody extraction unit 26 of the prosody generation device 2 to extract a speech prosody pattern from the human speech input by the speech input button 51f.

音声韻律パターン表示部５１ｈは、音声韻律抽出部２６により抽出された音声韻律パターンを表示する。図２０に示す例では、音声韻律パターン表示部５１ｈには、音声韻律パターンのうち、音声ピッチパターンおよび音素時間長パターンが表示されている。音声ピッチパターンは、信頼度判定部２６ｃにより抽出の信頼性が高いと判定されたピッチについては実線のパターンにて表し、信頼度判定部２６ｃにより抽出の信頼性が低いと判定されたピッチについては点線のパターンにて表している。なお、音声韻律パターン表示部５１ｈには、パワーパターンが表示されていてもよい。 The speech prosody pattern display unit 51 h displays the speech prosody pattern extracted by the speech prosody extraction unit 26. In the example shown in FIG. 20, the speech prosody pattern display unit 51 h displays a speech pitch pattern and a phoneme time length pattern among speech prosodic patterns. In the voice pitch pattern, the pitch determined to be high in extraction reliability by the reliability determination unit 26c is represented by a solid line pattern, and the pitch determined to be low in extraction reliability by the reliability determination unit 26c. It is represented by a dotted line pattern. Note that a power pattern may be displayed in the speech prosody pattern display portion 51h.

自動修正ボタン５１ｉは、韻律生成装置２の韻律補完部２７ａに対して、音声韻律パターン表示部５１ｈに表示された抽出の信頼性が高い音声ピッチパターン、および、規則韻律パターン表示部５１ｅに表示された規則ピッチパターンに基づいて修正ピッチパターンを生成するように指示するボタンである。なお、自動修正ボタン５１ｉは、修正ピッチパターンを生成することに加えて、修正音素時間長パターンの生成を指示するボタンでもある。 The automatic correction button 51i is displayed on the speech pitch pattern with high extraction reliability displayed on the speech prosody pattern display unit 51h and on the regular prosody pattern display unit 51e with respect to the prosody complement unit 27a of the prosody generation device 2. It is a button for instructing to generate a corrected pitch pattern based on the regular pitch pattern. The automatic correction button 51i is a button for instructing generation of a corrected phoneme time length pattern in addition to generating a corrected pitch pattern.

修正韻律パターン表示部５１ｊは、韻律補完部２７ａにより生成された修正韻律パターンを表示する。図２０に示す例では、修正韻律パターン表示部５１ｊには、修正韻律パターンのうち、修正ピッチパターンおよび修正音素時間長パターンが表示されている。なお、修正韻律パターン表示部５１ｅには、修正パワーパターンが表示されていてもよい。ここで、本実施形態においては、修正韻律パターン表示部５１ｊは、表示された修正ピッチパターンを、ユーザが入力デバイスを用いて操作することにより移動させ、修正ピッチパターンを新たに再設定させることができる。一例として、ユーザは、マウスのポインタを移動させたい修正ピッチパターンに触れた状態でその触れた位置（指示位置）を上方向または下方向に移動（ドラッグ）させ、所望の位置でドロップすると、修正ピッチパターンは、移動された所望の位置に配置される。なお、修正韻律パターン表示部５１ｊは、修正ピッチパターンを、スペクトログラムに重ねて表示することが好ましい。 The modified prosodic pattern display unit 51j displays the modified prosodic pattern generated by the prosody complementing unit 27a. In the example shown in FIG. 20, the modified prosody pattern display unit 51j displays a modified pitch pattern and a modified phoneme time length pattern among the modified prosodic patterns. The modified prosody pattern display unit 51e may display a modified power pattern. Here, in the present embodiment, the modified prosody pattern display unit 51j moves the displayed modified pitch pattern by the user operating the input device, and newly resets the modified pitch pattern. it can. As an example, when the user touches the correction pitch pattern to which the mouse pointer is to be moved and moves (drags) the touched position (instructed position) upward or downward and drops it at the desired position, the correction is performed. The pitch pattern is arranged at a desired position that has been moved. Note that the modified prosodic pattern display unit 51j preferably displays the modified pitch pattern superimposed on the spectrogram.

波形生成ボタン５１ｋは、音声合成装置３の波形生成部３２に対して、修正韻律パターン表示部５１ｅに表示された修正韻律パターンに基づいて合成音声の波形を生成するように指示するボタンである。これにより、音声合成装置３は、波形生成ボタン５１ｋにより生成された合成音声の波形に基づいて、合成音声を出力することが可能となる。それゆえ、ユーザは、音声合成装置３から出力された合成音声に基づいて、修正韻律パターン表示部５１ｊに表示された修正ピッチパターンを変更することが可能となる。 The waveform generation button 51k is a button for instructing the waveform generation unit 32 of the speech synthesizer 3 to generate a synthesized speech waveform based on the modified prosodic pattern displayed on the modified prosodic pattern display unit 51e. Thereby, the speech synthesizer 3 can output the synthesized speech based on the waveform of the synthesized speech generated by the waveform generation button 51k. Therefore, the user can change the modified pitch pattern displayed on the modified prosodic pattern display unit 51j based on the synthesized speech output from the speech synthesizer 3.

以上のように、本実施形態に係る音声合成システム１１によれば、ＧＵＩ装置５は、韻律生成装置２により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるので、韻律生成装置２により生成された表音文字列データおよび修正韻律パターンの少なくとも１つに対して、ユーザは、木目細かい調整を行うことが可能となる。 As described above, according to the speech synthesis system 11 according to the present embodiment, the GUI device 5 edits at least one of the phonogram character string data and the modified prosody pattern generated by the prosody generation device 2. The user can make fine adjustments to at least one of the phonetic character string data and the modified prosodic pattern generated by the generation device 2.

なお、第１〜第３の実施形態において、韻律生成装置またはＧＵＩ装置から出力された修正韻律パターンを音声合成装置に出力し、音声合成装置が、修正韻律パターンに基づいて合成音声を生成し出力する例について説明したが、これに限定されない。例えば、韻律生成装置またはＧＵＩ装置から出力された修正韻律パターンを用いて、音声合成用の韻律辞書、音声合成用の波形辞書、音声認識用の音響モデルなどを生成するようにしてもよい。 In the first to third embodiments, the modified prosody pattern output from the prosody generation device or the GUI device is output to the speech synthesizer, and the speech synthesizer generates and outputs synthesized speech based on the modified prosodic pattern. Although the example to do was demonstrated, it is not limited to this. For example, a prosody dictionary for speech synthesis, a waveform dictionary for speech synthesis, an acoustic model for speech recognition, and the like may be generated using the modified prosody pattern output from the prosody generation device or the GUI device.

すなわち、本発明は上述した第１〜第３の実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能である。すなわち、請求項に示した範囲で適宜変更した技術的手段を組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 That is, the present invention is not limited to the first to third embodiments described above, and various modifications can be made within the scope of the claims. That is, embodiments obtained by combining technical means appropriately changed within the scope of the claims are also included in the technical scope of the present invention.

以上の実施の形態に関し、更に以下の付記を開示する。 Regarding the above embodiment, the following additional notes are disclosed.

（付記１）
任意のテキストが入力されるテキスト入力部と、
前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理部と、
前記表音文字列データ、および、韻律生成規則に基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成部と、
前記テキストを読み上げた人間の音声を音声データに変換する音声入力部と、
前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出部と、
前記音声韻律抽出部が前記音声データから前記音声韻律パターンを抽出する際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出部による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出部による抽出の信頼性が低いパターンと判定する信頼度判定部と、
前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成部とを備えたことを特徴とする韻律生成装置。 (Appendix 1)
A text input part for inputting arbitrary text;
A language processing unit that generates phonetic character string data indicating the reading of the text by performing language analysis on the text;
A regular prosody generation unit that generates a regular prosody pattern indicating the prosody of the text based on the phonetic character string data and the prosody generation rules;
A voice input unit for converting the voice of a human who has read the text into voice data;
A speech prosody extraction unit that extracts a speech prosody pattern indicating the prosody of the human speech from the speech data;
When the speech prosody extraction unit extracts the speech prosody pattern from the speech data, the speech prosody extraction unit acquires the reliability of the extraction, and among the speech prosody patterns, the reliability of the speech prosody pattern is greater than or equal to a threshold by the speech prosody extraction unit A reliability determination unit that determines a pattern with high extraction reliability and determines a pattern having a reliability less than a threshold among the speech prosody patterns as a pattern with low extraction reliability by the speech prosody extraction unit;
Instead of a pattern with low extraction reliability by the speech prosody extraction unit in the speech prosody pattern, a pattern with high extraction reliability by the speech prosody extraction unit in the speech prosody pattern, and a regular prosody pattern A prosody generation device comprising: a modified prosody generation unit that generates a modified prosody pattern based on the prosody pattern.

（付記２）
前記修正韻律生成部は、
前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンに近似するように前記規則韻律パターンを変形し、変形した規則韻律パターンと、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンとを接続することにより、修正韻律パターンを生成する韻律補完部を含む、請求項１に記載の韻律生成装置。 (Appendix 2)
The modified prosody generation unit includes:
The regular prosody pattern is modified so as to approximate a pattern with high extraction reliability by the speech prosody extraction unit of the speech prosody pattern, and the speech prosody extraction unit of the speech prosody pattern is modified. The prosody generation device according to claim 1, further comprising a prosody complementing unit that generates a modified prosody pattern by connecting a pattern having high extraction reliability.

（付記３）
前記修正韻律生成部は、
前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンに近似するように前記規則韻律パターンを変形し、前記音声韻律パターンのうち前記音声韻律抽出部による抽出の信頼性が高いパターンを用いることなく、変形した規則韻律パターンを用いることにより、修正韻律パターンを生成する韻律修正部を含む、請求項１に記載の韻律生成装置。 (Appendix 3)
The modified prosody generation unit includes:
The regular prosody pattern is modified so as to approximate a pattern with high extraction reliability by the speech prosody extraction unit of the speech prosody pattern, and the extraction reliability by the speech prosody extraction unit of the speech prosody pattern is high The prosody generation device according to claim 1, further comprising a prosody modification unit that generates a modified prosody pattern by using a modified regular prosody pattern without using a pattern.

（付記４）
前記規則韻律パターン、前記音声韻律パターン、および、前記修正韻律パターンは、声の高さの変化パターンを表すピッチパターンである、付記１〜３のいずれか一項に記載の韻律生成装置。 (Appendix 4)
The prosodic generation device according to any one of appendices 1 to 3, wherein the regular prosodic pattern, the speech prosodic pattern, and the modified prosodic pattern are pitch patterns representing a voice pitch change pattern.

（付記５）
付記１〜４のいずれか一項に記載の韻律生成装置と、
前記韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるＧＵＩ装置とを備えたことを特徴とする韻律編集システム。 (Appendix 5)
The prosody generation device according to any one of appendices 1 to 4,
A prosody editing system comprising: a GUI device that edits at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generation device.

（付記６）
付記１〜４のいずれか一項に記載の韻律生成装置と、
前記韻律生成装置により生成された修正韻律パターンに基づいて、合成音声を生成し出力する音声合成装置とを備えたことを特徴とする音声合成システム。 (Appendix 6)
The prosody generation device according to any one of appendices 1 to 4,
A speech synthesis system comprising: a speech synthesizer that generates and outputs synthesized speech based on the modified prosodic pattern generated by the prosody generation device.

（付記７）
付記１〜４のいずれか一項に記載の韻律生成装置と、
前記韻律生成装置により生成された表音文字列データおよび修正韻律パターンの少なくとも１つを編集させるＧＵＩ装置と、
前記韻律生成装置により生成された修正韻律パターン、および、前記ＧＵＩ装置により編集された修正韻律パターンの少なくとも１つに基づいて、合成音声を生成し出力する音声合成装置とを備えたことを特徴とする音声合成システム。 (Appendix 7)
The prosody generation device according to any one of appendices 1 to 4,
A GUI device for editing at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generation device;
A speech synthesizer that generates and outputs synthesized speech based on at least one of the modified prosodic pattern generated by the prosody generating device and the modified prosodic pattern edited by the GUI device; A speech synthesis system.

（付記８）
コンピュータが備えるテキスト入力部が、任意のテキストが入力されるテキスト入力工程と、
前記コンピュータが備える言語処理部が、前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理工程と、
前記コンピュータが備える規則韻律生成部が、前記表音文字列データ、および、韻律生成規則に基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成工程と、
前記コンピュータが備える音声入力部が、前記テキストを読み上げた人間の音声を音声データに変換する音声入力工程と、
前記コンピュータが備える音声韻律抽出部が、前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出工程と、
前記コンピュータが備える信頼度判定部が、前記音声韻律抽出工程にて前記音声データから前記音声韻律パターンが抽出された際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出工程による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出工程による抽出の信頼性が低いパターンと判定する信頼度判定工程と、
前記コンピュータが備える修正韻律生成部が、前記音声韻律パターンのうち前記音声韻律抽出工程による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出工程による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成工程とを含むことを特徴とする韻律生成方法。 (Appendix 8)
A text input unit provided in the computer, a text input process in which arbitrary text is input;
A language processing step of the computer includes a language processing step of generating phonetic character string data indicating reading of the text by analyzing the text.
A regular prosody generation step of generating a regular prosody pattern indicating a prosody of the text based on the phonetic character string data and the prosody generation rule,
A voice input step in which the voice input unit included in the computer converts human voice read out from the text into voice data;
A speech prosody extraction step in which the speech prosody extraction unit provided in the computer extracts a speech prosody pattern indicating the prosody of the human speech from the speech data;
A reliability determination unit included in the computer acquires the reliability of the extraction when the speech prosody pattern is extracted from the speech data in the speech prosody extraction step, and the reliability of the speech prosody pattern Is determined to be a pattern having a high extraction reliability by the speech prosody extraction step, and a pattern having a reliability less than the threshold among the speech prosody extraction steps has a low extraction reliability by the speech prosody extraction step A reliability determination step for determining a pattern;
The modified prosody generation unit included in the computer has an extraction reliability of the speech prosody pattern extracted from the speech prosody pattern by the speech prosody extraction step instead of a pattern having a low reliability of extraction by the speech prosody extraction step. A prosody generation method comprising: a high pattern; and a modified prosody generation step of generating a modified prosody pattern based on the regular prosody pattern.

（付記９）
任意のテキストが入力されるテキスト入力処理と、
前記テキストを言語解析することにより、前記テキストの読みを示す表音文字列データを生成する言語処理と、
前記表音文字列データ、および、韻律生成規則に基づいて、前記テキストの韻律を示す規則韻律パターンを生成する規則韻律生成処理と、
前記テキストを読み上げた人間の音声を音声データに変換する音声入力処理と、
前記音声データから前記人間の音声の韻律を示す音声韻律パターンを抽出する音声韻律抽出処理と、
前記音声韻律抽出処理にて前記音声データから前記音声韻律パターンが抽出された際における、当該抽出の信頼度を取得し、前記音声韻律パターンのうち前記信頼度が閾値以上のパターンを前記音声韻律抽出処理による抽出の信頼性が高いパターンと判定し、前記音声韻律パターンのうち前記信頼度が閾値未満のパターンを前記音声韻律抽出処理による抽出の信頼性が低いパターンと判定する信頼性判定処理と、
前記音声韻律パターンのうち前記音声韻律抽出処理による抽出の信頼性が低いパターンの代わりに、前記音声韻律パターンのうち前記音声韻律抽出処理による抽出の信頼性が高いパターン、および、前記規則韻律パターンに基づいて修正韻律パターンを生成する修正韻律生成処理とをコンピュータに実行させることを特徴とする韻律生成プログラム。 (Appendix 9)
Text input processing where arbitrary text is input,
Linguistic processing for generating phonetic character string data indicating the reading of the text by language analysis of the text;
A regular prosody generation process for generating a regular prosody pattern indicating the prosody of the text based on the phonetic character string data and the prosody generation rules;
A voice input process for converting a human voice read out from the text into voice data;
A speech prosody extraction process for extracting a speech prosody pattern indicating the prosody of the human speech from the speech data;
When the speech prosody pattern is extracted from the speech data in the speech prosody extraction process, the reliability of the extraction is obtained, and the speech prosody extraction is performed for a pattern having the reliability greater than or equal to a threshold among the speech prosody patterns. A reliability determination process for determining a pattern with high reliability of extraction by processing, and determining a pattern having the reliability less than a threshold among the speech prosodic patterns as a pattern with low extraction reliability by the speech prosody extraction process;
Instead of a pattern with low extraction reliability by the voice prosody extraction process in the phonetic prosody pattern, a pattern with high extraction reliability by the voice prosody extraction process and a regular prosody pattern in the voice prosody pattern A prosody generation program that causes a computer to execute a modified prosody generation process that generates a modified prosody pattern based on the computer program.

以上のように、本発明は、任意のテキストと、このテキストを読み上げた人間の音声とを受け付け、受け付けた任意のテキストおよび人間の音声に基づいて、韻律パターンを生成する韻律生成装置、韻律生成方法、または、韻律生成プログラムとして有用である。 As described above, the present invention accepts an arbitrary text and a human voice that is read out from the text, and generates a prosodic pattern based on the received arbitrary text and the human voice. It is useful as a method or prosody generation program.

本発明の第１の実施形態に係る音声合成システムの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a speech synthesis system according to a first embodiment of the present invention. 上記音声合成システムの韻律生成装置における言語処理部が文字列データに対して形態素解析を行った結果を示す概念図である。It is a conceptual diagram which shows the result of the morphological analysis which the language processing part in the prosody generation apparatus of the said speech synthesis system performed with respect to character string data. 上記言語処理部により生成された複数の文節とその読みを示す概念図である。It is a conceptual diagram which shows the some clause produced | generated by the said language processing part, and its reading. フレーズ成分にアクセント句成分が重畳された状態を示す概念図である。It is a conceptual diagram which shows the state by which the accent phrase component was superimposed on the phrase component. 上記韻律生成装置におけるピッチパターン生成部により生成された規則ピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the regular pitch pattern produced | generated by the pitch pattern production | generation part in the said prosody production | generation apparatus. 任意の母音の音声データの時系列を示す概念図である。It is a conceptual diagram which shows the time series of the audio | voice data of arbitrary vowels. 図６に示す音声データの時系列を自己相関関数に適用した場合における相関値を示す概念図である。It is a conceptual diagram which shows the correlation value at the time of applying the time series of the audio | voice data shown in FIG. 6 to an autocorrelation function. 上記韻律生成装置におけるピッチパターン抽出部により抽出された音声ピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the audio | voice pitch pattern extracted by the pitch pattern extraction part in the said prosody generation apparatus. 図８に示す音声ピッチパターンのうち、上記韻律生成装置における信頼度判定部により抽出の信頼性が高いと判定されたパターンの一例を示す概念図である。FIG. 9 is a conceptual diagram illustrating an example of a pattern determined as having high extraction reliability by the reliability determination unit in the prosody generation device in the speech pitch pattern illustrated in FIG. 8. 図９に示すパターンに近似するように変形された規則ピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the regular pitch pattern deform | transformed so that it might approximate to the pattern shown in FIG. 上記韻律生成装置における韻律補完部によりスムージングされたピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the pitch pattern smoothed by the prosody complementation part in the said prosody generation apparatus. 上記韻律補完部により生成された修正ピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the correction pitch pattern produced | generated by the said prosody complement part. 本発明の第１の実施形態の変形例に係る音声合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech synthesis system which concerns on the modification of the 1st Embodiment of this invention. 上記音声合成システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the said speech synthesis system. 本発明の第２の実施形態に係る音声合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech synthesis system which concerns on the 2nd Embodiment of this invention. 図１０に示す太線のパターンを除去し、変形された規則ピッチパターンのみを示した概念図である。FIG. 11 is a conceptual diagram showing only a regular pitch pattern that is modified by removing the thick line pattern shown in FIG. 10. 上記音声合成システムの韻律生成装置における韻律修正部によりスムージングされたピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the pitch pattern smoothed by the prosody modification part in the prosody generation apparatus of the said speech synthesis system. 上記韻律修正部により生成された修正ピッチパターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the correction pitch pattern produced | generated by the said prosody correction | amendment part. 本発明の第３の実施形態に係る音声合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech synthesis system which concerns on the 3rd Embodiment of this invention. 上記音声合成システムのＧＵＩ装置における表示部に表示された表示画面の一例を示す概念図である。It is a conceptual diagram which shows an example of the display screen displayed on the display part in the GUI apparatus of the said speech synthesis system.

Explanation of symbols

１、１ａ、１０、１１音声合成システム
２、４韻律生成装置
３音声合成装置
５ＧＵＩ装置
２１テキスト入力部
２３言語処理部
２４規則韻律生成部
２４ａ音素時間長生成部
２４ｂピッチパターン生成部
２４ｃパワー生成部
２５音声入力部
２６音声韻律抽出部
２６ａ音素時間長抽出部
２６ｂピッチパターン抽出部
２６ｃ信頼度判定部
２６ｄパワー抽出部
２７、４１修正韻律生成部
２７ａ韻律補完部
４１ａ韻律修正部 1, 1a, 10, 11 Speech synthesis system 2, 4 Prosody generation device 3 Speech synthesizer 5 GUI device 21 Text input unit 23 Language processing unit 24 Regular prosody generation unit 24a Phoneme time length generation unit 24b Pitch pattern generation unit 24c Power generation Unit 25 speech input unit 26 speech prosody extraction unit 26a phoneme time length extraction unit 26b pitch pattern extraction unit 26c reliability determination unit 26d power extraction unit 27, 41 modified prosody generation unit 27a prosody complement unit 41a prosody modification unit

Claims

A text input part for inputting arbitrary text;
A language processing unit that generates phonetic character string data indicating the reading of the text by performing language analysis on the text;
A regular prosody generation unit that generates a regular prosody pattern indicating the prosody of the text based on the phonetic character string data and the prosody generation rules;
A voice input unit for converting the voice of a human who has read the text into voice data;
A speech prosody extraction unit that extracts a speech prosody pattern indicating the prosody of the human speech from the speech data;
When the speech prosody extraction unit extracts the speech prosody pattern from the speech data, the speech prosody extraction unit acquires the reliability of the extraction, and among the speech prosody patterns, the reliability of the speech prosody pattern is greater than or equal to a threshold by the speech prosody extraction unit A reliability determination unit that determines a pattern with high extraction reliability and determines a pattern having a reliability less than a threshold among the speech prosody patterns as a pattern with low extraction reliability by the speech prosody extraction unit;
Instead of a pattern with low extraction reliability by the speech prosody extraction unit in the speech prosody pattern, a pattern with high extraction reliability by the speech prosody extraction unit in the speech prosody pattern, and a regular prosody pattern A prosody generation device comprising: a modified prosody generation unit that generates a modified prosody pattern based on the prosody pattern.

The modified prosody generation unit includes:
The regular prosody pattern is modified so as to approximate a pattern with high extraction reliability by the speech prosody extraction unit of the speech prosody pattern, and the speech prosody extraction unit of the speech prosody pattern is modified. The prosody generation device according to claim 1, further comprising a prosody complementing unit that generates a modified prosody pattern by connecting a pattern having high extraction reliability.

The modified prosody generation unit includes:
The regular prosody pattern is modified so as to approximate a pattern with high extraction reliability by the speech prosody extraction unit of the speech prosody pattern, and the extraction reliability by the speech prosody extraction unit of the speech prosody pattern is high The prosody generation device according to claim 1, further comprising a prosody modification unit that generates a modified prosody pattern by using a modified regular prosody pattern without using a pattern.

The prosody generation device according to any one of claims 1 to 3, wherein the regular prosody pattern, the speech prosody pattern, and the modified prosody pattern are pitch patterns representing a change pattern of voice pitch.

The prosody generation device according to any one of claims 1 to 4,
A prosody editing system comprising: a GUI device that edits at least one of the phonetic character string data and the modified prosody pattern generated by the prosody generation device.

A text input unit provided in the computer, a text input process in which arbitrary text is input;
A language processing step of the computer includes a language processing step of generating phonetic character string data indicating reading of the text by analyzing the text.
A regular prosody generation step of generating a regular prosody pattern indicating a prosody of the text based on the phonetic character string data and the prosody generation rule,
A voice input step in which the voice input unit included in the computer converts human voice read out from the text into voice data;
A speech prosody extraction step in which the speech prosody extraction unit provided in the computer extracts a speech prosody pattern indicating the prosody of the human speech from the speech data;
A reliability determination unit included in the computer acquires the reliability of the extraction when the speech prosody pattern is extracted from the speech data in the speech prosody extraction step, and the reliability of the speech prosody pattern Is determined to be a pattern having a high extraction reliability by the speech prosody extraction step, and a pattern having a reliability less than the threshold among the speech prosody extraction steps has a low extraction reliability by the speech prosody extraction step A reliability determination step for determining a pattern;
The modified prosody generation unit included in the computer has an extraction reliability of the speech prosody pattern extracted from the speech prosody pattern by the speech prosody extraction step instead of a pattern having a low reliability of extraction by the speech prosody extraction step. A prosody generation method comprising: a high pattern; and a modified prosody generation step of generating a modified prosody pattern based on the regular prosody pattern.

Text input processing where arbitrary text is input,
Linguistic processing for generating phonetic character string data indicating the reading of the text by language analysis of the text;
A regular prosody generation process for generating a regular prosody pattern indicating the prosody of the text based on the phonetic character string data and the prosody generation rules;
A voice input process for converting a human voice read out from the text into voice data;
A speech prosody extraction process for extracting a speech prosody pattern indicating the prosody of the human speech from the speech data;
When the speech prosody pattern is extracted from the speech data in the speech prosody extraction process, the reliability of the extraction is obtained, and the speech prosody extraction is performed for a pattern having the reliability greater than or equal to a threshold among the speech prosody patterns. A reliability determination process for determining a pattern with high reliability of extraction by processing, and determining a pattern having the reliability less than a threshold among the speech prosodic patterns as a pattern with low extraction reliability by the speech prosody extraction process;
Instead of a pattern with low extraction reliability by the voice prosody extraction process in the phonetic prosody pattern, a pattern with high extraction reliability by the voice prosody extraction process and a regular prosody pattern in the voice prosody pattern A prosody generation program that causes a computer to execute a modified prosody generation process that generates a modified prosody pattern based on the computer program.