JP2005309164A

JP2005309164A - Device for encoding data for read-aloud and program for encoding data for read-aloud

Info

Publication number: JP2005309164A
Application number: JP2004127475A
Authority: JP
Inventors: Tomoyasu Komori; 智康小森; Hiroyuki Segi; 寛之世木; Yoshiaki Shishikui; 善明鹿喰; Kazuhisa Iguchi; 和久井口; Shuichi Aoki; 秀一青木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-04-23
Filing date: 2004-04-23
Publication date: 2005-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for encoding data for read-aloud which can encode a composition mixed with Chinese characters and the Japanese syllabary (text data) by suppressing an amount of data of a dictionary file for decoding and by enabling the dictionary file to be changed without necessitating an exlusive decording device (decoder) at the time of performing decoding, and to rovide a program for encoding data for read-aloud. <P>SOLUTION: The device for encoding data for read-aloud 1 is a device which inputs the composition mixed with Chinese characters and the Japanese syllabary as data for read-aloud and encodes the data into voice streams, and equipped with; a word dictionary accumulating means 7; a morpheme analyzing means 5; a rhythm predicting means 9; a phonemic sequence selection means 11; a stream shaping means 13; and a stream connecting means 15. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、読み上げ用データとして入力した漢字仮名混じり文を、音声ストリームに符号化する読み上げ用データ符号化装置および読み上げ用データ符号化プログラムに関する。 The present invention relates to a reading data encoding apparatus and a reading data encoding program for encoding a kanji-kana mixed sentence input as reading data into an audio stream.

近年、漢字仮名混じり文等のテキストデータが発話された場合の音声を、大量の音声データ（例えば、多数の話者が様々な文章について発話したデータ）を利用することによって、合成音声として再現する場合に、様々な手法（例えば、非特許文献１，２、特許文献１を参照）を用いることによって、当該合成音声の高品質化を図ろうとする試みが盛んに行われている。 In recent years, speech when text data such as kanji mixed text is uttered is reproduced as synthesized speech by using a large amount of speech data (for example, data uttered by various speakers on various sentences). In many cases, attempts have been made to improve the quality of the synthesized speech by using various methods (see, for example, Non-Patent Documents 1 and 2 and Patent Document 1).

例えば、アナウンサが読み上げたニュースの音声を収録して、音声波形データベースを構築し、この音声波形データベース（大量の音声データ）を利用した音声合成が行われている（例えば、特願２００３−２９６５８４音声合成方法、音声合成装置および音声合成プログラム）。
松本裕治、北内啓、山下達雄、平野善隆、今一修、今村友明 “日本語形態素解析システム「茶筌」ｖｅｒｓｉｏｎ１．５使用説明書”、１９９７全頁「基本周波数パターン生成過程モデルに基づく文章音声の合成」（電子情報通信学会論文誌ＡＶｏｌ．Ｊ７２−Ａ，Ｎｏ．１ｐｐ３２−４０，１９８９．１）特開２００３−２７１１９９号公報（段落００２９〜００３３、図１） For example, the voice of the news read by the announcer is recorded, a voice waveform database is constructed, and voice synthesis is performed using this voice waveform database (a large amount of voice data) (for example, Japanese Patent Application No. 2003-296484 voice). Synthesis method, speech synthesis apparatus and speech synthesis program).
Yuji Matsumoto, Kei Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Osamu Ima, Tomoaki Imamura “Japanese Morphological Analysis System“ Chaya ”version 1.5 Instruction Manual”, 1997 “Synthesis of sentence speech based on fundamental frequency pattern generation process model” (The Institute of Electronics, Information and Communication Engineers Journal A Vol. J72-A, No. 1 pp32-40, 1989. 1) JP 2003-271199 A (paragraphs 0029 to 0033, FIG. 1)

しかしながら、近年試みられている音声合成の手法は、いずれも漢字仮名混じり文を、直接、音声信号に変換するもの（方法、装置またはプログラム）であり、当該音声信号を送信した場合に、当該音声信号を受信した受信側では、音声信号をデコード（復号）するのに特別な単体機器が必要となるという問題がある。つまり、音声信号がＴＴＳ（ＴｅｘｔＴｏ−ｓｐｅｅｃｈＳｙｎｔｈｅｓｉｓ；テキスト合成方式）符号化された音声ストリームを受信して再生する場合、専用のデコード装置が無い限り、元の音声信号に変換して再生させることができなかった。 However, all of the speech synthesis techniques that have been attempted in recent years are those that directly convert a kanji-kana mixed sentence into a speech signal (method, apparatus or program), and when the speech signal is transmitted, On the receiving side that receives the signal, there is a problem that a special single device is required to decode the audio signal. That is, when an audio signal is received and played back by a TTS (Text To-speech Synthesis) text stream, the original audio signal is converted and played back unless there is a dedicated decoding device. I could not.

また、高品質に符号化された音声ストリームを受信した側（受信側）でデコード（復号）する際には、当該受信側で用意しておかなくてはならないデコード用の辞書ファイルのデータ量が膨大になるという問題がある。 In addition, when decoding (decoding) a high-quality encoded audio stream on the receiving side (receiving side), the data amount of the decoding dictionary file that must be prepared on the receiving side is large. There is a problem of becoming enormous.

さらに、専用のデコード装置が単体機器である場合、デコード用の辞書ファイルが内蔵されており、変更（更新）することが想定されていないので、別の辞書ファイルに変更することが困難であるという問題がある。 Furthermore, if the dedicated decoding device is a single device, the dictionary file for decoding is built in and is not supposed to be changed (updated), so it is difficult to change to another dictionary file. There's a problem.

そこで、本発明では、前記した問題を解決し、デコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、デコード用の辞書ファイルのデータ量を抑制し、当該辞書ファイルを変更可能にして、漢字仮名混じり文（テキストデータ）を符号化することができる読み上げ用データ符号化装置および読み上げ用データ符号化プログラムを提供することを目的とする。 Therefore, in the present invention, when the above-described problem is solved and decoding (decoding) is performed, a dedicated decoding device (decoding device) is not required, the data amount of the dictionary file for decoding is suppressed, and the dictionary file is It is an object of the present invention to provide a reading data encoding apparatus and a reading data encoding program that can be changed and can encode a kanji-kana mixed sentence (text data).

前記課題を解決するため、請求項１記載の読み上げ用データ符号化装置は、漢字仮名混じり文を読み上げ用データとして入力して、当該読み上げ用データを音声ストリームに符号化する読み上げ用データ符号化装置であって、音素列選択手段と、ストリーム整形手段と、ストリーム接続手段と、を備える構成とした。 In order to solve the above-mentioned problem, the reading data encoding apparatus according to claim 1 inputs a kanji-kana mixed sentence as reading data and encodes the reading data into an audio stream. The phoneme string selection unit, the stream shaping unit, and the stream connection unit are provided.

かかる構成によれば、読み上げ用データ符号化装置は、音素列選択手段によって、音素列を蓄積し、当該音素列を音声ストリームに符号化する外部の装置である音声符号化装置に対して、漢字仮名混じり文、漢字仮名混じり文を形態素解析した解析結果および漢字仮名混じり文を音素表記した音素記号に基づき、符号化する音素列を選択する音素列選択信号を出力する。漢字仮名混じり文は、日本語の文章であり、いわゆるテキストデータである。これら漢字仮名混じり文、解析結果および音素記号は、当該装置で符号化して接続した音声ストリーム（接続音声ストリーム）が、受信側で音声合成用に利用されるものであるため、音声合成用に特化したデータと言えるものである。また、音素列選択手段からは、音声符号化装置によって符号化された音声ストリームを、どのように接続するのかを制御する制御信号を出力する。この音声符号化装置は、音素列を、音声ストリーム（音声符号化情報；例えば、ＡＡＣ音声ストリーム）に符号化して出力するものである。 According to such a configuration, the reading data encoding apparatus stores a phoneme string by the phoneme string selection unit, and provides a kanji character to the speech encoding apparatus that is an external device that encodes the phoneme string into a speech stream. A phoneme string selection signal for selecting a phoneme string to be encoded is output based on an analysis result obtained by morphological analysis of a kana mixed sentence, a kanji kana mixed sentence, and a phoneme symbol in which the kanji kana mixed sentence is phoneme-notified. A kanji-kana mixed sentence is a Japanese sentence and is so-called text data. These kanji-kana mixed sentences, analysis results, and phoneme symbols are specially used for speech synthesis because a speech stream (connected speech stream) encoded and connected by the device is used for speech synthesis on the receiving side. It can be said that the data has been converted into data. Further, the phoneme string selection means outputs a control signal for controlling how the audio stream encoded by the audio encoding device is connected. This speech coding apparatus encodes a phoneme string into a speech stream (speech coding information; for example, an AAC speech stream) and outputs the speech stream.

そして、読み上げ用データ符号化装置は、ストリーム整形手段によって、音声符号化装置から出力された音声ストリームから必要な部分を切り出（抽出）して、当該音声ストリームを整形し、この整形した音声ストリームを、ストリーム接続手段によって、音素列選択手段から出力された制御信号に基づいて接続して、接続音声ストリームとして出力する。ストリーム整形手段で整形される必要な部分とは、ストリーム接続手段で使用する部分であり、ストリーム接続手段に入力される制御信号に従って決定されるものである。つまり、ストリーム整形手段で必要な部分が切り出される際には、音素列選択手段から直接、または、ストリーム接続手段を介して制御信号が当該ストリーム整形手段に入力されることになる。 Then, the read-out data encoding apparatus cuts out (extracts) a necessary portion from the audio stream output from the audio encoding apparatus by the stream shaping unit, shapes the audio stream, and the shaped audio stream Are connected based on the control signal output from the phoneme string selection means by the stream connection means, and output as a connected audio stream. The necessary part to be shaped by the stream shaping means is a part used by the stream connecting means, and is determined according to a control signal input to the stream connecting means. That is, when a necessary part is cut out by the stream shaping means, a control signal is input to the stream shaping means directly from the phoneme string selection means or via the stream connection means.

請求項２記載の読み上げ用データ符号化装置は、漢字仮名混じり文を読み上げ用データとして入力して、当該読み上げ用データを音声ストリームに符号化する読み上げ用データ符号化装置であって、単語辞書蓄積手段と、形態素解析手段と、韻律予測手段と、音素列選択手段と、ストリーム整形手段と、ストリーム接続手段と、を備える構成とした。 3. A reading data encoding apparatus according to claim 2, wherein the reading data encoding apparatus inputs a kanji-kana mixed sentence as reading data and encodes the reading data into an audio stream, and stores the word dictionary. Means, morphological analysis means, prosody prediction means, phoneme string selection means, stream shaping means, and stream connection means.

かかる構成によれば、読み上げ用データ符号化装置は、形態素解析手段によって、単語辞書蓄積手段に蓄積されている単語辞書を参照して、漢字仮名混じり文を形態素解析すると共に、漢字仮名混じり文を音素表記した音素記号にして出力する。単語辞書は、単語の読みに関するデータを少なくとも含むものであり、この単語の読みに関するデータ以外には、例えば、単語の品詞、アクセント型、単語間の接続確率および品詞間の接続確率に関するデータを含むものである。続いて、読み上げ用データ符号化装置は、韻律予測手段によって、形態素解析手段で形態素解析された解析結果および漢字仮名混じり文字を音素表記した音素記号に基づいて、漢字仮名混じり文の韻律を示す韻律記号を予測する。韻律記号は、アクセントやイントネーションに関する情報を指すものである。 According to this configuration, the reading data encoding apparatus uses the morphological analysis unit to refer to the word dictionary stored in the word dictionary storage unit to perform morphological analysis of the kanji-kana mixed sentence and to convert the kanji-kana mixed sentence. Output as phonemic symbolic phonemes. The word dictionary includes at least data related to the reading of the word. In addition to the data related to the reading of the word, the word dictionary includes, for example, data related to the word part of speech, the accent type, the connection probability between words, and the connection probability between parts of speech. It is a waste. Subsequently, the reading-out data encoding device uses the prosody prediction means to indicate the prosody of the kanji-kana mixed sentence based on the analysis result obtained by the morphological analysis by the morphological analysis means and the phoneme symbol in which the kanji-kana mixed character is phoneme-notified. Predict the symbol. The prosodic symbols indicate information related to accents and intonations.

そして、この読み上げ用データ符号化装置は、音素列選択手段によって、音素列を蓄積し、当該音素列を音声ストリームに符号化する外部の装置である音声符号化装置に対して、漢字仮名混じり文、解析結果、音素記号および韻律記号に基づき、符号化する音素列を選択する音素列選択信号を出力する。また、音素列選択手段からは、音声符号化装置によって符号化された音声ストリームを、どのように接続するのかを制御する制御信号を出力する。この音声符号化装置は、音素列を、音声ストリーム（音声符号化情報；例えば、ＡＡＣ音声ストリーム）に符号化して出力するものである。 Then, this reading data encoding device stores a phoneme sequence by the phoneme sequence selection means, and transmits the phoneme sequence to a speech encoding device which is an external device that encodes the phoneme sequence into a speech stream. Then, based on the analysis result, the phoneme symbol and the prosodic symbol, a phoneme sequence selection signal for selecting a phoneme sequence to be encoded is output. Further, the phoneme string selection means outputs a control signal for controlling how the audio stream encoded by the audio encoding device is connected. This speech coding apparatus encodes a phoneme string into a speech stream (speech coding information; for example, an AAC speech stream) and outputs the speech stream.

そして、読み上げ用データ符号化装置は、ストリーム整形手段によって、音声符号化装置から出力された音声ストリームから必要な部分を切り出（抽出）して、当該音声ストリームを整形し、この整形した音声ストリームを、ストリーム接続手段によって、音素列選択手段から出力された制御信号に基づいて接続して、接続音声ストリームとして出力する。 Then, the read-out data encoding apparatus cuts out (extracts) a necessary portion from the audio stream output from the audio encoding apparatus by the stream shaping unit, shapes the audio stream, and the shaped audio stream Are connected based on the control signal output from the phoneme string selection means by the stream connection means, and output as a connected audio stream.

請求項３記載の読み上げ用データ符号化装置は、漢字仮名混じり文を読み上げ用データとして入力して、当該読み上げ用データを音声ストリームに符号化する読み上げ用データ符号化装置であって、音素列符号化データ辞書蓄積手段と、音素列選択手段と、音声ストリーム出力手段と、ストリーム整形手段と、ストリーム接続手段と、を備える構成とした。 4. A reading data encoding apparatus according to claim 3, wherein the reading data encoding apparatus inputs a kanji-kana mixed sentence as reading data and encodes the reading data into an audio stream. The data data storage unit, the phoneme string selection unit, the audio stream output unit, the stream shaping unit, and the stream connection unit are provided.

かかる構成によれば、読み上げ用データ符号化装置は、音素列選択手段によって、漢字仮名混じり文、漢字仮名混じり文を形態素解析した解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、音素列符号化データ辞書蓄積手段に蓄積されている音素列を選択する。音素列符号化データ辞書蓄積手段は、漢字仮名混じり文を構成している単語を音素に分解し、この分解した音素が連続した音素列を符号化した音素列符号化データに関する辞書を蓄積するものである。また、この音素列選択手段からは、音素列符号化データ（音声ストリーム）をどのように接続するのかを制御する制御信号を出力する。 According to such a configuration, the data encoding device for reading is based on the phoneme string selection means, the analysis result of the morphological analysis of the kanji-kana mixed sentence, and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified. Select a phoneme string stored in the phoneme string encoded data dictionary storage means. The phoneme sequence encoded data dictionary storage means decomposes words constituting a kanji mixed sentence into phonemes, and stores a dictionary related to phoneme sequence encoded data obtained by encoding a phoneme sequence in which the decomposed phonemes are continuous. It is. Further, the phoneme string selection means outputs a control signal for controlling how the phoneme string encoded data (audio stream) is connected.

そして、読み上げ用データ符号化装置は、音声ストリーム出力手段によって、音素列選択手段で選択された音素列に対応する音素列符号化データを、音声ストリームとして出力し、この出力された音声ストリームから必要な部分を、ストリーム整形手段によって切り出して、当該音声ストリームを整形する。 Then, the reading data encoding apparatus outputs the phoneme sequence encoded data corresponding to the phoneme sequence selected by the phoneme sequence selection unit as the audio stream by the audio stream output unit, and the necessary data is output from the output audio stream. This part is cut out by the stream shaping means, and the audio stream is shaped.

その後、読み上げ用データ符号化装置は、ストリーム接続手段によって、ストリーム整形手段で整形された音声ストリームを、音素列選択手段から出力された制御信号に基づいて接続し、接続音声ストリームとして出力する。 Thereafter, the read-out data encoding apparatus connects the audio stream shaped by the stream shaping unit by the stream connecting unit based on the control signal output from the phoneme string selection unit, and outputs the connected audio stream.

請求項４記載の読み上げ用データ符号化装置は、漢字仮名混じり文を読み上げ用データとして入力して、当該読み上げ用データを音声ストリームに符号化する読み上げ用データ符号化装置であって、単語辞書蓄積手段と、形態素解析手段と、韻律予測手段と、音素列符号化データ辞書蓄積手段と、音素列選択手段と、音声ストリーム出力手段と、ストリーム整形手段と、ストリーム接続手段と、を備える構成とした。 5. A reading data encoding apparatus according to claim 4, wherein the reading data encoding apparatus inputs a kanji-kana mixed sentence as reading data, and encodes the reading data into an audio stream. Means, morphological analysis means, prosody prediction means, phoneme sequence encoded data dictionary storage means, phoneme sequence selection means, speech stream output means, stream shaping means, and stream connection means. .

かかる構成によれば、読み上げ用データ符号化装置は、形態素解析手段によって、単語辞書蓄積手段に蓄積されている単語辞書を参照して、漢字仮名混じり文を形態素解析すると共に、漢字仮名混じり文を音素表記した音素記号として出力する。続いて、読み上げ用データ符号化装置は、韻律予測手段によって、形態素解析手段で形態素解析された解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、漢字仮名混じり文の韻律を示す韻律記号を予測する。 According to this configuration, the reading data encoding apparatus uses the morphological analysis unit to refer to the word dictionary stored in the word dictionary storage unit to perform morphological analysis of the kanji-kana mixed sentence and to convert the kanji-kana mixed sentence. Output as phoneme symbols. Subsequently, the reading-out data encoding device uses the prosody indicating the prosody of the kanji-kana mixed sentence based on the analysis result obtained by the morphological analysis by the prosody prediction means and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified. Predict the symbol.

そして、読み上げ用データ符号化装置は、音素列選択手段によって、漢字仮名混じり文、解析結果、音素記号および韻律記号に基づいて、音素列符号化データ辞書蓄積手段に蓄積されている音素列を選択する。音素列符号化データ辞書蓄積手段は、漢字仮名混じり文を構成している単語を音素に分解し、この分解した音素が連続した音素列を符号化した音素列符号化データに関する辞書を蓄積するものである。また、この音素列選択手段からは、音素列符号化データ（音声ストリーム）をどのように接続するのかを制御する制御信号を出力する。 Then, the reading data encoding device selects the phoneme sequence stored in the phoneme sequence encoded data dictionary storage unit based on the kanji kana mixed sentence, the analysis result, the phoneme symbol, and the prosodic symbol by the phoneme sequence selection unit. To do. The phoneme sequence encoded data dictionary storage means decomposes words constituting a kanji mixed sentence into phonemes, and stores a dictionary related to phoneme sequence encoded data obtained by encoding a phoneme sequence in which the decomposed phonemes are continuous. It is. Further, the phoneme string selection means outputs a control signal for controlling how the phoneme string encoded data (audio stream) is connected.

請求項５記載の読み上げ用データ符号化装置は、請求項３または請求項４に記載の読み上げ用データ符号化装置において、前記音素列符号化データ辞書蓄積手段が、話者と話速との少なくとも一方が異なる複数の音素列符号化データ辞書を蓄積し、前記音声ストリーム出力手段は、前記複数の音素列符号化データ辞書に収められている音素列符号化データを切り替えて出力する出力切替手段を備えたことを特徴とする。 The data encoding device for reading according to claim 5 is the data encoding device for reading according to claim 3 or 4, wherein the phoneme sequence encoded data dictionary storage means includes at least a speaker and a speech speed. A plurality of different phoneme sequence encoded data dictionaries are stored, and the audio stream output means includes output switching means for switching and outputting the phoneme sequence encoded data stored in the plurality of phoneme sequence encoded data dictionaries. It is characterized by having.

かかる構成によれば、読み上げ用データ符号化装置は、音素列符号化データ辞書蓄積手段に話者と話速との少なくとも一方の異なる複数の音素列符号化データ辞書を蓄積しておいて、出力切替手段で、切り替えることで、漢字仮名混じり文に最も適切な音素列符号化データを出力することができる。 According to such a configuration, the reading data encoding apparatus stores a plurality of phoneme sequence encoded data dictionaries in which at least one of the speaker and the speech speed is different in the phoneme sequence encoded data dictionary storage unit, and outputs By switching by the switching means, it is possible to output the most appropriate phoneme sequence encoded data for the kanji mixed sentence.

請求項６記載の読み上げ用データ符号化装置は、請求項１から請求項５のいずれか一項に記載の読み上げ用データ符号化装置において、前記ストリーム整形手段が、前記音声ストリームを、復号する際の復号可能な最小の単体単位に整形し、前記ストリーム接続手段は、前記単体単位を接続して、前記接続音声ストリームとすることを特徴とする。 The reading data encoding device according to claim 6 is the reading data encoding device according to any one of claims 1 to 5, wherein the stream shaping unit decodes the audio stream. The stream connecting means connects the single units to form the connected audio stream.

かかる構成によれば、読み上げ用データ符号化装置は、ストリーム整形手段によって、復号可能な最小の単体単位に整形し、ストリーム接続手段によって、単体単位の音声ストリームを容易に接続することができ、この単体単位の音声ストリームが接続された接続音声ストリームが送信され、受信された後に、当該接続音声ストリームを復号することで、漢字仮名混じり文を音声信号として再生することができる。 According to such a configuration, the data encoding device for reading can be shaped into the smallest single unit that can be decoded by the stream shaping unit, and the single unit audio stream can be easily connected by the stream connection unit. A connected audio stream to which a single unit audio stream is connected is transmitted and received, and then the connected audio stream is decoded, so that a kanji-kana mixed sentence can be reproduced as an audio signal.

請求項７記載の読み上げ用データ符号化装置は、請求項１から請求項６のいずれか一項に記載の読み上げ用データ符号化装置において、前記ストリーム接続手段が、時間軸上で予め設定した時間分を重複させて前記音声ストリームを接続することを特徴とする。 The data encoding device for reading according to claim 7 is the data encoding device for reading according to any one of claims 1 to 6, wherein the stream connecting means sets a time preset on the time axis. The audio stream is connected with overlapping minutes.

かかる構成によれば、読み上げ用データ符号化装置は、ストリーム接続手段によって、音声ストリームが時間軸上で、予め設定した時間分重複（オーバラップ）させて当該音声ストリーム同士を接続し、接続音声ストリームとしている。予め設定した時間分とは、例えば、音声ストリームの単体単位の５０％の長さの時間である。 According to such a configuration, the read-out data encoding apparatus connects the audio streams so that the audio streams overlap each other for a preset time on the time axis by the stream connecting means, and the connected audio streams are connected. It is said. The preset time is, for example, 50% of the length of a single unit of the audio stream.

請求項８記載の読み上げ用データ符号化装置は、請求項１から請求項７のいずれか一項に記載の読み上げ用データ符号化装置において、前記ストリーム接続手段が、前記音声ストリームを接続する際に、接続点近傍の量子化精度を変更する量子化精度変更手段を備えていることを特徴とする。 The read-out data encoding device according to claim 8 is the read-out data encoding device according to any one of claims 1 to 7, wherein the stream connecting unit connects the audio stream. Quantization accuracy changing means for changing the quantization accuracy in the vicinity of the connection point is provided.

かかる構成によれば、読み上げ用データ符号化装置は、量子化精度変更手段によって、音声ストリーム同士を接続する際の接続点近傍の量子化精度が変更されることで、異なる量子化精度情報を持つ音声ストリームを接続することができる。 According to such a configuration, the data encoding device for reading has different quantization accuracy information by changing the quantization accuracy in the vicinity of the connection point when the audio streams are connected by the quantization accuracy changing unit. Audio streams can be connected.

請求項９記載の読み上げ用データ符号化プログラムは、漢字仮名混じり文を読み上げ用データとして入力して、当該読み上げ用データを音声ストリームに符号化するために、コンピュータを、音素列選択手段、ストリーム整形手段、ストリーム接続手段、として機能させる構成とした。 The read-out data encoding program according to claim 9 inputs a kanji-kana mixed sentence as read-out data, and encodes the read-out data into an audio stream. Means and stream connection means.

かかる構成によれば、読み上げ用データ符号化プログラムは、音素列選択手段で、漢字仮名混じり文、漢字仮名混じり文を形態素解析した解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、符号化する音素列を選択する。そして、読み上げ用データ符号化プログラムは、ストリーム整形手段で、音声符号化装置から出力された音声ストリームから必要な部分を切り出して、ストリーム接続手段で、音声ストリーム同士を接続して接続音声ストリームとして出力する。 According to such a configuration, the data encoding program for reading is based on the phoneme string selection means, the kana-kana mixed sentence, the analysis result obtained by morphological analysis of the kanji-kana mixed sentence, and the phoneme symbol that represents the kanji-kana mixed sentence as phonemes. Select a phoneme string to be encoded. The read-out data encoding program cuts out a necessary part from the audio stream output from the audio encoding device by the stream shaping means, connects the audio streams to each other by the stream connecting means, and outputs it as a connected audio stream To do.

請求項１、９に記載の発明によれば、漢字仮名混じり文、漢字仮名混じり文を形態素解析した解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、符号化する音素列を選択する。そして、音声符号化装置から出力された音声ストリームから必要な部分を切り出し、音声ストリーム同士を接続して接続音声ストリームとして出力する。このため、接続音声ストリームをデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。 According to the first and ninth aspects of the invention, a phoneme sequence to be encoded is selected based on a kanji-kana mixed sentence, an analysis result obtained by morphological analysis of the kanji-kana mixed sentence, and a phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified. To do. Then, a necessary part is cut out from the audio stream output from the audio encoding device, and the audio streams are connected and output as a connected audio stream. For this reason, when decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and it is possible to reproduce the sound when a kanji-kana mixed sentence is spoken using an existing decoding device. In addition, the kanji mixed kana sentence can be encoded.

請求項２に記載の発明によれば、単語辞書を参照して、入力された漢字仮名混じり文に形態素解析を施して得た解析結果および漢字仮名混じり文字を音素表記した音素記号に基づいて、漢字仮名混じり文が発話された際の韻律を示す韻律記号を予測し、漢字仮名混じり文、解析結果、音素記号および韻律記号に基づいて、符号化する音素列を選択する。そして、音声符号化装置から出力された音声ストリームから必要な部分を切り出し、音声ストリーム同士を接続して接続音声ストリームとして出力する。このため、接続音声ストリームをデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。 According to the second aspect of the present invention, referring to the word dictionary, based on the analysis result obtained by performing morphological analysis on the input kanji-kana mixed sentence and the phoneme symbol in which the kanji-kana mixed character is phoneme-notified, A prosodic symbol indicating a prosody when a kanji-kana mixed sentence is uttered is predicted, and a phoneme string to be encoded is selected based on the kanji-kana mixed sentence, the analysis result, the phoneme symbol, and the prosodic symbol. Then, a necessary part is cut out from the audio stream output from the audio encoding device, and the audio streams are connected and output as a connected audio stream. For this reason, when decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and it is possible to reproduce the sound when a kanji-kana mixed sentence is spoken using an existing decoding device. In addition, the kanji mixed kana sentence can be encoded.

請求項３に記載の発明によれば、漢字仮名混じり文、漢字仮名混じり文を形態素解析した解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、音素列符号化データ辞書に含まれている音素列を選択し、この音素列に対応する音素列符号化データを、音声ストリームとして出力し、この出力された音声ストリームから必要な部分を切り出して、当該音声ストリームを整形する。その後、整形された音声ストリームを、制御信号に基づいて接続し、接続音声ストリームとして出力する。このため、接続音声ストリームをデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。また、音素列符号化データ辞書を備えることで、デコードする際に用いられるデコード用の辞書ファイルのデータ量を抑制し、漢字仮名混じり文（テキストデータ）を符号化することができる。 According to the third aspect of the present invention, the phoneme string encoded data dictionary includes a kanji-kana mixed sentence, an analysis result obtained by morphological analysis of the kanji-kana mixed sentence, and a phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified. A phoneme sequence is selected, and phoneme sequence encoded data corresponding to the phoneme sequence is output as an audio stream, and a necessary portion is cut out from the output audio stream to shape the audio stream. Thereafter, the shaped audio stream is connected based on the control signal and output as a connected audio stream. For this reason, when decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and it is possible to reproduce the sound when a kanji-kana mixed sentence is spoken using an existing decoding device. In addition, the kanji mixed kana sentence can be encoded. Further, by providing the phoneme sequence encoded data dictionary, it is possible to suppress the amount of data in the decoding dictionary file used for decoding and to encode kanji-kana mixed text (text data).

請求項４に記載の発明によれば、単語辞書を参照して、漢字仮名混じり文を形態素解析し、形態素解析された解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、漢字仮名混じり文の韻律を示す韻律記号を予測する。そして、漢字仮名混じり文、解析結果、音素記号および韻律記号に基づいて、音素列符号化データ辞書に含まれている音素列を選択し、この音素列に対応する音素列符号化データを、音声ストリームとして出力し、この出力された音声ストリームから必要な部分を切り出して、当該音声ストリームを整形する。その後、整形された音声ストリームを、制御信号に基づいて接続し、接続音声ストリームとして出力する。このため、接続音声ストリームをデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。また、音素列符号化データ辞書を備えることで、デコードする際に用いられるデコード用の辞書ファイルのデータ量を抑制し、漢字仮名混じり文（テキストデータ）を符号化することができる。 According to the fourth aspect of the invention, referring to the word dictionary, the kanji-kana mixed sentence is subjected to morphological analysis, and the kana-kana kana based on the result of the morphological analysis and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified. Predict prosodic symbols that indicate prosody of mixed sentences. Then, based on the kanji mixed sentence, analysis result, phoneme symbol and prosodic symbol, the phoneme sequence included in the phoneme sequence encoded data dictionary is selected, and the phoneme sequence encoded data corresponding to this phoneme sequence is It outputs as a stream, cuts out a necessary part from the output audio stream, and shapes the audio stream. Thereafter, the shaped audio stream is connected based on the control signal and output as a connected audio stream. For this reason, when decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and it is possible to reproduce the sound when a kanji-kana mixed sentence is spoken using an existing decoding device. In addition, the kanji mixed kana sentence can be encoded. Further, by providing the phoneme sequence encoded data dictionary, it is possible to suppress the amount of data in the decoding dictionary file used for decoding and to encode kanji-kana mixed text (text data).

請求項５に記載の発明によれば、話者と話速との少なくとも一方の異なる複数の音素列符号化データ辞書を蓄積しておいて、切り替えることで、当該話者または話速、或いは双方を変更可能にして、漢字仮名混じり文に最も適切な音素列符号化データを出力することができる。 According to the fifth aspect of the present invention, a plurality of phoneme sequence encoded data dictionaries in which at least one of the speaker and the speech speed is different are stored and switched so that the speaker, the speech speed, or both Can be changed, and the phoneme sequence encoded data most suitable for the kanji mixed sentence can be output.

請求項６に記載の発明によれば、音声ストリームを、復号可能な最小の単体単位に整形し、当該単体単位を容易に接続することができる。 According to the sixth aspect of the present invention, the audio stream can be shaped into the smallest unit that can be decoded, and the unit can be easily connected.

請求項７に記載の発明によれば、ストリーム接続手段によって、音声ストリームが時間軸上で、予め設定した時間分重複（オーバラップ）させることで、例えば、ＡＡＣの場合、タイム・ドメイン・エイリアジング・キャンセレーションを行い、音声ストリーム同士の接続点におけるエイリアジングノイズを減少させることができる。また、複数の単体単位の音声ストリームを重複させることで、当該単体単位の音声ストリームを接続した接続音声ストリームを復号した音声信号も重ね合わせることになるので、急激な音声信号の変化を緩和することができる。また、この急激な音声信号の変化を緩和することで、音声信号の品質改善を容易に行うことができる。 According to the seventh aspect of the present invention, the audio stream is overlapped by a predetermined time on the time axis by the stream connecting means, for example, in the case of AAC, time domain aliasing is performed. Cancellation can be performed to reduce aliasing noise at the connection point between audio streams. In addition, by overlapping a plurality of single unit audio streams, the audio signal obtained by decoding the connected audio stream connecting the single unit audio streams is also superimposed, so that sudden changes in the audio signal can be reduced. Can do. Moreover, the quality of the audio signal can be easily improved by alleviating the sudden change in the audio signal.

請求項８に記載の発明によれば、音声ストリーム同士を接続する際の接続点近傍の量子化精度が変更されることで、異なった量子化情報をもつ音声ストリームを接続することができる。 According to the eighth aspect of the invention, it is possible to connect audio streams having different quantization information by changing the quantization accuracy in the vicinity of the connection point when the audio streams are connected to each other.

次に、本発明の実施形態について、適宜、図面を参照しながら詳細に説明する。
ここでは、２つの実施形態（第一実施形態、第二実施形態）の読み上げ用データ符号化装置について、まず、当該装置の構成の説明を行って、次に、当該装置の動作の説明を行うこととする。 Next, embodiments of the present invention will be described in detail with reference to the drawings as appropriate.
Here, regarding the data encoding device for reading in the two embodiments (the first embodiment and the second embodiment), first, the configuration of the device will be described, and then the operation of the device will be described. I will do it.

〈読み上げ用データ符号化装置［第一実施形態］の構成〉
図１は、読み上げ用データ符号化装置（第一実施形態）のブロック図である。この図１に示すように、読み上げ用データ符号化装置１は、漢字仮名混じり文等のテキストデータを、読み上げ用データとして入力し、当該読み上げ用データを符号化するもので、形態素解析手段５と、単語辞書蓄積手段７と、韻律予測手段９と、音素列選択手段１１と、ストリーム整形手段１３と、ストリーム接続手段１５とを備えている。 <Configuration of Reading Data Encoding Device [First Embodiment]>
FIG. 1 is a block diagram of a reading data encoding apparatus (first embodiment). As shown in FIG. 1, a reading data encoding apparatus 1 inputs text data such as a kanji-kana mixed sentence as reading data and encodes the reading data. , Word dictionary storage means 7, prosody prediction means 9, phoneme string selection means 11, stream shaping means 13, and stream connection means 15.

また、この読み上げ用データ符号化装置１には、音声符号化装置３と、ストリーム配信装置２とが接続されている。読み上げ用データ符号化装置１の構成の説明に先立ち、これらの装置３，２の説明を行う。 In addition, a speech encoding device 3 and a stream distribution device 2 are connected to the reading data encoding device 1. Prior to the description of the configuration of the reading data encoding apparatus 1, these apparatuses 3 and 2 will be described.

音声符号化装置３は、読み上げ用データ符号化装置１から出力された音素列選択信号に基づいて、選択された音素列を逐次符号化して、当該装置１に出力するもので、音素列蓄積手段１７と、音声符号化手段１９とを備えている。 The speech encoding device 3 sequentially encodes the selected phoneme sequence based on the phoneme sequence selection signal output from the reading data encoding device 1 and outputs the selected phoneme sequence to the device 1. 17 and voice encoding means 19.

音素列蓄積手段１７は、ハードディスク等の記憶媒体とデータの入出力を制御する制御機能とによって構成されており、予め特定の話者が発話した音声（音声列）と、当該音声を形態素解析した解析データ（音素列を含む）とを組み合わせた音声・解析データを蓄積するものである。この音声・解析データは、読み上げ用データ符号化装置１に入力される漢字仮名混じり文から、復号側で高品質な合成音声として再現することが可能な音声ストリームを作成するためのもので、多数の単語や、様々な言い回し等を収めた辞書データと言えるものである。 The phoneme string storage means 17 is composed of a storage medium such as a hard disk and a control function for controlling input / output of data, and a voice (speech string) uttered by a specific speaker in advance and a morphological analysis of the voice. Speech / analysis data combined with analysis data (including phoneme strings) is accumulated. This speech / analysis data is for creating a speech stream that can be reproduced as a high-quality synthesized speech on the decoding side from a kanji-kana mixed sentence input to the reading data encoding device 1. It can be said that the dictionary data contains various words and various phrases.

この音素列蓄積手段１７は、読み上げ用データ符号化装置１から出力される音素列選択信号に基づいて、当該音素列蓄積手段１７の制御機能が機能し、該当する音素列を含む音声・解析データを音声符号化手段１９に逐次出力するものである。 The phoneme string accumulating unit 17 functions based on the phoneme string selection signal output from the reading data encoding apparatus 1, and the control function of the phoneme string accumulating unit 17 functions to generate speech / analysis data including the corresponding phoneme string. Are sequentially output to the voice encoding means 19.

この実施形態では、音素列蓄積手段１７が蓄積している音素列に、特願２００３−２９６５８４（音声合成方法、音声合成装置および音声合成プログラム）において記述されている、アナウンサが読み上げたニュースの音声を収録した音声波形データベースを利用している。 In this embodiment, the speech of the news read by the announcer described in Japanese Patent Application No. 2003-296484 (speech synthesis method, speech synthesizer, and speech synthesis program) is stored in the phoneme sequence stored by the phoneme sequence storage means 17. The voice waveform database that recorded is used.

音声符号化手段１９は、音素列蓄積手段１７から出力された音声・解析データを逐次符号化（音声符号化）して、音声ストリームとして、読み上げ用データ符号化装置１に出力するものである。 The speech encoding unit 19 sequentially encodes (speech encodes) speech / analysis data output from the phoneme string storage unit 17 and outputs the speech / analysis data to the read-out data encoding device 1 as an audio stream.

なお、この音声符号化装置３は、この実施形態では、読み上げ用データ符号化装置１と別体に構成されているが、読み上げ用データ符号化装置１に組み込まれていてもよい。 In this embodiment, the speech encoding device 3 is configured separately from the reading data encoding device 1, but may be incorporated in the reading data encoding device 1.

ストリーム配信装置２は、読み上げ用データ符号化装置１から出力された接続音声ストリームを、通信回線（ネットワーク）等を通じて、受信側（復号側）の単体機器または多数の機器に配信するものである。 The stream distribution device 2 distributes the connected audio stream output from the reading data encoding device 1 to a single device or a number of devices on the reception side (decoding side) through a communication line (network) or the like.

この実施形態では、読み上げ用データ符号化装置１から出力される音声ストリームがＡＡＣ（ＡｄｖａｎｃｅｄＡｕｄｉｏＣｏｄｉｎｇ；ＭＰＥＧ−２の変換符号化方式）の音声ストリーム（以下、ＡＡＣ音声ストリームという）であり、ストリーム配信装置２が、ＡＡＣ対応の受信装置（図示せず）に接続する場合、光ケーブルを用いて、ストリーム配信装置２と受信装置とを接続し、当該光ケーブルで伝送可能な帯域に当該ＡＡＣ音声ストリームを多重化して配信している（詳しくは、ＪＥＩＴＡＣＰＸ−４１４１ＡＡＣデジタルインターフェース参照）。 In this embodiment, the audio stream output from the read-out data encoding apparatus 1 is an AAC (Advanced Audio Coding) MPEG stream (hereinafter referred to as AAC audio stream) stream distribution. When the device 2 is connected to an AAC-compatible receiving device (not shown), the stream distribution device 2 and the receiving device are connected using an optical cable, and the AAC audio stream is multiplexed in a band that can be transmitted by the optical cable. (For details, see JEITA CPX-4141 AAC digital interface).

また、読み上げ用データ符号化装置１から出力される音声ストリームがＭＰ３の音声ストリームであり、ストリーム配信装置４が、ＭＰ３対応の受信装置（図示せず）に接続する場合、既知（既存）の方法（例えば、ＲＴＰ等を使用した方法）で配信している。 Further, when the audio stream output from the reading data encoding apparatus 1 is an MP3 audio stream and the stream distribution apparatus 4 is connected to an MP3-compatible receiving apparatus (not shown), a known (existing) method is used. (For example, a method using RTP or the like).

ちなみに、ＲＴＰは、（Ｒｅａｌ−ｔｉｍｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ；リアルタイム・データ転送プロトコル）は、映像信号と音声信号とをリアルタイムに適した形式で転送することを目的に設計されたデータ通信の通信規約であり、時間単位でデータ（映像信号、音声信号）をパケットに分割して、当該パケットにデータの時間情報を付加して転送するものである。 Incidentally, RTP (Real-time Transport Protocol; Real-time data transfer protocol) is a data communication protocol designed for transferring video signals and audio signals in a format suitable for real-time, Data (video signal, audio signal) is divided into packets in units of time, and data time information is added to the packets and transferred.

これより、読み上げ用データ符号化装置１の各構成の説明を行う。
形態素解析手段５は、入力された漢字仮名混じり文（通常の日本語表記テキスト文、いわゆるテキストデータ）を、単語辞書蓄積手段７に蓄積されている単語辞書を参照して、形態素解析するものである。この形態素解析には、様々な手法が提案されており、この実施形態では非特許文献１に記載した“日本語形態素解析システム「茶筌」”のｖｅｒｓｉｏｎ１．５の使用説明書に示されている手法を利用している。 Now, each component of the reading data encoding apparatus 1 will be described.
The morpheme analyzing means 5 performs morpheme analysis on the input kana-kana mixed sentence (normal Japanese notation text sentence, so-called text data) with reference to the word dictionary stored in the word dictionary storage means 7. is there. Various methods have been proposed for this morphological analysis. In this embodiment, the method shown in the version 1.5 usage manual of “Japanese morphological analysis system“ tea bowl ”” described in Non-Patent Document 1. Is used.

なお、この形態素解析手段５によって、形態素解析された解析結果は、漢字仮名混じり文に含まれている各単語の品詞、アクセント型、読み、係り受け等の少なくとも１つの情報が含まれているものである。係り受けとは、漢字仮名混じり文（一般的な日本語）において、ある文節が他の文節に係る（依存する）という形式で文の構造が成立しており、係る文節と、受ける文節との関係を定義した係り受け文法（依存文法）を指すものである。 The analysis result obtained by the morphological analysis by the morphological analysis means 5 includes at least one piece of information such as the part of speech, accent type, reading, and dependency of each word included in the kanji-kana mixed sentence. It is. Dependency is a kana-kana mixed sentence (general Japanese), where the structure of a sentence is established in the form that a certain phrase is related to (depends on) another phrase. A dependency grammar (dependency grammar) that defines a relationship.

また、形態素解析手段５は、入力された漢字仮名混じり文を音素表記した音素記号（音素記号列）を生成するものである。 The morpheme analyzing means 5 generates a phoneme symbol (phoneme symbol string) in which the input kanji-kana mixed sentence is phoneme-notated.

これら解析結果（各単語の品詞、アクセント型、読み、係り受け等に関する情報）および音素記号は、韻律予測手段９と音素列選択手段１１とに出力される。 These analysis results (information on part of speech, accent type, reading, dependency, etc.) and phoneme symbols are output to the prosody prediction means 9 and the phoneme string selection means 11.

単語辞書蓄積手段７は、ハードディスク等の記憶媒体から構成されており、形態素解析手段５で形態素解析をする際に参照する単語辞書を蓄積するものである。この単語辞書は、読み上げ用データ符号化装置１に入力される漢字仮名混じり文で使われている単語の読みに関するデータを少なくとも含むもので、この実施形態では、さらに、単語の品詞、アクセント型、単語同士の接続確率および品詞の接続確率に関するデータを含んでいる。 The word dictionary accumulating means 7 is composed of a storage medium such as a hard disk, and accumulates a word dictionary that is referred to when the morpheme analyzing means 5 performs morpheme analysis. The word dictionary includes at least data related to the reading of words used in the kanji-kana mixed sentence input to the reading data encoding device 1, and in this embodiment, the word part of speech, the accent type, Data on the connection probability between words and the connection probability of parts of speech are included.

「単語の品詞」は、名詞、動詞、副詞、形容詞等であり、これらの品詞を示す識別情報が単語毎に付されている。 “Part of speech of a word” is a noun, a verb, an adverb, an adjective, and the like, and identification information indicating these parts of speech is attached to each word.

「アクセント型」は、アクセント核の有無と、当該アクセント核の個数とによって、単語を分類するものである。通常、日本語のアクセントは、感覚的に、アクセントの高い箇所とアクセントの低い箇所との２つのレベルがあり、アクセントのあるモーラ（仮名文字単位に相当）の直後にレベルが高い箇所から低い箇所に移行する。この移行する箇所をアクセント核という。ｎモーラの単語には、アクセント核が存在しないか、アクセント核の位置によって、ｎ個（ｎパターン）のアクセント型が存在することになる（ｎは自然数）。つまり、ｋモーラ目にアクセントがある場合、ｋ型といい、アクセント核が存在しない場合、０型という。 The “accent type” classifies words according to the presence / absence of an accent nucleus and the number of the accent nucleus. Usually, Japanese accents are sensuously divided into two levels: high accents and low accents, and high to low levels immediately after the accented mora (equivalent to kana characters). Migrate to This transition is called the accent nucleus. An n-mora word does not have an accent nucleus, or there are n (n patterns) accent types depending on the position of the accent nucleus (n is a natural number). That is, when the k-mora has an accent, it is called a k-type, and when there is no accent nucleus, it is called a 0-type.

「読み」は、単語の読み方をローマ字表記したものである。「単語同士の接続確率」は、単語同士が接続する確率を表したものである。「品詞の接続確率」は、品詞同士が接続する確率を表したものである。 “Reading” is a romanized representation of how to read a word. The “probability of connection between words” represents the probability of connection between words. “Part-of-speech connection probability” represents the probability of connection between parts of speech.

韻律予測手段９は、形態素解析手段５によって、形態素解析された解析結果と、漢字仮名混じり文を音素表記した音素記号列（発音記号列）とに基づいて、各音素の継続時間長、基本周波数パターン、ポーズ長（無音時間長）、イントネーション等を予測して生成した韻律記号を音素列選択手段１１に出力するものである。 The prosody prediction means 9 is based on the analysis result obtained by the morphological analysis by the morphological analysis means 5 and the phoneme symbol string (phonetic symbol string) in which the kanji-kana mixed sentences are phoneme-notated. A prosodic symbol generated by predicting a pattern, pause length (silent time length), intonation and the like is output to the phoneme string selection means 11.

この韻律予測手段９では、「基本周波数パターン生成過程モデルに基づく文章音声の合成」（電子情報通信学会論文誌ＡＶｏｌ．Ｊ７２−Ａ，Ｎｏ．１ｐｐ３２−４０，１９８９．１参照）に開示されている、韻律を予測する（韻律記号を出力する）手法を採用している。 This prosody prediction means 9 is disclosed in “Synthesis of sentence speech based on a fundamental frequency pattern generation process model” (see IEICE Transactions A Vol. J72-A, No. 1 pp32-40, 1989. 1). The method of predicting prosody (outputting prosodic symbols) is adopted.

音素列選択手段１１は、入力された漢字仮名混じり文と、形態素解析手段５から出力された解析結果および音素記号と、韻律予測手段９から出力された韻律記号とに基づいて、符号化する音素列を選択する音素列選択制御信号を音声符号化装置３の音素列蓄積手段１７に出力するものである。 The phoneme string selection unit 11 performs phoneme encoding based on the input kanji-kana mixed sentence, the analysis result and phoneme symbol output from the morpheme analysis unit 5, and the prosody symbol output from the prosody prediction unit 9. A phoneme string selection control signal for selecting a string is output to the phoneme string storage means 17 of the speech encoding device 3.

音素列選択制御信号は、漢字仮名混じり文が区分された複数のモーラからなる音素列それぞれを、音声符号化装置３の音素列蓄積手段１７から選択するものである。例えば、漢字仮名混じり文が「今日はいい天気です。」の場合、「今日」、「は」、「いい」、「天気」および「です」に区分された際に、「今日」（ｋｙｏｕ）という音素列、「は」（ｈａ）という音素列、「いい」（ｉｉ）という音素列、「天気」（ｔｅｎｎｋｉ）という音素列および「です」（ｄｅｓｕ）という音素列がそれぞれ選択されることとなる。 The phoneme string selection control signal is used to select, from the phoneme string storage unit 17 of the speech encoding apparatus 3, each of the phoneme strings composed of a plurality of mora into which kanji-kana mixed sentences are divided. For example, if the kanji-kana mixed sentence is “Today is a good weather”, it is divided into “Today”, “Ha”, “Good”, “Weather”, and “Is”, and “Today” (kyou) A phoneme sequence “ha” (ha), a phoneme sequence “good” (ii), a phoneme sequence “tennki”, and a phoneme sequence “de”. Become.

また、この音素列選択手段１１は、音声ストリーム同士を接続を制御するストリーム接続制御信号（請求項に記載した制御信号に該当する）を、ストリーム接続手段１５に出力する。 In addition, the phoneme string selection unit 11 outputs a stream connection control signal (corresponding to a control signal described in claims) for controlling connection between audio streams to the stream connection unit 15.

ストリーム接続制御信号は、適切な発話休止の長さ（時間）、接続する時刻（タイミング）および接続方法の指示に関する情報を含んで成り、ストリーム接続手段１５で接続される音声ストリームの接続タイミング、接続方法を制御するものである。 The stream connection control signal includes information on an appropriate speech pause length (time), connection time (timing), and an instruction of a connection method, and connection timing and connection of an audio stream connected by the stream connection means 15 It controls the method.

なお、接続方法の指示とは、例えば、ＡＤＴＳのフレームの周波数係数（ＤＣＴ成分）に関して、スケールファクター値Ｓの値が大きい方のゲインＧに合わせて、再量子化を行って、両窓の音声ストリームを、例えば、周波数係数１／２倍して足し合わせた上で、ハフマン符号化を行い、所定のビットレートで、音声ストリームに変換して接続する場合、接続する音声ストリームの接続点のフレーム番号、当該接続点でのゲインを指示することである。 The connection method instruction is, for example, requantizing the frequency coefficient (DCT component) of the ADTS frame in accordance with the gain G having the larger scale factor value S, and the audio of both windows. For example, when a stream is added after being multiplied by a frequency coefficient of 1/2, Huffman coding is performed, and the stream is converted into an audio stream at a predetermined bit rate, the frame at the connection point of the audio stream to be connected It is to indicate the number and the gain at the connection point.

例えば、フレーム毎の周波数係数が２５６本である場合、接続方法の指示に関する情報は、接続する一方の音声ストリームＡの接続点でのゲインをＧａｉｎＡと、接続する他方の音声ストリームＢの接続点でのゲインをＧａｉｎＢとが予め与えられているとすると、音声ストリームＡのフレーム番号ｋと、音声ストリームＢのフレーム番号ｔだけを指示するだけで、ストリーム接続手段１５において、音声ストリームＡと音声ストリームＢとを接続させることが可能である。 For example, when the frequency coefficient for each frame is 256, the information related to the instruction of the connection method is the gain at the connection point of one audio stream A to be connected with Gain A and the connection point of the other audio stream B to be connected. If gain B is given in advance, the stream connecting means 15 simply indicates only the frame number k of the audio stream A and the frame number t of the audio stream B. Can be connected.

なお、このストリーム接続制御信号を音素列選択手段１１からストリーム接続手段１５に出力する代わりに、無音の音素およびこの音素の時間（発話休止の長さに相当する時間）に適合する音素列を選択する音素列選択制御信号を音声符号化装置３に出力することも可能である。 Instead of outputting this stream connection control signal from the phoneme string selection means 11 to the stream connection means 15, a silence phoneme and a phoneme string suitable for the time of this phoneme (time corresponding to the length of speech pause) are selected. It is also possible to output a phoneme sequence selection control signal to the speech encoding device 3.

ストリーム整形手段１３は、音声符号化装置３から出力された音声ストリームから必要な部分を切り出して整形した音声ストリームをストリーム接続手段１５に出力するものである。なお、このストリーム整形手段１３で整形される必要な部分とは、ストリーム接続手段１５で使用する部分であり、ストリーム接続手段１５に入力されるストリーム接続制御信号に従って決定されるものである。つまり、ストリーム整形手段１３で必要な部分が切り出される際には、音素列選択手段１１から直接、または、ストリーム接続手段１５を介してストリーム接続制御信号が当該ストリーム整形手段１３に入力されることになる。 The stream shaping unit 13 outputs an audio stream obtained by cutting out and shaping a necessary part from the audio stream output from the audio encoding device 3 to the stream connection unit 15. Note that the necessary part to be shaped by the stream shaping unit 13 is a part used by the stream connection unit 15 and is determined according to the stream connection control signal input to the stream connection unit 15. That is, when a necessary portion is cut out by the stream shaping unit 13, a stream connection control signal is input to the stream shaping unit 13 directly from the phoneme string selection unit 11 or via the stream connection unit 15. Become.

この実施形態では、ストリーム整形手段１３は、８ｂｉｔのＡＤＰＣＭ（ＡｄａｐｔｉｖｅＤｉｆｆｅｒｅｎｔｉａｌＰｌｕｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ；適応型差分パルス符号変調）を採用している。この音声ストリームのＡＤＰＣＭの出力が零クロスして零になった時点で、入力された音声ストリームを切断して、ストリーム接続手段１５に出力することで、このストリーム接続手段１５で特殊な処理を施さなくても音声ストリーム同士の違和感のない接続（スムーズな音素のつながり）を実現することができる。 In this embodiment, the stream shaping means 13 employs 8-bit ADPCM (Adaptive Differential Plus Code Modulation). When the ADPCM output of this audio stream crosses zero and becomes zero, the input audio stream is disconnected and output to the stream connection means 15, whereby special processing is performed by this stream connection means 15. Even without this, it is possible to realize a connection (smooth phoneme connection) between the audio streams without any sense of incongruity.

なお、ＡＤＰＣＭは、リニアＰＣＭデータ（ストリーム化する前の音声信号）に対して、１サンプルの振幅情報を圧縮する方式であり、音声ストリームのＡＤＰＣＭの出力が零クロスして零になった時点とは、ＡＤＰＣＭの出力波形が、時間軸（横軸）と交わった時点、つまり、振幅情報が０になった時を指すものである。 ADPCM is a method of compressing the amplitude information of one sample with respect to linear PCM data (audio signal before being streamed), and when the ADPCM output of the audio stream crosses to zero and becomes zero. Indicates the point in time when the output waveform of the ADPCM intersects the time axis (horizontal axis), that is, when the amplitude information becomes zero.

また、このストリーム整形手段１３は、音声符号化装置３から出力された音声ストリームを、復号する際の復号可能な最小の単体単位に整形するものである。この復号可能な最小の単体単位とは、音声ストリームとして意味のある単位であり、この意味のある単位とは、例えば、ＡＡＣのＡｕｄｉｏ＿Ｄａｔａ＿Ｔｒａｎｓｐｏｒｔ＿Ｓｔｒｅａｍｆｒａｍｅ（フレーム）ＡＤＴＳ（ＩＳＯ／ＩＥＣ１３８１８−７：２００３６章Ｓｙｎｔａｘ参照）単位が挙げられる。このＡＤＴＳのフレーム単位単体だけであってもデコード装置（デコーダ、復号装置）等で解釈できるものである。 Further, the stream shaping means 13 shapes the audio stream output from the audio encoding device 3 into the smallest single unit that can be decoded. The smallest unit that can be decoded is a meaningful unit as an audio stream. For example, this meaningful unit is an AAC Audio_Data_Transport_Stream frame (frame) ADTS (ISO / IEC 13818-7: 2003 Chapter 6 Syntax). See) units. Even a single ADTS frame unit alone can be interpreted by a decoding device (decoder, decoding device) or the like.

ＡＤＴＳのフレームのヘッダ部であるＡｄｔｓ＿ｆｉｘｅｄ＿ｈｅａｄｅｒには、ｓｙｎｃｗｏｒｄ“１１１１１１１１１１１１１１１１”が含まれている。このｓｙｎｃｗｏｒｄの前で音声ストリームを切断することで、ＡＤＴＳのフレームとして意味のある単位に分割することができる。 The Syncts “1111 1111 1111 1111” is included in the ADTS_fixed_header which is the header part of the ADTS frame. By cutting the audio stream before this syncword, it can be divided into meaningful units as ADTS frames.

但し、ＡＡＣの音声ストリームは、ストリーム接続手段１５において、当該音声ストリーム同士を接続するためには、ｐｒｏｆｉｌｅ等の情報が同じである必要がある。逆に、前に同じプロファイルであれば、例えば、音声データを格納するｒａｗ＿ｄａｔａ＿ｂｌｏｃｋのｃｈａｎｎｅｌ＿ｐａｉｒ＿ｅｌｅｍｅｎｔに収められているＡＤＴＳのフレームの一部のデータに、ヘッダ部等を付加しＡＤＴＳフレームとすることで、デコーダ等で解釈可能になる。すなわち、ストリーム整形手段１３では、ＡＤＴＳのフレームを意味のある単位とみなして、整形している。 However, in order to connect the audio streams of AAC in the stream connecting means 15, the information such as the profile needs to be the same. On the contrary, if the profile is the same as before, for example, by adding a header part to the data of a part of the ADTS frame stored in the channel_pair_element of the raw_data_block for storing the audio data, a decoder is obtained by adding an header. Interpretation is possible. That is, the stream shaping means 13 shapes the ADTS frame by regarding it as a meaningful unit.

ストリーム接続手段１５は、音素列選択手段１１から出力されたストリーム接続制御信号に基づいて、ストリーム整形手段１３で整形された音声ストリームを一旦周波数領域の周波数係数に変換して接続点位置で加算することによって、当該音声ストリーム同士を接続するもので、量子化精度変更手段１５ａを備えている。 Based on the stream connection control signal output from the phoneme string selection unit 11, the stream connection unit 15 temporarily converts the audio stream shaped by the stream shaping unit 13 into frequency coefficients in the frequency domain and adds them at the connection point position. Thus, the audio streams are connected to each other, and the quantization accuracy changing means 15a is provided.

ストリーム接続制御信号として、適切な発話休止の長さ（時間）、接続する時刻（タイミング）および接続方法の指示がストリーム接続手段１５に入力された場合、ＡＤＰＣＭでは、発話休止の長さに応じたＮｕｌｌデータと、ストリーム整形手段１３で整形された音声ストリームとを接続することで、音声ストリームの接続（接続音声ストリーム）を実行することができる。 When an appropriate speech pause length (time), connection time (timing), and connection method instruction are input to the stream connection means 15 as a stream connection control signal, ADPCM responds to the length of the speech pause. By connecting the Null data and the audio stream shaped by the stream shaping means 13, connection of the audio stream (connected audio stream) can be executed.

量子化精度変更手段１５ａは、音声ストリーム同士を接続する場合、当該音声ストリームが変換された周波数係数（ＤＣＴ係数）を加算する際に、量子化値の精度を変更するものである。周波数係数は、量子化値とゲインとの積によって表すことができ、ゲインはスケールファクター値を指数に持つ値である。周波数係数をＫ、量子化値をＲ、ゲインをＧスケールファクター値をＳとすると、Ｋ＝Ｒ×Ｇ^Sで表すことができる。 When the audio streams are connected to each other, the quantization accuracy changing unit 15a changes the accuracy of the quantization value when adding the frequency coefficient (DCT coefficient) obtained by converting the audio streams. The frequency coefficient can be expressed by the product of the quantized value and the gain, and the gain is a value having a scale factor value as an exponent. Frequency coefficient K, when the quantization value R, G scale factor value of the gain and S, can be expressed by K = R × G ^S.

すなわち、量子化値の精度を変更するとは、スケールファクター値Ｓを変更（調整）することで、量子化値Ｒの値を変更することである。 That is, changing the precision of the quantized value means changing the value of the quantized value R by changing (adjusting) the scale factor value S.

また、音素列選択手段１１から音声符号化装置３に、無音の音素およびこの音素の時間（発話休止の長さに相当する時間）に適合する音素列を選択する音素列選択制御信号が出力されていた場合、ストリーム接続制御信号は必要なく、音声符号化装置３では、音素列選択制御信号に基づいて、音声ストリーム同士が接続される時間を鑑みた音素列が検索・選択されることになる。 Further, the phoneme sequence selection means 11 outputs a phoneme sequence selection control signal for selecting a silent phoneme and a phoneme sequence that matches the time of this phoneme (a time corresponding to the length of speech pause) to the speech encoding device 3. In such a case, the stream connection control signal is not necessary, and the speech encoding device 3 searches and selects a phoneme sequence in consideration of the time for connecting the speech streams based on the phoneme sequence selection control signal. .

また、ストリーム接続手段１５は、ストリーム整形手段１３で整形された音声ストリームを接続する際に、時間軸上で予め設定した時間（設定時間）分を重複させて接続するものである。この実施形態では、設定時間は、１つの音声ストリームの約半分（５０％）の時間である。 The stream connection unit 15 is configured to connect the audio streams shaped by the stream shaping unit 13 by overlapping a preset time (set time) on the time axis. In this embodiment, the set time is about half (50%) of one audio stream.

この場合のストリーム接続手段１５からの出力を図５に示す。図５は、音声ストリームを時間軸上で重複接続していく方法を、ＡＡＣの音声ストリームを例として説明した図である。 The output from the stream connection means 15 in this case is shown in FIG. FIG. 5 is a diagram illustrating a method of overlappingly connecting audio streams on the time axis, using an AAC audio stream as an example.

この図５に示すように、ストリーム接続手段１５からの出力は、時間軸上（時間領域）の音声信号に窓関数（ここでは、ｓｉｎｅ窓［２０４８／２５６ｓａｍｐｌｅ］）をかけて、ＤＣＴ変換およびハフマン符号化を行って、ＡＡＣの音声ストリームを連続的に半分ずつ重複して列べた形式で模式的に表すことができる。 As shown in FIG. 5, the output from the stream connecting means 15 is obtained by applying a window function (here, a sine window [2048/256 sample]) to an audio signal on the time axis (time domain) to perform DCT transformation and Huffman. Encoding is performed, and the AAC audio stream can be schematically represented in a form in which the audio streams are continuously overlapped by half.

このストリーム接続手段１５における音声ストリーム同士の接続方法について、２５６ｓａｍｐｌｅの窓で、複数の音素を接続する場合を例にして説明する（この接続方法についての詳細は、「音声符号化情報処理装置、音声符号化情報処理プログラムおよび音声符号化情報処理方法」特願２００４−１１８３６１を参照）。 The connection method between the audio streams in the stream connection means 15 will be described by taking as an example a case where a plurality of phonemes are connected through a 256 sample window (for details on this connection method, refer to “Audio encoding information processing apparatus, audio Encoded information processing program and speech encoded information processing method "(see Japanese Patent Application No. 2004-118361).

各音素の両端のＡＤＴＳのフレームの周波数係数（ＤＣＴ成分）に関して、スケールファクター値Ｓの値が大きい方のゲインＧに合わせて、再量子化を行って、両窓の音声ストリームを、例えば、周波数係数１／２倍して足し合わせた上で、ハフマン符号化を行い、所定のビットレートで、音声ストリームに変換して接続する。 With respect to the frequency coefficient (DCT component) of the ADTS frame at both ends of each phoneme, requantization is performed in accordance with the gain G having the larger scale factor value S, and the audio stream of both windows is, for example, frequency Huffman coding is performed after adding the coefficients by a factor of 1/2, and converted into an audio stream and connected at a predetermined bit rate.

この場合、周波数係数をそれぞれ１／２倍して足し合わせているが、数フレームに亘り適切な比でゲインをかけ、足し合わせることで、高品質な接続音声ストリームを作成することができる。前記したように、このストリーム接続手段１５において、音声ストリームを接続する接続点（接続点近傍）の位置の制御および足し合わせる周波数係数については、音素列選択手段１１から出力されるストリーム接続制御信号によって決定される。 In this case, the frequency coefficients are each multiplied by ½, and a high-quality connected audio stream can be created by adding and adding gains at an appropriate ratio over several frames. As described above, in the stream connection unit 15, the position of the connection point (near the connection point) for connecting the audio stream and the frequency coefficient to be added are determined by the stream connection control signal output from the phoneme string selection unit 11. It is determined.

例えば、ストリーム接続手段１５は、複数の音声ストリームが入力され、これらの音声ストリームを、周波数領域で、ゲインＧを乗算してから加算する際に、量子化精度を制御しながら処理するもので、周波数係数変換手段（図示せず）と、ゲイン乗算手段（図示せず）と、周波数係数加算手段（図示せず）と、音声符号化情報変換手段（図示せず）とを備えている。 For example, the stream connecting means 15 receives a plurality of audio streams, and processes these audio streams while controlling the quantization accuracy when adding after multiplying the gain G in the frequency domain. Frequency coefficient conversion means (not shown), gain multiplication means (not shown), frequency coefficient addition means (not shown), and speech encoded information conversion means (not shown) are provided.

このストリーム接続手段１５は、入力された音声ストリームを、周波数係数変換手段（図示せず）によって、周波数係数に変換し、周波数係数乗算手段（図示せず）によって、周波数係数にゲインＧを乗算する。そして、ストリーム接続手段１５は、周波数係数加算手段（図示せず）によって、量子化精度を制御しつつ、周波数係数を加算する。 The stream connection unit 15 converts the input audio stream into a frequency coefficient by a frequency coefficient conversion unit (not shown), and multiplies the frequency coefficient by a gain G by a frequency coefficient multiplication unit (not shown). . Then, the stream connecting unit 15 adds the frequency coefficient while controlling the quantization accuracy by a frequency coefficient adding unit (not shown).

そして、ストリーム接続手段１５Ｃは、音声符号化情報変換手段（図示せず）の量子化部（図示せず）によって、加算した周波数係数を量子化値に変換し、音声符号化情報変換手段（図示せずのノイズレスコーディング部（図示せず）によって、量子化値をハフマン符号語にコーディング（符号化）し、レート歪みコントローラ部（図示せず）によって、ビットレートが所定の値以下であるか否かを判断する。 Then, the stream connecting unit 15C converts the added frequency coefficient into a quantized value by a quantization unit (not shown) of the speech encoded information converting unit (not shown), and the speech encoded information converting unit (see FIG. A quantization value is coded (encoded) into a Huffman codeword by a noiseless coding unit (not shown), and the bit rate is equal to or lower than a predetermined value by a rate distortion controller unit (not shown). Determine whether.

ここで、ストリーム接続手段１５は、レート歪みコントローラ部（図示せず）によって、ビットレートが所定の値以下であると判断しなかった場合、ビットレートを所定の値以下にするために、量子化値の大きさを制御して、ハフマン符号語を変換し、変換後のビットレートが所定の値以下になるまで量子化値の制御を繰り返して行う。また、ストリーム接続手段１５は、レート歪みコントローラ部（図示せず）によって、ビットレートが所定の値以下であると判断した場合、音声符号化情報変換手段（図示せず）のビットストリームマルチプレクサ部（図示せず）によって、ハフマン符号語を並べ替えて、ストリーム化した音声ストリームに変換して出力する。 Here, when the rate connecting controller 15 (not shown) does not determine that the bit rate is equal to or lower than the predetermined value, the stream connecting unit 15 performs quantization to reduce the bit rate to the predetermined value or lower. The Huffman codeword is converted by controlling the magnitude of the value, and the quantization value is repeatedly controlled until the bit rate after conversion becomes a predetermined value or less. Further, when the stream connection means 15 determines that the bit rate is equal to or lower than a predetermined value by a rate distortion controller section (not shown), the stream connection means 15 (bit stream multiplexer section (not shown)) of the speech coding information conversion means (not shown). (Not shown), the Huffman codewords are rearranged, converted into a streamed audio stream, and output.

図１に戻って、読み上げ用データ符号化装置１の構成の説明を続ける。
この読み上げ用データ符号化装置１によれば、形態素解析手段５によって、単語辞書蓄積手段７に蓄積されている単語辞書が参照されて、漢字仮名混じり文が形態素解析され、韻律予測手段９によって、形態素解析手段５で形態素解析された解析結果および漢字仮名混じり文字が音素表記された音素記号に基づいて、漢字仮名混じり文の韻律を示す韻律記号が予測される。そして、音素列選択手段１１によって、音素列を蓄積し当該音素列を音声ストリームに符号化する外部の装置である音声符号化装置３に対して、漢字仮名混じり文、解析結果、音素記号および韻律記号に基づき、符号化する音素列を選択する音素列選択信号が出力される。また、音素列選択手段１１からは、音声符号化装置３によって符号化された音声ストリームを、どのように接続するのかを制御するストリーム接続制御信号が出力される。そして、ストリーム整形手段１３によって、音声符号化装置３から出力された音声ストリームから必要な部分が切り出されて、当該音声ストリームが整形され、この整形された音声ストリームが、ストリーム接続手段１５によって、音素列選択手段１１から出力されたストリーム接続制御信号に基づいて接続されて、接続音声ストリームとして出力される。このため、接続音声ストリームを受信してデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。 Returning to FIG. 1, the description of the configuration of the reading data encoding apparatus 1 will be continued.
According to the reading data encoding apparatus 1, the morphological analysis unit 5 refers to the word dictionary stored in the word dictionary storage unit 7, morphologically analyzes the kanji-kana mixed sentence, and the prosody prediction unit 9 Based on the analysis result obtained by the morpheme analysis by the morpheme analysis unit 5 and the phoneme symbol in which the kanji-kana mixed character is phoneme-notated, a prosodic symbol indicating the prosody of the kanji-kana mixed sentence is predicted. Then, the phoneme sequence selection unit 11 stores the phoneme sequence and encodes the phoneme sequence into an audio stream. The speech encoding device 3 is an external device that mixes kanji characters, analysis results, phonetic symbols, and prosody. Based on the symbol, a phoneme string selection signal for selecting a phoneme string to be encoded is output. Also, the phoneme sequence selection means 11 outputs a stream connection control signal that controls how the audio streams encoded by the audio encoding device 3 are connected. Then, the stream shaping unit 13 cuts out a necessary part from the audio stream output from the audio encoding device 3 to shape the audio stream, and the stream connection unit 15 converts the shaped audio stream into a phoneme. Connection is made based on the stream connection control signal output from the column selection means 11, and the connection audio stream is output. For this reason, when receiving and decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and the sound when a kanji-kana mixed sentence is spoken using an existing decoding device Can be reproduced so that the kanji-kana mixed sentence can be encoded.

また、この読み上げ用データ符号化装置１によれば、ストリーム整形手段１３によって、最小の単体単位（意味のある単位）に音声ストリームを整形することで、ストリーム接続手段１５によって、当該単体単位を容易に接続することができる。 Further, according to the read-out data encoding apparatus 1, the stream shaping unit 13 shapes the audio stream into the smallest single unit (meaningful unit), so that the stream connecting unit 15 easily converts the single unit. Can be connected to.

さらに、この読み上げ用データ符号化装置１によれば、ストリーム接続手段１５によって、音声ストリームが時間軸上で、予め設定した時間分重複（オーバラップ）させて当該音声ストリーム同士を接続し、接続音声ストリームとしている。このため、音声ストリームを単に連続させて接続する場合に比べ、復号する際に接続箇所に生じる違和感がなくなり、音声の品質劣化を防止することができる。 Furthermore, according to the data encoding device for reading out 1, the stream connection unit 15 connects the audio streams so that the audio streams overlap each other for a preset time on the time axis, and the connected audio is connected. It is a stream. For this reason, compared with the case where the audio streams are simply connected continuously, there is no sense of incongruity occurring at the connection location when decoding, and it is possible to prevent deterioration of the audio quality.

さらにまた、この読み上げ用データ符号化装置１によれば、ストリーム接続手段１５が、音声ストリームを接続する際に、量子化精度変更手段１５ａによって、接続点近傍の量子化精度を変更しているので、異なる量子化精度を持つ音声ストリームを接続することができる。 Furthermore, according to the reading data encoding apparatus 1, when the stream connecting unit 15 connects the audio stream, the quantization accuracy near the connection point is changed by the quantization accuracy changing unit 15a. Audio streams with different quantization accuracy can be connected.

なお、この読み上げ用データ符号化装置１では、漢字仮名混じり文のみを読み上げ用データ（読み上げ用テキスト）として入力しているが、直接、漢字仮名混じり文、音素記号解析結果および韻律記号を音素列選択手段１１に入力することで、形態素解析手段５、単語辞書蓄積手段７および韻律予測手段９を省略することも可能である。 In this data encoding device for reading 1, only kanji-kana mixed sentences are input as reading-out data (read-out text). By inputting to the selection means 11, the morphological analysis means 5, the word dictionary storage means 7, and the prosody prediction means 9 can be omitted.

すなわち、読み上げ用データとして、漢字仮名混じり文等のテキストデータと、音素記号や韻律記号等の音声合成用の補助情報とを含む場合や、漢字仮名混じり文等のテキストデータを含まずに、音声合成用に特化したデータのみである場合が想定可能である。 In other words, as text-to-speech, text data such as kanji-kana mixed texts and auxiliary information for speech synthesis such as phoneme symbols and prosodic symbols, or text data such as kanji-kana mixed texts are not included. It can be assumed that there is only data specialized for synthesis.

〈読み上げ用データ符号化装置［第一実施形態］の動作〉
次に、図３に示すフローチャートを参照して、読み上げ用データ符号化装置１（第一実施形態）の動作（音声符号化装置３およびストリーム配信装置２の動作も含む）について説明する（適宜、図１参照）。
まず、読み上げ用データ符号化装置１は、形態素解析手段５によって、入力された漢字仮名混じり文（テキストデータ）を、単語辞書蓄積手段７に蓄積されている単語辞書を参照して形態素解析し（ステップＳ１）、解析した解析結果と音素記号とを、韻律予測手段９と音素列選択手段１１に出力する。 <Operation of Reading Data Encoding Device [First Embodiment]>
Next, operations of the reading data encoding apparatus 1 (first embodiment) (including operations of the audio encoding apparatus 3 and the stream distribution apparatus 2) will be described with reference to the flowchart shown in FIG. (See FIG. 1).
First, the reading data encoding apparatus 1 uses the morpheme analysis unit 5 to analyze the input kanji mixed text (text data) with reference to the word dictionary stored in the word dictionary storage unit 7 ( Step S1), the analyzed analysis result and the phoneme symbol are output to the prosody prediction means 9 and the phoneme string selection means 11.

続いて、読み上げ用データ符号化装置１は、韻律予測手段９によって、形態素解析手段５で解析された解析結果および音素記号に基づいて、韻律記号を生成し、音素列選択手段１１に出力する（ステップＳ２）。すると、読み上げ用データ符号化装置１は、音素列選択手段１１によって、符号化する音素を選択する音素列選択制御信号を音声符号化装置３の音素列蓄積手段１７に出力すると共に、音声ストリーム同士の接続を制御するストリーム接続制御信号（制御信号）をストリーム接続手段１５に出力する（ステップＳ３）。 Subsequently, the reading data encoding device 1 generates a prosodic symbol by the prosody predicting unit 9 based on the analysis result and the phoneme symbol analyzed by the morphological analyzing unit 5 and outputs the prosodic symbol to the phoneme sequence selecting unit 11 ( Step S2). Then, the reading data encoding device 1 outputs a phoneme sequence selection control signal for selecting a phoneme to be encoded to the phoneme sequence storage unit 17 of the speech encoding device 3 by the phoneme sequence selection unit 11 and A stream connection control signal (control signal) for controlling the connection is output to the stream connection means 15 (step S3).

すると、音声符号化装置３は、音素列選択制御信号に基づいて、音素列蓄積手段１７から出力された音素列を、音声符号化手段１９によって、符号化し、この音素列を符号化した音声ストリームを読み上げ用データ符号化装置１に出力する（ステップＳ４）。 Then, the speech encoding device 3 encodes the phoneme sequence output from the phoneme sequence storage unit 17 by the speech encoding unit 19 based on the phoneme sequence selection control signal, and encodes the phoneme sequence. Is output to the reading data encoding apparatus 1 (step S4).

そして、読み上げ用データ符号化装置１は、ストリーム整形手段１３によって、音声ストリームから必要な部分を切り出して（ステップＳ５）、当該音声ストリームを整形し、ストリーム接続手段１５に出力する。そして、読み上げ用データ符号化装置１は、ストリーム接続手段１５によって、ストリーム整形手段１３から出力された音声ストリーム同士を接続して（ステップＳ６）、接続音声ストリームとしてストリーム配信装置２に出力する。 Then, the read-out data encoding apparatus 1 cuts out a necessary part from the audio stream by the stream shaping unit 13 (step S5), shapes the audio stream, and outputs it to the stream connection unit 15. Then, the read-out data encoding apparatus 1 connects the audio streams output from the stream shaping means 13 by the stream connecting means 15 (step S6), and outputs the connected audio streams to the stream distribution apparatus 2.

すると、ストリーム配信装置２は、接続音声ストリームを、ネットワーク（図示せず）等を介して、配信する（ステップＳ７）。 Then, the stream distribution device 2 distributes the connected audio stream via a network (not shown) or the like (step S7).

〈読み上げ用データ符号化装置［第二実施形態］の構成〉
図２は、読み上げ用データ符号化装置（第二実施形態）のブロック図である。この図２に示すように、読み上げ用データ符号化装置１Ａは、漢字仮名混じり文等のテキストデータを、読み上げ用データとして入力し、当該読み上げ用データを符号化する際に参照する符号化辞書を切替可能としたもので、形態素解析手段５と、単語辞書蓄積手段７と、韻律予測手段９と、音素列選択手段１１Ａと、ストリーム整形手段１３と、ストリーム接続手段１５と、符号化辞書蓄積手段（音素列符号化データ辞書蓄積手段）２１と、音声ストリーム出力手段２３とを備えている。なお、図１に示した読み上げ用データ符号化装置１の構成と同様の構成については、同一の符号を付して、その説明を省略する。 <Configuration of Reading Data Encoding Device [Second Embodiment]>
FIG. 2 is a block diagram of the reading data encoding apparatus (second embodiment). As shown in FIG. 2, the reading data encoding apparatus 1A inputs text data such as a kanji-kana mixed sentence as reading data, and reads an encoding dictionary to be referred to when the reading data is encoded. The morphological analysis means 5, the word dictionary storage means 7, the prosody prediction means 9, the phoneme string selection means 11A, the stream shaping means 13, the stream connection means 15, the coding dictionary storage means (Phoneme sequence encoded data dictionary storage means) 21 and audio stream output means 23 are provided. In addition, about the structure similar to the structure of the data encoding apparatus 1 for reading shown in FIG. 1, the same code | symbol is attached | subjected and the description is abbreviate | omitted.

音素列選択手段１１Ａは、音声ストリーム出力手段２３に、音素列（符号化されている音素列［音素列符号化データ］）を選択する符号化音素列選択制御信号を出力すると共に、漢字仮名混じり文を符号化して接続した接続音声ストリームを復号する際の話速によって、符号化辞書蓄積手段２１に蓄積されている符号化辞書の切替および切り替える時刻を指定する辞書切替制御信号を出力するものである。また、この音素列選択手段１１Ａは、音素列選択手段１１と同様に、音声ストリーム接続制御信号をストリーム接続手段１５に出力するものである。 The phoneme string selection unit 11A outputs an encoded phoneme sequence selection control signal for selecting a phoneme sequence (encoded phoneme sequence [phoneme sequence encoded data]) to the audio stream output unit 23, and also mixes kanji characters. Outputs a dictionary switching control signal that specifies the switching of the coding dictionary stored in the coding dictionary storage means 21 and the switching time according to the speech speed at the time of decoding the connected audio stream that is encoded and connected to the sentence. is there. The phoneme string selection means 11 A outputs an audio stream connection control signal to the stream connection means 15, as with the phoneme string selection means 11.

なお、符号化音素列選択制御信号は、当該装置１Ａに入力された漢字仮名混じり文と、形態素解析手段５から出力された音素記号と、韻律予測手段９から出力された解析結果および韻律記号とを含むものである。 The encoded phoneme sequence selection control signal includes a kanji-kana mixed sentence input to the apparatus 1A, a phoneme symbol output from the morpheme analysis unit 5, an analysis result and a prosodic symbol output from the prosody prediction unit 9. Is included.

つまり、この音素列選択手段１１Ａは、切り替えた符号化辞書から最適な音素列符号化データを選択するものである。 That is, this phoneme sequence selection means 11A selects the optimal phoneme sequence encoded data from the switched encoding dictionary.

なお、辞書切替制御信号は、話速を変換する話速変換速度パラメータを含んでいる。この話速変換速度パラメータは、任意に入力することや、従来の話速変換研究で報告されたパラメータ等を使用することが可能である。 The dictionary switching control signal includes a speech speed conversion speed parameter for converting the speech speed. The speech speed conversion speed parameter can be arbitrarily input, or a parameter reported in a conventional speech speed conversion research can be used.

符号化辞書蓄積手段２１は、大容量のハードディスク等の記憶媒体から構成されており、複数の話者が、様々な話速で発話した音声を、予め符号化した複数の符号化辞書（音素列符号化データ辞書）を蓄積するものである。この実施形態では、３個の符号化辞書（Ａ、Ｂ、Ｃ）を蓄積している。例えば、この３個の符号化辞書（Ａ、Ｂ、Ｃ）の中で、符号化辞書Ａの話速が最も速く、次いで、符号化辞書Ｂの話速が速く、符号化辞書Ｃの話速が最も遅くなるように、予め設定されている。 The encoding dictionary storage means 21 is composed of a storage medium such as a large-capacity hard disk, and a plurality of encoding dictionaries (phoneme sequences) in which speech uttered by a plurality of speakers at various speaking speeds is encoded in advance. An encoded data dictionary). In this embodiment, three coding dictionaries (A, B, C) are accumulated. For example, among the three coding dictionaries (A, B, C), the speaking speed of the coding dictionary A is the fastest, then the speaking speed of the coding dictionary B is fast, and the speaking speed of the coding dictionary C Is preset so as to be the slowest.

この符号化辞書蓄積手段２１は、音声ストリーム出力手段２３から出力された辞書切替制御信号および符号化音素列選択制御信号に基づいて、音素列符号化データを音声ストリーム出力手段２３に出力する。 The encoded dictionary storage unit 21 outputs phoneme sequence encoded data to the audio stream output unit 23 based on the dictionary switching control signal and the encoded phoneme sequence selection control signal output from the audio stream output unit 23.

音声ストリーム出力手段２３は、音素列選択手段１１Ａから辞書切替制御信号および符号化音素列選択制御信号を受信して、これらの信号を符号化辞書蓄積手段２１に出力し、符号化辞書蓄積手段２１から出力される音素列符号化データを切り替えてストリーム整形手段１３に出力するもので、出力切替手段２３ａを備えている。 The audio stream output unit 23 receives the dictionary switching control signal and the encoded phoneme sequence selection control signal from the phoneme sequence selection unit 11A, outputs these signals to the encoding dictionary storage unit 21, and the encoding dictionary storage unit 21. Is switched to output to the stream shaping means 13 and includes an output switching means 23a.

出力切替手段２３ａは、符号化辞書蓄積手段２１から出力される音素列符号化データを、辞書切替制御信号に含まれている切替時刻に従って切り替えて、ストリーム整形手段１３に出力させるためのものである。 The output switching unit 23a is for switching the phoneme sequence encoded data output from the encoding dictionary accumulating unit 21 in accordance with the switching time included in the dictionary switching control signal, and causing the stream shaping unit 13 to output the data. .

この読み上げ用データ符号化装置１Ａによれば、形態素解析手段５によって、単語辞書蓄積手段７に蓄積されている単語辞書が参照されて、漢字仮名混じり文が形態素解析され、韻律予測手段９によって、形態素解析手段５で形態素解析された解析結果および漢字仮名混じり文を音素表記した音素記号に基づいて、漢字仮名混じり文の韻律を示す韻律記号が予測（生成）される。そして、音素列選択手段１１Ａによって、符号化音素列選択制御信号（漢字仮名混じり文、解析結果、音素記号および韻律記号を含む）および辞書切替制御信号に基づいて、符号化辞書蓄積手段２１に蓄積されている符号化辞書に収められている音素列符号化データが選択される。そして、音声ストリーム出力手段２３によって、音素列選択手段１１Ａで選択された音素列に対応する音素列符号化データが、音声ストリームとして出力され、この出力された音声ストリームから必要な部分が、ストリーム整形手段１３によって切り出されて、当該音声ストリームが整形される。その後、ストリーム接続手段１５によって、ストリーム整形手段１３で整形された音声ストリームが、音素列選択手段１１Ａから出力された音声ストリーム接続制御信号に基づいて接続され、接続音声ストリームとして出力される。このため、接続音声ストリームを受信してデコード（復号）する際に、専用のデコード装置（復号装置）を必要とせず、既存のデコード装置を利用して、漢字仮名混じり文を発話した際の音声を再現可能に、当該漢字仮名混じり文を符号化することができる。 According to the reading data encoding apparatus 1A, the word dictionary stored in the word dictionary storage unit 7 is referred to by the morphological analysis unit 5, the kana-kana mixed sentence is morphologically analyzed, and the prosody prediction unit 9 Based on the analysis result obtained by the morpheme analysis by the morpheme analysis unit 5 and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notated, a prosodic symbol indicating the prosody of the kanji-kana mixed sentence is predicted (generated). Then, the phoneme sequence selection means 11A stores the encoded phoneme sequence selection control signal (including kanji kana mixed sentences, analysis results, phoneme symbols and prosodic symbols) and the dictionary switching control signal in the encoding dictionary storage means 21. The phoneme string encoded data stored in the encoded dictionary is selected. Then, the audio stream output unit 23 outputs phoneme sequence encoded data corresponding to the phoneme sequence selected by the phoneme sequence selection unit 11A as an audio stream, and a necessary portion of the output audio stream is stream-shaped. The audio stream is cut out by the means 13 and shaped. Thereafter, the audio stream shaped by the stream shaping unit 13 is connected by the stream connection unit 15 based on the audio stream connection control signal output from the phoneme sequence selection unit 11A, and is output as a connected audio stream. For this reason, when receiving and decoding a connected audio stream, a dedicated decoding device (decoding device) is not required, and the sound when a kanji-kana mixed sentence is spoken using an existing decoding device Can be reproduced so that the kanji-kana mixed sentence can be encoded.

また、この読み上げ用データ符号化装置１Ａによれば、符号化辞書蓄積手段２１に話者と話速との少なくとも一方の異なる複数の符号化辞書Ａ、Ｂ、Ｃを蓄積しておいて、出力切替手段２３ａで、切り替えることで、入力された漢字仮名混じり文に最も適切な音素列符号化データを出力することができる。 Further, according to the reading data encoding apparatus 1A, a plurality of encoding dictionaries A, B, and C different in at least one of the speaker and the speech speed are stored in the encoding dictionary storage means 21 and output. By switching by the switching means 23a, it is possible to output the most appropriate phoneme string encoded data for the input kanji mixed sentence.

〈読み上げ用データ符号化装置［第二実施形態］の動作〉
次に、図４に示すフローチャートを参照して、読み上げ用データ符号化装置１Ａ（第二実施形態）の動作について説明する（適宜、図２参照）。
まず、読み上げ用データ符号化装置１Ａは、形態素解析手段５によって、入力された漢字仮名混じり文（テキストデータ）を、単語辞書蓄積手段７に蓄積されている単語辞書を参照して形態素解析し（ステップＳ１１）、解析した解析結果と音素記号とを、韻律予測手段９と音素列選択手段１１Ａに出力する。 <Operation of Reading Data Encoding Device [Second Embodiment]>
Next, the operation of the reading data encoding apparatus 1A (second embodiment) will be described with reference to the flowchart shown in FIG. 4 (see FIG. 2 as appropriate).
First, the reading data encoding apparatus 1A uses the morpheme analysis unit 5 to analyze the input kanji mixed text (text data) with reference to the word dictionary stored in the word dictionary storage unit 7 ( In step S11), the analyzed analysis result and the phoneme symbol are output to the prosody prediction means 9 and the phoneme string selection means 11A.

続いて、読み上げ用データ符号化装置１Ａは、韻律予測手段９によって、形態素解析手段５で解析された解析結果および音素記号に基づいて、韻律記号を生成し、音素列選択手段１１Ａに出力する（ステップＳ１２）。すると、読み上げ用データ符号化装置１Ａは、音素列選択手段１１Ａによって、符号化する音素を選択する符号化音素列選択制御信号および符号化辞書を切り替える辞書切替制御信号を音声ストリーム出力手段２３に出力すると共に、音声ストリーム同士の接続を制御するストリーム接続制御信号（制御信号）をストリーム接続手段１５に出力する（ステップＳ１３）。 Subsequently, the reading data encoding apparatus 1A generates a prosodic symbol by the prosody prediction unit 9 based on the analysis result and the phoneme symbol analyzed by the morphological analysis unit 5, and outputs the prosodic symbol to the phoneme string selection unit 11A ( Step S12). Then, the reading data encoding apparatus 1A outputs, to the speech stream output unit 23, the encoded phoneme sequence selection control signal for selecting the phoneme to be encoded and the dictionary switching control signal for switching the encoding dictionary by the phoneme sequence selection unit 11A. At the same time, a stream connection control signal (control signal) for controlling the connection between the audio streams is output to the stream connection means 15 (step S13).

すると、読み上げ用データ符号化装置１Ａは、音声ストリーム出力手段２３によって、符号化音素列選択制御信号および辞書切替制御信号を符号化辞書蓄積手段２１に出力し、この符号化辞書蓄積手段２１の符号化辞書Ａ〜Ｃのいずれかから出力された音素列符号化データを、出力切替手段２３ａで切り替えてストリーム整形手段１３に出力する（ステップＳ１４）。 Then, the reading data encoding apparatus 1A outputs the encoded phoneme sequence selection control signal and the dictionary switching control signal to the encoding dictionary storage unit 21 by the audio stream output unit 23, and the code of the encoding dictionary storage unit 21 The phoneme sequence encoded data output from any of the conversion dictionaries A to C is switched by the output switching unit 23a and output to the stream shaping unit 13 (step S14).

そして、読み上げ用データ符号化装置１Ａは、ストリーム整形手段１３によって、音素列符号化データ（音声ストリーム）から必要な部分を切り出して（ステップＳ１５）、当該音声ストリームを整形し、ストリーム接続手段１５に出力する。そして、読み上げ用データ符号化装置１Ａは、ストリーム接続手段１５によって、ストリーム整形手段１３から出力された音素列符号化データ（音声ストリーム）同士を接続して（ステップＳ１６）、接続音声ストリームとしてストリーム配信装置２に出力する。 Then, the reading data encoding apparatus 1A uses the stream shaping unit 13 to cut out a necessary part from the phoneme sequence encoded data (audio stream) (step S15), shapes the audio stream, and sends it to the stream connection unit 15 Output. Then, the read-out data encoding apparatus 1A connects the phoneme sequence encoded data (audio streams) output from the stream shaping unit 13 by the stream connecting unit 15 (step S16), and delivers the stream as a connected audio stream. Output to device 2.

すると、ストリーム配信装置２は、接続音声ストリームを、ネットワーク（図示せず）等を介して、配信する（ステップＳ１７）。 Then, the stream distribution device 2 distributes the connected audio stream via a network (not shown) or the like (step S17).

以上、本発明の実施形態について説明したが、本発明は前記実施形態には限定されない。例えば、本実施形態では、読み上げ用データ符号化装置１として説明したが、当該装置１（１Ａ）の各構成の処理を、汎用的または特殊なコンピュータ言語によって記述した読み上げ用データ符号化プログラムとみなすことも可能であるし、当該装置１（１Ａ）の各構成の処理を、漢字仮名混じり文から接続音声ストリームを生成（符号化）する各過程ととらえた読み上げ用データ符号化方法とみなすことも可能である。これらの場合、読み上げ用データ符号化装置１（１Ａ）と同様の効果を得ることができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment. For example, although the present embodiment has been described as the reading data encoding apparatus 1, the processing of each component of the apparatus 1 (1A) is regarded as a reading data encoding program described in a general-purpose or special computer language. In addition, the processing of each component of the apparatus 1 (1A) may be regarded as a data encoding method for reading that is regarded as each process of generating (encoding) a connected speech stream from a kanji-kana mixed sentence. Is possible. In these cases, an effect similar to that of the reading data encoding apparatus 1 (1A) can be obtained.

本発明の実施形態に係る読み上げ用データ符号化装置（第一実施形態）のブロック図である。1 is a block diagram of a reading data encoding apparatus (first embodiment) according to an embodiment of the present invention. FIG. 本発明の実施形態に係る読み上げ用データ符号化装置（第二実施形態）のブロック図である。It is a block diagram of a data encoding device for reading (second embodiment) according to the embodiment of the present invention. 図１に示した読み上げ用データ符号化装置（第一実施形態）の動作を説明したフローチャートである。It is the flowchart explaining operation | movement of the data encoding apparatus for reading shown in FIG. 1 (1st embodiment). 図２に示した読み上げ用データ符号化装置（第二実施形態）の動作を説明したフローチャートである。It is the flowchart explaining operation | movement of the data encoding apparatus for reading (2nd embodiment) shown in FIG. 音声ストリームを時間軸上で重複接続していく方法を、ＡＡＣの音声ストリームを例として説明した図である。It is the figure explaining the method of overlappingly connecting an audio stream on a time axis, taking an AAC audio stream as an example.

Explanation of symbols

１、１Ａ読み上げ用データ符号化装置
２ストリーム配信装置
３音声符号化装置
５形態素解析手段
７単語辞書蓄積手段
９韻律予測手段
１１音素列選択手段
１３ストリーム整形手段
１５ストリーム接続手段
１５ａ量子化精度変更手段
１７音素列蓄積手段
１９音声符号化手段
２１符号化辞書蓄積手段（音素列符号化データ辞書蓄積手段）
２３音声ストリーム出力手段
２３ａ出力切替手段 DESCRIPTION OF SYMBOLS 1, 1A Data encoding apparatus for reading 2 Stream delivery apparatus 3 Speech encoding apparatus 5 Morphological analysis means 7 Word dictionary storage means 9 Prosody prediction means 11 Phoneme sequence selection means 13 Stream shaping means 15 Stream connection means 15a Quantization accuracy change means 17 phoneme sequence storage means 19 speech encoding means 21 encoding dictionary storage means (phoneme sequence encoded data dictionary storage means)
23 audio stream output means 23a output switching means

Claims

A reading data encoding apparatus for inputting a kanji-kana mixed sentence as reading data and encoding the reading data into an audio stream,
For a speech encoding apparatus that accumulates phoneme sequences and encodes the phoneme sequences into speech streams, the kanji-kana mixed sentence, the analysis result obtained by morphological analysis of the kanji-kana mixed sentence, and the kanji-kana mixed sentence A phoneme string selection means for outputting a phoneme string selection signal for selecting a phoneme string to be encoded on the basis of the phoneme symbol, and for outputting a control signal for controlling connection between the audio streams;
A stream shaping unit that cuts out a necessary part from the audio stream output from the audio encoding device and shapes the audio stream;
Stream connection means for outputting the audio stream cut out by the stream shaping means as a connected audio stream connected based on the control signal;
A data encoding apparatus for reading aloud, comprising:

A reading data encoding apparatus for inputting a kanji-kana mixed sentence as reading data and encoding the reading data into an audio stream,
Word dictionary storage means for storing a word dictionary including at least data relating to reading of words included in the kanji-kana mixed sentence;
Referring to the word dictionary stored in the word dictionary storage means, morphologically analyzing the kanji-kana mixed sentence, and outputting a phoneme symbol representing the phonetic representation of the kanji-kana mixed sentence; and
Prosody prediction means for predicting a prosodic symbol indicating the prosody of the kanji-kana mixed sentence based on the analysis result of the morpheme analysis by the morpheme analysis means and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified,
A phoneme sequence to be encoded based on the kanji-kana mixed sentence, the analysis result, the phoneme symbol, and the prosodic symbol is stored in a speech encoding apparatus that accumulates the phoneme sequence and encodes the phoneme sequence into a speech stream. A phoneme string selection means for outputting a phoneme string selection signal to be selected and outputting a control signal for controlling connection between the audio streams;
A stream shaping unit that cuts out a necessary part from the audio stream output from the audio encoding device and shapes the audio stream;
Stream connection means for outputting the audio stream cut out by the stream shaping means as a connected audio stream connected based on the control signal;
A data encoding apparatus for reading aloud, comprising:

A reading data encoding apparatus for inputting a kanji-kana mixed sentence as reading data and encoding the reading data into an audio stream,
A phoneme sequence encoded data dictionary storage means for storing a phoneme sequence encoded data dictionary that is a dictionary related to phoneme sequence encoded data obtained by encoding the phoneme sequence;
Based on the kana-kana mixed sentence, the analysis result obtained by morphological analysis of the kana-kana mixed sentence, and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified, the phoneme string stored in the phoneme string encoded data dictionary storage means A phoneme string selection means for selecting and outputting a control signal for controlling connection between the phoneme string encoded data;
Speech stream output means for outputting phoneme sequence encoded data corresponding to the phoneme sequence selected by the phoneme sequence selection means as an audio stream;
A stream shaping unit that cuts out a necessary part from the audio stream output from the audio stream output unit and shapes the audio stream;
Stream connection means for outputting the audio stream cut out by the stream shaping means as a connected audio stream connected based on the control signal;
A data encoding apparatus for reading aloud, comprising:

A reading data encoding apparatus for inputting a kanji-kana mixed sentence as reading data and encoding the reading data into an audio stream,
Word dictionary storage means for storing a word dictionary including at least data relating to reading of words included in the kanji-kana mixed sentence;
Referring to the word dictionary stored in the word dictionary storage means, morphologically analyzing the kanji-kana mixed sentence, and outputting a phoneme symbol representing the phonetic representation of the kanji-kana mixed sentence; and
Prosody prediction means for predicting a prosodic symbol indicating the prosody of the kanji-kana mixed sentence based on the analysis result of the morpheme analysis by the morpheme analysis means and the phoneme symbol in which the kanji-kana mixed sentence is phoneme-notified,
A phoneme sequence encoded data dictionary storage means for storing a phoneme sequence encoded data dictionary that is a dictionary related to phoneme sequence encoded data obtained by encoding the phoneme sequence;
Based on the kanji-kana mixed sentence, the analysis result, the phoneme symbol, and the prosodic symbol, a phoneme sequence stored in the phoneme sequence encoded data dictionary storage unit is selected, and the phoneme sequence encoded data Phoneme string selection means for outputting a control signal for controlling connection;
Speech stream output means for outputting phoneme sequence encoded data corresponding to the phoneme sequence selected by the phoneme sequence selection means as an audio stream;
A stream shaping unit that cuts out a necessary part from the audio stream output from the audio stream output unit and shapes the audio stream;
Stream connection means for outputting the audio stream cut out by the stream shaping means as a connected audio stream connected based on the control signal;
A data encoding apparatus for reading aloud, comprising:

The phoneme sequence encoded data dictionary storage means stores a plurality of phoneme sequence encoded data dictionaries in which at least one of a speaker and a speech speed is different,
5. The voice stream output means comprises output switching means for switching and outputting phoneme sequence encoded data stored in the plurality of phoneme sequence encoded data dictionaries. The data encoding device for reading as described.

The stream shaping means shapes the audio stream into the smallest unit that can be decoded when decoding,
The reading data encoding apparatus according to any one of claims 1 to 5, wherein the stream connection unit connects the single units to form the connected audio stream.

7. The data encoding for reading according to claim 1, wherein the stream connecting unit connects the audio streams by overlapping a predetermined time on a time axis. apparatus.

The said stream connection means is provided with the quantization accuracy change means which changes the quantization accuracy of the vicinity of a connection point, when connecting the said audio | voice stream. A data encoding device for reading as described in the paragraph.

In order to input a kanji-kana mixed sentence as read-out data and encode the read-out data into an audio stream,
For a speech encoding apparatus that accumulates phoneme sequences and encodes the phoneme sequences into speech streams, the kanji-kana mixed sentence, the analysis result obtained by morphological analysis of the kanji-kana mixed sentence, and the kanji-kana mixed sentence A phoneme string selection means for outputting a phoneme string selection signal for selecting a phoneme string to be encoded based on the phoneme symbol, and for outputting a control signal for controlling connection between the audio streams;
A stream shaping unit that cuts out a necessary part from the audio stream output from the audio encoding device and shapes the audio stream;
Stream connection means for outputting the audio stream cut out by the stream shaping means as a connected audio stream connected based on the control signal;
A data encoding program for reading aloud,