JP2008275836A

JP2008275836A - Document processing method and device for reading aloud

Info

Publication number: JP2008275836A
Application number: JP2007118492A
Authority: JP
Inventors: Atsushi Yoshimoto; 淳善本
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2007-04-27
Filing date: 2007-04-27
Publication date: 2008-11-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document processing method and a device for reading aloud, in which a suitable modification means for a sound signal is attained in order to read letter information aloud in voice without redundancy. <P>SOLUTION: This invention relates to an information processing device comprising at least a data input and output device, a data recording device, and an arithmetic processing device, wherein a document which is a reading aloud indication letter string composed of a control code and a ground sentence, is read. Relation of the control code, an effect which is a content of modification of the ground sentence, and the number of effects, is read, and the document is separated into the control code and the ground sentence, and a combination of each control code and the ground sentence controlled by the control code, is obtained and a waveform data for reading the ground sentence aloud is generated. A ground sentence reading waveform data is modified according to the control code corresponding to the ground sentence, and the ground sentence reading waveform data and the waveform data in which the waveform is modified are combined and the obtained waveform data is output. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、データ入出力装置と、データ記録装置、演算処理装置を少なくとも備えたパソコン等の情報処理装置において、文字列から成る読み上げ用のドキュメントを入力して加工し、波形データを出力する読み上げ用ドキュメント処理方法と、それを実施する装置に関する。 The present invention provides a data input / output device, a data recording device, and an information processing device such as a personal computer equipped with an arithmetic processing device, which inputs and processes a text-to-speech document and outputs waveform data. The present invention relates to a document processing method and an apparatus for executing the document processing method.

文字情報を提供するには、視覚と聴覚に訴える方法があり、文字情報に修飾を施す場合も、視覚的修飾と音声的修飾がある。
視覚的修飾の表現には、文字の大きさや色などのさまざまな文字装飾で、地の文を修飾することが可能である。
例えば、一般的なHTMLで記述されたウェブページを、一般的なブラウザで表示させた場合、アンカータグ(<A>〜</A>)で囲まれた部分ならば、他の地の文とは異なる色と下線等で表現され、そこが他の情報へリンクされていることが視覚的に容易に理解される。これは、単なる文字情報ではなく、色という修飾によって他の意味が含まれ、複数の情報が付帯されていることを示す。 In order to provide character information, there is a method of appealing to visual and auditory senses, and when character information is modified, there are visual modification and audio modification.
In the expression of visual modification, it is possible to modify the background sentence with various character decorations such as the size and color of the characters.
For example, when a web page written in general HTML is displayed in a general browser, if it is a part surrounded by anchor tags (<A> to </A>), Is expressed in different colors and underlines, etc., and it is easily understood visually that it is linked to other information. This is not mere character information, but indicates that other meanings are included by the modification of color and a plurality of pieces of information are attached.

一方、音声で地の文を単に読み上げさせることは、従来公知の技術で実施可能であるが、修飾された文字の部分の処理は、そのまま放置されていることが多い。
これは、視覚による表現では、色・大きさ・フォントの種類など色々な適切かつ定番となる修飾方法があるのに対し、音声による表現では、地の文に対して、適切かつ定番となる修飾方法が確立されていないからである。さもなければ、ドキュメント中の地の文も制御コードも、共に読み上げるという冗長な手法を取らざるを得ない。 On the other hand, it is possible to carry out a simple reading of the text in the ground using a conventionally known technique, but the processing of the modified character portion is often left as it is.
In visual expression, there are various appropriate and standard modification methods such as color, size, font type, etc., while in audio expression, modification is appropriate and standard for local sentences. This is because no method has been established. Otherwise, you have to take the redundant approach of reading both the local text and the control code in the document.

健常者には一見不要な技術であるかも知れないが、冗長性をなくし、十分に情報の付帯された状態で、文字情報を音声で読み上げさせることは、視覚障害者にとっては切望されるものである。また、情報格差というのが都市部と地方で生じている場合もあるが、それ以外に視覚先行で広がったウェブベースの情報では、健常者と視覚障害者間で格差があるのも事実であり、視覚以外の別の方法で地の文以上の情報を含めたままのドキュメント内容を提示する技術が、早急に必要とされている。 Although it may seem unnecessary technology for healthy people, it would be anxious for visually impaired people to read out textual information in a state where redundancy is eliminated and information is sufficiently attached. is there. In addition, there are cases where information disparities occur in urban areas and rural areas, but it is also true that there is a disparity between healthy people and visually impaired persons in web-based information that has spread with visual precedence. Therefore, there is an urgent need for a technique for presenting document contents that include information beyond the local sentence by another method other than vision.

また、近年、機械語による音声案内サービスが広がっている。
例えば、銀行によっては、電話の音声案内を用いて残高照会や振込処理が可能であるが、音声案内によるメニュー形式を用いている場合が多い。例えば「メインメニューです。残高確認は１を、それ以外は９のボタンを押して下さい」という形式である。全て同じ調子で、定番の「メインメニューです」を毎回読み上げたり、メニューに対応する各番号を全て読み上げたりするので、非常に冗長であり苛立つ人も少なくない。 In recent years, voice guidance services in machine language have been expanded.
For example, depending on the bank, balance inquiry and transfer processing can be performed using telephone voice guidance, but a menu format based on voice guidance is often used. For example, the format is “Main menu. Press 1 to check balance, press 9 for others”. Everything is in the same tone, and the standard “Main Menu” is read out every time, and all the numbers corresponding to the menu are read out, so there are many people who are very redundant and frustrated.

文字情報を読み上げる従来の技術には、特許文献１〜６などがある。
特開２００４−１９８８３０「テキスト読み上げシステム及び方法」特開２００３−３４５３７８「テキスト読み上げシステム及び方法」特開２００３−２２３１８１「文字−音声変換装置およびそれを用いた携帯端末装置」特開２００２−１３２２８２「電子テキスト読み上げ装置」特開２００２−５５９２５「音声読み上げ装置および情報処理装置」特開２００１−１８８７７７「音声をテキストに関連付ける方法、音声をテキストに関連付けるコンピュータ、コンピュータで文書を生成し読み上げる方法、文書を生成し読み上げるコンピュータ、コンピュータでテキスト文書の音声再生を行う方法、テキスト文書の音声再生を行うコンピュータ、及び、文書内のテキストを編集し評価する方法」 Conventional techniques for reading out character information include Patent Documents 1-6.
JP 2004-198830 "Text reading system and method" Japanese Patent Laid-Open No. 2003-345378 “Text Reading System and Method” Japanese Patent Application Laid-Open No. 2003-223181 “Character-Voice Conversion Device and Portable Terminal Device Using the Same” Japanese Patent Laid-Open No. 2002-132282 “Electronic Text-to-Speech Device” Japanese Patent Laid-Open No. 2002-55925 “Speech-to-speech device and information processing device” Japanese Patent Laid-Open No. 2001-188777 “Method for associating speech with text, computer for associating speech with text, method for generating and reading a document with a computer, computer for generating and reading a document, method for performing speech playback of a text document with a computer, Computer for sound reproduction and method for editing and evaluating text in a document "

しかし、いずれの従来技術によっても、文字情報を音声で冗長性無くわかりやすく読み上げさせることはできなかった。 However, none of the prior arts could read out the text information in an easy-to-understand manner with no redundancy.

そこで、本発明は、文字情報を音声で冗長性無くわかりやすく読み上げさせるために、音声信号に対する適切な修飾手段を実現する読み上げ用ドキュメント処理方法及び装置を提供することを課題とする。 Therefore, an object of the present invention is to provide a document processing method and apparatus for reading that realizes appropriate modification means for a voice signal in order to read out character information in a voice in an easy-to-understand manner without redundancy.

上記課題を解決するために、本発明の読み上げ用ドキュメント処理装置は次の構成を備える。すなわち、データ入出力装置と、データ記録装置、演算処理装置を少なくとも備えた情報処理装置において、制御コードと地の文とから構成される読み上げ指示文字列であるドキュメントを、読み込むドキュメント入力手段と、制御コードと、制御コードによって制御される地の文の加工内容であるエフェクトと、そのエフェクトの数と、の関連を読み込む関連入力手段と、ドキュメントを制御コードと地の文とに分離し、各制御コードと、その制御コードによって制御される地の文との組を得るドキュメント処理手段と、地の文を読み上げる波形データを生成する地の文読み上げ波形データ生成手段と、地の文読み上げ波形データを、その地の文に対応する制御コードに応じて加工する波形加工処理手段と、地の文読み上げ波形データと、波形加工処理された波形データとを合成する波形データ合成手段と、得られた波形データを出力する波形データ出力手段とを備えることを特徴とする。 In order to solve the above-described problems, a reading-out document processing apparatus of the present invention has the following configuration. That is, in an information processing apparatus including at least a data input / output device, a data recording device, and an arithmetic processing device, a document input means for reading a document that is a reading instruction character string composed of a control code and a local sentence; The control code, the input that reads the relationship between the effect that is the processing content of the ground sentence controlled by the control code, and the number of the effects, and the related input means that separates the document into the control code and the ground sentence, Document processing means for obtaining a set of a control code and a local sentence controlled by the control code, a local sentence reading waveform data generating means for generating waveform data for reading the local sentence, and a local sentence reading waveform data Waveform processing means for processing the sound according to the control code corresponding to the local sentence, Characterized in that it comprises a waveform data synthesizing means for synthesizing the waveform processing waveform data, the waveform data output means for outputting the waveform data obtained.

本発明の読み上げ用ドキュメント処理方法は、データ入出力装置と、データ記録装置、演算処理装置を少なくとも備えた情報処理装置において、制御コードと地の文とから構成される読み上げ指示文字列であるドキュメントを読み込み、制御コードと、制御コードによって制御される地の文の加工内容であるエフェクトと、そのエフェクトの数と、の関連を読み込み、ドキュメントを制御コードと地の文とに分離し、各制御コードと、その制御コードによって制御される地の文との組を得て、地の文を読み上げる波形データを生成すると共に、地の文読み上げ波形データを、その地の文に対応する制御コードに応じて加工し、地の文読み上げ波形データと、波形加工処理された波形データとを合成し、得られた波形データを出力することを特徴とする。 The document processing method for reading out according to the present invention is a document which is a reading instruction character string composed of a control code and a local sentence in an information processing apparatus including at least a data input / output device, a data recording device, and an arithmetic processing unit. Is read, control code, the effect of the processing of the local sentence controlled by the control code, and the number of the effect, the relationship is read, the document is separated into the control code and the local sentence, each control A set of a code and a local sentence controlled by the control code is obtained to generate waveform data that reads the local sentence, and the local sentence reading waveform data is converted into a control code corresponding to the local sentence. Processing, and synthesizing the text-to-speech read-out waveform data and the waveform processed waveform data, and outputting the obtained waveform data. To.

ここで、エフェクトの数が複数の場合は、その複数の制御コードに応じた波形加工処理を、地の文読み上げ波形データに施して、多様なエフェクトに対応させてもよい。 Here, when there are a plurality of effects, waveform processing corresponding to the plurality of control codes may be applied to the ground text-read waveform data so as to correspond to various effects.

また、予め、地の文読み上げ波形データを検査しておき、読み上げるべき地の文が無い場合は、その無音部分の出力を削除して、冗長性を低減させてもよい。 In addition, it is also possible to inspect the ground sentence reading waveform data in advance, and when there is no ground sentence to be read out, the output of the silent part may be deleted to reduce the redundancy.

制御コードによって制御される地の文の加工内容は、音量の変化で表現してもよい。 The processing content of the local sentence controlled by the control code may be expressed by a change in volume.

同様に、制御コードによって制御される地の文の加工内容は、特定周波数の振幅の変化で表現してもよい。 Similarly, the processing content of the local sentence controlled by the control code may be expressed by a change in the amplitude of the specific frequency.

制御コードによって制御される地の文の加工内容は、波形の変形で表現してもよい。 The processing content of the local sentence controlled by the control code may be expressed by waveform deformation.

制御コードによって制御される地の文の加工内容は、再生速度の変化で表現してもよい。 The processing content of the local sentence controlled by the control code may be expressed by a change in reproduction speed.

制御コードによって制御される地の文の加工内容は、再生タイミングの時間的な変化で表現してもよい。 The processing content of the local sentence controlled by the control code may be expressed by a temporal change in the reproduction timing.

制御コードによって制御される地の文の加工内容は、周波数を変化させるピッチシフトで表現してもよい。 The processing content of the local sentence controlled by the control code may be expressed by a pitch shift that changes the frequency.

ピッチシフトには、古典調律に則った変化が好適である。 For the pitch shift, a change in accordance with classical rhythm is suitable.

同様に、ピッチシフトには、音程を均等な周波数比で分割した平均律に則った変化も好適である。 Similarly, a change in accordance with the equal temperament in which the pitch is divided by an equal frequency ratio is also suitable for the pitch shift.

ピッチシフトを、周波数を半減または倍増させる変化にすると、聞きやすい音声にできる。 If the pitch shift is changed to halve or double the frequency, it will be easy to hear.

同様に、ピッチシフトを、和音の構成音の順序を変える転回にしても、聞きやすい音声にできる。 Similarly, even if the pitch shift is turned to change the order of the chord constituent sounds, the sound can be easily heard.

本発明によると、制御コードと地の文とから構成されるドキュメントを読み込み、制御コードに応じて加工した波形データを出力するので、文字情報を音声で冗長性無くわかりやすく読み上げさせることに寄与する。そして、音声信号に対して修飾の定番としても使える。 According to the present invention, a document composed of a control code and a ground sentence is read, and waveform data processed according to the control code is output, which contributes to reading out text information in an easy-to-understand manner without redundancy. . It can also be used as a standard for modifying audio signals.

以下に、図面を基に本発明の実施形態を説明する。
音を変化させるエフェクトには、数多くの種類がある。エコーなど再生タイミングを時間的に変化させるもの、オーバードライブシミュレータなど波形を変形させるもの、ボイスチェンジャなど周波数を変えるもの、イコライザなど特定周波数の振幅を変化させるもの、音量や速さを変化させるものなどがある。 Embodiments of the present invention will be described below with reference to the drawings.
There are many types of effects that change the sound. Those that change playback timing over time such as echoes, those that deform waveforms such as overdrive simulators, those that change the frequency such as voice changers, those that change the amplitude of a specific frequency such as an equalizer, those that change the volume or speed, etc. There is.

人の発話にエフェクトをかけ、そのエフェクトがかかった発話を他人が聞いても明瞭に理解できるエフェクトの種類を考えると、明瞭性を失わせずに周波数を変えるピッチシフト系のエフェクトが最も適していると考えられる。
そのため、以下では、ピッチシフトを挙げて説明するが、他のエフェクトも同様に利用可能である。 Considering the types of effects that can be applied to people's utterances and clearly understood even when others hear the utterances with those effects, pitch shift effects that change the frequency without losing clarity are most suitable. It is thought that there is.
Therefore, in the following description, pitch shift will be described, but other effects can be used as well.

人は一般生活上、声の高い相手であっても低い相手であっても、発話内容を十分に理解できていると考えられるので、読み上げている声のピッチが多少高くなったり低くなったりしても、聞きやすさには大きな影響を与えにくいはずである。
２人以上から同時に異なることを話しかけられた場合、人はその話を理解するのが困難であるが、発話者が２人以上の場合でも、同じ話速で同じ内容の同時発話の場合ならば、人はその話を理解することは可能である。これは、例えば、複数のパートからなる合唱団が、色々なパートで同じ歌詞を同時に歌っている場合、その歌詞を聴衆が理解できることで示される。
このような人の認識は、発話している原音や、原音を元に作られたピッチシフト音という単音の理解はもとより、原音とピッチシフト音の同時発信、すなわち和音であっても人は理解できることを意味している。 In general, people are considered to be able to fully understand the utterances regardless of whether they are loud or low in voice, so the pitch of the voice that is being read may be slightly higher or lower. However, it should not have a significant impact on ease of listening.
When two or more people talk about different things at the same time, it is difficult for a person to understand the story. However, even if there are two or more speakers, It is possible for a person to understand the story. This is indicated, for example, when a choir consisting of a plurality of parts sings the same lyrics in various parts simultaneously, the audience can understand the lyrics.
This kind of recognition is not only based on the understanding of the original sound that is being uttered and the pitch-shifted sound created based on the original sound, but also on the simultaneous transmission of the original sound and the pitch-shifted sound, that is, the person understands even the chord. It means you can do it.

ピッチシフトには、平均律が有用に利用できる。
平均律とは、1オクターブなどの音程を均等な周波数の比で分割した音律である。一般には、十二平均律のことを指すことが多く、現代人には耳慣れている。
図１は、十二平均律での周波数比を示す表である。
十二平均律とは、1オクターブを１２等分した音律であり、隣り合う音の周波数比が２の１２乗根と１との比で、これは西洋音楽の半音に相当する。 For pitch shift, equal temperament can be used effectively.
The equal temperament is a temperament in which a pitch such as one octave is divided by an equal frequency ratio. In general, it often refers to the twelve equal temperament, which is familiar to modern people.
FIG. 1 is a table showing frequency ratios according to the twelve equal temperament.
The twelve equal temperament is a temperament obtained by dividing one octave into 12 equal parts, and the frequency ratio of adjacent sounds is the ratio of the 12th root of 2 to 1, which corresponds to a semitone of Western music.

本発明では、機械的にピッチシフトを施すため、十二平均律特有の古典調律との差異を考える必要はない。演算上の都合で、かえって古典調律の方が好ましい場合があるために、以下では、原音より５度高い音、と記載する時には、十二平均律として原音より７半音高い音、という意味と、その元となる古典調律としての原音の1.5倍の周波数の音、という二重性を含めているものとする。便宜上、１半音高い音を100セント上の音、１半音低い音を100セント下の音とし、また１オクターブは1200セントとして表現する。 In the present invention, since the pitch shift is mechanically performed, it is not necessary to consider the difference from the classical rhythm peculiar to the twelve equal temperament. For the convenience of calculation, the classical rhythm may be preferable. Therefore, in the following, when describing a sound that is 5 degrees higher than the original sound, it means that the sound is 7 semitones higher than the original sound as a twelve average temperament. It is assumed to include the duality of 1.5 times the frequency of the original sound as the original classical rhythm. For convenience, a sound one semitone higher is a sound 100 cents above, a sound one semitone lower is a sound 100 cents lower, and one octave is represented as 1200 cents.

ピアノでCとGを同時に押せば、ハ長調でのいわゆるドとソの和音が鳴る。ソはドからみて５度高い。言い換えれば７半音高い、或いは、700セント高い。本来的な古典調律ならば、ソはドの1.5倍の周波数であるべきでだが、ピアノのような十二平均律の楽器は、オクターブ以外は全てに誤差が入り込んでいるので、良く調律されていても1.5倍にはならず、２の７乗の12乗根である約1.498307倍となる。
厳密な演算量の点では、十二平均律的にある音の約1.493307倍などを演算するよりも、古典調律的に1.5倍を演算した方が遥かに演算量が少なくてすむ。 If you press C and G at the same time on the piano, you will hear a so-called chord between S and D in C major. Seo is 5 degrees higher than De. In other words, it is 7 semitones higher or 700 cents higher. In the classic classical tune, Seo should be 1.5 times the frequency of the de, but twelve equal-tempered instruments like the piano are well tuned because all errors except for the octave are included. However, it is not 1.5 times, but it is about 1.498307 times, which is the 12th root of 2 7.
In terms of the strict calculation amount, the calculation amount of 1.5 times in the classical rhythm is much smaller than the calculation of about 1.493307 times the sound of the twelve average temperament.

ドを原音とすると、ピッチシフトによって示される音は、半音上のド# （100セント上昇）、長二度上のレ（200セント上昇）、短三度上のレ#（300セント上昇）、長三度上のミ（400セント上昇）、四度上のファ（500セント上昇）、減五度上のファ#（600セント上昇）、五度上のソ（700セント上昇）などがある。 The sound indicated by the pitch shift is a semi-tone do # (100 cents rise), a second higher second (200 cents rise), a third higher third # (300 cents rise), For example, there is a third-longer Mi (up 400 cents), a fourth-up pha (up 500 cents), a declining fifth-in-fa # (up 600 cents), and a fifth-up seo (up 700 cents).

人は和音を聴覚から得る場合、視覚で得られるような感覚とはまた異なる感覚を得るので、この特性を利用して、選択的に和音を合成することが望ましい。美しくない響きを聴き続けるのは苦痛であるので、実際使うに当たり、基本的には美しい響きをもって音を重ねることが望ましい。
一般的には、２音が単純な整数の周波数比にある時、その和音が美しく響く状態になる。このような音程を純正音程と呼び、調律法ではこれを利用して音階を定めている。 When a person obtains a chord from the auditory sense, a sense different from that obtained visually is obtained. Therefore, it is desirable to selectively synthesize a chord using this characteristic. It is painful to continue to listen to the unsounding sound, so it is basically desirable to repeat the sound with a beautiful sound when actually using it.
In general, when two notes are at a simple integer frequency ratio, the chord will sound beautifully. Such a pitch is called a pure pitch, and the tuning method uses this to determine the scale.

なお、平均律は、1オクターブを12等分する十二平均律のみが利用できるわけではない。その他の平均律として、十九平均律、二十二平均律、三十一平均律、五十三平均律、七十二平均律なども利用できる。 As for the equal temperament, only the twelve equal temperament that divides one octave into 12 is not available. As other equal temperaments, nineteen equal temperament, twenty-two equal temperament, thirty-one average temperament, fifty-three equal temperament, seventy-two equal temperament can be used.

また、古典音律として、ピタゴラス律、純正律 (C長調)、中全音律 (アロン律、ミーントーン)、ヴェルクマイスター律I、II、III、キルンベルガー律I、II、III、ヴァロッティ・ヤング律、ヤング律iiなども利用できる。 In addition, as the classical temperament, Pythagoras temperament, pure temperament (C major), medium whole temperament (Aron temperament, mean tone), werkmeister temperament I, II, III, kilnberger temperament I, II, III, Valotti Young temperament, Young temperament ii can also be used.

ピッチシフトの実装は、以下の通りである。
図２は、処理対象のドキュメントを示す例であり、図２（イ）は、制御コードを明示した文字列、図２（ロ）は、視覚表現を明示した文字列である。
ドキュメントは、制御コードと地の文とから構成される。図示の例では、HTMLで記述された地の文と、制御コードとしてのタグとから混成されている。HTML1.0に沿って、太字はBタグ、斜体はIタグ、などとする。
タグ解釈後の一般的な視覚的な表示は、図２（ロ）のように、太字を含んだ表現となる。 The implementation of pitch shift is as follows.
FIG. 2 shows an example of a document to be processed. FIG. 2 (a) is a character string that clearly indicates a control code, and FIG. 2 (b) is a character string that clearly indicates a visual expression.
The document is composed of control codes and ground sentences. In the example shown in the figure, the text is composed of a local sentence described in HTML and a tag as a control code. In line with HTML 1.0, B tags are in bold, I tags are in italics, and so on.
A general visual display after tag interpretation is an expression including bold characters as shown in FIG.

図３は、読み上げ指示文字列となるドキュメントを示す例であり、図３（イ）は、制御コードを明示した文字列、図３（ロ）は、音声表現を付記した文字列である。
本発明による音声表現では、まず、タグが存在する部分で、地の文を、「すべて国民は」「勤労の権利」「を有し」「義務を負ふ」と分割する。
Bタグは３度低いピッチシフト音のエフェクトであると前提すると、Bタグの制御を受けていない地の文（「すべて国民は」等）はそのまま音声合成を行い発信し、Bタグの制御を受ける地の文（「勤労の権利」等）では、まずは地の文をそのまま音声合成を行ったPCMファイル等の音情報を作り、次に、その音情報を元に、３度低いピッチシフト音情報を作り、最後に、地の文の音情報とピッチシフト音情報を合成し発信する。音と音の合成は、従来公知の技術を適宜利用する。
すると、全ての地の文は読み上げられるが、図３（ロ）のように、Bタグによる制御部分だけは地の文の読み上げよりも３度低い読み上げ音の同時発信、という和音による読み上げとなる。これによって、聴取者は単音と和音の差によって、タグの有無を理解できる。 FIG. 3 shows an example of a document that is a reading instruction character string. FIG. 3A is a character string that clearly indicates a control code, and FIG. 3B is a character string that is accompanied by a phonetic expression.
In the phonetic expression according to the present invention, first, in the portion where the tag exists, the sentence of the ground is divided into “all citizens”, “right to work”, “has a duty”, and “is obligated”.
Assuming that the B tag is an effect of pitch shift sound that is three times lower, the text of the ground that is not under the control of the B tag (such as “All the people are”) is synthesized and transmitted as it is, and the B tag is controlled. For the local text (such as “Right to Work”), first create sound information such as a PCM file, which is a speech synthesis of the local text as it is, and then 3 times lower pitch-shifted sound based on that sound information. Information is created, and finally, the sound information of the local sentence and the pitch shift sound information are synthesized and transmitted. For synthesis of sound, conventionally known techniques are appropriately used.
Then, all the local sentences are read out, but as shown in FIG. 3 (b), only the control part by the B tag is read out by the chord that the simultaneous reading of the reading sound is three times lower than the reading of the local sentence. . As a result, the listener can understand the presence or absence of the tag based on the difference between a single tone and a chord.

同様に、図４は、処理対象のドキュメントを示す例であり、図４（イ）は、制御コードを明示した文字列、図４（ロ）は、視覚表現を明示した文字列である。
本例では、複数のタグが使用され、異なる部分でそれぞれ異なるピッチシフトが施される。
複数タグの表出違いの例として、異なるタグであるIタグを交えている。
タグ解釈後の一般的な視覚的な表示は、図４（ロ）のように、太字と斜体を含んだ表現となる。 Similarly, FIG. 4 is an example showing a document to be processed. FIG. 4A is a character string that clearly indicates a control code, and FIG. 4B is a character string that clearly indicates a visual expression.
In this example, a plurality of tags are used, and different pitch shifts are applied to different portions.
As an example of the misappearance of multiple tags, I tags that are different tags are used.
A general visual display after tag interpretation is an expression including bold and italics as shown in FIG.

図５は、読み上げ指示文字列となるドキュメントを示す例であり、図５（イ）は、制御コードを明示した文字列、図５（ロ）は、音声表現を付記した文字列である。
Iタグは５度高いピッチシフト音のエフェクトであると前提すると、まずは「すべて国民」「勤労の権利」「を有し」「義務を負ふ」とタグが存在する部分で分割した後、BタグやIタグの制御を受けていない地の文（「は」、「を有し」）はそのまま音声合成を行い発信し、Bタグの制御下では、まずは地の文（「勤労の権利」、「義務を負ふ」）をそのまま音声合成を行い、その音情報を作り、次に、合成された音情報を元に、３度低いピッチシフト音情報を作り、最後に、地の文の音情報とピッチシフト音情報を合成し発信し、Iタグの制御下では、まずは地の文（「すべて国民」）をそのまま音声合成を行い、その音情報を作り、次に、合成された音情報を元に、５度高いピッチシフト音情報を作り、最後に、地の文の音情報とピッチシフト音情報を合成し発信する。
すると、全ての地の文は読み上げられるが、図５（ロ）のように、Bタグの部分だけは地の文の読み上げよりも３度低い読み上げ音の同時発信という和音による読み上げ、Iタグの部分だけは地の文の読み上げよりも５度高い読み上げ音の同時発信という和音による読み上げとなる。これによって、利用者は単音と和音の差によってタグの有無を理解できる。 FIG. 5 shows an example of a document that is a reading instruction character string. FIG. 5A is a character string that clearly indicates a control code, and FIG. 5B is a character string that is accompanied by a phonetic expression.
Assuming that the I tag is an effect of a pitch shift sound that is five times higher, first divide it into the parts where the tag exists, such as “All citizens”, “Right to work”, “Has obligation”, and B Sentences that are not controlled by the tag or I tag ("ha", "has") are sent out as they are, and under the control of the B tag, the local sentence ("right of work" , “I have no obligation”), synthesizing the speech as it is, creating the sound information, then creating pitch-shifted sound information three times lower based on the synthesized sound information, and finally, Sound information and pitch-shifted sound information are synthesized and transmitted, and under the control of the I tag, first the local sentence (“all citizens”) is synthesized as it is, then the sound information is created, and then the synthesized sound Based on the information, 5 degree higher pitch shift sound information is made, and finally, the sound information and pitch shift of the local sentence The sound information is synthesized and outgoing.
Then, all the local sentences are read out, but as shown in Fig. 5 (b), only the B tag part is read out by a chord that is a simultaneous reading of a reading sound three times lower than the reading of the local sentence. Only the part is read out by a chord of simultaneous sending of a reading sound that is 5 degrees higher than the reading of the local sentence. Thereby, the user can understand the presence or absence of the tag by the difference between a single tone and a chord.

同様に、図６は、処理対象のドキュメントを示す例であり、図６（イ）は、制御コードを明示した文字列、図６（ロ）は、視覚表現を明示した文字列である。
本例では、BタグとIタグの複数タグが使用され、複数のタグの制御下となる地の文が存在する。
タグ解釈後の一般的な視覚的な表示は、図６（ロ）のように、太字と斜体を含んだ表現となる。
すなわち、「すべて国民は」は変化無く、Bタグ制御下の「勤労の」は太字、Bタグ及びIタグ制御下の「権利」は太字の斜体、Iタグ制御下の「を有し」は斜体、Bタグ及びIタグ制御下の「義務」は太字斜体、Bタグ制御下の「を負ふ」は太字になる。このように、視覚表現では、太字と斜体は互いに干渉せずにかけられるエフェクトである。これは音声でも可能である。 Similarly, FIG. 6 is an example showing a document to be processed. FIG. 6A is a character string that clearly indicates a control code, and FIG. 6B is a character string that clearly indicates a visual expression.
In this example, a plurality of tags of B tag and I tag are used, and there is a local sentence under the control of the plurality of tags.
A general visual display after tag interpretation is an expression including bold and italics as shown in FIG.
In other words, “all citizens” are unchanged, “work” under B tag control is bold, “right” under B tag and I tag control is bold italic, “has” under I tag control is “Obligations” under italics, B tag and I tag controls are in bold italics, and “O” under B tag controls are in bold. Thus, in visual expression, bold and italics are effects that can be applied without interfering with each other. This can also be done by voice.

図７は、読み上げ指示文字列となるドキュメントを示す例であり、図７（イ）は、制御コードを明示した文字列、図７（ロ）は、音声表現を付記した文字列である。
すなわち、「すべて国民は」は変化無く、Bタグ制御下の「勤労の」は３度低いピッチシフト音との和音、Bタグ及びIタグ制御下の「権利」は３度低いピッチシフト音と５度高いピッチシフト音との和音、Iタグ制御下の「を有し」は５度高いピッチシフト音との和音、Bタグ及びIタグ制御下の「義務」は３度低いピッチシフト音と５度高いピッチシフト音との和音、Bタグ制御下の「を負ふ」は３度低いピッチシフト音との和音になる。
このように、音声表現でも、複数のピッチシフト音は互いに干渉せずにかけられるエフェクトである。 FIG. 7 shows an example of a document that is a reading instruction character string. FIG. 7A is a character string that clearly indicates a control code, and FIG. 7B is a character string that is accompanied by a phonetic expression.
In other words, “all the people” are unchanged, “work” under B tag control is a chord with a pitch shift sound that is three times lower, and “right” under B tag and I tag control is a pitch shift sound that is three times lower. A chord with a pitch shift sound that is 5 degrees higher, “has” under I tag control is a chord with a pitch shift sound that is 5 degrees higher, and “obligation” under B tag and I tag control is a pitch shift sound that is 3 degrees lower. A chord with a pitch shift sound that is 5 degrees higher, and “Otofu” under B tag control, becomes a chord with a pitch shift sound that is 3 degrees lower.
Thus, even in the audio expression, a plurality of pitch-shifted sounds are effects that can be applied without interfering with each other.

図８は、上述の処理を実装するためのプログラムのフローチャートであり、図９は、そのフローチャートにおける波形加工処理のサブルーチン部分のフローチャートである。
メイン部分の処理は次の工程で行なう。
初期化：
ステップ1；
制御コードE−エフェクトの数k−エフェクトX(K) （K=１〜ｋ）が記載された対応テーブルTを読み込む。
制御コードE はタグBやIなどに相当するものであり、エフェクトX(K)は与えるエフェクトの内容を指定するものである。例えば、エフェクトをピッチシフトに限った場合、100であると原音より100セント上にピッチシフト、-400であると400セント下にピッチシフトなどを意味する。対応テーブルTは制御コードEとエフェクト番号の関連を記述したテーブルであり、例えば制御コードEがBならばエフェクトは-300、制御コードEがIならばエフェクトは700、それ以外の制御コードならばエフェクトは0など、利用者が適宜編集可能なものである。何もエフェクトを与えない制御コードEも存在するし、異なる制御コードEでも同じエフェクトを与える場合もある。 FIG. 8 is a flowchart of a program for implementing the above-described processing, and FIG. 9 is a flowchart of a subroutine part of waveform processing in the flowchart.
The main part is processed in the following steps.
Initialization:
step 1;
A correspondence table T in which control code E-number of effects k-effect X (K) (K = 1 to k) is written is read.
The control code E corresponds to the tags B and I and the like, and the effect X (K) specifies the content of the effect to be given. For example, when the effect is limited to pitch shift, 100 means pitch shift 100 cents above the original sound, and -400 means pitch shift 400 cents below. The correspondence table T is a table describing the relationship between the control code E and the effect number. For example, if the control code E is B, the effect is -300, if the control code E is I, the effect is 700, and if it is any other control code The effect can be edited as appropriate by the user, such as 0. There is a control code E that does not give any effect, and different control codes E may give the same effect.

ステップ２；
地の文Zと制御コードEから成る読み上げ対象のドキュメントDを読み込む。
ドキュメントDは地の文Zと制御コードEから成る混成文であり、必ず１つ以上の地の文Zを含むものである。 Step 2;
Read a document D to be read out consisting of a local sentence Z and a control code E.
Document D is a hybrid sentence composed of a local sentence Z and a control code E, and always includes one or more local sentences Z.

ステップ３；
整数変数iを１として初期化し、全ての地の文Zを処理するまでのループ用カウンタとして使用する。 Step 3;
The integer variable i is initialized as 1 and used as a loop counter until all the local sentences Z are processed.

ドキュメント処理：
ステップ1；
ドキュメントDを制御コードEと地の文Zに分離し、地の文Zを制御コードEが存在していた位置で分割する。分割された個数を整数変数Nに代入する。 Document processing:
step 1;
Document D is divided into control code E and earth sentence Z, and earth sentence Z is divided at the position where control code E was present. The divided number is assigned to the integer variable N.

ステップ２；
地の文Zを文字列配列Z(n)（n＝1〜N）に代入する。 Step 2;
The local sentence Z is assigned to the character string array Z (n) (n = 1 to N).

ステップ３；
地の文Zにかかっていた制御コードEを、区切文字zと制御コードEとの組み合わせから成る文字列配列C(n) （n＝1〜N）に代入する。
文字列配列C(n)は、例えば、「区切文字z 制御コードE」という形式で記述できる。地の文Zに複数の制御コードEがかかっていた場合は、「区切文字z 制御コード１番目E1区切文字z 制御コード２番目E2」として文字列配列C(n)に代入する。地の文Zに制御コードEがかかっていない場合はnullを文字列配列C(n)に代入する。 Step 3;
The control code E applied to the local sentence Z is substituted into a character string array C (n) (n = 1 to N) composed of a combination of the delimiter z and the control code E.
The character string array C (n) can be described in a format of “delimiter character z control code E”, for example. When a plurality of control codes E are applied to the local sentence Z, it is assigned to the character string array C (n) as “separator character z control code first E1 separator character z control code second E2”. If the control code E is not applied to the local sentence Z, null is assigned to the character string array C (n).

地の文の読み上げ波形データ生成：
従来公知の音声合成ルーチン等を使用するなどして、地の文Z(i)を読み上げた波形データWZ(i)を作成する。 Generation of read-out waveform data of local sentence:
Waveform data WZ (i) that reads out the local sentence Z (i) is created by using a conventionally known speech synthesis routine or the like.

エフェクト用波形データ初期化：
エフェクト用波形データWEを、読み上げ用波形データWZ(i)と同一長の無音データとして初期化する。エフェクト用波形データWEは、読み上げ用波形データWZ(i)に対しエフェクトをかけた波形データである。 Waveform data initialization for effects:
The effect waveform data WE is initialized as silence data having the same length as the read-out waveform data WZ (i). The effect waveform data WE is waveform data obtained by applying an effect to the read-out waveform data WZ (i).

エフェクト指定の有無判断：
文字列配列C(i)がnullならばエフェクト処理は不要なので次の処理へ進ませる。
null以外で、消去を意味するエフェクトならば、エフェクト用波形データWEを読み上げ用波形データWZ(i)に代入して処理する。消去以外のエフェクトの場合は、エフェクトをかけるために波形加工処理のサブルーチンへと処理を移行させる。 Judgment of effect designation:
If the character string array C (i) is null, the effect processing is unnecessary and the process proceeds to the next processing.
If the effect means erasure other than null, the effect waveform data WE is substituted into the read-out waveform data WZ (i) and processed. In the case of effects other than erasure, the process is shifted to a waveform processing subroutine to apply the effect.

波形加工処理：
後述の処理を行う。 Waveform processing:
Processing described below is performed.

地の文読み上げ波形データと加工波形データの合成：
読み上げ用波形データWZ(i)とエフェクト用波形データWEの波形を合成し、読み上げ用波形データWZ(i)に代入する。
なお、波形データの合成は従来公知の技術を利用する。波形データの合成としては、sox（SOund eXchange)などが利用できる。例えば、inputfile1.wavとinputfile2.wavを合成しoutputfile1.wavに代入する場合には、「soxmix inputfile1.wav inputfile2.wav outputfile1.wav」という形式で記述し、この場合には合成処理後、outputfile1.wavをinputfile1.wavに改名しておく。 Combining ground text-to-speech waveform data and machining waveform data:
The waveform of the reading waveform data WZ (i) and the effect waveform data WE is synthesized and substituted into the reading waveform data WZ (i).
The waveform data is synthesized by using a conventionally known technique. For synthesis of waveform data, sox (SOund eXchange) can be used. For example, when combining inputfile1.wav and inputfile2.wav and assigning to outputfile1.wav, describe in the format `` soxmix inputfile1.wav inputfile2.wav outputfile1.wav ''. In this case, after combining processing, outputfile1.wav Rename wav to inputfile1.wav.

地の文処理終了判断：
処理されていない地の文Zが残っている間(i<N)は、カウンタを加算してループ処理を繰り返し、全ての地の文が処理されれば(i＝＝N)、次の処理へ進ませる。 Judgment of end of sentence processing:
While the unprocessed local sentence Z remains (i <N), the counter is incremented and the loop process is repeated. If all the local sentences are processed (i == N), the next process Go to.

得られた波形データ出力：
読み上げ用波形データWZ(1)〜WZ(N)を順次出力させれば、制御コードEによってそれぞれのエフェクトがかかったドキュメントD中の全ての地の文Zを読み上げることになる。
なお、読み上げ用波形データWZ(1)〜WZ(N)の出力としては、ｗａｖファイル等のファイル出力でも、付設スピーカーから音声としての出力等でもよい。 Resulting waveform data output:
If the read-out waveform data WZ (1) to WZ (N) are sequentially output, all the sentences Z in the document D to which the respective effects are applied are read out by the control code E.
The output of the read-out waveform data WZ (1) to WZ (N) may be a file output such as a wav file or an audio output from an attached speaker.

また、一般的なHTMLドキュメントは、<BODY>〜</BODY>タグ中に本文が含まれ、＜HEAD>〜＜/HEAD＞タグ中には別の記述があることが多い。＜HEAD＞タグ内部の情報を毎回、地の文として読み上げられると冗長な場合もある。
これを回避するには、<HEAD>制御コードのエフェクトを無音化（消去）に指定すればよい。すなわち、地の文を予め検査しておき、読み上げるべき地の文が無い場合は、その無音部分の出力を削除すればよい。 Further, a general HTML document includes a body in <BODY> to </ BODY> tags, and there are often other descriptions in <HEAD> to </ HEAD> tags. If the information inside the <HEAD> tag is read out as a local sentence every time, it may be redundant.
To avoid this, the effect of the <HEAD> control code can be set to silence (erase). That is, the local sentence is examined in advance, and if there is no local sentence to be read out, the output of the silent part may be deleted.

図９に示した波形加工処理のサブルーチン部分の処理は次の工程で行なう。
初期化：
ステップ１；
文字列配列C(i)内の制御コードEの数をカウントし、整数変数Mに代入する。 The subroutine processing of the waveform processing shown in FIG. 9 is performed in the following steps.
Initialization:
Step 1;
The number of control codes E in the character string array C (i) is counted and assigned to the integer variable M.

ステップ２；
文字列配列C(i)内のM個全ての制御コードEを処理するまで、以下のループを繰り返す。整数変数jを１として初期化し、ループ用カウンタとして使用する。 Step 2;
The following loop is repeated until all M control codes E in the character string array C (i) are processed. The integer variable j is initialized as 1 and used as a loop counter.

制御コードからエフェクトを決定：
文字列配列C(i)のj番目の制御コードEを取得する。次に対応テーブルTを参照し、制御コードE及びその内容から対応したかけるべきエフェクトの数kとエフェクトX(K)を得る。
これは、エフェクト数kとエフェクトX(K)は、制御コードＥから一意に決定されることを意味する。例えば、テーブルT内で、制御コード ”B” が、エフェクト数1、エフェクト“pitch -300” と関連付けられている場合、E=”B” 、k=1、X(1)＝” pitch -300” と記述できる。 Determine effect from control code:
Get the jth control code E of the character string array C (i). Next, referring to the correspondence table T, the number k of effects to be applied and the effect X (K) corresponding to the control code E and the contents thereof are obtained.
This means that the effect number k and the effect X (K) are uniquely determined from the control code E. For example, in the table T, when the control code “B” is associated with the effect number 1 and the effect “pitch -300”, E = “B”, k = 1, X (1) = ”pitch -300 ".

エフェクトを一時エフェクト用波形データに反映処理：
原音の読み上げ用波形データWZ(i)に対してエフェクトX(K)で指定されたエフェクトをかけたテンポラリ波形データWT(K)（K=１〜ｋ）を作成する。テンポラリ波形データWT(K)は、一時的にエフェクトをかけた波形データを格納する変数である。
例えば、エフェクトＸ(1)が原音に対して300セント下へのピッチシフトであり（X(1)=”pitch -300”）、原音読み上げ用波形データWZ(i)をinputfile1.wav、エフェクをかけたテンポラリ波形データWT(1)をoutputfile1.wavとし、soxを利用して表現するならば、「soxinputfile1.wav outputfile1.wav pitch -300」という形式で記述すれば、エフェクトのかかった波形データを得られる。
エフェクト数kが複数の場合は、読み上げ用波形データWZ(i)に対して個別に複数のエフェクトをかけ、それから個別に得られた波形データが個別のテンポラリ波形データWT(K)となる。 Apply effects to temporary effect waveform data:
Temporary waveform data WT (K) (K = 1 to k) is created by applying the effect specified by the effect X (K) to the waveform data WZ (i) for reading the original sound. Temporary waveform data WT (K) is a variable for storing waveform data to which effects are temporarily applied.
For example, if effect X (1) is pitch shifted 300 cents below the original sound (X (1) = ”pitch -300”), the waveform data WZ (i) for reading the original sound is inputfile1.wav, and the effect is If the applied temporary waveform data WT (1) is outputfile1.wav and expressed using sox, it can be expressed in the format `` soxinputfile1.wav outputfile1.wav pitch -300 ''. can get.
When the number of effects k is plural, a plurality of effects are individually applied to the reading-out waveform data WZ (i), and the waveform data individually obtained from the plurality of effects becomes individual temporary waveform data WT (K).

また、この処理の最後に、エフェクトX(K)をかけたテンポラリ波形データWT(K)を検査し、その中での主たる周波数が、例えば3400Hzなど所定の閾値の周波数を超えていた場合は、周波数を半減させる（”pitch-1200”）処理を行い、例えば300Hzなど所定の閾値の周波数を下回っていた場合は、周波数を倍増させる（”pitch 1200”）処理を行ってもよい。これによって、ピッチシフト音が高過ぎる場合は１オクターブ下げ、ピッチシフト音が低すぎる場合は１オクターブ上げられる。
これは、和音の構成音の並び順を変える転回と同様の効果を奏するものである。 Also, at the end of this process, the temporary waveform data WT (K) to which the effect X (K) is applied is inspected, and the main frequency in the data exceeds a predetermined threshold frequency such as 3400 Hz, for example, A process of halving the frequency (“pitch-1200”) is performed. If the frequency is below a predetermined threshold frequency such as 300 Hz, a process of doubling the frequency (“pitch 1200”) may be performed. As a result, when the pitch shift sound is too high, it is lowered by one octave, and when the pitch shift sound is too low, it is raised by one octave.
This has the same effect as turning to change the order of arrangement of the chord constituent sounds.

一般的な電話の周波数帯域は300Hz〜3400Hz程度と限定的であるので、上記処理によって再生できない音の生成を回避することができ、一般音声回線経由で読み上げさせる場合に有効である。また、電話機付設のスピーカーもこの周波数帯域を前提に製作されている場合があるので、音声回線経由でなくとも、何らかのアプリケーションソフトを用いて電話機を利用して読み上げさせる場合もこの処理は有効である。
ピッチシフトさせるか否かの判断になる周波数閾値は、スピーカーの特性のみならず、利用者の可聴域に適合させるために、利用者がある程度任意に調整可能な設計が好ましい。 Since the frequency band of a general telephone is limited to about 300 Hz to 3400 Hz, it is possible to avoid generation of sound that cannot be reproduced by the above processing, which is effective when reading out via a general voice line. Also, the speaker attached to the telephone may be manufactured on the premise of this frequency band, so this process is effective even when reading out using a telephone using some application software without using a voice line. .
The frequency threshold value for determining whether or not to shift the pitch is preferably designed so that the user can adjust it to some extent to suit not only the characteristics of the speaker but also the audible range of the user.

得られた全エフェクトをエフェクト波形データと合成：
得られた全てのエフェクトをかけた波形データWEと、エフェクトをかけたテンポラリ波形データWT(K)の波形を合成し、WEに代入する。 Combining all obtained effects with effect waveform data:
The obtained waveform data WE to which the effect is applied and the waveform of the temporary waveform data WT (K) to which the effect is applied are synthesized and substituted into WE.

１つの地の文に対するエフェクト処理の終了判断：
処理されていないエフェクトが残っている間(j<M)は、カウンタを加算してループ処理を繰り返し、全てのエフェクトが処理されれば(j＝＝M)、ループを終了し、このサブルーチンを終了する。 Determining the end of effect processing for a single sentence:
While the unprocessed effects remain (j <M), the counter is incremented and the loop processing is repeated. When all the effects are processed (j == M), the loop is terminated and this subroutine is executed. finish.

図１０は、処理対象のドキュメントを示す例であり、図１０（イ）は、制御コードを明示した文字列、図１０（ロ）は、視覚表現を明示した文字列、図１０（ハ）は、別の視覚表現を明示した文字列である。
前述のＢタグやＩタグの例では、太字や斜体の表現を音声で加工する手段であったが、例えばＡタグ（アンカータグ）内には、色々と情報が含まれている場合がある。すなわち、制御コードとその内容に依存させて表現した方が、得られる情報が広がる場合がある。 FIG. 10 shows an example of a document to be processed. FIG. 10 (a) shows a character string specifying a control code, FIG. 10 (b) shows a character string showing a visual expression, and FIG. A character string that clearly indicates another visual expression.
In the example of the B tag and the I tag described above, it is a means for processing a bold or italic expression by voice. For example, the A tag (anchor tag) may contain various information. That is, there are cases where the obtained information is spread by expressing it depending on the control code and its contents.

図１０（イ）に例示したドキュメントの場合、通常のブラウザによると図１０（ロ）のように、Ａタグの制御下にある「特許庁」の文字部分が変色されると共に下線が引かれて表示される。これによって、「特許庁」の文字部分には、Ａタグが存在しているという情報が付帯されていて、そこをクリックすると別のページに移る。更に、そのページを訪問した履歴があると、図１０（ハ）のように、「特許庁」の文字部分が更に別の色に変色して、訪問履歴の情報が付帯される。これは、Ａタグのみならず、Ａタグが含んでいる制御コードの内容と、ブラウザが管理している履歴が合わさって表現されていることになる。 In the case of the document illustrated in FIG. 10 (a), according to a normal browser, as shown in FIG. 10 (b), the character part of “Patent Office” under the control of the A tag is discolored and underlined. Is displayed. As a result, the information that the A tag exists is attached to the text portion of “Patent Office”, and clicking there moves to another page. Furthermore, if there is a history of visiting the page, as shown in FIG. 10 (c), the text portion of “Patent Office” is further changed to another color, and information on the visit history is attached. This means that not only the A tag but also the contents of the control code included in the A tag and the history managed by the browser are combined.

このような制御コードの内容に関してエフェクトを与えることは、音声表現でも可能である。
図１１は、読み上げ指示文字列となるドキュメントを示す例であり、図１１（イ）は、制御コードを明示した文字列、図１１（ロ）は、音声表現を付記した文字列、図１１（ハ）は、別の音声表現を付記した文字列である。
前記実施例では、原音とピッチシフト音の和音であったが、三音以上の和音に積み上げることも同様に可能であり、同じタグでも、タグ内の変数によって変化させるも可能である。例えば、Ａタグ内のHREFが示す一般的なURI (Uniform Resource Identifier)が挙げられる。なお、URI（RFC2396で規定）の内容を認識するアルゴリズムは従来公知の技術を利用する。
この場合、Ａタグ内におけるHREF=Uという記載を検索し、その記載Uを分解しその内容にマッチさせて、エフェクトを選択させる処理をするように、メインルーチン内の初期化を行なう。エフェクトの関連が記載された対応テーブルT中には、制御タグの内容も記載しておく。 It is possible to give an effect with respect to the contents of such a control code by voice expression.
FIG. 11 shows an example of a document that is a reading instruction character string. FIG. 11A is a character string that clearly indicates a control code, FIG. 11B is a character string that is accompanied by a phonetic expression, and FIG. C) is a character string to which another phonetic expression is added.
In the above-described embodiment, the original sound and the pitch-shifted chord are used. However, it is also possible to accumulate three or more chords, and even the same tag can be changed by a variable in the tag. For example, there is a general URI (Uniform Resource Identifier) indicated by HREF in the A tag. The algorithm for recognizing the contents of URI (specified in RFC2396) uses a conventionally known technique.
In this case, the description in the A tag is searched for HREF = U, the initialization in the main routine is performed so that the description U is disassembled and matched with the contents to select an effect. The contents of the control tag are also described in the correspondence table T in which the effect relation is described.

未読サイト（長三度）、既読サイト（短三度）、ポートが80（属音）、プロトコルがhttps（短七度）というピッチシフト音のエフェクトを前提とした場合、URIで示されたリンク先が、既読サイトでポート80を使用、httpsを採用したページであると、原音を主音として、それぞれ短三度＋属音＋短七度の和音が合成され、いわゆるマイナーセブンスコードと同様の響きとなる。
長三度と短三度は半音差であり、同時に発信すれば美しい響きではないが、サイトが未読か既読かという２者択一の場合なのでこの和音は実際には聴かずに済み、図１１（ロ）及び図１１（ハ）のように異なる音声表現ができる。
なお、Aタグ内のHREFが指し示すURIに依存させる他、同じAタグ内のaccesskeyなどに依存させ和音を変化させることも可能である。 It is indicated by URI when assuming the effect of pitch shift sound that is unread site (long third), read site (short third), port 80 (genus sound), protocol https (short seventh) If the link destination is a page that uses port 80 on an already-read site and adopts https, the original sound is the main sound, and a chord of 3rd minor + genera + 7th minor is synthesized, similar to the so-called minor seventh code The sound of.
The major third and minor third are semitone differences, and if you send them simultaneously, it does not sound beautiful, but since the site is an alternative of whether it is unread or read, you do not actually need to listen to this chord, Different speech expressions can be made as shown in FIG.
In addition to making it depend on the URI indicated by HREF in the A tag, it is also possible to change the chord depending on the accesskey in the same A tag.

実施例：
以上のように、十二平均律等を基にして選択的にピッチシフト音を積み重ねていけば、不快にはなりにくい美しい響きを有する音声で、複数の情報を提示することが実現される。
そのような実例として、例えば電話の音声案内サービスが挙げられる。例えば銀行によっては、電話の音声案内を用いて残高照会や振込処理等の操作が可能であるが、音声案内によるメニュー形式を用いている場合が多い。
図１２は、音声案内メニューを示すものであり、図１２（イ）は、従来の音声案内で発声される文言であり、図１２（ロ）は、本発明による音声案内で発声される文言である。
従来は、図１２（イ）のように、全ての文言が読み上げられ、非常に冗長であった。この問題を回避するために本発明では、１や２や９等の数字に対応する音声を、予め利用者に周知させておくことで、その数字を読み上げることを省いている。また、「メインメニューです。」の「です」や、「それ以外のお取引は」の「のお取引は」や、「のボタンを押して下さい。」の「のボタン」など、定型的な文言部分の発声をスキップして簡略化することも容易である。 Example:
As described above, if the pitch-shifted sounds are selectively stacked based on the twelve equal temperament, etc., it is possible to present a plurality of information with a sound having a beautiful sound that is unlikely to be uncomfortable.
An example of such a case is a telephone voice guidance service. For example, depending on the bank, operations such as balance inquiry and transfer processing can be performed using telephone voice guidance, but a menu format based on voice guidance is often used.
FIG. 12 shows a voice guidance menu, FIG. 12 (a) shows the words spoken by the conventional voice guidance, and FIG. 12 (b) shows the words spoken by the voice guidance according to the present invention. is there.
Conventionally, all words were read out as shown in FIG. In order to avoid this problem, the present invention omits reading out the numbers by making the user know in advance the voice corresponding to the numbers such as 1, 2, 9 and the like. In addition, the “main menu” “is”, “other transactions” “transactions”, “please press the button” “no button”, etc. It is also easy to simplify by skipping part utterances.

本発明によると、視覚的修飾が施されたり情報が付帯された文字情報を、音声的修飾によって表現できるようになる。そのため、視覚障害者にとっては視覚情報を十分音声で得ることができるようになり、また、冗長でなくわかりやすい音声案内サービス等が可能になるので、産業上利用価値が高い。 According to the present invention, character information that has been visually modified or accompanied by information can be expressed by voice modification. Therefore, it is possible for a visually impaired person to obtain sufficient visual information by voice, and it is possible to provide a voice guidance service that is easy to understand without being redundant.

十二平均律での周波数比を示す表Table showing the frequency ratio in the twelve equal temperament 処理対象のドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は視覚表現を明示した文字列の説明図It is an example showing a document to be processed, (a) is a character string that clearly specifies the control code, (b) is an explanatory diagram of the character string that clearly indicates the visual expression 図２に対応する読み上げ指示文字列となるドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は音声表現を付記した文字列の説明図It is an example which shows the document used as the reading-out instruction | indication character string corresponding to FIG. 2, (A) is a character string which specified control code, (B) is explanatory drawing of the character string which added audio | voice expression. 処理対象のドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は視覚表現を明示した文字列の説明図It is an example showing a document to be processed, (a) is a character string that clearly specifies the control code, (b) is an explanatory diagram of the character string that clearly indicates the visual expression 図４に対応する読み上げ指示文字列となるドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は音声表現を付記した文字列の説明図It is an example which shows the document used as the reading instruction | indication character string corresponding to FIG. 4, (A) is a character string which specified control code, (B) is explanatory drawing of the character string which added audio | voice expression. 処理対象のドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は視覚表現を明示した文字列の説明図It is an example showing a document to be processed, (a) is a character string that clearly specifies the control code, (b) is an explanatory diagram of the character string that clearly indicates the visual expression 図６に対応する読み上げ指示文字列となるドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は音声表現を付記した文字列の説明図It is an example which shows the document used as the reading instruction | indication character string corresponding to FIG. 6, (A) is a character string which specified the control code, (B) is explanatory drawing of the character string which added audio | voice expression. 処理実装用のプログラムのメイン部分のフローチャートFlow chart of main part of program for processing implementation 波形加工処理のサブルーチン部分のフローチャートFlow chart of subroutine part of waveform processing 処理対象のドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は視覚表現を明示した文字列、（ハ）は別の視覚表現を明示した文字列の説明図It is an example showing a document to be processed, (A) is a character string that clearly specifies a control code, (B) is a character string that clearly specifies a visual expression, and (C) is an explanatory diagram of a character string that specifies another visual expression 図１０に対応する読み上げ指示文字列となるドキュメントを示す例であり、（イ）は制御コードを明示した文字列、（ロ）は音声表現を付記した文字列、（ハ）は別の音声表現を付記した文字列の説明図FIG. 11 is an example showing a document that is a reading instruction character string corresponding to FIG. 10, (A) is a character string that clearly indicates a control code, (B) is a character string that is accompanied by a phonetic expression, and (C) is another voice expression. Explanatory drawing of the character string with the 音声案内メニューを示すものであり、（イ）は従来の音声案内で発声される文言、（ロ）は本発明による音声案内で発声される文言の説明図The voice guidance menu is shown, (a) is a wording uttered by the conventional voice guidance, (b) is an explanatory diagram of the words uttered by the voice guidance according to the present invention.

Claims

In an information processing apparatus including at least a data input / output device, a data recording device, and an arithmetic processing device,
Document input means for reading a document which is a reading instruction character string composed of a control code and a ground sentence,
Related input means for reading the relationship between the control code, the effect that is the processing content of the local sentence controlled by the control code, and the number of the effect,
A document processing means for separating a document into control codes and ground sentences, and obtaining a pair of each control code and a ground sentence controlled by the control code;
A local text-to-speech waveform generating means for generating waveform data to read a local text;
Waveform processing means for processing the local text-to-speech waveform data according to the control code corresponding to the local text;
Waveform data synthesizing means for synthesizing the text-to-speech reading waveform data and the waveform processed waveform data;
And a waveform data output means for outputting the obtained waveform data.

In an information processing apparatus including at least a data input / output device, a data recording device, and an arithmetic processing device,
Read a document that is a text string that is composed of a control code and a local sentence.
Read the relationship between the control code, the effect that is the processing content of the local sentence controlled by the control code, and the number of the effect,
Separating the document into control codes and local sentences, and getting each control code and a pair of local sentences controlled by the control code,
Generate waveform data that reads the local sentence, and process the waveform data that reads the local sentence according to the control code corresponding to the local sentence.
Synthesize the text-to-speech reading waveform data and the waveform processed waveform data,
A document processing method for reading, characterized by outputting the obtained waveform data.

The reading document processing method according to claim 2, wherein when there are a plurality of effects, waveform processing according to the plurality of control codes is performed on the local sentence reading waveform data.

The document processing method for reading out according to claim 2 or 3, wherein the reading data of the ground sentence reading waveform data is checked in advance, and if there is no ground sentence to be read out, the output of the silent part is deleted.

5. The reading document processing method according to claim 2, wherein the processing content of the local sentence controlled by the control code is a change in volume.

5. The document processing method for reading out according to claim 2, wherein the processing content of the local sentence controlled by the control code is a change in the amplitude of the specific frequency.

5. The reading document processing method according to claim 2, wherein the processing content of the local sentence controlled by the control code is a waveform deformation.

5. The reading document processing method according to claim 2, wherein the processing content of the local sentence controlled by the control code is a change in reproduction speed.

5. The reading document processing method according to claim 2, wherein the processing content of the local sentence controlled by the control code is a temporal change in reproduction timing.

5. The document processing method for reading aloud according to claim 2, wherein the processing content of the local sentence controlled by the control code is a pitch shift for changing a frequency.

The document processing method for reading out according to claim 10, wherein the pitch shift is a change in accordance with a classical rhythm.

The document processing method for reading according to claim 10, wherein the pitch shift is a change in accordance with an equal temperament obtained by dividing a pitch by an equal frequency ratio.

The document processing method for reading out according to claim 11 or 12, wherein the pitch shift is a change that reduces or doubles the frequency.

The document processing method for reading out according to claim 11 or 12, wherein the pitch shift is a turn to change a sequence of chord constituent sounds.