JP2007086309A

JP2007086309A - Voice synthesizer, voice synthesizing method, and program

Info

Publication number: JP2007086309A
Application number: JP2005273987A
Authority: JP
Inventors: Yoichi Fujii; 洋一藤井; Satoshi Furuta; 訓古田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-09-21
Filing date: 2005-09-21
Publication date: 2007-04-05

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that although PIN(personal identification number) and ID numbers are prohibited to read in a prohibit list when reading a text and outputing it in voice, the PINs are always changed for safety and their registration is a troublesome load each time on the user or the system, the read control on XML tag reading is not simple, and only the text maker can make the control information without allowing the user to make any control to meet the situations. <P>SOLUTION: This voice synthesizer which converts the input text into a voice signal and outputs it in voice creates a pattern morpheme of the information that is not desired to read depending on the use situation by the user, such as PINs, telephone numbers, and card numbers, by using a pattern morpheme creating means, and skips over the information to which a pattern morpheme is created in the read content converter, or changes it into other contents, such as a beep tone or muting. It makes it possible to decide whether to change the text content to read in the read control signal input or to read it without changing the content depending on the situations of the use by the user. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、携帯電話、ＰＤＡ（Personal Digital Assistant)、パーソナルコンピュータ等の情報機器や、カーナビゲーションシステム、ＥＴＣ（Electronic Toll Collection System）等の車載機器、ＡＴＭ（自動現金預払機）、ＣＤ（キャッシュディスペンサ）機等の事務機器などに適用するテキスト解析技術に係り、テキスト解析結果を読み上げるテキスト音声合成装置およびその方法、ならびにその方法をコンピュータに実現させるためのプログラムに関するものである。 The present invention relates to an information device such as a mobile phone, a PDA (Personal Digital Assistant), a personal computer, an in-vehicle device such as a car navigation system, an ETC (Electronic Toll Collection System), an ATM (automatic cash dispenser), a CD (cash dispenser). The present invention relates to a text analysis technology applied to office equipment such as a machine, and relates to a text-to-speech synthesizer that reads a text analysis result, a method thereof, and a program for causing a computer to implement the method.

任意の文章から人工的に音声信号を作り出すことをテキスト音声合成という。テキスト音声合成は、一般的に言語処理部（テキスト解析）、音韻処理部（韻律設定）、音声合成部の３つの段階によって行われる。入力されたテキストは、まず言語処理部において形態素解析や構文解析などが行われ、次に音韻処理部においてアクセントやイントネーションの処理が行われて、音韻記号、ピッチ長、継続時間長などの音素環境情報が出力される。そして音素環境情報を根拠に、音声素片辞書に登録された音声素片を選択する。最後に、音声合成部で選択された音声素片と音韻記号、ピッチ長、継続時間長などの情報から音声を合成する。 Synthesizing speech signals artificially from arbitrary sentences is called text-to-speech synthesis. Text-to-speech synthesis is generally performed in three stages: a language processing unit (text analysis), a phoneme processing unit (prosodic setting), and a speech synthesis unit. The input text is first subjected to morphological analysis and syntactic analysis in the language processing unit, and then subjected to accent and intonation processing in the phonological processing unit, so that the phoneme environment such as phonological symbol, pitch length, duration length, etc. Information is output. Then, based on the phoneme environment information, a speech unit registered in the speech unit dictionary is selected. Finally, the speech is synthesized from the speech unit selected by the speech synthesizer and information such as the phoneme symbol, the pitch length, and the duration time.

従来の音声合成装置では、入力文章に発音出力が妥当でない用語（差別用語等）が含まれている場合に、この種の用語の発音を禁止するために、読み上げ禁止用語テーブルに発音出力を禁止する読み上げ禁止用語を予め格納しておく。入力文章であるテキストが入力されると、読み上げ禁止用語判断手段は、入力されたテキストを単語単位で切り出し、読み上げ禁止テーブルを検索して、入力されたテキストに含まれている単語が読み上げ禁止用語か否かを判断する。そして、発音禁止手段が、この読み上げ禁止用語判断手段の判断結果に基づいて、読み上げ禁止用語に該当する単語の発音を禁止する。例えば、読み上げ禁止用語に代えてビープ音を発音させたり、無音にさせたりすることで読み上げ禁止用語の発音を禁止する。また、置換表現に変換して発音させる。（特許文献１に開示）
また、ＸＭＬ(Extended Markup Language)のタグを記述することで、本来のテキスト内容と異なる読み上げを行ったり、何も読み上げなかったりといった制御を行う方法が開示されている（非特許文献１）。この方法は、文章作成者が、あらかじめ意図する読み上げ方をさせるために、文章中に読みの指定、及び読み飛ばしを指定するための制御タグを埋め込むものである。 In the conventional speech synthesizer, when the input sentence contains a term whose pronunciation output is not valid (discriminatory term, etc.), the pronunciation output is prohibited in the reading prohibition term table in order to prohibit the pronunciation of this type of term. The words that are prohibited from being read out are stored in advance. When the text that is the input sentence is input, the reading prohibition term judging means cuts out the input text in units of words, searches the reading prohibition table, and the words included in the input text are read prohibition terms. Determine whether or not. Then, the pronunciation prohibiting means prohibits pronunciation of a word corresponding to the reading prohibited term based on the determination result of the reading prohibited term determining means. For example, the pronunciation of a prohibited word is prohibited by generating a beep sound or silence in place of the prohibited word. Moreover, it is converted into a substitution expression and pronounced. (Disclosed in Patent Document 1)
Further, a method is disclosed in which XML (Extended Markup Language) tags are described to perform control such as reading out different from the original text content or not reading out anything (Non-Patent Document 1). In this method, a text creator embeds a control tag for designating reading and skipping in a sentence so that the intended creator reads out in advance.

特開平５−１６５４８６号公報（第１頁〜５頁、第１図）JP-A-5-165486 (first page to fifth page, FIG. 1) Microsoft 「SpeechＳＤＫ」Version 5.1Microsoft “SpeechSDK” Version 5.1

特許文献１に開示する従来の音声合成装置は、以上のように構成されているが、次のような課題がある。例えば、「あなたの暗証番号は１２３４です」とか「あなたのＩＤ番号はabcdefgです」などの文章がテキストとして入力された場合を考える。「暗証番号」「ＩＤ番号」などの単語と、“暗証番号そのもの”の「１２３４」や“ＩＤ番号そのもの”の「abcdefg」が関連無しに単独で読み上げられても、「暗証番号」や「ＩＤ番号」は読み上げに際して特に問題の無い一般名詞であり、また、「１２３４」や「abcdefg」は単なる数字やアルファベットの羅列に過ぎず、さほど大きな問題にはならない。しかし、「暗証番号は１２３４です」と読み上げられた場合には、「暗証番号＝（イコール）１２３４」と関連付けられるので大きな問題となりうる。
従来の音声合成装置では、読み上げ禁止リストに登録された単語しか読み上げ禁止できないので、上記の「１２３４」や「abcdefg」を読み上げ禁止リストに登録する必要があるが、数字の羅列やアルファベット列などの組み合わせは膨大な数となる上、セキュリティのため暗証番号等は常に変更する必要があり、その度に登録するのはユーザあるいはシステムに負担が掛かり、上記の問題に対応できない。 The conventional speech synthesizer disclosed in Patent Document 1 is configured as described above, but has the following problems. For example, consider a case where a sentence such as “Your PIN is 1234” or “Your ID is abcdefg” is entered as text. Even if a word such as “password” or “ID number” and “1234” of “password itself” or “abcdefg” of “ID number itself” are read out independently without being related, “password” or “ID” “No.” is a general noun that has no particular problem in reading, and “1234” and “abcdefg” are merely a list of numbers and alphabets, and do not become a big problem. However, when “PIN is 1234” is read out, it can be a big problem because it is associated with “PIN = (equal) 1234”.
In the conventional speech synthesizer, only words registered in the reading prohibition list can be prohibited from reading out. Therefore, it is necessary to register the above “1234” and “abcdefg” in the reading prohibition list. In addition to enormous numbers of combinations, it is necessary to always change the password for security, and registration each time places a burden on the user or system, and cannot cope with the above problems.

また、非特許文献１に開示する従来の音声合成装置では、ＸＭＬでのタグによる音声読み上げ制御を行っているが、読み上げ制御のためにテキスト本文以外に送出する情報が必要であり、簡便ではない。さらに、制御のための情報は、テキスト作成者によってのみ設定できるものであり、テキスト作成者の意図でしか制御できず、音声合成装置利用者側で制御できないという課題がある。
例えば、テキスト作成者は、電話番号を、間違いなく聞き取って欲しいがために、制御のための情報として電話番号の範囲を「強調」して喋らせるように指定することが考えられるが、ユーザにとっては、利用状況によっては、個人情報に当たるため、読み上げて欲しくない場合があり、これらの問題には対応できない。 In addition, in the conventional speech synthesizer disclosed in Non-Patent Document 1, speech reading control is performed using XML tags, but information to be sent out other than the text body is necessary for the reading control, which is not simple. . Furthermore, the control information can be set only by the text creator, and can be controlled only by the text creator's intention, and cannot be controlled by the user of the speech synthesizer.
For example, a text writer may specify that the phone number range should be “highlighted” as control information because he / she wants to hear the phone number without fail. Depending on the usage situation, it may be personal information and you may not want it to be read out.

この発明は、上記問題点を解決するためになされたもので、ユーザの利用状況に応じて、暗証番号、電話番号、及びカード番号などの読み上げて欲しくない情報を、読み飛ばし、またはビープ音や無音などの別の内容に変更の上読み上げることを可能とすることを目的とする。 The present invention has been made to solve the above-described problems. Depending on the use situation of the user, information that is not desired to be read out such as a personal identification number, a telephone number, and a card number is skipped, or a beep or The purpose is to make it possible to read out after changing to another content such as silence.

この発明に係る音声合成装置は
テキストを入力とし、見だし、読み、アクセント型情報を持った言語辞書を用いて、読み情報とアクセント情報を含む解析結果に分割するテキスト解析手段と、
テキスト解析手段によって得られた読み情報とアクセント情報を元に、イントネーションやリズムを制御するための韻律情報を生成する韻律制御手段と、
テキスト解析手段によって得られた読み情報とアクセント情報、および韻律制御手段によって得られた韻律情報を元に、音素片を格納した音響辞書から音素片を選択する素片選択手段と、
素片選択手段で選択した音素片を韻律制御手段によって得られた韻律情報に合わせて合成音を作成する音声合成手段を有する音声合成装置において、
上記テキスト解析手段の処理中に、特定のパターンを抽出するための読み制御ルールを元に読み制御ルールに規定されているパターンに一致する形態素を生成するパターン形態素生成手段と、
パターン形態素生成手段によって求められた形態素の読み上げ内容の変更を行い入力テキストの読み上げ内容変更を行う読み上げ内容変更手段と、
上記読み上げ内容変更手段の読み上げ内容変更を実行するか否かの読み制御信号を入力する読み制御信号入力手段を有する。 A speech synthesizer according to the present invention uses a text dictionary as input, finds, reads, and uses a language dictionary having accent-type information, and divides it into analysis results including reading information and accent information;
Prosody control means for generating prosodic information for controlling intonation and rhythm based on reading information and accent information obtained by text analysis means,
Based on the reading information and accent information obtained by the text analysis means, and the prosody information obtained by the prosody control means, a segment selection means for selecting a phoneme from an acoustic dictionary storing phonemes;
In a speech synthesizer having a speech synthesizer that creates a synthesized sound in accordance with the prosodic information obtained by the prosodic control means, the phoneme selected by the segment selection means
A pattern morpheme generation unit that generates a morpheme that matches a pattern defined in the reading control rule based on a reading control rule for extracting a specific pattern during the processing of the text analysis unit;
Reading contents changing means for changing the reading contents of the input text by changing the reading contents of the morpheme obtained by the pattern morpheme generation means,
There is a reading control signal input means for inputting a reading control signal as to whether or not to execute the reading contents change of the reading contents changing means.

この発明に係る音声合成装置によれば、暗証番号、電話番号、及びカード番号などユーザの利用状況によっては、読み上げて欲しくない情報を、パターン形態素生成手段で形態素生成し、読み上げ内容変更手段での形態素生成された情報を読み飛ばし、またはビープ音や無音などの別の内容に変更し、読み制御信号入力手段でユーザの利用状況に応じて入力テキストを内容変更して読み上げるか、内容変更しないで読み上げるかを選択可能とするので、ユーザの利用状況に応じた使用が出来る効果がある。 According to the speech synthesizer according to the present invention, information that is not desired to be read out depending on the use situation of the user, such as a personal identification number, a telephone number, and a card number, is generated by the pattern morpheme generating unit, and the reading content changing unit Read the morpheme-generated information or change it to another content such as a beep or silence, and change the content of the input text according to the user's usage status with the reading control signal input means, or do not change the content Since it is possible to select whether to read out, there is an effect that it can be used in accordance with the usage status of the user.

実施の形態１．
以下、この発明を実施するための最良の形態について図を参照して説明する。
図１は、この発明の実施の形態に係る音声合成方法を実現する音声合成装置の構成を示すブロック図である。図１において、１はテキストを入力するテキスト入力端子である。２はテキスト入力端子１から入力されたテキストを解析して、読み、アクセント情報、および韻律制御に必要な言語情報を生成するテキスト解析部である。ここで、テキスト解析部２での処理としては、最長一致、文節数最小法、およびコスト最小法などの既知の形態素解析アルゴリズムを用いて処理するものである（例えば、首藤公昭，吉村賢治「日本語の構造とその解析」、情報処理 Vol.27, No.8, pp.947〜954, 1986)。 Embodiment 1 FIG.
The best mode for carrying out the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a speech synthesizer for realizing a speech synthesis method according to an embodiment of the present invention. In FIG. 1, 1 is a text input terminal for inputting text. A text analysis unit 2 analyzes text input from the text input terminal 1 and generates language information necessary for reading, accent information, and prosody control. Here, the processing in the text analysis unit 2 is performed using a known morphological analysis algorithm such as the longest match, the minimum number of clauses method, and the minimum cost method (for example, Kimiaki Shudo, Kenji Yoshimura “Japan Word structure and its analysis ", Information Processing Vol.27, No.8, pp.947 ～ 954, 1986).

３はテキスト解析部２でテキストを解析し、読み、アクセント情報、および韻律制御に必要な言語情報を生成するために利用する言語辞書である。４はテキスト解析部２で生成された、読み、アクセント情報、および韻律制御に必要な言語情報を元に、イントネーション、リズムなどの韻律情報を生成する韻律制御部である。ここで、韻律制御部の処理としては、藤崎モデルに代表される既知の技術をもとに、イントネーション、リズムなどの韻律情報を生成するものである。 Reference numeral 3 denotes a language dictionary used for analyzing text by the text analysis unit 2 and generating language information necessary for reading, accent information, and prosodic control. Reference numeral 4 denotes a prosody control unit that generates prosody information such as intonation and rhythm based on the reading, accent information, and language information necessary for prosody control generated by the text analysis unit 2. Here, the prosody control unit generates prosody information such as intonation and rhythm based on a known technique represented by the Fujisaki model.

５は合成音を作成するときに利用する音素片を格納した音響辞書で、６は、韻律制御部４から出力した読みと、イントネーション、リズムなどの韻律情報を元に、合成音作成に使う音素片を音響辞書５から選択する素片選択部、７は、素片選択部６で選択した音素片を、韻律制御部４で生成したイントネーション、リズムなどの韻律情報に合わせて合成音声を作成する音声合成部である。ここで、ピッチ周期および音韻継続時間長を変更し、音声を合成する方法としては、例えばＬＳＰ（Line Spectral Pair）パラメータ上で合成する残差駆動ＬＳＰ方法、スペクトルパラメータ上で合成するＭＢＥ（Multi Band Excitation）方法、２ピッチ長波形を重畳合成するピッチ波形重畳方法、音素単位等の信号波形を接続合成する波形編集方法など公知の手法を用いることができる。８は音声合成部７で生成した合成音声を出力する出力端子である。 Reference numeral 5 denotes an acoustic dictionary storing phonemes used when creating a synthesized sound. Reference numeral 6 denotes a phoneme used for creating a synthesized sound based on readings output from the prosody control unit 4 and prosodic information such as intonation and rhythm. A segment selection unit 7 for selecting a segment from the acoustic dictionary 5 generates a synthesized speech by matching the phoneme segment selected by the segment selection unit 6 with prosody information such as intonation and rhythm generated by the prosody control unit 4. It is a speech synthesizer. Here, as a method of synthesizing speech by changing the pitch period and the phoneme duration, for example, a residual drive LSP method for synthesizing on an LSP (Line Spectral Pair) parameter, an MBE (Multi Band) for synthesizing on a spectral parameter, etc. Excitation), a pitch waveform superposition method for superposing and synthesizing two pitch long waveforms, and a waveform editing method for connecting and synthesizing signal waveforms such as phoneme units can be used. Reference numeral 8 denotes an output terminal for outputting synthesized speech generated by the speech synthesis unit 7.

また、９は読み制御信号入力端子で、出力端子８から出力する合成音声を入力端子１で入力したテキストの内容通りの読み上げとするか、重要キーワードは読み上げずに無音としたり、ビーブ音などの別の音に変更して読み上げるかを制御するための読み制御信号が入力される。１０は、テキスト中の特定パターンの文字列を抽出し、その抽出した文字列パターンに対して読み上げ内容を定義する読み制御ルール、１１は、テキスト解析部２の処理中に呼び出し、読み制御ルール１０に記述した各ルールを元に形態素情報を追加するパターン形態素生成部、１２は、テキスト解析部２の処理中に呼び出し、形態素解析結果中の形態素がパターン形態素生成部１１で生成した形態素の場合、読み制御信号入力端子９からの読み上げ内容の変更を要求に従って、読み上げ内容を変更する読み上げ内容変更部である。 Reference numeral 9 is a reading control signal input terminal, and the synthesized speech output from the output terminal 8 is read out in accordance with the contents of the text input at the input terminal 1, or important keywords are not read out and are silent, beep sounds, etc. A reading control signal for controlling whether to change to another sound and read out is input. 10 is a reading control rule that extracts a character string of a specific pattern in the text and defines the contents to be read out for the extracted character string pattern. 11 is a call control rule that is called during the processing of the text analysis unit 2. A pattern morpheme generator 12 that adds morpheme information based on each rule described in (1), is called during the processing of the text analyzer 2, and the morpheme in the morpheme analysis result is a morpheme generated by the pattern morpheme generator 11, This is a reading content changing unit that changes the reading contents in response to a request to change the reading contents from the reading control signal input terminal 9.

図２は、テキスト解析部２の処理内容を示すフローチャートで、特にＳ１４は、パターン形態素生成部１１での処理を、Ｓ１７は、読み上げ内容変更部１２での処理を表している。
図３は、図２におけるＳ１４の処理内容を詳細に示すもので、パターン形態素生成部１１での処理内容を示すフローチャートである。なお、以下ではＳ２５の処理におけるＮを３として説明する。
図４は、図２におけるＳ１７の処理内容を詳細に示すもので、読み上げ内容変更部１２での処理内容を示すフローチャートである。 FIG. 2 is a flowchart showing the processing contents of the text analysis unit 2. In particular, S14 represents processing in the pattern morpheme generation unit 11, and S17 represents processing in the reading content change unit 12.
FIG. 3 shows details of the processing contents of S14 in FIG. 2, and is a flowchart showing the processing contents in the pattern morpheme generation unit 11. In the following description, N is 3 in the process of S25.
FIG. 4 shows in detail the processing contents of S17 in FIG. 2, and is a flowchart showing the processing contents in the reading content changing unit 12.

図５は、読み制御ルール１０の例であり、２１〜２４はそれぞれ読み制御ルール例である。
図６は、読み制御ルール１０の変換対象文字パターンルールで指定されたルール名の具体的なマッチングパターンの例であり、３１〜３４はそれぞれマッチングパターン例である。
図７は、テキスト解析部２での処理データ例であり、５１は入力テキスト例、５２はマッチング文字列例、５３は最適形態素例、５４は読み制御ルール生成形態素例、５５〜５６はテキスト解析結果例である。
図８は、テキスト解析部２での形態素候補生成例であり、６１〜６３は実際の形態素例である。
図９は、テキスト解析部２での別の処理データ例であり、７１は入力テキスト例、７２は先行単語列例、７３はマッチング文字列例、７４は最適形態素例、７５は読み制御ルール生成形態素例、７６〜７７はテキスト解析結果例である。 FIG. 5 is an example of the reading control rule 10, and 21 to 24 are reading control rule examples.
FIG. 6 is an example of a specific matching pattern of the rule name designated by the conversion target character pattern rule of the reading control rule 10, and 31 to 34 are examples of matching patterns, respectively.
FIG. 7 is an example of processing data in the text analysis unit 2, 51 is an example of input text, 52 is an example of a matching character string, 53 is an example of an optimal morpheme, 54 is an example of a reading control rule generation morpheme, and 55 to 56 is a text analysis. It is an example of a result.
FIG. 8 is an example of morpheme candidate generation in the text analysis unit 2, and 61 to 63 are actual morpheme examples.
FIG. 9 is another example of processing data in the text analysis unit 2, 71 is an example of input text, 72 is an example of a preceding word string, 73 is an example of a matching character string, 74 is an example of an optimal morpheme, and 75 is a generation of a reading control rule. Morphological examples 76 to 77 are examples of text analysis results.

次に動作について説明する。
図１のテキスト入力端子１に図７に示す入力テキスト例５１を入力した場合の動作を説明する。また、読み制御信号入力端子９には、数段階のレベルを入力が可能とし、以下実施の形態では、０〜２の３段階のレベル指定が入力されるとする。ここでは、レベル指定は、あらかじめ利用者が設定するものとする。
テキスト入力端子１に入力された入力テキスト例５１は、テキスト解析部２に渡される。テキスト解析部２では、テキストから文を１文ずつ抽出し、最ももっともらしい解析結果を生成する。１文ごとのテキスト解析部２での処理を図２に従って説明する。 Next, the operation will be described.
The operation when the input text example 51 shown in FIG. 7 is input to the text input terminal 1 of FIG. 1 will be described. Further, it is assumed that several levels of levels can be input to the reading control signal input terminal 9, and in the following embodiments, three levels of levels 0 to 2 are input. Here, the level designation is set by the user in advance.
The input text example 51 input to the text input terminal 1 is passed to the text analysis unit 2. The text analysis unit 2 extracts sentences one by one from the text and generates the most likely analysis result. Processing in the text analysis unit 2 for each sentence will be described with reference to FIG.

入力テキスト例５１は、１文からなっているため、入力テキスト例５１全体がＳ１１に渡され、現在位置を文頭にセットして、処理Ｓ１２に処理を移す。Ｓ１２では、現在処理位置が文頭のため、Ｓ１３に処理を移す。Ｓ１３では、現在位置から始まる単語を言語辞書３から検索し、形態素候補として登録する。現在位置が文頭の場合は、図８に示す形態素例６１と形態素例６２の形態素が生成される。 Since the input text example 51 consists of one sentence, the entire input text example 51 is transferred to S11, the current position is set at the beginning of the sentence, and the process proceeds to process S12. In S12, since the current processing position is the beginning of the sentence, the process proceeds to S13. In S13, a word starting from the current position is searched from the language dictionary 3 and registered as a morpheme candidate. When the current position is the sentence head, the morphemes of the morpheme example 61 and the morpheme example 62 shown in FIG. 8 are generated.

辞書引きによる形態素の生成が終了すると、Ｓ１４に処理を移し、図５の読み制御ルール、及び図６のマッチングパターンに一致するパターンが存在するかどうかチェックし、存在するならば、Ｓ１９で形態素として登録する。現在位置（＝文頭）では、読み制御ルール及び、マッチングパターンに一致するパターンが存在しないため、Ｓ１５に処理を移し、現在位置を１文字文末方向に移動する。
Ｓ１５の処理が終わると、Ｓ１２まで処理を移し、Ｓ１２〜Ｓ１５までの処理を繰り返し実行する。 When generation of the morpheme by dictionary lookup is completed, the process moves to S14 to check whether there is a pattern that matches the reading control rule of FIG. 5 and the matching pattern of FIG. sign up. Since there is no pattern matching the reading control rule and the matching pattern at the current position (= the beginning of the sentence), the process proceeds to S15, and the current position is moved toward the end of one character sentence.
When the processing of S15 is completed, the processing is shifted to S12, and the processing from S12 to S15 is repeatedly executed.

ここで、入力テキスト例５１の現在位置が、「(045)930-0010まで、連絡・・・」になった時の、Ｓ１４の具体的処理について説明する。処理は図３のＳ２１に移され、Ｓ２１では、先頭の読み制御ルールをセットする処理であるから、図５に示すルール例２１が選択され、Ｓ２２に処理を移す。Ｓ２２では、全ての読み制御ルールが処理されていないので、Ｓ２３に処理を移す。Ｓ２３では、ルール例２１は先行単語列「ＦＡＸ」が定義されているので、Ｓ２５に処理を移す。前方Ｎ文字（Ｎ＝３）以内に、「ＦＡＸ」という先行単語列は存在しないため、Ｓ２７に処理を移し、図５に示す次のルール例２２を選択し、Ｓ２２に処理を戻す。 Here, the specific processing of S14 when the current position of the input text example 51 becomes “(045) 930-0010, contact ...” will be described. The process is moved to S21 in FIG. 3, and in S21, the first reading control rule is set. Therefore, the rule example 21 shown in FIG. 5 is selected, and the process moves to S22. In S22, since all the reading control rules have not been processed, the process proceeds to S23. In S23, in the rule example 21, since the preceding word string “FAX” is defined, the process proceeds to S25. Since there is no preceding word string “FAX” within the front N characters (N = 3), the process proceeds to S27, the next rule example 22 shown in FIG. 5 is selected, and the process returns to S22.

ルール例２２は、ルール例２１と同様に、先行単語列を持ち、「ＴＥＬ」が、入力テキスト例５１の前方Ｎ文字以内に存在しないため、Ｓ２３からＳ２５の判定処理をして、Ｓ２７に処理を移して、次のルール例２３を選択し、Ｓ２２に処理を戻す。 Similar to rule example 21, rule example 22 has a preceding word string, and “TEL” does not exist within the first N characters of input text example 51. Therefore, the determination process from S23 to S25 is performed, and the process proceeds to S27. , The next rule example 23 is selected, and the process returns to S22.

Ｓ２２では、全読み制御ルールが処理されていないので、Ｓ２３に処理を移す。Ｓ２３では、ルール例２３には、先行単語列が定義されていないので、Ｓ２４に処理を移す。Ｓ２４では、ルール例２３の変換対象文字パターンルールが、図７に示す入力テキスト例５１の現在位置からのテキスト「(045)930-0010まで、連絡・・・」の部分文字列に一致するかどうかを判定する。ルール例２３では、変換対象文字パターンルールは、「rule(phone2)」であるので、図６のマッチングパターンにおけるマッチングパターン例３１〜３３のマッチングパターン記述とマッチング条件に一致するかどうかを判定する。 In S22, since the full reading control rule has not been processed, the process proceeds to S23. In S23, since the preceding word string is not defined in the rule example 23, the process proceeds to S24. In S24, whether the conversion target character pattern rule of rule example 23 matches the partial character string of the text “(045) 930-0010, contact ...” from the current position of input text example 51 shown in FIG. Determine if. In rule example 23, since the conversion target character pattern rule is “rule (phone2)”, it is determined whether or not the matching pattern description in the matching pattern examples 31 to 33 in the matching pattern in FIG.

ここで、図６のマッチングパターン記述では、直接その文字列を記述する方法と、特定の文字タイプが指定の個数だけ連続することが記述できるようになっている。直接その文字列を記述する方法としては、「”」、「”」でくくることで表現し、特定の文字タイプが指定の個数だけ連続することを記述する方法としては、＜文字タイプ＞（＜最小文字数＞，＜最大文字数＞）と記述することする。 Here, in the matching pattern description of FIG. 6, it is possible to describe a method of describing the character string directly and that a specific character type continues for a specified number of times. As a method of describing the character string directly, it is expressed by enclosing it with “” ”and“ ””, and as a method of describing that a specific character type continues for a specified number, <character type> (< (Minimum number of characters>, <Maximum number of characters>).

さらに、マッチング条件としては、マッチングパターン記述でマッチングした特定文字タイプのマッチングを先頭から順に変数に対応付けし、「ｌｅｎ（＜変数＞）」で文字列の長さ、「［ｖａｌ（＜変数＞），＜最小＞，＜最大＞］」で数値の範囲を規定している。
例えば、マッチングパターン例３１のマッチングパターン記述では、「ＮＵＭ（１，１０）」の記述により、数字が１〜１０連続し、その後「”−”」によって文字「−」が現れ、さらに、数字が１〜１０連続し、文字「−」が現れ、数字が１〜１０連続することを示している。そして、各ＮＵＭ（＊，＊）は左から順番に変数＄１〜＄３と割り当てる。一方、マッチング条件の「［ｌｅｎ（＄１）＋ｌｅｎ（＄２）＋ｌｅｎ（＄３），１０，１１］」により、数字の連続の総和が１０、または１１であることが条件となる。 Further, as a matching condition, matching of a specific character type matched by the matching pattern description is associated with a variable in order from the top, and “len (<variable>)” is the length of the character string, “[val (<variable >> ), <Minimum>, <maximum>] ”defines the range of numerical values.
For example, in the matching pattern description of the matching pattern example 31, numbers “1” to “10” are consecutive in the description “NUM (1, 10)”, and then the character “-” appears by ““-””. 1 to 10 continuous, the character “-” appears, indicating that the numbers are 1 to 10 continuous. Each NUM (*, *) is assigned to variables $ 1 to $ 3 in order from the left. On the other hand, according to the matching condition “[len ($ 1) + len ($ 2) + len ($ 3), 10, 11]”, the condition is that the sum of consecutive numbers is 10 or 11.

従って、入力テキスト例５１の現在位置からのテキスト「(045)930-0010まで、連絡・・・」に対しては、マッチングパターン例３１、マッチングパターン例３３は一致せず、マッチングパターン例３２が一致し、Ｓ２６に処理を移すことになる。その結果、Ｓ２６では、図８の形態素６３が生成・登録され、処理をＳ２７に移す。このとき、形態素には、読みなどの一般的な情報のほかに、読み制御ルールのレベルを設定する。Ｓ２７では次の読み制御ルールが選択され、Ｓ２２〜Ｓ２７の処理を繰り返す。最終的には、Ｓ２２で全ての読み制御ルールが処理されたと判断され、パターン形態素生成部１１での処理を終了する。 Accordingly, the matching pattern example 31 and the matching pattern example 33 do not match the text “(045) 930-0010, contact ...” from the current position of the input text example 51, and the matching pattern example 32 is If they match, the process proceeds to S26. As a result, in S26, the morpheme 63 of FIG. 8 is generated and registered, and the process proceeds to S27. At this time, in addition to general information such as reading, the level of the reading control rule is set in the morpheme. In S27, the next reading control rule is selected, and the processes in S22 to S27 are repeated. Finally, it is determined in S22 that all reading control rules have been processed, and the processing in the pattern morpheme generation unit 11 ends.

上記の通り、形態素候補の生成が終了すると、テキスト解析処理部２では、Ｓ１６で、コスト最小法、２文節最長一致法などの既知の形態素解析方法に従い、最ももっともらしい最適形態素を選択する。入力テキスト例５１に対しては、図７に示される最適形態素例５３が生成される。Ｓ１６で最適形態素が選択されると、Ｓ１７に処理を移し、読み上げ内容変更部１２にて、読み上げ内容の変更を行う。 As described above, when the generation of the morpheme candidate is completed, the text analysis processing unit 2 selects the most likely optimal morpheme in S16 according to a known morpheme analysis method such as the minimum cost method or the longest phrase matching method. For the input text example 51, an optimal morpheme example 53 shown in FIG. 7 is generated. When the optimum morpheme is selected in S16, the process proceeds to S17, and the reading content changing unit 12 changes the reading content.

読み上げ内容変更部１２によるＳ１７での読み上げ内容の変更処理を図４の処理の流れに従って説明する。まず、Ｓ３１では、読み上げ制御信号入力端子９への入力値から合成音作成レベルを決定する。読み上げ制御信号入力端子９は、機器の使用環境に従って入力される値で、ここでは、読み上げ制御信号入力端子９は３段階の入力を受け付け、０：全て読み上げ、１：最重要キーワード非読み上げ、２：重要キーワード非読み上げ、といったレベルが入力できるものとする。以下では、読み上げ制御信号入力端子９からレベル１が入力されたものとして説明する。
Ｓ３２では、図７の最適形態素例５３から、先頭の形態素「御用［ゴヨ'ー］」を選択し、Ｓ３３に処理を移す。Ｓ３３では、全ての形態素の処理が終了していないため、Ｓ３４に処理を移す。Ｓ３４では、先頭の形態素「御用［ゴヨ'ー］」が処理対象となり、この形態素は読み制御ルールで生成された形態素ではないため、Ｓ３７に処理を移す。 The reading content changing process in S17 by the reading content changing unit 12 will be described in accordance with the processing flow of FIG. First, in S31, a synthetic sound creation level is determined from an input value to the reading control signal input terminal 9. The reading control signal input terminal 9 is a value that is input according to the usage environment of the device. Here, the reading control signal input terminal 9 accepts input in three stages, 0: reading all, 1: not reading the most important keyword, 2 : It is possible to input a level such as non-important keyword reading. In the following description, it is assumed that level 1 is input from the reading control signal input terminal 9.
In S32, the first morpheme “goyo” is selected from the optimal morpheme example 53 in FIG. 7, and the process proceeds to S33. In S33, since the processing of all morphemes has not been completed, the process proceeds to S34. In S34, the first morpheme “goyo [goyo ']” is a processing target, and since this morpheme is not a morpheme generated by the reading control rule, the process proceeds to S37.

Ｓ３７では、次の形態素「の［ノ］」を選択し、Ｓ３３に処理を戻す。図７に示す形態素例５４がＳ３７で選択されるまでは、上記の処理の繰り返しとなる。形態素例５４がＳ３７で選択されて、Ｓ３３に処理を移した場合、全ての形態素が処理済でないので、Ｓ３４に処理を移す。ここで、形態素例５４は読み制御ルールで生成された形態素なので、Ｓ３５に処理を移す。Ｓ３５では、合成音作成レベル（＝１）≧ルールレベル（＝１）なので、Ｓ３６に処理を移す。Ｓ３６では、所定のルールに従って読みを書き換える。ここでは、無音に書き換えるものとする。Ｓ３６の処理が終了し、Ｓ３７で次の形態素を選択する。 In S37, the next morpheme “no” is selected, and the process returns to S33. The above processing is repeated until the morpheme example 54 shown in FIG. 7 is selected in S37. When the morpheme example 54 is selected in S37 and the processing is shifted to S33, since all the morphemes have not been processed, the processing is shifted to S34. Here, since the morpheme example 54 is a morpheme generated by the reading control rule, the processing is shifted to S35. In S35, since the synthesized sound creation level (= 1) ≧ rule level (= 1), the process proceeds to S36. In S36, the reading is rewritten according to a predetermined rule. Here, it shall be rewritten to silence. The process of S36 is completed, and the next morpheme is selected in S37.

この後も、Ｓ３２〜Ｓ３７までの処理を繰り返し、Ｓ１７の処理を終了する。Ｓ１７の処理が終了すると、Ｓ１８に処理を移し、アクセント句としてのアクセント位置の制御を行う。アクセント位置の処理は、例えば、「ＮＨＫ日本語発音アクセント辞典」などの既知のアクセントルールによって、助詞、助動詞などの接続によるアクセントが変形されるものとする。そして、図７に示すテキスト解析結果例５６が生成され、テキスト解析部２の処理を終了する。
なお、合成音作成レベルが０の場合には、形態素例５４はＳ３５で条件を満足しないので、そのまま、Ｓ３７に処理を移すため、図７に示すテキスト解析結果例５５が生成される。 Thereafter, the processes from S32 to S37 are repeated, and the process of S17 is terminated. When the process of S17 ends, the process moves to S18, and the accent position as an accent phrase is controlled. In the processing of the accent position, it is assumed that the accent due to the connection of particles, auxiliary verbs, etc. is deformed by a known accent rule such as “NHK Japanese pronunciation accent dictionary”. Then, a text analysis result example 56 shown in FIG. 7 is generated, and the processing of the text analysis unit 2 is finished.
If the synthesized sound creation level is 0, the morpheme example 54 does not satisfy the condition in S35, and the process proceeds to S37 as it is, so that the text analysis result example 55 shown in FIG. 7 is generated.

テキスト解析処理が終了すると、韻律制御部４に処理を移し、イントネーション、リズムなどの韻律情報を生成する。更に、素片選択部６では、発声内容に併せた素片を音響辞書５から選択する。最後に音声合成部７において、素片選択部６で選択した素片を、韻律制御部４にて生成した、イントネーション、リズムにあわせるよう変形して合成音声を作成し、出力端子８に出力する。なお、韻律制御部４、素片選択部６、音声合成部７での処理については、既知の音声合成方法にて実現することが可能であるため、詳細は省略する。 When the text analysis process ends, the process moves to the prosody control unit 4 to generate prosody information such as intonation and rhythm. Further, the segment selection unit 6 selects a segment in accordance with the utterance content from the acoustic dictionary 5. Finally, the speech synthesizer 7 generates a synthesized speech by transforming the segment selected by the segment selection unit 6 to match the intonation and rhythm generated by the prosody control unit 4 and outputs the synthesized speech to the output terminal 8. . Note that the processing in the prosody control unit 4, the segment selection unit 6, and the speech synthesis unit 7 can be realized by a known speech synthesis method, and thus the details are omitted.

次に、別の入力テキスト例での解析結果の例を図９を用い簡単に説明する。
図９の入力テキスト例７１が入力された場合、図５の読み制御ルール例２４の先行単語列が、先行単語列例７２に一致するため、先行単語列例７２の後方に図６のマッチングパターン例３４のマッチングパターンを探す。
結果として、マッチング文字列例７３が見つかり、形態素が生成され、最適形態素７４を生成する。最適形態素７４の中には、読み制御ルール生成形態素例７５が存在するため、読み制御信号入力端子９のレベルが１の場合は、テキスト解析結果例７７を、レベルが０の場合は、テキスト解析結果例７６を生成する。 Next, an example of an analysis result in another input text example will be briefly described with reference to FIG.
When the input text example 71 in FIG. 9 is input, the preceding word string in the reading control rule example 24 in FIG. 5 matches the preceding word string example 72, so that the matching pattern in FIG. The matching pattern of Example 34 is searched.
As a result, a matching character string example 73 is found, a morpheme is generated, and an optimal morpheme 74 is generated. Since there is a reading control rule generation morpheme example 75 in the optimum morpheme 74, the text analysis result example 77 is displayed when the level of the reading control signal input terminal 9 is 1, and the text analysis is performed when the level is 0. Result example 76 is generated.

前記実施の形態では、読み上げ内容変更部１２での処理として、無音化する例を示したが、変更内容として無音化以外にも、読みの内容をマッピングして、無意味な言葉に変更しても良い。
また、特定の効果音（動物の鳴き声や「ピー」といった信号音）などで置き換えることも可能である。 In the above-described embodiment, an example of silence is shown as the processing in the reading content changing unit 12, but the content of reading is mapped to a meaningless word other than the silence as the changed content. Also good.
It is also possible to replace with a specific sound effect (animal bark or signal sound such as “pea”).

また、前記実施の形態では、読み制御信号入力端子９からレベル指定は、あらかじめ利用者が設定するものとしたが、合成音出力の時点で、ユーザに確認し確認結果として、ユーザが読み制御信号入力端子９から入力した指定レベルで読み上げることも可能である。 In the above embodiment, the user designates the level from the reading control signal input terminal 9 in advance. However, the user confirms the reading control signal at the time of the synthesized sound output and confirms the result. It is also possible to read out at a specified level input from the input terminal 9.

さらに、前記実施の形態では、日本語のテキストに対して例示しているが、英語などの外国語に対しても適用可能である。例えば、“Your password number is 1234”では、“password”を「暗証番号」として置き換えて考えればよい。 Furthermore, in the said embodiment, although illustrated with respect to a Japanese text, it is applicable also to foreign languages, such as English. For example, in “Your password number is 1234”, “password” may be replaced with “password”.

また、前記実施の形態では、音声出力に対して、読みの制御を行ったが、表示装置に対しても適用可能である。 In the above embodiment, the reading control is performed on the audio output, but the present invention can also be applied to a display device.

さらに、前記実施の形態では、図６のパターンマッチを、文字列レベルのパターン記述としたが、形態素レベルでの記述を行うように拡張することも容易であり、これによって、人名の読み上げを制御したり、住所の読み上げを制御することも可能である。 Furthermore, in the above embodiment, the pattern matching in FIG. 6 is a pattern description at the character string level. However, it can be easily extended to be described at the morpheme level, thereby controlling the reading of personal names. It is also possible to control address reading.

この発明は、携帯電話、ＰＤＡ（Personal Digital Assistant)、パーソナルコンピュータ等の情報機器や、カーナビゲーションシステム、ＥＴＣ（Electronic Toll Collection System）等の車載機器、ＡＴＭ（自動現金預払機）、ＣＤ（キャッシュディスペンサ）機等の事務機器などに適用が可能である。 The present invention relates to an information device such as a mobile phone, a PDA (Personal Digital Assistant), a personal computer, an in-vehicle device such as a car navigation system, an ETC (Electronic Toll Collection System), an ATM (automatic cash dispenser), a CD (cash dispenser). ) Applicable to office equipment such as machines.

この発明の実施の形態１に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on Embodiment 1 of this invention. テキスト解析処理部の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of a text analysis process part. パターン形態素生成部での処理内容を示すフローチャートである。It is a flowchart which shows the processing content in a pattern morpheme production | generation part. 読み上げ内容変更部での処理内容を示すフローチャートである。It is a flowchart which shows the processing content in the reading content change part. 読み制御ルール例の説明図である。It is explanatory drawing of the example of a reading control rule. 読み制御ルールでの具体的なマッチングパターン例の説明図である。It is explanatory drawing of the example of a specific matching pattern in a reading control rule. テキスト解析部での処理データ例の説明図である。It is explanatory drawing of the example of processing data in a text analysis part. テキスト解析部での形態素候補生成例の説明図である。It is explanatory drawing of the example of morpheme candidate generation in a text analysis part. テキスト解析部での別の処理データ例の説明図である。It is explanatory drawing of the example of another process data in a text analysis part.

Explanation of symbols

１．テキスト入力端子、２．テキスト解析部、３．言語辞書、４．韻律制御部、
５．音響辞書、６．素片選択部、７．音声合成部、８．音声出力端子、９．読み制御信号入力端子、１０．読み制御ルール、１１．パターン形態素生成部、１２．読み上げ内容変更部、２１〜２４．読み制御ルール例、３１〜３４．マッチングパターン例、５１．入力テキスト例、５２．マッチング文字列例、５３．最適形態素例、５４．読み制御ルール生成形態素例、５５〜５６．テキスト解析結果例、６１〜６３．形態素例、７１．入力テキスト例、７２．先行単語列例、７３．マッチング文字列例、７４．最適形態素例、７５．読み制御ルール生成形態素例、７６〜７７．テキスト解析結果例。 1. 1. Text input terminal 2. text analysis unit; Language dictionary, 4. Prosody control part,
5. 5. Acoustic dictionary 6. Segment selection unit, Speech synthesis unit, 8. 8. Audio output terminal 9. Reading control signal input terminal 10. Reading control rules 11. pattern morpheme generator, Reading content changing section, 21-24. Reading control rule examples, 31-34. 51. matching pattern example 52. Input text example Matching string example 53. 54. Optimal morpheme example Reading control rule generation morpheme example, 55-56. Text analysis result example 61-63. 71. morpheme examples 72. Input text example 73. preceding word string example 74. matching character string example 75. Optimal morpheme example Reading control rule generation morpheme examples 76-77. Text analysis result example.

Claims

A text analysis means for taking text as input, using a language dictionary with finding, reading, and accent type information, and dividing into analysis results including reading information and accent information;
Prosody control means for generating prosodic information for controlling intonation and rhythm based on reading information and accent information obtained by text analysis means,
Based on the reading information and accent information obtained by the text analysis means, and the prosody information obtained by the prosody control means, a segment selection means for selecting a phoneme from an acoustic dictionary storing phonemes;
In a speech synthesizer having a speech synthesizer that creates a synthesized sound in accordance with the prosodic information obtained by the prosodic control means, the phoneme selected by the segment selection means
A pattern morpheme generation unit that generates a morpheme that matches a pattern defined in the reading control rule based on a reading control rule for extracting a specific pattern during the processing of the text analysis unit;
Reading contents changing means for changing the reading contents of the input text by changing the reading contents of the morpheme obtained by the pattern morpheme generation means,
A speech synthesizer characterized by comprising reading control signal input means for inputting a reading control signal indicating whether or not to execute reading contents change of the reading contents changing means.

The reading control signal input from the reading control signal input means is input after the speech synthesizer user confirms whether or not to change the reading content when the reading content changes. The speech synthesizer according to claim 1, wherein

3. The speech synthesizer according to claim 1, wherein the reading control signal input from the reading control signal input means is configured to be able to set a change in reading contents in several stages.

4. The speech synthesizer according to claim 1, wherein the reading content changing means is configured to silence the changed voice.

4. The speech synthesizer according to claim 1, wherein the reading content changing means is configured to make the changed sound a specific sound effect or a continuous sound effect.

A text analysis step that takes a text as input, divides it into an analysis result including reading information and accent information, using a language dictionary with finding, reading and accent type information;
Prosody control step for generating prosody information for controlling intonation and rhythm based on reading information and accent information obtained by the text analysis step,
Based on the reading information and accent information obtained by the text analysis step, and the prosodic information obtained by the prosody control step, a segment selection step for selecting a phoneme from an acoustic dictionary storing phonemes;
In a speech synthesis method having a speech synthesis step of creating a synthesized sound in accordance with the prosodic information obtained in the prosody control step by the phoneme segment selected in the segment selection step,
During the processing of the text analysis step, a pattern morpheme generation step for generating a morpheme that matches a pattern defined in the reading control rule based on a reading control rule for extracting a specific pattern;
A reading content change step for changing the reading content of the input text by changing the reading content of the morpheme obtained by the pattern morpheme generation step;
A speech synthesis method, comprising: a reading control signal input step for inputting a reading control signal indicating whether or not to execute reading content change in the reading content changing step.

A text analysis process that takes text as input, divides it into analysis results that include reading information and accent information, using a language dictionary with finding, reading, and accent type information,
Prosody control processing for generating prosody information for controlling intonation and rhythm based on reading information and accent information obtained by text analysis processing,
Based on the reading information and accent information obtained by the text analysis processing, and the prosody information obtained by the prosody control processing, a segment selection process for selecting a phoneme from an acoustic dictionary storing the phonemes;
In a speech synthesis program for causing a computer to implement speech synthesis means for creating synthesized speech in accordance with the prosodic information obtained by prosody control processing for the phoneme segment selected in the segment selection process,
A pattern morpheme generation process that generates a morpheme that matches a pattern defined in the reading control rule based on the reading control rule for extracting a specific pattern during the text analysis process;
Reading contents change processing for changing the reading contents of the input text by changing the reading contents of the morpheme obtained by the pattern morpheme generation process,
A speech synthesis program for causing a computer to further realize a reading control signal input process for inputting a reading control signal for determining whether or not to read a reading contents change in the reading contents changing process.