JP6897132B2

JP6897132B2 - Speech processing methods, audio processors and programs

Info

Publication number: JP6897132B2
Application number: JP2017022418A
Authority: JP
Inventors: 優樹瀬戸
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-02-09
Filing date: 2017-02-09
Publication date: 2021-06-30
Anticipated expiration: 2037-02-09
Also published as: JP2018128607A

Description

本発明は、特定の文字列を発音した音声を合成する技術に関する。 The present invention relates to a technique for synthesizing a voice in which a specific character string is pronounced.

例えば電車等の交通機関または店舗等の商業施設では、利用者を案内するための様々な音声が放音される。例えば特許文献１には、例えば素片接続型等の公知の音声合成処理により案内音声を生成して施設内の放音装置から再生する構成が開示されている。 For example, in transportation such as trains or commercial facilities such as stores, various voices for guiding users are emitted. For example, Patent Document 1 discloses a configuration in which a guidance voice is generated by a known voice synthesis process such as a piece connection type and reproduced from a sound emitting device in a facility.

特開２０１６−７６２０１号公報JP-A-2016-762201

ところで、特定の言語（以下「第１言語」という）で表現された案内音声には、他言語（以下「第２言語」という）に対応する音声が含まれる場合がある。例えば、日本語で表現された駅名または地名等の固有名詞が、英語で表現された案内音声に含まれ得る。しかし、第１言語を前提とした音声合成処理により、第２言語で表現された文字列の音声を合成した場合には、音韻（発音内容）および抑揚が聴感的に自然な音声を合成することは実際には困難である。例えば、「タテヤマ（tateyama）」という日本語の地名を発音した音声の合成に英語用の音声合成処理を利用した場合には、「タテイアマ（tateiama）」といった音声が生成される可能性がある。以上の事情を考慮して、本発明は、第１言語で表現された特定の文字列に第２言語の部分が含まれる場合でも音韻および抑揚が聴感的に自然な音声を合成することを目的とする。 By the way, the guidance voice expressed in a specific language (hereinafter referred to as "first language") may include voice corresponding to another language (hereinafter referred to as "second language"). For example, a proper noun such as a station name or a place name expressed in Japanese may be included in the guidance voice expressed in English. However, when the speech of the character string expressed in the second language is synthesized by the speech synthesis process premised on the first language, the phoneme (pronunciation content) and the intonation are audibly natural. Is actually difficult. For example, when the speech synthesis process for English is used to synthesize the speech that pronounces the Japanese place name "tateyama", the speech "tateiama" may be generated. In consideration of the above circumstances, it is an object of the present invention to synthesize a voice in which the phoneme and intonation are audibly natural even when a specific character string expressed in the first language contains a part in the second language. And.

以上の課題を解決するために、本発明の好適な態様に係る音声処理方法は、指定文字列のうちの第１部分を第１言語で発音した音声と、前記指定文字列のうち前記第１部分とは相違する第２部分を発音した音声とを表す音声信号を生成し、前記音声信号の生成においては、前記第２部分について、前記第１言語とは相違する第２言語用の音声合成データを利用した音声合成処理を実行する。
また、本発明の好適な態様に係る音声処理装置は、指定文字列のうちの第１部分を第１言語で発音した音声と、前記指定文字列のうち前記第１部分とは相違する第２部分を発音した音声とを表す音声信号を生成する音声合成部を具備し、前記音声合成部は、前記第２部分について、前記第１言語とは相違する第２言語用の音声合成データを利用した音声合成処理を実行する。 In order to solve the above problems, the voice processing method according to the preferred embodiment of the present invention includes a voice in which the first part of the designated character string is pronounced in the first language and the first of the designated character strings. A voice signal representing a voice that pronounces a second part different from the part is generated, and in the generation of the voice signal, the second part is a voice synthesis for a second language different from the first language. Execute voice synthesis processing using data.
Further, in the voice processing device according to the preferred embodiment of the present invention, the voice obtained by pronouncing the first part of the designated character string in the first language and the second part of the designated character string are different from the first part. The voice synthesis unit includes a voice synthesis unit that generates a voice signal representing a voice that pronounces a portion, and the voice synthesis unit uses voice synthesis data for a second language different from the first language for the second part. Execute the voice synthesis process.

本発明の第１実施形態に係る音声処理装置の構成図である。It is a block diagram of the voice processing apparatus which concerns on 1st Embodiment of this invention. 指定文字列と定型部分と非定型部分との関係の説明図である。It is explanatory drawing of the relationship between a designated character string, a standard part, and an atypical part. 非定型部分を入力する画面の説明図である。It is explanatory drawing of the screen which inputs the atypical part. 音声処理装置における制御装置の機能に着目した構成図である。It is a block diagram focusing on the function of the control device in a voice processing device. 制御装置が実行する信号生成処理のフローチャートである。It is a flowchart of the signal generation processing executed by a control device. 第２実施形態に係る音声処理装置の構成図である。It is a block diagram of the voice processing apparatus which concerns on 2nd Embodiment. 第２実施形態における制御装置の機能に着目した構成図である。It is a block diagram focusing on the function of the control device in 2nd Embodiment. 第２実施形態の制御装置が実行する信号生成処理のフローチャートである。It is a flowchart of the signal generation processing executed by the control device of 2nd Embodiment. 第３実施形態に係る音声処理装置の構成図である。It is a block diagram of the voice processing apparatus which concerns on 3rd Embodiment. 第３実施形態における制御装置の機能に着目した構成図である。It is a block diagram focusing on the function of the control device in 3rd Embodiment. 第３実施形態の制御装置が実行する音声合成処理のフローチャートである。It is a flowchart of the voice synthesis processing executed by the control device of 3rd Embodiment. 第４実施形態の音声処理装置における制御装置の機能に着目した構成図である。It is a block diagram paying attention to the function of the control device in the voice processing device of 4th Embodiment. 第４実施形態における端末装置の構成図である。It is a block diagram of the terminal apparatus in 4th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声処理装置１００の構成図である。図１に例示される通り、第１実施形態の音声処理装置１００は、電車等の交通機関の施設内（例えば駅構内）に設置され、当該施設に関する案内を表す音声（以下「案内音声」という）Ｇを施設内の利用者に対して放音する音響システムである。 <First Embodiment>
FIG. 1 is a configuration diagram of a voice processing device 100 according to a first embodiment of the present invention. As illustrated in FIG. 1, the voice processing device 100 of the first embodiment is installed in a facility of transportation such as a train (for example, in a station yard), and is referred to as a voice representing guidance regarding the facility (hereinafter referred to as "guidance voice"). ) This is an acoustic system that emits sound to users in the facility.

案内音声Ｇは、音声処理装置１００の管理者が指定した文字列（以下「指定文字列」という）Ｑを発音した音声である。図２には、“We have found a child, who tells us his name is Yuki Suzuki.”（スズキユウキちゃんという迷子のお子様がお待ちでございます）という英語の指定文字列Ｑが例示されている。図２に例示される通り、第１実施形態の指定文字列Ｑは、定型部分Ｑaと非定型部分Ｑbとを含んで構成される。 The guidance voice G is a voice that pronounces a character string (hereinafter referred to as “designated character string”) Q designated by the administrator of the voice processing device 100. Figure 2 exemplifies the English designated character string Q, "We have found a child, who tells us his name is Yuki Suzuki." As illustrated in FIG. 2, the designated character string Q of the first embodiment is configured to include a fixed form portion Qa and an atypical portion Qb.

定型部分Ｑa（第１部分の例示）は、事前に内容が想定される定型的な文字列であり、特定の言語（以下「第１言語」という）の語句で構成される。図２では、第１言語の例示である英語で表現された定型部分Ｑaが例示されている。他方、非定型部分Ｑb（第２部分の例示）は、例えば施設内の状況に応じて変更される非定型の文字列である。例えば図２に例示される通り、施設内の迷子の子供の名前等の固有名詞の部分が非定型部分Ｑbの典型例である。非定型部分Ｑbは、第１言語とは相違する言語（以下「第２言語」という）の語句であり得る。図２に例示された指定文字列Ｑのうち名前を表す“Yuki Suzuki”という語句が非定型部分Ｑbである。すなわち、非定型部分Ｑbは、例えば通常は日本語として使用される固有名詞（例えば日本人の名前または日本国内の地名）である。定型部分Ｑaは、案内の概略的かつ基本的な内容を表現し、非定型部分Ｑbは、案内に関する個別的または具体的な内容を表現する、と換言することも可能である。なお、図２では１個の非定型部分Ｑbを含む指定文字列Ｑを例示したが、複数の非定型部分Ｑbを１個の指定文字列Ｑに含めてもよい。 The standard part Qa (example of the first part) is a standard character string whose content is expected in advance, and is composed of words and phrases of a specific language (hereinafter referred to as "first language"). In FIG. 2, a fixed form Qa expressed in English, which is an example of the first language, is illustrated. On the other hand, the atypical part Qb (example of the second part) is, for example, an atypical character string that is changed according to the situation in the facility. For example, as illustrated in FIG. 2, the part of the proper noun such as the name of the lost child in the facility is a typical example of the atypical part Qb. The atypical part Qb can be a phrase in a language different from the first language (hereinafter referred to as "second language"). Among the designated character strings Q illustrated in FIG. 2, the phrase “Yuki Suzuki” representing the name is the atypical part Qb. That is, the atypical part Qb is, for example, a proper noun usually used as Japanese (for example, a Japanese name or a place name in Japan). It can be paraphrased that the standard part Qa expresses the general and basic contents of the guidance, and the atypical part Qb expresses the individual or specific contents regarding the guidance. Although the designated character string Q including one atypical portion Qb is illustrated in FIG. 2, a plurality of atypical portion Qb may be included in one designated character string Q.

図１に例示される通り、音声処理装置１００は、制御装置１１と記憶装置１２と表示装置１３と操作装置１４と放音装置１５とを具備するコンピュータシステムである。例えばタブレット端末またはパーソナルコンピュータ等の情報端末が音声処理装置１００として利用され得る。なお、例えば、鉄道事業者の施設内に設置される電光掲示板、または商業施設に設置される電子看板（例えばデジタルサイネージ）等の案内用の表示端末を、音声処理装置１００として利用することも可能である。また、音声処理装置１００は、単体の装置で実現されるほか、相互に別体で構成された複数の装置（すなわちシステム）でも実現され得る。 As illustrated in FIG. 1, the voice processing device 100 is a computer system including a control device 11, a storage device 12, a display device 13, an operation device 14, and a sound emitting device 15. For example, an information terminal such as a tablet terminal or a personal computer can be used as the voice processing device 100. It is also possible to use, for example, a display terminal for guidance such as an electric bulletin board installed in a railway operator's facility or an electronic signboard (for example, digital signage) installed in a commercial facility as the voice processing device 100. Is. Further, the voice processing device 100 can be realized not only by a single device but also by a plurality of devices (that is, systems) configured as separate bodies from each other.

表示装置１３（例えば液晶表示パネル）は、制御装置１１による制御のもとで各種の画像を表示する。操作装置１４は、管理者からの指示を受付ける入力機器である。例えば、管理者が操作可能な複数の操作子、または、表示装置１３の表示面に対する接触を検知するタッチパネルが、操作装置１４として好適に利用される。第１実施形態では、音声処理装置１００の管理者は、操作装置１４を適宜に操作することで、事前に用意された複数の候補から指定文字列Ｑの定型部分Ｑaを選択するとともに、迷子の子供の名前等の任意の文字列を非定型部分Ｑbとして指定することが可能である。 The display device 13 (for example, a liquid crystal display panel) displays various images under the control of the control device 11. The operation device 14 is an input device that receives an instruction from the administrator. For example, a plurality of controls that can be operated by the administrator or a touch panel that detects contact with the display surface of the display device 13 is preferably used as the operation device 14. In the first embodiment, the administrator of the voice processing device 100 selects the fixed portion Qa of the designated character string Q from a plurality of candidates prepared in advance by appropriately operating the operation device 14, and the lost child It is possible to specify an arbitrary character string such as a child's name as the atypical part Qb.

制御装置１１は、例えばＣＰＵ（Central Processing Unit）等の処理回路で構成され、音声処理装置１００の各要素を統括的に制御する。具体的には、第１実施形態の制御装置１１は、指定文字列Ｑを発音した音声を表す音声信号Ｘを生成する。放音装置１５は、制御装置１１が生成した音声信号Ｘに応じた音声を再生する。なお、制御装置１１が生成した音声信号Ｘをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略した。記憶装置１２は、制御装置１１が実行するプログラムと制御装置１１が使用する各種のデータとを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体が記憶装置１２として採用され得る。 The control device 11 is composed of a processing circuit such as a CPU (Central Processing Unit), and controls each element of the voice processing device 100 in an integrated manner. Specifically, the control device 11 of the first embodiment generates an audio signal X representing a voice that pronounces the designated character string Q. The sound emitting device 15 reproduces the sound corresponding to the sound signal X generated by the control device 11. The D / A converter that converts the audio signal X generated by the control device 11 from digital to analog is not shown for convenience. The storage device 12 stores a program executed by the control device 11 and various data used by the control device 11. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium can be adopted as the storage device 12.

第１実施形態の記憶装置１２は、相異なる定型部分Ｑaに対応する複数の収録信号Ｒを記憶する。任意の１個の定型部分Ｑaに対応する収録信号Ｒは、定型部分Ｑaを発音した音声（すなわち第１言語で表現された音声）を表す信号である。複数の定型部分Ｑaの各々を特定の発声者に順次に発音させ、当該発声音を収音機器により収録することで複数の収録信号Ｒが生成される。発声音の収録により生成された複数の収録信号Ｒが事前（すなわち音声信号Ｘの生成前）に記憶装置１２に格納される。 The storage device 12 of the first embodiment stores a plurality of recording signals R corresponding to different standard portions Qa. The recording signal R corresponding to any one fixed form portion Qa is a signal representing the voice that pronounces the fixed form portion Qa (that is, the voice expressed in the first language). A plurality of recording signals R are generated by sequentially causing a specific vocalist to pronounce each of the plurality of standard portions Qa and recording the vocalized sound by a sound collecting device. A plurality of recorded signals R generated by recording the vocal sound are stored in the storage device 12 in advance (that is, before the generation of the audio signal X).

非定型部分Ｑbは、単発的に必要となる文字列であるから、収録信号Ｒを事前に用意することは困難である。また、新規に設置された店舗等の施設で使用される指定文字列Ｑについては収録信号Ｒが収録されていない場合が想定される。以上の事情を考慮して、第１実施形態では、非定型部分Ｑbを音声合成処理により生成する。 Since the atypical portion Qb is a character string that is required sporadically, it is difficult to prepare the recording signal R in advance. Further, it is assumed that the recording signal R is not recorded for the designated character string Q used in the newly installed facility such as a store. In consideration of the above circumstances, in the first embodiment, the atypical portion Qb is generated by the voice synthesis process.

図１に例示される通り、第１実施形態の記憶装置１２は、第２言語用の音声合成プログラムＰ2および音声合成データＤ2とを記憶する。音声合成プログラムＰ2は、第２言語の任意の文字列に対応する音声を合成する音声合成処理を実現するためのソフトウェア（音声合成エンジン）である。第１実施形態では、複数の音声素片を時間軸上で相互に接続する素片接続型の音声合成処理を例示する。 As illustrated in FIG. 1, the storage device 12 of the first embodiment stores the speech synthesis program P2 for the second language and the speech synthesis data D2. The speech synthesis program P2 is software (speech synthesis engine) for realizing a speech synthesis process for synthesizing speech corresponding to an arbitrary character string in a second language. In the first embodiment, a piece-to-piece connection type voice synthesis process in which a plurality of voice pieces are connected to each other on the time axis is illustrated.

音声合成データＤ2は、非定型部分Ｑbの音声合成処理に利用される。第１実施形態では、素片接続型の音声合成処理により非定型部分Ｑbの音声を合成する場合を想定する。図１に例示される通り、音声合成データＤ2は、発音規則データＤa2と音声素片データＤb2とを記憶する。発音規則データＤa2は、第２言語の文字列と発音記号との関係（すなわち、文字列を発音記号に変換する規則）を規定する。音声素片データＤb2は、複数の音声素片の集合（音声合成用ライブラリ）である。各音声素片は、例えば母音もしくは子音等の音素単体、または、複数の音素を連結した音素連鎖（例えばダイフォンまたはトライフォン）である。第１実施形態では、第２言語の語句を発音した音声から採取された複数の音声素片が音声素片データＤb2に登録される。 The voice synthesis data D2 is used for the voice synthesis processing of the atypical portion Qb. In the first embodiment, it is assumed that the voice of the atypical portion Qb is synthesized by the voice synthesis processing of the elemental connection type. As illustrated in FIG. 1, the voice synthesis data D2 stores the pronunciation rule data Da2 and the voice element data Db2. The phonetic rule data Da2 defines the relationship between the second language character string and the phonetic symbol (that is, the rule for converting the character string into the phonetic symbol). The voice element data Db2 is a set of a plurality of voice element pieces (speech synthesis library). Each phoneme piece is, for example, a single phoneme such as a vowel or a consonant, or a phoneme chain (for example, a diphone or a triphone) in which a plurality of phonemes are connected. In the first embodiment, a plurality of voice elements collected from the voices that pronounce the words in the second language are registered in the voice element data Db2.

図４は、制御装置１１の機能に着目した構成図である。図４に例示される通り、第１実施形態の制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで、指定文字列Ｑの音声信号Ｘを生成するための複数の機能（文字列設定部２２および音声合成部２４A）を実現する。なお、制御装置１１の一部の機能を専用の電子回路で実現した構成、または、制御装置１１の機能を複数の装置に分散した構成も採用され得る。 FIG. 4 is a configuration diagram focusing on the function of the control device 11. As illustrated in FIG. 4, the control device 11 of the first embodiment has a plurality of functions (characters) for generating the audio signal X of the designated character string Q by executing the program stored in the storage device 12. The column setting unit 22 and the voice synthesis unit 24A) are realized. A configuration in which some functions of the control device 11 are realized by a dedicated electronic circuit, or a configuration in which the functions of the control device 11 are distributed to a plurality of devices can also be adopted.

文字列設定部２２は、定型部分Ｑaと非定型部分Ｑbとを含む指定文字列Ｑを設定する。具体的には、第１実施形態の文字列設定部２２は、操作装置１４に対する管理者からの指示に応じて定型部分Ｑaと非定型部分Ｑbとを設定する。例えば、文字列設定部２２は、操作装置１４に対する操作で管理者が複数の候補から選択した文字列を定型部分Ｑaとして設定する。また、文字列設定部２２は、操作装置１４に対する操作で管理者が指定した任意の文字列を非定型部分Ｑbとして設定する。例えば図３に例示される通り、表示装置１３に表示された入力欄１３２に管理者が入力した文字列が非定型部分Ｑbとして設定される。なお、非定型部分Ｑbを定型部分Ｑaとは別個の文字種（例えば片仮名）により指定することも可能である。例えば、外国人の名前を、日本人が聴取した通りの片仮名で非定型部分Ｑbとして指定する。 The character string setting unit 22 sets a designated character string Q including a standard portion Qa and an atypical portion Qb. Specifically, the character string setting unit 22 of the first embodiment sets the standard portion Qa and the non-standard portion Qb according to the instruction from the administrator to the operation device 14. For example, the character string setting unit 22 sets a character string selected from a plurality of candidates by the administrator in an operation on the operation device 14 as a standard portion Qa. Further, the character string setting unit 22 sets an arbitrary character string designated by the administrator in the operation on the operation device 14 as the atypical portion Qb. For example, as illustrated in FIG. 3, a character string input by the administrator in the input field 132 displayed on the display device 13 is set as the atypical portion Qb. It is also possible to specify the non-standard part Qb by a character type (for example, katakana) different from the standard part Qa. For example, the name of a foreigner is specified as an atypical part Qb in katakana as heard by the Japanese.

図４の音声合成部２４Aは、文字列設定部２２が設定した指定文字列Ｑを発音した案内音声Ｇを表す音声信号Ｘを生成する。図４に例示される通り、第１実施形態の音声合成部２４Aは、第１処理部３２Aと第２処理部３４と接続処理部３６とを含んで構成される。 The voice synthesis unit 24A of FIG. 4 generates a voice signal X representing the guidance voice G that pronounces the designated character string Q set by the character string setting unit 22. As illustrated in FIG. 4, the voice synthesis unit 24A of the first embodiment includes a first processing unit 32A, a second processing unit 34, and a connection processing unit 36.

第１処理部３２Aは、文字列設定部２２が設定した指定文字列Ｑの定型部分Ｑaの音声を表す第１信号Ｘ1を生成する。第１実施形態の第１処理部３２Aは、記憶装置１２に記憶された複数の収録信号Ｒのうち定型部分Ｑaに対応する１個の収録信号Ｒを第１信号Ｘ1として選択する。第２処理部３４は、文字列設定部２２が設定した指定文字列Ｑの非定型部分Ｑbに対応する音声を表す第２信号Ｘ2を生成する。第１実施形態の第２処理部３４は、制御装置１１が音声合成プログラムＰ2を実行することで実現され、記憶装置１２に記憶された第２言語用の音声合成データＤ2を利用した音声合成処理により第２信号Ｘ2を生成する。接続処理部３６は、第１処理部３２Aが生成した第１信号Ｘ1と第２処理部３４が生成した第２信号Ｘ2とを相互に接続することで音声信号Ｘを生成する。 The first processing unit 32A generates a first signal X1 representing the voice of the standard portion Qa of the designated character string Q set by the character string setting unit 22. The first processing unit 32A of the first embodiment selects one recording signal R corresponding to the standard portion Qa among the plurality of recording signals R stored in the storage device 12 as the first signal X1. The second processing unit 34 generates a second signal X2 representing a voice corresponding to the atypical portion Qb of the designated character string Q set by the character string setting unit 22. The second processing unit 34 of the first embodiment is realized by the control device 11 executing the speech synthesis program P2, and is a speech synthesis process using the speech synthesis data D2 for the second language stored in the storage device 12. Generates the second signal X2. The connection processing unit 36 generates an audio signal X by connecting the first signal X1 generated by the first processing unit 32A and the second signal X2 generated by the second processing unit 34 to each other.

図５は、第１実施形態の音声合成部２４Aが音声信号Ｘを生成する処理（以下「信号生成処理」という）のフローチャートである。文字列設定部２２による指定文字列Ｑの設定毎に信号生成処理が実行される。 FIG. 5 is a flowchart of a process (hereinafter referred to as “signal generation process”) in which the voice synthesis unit 24A of the first embodiment generates a voice signal X. The signal generation process is executed for each setting of the designated character string Q by the character string setting unit 22.

信号生成処理を開始すると、第１処理部３２Aは、文字列設定部２２が設定した指定文字列Ｑの定型部分Ｑaに対応する音声を表す第１信号Ｘ1を複数の収録信号Ｒから選択する（Ｓa1：第１処理）。すなわち、複数の収録信号Ｒのうち指定文字列Ｑの定型部分Ｑaに対応する１個の収録信号Ｒが第１信号Ｘ1として選択される。第１信号Ｘ1は、定型部分Ｑaを発音した第１言語の音声を表す信号である。 When the signal generation process is started, the first processing unit 32A selects the first signal X1 representing the voice corresponding to the fixed portion Qa of the designated character string Q set by the character string setting unit 22 from the plurality of recorded signals R ( Sa1: First process). That is, one of the plurality of recorded signals R corresponding to the fixed portion Qa of the designated character string Q is selected as the first signal X1. The first signal X1 is a signal representing the voice of the first language that pronounces the fixed portion Qa.

第２処理部３４は、文字列設定部２２が設定した指定文字列Ｑの非定型部分Ｑbの音声を表す第２信号Ｘ2を音声合成処理により生成する（Ｓa2：第２処理）。第１実施形態の第２処理部３４は、以下に詳述する通り、記憶装置１２に記憶された第２言語用の音声合成データＤ2を利用した音声合成処理により第２信号Ｘ2を生成する（Ｓa21−Ｓa24）。 The second processing unit 34 generates a second signal X2 representing the voice of the atypical portion Qb of the designated character string Q set by the character string setting unit 22 by voice synthesis processing (Sa2: second processing). As described in detail below, the second processing unit 34 of the first embodiment generates the second signal X2 by the voice synthesis processing using the voice synthesis data D2 for the second language stored in the storage device 12 ( Sa21-Sa24).

まず、第２処理部３４は、音声合成データＤ2の発音規則データＤa2を参照することで、非定型部分Ｑbに対応する発音記号を決定する（Ｓa21）。第１実施形態の発音規則データＤa2は、第２言語（例えば日本語）の文字列と発音記号との関係を規定する。したがって、ステップＳa21では、第２言語の語句として自然な読み方と認識される発音記号が非定型部分Ｑbから決定される。 First, the second processing unit 34 determines the phonetic symbols corresponding to the atypical portion Qb by referring to the pronunciation rule data Da2 of the speech synthesis data D2 (Sa21). The pronunciation rule data Da2 of the first embodiment defines the relationship between the character string of the second language (for example, Japanese) and the phonetic symbol. Therefore, in step Sa21, a phonetic symbol recognized as a natural reading as a second language phrase is determined from the atypical part Qb.

また、第２処理部３４は、非定型部分Ｑbの発音記号に対応する複数の音声素片を音声素片データＤb2から選択する（Ｓa22）。そして、第２処理部３４は、音声素片データＤb2から選択した各音声素片の特性を適宜に調整する（Ｓa23）。例えば、案内音声Ｇの抑揚に影響する音高および音量が調整される。第２処理部３４は、調整後の複数の音声素片を時間軸上で相互に接続することで第２信号Ｘ2を生成する（Ｓa24）。前述の通り、第１実施形態の音声素片データＤb2には、第２言語（例えば日本語）を発音した音声から採取された複数の音声素片が登録される。したがって、ステップＳa24では、第２言語の音声として聴感的に自然な音声を表す第２信号Ｘ2が生成される。なお、第１処理Ｓa1と第２処理Ｓa2との先後を逆転することも可能である。 Further, the second processing unit 34 selects a plurality of voice elements corresponding to the phonetic symbols of the atypical portion Qb from the voice element data Db2 (Sa22). Then, the second processing unit 34 appropriately adjusts the characteristics of each audio element selected from the audio element data Db2 (Sa23). For example, the pitch and volume that affect the intonation of the guidance voice G are adjusted. The second processing unit 34 generates a second signal X2 by connecting a plurality of adjusted audio elements to each other on the time axis (Sa24). As described above, in the voice element data Db2 of the first embodiment, a plurality of voice elements collected from the voice pronounced in the second language (for example, Japanese) are registered. Therefore, in step Sa24, a second signal X2 representing an audibly natural voice is generated as the voice of the second language. It is also possible to reverse the front and back of the first process Sa1 and the second process Sa2.

以上の処理が完了すると、接続処理部３６は、第１処理Ｓa1で生成した第１信号Ｘ1と第２処理Ｓa2（Ｓa21−Ｓa24）で生成した第２信号Ｘ2とを接続することで音声信号Ｘを生成する（Ｓa3：接続処理）。具体的には、第１信号Ｘ1のうち非定型部分Ｑbに対応した区間に第２信号Ｘ2を挿入することで音声信号Ｘが生成される。すなわち、第１実施形態の音声合成部２４Aは、指定文字列Ｑの定型部分Ｑaを第１言語で発音した音声と、指定文字列Ｑの非定型部分Ｑbを第２言語で発音した音声とを表す音声信号Ｘを生成する。音声合成部２４Aが生成した音声信号Ｘが放音装置１５に供給されることで、施設内の利用者に対して案内音声Ｇが再生される。 When the above processing is completed, the connection processing unit 36 connects the first signal X1 generated by the first processing Sa1 and the second signal X2 generated by the second processing Sa2 (Sa21-Sa24) to the audio signal X. (Sa3: connection processing). Specifically, the audio signal X is generated by inserting the second signal X2 into the section of the first signal X1 corresponding to the atypical portion Qb. That is, the voice synthesis unit 24A of the first embodiment produces a voice in which the fixed portion Qa of the designated character string Q is pronounced in the first language and a voice in which the atypical portion Qb of the designated character string Q is pronounced in the second language. The voice signal X to be represented is generated. By supplying the voice signal X generated by the voice synthesis unit 24A to the sound emitting device 15, the guidance voice G is reproduced for the users in the facility.

以上に説明した通り、第１実施形態では、指定文字列Ｑのうち非定型部分Ｑbについては第２言語用の音声合成データＤ2を利用した音声合成処理が実行される。したがって、非定型部分Ｑbについて音韻および抑揚が聴感的に自然な案内音声Ｇを再生することが可能である。 As described above, in the first embodiment, the voice synthesis process using the voice synthesis data D2 for the second language is executed for the atypical portion Qb of the designated character string Q. Therefore, it is possible to reproduce the guidance voice G whose phonology and intonation are audibly natural for the atypical portion Qb.

＜第２実施形態＞
本発明の第２実施形態について説明する。以下に例示する各構成において作用または機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 <Second Embodiment>
A second embodiment of the present invention will be described. For the elements whose actions or functions are the same as those of the first embodiment in each of the configurations exemplified below, the reference numerals used in the description of the first embodiment will be diverted and detailed description of each will be omitted as appropriate.

図６は、第２実施形態に係る音声処理装置１００の構成図である。図６に例示される通り、第２実施形態の記憶装置１２は、第１実施形態で例示した第２言語用の音声合成プログラムＰ2および音声合成データＤ2のほか、第１言語用の音声合成プログラムＰ1および音声合成データＤ1を記憶する。音声合成プログラムＰ1は、第１言語の任意の文字列に対応する音声を合成する音声合成処理を実現するためのソフトウェア（音声合成エンジン）である。 FIG. 6 is a configuration diagram of the voice processing device 100 according to the second embodiment. As illustrated in FIG. 6, the storage device 12 of the second embodiment includes the speech synthesis program P2 for the second language and the speech synthesis data D2 exemplified in the first embodiment, as well as the speech synthesis program for the first language. Stores P1 and voice synthesis data D1. The speech synthesis program P1 is software (speech synthesis engine) for realizing a speech synthesis process for synthesizing speech corresponding to an arbitrary character string in a first language.

音声合成データＤ1は、音声合成プログラムＰ1による音声合成処理に利用されるデータであり、発音規則データＤa1と音声素片データＤb1とを含んで構成される。発音規則データＤa1は、第１言語の文字列と発音記号との関係を規定する。音声素片データＤb1は、第１言語の語句を発音した音声から採取された複数の音声素片の集合である。なお、第１実施形態で例示した複数の収録信号Ｒは、第２実施形態では省略される。 The voice synthesis data D1 is data used for voice synthesis processing by the voice synthesis program P1, and includes pronunciation rule data Da1 and voice fragment data Db1. The pronunciation rule data Da1 defines the relationship between the character string of the first language and the phonetic symbol. The speech fragment data Db1 is a set of a plurality of speech fragments collected from the speech that pronounces a phrase in the first language. The plurality of recorded signals R illustrated in the first embodiment are omitted in the second embodiment.

図７は、制御装置１１の機能に着目した構成図である。図７に例示される通り、第２実施形態の制御装置１１は、文字列設定部２２および音声合成部２４Bとして機能する。文字列設定部２２は、第１実施形態と同様に、定型部分Ｑaと非定型部分Ｑbとを含む指定文字列Ｑを設定する。 FIG. 7 is a configuration diagram focusing on the function of the control device 11. As illustrated in FIG. 7, the control device 11 of the second embodiment functions as the character string setting unit 22 and the voice synthesis unit 24B. The character string setting unit 22 sets the designated character string Q including the standard portion Qa and the non-standard portion Qb, as in the first embodiment.

音声合成部２４Bは、文字列設定部２２が設定した指定文字列Ｑを発音した案内音声Ｇを表す音声信号Ｘを生成する。第２実施形態の音声合成部２４Bは、第１実施形態の音声合成部２４Aにおける第１処理部３２Aを第１処理部３２Bに置換した構成である。第２実施形態の第１処理部３２Bは、第１言語用の音声合成プログラムＰ1を制御装置１１が実行することで実現され、記憶装置１２に記憶された第１言語用の音声合成データＤ1を利用した音声合成処理により第１信号Ｘ1を生成する。第２処理部３４および接続処理部３６の機能は第１実施形態と同様である。 The voice synthesis unit 24B generates a voice signal X representing the guidance voice G that pronounces the designated character string Q set by the character string setting unit 22. The voice synthesis unit 24B of the second embodiment has a configuration in which the first processing unit 32A in the voice synthesis unit 24A of the first embodiment is replaced with the first processing unit 32B. The first processing unit 32B of the second embodiment is realized by the control device 11 executing the speech synthesis program P1 for the first language, and stores the speech synthesis data D1 for the first language stored in the storage device 12. The first signal X1 is generated by the voice synthesis process used. The functions of the second processing unit 34 and the connection processing unit 36 are the same as those in the first embodiment.

図８は、第２実施形態の音声合成部２４Bが音声信号Ｘを生成する信号生成処理のフローチャートである。信号生成処理を開始すると、第１処理部３２Bは、指定文字列Ｑの定型部分Ｑaの音声を表す第１信号Ｘ1を音声合成処理により生成する（Ｓb1：第１処理）。第２実施形態の第１処理部３２Bは、以下に詳述する通り、記憶装置１２に記憶された第１言語用の音声合成データＤ1を利用した音声合成処理により第１信号Ｘ1を生成する（Ｓb11−Ｓb14）。 FIG. 8 is a flowchart of a signal generation process in which the voice synthesis unit 24B of the second embodiment generates a voice signal X. When the signal generation process is started, the first processing unit 32B generates a first signal X1 representing the voice of the fixed portion Qa of the designated character string Q by the voice synthesis process (Sb1: first process). As described in detail below, the first processing unit 32B of the second embodiment generates the first signal X1 by the voice synthesis processing using the voice synthesis data D1 for the first language stored in the storage device 12 ( Sb11-Sb14).

具体的には、第１処理部３２Bは、第１言語用の発音規則データＤa1を参照することで、定型部分Ｑaに対応する発音記号を決定する（Ｓb11）。したがって、第１言語の語句として自然な読み方と認識される発音記号が定型部分Ｑaから決定される。また、第１処理部３２Bは、定型部分Ｑaの発音記号に対応する複数の音声素片を第１言語用の音声素片データＤb1から選択し（Ｓb12）、各音声素片の特性を適宜に調整する（Ｓb13）。そして、第１処理部３２Bは、調整後の複数の音声素片を時間軸上で相互に接続することで第１信号Ｘ1を生成する（Ｓb14）。前述の通り、音声素片データＤb2には、第１言語を発音した音声から採取された複数の音声素片が登録されるから、第１言語の音声として聴感的に自然な音声を表す第１信号Ｘ1が生成される。 Specifically, the first processing unit 32B determines the phonetic symbols corresponding to the standard portion Qa by referring to the pronunciation rule data Da1 for the first language (Sb11). Therefore, the phonetic symbols recognized as natural readings as words in the first language are determined from the fixed form Qa. Further, the first processing unit 32B selects a plurality of speech elements corresponding to the phonetic symbols of the standard portion Qa from the speech element data Db1 for the first language (Sb12), and appropriately sets the characteristics of each speech element. Adjust (Sb13). Then, the first processing unit 32B generates the first signal X1 by connecting the adjusted voice elements to each other on the time axis (Sb14). As described above, since a plurality of speech elements collected from the speech that pronounces the first language are registered in the speech fragment data Db2, the first speech that represents an audibly natural speech as the speech of the first language. Signal X1 is generated.

第２処理部３４は、第１実施形態と同様に、指定文字列Ｑの非定型部分Ｑbの音声を表す第２信号Ｘ2を、第２言語用の音声合成データＤ2を利用した音声合成処理により生成する（Ｓb2：第２処理）。第２処理部３４が第２信号Ｘ2を生成する第２処理Ｓb2の内容は、第１実施形態の第２処理Ｓa2（Ｓa21−Ｓa24）と同様である。すなわち、第２処理部３４は、第２言語用の発音規則データＤa2を利用して非定型部分Ｑbの発音記号を決定する処理（Ｓb21）と、発音記号に対応する複数の音声素片を第２言語用の音声素片データＤb2から選択する処理（Ｓb22）と、各音声素片の調整（Ｓb23）および接続（Ｓb24）により第２信号Ｘ2を生成する処理とを実行する。なお、第１処理Ｓb1と第２処理Ｓb2との先後を逆転することも可能である。 Similar to the first embodiment, the second processing unit 34 uses the second signal X2 representing the voice of the atypical portion Qb of the designated character string Q by voice synthesis processing using the voice synthesis data D2 for the second language. Generate (Sb2: second process). The content of the second processing Sb2 in which the second processing unit 34 generates the second signal X2 is the same as that of the second processing Sa2 (Sa21-Sa24) of the first embodiment. That is, the second processing unit 34 performs a process (Sb21) of determining the phonetic symbol of the atypical portion Qb using the phonetic rule data Da2 for the second language, and a plurality of speech elements corresponding to the phonetic symbol. The process of selecting from the phonetic fragment data Db2 for two languages (Sb22) and the process of generating the second signal X2 by adjusting and connecting (Sb24) each phonetic fragment are executed. It is also possible to reverse the front and back of the first process Sb1 and the second process Sb2.

以上の処理が完了すると、接続処理部３６は、第１処理Ｓb1（Ｓb11−Ｓb14）で生成した第１信号Ｘ1と第２処理Ｓb2（Ｓb21−Ｓb24）で生成した第２信号Ｘ2とを接続することで音声信号Ｘを生成する（Ｓb3：接続処理）。すなわち、第２実施形態の音声合成部２４Bは、第１実施形態と同様に、指定文字列Ｑの定型部分Ｑaを第１言語で発音した音声と、指定文字列Ｑの非定型部分Ｑbを第２言語で発音した音声とを表す音声信号Ｘを生成する。音声合成部２４Bが生成した音声信号Ｘが放音装置１５に供給されることで、施設内の利用者に対して案内音声Ｇが再生される。 When the above processing is completed, the connection processing unit 36 connects the first signal X1 generated by the first processing Sb1 (Sb11-Sb14) and the second signal X2 generated by the second processing Sb2 (Sb21-Sb24). This generates an audio signal X (Sb3: connection process). That is, the voice synthesis unit 24B of the second embodiment has the same as the first embodiment, the voice in which the fixed portion Qa of the designated character string Q is pronounced in the first language and the atypical portion Qb of the designated character string Q. A voice signal X representing a voice pronounced in two languages is generated. By supplying the voice signal X generated by the voice synthesis unit 24B to the sound emitting device 15, the guidance voice G is reproduced for the users in the facility.

以上に説明した通り、第２実施形態では、指定文字列Ｑのうち非定型部分Ｑbについては第２言語用の音声合成データＤ2を利用した音声合成処理が実行される。したがって、第１実施形態と同様に、非定型部分Ｑbについて音韻および抑揚が聴感的に自然な案内音声Ｇを再生することが可能である。 As described above, in the second embodiment, the voice synthesis process using the voice synthesis data D2 for the second language is executed for the atypical portion Qb of the designated character string Q. Therefore, as in the first embodiment, it is possible to reproduce the guidance voice G whose phoneme and intonation are audibly natural for the atypical portion Qb.

また、第２実施形態では、定型部分Ｑaの音声を表す第１信号Ｘ1が、第１言語用の音声合成データＤ1を利用した音声合成処理により生成される。したがって、第１実施形態と比較して、複数の収録信号Ｒを事前に用意して記憶装置１２に格納する必要がないという利点がある。他方、収録信号Ｒの音質は、音声合成処理で生成される第１信号Ｘ1の音質を一般的には上回る。以上の事情を考慮すると、事前に用意された複数の収録信号Ｒを選択的に第１信号Ｘ1として利用する第１実施形態によれば、第２実施形態と比較して、定型部分Ｑaの音質が高い音声信号Ｘを生成できるという利点がある。また、第１実施形態では、第１言語用の音声合成処理（第１処理Ｓb1）が不要であるから、制御装置１１の処理負荷が軽減されるという利点もある。 Further, in the second embodiment, the first signal X1 representing the voice of the fixed form portion Qa is generated by the voice synthesis process using the voice synthesis data D1 for the first language. Therefore, as compared with the first embodiment, there is an advantage that it is not necessary to prepare a plurality of recorded signals R in advance and store them in the storage device 12. On the other hand, the sound quality of the recorded signal R generally exceeds the sound quality of the first signal X1 generated by the speech synthesis process. Considering the above circumstances, according to the first embodiment in which a plurality of recorded signals R prepared in advance are selectively used as the first signal X1, the sound quality of the standard portion Qa is compared with that of the second embodiment. Has the advantage of being able to generate a high audio signal X. Further, in the first embodiment, since the speech synthesis process for the first language (first process Sb1) is unnecessary, there is an advantage that the processing load of the control device 11 is reduced.

＜第３実施形態＞
図９は、第３実施形態に係る音声処理装置１００の構成図である。図９に例示される通り、第３実施形態の記憶装置１２は、第１言語用の音声合成プログラムＰ1および音声合成データＤ1と、第２言語用の発音規則データＤa2（音声合成データＤ2）とを記憶する。第１言語用の音声合成データＤ1は、発音規則データＤa1と音声素片データＤb1とを含んで構成される。第２言語用の発音規則データＤa2は、第１実施形態で前述した通り、第２言語の文字列と発音記号との関係を規定する。 <Third Embodiment>
FIG. 9 is a configuration diagram of the voice processing device 100 according to the third embodiment. As illustrated in FIG. 9, the storage device 12 of the third embodiment includes the speech synthesis program P1 and the speech synthesis data D1 for the first language, and the pronunciation rule data Da2 (speech synthesis data D2) for the second language. Remember. The voice synthesis data D1 for the first language is configured to include pronunciation rule data Da1 and voice element data Db1. The pronunciation rule data Da2 for the second language defines the relationship between the character string of the second language and the phonetic symbols as described above in the first embodiment.

図１０は、制御装置１１の機能に着目した構成図である。図１０に例示される通り、第３実施形態の制御装置１１は、文字列設定部２２および音声合成部２４Cとして機能する。文字列設定部２２は、第１実施形態と同様に、定型部分Ｑaと非定型部分Ｑbとを含む指定文字列Ｑを設定する。 FIG. 10 is a configuration diagram focusing on the function of the control device 11. As illustrated in FIG. 10, the control device 11 of the third embodiment functions as a character string setting unit 22 and a voice synthesis unit 24C. The character string setting unit 22 sets the designated character string Q including the standard portion Qa and the non-standard portion Qb, as in the first embodiment.

音声合成部２４Cは、文字列設定部２２が設定した指定文字列Ｑを発音した案内音声Ｇを表す音声信号Ｘを生成する。第３実施形態の音声合成部２４Cは、第１言語の音声合成プログラムＰ1により実現される。音声信号Ｘの生成において、音声合成部２４Cは、第１言語用の発音規則データＤa1により定型部分Ｑaの発音記号を決定し、第２言語用の発音規則データＤa2により非定型部分Ｑbの発音記号を決定する。 The voice synthesis unit 24C generates a voice signal X representing the guidance voice G that pronounces the designated character string Q set by the character string setting unit 22. The voice synthesis unit 24C of the third embodiment is realized by the voice synthesis program P1 of the first language. In the generation of the voice signal X, the voice synthesis unit 24C determines the phonetic symbol of the fixed portion Qa by the pronunciation rule data Da1 for the first language, and the phonetic symbol of the atypical portion Qb by the pronunciation rule data Da2 for the second language. To determine.

図１１は、第３実施形態の音声合成部２４Cが音声信号Ｘを生成する処理（音声合成処理）のフローチャートである。文字列設定部２２による指定文字列Ｑの設定毎に音声合成処理が実行される。 FIG. 11 is a flowchart of a process (voice synthesis process) in which the voice synthesis unit 24C of the third embodiment generates a voice signal X. The voice synthesis process is executed for each setting of the designated character string Q by the character string setting unit 22.

音声合成処理を開始すると、音声合成部２４Cは、文字列設定部２２が設定した指定文字列Ｑの定型部分Ｑaに対応する発音記号を、第１言語用の発音規則データＤa1を参照して決定する（Ｓc1）。したがって、第１言語の語句として自然な読み方と認識される発音記号が定型部分Ｑaから決定される。 When the speech synthesis process is started, the speech synthesis unit 24C determines the phonetic symbols corresponding to the fixed portion Qa of the designated character string Q set by the character string setting unit 22 with reference to the pronunciation rule data Da1 for the first language. (Sc1). Therefore, the phonetic symbols recognized as natural readings as words in the first language are determined from the fixed form Qa.

また、音声合成部２４Cは、指定文字列Ｑの非定型部分Ｑbに対応する発音記号を、第２言語用の発音規則データＤa2を参照して決定する（Ｓc2）。したがって、第２言語の語句として自然な読み方と認識される発音記号が非定型部分Ｑbから決定される。なお、定型部分Ｑaの発音記号の決定（Ｓc1）と非定型部分Ｑbの発音記号の決定（Ｓc2）との先後を逆転することも可能である。 Further, the speech synthesis unit 24C determines the phonetic symbols corresponding to the atypical portion Qb of the designated character string Q with reference to the pronunciation rule data Da2 for the second language (Sc2). Therefore, the phonetic symbols recognized as natural readings as words in the second language are determined from the atypical part Qb. It is also possible to reverse the process of determining the phonetic symbol of the standard portion Qa (Sc1) and determining the phonetic symbol of the atypical portion Qb (Sc2).

音声合成部２４Cは、定型部分Ｑaおよび非定型部分Ｑbについて決定した発音記号の音声を表す音声信号Ｘを生成する（Ｓc3）。具体的には、音声合成部２４Cは、まず、定型部分Ｑaおよび非定型部分Ｑbの発音記号に対応する複数の音声素片を音声素片データＤb1から選択する（Ｓc31）。そして、音声合成部２４Cは、音声素片データＤb1から選択した各音声素片の特性を適宜に調整し（Ｓc32）、調整後の複数の音声素片を時間軸上で相互に接続することで音声信号Ｘを生成する（Ｓc33）。音声合成部２４Cが生成した音声信号Ｘが放音装置１５に供給されることで、施設内の利用者に対して案内音声Ｇが再生される。 The voice synthesis unit 24C generates a voice signal X representing the voice of the phonetic symbol determined for the standard portion Qa and the atypical portion Qb (Sc3). Specifically, the voice synthesis unit 24C first selects a plurality of voice elements corresponding to the phonetic symbols of the standard portion Qa and the atypical portion Qb from the voice fragment data Db1 (Sc31). Then, the voice synthesis unit 24C appropriately adjusts the characteristics of each voice piece selected from the voice piece data Db1 (Sc32), and connects the adjusted plurality of voice pieces to each other on the time axis. Generates a voice signal X (Sc33). By supplying the voice signal X generated by the voice synthesis unit 24C to the sound emitting device 15, the guidance voice G is reproduced for the users in the facility.

第３実施形態では、指定文字列Ｑのうち非定型部分Ｑbについては第２言語用の発音規則データＤa2（音声合成データＤ2）を利用した音声合成処理が実行される。したがって、第１実施形態と同様に、非定型部分Ｑbについて音韻および抑揚が聴感的に自然な案内音声Ｇを再生することが可能である。 In the third embodiment, the voice synthesis process using the pronunciation rule data Da2 (speech synthesis data D2) for the second language is executed for the atypical part Qb of the designated character string Q. Therefore, as in the first embodiment, it is possible to reproduce the guidance voice G whose phoneme and intonation are audibly natural for the atypical portion Qb.

また、第３実施形態では、音声合成プログラムＰ1および音声素片データＤb1を利用して音声信号Ｘが生成されるから、第１実施形態および第２実施形態で例示した第２言語用の音声合成プログラムＰ2および音声素片データＤb2は不要である。したがって、第１言語用の音声合成プログラムＰ1および音声素片データＤb1と第２言語用の音声合成プログラムＰ2および音声素片データＤb2とが必要な第２実施形態と比較して、記憶装置１２に必要な記憶容量が削減されるという利点もある。また、第３実施形態では、第１信号Ｘ1と第２信号Ｘ2とを接続する接続処理（Ｓa3，Ｓb3）が不要である。例えば、第１信号Ｘ1と第２信号Ｘ2との時間的な関係を調整する処理（すなわち、第１信号Ｘ1のうち非定型部分Ｑbに対応した区間に第２信号Ｘ2を移動する処理）が不要である。したがって、定型部分Ｑaと非定型部分Ｑbとが自然に連結された案内音声Ｇが再生されるという利点もある。 Further, in the third embodiment, since the voice signal X is generated by using the voice synthesis program P1 and the voice fragment data Db1, the voice synthesis for the second language exemplified in the first embodiment and the second embodiment is performed. The program P2 and the voice fragment data Db2 are unnecessary. Therefore, the storage device 12 is compared with the second embodiment in which the speech synthesis program P1 and the speech fragment data Db1 for the first language and the speech synthesis program P2 and the speech fragment data Db2 for the second language are required. It also has the advantage of reducing the required storage capacity. Further, in the third embodiment, the connection process (Sa3, Sb3) for connecting the first signal X1 and the second signal X2 is unnecessary. For example, the process of adjusting the temporal relationship between the first signal X1 and the second signal X2 (that is, the process of moving the second signal X2 to the section corresponding to the atypical portion Qb of the first signal X1) is unnecessary. Is. Therefore, there is also an advantage that the guidance voice G in which the fixed form portion Qa and the atypical portion Qb are naturally connected is reproduced.

＜第４実施形態＞
図１２は、第４実施形態に係る音声処理装置１００の機能に着目した構成図である。図１２に例示される通り、第４実施形態の記憶装置１２は、相異なる指定文字列Ｑ（具体的には定型部分Ｑa）に対応する複数の配信情報Ｖを記憶する。任意の１種類の指定文字列Ｑに対応する配信情報Ｖは、当該指定文字列Ｑに関連する情報（以下「関連情報」という）Ｃを識別するための識別情報である。関連情報Ｃは、案内音声Ｇの再生とともに施設の利用者に提示すべき情報である。例えば指定文字列Ｑに関連する文字列、または、当該文字列を他言語に翻訳した文字列が、関連情報Ｃの好適例である。 <Fourth Embodiment>
FIG. 12 is a configuration diagram focusing on the function of the voice processing device 100 according to the fourth embodiment. As illustrated in FIG. 12, the storage device 12 of the fourth embodiment stores a plurality of distribution information Vs corresponding to different designated character strings Q (specifically, a fixed form portion Qa). The distribution information V corresponding to any one type of designated character string Q is identification information for identifying information (hereinafter referred to as “related information”) C related to the designated character string Q. The related information C is information to be presented to the user of the facility together with the reproduction of the guidance voice G. For example, a character string related to the designated character string Q or a character string obtained by translating the character string into another language is a preferable example of the related information C.

第４実施形態の制御装置１１は、図１２に例示される通り、第１実施形態から第３実施形態の何れかと同様の文字列設定部２２および音声合成部２４（２４A−２４Cの何れか）に加えて、変調処理部２６および混合処理部２８として機能する。変調処理部２６は、文字列設定部２２が設定した指定文字列Ｑに応じた変調信号Ｍを生成する。変調信号Ｍは、指定文字列Ｑに対応した配信情報Ｖを音響成分として含む信号である。変調処理部２６は、記憶装置１２に記憶された複数の配信情報Ｖのうち指定文字列Ｑに対応する配信情報Ｖを検索し、当該配信情報Ｖを示す変調信号Ｍを生成する。具体的には、変調処理部２６は、例えば所定の周波数の正弦波等の搬送波を配信情報Ｖにより変調する周波数変調、または、拡散符号を利用した配信情報Ｖの拡散変調等の変調処理により変調信号Ｍを生成する。配信情報Ｖの音響成分の周波数帯域は、例えば、放音装置１５による再生が可能な周波数帯域であり、かつ、利用者が通常の環境で聴取する音の周波数帯域を上回る範囲（例えば１８ｋＨｚ以上かつ２０ｋＨｚ以下）に包含される。 As illustrated in FIG. 12, the control device 11 of the fourth embodiment has the same character string setting unit 22 and voice synthesis unit 24 (any of 24A-24C) as any of the first to third embodiments. In addition, it functions as a modulation processing unit 26 and a mixing processing unit 28. The modulation processing unit 26 generates a modulation signal M corresponding to the designated character string Q set by the character string setting unit 22. The modulated signal M is a signal that includes the distribution information V corresponding to the designated character string Q as an acoustic component. The modulation processing unit 26 searches for the distribution information V corresponding to the designated character string Q among the plurality of distribution information V stored in the storage device 12, and generates a modulation signal M indicating the distribution information V. Specifically, the modulation processing unit 26 modulates a carrier wave such as a sine wave having a predetermined frequency by frequency modulation using the distribution information V, or modulation processing such as diffusion modulation of the distribution information V using a diffusion code. Generate signal M. The frequency band of the acoustic component of the distribution information V is, for example, a frequency band that can be reproduced by the sound emitting device 15 and exceeds the frequency band of the sound that the user listens to in a normal environment (for example, 18 kHz or more). 20 kHz or less).

図１２の混合処理部２８は、音声合成部２４が生成した音声信号Ｘと変調処理部２６が生成した変調信号Ｍとを混合（例えば加算）することで音響信号Ｙを生成する。第４実施形態では、混合処理部２８が生成した音響信号Ｙが放音装置１５に供給される。放音装置１５は、音響信号Ｙが表す音を放音する。すなわち、音声信号Ｘが表す案内音声Ｇと変調信号Ｍが表す配信情報Ｖの音響成分とが放音装置１５から再生される。以上の説明から理解される通り、第１実施形態の放音装置１５は、指定文字列Ｑを表す案内音声Ｇを再生する音響機器として機能するほか、空気振動としての音波を伝送媒体とした音響通信で配信情報Ｖを送信する送信部としても機能する。 The mixing processing unit 28 of FIG. 12 generates an acoustic signal Y by mixing (for example, adding) the voice signal X generated by the voice synthesis unit 24 and the modulation signal M generated by the modulation processing unit 26. In the fourth embodiment, the acoustic signal Y generated by the mixing processing unit 28 is supplied to the sound emitting device 15. The sound emitting device 15 emits a sound represented by the acoustic signal Y. That is, the guidance voice G represented by the voice signal X and the acoustic component of the distribution information V represented by the modulation signal M are reproduced from the sound emitting device 15. As understood from the above description, the sound emitting device 15 of the first embodiment functions as an acoustic device for reproducing the guidance voice G representing the designated character string Q, and also has sound waves using sound waves as air vibration as a transmission medium. It also functions as a transmission unit that transmits distribution information V by communication.

施設内の利用者は、図１２の端末装置５０を携帯する。端末装置５０は、例えば携帯電話機またはスマートフォン等の可搬型の情報端末である。なお、例えば、鉄道事業者の施設内に設置される電光掲示板、または商業施設に設置される電子看板（例えばデジタルサイネージ）等の案内用の表示端末を端末装置５０として利用することも可能である。 The user in the facility carries the terminal device 50 shown in FIG. The terminal device 50 is a portable information terminal such as a mobile phone or a smartphone. In addition, for example, it is also possible to use a display terminal for guidance such as an electric bulletin board installed in a railway operator's facility or an electronic signboard (for example, digital signage) installed in a commercial facility as a terminal device 50. ..

図１３は、端末装置５０の構成図である。図１３に例示される通り、端末装置５０は、制御装置５１と記憶装置５２と収音装置５３と表示装置５４とを具備する。収音装置５３は、周囲の音を収音する音響機器（マイクロホン）である。具体的には、収音装置５３は、音声処理装置１００の放音装置１５による再生音を収音して音響信号Ｚを生成する。音響信号Ｚは、配信情報Ｖの音響成分を含み得る。以上の説明から理解される通り、収音装置５３は、端末装置５０の相互間の音声通話または動画撮影時の音声収録に利用されるほか、空気振動としての音波を伝送媒体とする音響通信で配信情報Ｖを受信する受信部としても機能する。 FIG. 13 is a configuration diagram of the terminal device 50. As illustrated in FIG. 13, the terminal device 50 includes a control device 51, a storage device 52, a sound collecting device 53, and a display device 54. The sound collecting device 53 is an audio device (microphone) that collects ambient sound. Specifically, the sound collecting device 53 collects the sound reproduced by the sound emitting device 15 of the voice processing device 100 and generates an acoustic signal Z. The acoustic signal Z may include the acoustic component of the distribution information V. As understood from the above description, the sound collecting device 53 is used for voice communication between the terminal devices 50 or voice recording at the time of moving image shooting, and is also used for acoustic communication using sound waves as air vibration as a transmission medium. It also functions as a receiving unit that receives the distribution information V.

制御装置５１は、例えばＣＰＵ等の処理回路で構成され、端末装置５０の各要素を統括的に制御する。表示装置５４（例えば液晶表示パネル）は、制御装置５１による制御のもとで各種の画像を表示する。記憶装置５２は、制御装置５１が実行するプログラムと制御装置５１が使用する各種のデータとを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体が記憶装置５２として採用され得る。第４実施形態の記憶装置５２は、図１３に例示される通り、参照テーブルＴを記憶する。参照テーブルＴは、音声処理装置１００から送信され得る複数の配信情報Ｖ（Ｖ1，Ｖ2，…）の各々について関連情報Ｃ（Ｃ1，Ｃ2，…）が登録されたデータテーブルであり、配信情報Ｖに対応する関連情報Ｃを特定するために使用される。 The control device 51 is composed of a processing circuit such as a CPU, and controls each element of the terminal device 50 in an integrated manner. The display device 54 (for example, a liquid crystal display panel) displays various images under the control of the control device 51. The storage device 52 stores a program executed by the control device 51 and various data used by the control device 51. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium can be adopted as the storage device 52. The storage device 52 of the fourth embodiment stores the reference table T as illustrated in FIG. The reference table T is a data table in which related information C (C1, C2, ...) Is registered for each of a plurality of distribution information Vs (V1, V2, ...) That can be transmitted from the voice processing device 100, and is a distribution information V. It is used to identify the relevant information C corresponding to.

制御装置５１は、記憶装置５２に記憶されたプログラムを実行することで、音声処理装置１００が送信した配信情報Ｖに関する処理を実行するための複数の機能（情報抽出部５１１および提示制御部５１３）を実現する。なお、制御装置５１の一部の機能を専用の電子回路で実現した構成、または、制御装置５１の機能を複数の装置に分散した構成も採用され得る。 The control device 51 executes a program stored in the storage device 52 to execute a plurality of functions (information extraction unit 511 and presentation control unit 513) for executing processing related to the distribution information V transmitted by the voice processing device 100. To realize. A configuration in which some functions of the control device 51 are realized by a dedicated electronic circuit, or a configuration in which the functions of the control device 51 are distributed to a plurality of devices can also be adopted.

情報抽出部５１１は、収音装置５３が生成した音響信号Ｚから配信情報Ｖを抽出する。具体的には、情報抽出部５１１は、音響信号Ｚのうち配信情報Ｖの音響成分を含む周波数帯域を強調するフィルタ処理と、配信情報Ｖに対する変調処理に対応した復調処理とを実行する。 The information extraction unit 511 extracts the distribution information V from the acoustic signal Z generated by the sound collecting device 53. Specifically, the information extraction unit 511 executes a filter process for emphasizing the frequency band including the acoustic component of the distribution information V in the acoustic signal Z, and a demodulation process corresponding to the modulation process for the distribution information V.

提示制御部５１３は、表示装置５４による情報の表示を制御する。第４実施形態の提示制御部５１３は、情報抽出部５１１が抽出した配信情報Ｖに対応する関連情報Ｃを表示装置５４に表示させる。具体的には、提示制御部５１３は、参照テーブルＴに登録された複数の関連情報Ｃのうち情報抽出部５１１が抽出した配信情報Ｖに対応する関連情報Ｃを検索し、当該関連情報Ｃを表示装置５４に表示させる。したがって、音声処理装置１００の放音装置１５による案内音声Ｇの再生に並行して、当該案内音声Ｇに対応した関連情報Ｃが表示装置５４に表示される。 The presentation control unit 513 controls the display of information by the display device 54. The presentation control unit 513 of the fourth embodiment causes the display device 54 to display the related information C corresponding to the distribution information V extracted by the information extraction unit 511. Specifically, the presentation control unit 513 searches for the related information C corresponding to the distribution information V extracted by the information extraction unit 511 from among the plurality of related information C registered in the reference table T, and searches for the related information C. It is displayed on the display device 54. Therefore, in parallel with the reproduction of the guidance voice G by the sound emitting device 15 of the voice processing device 100, the related information C corresponding to the guidance voice G is displayed on the display device 54.

第４実施形態においても第１実施形態と同様の効果が実現される。また、第４実施形態では、関連情報Ｃを示す配信情報Ｖが音声処理装置１００から端末装置５０に送信される。したがって、案内音声Ｇに関連する関連情報Ｃを端末装置５０により利用者に提示することが可能である。 The same effect as that of the first embodiment is realized in the fourth embodiment. Further, in the fourth embodiment, the distribution information V indicating the related information C is transmitted from the voice processing device 100 to the terminal device 50. Therefore, the related information C related to the guidance voice G can be presented to the user by the terminal device 50.

＜変形例＞
以上に例示した各形態は多様に変形され得る。前述の各形態に適用され得る具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification example>
Each of the above-exemplified forms can be variously modified. Specific modifications that can be applied to each of the above-mentioned forms are illustrated below. Two or more embodiments arbitrarily selected from the following examples can be appropriately merged to the extent that they do not contradict each other.

（１）前述の各形態では、表示装置１３に表示された入力欄１３２に入力された文字列を非定型部分Ｑbとしたが、指定文字列Ｑのうちの非定型部分Ｑbを文字列設定部２２が設定する方法は以上の例示に限定されない。例えば、形態素解析等の自然言語処理を指定文字列Ｑに対して実行することで固有名詞を抽出し、指定文字列Ｑのうち固有名詞の部分を非定型部分Ｑbとして設定することも可能である。また、定型部分Ｑaとは別個の文字種を利用して管理者が非定型部分Ｑbを図３の入力欄１３２に入力することも可能である。また、指定文字列Ｑを管理者が音声入力できる構成も好適である。例えば、管理者が発生した音声に対する音声認識で指定文字列Ｑが特定される。 (1) In each of the above-described modes, the character string input in the input field 132 displayed on the display device 13 is set as the non-standard part Qb, but the non-standard part Qb of the designated character string Q is set as the character string setting unit. The method set by 22 is not limited to the above examples. For example, it is also possible to extract a proper noun by executing natural language processing such as morphological analysis on the designated character string Q, and set the proper noun part of the designated character string Q as an atypical part Qb. .. It is also possible for the administrator to input the atypical portion Qb into the input field 132 of FIG. 3 by using a character type different from the fixed portion Qa. Further, a configuration in which the administrator can input the designated character string Q by voice is also preferable. For example, the designated character string Q is specified by voice recognition for the voice generated by the administrator.

（２）移動体通信網またはインターネット等の通信網を介して端末装置（例えば携帯電話機またはスマートフォン）と通信するサーバ装置により音声処理装置１００を実現することも可能である。例えば、音声処理装置１００は、端末装置から通信網を介して受信した指定文字列Ｑから音声信号Ｘを生成し、当該音声信号Ｘを端末装置に送信する。音声処理装置１００が生成した音声信号Ｘのうちの非定型部分Ｑbを、第１実施形態の収録信号Ｒとして利用することも可能である。また、使用頻度が低い（あるいは低音質でよい）非定型部分Ｑbの第２信号Ｘ2を、スマートフォン等の情報端末で実現された音声処理装置１００により生成し、使用頻度が高い（あるいは高品質が要求される）非定型部分Ｑbの第２信号Ｘ2を、サーバ装置で実現された音声処理装置１００により生成することも可能である。 (2) It is also possible to realize the voice processing device 100 by a server device that communicates with a terminal device (for example, a mobile phone or a smartphone) via a mobile communication network or a communication network such as the Internet. For example, the voice processing device 100 generates a voice signal X from the designated character string Q received from the terminal device via the communication network, and transmits the voice signal X to the terminal device. It is also possible to use the atypical portion Qb of the audio signal X generated by the audio processing device 100 as the recording signal R of the first embodiment. Further, the second signal X2 of the atypical portion Qb, which is used infrequently (or low sound quality is acceptable), is generated by the voice processing device 100 realized by an information terminal such as a smartphone, and is frequently used (or has high quality). It is also possible to generate the second signal X2 of the atypical portion Qb (required) by the voice processing device 100 realized by the server device.

（３）第４実施形態では、音波を伝送媒体とする音響通信で音声処理装置１００から端末装置５０に配信情報Ｖを送信したが、音声処理装置１００から配信情報Ｖを送信するための通信方式は音響通信に限定されない。例えば、電波または赤外線等の電磁波を伝送媒体とした無線通信で音声処理装置１００から端末装置５０に配信情報Ｖを送信することも可能である。例えば、前述の各形態における放音装置１５が無線通信用の通信機器に置換される。具体的には、Bluetooth（登録商標）またはWiFi（登録商標）等の無線通信が配信情報Ｖの送信に好適である。 (3) In the fourth embodiment, the distribution information V is transmitted from the voice processing device 100 to the terminal device 50 by acoustic communication using sound waves as a transmission medium, but a communication method for transmitting the distribution information V from the voice processing device 100. Is not limited to audio communication. For example, it is also possible to transmit the distribution information V from the voice processing device 100 to the terminal device 50 by wireless communication using electromagnetic waves such as radio waves or infrared rays as a transmission medium. For example, the sound emitting device 15 in each of the above-described embodiments is replaced with a communication device for wireless communication. Specifically, wireless communication such as Bluetooth (registered trademark) or WiFi (registered trademark) is suitable for transmitting distribution information V.

以上の例示から理解される通り、音声処理装置１００による配信情報Ｖの送信には、移動体通信網等の通信網が介在しない近距離無線通信が好適であり、音波を伝送媒体とする音響通信と電磁波を伝送媒体とする無線通信とは、近距離無線通信の例示である。なお、前述の各形態で例示した音響通信によれば、例えば遮音壁の設置により通信範囲を容易に制御できるという利点がある。 As understood from the above examples, short-range wireless communication without a communication network such as a mobile communication network is suitable for transmission of the distribution information V by the voice processing device 100, and acoustic communication using sound waves as a transmission medium. And wireless communication using electromagnetic waves as a transmission medium is an example of short-range wireless communication. According to the acoustic communication exemplified in each of the above-described modes, there is an advantage that the communication range can be easily controlled by installing a sound insulation wall, for example.

（４）前述の各形態では、関連情報Ｃの識別情報を配信情報Ｖとして例示したが、関連情報Ｃ自体を配信情報Ｖとして音声処理装置１００から送信することも可能である。関連情報Ｃを配信情報Ｖとして送信する構成では、端末装置５０に参照テーブルＴを保持する必要はない。以上の例示から理解される通り、配信情報Ｖは、関連情報Ｃを示す情報として包括的に表現される。 (4) In each of the above-described forms, the identification information of the related information C is exemplified as the distribution information V, but the related information C itself can be transmitted from the voice processing device 100 as the distribution information V. In the configuration in which the related information C is transmitted as the distribution information V, it is not necessary to hold the reference table T in the terminal device 50. As understood from the above examples, the distribution information V is comprehensively expressed as information indicating the related information C.

（５）前述の各形態では、関連情報Ｃを表示装置５４に表示したが、関連情報Ｃを端末装置５０の利用者に提示する方法は以上の例示に限定されない。例えば、関連情報Ｃが表す音声を放音装置１５により再生することで関連情報Ｃを利用者に提示することも可能である。関連情報Ｃが表す音声の生成には、例えば公知の音声合成技術が利用され得る。 (5) In each of the above-described forms, the related information C is displayed on the display device 54, but the method of presenting the related information C to the user of the terminal device 50 is not limited to the above examples. For example, it is also possible to present the related information C to the user by reproducing the sound represented by the related information C by the sound emitting device 15. For example, a known speech synthesis technique can be used to generate the speech represented by the related information C.

（６）第１実施形態において、収録信号Ｒが表す音声の発声者と、音声素片データＤb2が表す音声素片の発声者とが相違する場合がある。この場合、第１信号Ｘ1と第２信号Ｘ2とで声質が相違するから、音声信号Ｘが表す音声が聴感的に不自然な印象となる可能性がある。そこで、第１信号Ｘ1および第２信号Ｘ2の一方または双方の声質を調整することで、第１信号Ｘ1と第２信号Ｘ2との声質を近付ける（理想的には一致させる）構成が好適である。声質の調整には、公知の声質変換技術が任意に採用され得る。 (6) In the first embodiment, the speaker of the voice represented by the recorded signal R and the speaker of the voice fragment represented by the voice fragment data Db2 may be different. In this case, since the voice quality differs between the first signal X1 and the second signal X2, the voice represented by the voice signal X may give an audibly unnatural impression. Therefore, it is preferable to adjust the voice qualities of one or both of the first signal X1 and the second signal X2 so that the voice qualities of the first signal X1 and the second signal X2 are brought close to each other (ideally matched). .. A known voice quality conversion technique can be arbitrarily adopted for adjusting the voice quality.

（７）音声合成処理に利用される音声合成データＤ（Ｄ1またはＤ2）の内容は、以上の例示に限定されない。例えば、音声の抑揚（例えば音高または音量の時間的な変化）を決定するための抑揚データを音声合成データＤに含ませてもよい。例えば、音声合成データＤ1には、第１言語の発音時の抑揚の傾向が反映された抑揚データが含まれ、音声合成データＤ2には、第２言語の発音時の抑揚の傾向が反映された抑揚データが含まれる。第１実施形態または第２実施形態において、第２処理（Ｓa2，Ｓb2）には音声合成データＤ2の抑揚データが適用される。また、第２実施形態の第１処理Ｓb1には音声合成データＤ1の抑揚データが適用される。 (7) The content of the voice synthesis data D (D1 or D2) used for the voice synthesis processing is not limited to the above examples. For example, the speech synthesis data D may include intonation data for determining the intonation of speech (for example, a change in pitch or volume over time). For example, the speech synthesis data D1 includes intonation data that reflects the tendency of intonation at the time of pronunciation of the first language, and the speech synthesis data D2 reflects the tendency of intonation at the time of pronunciation of the second language. Includes intonation data. In the first embodiment or the second embodiment, the intonation data of the voice synthesis data D2 is applied to the second processing (Sa2, Sb2). Further, the intonation data of the voice synthesis data D1 is applied to the first process Sb1 of the second embodiment.

（８）前述の各形態に係る音声処理装置１００は、各形態での例示の通り、制御装置１１とプログラムとの協働により実現される。本発明の好適な態様に係るプログラムは、コンピュータに、指定文字列のうちの第１部分を第１言語で発音した音声と、前記指定文字列のうち前記第１部分とは相違する第２部分を発音した音声とを表す音声信号を生成する音声合成処理を実行させ、音声合成処理では、前記第２部分について、前記第１言語とは相違する第２言語用の音声合成データを利用する。以上に例示したプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。なお、非一過性の記録媒体とは、一過性の伝搬信号（transitory, propagating signal）を除く任意の記録媒体を含み、揮発性の記録媒体を除外するものではない。また、通信網を介した配信の形態でプログラムをコンピュータに提供することも可能である。 (8) The voice processing device 100 according to each of the above-described forms is realized by the cooperation between the control device 11 and the program as illustrated in each form. In a program according to a preferred embodiment of the present invention, a voice in which a first part of a designated character string is pronounced in a first language and a second part of the designated character string which is different from the first part are described. A voice synthesis process for generating a voice signal representing the voice that pronounced the above is executed, and in the voice synthesis process, voice synthesis data for a second language different from the first language is used for the second part. The programs exemplified above can be provided and installed in a computer in a form stored in a computer-readable recording medium. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a known arbitrary such as a semiconductor recording medium or a magnetic recording medium is used. Can include recording media in the form of. The non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and does not exclude the volatile recording medium. It is also possible to provide a program to a computer in the form of distribution via a communication network.

（９）以上に例示した形態から、例えば以下の構成が把握される。
＜態様１＞
本発明の好適な態様（態様１）に係る音声処理方法は、指定文字列のうちの第１部分を第１言語で発音した音声と、前記指定文字列のうち前記第１部分とは相違する第２部分を発音した音声とを表す音声信号を生成し、前記音声信号の生成においては、前記第２部分について、前記第１言語とは相違する第２言語用の音声合成データを利用した音声合成処理を実行する。以上の態様では、指定文字列のうちの第１部分を第１言語で発音した音声と、指定文字列のうち第２部分を発音した音声とを表す音声信号を生成する過程において、第２部分については第２言語用の音声合成データを利用した音声合成処理が実行される。したがって、指定文字列の全体について第１言語用の音声合成データを利用した音声合成処理を実行する場合と比較して、第２部分について音韻および抑揚が聴感的に自然である音声の音声信号を生成できる。
＜態様２＞
態様１の好適例（態様２）において、前記音声信号の生成は、前記指定文字列のうち前記第１部分に対応する音声を表す第１信号を、事前に収録された音声を表す複数の収録信号から選択する第１処理と、前記指定文字列のうち前記第２部分に対応する音声を表す第２信号を、前記第２言語用の音声合成データを利用した音声合成処理により生成する第２処理と、前記第１処理で選択した前記第１信号と前記第２処理で生成した前記第２信号とを接続することで前記音声信号を生成する接続処理とを含む。以上の態様では、指定文字列のうち第１部分に対応する音声を表す第１信号が複数の収録信号から選択される。したがって、高音質な音声で第１部分が発音された音声信号を生成できるという利点がある。
＜態様３＞
態様１の好適例（態様３）において、前記音声信号の生成は、前記指定文字列のうち前記第１部分に対応する音声を表す第１信号を、前記第１言語用の音声合成データを利用した音声合成処理により生成する第１処理と、前記指定文字列のうち前記第２部分に対応する音声を表す第２信号を、前記第２言語用の音声合成データを利用した音声合成処理により生成する第２処理と、前記第１処理で生成した前記第１信号と前記第２処理で生成した前記第２信号とを接続することで前記音声信号を生成する接続処理とを含む。以上の態様では、指定文字列のうち第１部分に対応する音声を表す第１信号が、第１言語用の音声合成データを利用した音声合成処理により生成される。したがって、第１部分の音声を事前に収録する必要がないという利点がある。
＜態様４＞
態様１の好適例（態様４）では、前記音声信号の生成において、前記第１言語用の発音規則データにより前記第１部分の発音記号を決定し、前記第１言語用の発音規則データとは相違する前記第２言語用の発音規則データにより前記第２部分の発音記号を決定し、前記第１部分および前記第２部分について決定した発音記号の音声を表す前記音声信号を生成する。以上の態様では、第１部分の発音記号が第１言語用の発音規則データにより決定され、第２部分の発音記号が第２言語用の発音規則データにより決定されて、各発音記号の音声を表す音声信号が生成される。したがって、発音記号から音声信号を生成する処理を第１部分と第２部分とで共通化できるという利点がある。
＜態様５＞
態様１から態様４の何れかの好適例（態様５）において、前記第２部分は、前記指定文字列のうち固有名詞の部分である。指定文字列のうち固有名詞の部分は一般的に使用頻度が低いから、音声を事前に収録することは困難である。指定文字列のうち固有名詞の部分を第２部分とした構成によれば、使用頻度が低い第２部分についても音声を生成できるという利点がある。
＜態様６＞
態様１から態様５の何れかの好適例（態様６）において、前記音声信号と、当該音声信号が表す音声に対応した関連情報を示す配信情報を音響成分として含む変調信号とを混合して放音装置に供給する。以上の態様では、配信情報を音響成分として含む変調信号が音声信号に混合されたうえで放音装置から再生される。すなわち、音声信号が表す音声を放音するための放音装置が、配信情報を送信するための送信機として利用される。したがって、配信情報の送信に専用される送信機が必要である構成と比較して、装置構成が簡素化されるという利点がある。
＜態様７＞
本発明の好適な態様（態様７）に係る音声処理装置は、指定文字列のうちの第１部分を第１言語で発音した音声と、前記指定文字列のうち前記第１部分とは相違する第２部分を発音した音声とを表す音声信号を生成する音声合成部を具備し、前記音声合成部は、前記第２部分について、前記第１言語とは相違する第２言語用の音声合成データを利用した音声合成処理を実行する。以上の態様では、指定文字列のうちの第１部分を第１言語で発音した音声と、指定文字列のうち第２部分を発音した音声とを表す音声信号を生成する音声合成部が、第２部分については第２言語用の音声合成データを利用した音声合成処理を実行する。したがって、指定文字列の全体について第１言語用の音声合成データを利用した音声合成処理を実行する構成と比較して、第２部分について音韻および抑揚が聴感的に自然である音声の音声信号を生成できる。 (9) From the above-exemplified form, for example, the following configuration can be grasped.
<Aspect 1>
The voice processing method according to the preferred aspect (aspect 1) of the present invention is different from the voice in which the first part of the designated character string is pronounced in the first language and the first part of the designated character string. A voice signal representing the voice that pronounces the second part is generated, and in the generation of the voice signal, the voice using the voice synthesis data for the second language different from the first language is used for the second part. Execute the synthesis process. In the above aspect, in the process of generating a voice signal representing a voice in which the first part of the designated character string is pronounced in the first language and a voice in which the second part of the designated character string is pronounced, the second part The voice synthesis process using the voice synthesis data for the second language is executed. Therefore, as compared with the case where the voice synthesis process using the voice synthesis data for the first language is executed for the entire designated character string, the voice signal of the voice whose phoneme and intonation are audibly natural for the second part is produced. Can be generated.
<Aspect 2>
In a preferred example of the first aspect (aspect 2), the generation of the voice signal is performed by recording a first signal representing the voice corresponding to the first portion of the designated character string, and a plurality of recordings representing the voice recorded in advance. The first process of selecting from the signals and the second signal representing the voice corresponding to the second part of the designated character string are generated by the voice synthesis process using the voice synthesis data for the second language. The process includes a connection process of generating the voice signal by connecting the first signal selected in the first process and the second signal generated in the second process. In the above aspect, the first signal representing the voice corresponding to the first part of the designated character string is selected from the plurality of recorded signals. Therefore, there is an advantage that an audio signal in which the first portion is pronounced can be generated with high-quality sound.
<Aspect 3>
In a preferred example of the first aspect (aspect 3), the voice signal is generated by using the first signal representing the voice corresponding to the first part of the designated character string and the voice synthesis data for the first language. The first process generated by the voice synthesis process and the second signal representing the voice corresponding to the second part of the designated character string are generated by the voice synthesis process using the voice synthesis data for the second language. The second process is included, and a connection process for generating the voice signal by connecting the first signal generated in the first process and the second signal generated in the second process. In the above aspect, the first signal representing the voice corresponding to the first part of the designated character string is generated by the voice synthesis process using the voice synthesis data for the first language. Therefore, there is an advantage that it is not necessary to record the sound of the first part in advance.
<Aspect 4>
In the preferred example of the first aspect (aspect 4), in the generation of the voice signal, the phonetic symbol of the first part is determined from the phonetic rule data for the first language, and the phonetic rule data for the first language is used. The phonetic symbol of the second part is determined from the different pronunciation rule data for the second language, and the voice signal representing the sound of the phonetic symbol determined for the first part and the second part is generated. In the above aspect, the phonetic symbol of the first part is determined by the phonetic rule data for the first language, the phonetic symbol of the second part is determined by the phonetic rule data for the second language, and the phonetic of each phonetic symbol is produced. A phonetic signal is generated. Therefore, there is an advantage that the process of generating an audio signal from a phonetic symbol can be shared between the first portion and the second portion.
<Aspect 5>
In any of the preferred examples (Aspect 5) of Aspects 1 to 4, the second part is a part of a proper noun in the designated character string. Since the proper noun part of the designated character string is generally used infrequently, it is difficult to record the voice in advance. According to the configuration in which the proper noun part of the designated character string is used as the second part, there is an advantage that voice can be generated even for the second part which is rarely used.
<Aspect 6>
In any of the preferred examples (Aspect 6) of Aspects 1 to 5, the voice signal and a modulated signal containing distribution information indicating related information corresponding to the voice represented by the voice signal as an acoustic component are mixed and released. Supply to the sound device. In the above aspect, the modulated signal including the distribution information as an acoustic component is mixed with the audio signal and then reproduced from the sound emitting device. That is, the sound emitting device for emitting the sound represented by the audio signal is used as a transmitter for transmitting the distribution information. Therefore, there is an advantage that the device configuration is simplified as compared with a configuration that requires a transmitter dedicated to transmitting distribution information.
<Aspect 7>
The voice processing device according to the preferred aspect (aspect 7) of the present invention is different from the voice in which the first part of the designated character string is pronounced in the first language and the first part of the designated character string. The voice synthesis unit includes a voice synthesis unit that generates a voice signal representing the voice that pronounces the second part, and the voice synthesis unit has voice synthesis data for a second language that is different from the first language for the second part. Executes voice synthesis processing using. In the above aspect, the voice synthesis unit that generates a voice signal representing a voice in which the first part of the designated character string is pronounced in the first language and a voice in which the second part of the designated character string is pronounced is the first. For the second part, the voice synthesis process using the voice synthesis data for the second language is executed. Therefore, as compared with the configuration in which the voice synthesis process using the voice synthesis data for the first language is executed for the entire designated character string, the voice signal of the voice whose phoneme and intonation are audibly natural for the second part is produced. Can be generated.

１００…音声処理装置、１１…制御装置、１２…記憶装置、１３…表示装置、１４…操作装置、１５…放音装置、２２…文字列設定部、２４A，２４B，２４C…音声合成部、２６…変調処理部、２８…混合処理部、３２A，３２B…第１処理部、３４…第２処理部、３６…接続処理部、５０…端末装置、５１…制御装置、５２…記憶装置、５３…収音装置、５４…表示装置。
100 ... voice processing device, 11 ... control device, 12 ... storage device, 13 ... display device, 14 ... operation device, 15 ... sound release device, 22 ... character string setting unit, 24A, 24B, 24C ... voice synthesis unit, 26 ... Modulation processing unit, 28 ... Mixing processing unit, 32A, 32B ... First processing unit, 34 ... Second processing unit, 36 ... Connection processing unit, 50 ... Terminal device, 51 ... Control device, 52 ... Storage device, 53 ... Sound collecting device, 54 ... Display device.

Claims

Represents a voice that pronounces the first part of the designated character string corresponding to the first language and a voice that pronounces the second part of the designated character string corresponding to a second language different from the first language. Generate a voice signal,
The generation of the audio signal is
The first process of selecting the first signal representing the voice corresponding to the first portion of the designated character string from a plurality of recorded signals representing the voice recorded in advance, and
A second process of generating a second signal representing a voice corresponding to the second part of the designated character string by a voice synthesis process using the voice synthesis data for the second language, and a second process.
A voice processing method realized by a computer system including a connection process for generating the voice signal by connecting the first signal selected in the first process and the second signal generated in the second process.

The voice processing method according to claim 1, wherein the voice signal and a modulated signal including distribution information indicating related information corresponding to the voice represented by the voice signal are mixed and supplied to a sound emitting device.

Represents a voice that pronounces the first part of the designated character string corresponding to the first language and a voice that pronounces the second part of the designated character string corresponding to a second language different from the first language. Equipped with a voice synthesizer that generates voice signals,
The voice synthesizer
The first process of selecting the first signal representing the voice corresponding to the first portion of the designated character string from a plurality of recorded signals representing the voice recorded in advance, and
A second process of generating a second signal representing a voice corresponding to the second part of the designated character string by a voice synthesis process using the voice synthesis data for the second language, and a second process.
A voice processing device that executes a connection process for generating the voice signal by connecting the first signal selected in the first process and the second signal generated in the second process.

Represents a voice that pronounces the first part of the designated character string corresponding to the first language and a voice that pronounces the second part of the designated character string corresponding to a second language different from the first language. A program that makes a computer function as a voice synthesizer that generates voice signals.
The voice synthesizer
The first process of selecting the first signal representing the voice corresponding to the first portion of the designated character string from a plurality of recorded signals representing the voice recorded in advance, and
A second process of generating a second signal representing a voice corresponding to the second part of the designated character string by a voice synthesis process using the voice synthesis data for the second language, and a second process.
A program that executes a connection process for generating an audio signal by connecting the first signal selected in the first process and the second signal generated in the second process.