JP2009139390A

JP2009139390A - Information processing system, processing method and program

Info

Publication number: JP2009139390A
Application number: JP2007312147A
Authority: JP
Inventors: Kaneyasu Jo; 金安徐; Seiya Osada; 誠也長田; Kiyoshi Yamahata; 潔山端
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-12-03
Filing date: 2007-12-03
Publication date: 2009-06-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processing system, processing method and program for smooth user communication. <P>SOLUTION: This information processing system for achieving smooth communication between users processes communications from a first speaker to a second speaker and has a second speaker feature extraction means which extracts a feature of the second speaker, and a communication processing means which processes input data from the first speaker based on the feature of the second speaker. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ユーザ間のコミュニケーションを円滑に行うための情報処理システム、処理方法及びプログラムに関する。 The present invention relates to an information processing system, a processing method, and a program for smoothly performing communication between users.

人間と機械または機械を介した人間と人間のコミュニケーションシステムは異なるユーザであっても出力が円滑に対応できるように、様々なシステムが提案されている。 Various systems have been proposed so that the output of human-machine or human-communication system via machine can be smoothly handled even by different users.

例えば、音声認識の分野では、話者の顔画像の特徴を用いた音声認識システムが提案されている（特許文献１参照）。 For example, in the field of speech recognition, a speech recognition system using features of a speaker's face image has been proposed (see Patent Document 1).

この特許文献１に記載された音声認識システムは、図30に示すように、音声認識入力部と映像信号入力部と、不特定話者音声認識部と、特定話者音声認識部と、画像処理部と、認識結果統合部とで構成される。画像処理部は、顔領域抽出部と、顔画像データベースと、画像比較部とから構成される。特定話者音声認識部は、音声処理部と、音声データベースと、音声認識処理部とから構成されている。 As shown in FIG. 30, the speech recognition system described in Patent Document 1 includes a speech recognition input unit, a video signal input unit, an unspecified speaker speech recognition unit, a specific speaker speech recognition unit, and image processing. And a recognition result integration unit. The image processing unit includes a face area extraction unit, a face image database, and an image comparison unit. The specific speaker voice recognition unit includes a voice processing unit, a voice database, and a voice recognition processing unit.

このシステムでは、発話者の顔画像の特徴を用いて話者を特定し、複数の特定話者の入力に対しても、高い認識率を実現することができる音声認識方式を提供している。 This system provides a speech recognition method that identifies a speaker using features of a speaker's face image and can realize a high recognition rate even with respect to inputs of a plurality of specific speakers.

また、文章の送り手と受け手の待遇関係を利用し、文章から合成音声を出力する情報処理方法が提案されている（特許文献２参照）。 In addition, an information processing method has been proposed in which a synthesized speech is output from a sentence using the treatment relationship between the sender and receiver of the sentence (see Patent Document 2).

この特許文献２に記載された情報処理方法は、図31に示すように、メールボックスとコマンド入力部と、操作管理部と、電子メール管理部と、文章読み上げ部と、音声送信部と、文章解析部と、文章生成部と、待遇判定部と、意味表現要約部とで構成される。 As shown in FIG. 31, the information processing method described in Patent Document 2 includes a mailbox, a command input unit, an operation management unit, an e-mail management unit, a text reading unit, a voice transmission unit, and a text. It comprises an analysis unit, a sentence generation unit, a treatment determination unit, and a semantic expression summary unit.

このシステムにおける待遇関係の判定は、予め記憶されている送り手と受け手のプロファイル情報を利用して行う。 The determination of the treatment relationship in this system is performed using the profile information of the sender and receiver stored in advance.

また、ユーザのタイプを識別して音声認識、対話制御及び音声合成を選定する音声応答装置が発明されている（特許文献３参照）
この特許文献３に記載された音声応答装置は、図32に示すように、ユーザのタイプ識別手段と、音声認識手段と、対話制御手段と、データベースと、音声合成手段とから構成される。 Also, a voice response device has been invented that identifies a user type and selects voice recognition, dialog control, and voice synthesis (see Patent Document 3).
As shown in FIG. 32, the voice response device described in Patent Document 3 includes a user type identification unit, a voice recognition unit, a dialogue control unit, a database, and a voice synthesis unit.

このシステムでは、一人のユーザに対してユーザのタイプを識別し、その識別情報を用いて音声認識、対話制御及び音声合成を行い、そのユーザに適した応答音声を出力する。
特開平11-282492号公報特開平10-149361号公報特開2004-163541号公報 In this system, a user type is identified for a single user, voice recognition, dialogue control and speech synthesis are performed using the identification information, and a response voice suitable for the user is output.
Japanese Patent Laid-Open No. 11-282492 Japanese Patent Laid-Open No. 10-149361 JP 2004-163541 A

しかしながら、前記特許文献に記載された技術の第一の問題点は、第一の話者からの入力に対し、第二の話者の特徴に適した音声認識、テキスト変換又は機械翻訳、音声合成等のコミュニケーション処理は行えないことである。 However, the first problem of the technique described in the patent document is that speech input, text conversion or machine translation, speech synthesis suitable for the characteristics of the second speaker with respect to the input from the first speaker The communication process such as cannot be performed.

その理由は、まず、前記特許文献1及び文献3においては、第二の話者の個人属性を獲得する手段を設けていないためである。 The reason is that, in Patent Documents 1 and 3, the means for acquiring the personal attribute of the second speaker is not provided.

また、前記の特許文献2において、文章の送り手と受け手のプロファイルからユーザの個人属性を獲得することになっているため、第二の話者のプロファイルが予め与えられていない場合には、第二の話者の特徴が獲得できなくなるためである。 Further, in Patent Document 2, since the personal attribute of the user is acquired from the profile of the sender and receiver of the text, the second speaker's profile is not given in advance. This is because the characteristics of the second speaker cannot be acquired.

そこで、本発明は上記課題に鑑みて発明されたものであって、その目的は、ユーザ間のコミュニケーションを円滑に行うための情報処理システム、処理方法及びプログラムを提供することにある。 Therefore, the present invention has been invented in view of the above problems, and an object thereof is to provide an information processing system, a processing method, and a program for smoothly performing communication between users.

上記目的を達成する本発明は、第一の話者から第二の話者へのコミュニケーションの処理を行う情報処理システムであって、前記第二の話者の特徴を抽出する第二の話者特徴抽出手段と、前記第二の話者の特徴に基づいて、第一の話者からの入力データの最適化処理を行うコミュニケーション処理手段とを有することを特徴とする情報処理システムである。 The present invention that achieves the above object is an information processing system for processing communication from a first speaker to a second speaker, wherein the second speaker extracts characteristics of the second speaker. An information processing system comprising: feature extraction means; and communication processing means for performing optimization processing of input data from the first speaker based on the characteristics of the second speaker.

また、上記目的を達成する本発明は、第一の話者から第二の話者へのコミュニケーションの処理を行う処理方法であって、前記第二の話者の特徴を抽出する第一の話者特徴抽出処理と、前記第二の話者の特徴に基づいて、第一の話者からの入力データのコミュニケーション処理を行うコミュニケーション処理とを有することを特徴とする処理方法である。 In addition, the present invention that achieves the above object is a processing method for processing communication from a first speaker to a second speaker, wherein the first speaker extracts features of the second speaker. And a communication process for performing a communication process for input data from the first speaker based on the characteristics of the second speaker.

また、上記目的を達成する本発明は、第一の話者から第二の話者へのコミュニケーションの処理を実行する情報処理システムのプログラムであって、前記第二の話者の特徴を抽出する第一の話者特徴抽出処理と、前記第二の話者の特徴に基づいて、第一の話者からの入力データの最適化処理を行うコミュニケーション処理とを情報処理システムに実行させるプログラムである。 The present invention that achieves the above object is a program of an information processing system that executes processing of communication from a first speaker to a second speaker, and extracts features of the second speaker A program for causing an information processing system to execute a first speaker feature extraction process and a communication process for performing an optimization process on input data from the first speaker based on the characteristics of the second speaker. .

本発明によれば、種々の属性を持つ話者の発話に対して、第二の話者のプロファイルを予め用意しなくても、第二の話者の話者特徴を第一の話者から第二話者への音声認識、テキスト変換又は機械翻訳、音声合成等のコミュニケーション処理を行うことができることにある。 According to the present invention, the speaker characteristics of the second speaker can be obtained from the first speaker without preparing the profile of the second speaker in advance for the utterance of the speaker having various attributes. Communication processing such as speech recognition, text conversion or machine translation, speech synthesis, etc. for the second speaker can be performed.

＜第１の実施の形態＞
第１の実施の形態を説明する。 <First Embodiment>
A first embodiment will be described.

第１の実施の形態は、音声やテキスト等によるコミュニケーションを発する第一の話者と、そのコミュニケーションを受ける側の第二の話者との間の翻訳、音声合成等のコミュニケーション処理を図るシステムにおいて、第二の話者の特徴量を抽出する第二の話者特徴抽出手段を設けて第二の話者の個人属性をリアルタイム的に抽出する。そして、第二の話者の特徴抽出手段により抽出された第二の話者の話者属性と辞書・モデルデータベースに格納されたモデル群の各モデルが持つ属性との類似度を算出し、類似度が最大となるものを選び出し、第一の話者の音声、テキスト等に対して、第二の話者に適した音声認識、テキスト変換又は機械翻訳、音声合成等のコミュニケーション処理を行うものである。 The first embodiment is a system that performs communication processing such as translation and speech synthesis between a first speaker that emits communication by voice, text, and the like and a second speaker that receives the communication. The second speaker feature extracting means for extracting the feature amount of the second speaker is provided to extract the personal attribute of the second speaker in real time. Then, the similarity between the speaker attributes of the second speaker extracted by the feature extraction means of the second speaker and the attributes of each model of the model group stored in the dictionary / model database is calculated, Select the one with the highest degree, and perform communication processing such as speech recognition, text conversion or machine translation, speech synthesis, etc. suitable for the second speaker for the voice and text of the first speaker is there.

また、第１の実施の形態におけるコミュニケーション処理システムにおける辞書・モデルデータベースに格納された辞書・モデルは、音声認識用辞書・モデル、機械翻訳用辞書・モデル、テキスト変換用辞書・モデル、音声合成用辞書・モデル等の各種の辞書・モデルを、第二の話者の特徴に基づいて予め構築する。例えば、音声認識用言語モデルの構築では、第一の話者から第二の話者への発話タイプを、「老人男性への発話」タイプと、「老人女性への発話」タイプと、「壮年男性への発話」タイプと、「壮年女性への発話」タイプと、「青年男性への発話」タイプと、「青年女性への発話」タイプと、「若年男性への発話」タイプと、「若年女性への発話」タイプと、「子供男性への発話」タイプと、「子供女性への発話」タイプに分類して音声認識用言語モデルを構築することができる。同様に、第二の話者の特徴に基づく機械翻訳用辞書・モデル、テキスト変換用辞書・モデル、音声合成用辞書・モデル等を予め構築して辞書・モデルデータベースに格納しておく。 The dictionary / model stored in the dictionary / model database in the communication processing system according to the first embodiment includes a dictionary / model for speech recognition, a dictionary / model for machine translation, a dictionary / model for text conversion, and a speech synthesis model. Various dictionaries / models such as dictionaries / models are constructed in advance based on the characteristics of the second speaker. For example, in the construction of a language model for speech recognition, the utterance types from the first speaker to the second speaker are the utterance to the elderly man type, the utterance to the elderly woman type, "Speaking to men" type, "Speaking to older women" type, "Speaking to youth men" type, "Speaking to youth women" type, "Speaking to young men" type, and "Young A speech recognition language model can be constructed by classifying into "speech to women" type, "speech to children men" type, and "speech to children women" type. Similarly, a dictionary / model for machine translation based on the characteristics of the second speaker, a dictionary / model for text conversion, a dictionary / model for speech synthesis, and the like are built in advance and stored in the dictionary / model database.

また、コミュニケーション処理として第一の話者の音声認識を行う場合にも、抽出された第二の話者の特徴を用いることにより、辞書・モデルデータベースから、適切な第一の話者の音声認識用の音声認識用辞書・モデルを選択することができる。例えば、第二の話者特徴抽出手段により抽出した第二の話者の特徴が「子供、女性」の時、第一話者からの発話に対する音声認識用言語モデルの選択は、「子供、女性」の属性情報を用いて、音声認識用辞書・モデルデータベースから、第１の話者が「子供、女性」に話しかける時に用いられる「子供、女性」の音声認識用言語モデルを選択することができる。具体的には、第二の話者の話者属性情報（子供、女性）と、辞書・モデルデータベースに格納されている各種の音声認識用辞書・モデルが持つ属性情報との類似度を算出して、類似度が最大となるものを選択する。このようにして、「子供、女性」の属性を持つ音声認識用言語モデルを選択して、第一の話者の音声認識を行うことが可能となる。同様に、第二の話者が他の属性を持つ話者であっても、その話者の属性にあう音声認識用言語モデルを、辞書・モデル選択手段により選択し、音声認識を行うことが可能となる。 In addition, when performing speech recognition of the first speaker as a communication process, by using the extracted features of the second speaker, appropriate speech recognition of the first speaker can be performed from the dictionary / model database. A voice recognition dictionary / model can be selected. For example, when the feature of the second speaker extracted by the second speaker feature extraction means is “child, woman”, the language model for speech recognition for the utterance from the first speaker is “child, woman”. ”Can be used to select the speech recognition language model of“ children and women ”used when the first speaker speaks to“ children and women ”from the speech recognition dictionary / model database. . Specifically, the similarity between the second speaker's speaker attribute information (children, women) and the attribute information of various speech recognition dictionaries / models stored in the dictionary / model database is calculated. Select the one with the maximum similarity. In this way, it is possible to perform speech recognition of the first speaker by selecting a speech recognition language model having an attribute of “children and women”. Similarly, even if the second speaker is a speaker having other attributes, the language model for speech recognition that matches the speaker's attributes can be selected by the dictionary / model selection means to perform speech recognition. It becomes possible.

また、コミュニケーション処理手段として第一の話者の言語の機械翻訳を行う場合、第二話者の特徴を用いて、辞書・モデルデータベースに格納された翻訳辞書を選択する。例えば、英日翻訳の場合、日本語生成用辞書を第二話者の属性に応じて予め構築して辞書・モデルデータベースに格納する。英語人称代名詞「your」を例として説明すると、「your」と対応する日本語の訳語候補は「あなたの」、「君の」と「僕の」等の訳語を持たせる。そのため、機械翻訳用辞書は原言語辞書単語「your」の目的言語ブロックは、日本語見出し語「あなたの」と品詞情報と、第二の話者の話者属性を示す情報「若年、男性」等の情報から構成する一番目の目的言語辞書ブロックと、見出し語「君の」と品詞情報と、第二の話者の話者属性を示す「青年、男性」等の情報から構成する二番目のブロックと、見出し語「僕の」と品詞情報と、第二の話者の話者属性を示す「子供、男性」等の情報から構成する三番目のブロックとで、英語人称代名詞「your」の辞書を構築することができる。例えば、システムに第一の話者から英文「What is your name?」を入力して、第二の話者が「子供、男性」であると推定された場合、まず、英語の形態素解析と構文解析を行い、その結果、形態素要素英語疑問代名詞「What」、Be動詞「be」、人称代名詞「your」、名詞「name」との情報を得ることができる。一方、日本語側では、英日翻訳辞書から「What」から「何」を、「be」から「だ」を、「name」から「名前」を、「your」から「あなたの」と「君の」と「僕の」との三つの候補を読み込む。次に、前記の「子供、男性」の情報を用いて、「your」の三つの日本語候補の話者特徴との類似度を算出し、「子供、男性」を持つ「僕の」のブロックの類似度が最大となるため、「your」の訳語が「僕の」となる。次に、日本語を生成するための形態素は「何」、「だ」、「僕の」、「名前」となる。次に、日本語独立助動詞「だ」が持つ格フレームと翻訳規則を用いて機械翻訳を行い、「僕の名前は何だ？」と出力を提供することができる。同様に、第二の話者の属性は「青年、男性」の場合、翻訳結果は「君の名前は何ですか？」、第二の話者の属性は「若年、男性」の場合、翻訳結果は「あなたの名前は何ですか？」との翻訳結果を生成する。 When performing machine translation of the language of the first speaker as the communication processing means, the translation dictionary stored in the dictionary / model database is selected using the characteristics of the second speaker. For example, in the case of English-Japanese translation, a Japanese generation dictionary is constructed in advance according to the attributes of the second speaker and stored in the dictionary / model database. Taking the English personal pronoun “your” as an example, the Japanese translation candidates corresponding to “your” have translations such as “your”, “your”, and “my”. Therefore, the target language block of the source language dictionary word “your” is the Japanese headword “You”, part-of-speech information, and information indicating the speaker attributes of the second speaker “young, male” The first target language dictionary block composed of information such as, the second word composed of information such as the headword "Kimi no" and part-of-speech information, and "youth, male" indicating the speaker attributes of the second speaker And the third block consisting of the headword “my”, part-of-speech information, and information such as “children, men” indicating the speaker attributes of the second speaker, the English personal pronoun “your” You can build a dictionary. For example, if you enter English “What is your name?” From the first speaker into the system and the second speaker is presumed to be “children, men”, first the English morphological analysis and syntax As a result of the analysis, information on the morphological element English interrogative pronoun “What”, Be verb “be”, personal pronoun “your”, and noun “name” can be obtained. On the other hand, on the Japanese side, “What” to “What”, “be” to “da”, “name” to “name”, “your” to “your” and “you” from the English-Japanese translation dictionary. The three candidates “No” and “My” are read. Next, using the information of the above-mentioned “children and men”, the similarity between the speaker features of the three Japanese candidates “your” is calculated, and the “my” block with “children and men” is calculated. Since the similarity of is maximized, the translation of “your” becomes “my”. Next, the morphemes for generating Japanese are “what”, “da”, “my”, and “name”. Next, machine translation can be performed using the case frame and translation rules of the Japanese independent auxiliary verb “DA”, and the output “What is my name?” Can be provided. Similarly, if the attribute of the second speaker is “Youth, Male”, the translation result is “What is your name?”, And if the attribute of the second speaker is “Young, Male”, the translation The result generates a translation result “What is your name?”.

また、コミュニケーション処理が音声合成処理であり、第二の話者が老人の場合、辞書・モデル選択手段により、第二の話者の特徴を用いて、高齢者用の音声合成用辞書・モデルを選択し、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、第１の話者のテキスト、音声等を、高齢者に適した、速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールすることができる。 In addition, when the communication process is a speech synthesis process and the second speaker is an elderly person, the dictionary / model selection means creates a dictionary / model for speech synthesis for the elderly using the characteristics of the second speaker. By selecting and slightly increasing the duration of the synthesized speech waveform and adjusting the sound pressure level and pitch frequency threshold appropriately, the text, speech, etc. of the first speaker can be adjusted at a speed suitable for the elderly. Slowly speaking, slightly increasing the volume, and controlling the pitch of the voice.

次に、第１の実施の形態における具体的な構成について図面を参照して詳細に説明する。 Next, a specific configuration in the first embodiment will be described in detail with reference to the drawings.

図1は、本発明を実施するための第１の実施の形態の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of the first embodiment for carrying out the present invention.

図1を参照すると、本発明の第一の実施の形態は、第１の話者の音声、画像、又は文字情報等のデータを入力する入力手段101と、入力データを処理するコミュニケーション処理手段100と、第二の話者の特徴を抽出する第二の話者特徴抽出手段105と、コミュニケーション処理手段100から出力された文字、音声、画像等の情報を出力する出力手段116とを備えている。 Referring to FIG. 1, in the first embodiment of the present invention, an input means 101 for inputting data such as voice, image, or character information of a first speaker, and a communication processing means 100 for processing the input data. And second speaker feature extraction means 105 for extracting the characteristics of the second speaker, and output means 116 for outputting information such as characters, voices and images output from the communication processing means 100. .

コミュニケーション処理手段100は、辞書・モデル選択手段、音声認識手段、テキスト変換手段、機械翻訳手段、音声合成手段等の中から、一つまたは二つ以上の構成でコミュニケーション処理を実施する。 The communication processing unit 100 performs communication processing with one or more configurations from among a dictionary / model selection unit, a speech recognition unit, a text conversion unit, a machine translation unit, a speech synthesis unit, and the like.

また、本実施の形態において、コミュニケーション処理手段100の各種の処理を行うため、上述したように各種の辞書・モデルを予め構築して辞書・モデルデータベースに格納しておく。 In the present embodiment, in order to perform various processes of the communication processing unit 100, various dictionaries / models are built in advance and stored in the dictionary / model database as described above.

例えば、音声認識用言語モデルの構築において、第一話者からの発話タイプとして、「老人男性への発話」タイプと、「老人女性への発話」タイプと、「壮年男性への発話」タイプと、「壮年女性への発話」タイプと、「青年男性への発話」タイプと、「青年女性への発話」タイプと、「若年男性への発話」タイプと、「若年女性への発話」タイプと、「子供男性への発話」タイプと、「子供女性への発話」タイプに分類して音声認識用言語モデルを構築する。同様に、第二の話者の話者特徴に基づく機械翻訳用辞書・モデル、テキスト変換用辞書・モデル、音声合成用辞書・モデル等を予め構築して辞書・モデルデータベースに格納しておく。 For example, in the construction of a language model for speech recognition, the utterance types from the first speaker are the utterance to the elderly man type, the utterance to the elderly woman type, and the utterance to the elderly man type. , "Speaking to a young woman" type, "Speaking to a young man" type, "Speaking to a young woman" type, "Speaking to a young man" type, and "Speaking to a young woman" type The language model for speech recognition is constructed by classifying into "speech for children men" type and "speech for children women" type. Similarly, a dictionary / model for machine translation based on speaker characteristics of the second speaker, a dictionary / model for text conversion, a dictionary / model for speech synthesis, and the like are built in advance and stored in the dictionary / model database.

第二の話者特徴抽出手段105による第二の話者の特徴抽出は、第二の話者の音声データ、顔画像、指紋等、他の生体情報などからもその話者の特徴を抽出する。また、第二の話者のプロファイルが提供された場合や、センサーや話者の個人属性を通信可能なＩＣカードまたは話者の個人属性を通信可能なＩＣカードに内蔵した端末装置等を用いることにより第二の話者の特徴を抽出することもできる。例えば、第二話者の音声データや顔画像を含む画像データなどを用いて年齢、性別等の属性情報を抽出することができる。 The second speaker feature extraction by the second speaker feature extracting means 105 extracts the speaker feature from other biological information such as voice data, face image, fingerprint, etc. of the second speaker. . In addition, when a profile of the second speaker is provided, an IC card capable of communicating the personal attributes of the sensor or the speaker, or a terminal device incorporated in an IC card capable of communicating the personal attributes of the speaker is used. Thus, the characteristics of the second speaker can be extracted. For example, attribute information such as age and gender can be extracted by using voice data of the second speaker, image data including a face image, or the like.

次に、図1と図22を参照して本発明を実施するための第一の形態の動作について詳細に説明する。 Next, the operation of the first embodiment for carrying out the present invention will be described in detail with reference to FIG. 1 and FIG.

入力手段101を通じてシステムにデータを入力する（ステップ2201）。 Data is input to the system through the input means 101 (step 2201).

次に、第二の話者特徴抽出手段105により第二の話者の個人属性を示す話者特徴を抽出する（ステップ2202）。 Next, speaker features indicating the personal attributes of the second speaker are extracted by the second speaker feature extracting means 105 (step 2202).

次に、第二の話者特徴抽出手段105から抽出された第二の話者の特徴を用いて、辞書・モデル選択手段により、コミュニケーション処理用の辞書・モデルの選択処理を行う（ステップ2203）。例えば、入力は文で、コミュニケーション処理として機械翻訳を行う場合、辞書・モデル選択手段により、第二の話者の話者特徴に適した翻訳辞書・モデルを選択する。また、入力は音声で、コミュニケーション処理として音声認識を行うよう実施した場合、辞書・モデル選択手段により、第二の話者の話者特徴に適した音声認識用辞書・モデルを選択する。尚、辞書・モデル選択手段では、第二の話者特徴抽出手段105から抽出した第二の話者の特徴と、辞書・モデルデータベースに格納された各種の辞書・モデルが持つ属性情報との類似度を算出し、類似度が最大となるものを選択する。 Next, using the feature of the second speaker extracted from the second speaker feature extraction unit 105, the dictionary / model selection unit performs a dictionary / model selection process for communication processing (step 2203). . For example, the input is a sentence. When machine translation is performed as a communication process, a dictionary / model selection unit selects a translation dictionary / model suitable for the speaker characteristics of the second speaker. In addition, when speech is input and speech recognition is performed as communication processing, the dictionary / model selection means selects a speech recognition dictionary / model suitable for the speaker characteristics of the second speaker. The dictionary / model selection means resembles the characteristics of the second speaker extracted from the second speaker feature extraction means 105 and the attribute information of various dictionaries / models stored in the dictionary / model database. The degree is calculated, and the one having the maximum similarity is selected.

次に、前記のステップで選択された辞書・モデルを用いて、コミュニケーション処理手段100によりコミュニケーション処理を行う（ステップ2204）。コミュニケーション処理方式として、音声認識、テキスト変換、機械翻訳、音声合成等を実施することができる。 Next, communication processing is performed by the communication processing means 100 using the dictionary / model selected in the above step (step 2204). As a communication processing method, speech recognition, text conversion, machine translation, speech synthesis, and the like can be performed.

そして、前記コミュニケーション処理手段100での処理結果を、出力装置を通じて出力する（ステップ2205）。 Then, the processing result in the communication processing means 100 is output through the output device (step 2205).

本発明を実施するための第１の実施の形態の効果について説明する。 The effect of the first embodiment for carrying out the present invention will be described.

本発明の第１の実施の形態では、第二の話者特徴抽出手段を設けることにより、第二の話者のプロファイルを予め用意しなくても、第二の話者の話者特徴を利用したコミュニケーション処理が可能となる。 In the first embodiment of the present invention, by providing the second speaker feature extraction means, the speaker feature of the second speaker is used without preparing the profile of the second speaker in advance. Communication processing is possible.

具体的には、コミュニケーション処理として音声認識を行うよう実施した場合、第二の話者の特徴に適した音声認識方法を提供し、音声認識の精度を改善できるとなる効果を有する。 Specifically, when speech recognition is performed as communication processing, a speech recognition method suitable for the characteristics of the second speaker is provided, and the accuracy of speech recognition can be improved.

例えば、第二の話者特徴抽出手段により抽出した第二の話者の話者特徴が「子供、女性」の時、第一話者からの発話に対する音声認識用言語モデルの選択は、「子供、女性」の属性情報を用いて、音声認識用辞書・モデルデータベースから、「子供、女性」の第二の話者特徴を持つ音声認識用言語モデルを選択する。具体的に、第二話者の話者属性情報（子供、女性）と、辞書・モデルデータベースに格納されている各種の音声認識用辞書・モデルが持つ属性情報との類似度を算出して、類似度が最大となるものを選択する。このようにして、「子供、女性」の属性を持つ音声認識用言語モデルを選択して、第１の話者の発話の音声認識を行うことが可能となる。同様に、第二の話者が他の属性を持つ話者の時、その話者の属性にあう音声認識用言語モデルを、辞書・モデル選択手段により選択し、音声認識を行うことが可能となる。 For example, when the speaker feature of the second speaker extracted by the second speaker feature extraction means is “child, woman”, the language model for speech recognition for the speech from the first speaker is “child” The speech recognition language model having the second speaker feature of “children and women” is selected from the dictionary / model database for speech recognition using the attribute information of “female”. Specifically, the similarity between the second speaker's speaker attribute information (children, women) and the attribute information of various voice recognition dictionaries / models stored in the dictionary / model database is calculated, Select the one with the maximum similarity. In this way, it is possible to select a speech recognition language model having the attribute of “children and women” and perform speech recognition of the utterance of the first speaker. Similarly, when the second speaker is a speaker with other attributes, it is possible to select a language model for speech recognition that matches the speaker's attributes using the dictionary / model selection means and perform speech recognition. Become.

また、コミュニケーション処理として機械翻訳を行うよう実現した場合、第二話者の特徴を用いて、辞書・モデルデータベースに格納された翻訳辞書を選択することにより、第二の話者の特徴に適した機械翻訳手法を提供し、翻訳結果の多様性を実現することができる効果を有する。 Also, when machine translation is implemented as communication processing, it is suitable for the characteristics of the second speaker by selecting the translation dictionary stored in the dictionary / model database using the characteristics of the second speaker. The machine translation technique is provided, and it has an effect that a variety of translation results can be realized.

例えば、英日翻訳の場合、日本語生成用辞書を第二話者の属性に応じて予め構築して辞書・モデルデータベースに格納する。英語人称代名詞「your」を例として説明すると、「your」と対応する日本語の訳語候補は「あなたの」、「君の」と「僕の」等の訳語を持たせることができる。そのため、機械翻訳用辞書は原言語辞書単語「your」の目的言語ブロックは、日本語見出し語「あなたの」と品詞情報と、第二の話者の話者属性を示す情報「若年、男性」等の情報から構成する一番目の目的言語辞書ブロックと、見出し語「君の」と品詞情報と、第二の話者の話者属性を示す「青年、男性」等の情報から構成する二番目のブロックと、見出し語「僕の」と品詞情報と、第二の話者の話者属性を示す「子供、男性」等の情報から構成する三番目のブロックとで、英語人称代名詞「your」の辞書を構築する。仮に、システムに第一話者から英文「What is your name?」を入力して、第二の話者が「子供、男性」であると推定された場合、機械翻訳では、まず、英語の形態素解析と構文解析を行い、その結果、形態素要素英語疑問代名詞「What」、Be動詞「be」、人称代名詞「your」、名詞「name」との情報を得る。一方、日本語側では、英日翻訳辞書から「What」から「何」を、「be」から「だ」を、「name」から「名前」を、「your」から「あなたの」と「君の」と「僕の」との三つの候補を読み込む。次に、前記の「子供、男性」の情報を用いて、「your」の三つの日本語候補の話者特徴との類似度を算出し、「子供、男性」を持つ「僕の」のブロックの類似度が最大となるため、「your」の訳語が「僕の」となる。次に、日本語を生成するための形態素は「何」、「だ」、「僕の」、「名前」となる。日本語独立助動詞「だ」が持つ格フレームと翻訳規則を用いて機械翻訳を行い、「僕の名前は何だ？」と出力を提供することができる。同様に、第二の話者の属性は「青年、男性」の場合、翻訳結果は「君の名前は何ですか？」、第二の話者の属性は「若年、男性」の場合、翻訳結果は「あなたの名前は何ですか？」との翻訳結果を生成する翻訳方法を提供することにより、翻訳結果の多様性を実現することができる。 For example, in the case of English-Japanese translation, a Japanese generation dictionary is constructed in advance according to the attributes of the second speaker and stored in the dictionary / model database. Taking the English personal pronoun “your” as an example, the Japanese translation candidate corresponding to “your” can have translations such as “your”, “your”, and “my”. Therefore, the target language block of the source language dictionary word “your” is the Japanese headword “You”, part-of-speech information, and information indicating the speaker attributes of the second speaker “young, male” The first target language dictionary block composed of information such as, the second word composed of information such as the headword "Kimi no" and part-of-speech information, and "youth, male" indicating the speaker attributes of the second speaker And the third block consisting of the headword “my”, part-of-speech information, and information such as “children, men” indicating the speaker attributes of the second speaker, the English personal pronoun “your” Build a dictionary. If an English sentence `` What is your name? '' Is entered from the first speaker into the system and the second speaker is presumed to be `` children, men '', machine translation will begin with an English morpheme. Analysis and syntactic analysis are performed, and as a result, information about the morphological element English interrogative pronoun “What”, Be verb “be”, personal pronoun “your”, and noun “name” is obtained. On the other hand, on the Japanese side, “What” to “What”, “be” to “da”, “name” to “name”, “your” to “your” and “you” from the English-Japanese translation dictionary. The three candidates “No” and “My” are read. Next, using the information of the above-mentioned “children and men”, the similarity between the speaker features of the three Japanese candidates “your” is calculated, and the “my” block with “children and men” is calculated. Since the similarity of is maximized, the translation of “your” becomes “my”. Next, the morphemes for generating Japanese are “what”, “da”, “my”, and “name”. Machine translation is performed using the case frame and translation rules of the Japanese independent auxiliary verb "DA", and the output can be "What is my name?" Similarly, if the attribute of the second speaker is “Youth, Male”, the translation result is “What is your name?”, And if the attribute of the second speaker is “Young, Male”, the translation By providing a translation method that produces a translation result “What is your name?”, A variety of translation results can be realized.

また、コミュニケーション処理として音声合成を行うよう実現した場合、合成された音声の実用性を高める効果を有する。 Further, when voice synthesis is performed as communication processing, there is an effect of improving the practicality of synthesized voice.

例えば、第二の話者が老人の場合、辞書・モデル選択手段により、第二の話者特徴を用いて音声合成用辞書・モデルを選択して、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、高齢者に適した速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールするができる音声合成方法を提供することにより、合成された音声の実用性を高める効果を有する。
＜第２の実施の形態＞
第２の実施の形態は、第１の実施の形態の構成に加えて、第一の話者の特徴を抽出する第一の話者特徴抽出手段を更に設け、両話者の個人属性をリアルタイム的に獲得し、第一の話者属性と第二の話者の話者属性とを用いて、辞書・モデル選択手段により、予め用意された辞書・モデルデータベースの各モデルが持つ属性との類似度を算出して、その類似度が最大となるものを選び出して、音声認識、テキスト変換又は機械翻訳、音声合成等のコミュニケーション処理を行う。 For example, if the second speaker is an elderly person, the dictionary / model selection means selects the dictionary / model for speech synthesis using the second speaker characteristics, and slightly increases the duration of the synthesized speech waveform, To provide a speech synthesis method that can control the voice pitch by adjusting the sound pressure level and the threshold of the pitch frequency appropriately, making the elderly slowly speak at a suitable speed, slightly increasing the volume. This has the effect of increasing the practicality of the synthesized speech.
<Second Embodiment>
In the second embodiment, in addition to the configuration of the first embodiment, first speaker feature extraction means for extracting features of the first speaker is further provided, and the personal attributes of both speakers are real-time. Similar to the attributes of each model of the dictionary / model database prepared by the dictionary / model selection means using the first speaker attribute and the second speaker's speaker attribute The degree of similarity is calculated, the one having the maximum degree of similarity is selected, and communication processing such as speech recognition, text conversion or machine translation, and speech synthesis is performed.

第２の実施の形態における、第一の話者の話者特徴と第二の話者の話者特徴とを共に用いたコミュニケーション処理システムにおいては、辞書・モデルデータベースに格納された辞書・モデルでは、音声認識用辞書、機械翻訳用辞書、テキスト変換用辞書、音声合成用辞書等の各種の辞書を予め第一の話者の話者特徴と第二の話者の話者特徴とを共に考慮して構築する。例えば、音声認識用データベースの構築は、予め第一の話者の話者特徴と第二話者の話者特徴との組み合わせにより、第一話者からの発話内容や発話タイプを両話者の特徴により反映することが可能な内容で、種々の音声認識用辞書・モデルを構築して辞書・モデルデータベースに格納する。例えば、子供男性への発話タイプを、「老人男性から子供男性への発話」タイプと、「老人女性から子供男性への発話」タイプと、「青年男性から子供男性への発話」タイプと、「青年女性から子供男性への発話」タイプと、「若年男性から子供男性への発話」タイプと、「若年女性から子供男性への発話」タイプとの種類に分類することができる。同様に、子供女性への発話タイプを、「老人男性から子供女性への発話」タイプと、「老人女性から子供女性への発話」タイプと、「青年男性から子供女性への発話」タイプと、「青年女性から子供女性への発話」タイプと、「若年男性から子供女性への発話」タイプと、「若年女性から子供女性への発話」タイプとの種類に分類することができる。このようにして分類した発話タイプを示す言語を用いて、「子供への発話」における種々の音声認識用辞書・モデルを構築して辞書・モデルデータベースに格納することができる。同様に、機械翻訳、テキスト変換、音声合成等の辞書・モデルも予め構築して辞書・モデルデータベースに格納する。 In the communication processing system using both the speaker characteristics of the first speaker and the speaker characteristics of the second speaker in the second embodiment, the dictionary / model stored in the dictionary / model database Considering both the speaker characteristics of the first speaker and the speaker characteristics of the second speaker in advance, such as a dictionary for speech recognition, a dictionary for machine translation, a dictionary for text conversion, a dictionary for speech synthesis, etc. And build. For example, the construction of the speech recognition database is based on the combination of the speaker characteristics of the first speaker and the speaker characteristics of the second speaker in advance. Various speech recognition dictionaries / models are constructed and stored in the dictionary / model database with contents that can be reflected by features. For example, the utterance types for child men are `` utterance from old man to child man '' type, `` utterance from old woman to child man '' type, `` utterance from young man to child man '' type, It can be classified into the types of “utterance from young woman to child male” type, “utterance from young man to child male” type, and “utterance from young woman to child male” type. Similarly, the utterance types for children and women are the “speech from old man to child woman” type, the “speak from old woman to child woman” type, the “speak from young man to child woman” type, It can be classified into the types of “utterances from young women to child women”, “utterances from young men to child women”, and “utterances from young women to child women”. By using the language indicating the utterance type classified as described above, various voice recognition dictionaries / models for “utterance to children” can be constructed and stored in the dictionary / model database. Similarly, dictionaries / models for machine translation, text conversion, speech synthesis, etc. are also constructed in advance and stored in the dictionary / model database.

また、第２の実施の形態において、コミュニケーション処理として音声認識を行うよう実施した場合は、第一の話者の特徴と第二話者の特徴を同時に用いて、辞書・モデルデータベースに格納された音声認識用辞書・モデルを選択する。例えば、第一の話者が青年女性で、第二の話者が子供男性の時、辞書・モデル選択手段により、第一の話者の話者特徴を示す「青年、女性」と第二の話者の話者特徴を示す「子供、男性」を用いて、辞書・モデルデータベースに格納された辞書・モデルとの類似度を算出し、類似度が最大となるものを選択する。具体的に第一話者からの発話に対する音声認識処理においては、音響モデルの選出は第一の話者の話者属性を用いて辞書・データベースに格納されている音声認識用音響モデルが持つ話者の属性との類似度を算出し、類似度が最大となるものを選択する。言語モデルの選出は第一の話者の話者特徴と第二の話者の話者特徴とを用いて話者関係判定手段により得られた話者関係と両話者の特徴とを同時に用いて、辞書・データベースに格納されている音声認識用言語モデルの属性情報との類似度を算出し、類似度が最大となるものを選択する。このようにして、第一言語の話者が「青年、女性」で、なおかつ、第二の話者が「子供、男性」との属性を持つ音声認識用辞書・モデルを選択されて音声認識を行うことが可能となる。同様に、第一の話者の話者特徴と第二の話者の話者特徴が別の特徴であっても、前記のように音声認識用辞書・モデルを選択して、その両話者の特徴に適した音声認識を行うことが可能となる。 In the second embodiment, when speech recognition is performed as a communication process, the features of the first speaker and the features of the second speaker are simultaneously used and stored in the dictionary / model database. Select a dictionary / model for speech recognition. For example, when the first speaker is a youth woman and the second speaker is a child man, the dictionary / model selection means will show the speaker characteristics of the first speaker “Youth, Woman” and the second Using “children, men” indicating the speaker characteristics of the speaker, the similarity with the dictionary / model stored in the dictionary / model database is calculated, and the one with the maximum similarity is selected. Specifically, in speech recognition processing for utterances from the first speaker, the selection of the acoustic model is performed by the speech recognition acoustic model stored in the dictionary / database using the speaker attributes of the first speaker. The similarity with the person's attribute is calculated, and the one with the maximum similarity is selected. The selection of the language model uses the speaker relationship obtained by the speaker relationship determination means using the speaker feature of the first speaker and the speaker feature of the second speaker and the features of both speakers at the same time. Then, the similarity with the attribute information of the speech recognition language model stored in the dictionary / database is calculated, and the one with the maximum similarity is selected. In this way, a speech recognition dictionary / model having the attributes of “young man and woman” as the first language speaker and “children and men” as the second speaker is selected for speech recognition. Can be done. Similarly, even if the speaker feature of the first speaker and the speaker feature of the second speaker are different features, the speech recognition dictionary / model is selected as described above, and both speakers are selected. It is possible to perform speech recognition suitable for the features of

コミュニケーション処理手段として機械翻訳を行うよう実施した場合、形態素解析、構文解析、目的言語を生成する目的言語生成処理等が含まれる。目的言語生成処理では、第一の話者の話者特徴と第二話者の話者特徴を同時に用いて、辞書・モデル選択手段により、辞書・モデルデータベースに格納された翻訳辞書を選択する。 When machine translation is performed as communication processing means, morphological analysis, syntax analysis, target language generation processing for generating a target language, and the like are included. In the target language generation process, the dictionary / model selection means selects the translation dictionary stored in the dictionary / model database by using the speaker characteristics of the first speaker and the speaker characteristics of the second speaker at the same time.

説明するために、仮に中日翻訳辞書が予め構築されている。その翻訳辞書の中国語側では、適用できる第一話者の特徴を付与されている。日本語側では、第二の話者の話者特徴に応じて訳語が付与されている。 In order to explain, a Chinese-Japanese translation dictionary is preliminarily constructed. On the Chinese side of the translation dictionary, applicable first speaker features are given. On the Japanese side, translations are given according to the speaker characteristics of the second speaker.

中国語人称代名詞「晩生」を例として説明すると、「晩生」は一般的に謙譲的な言い方で、聞き手より年下の人に、男性でも女性でも使用できる。そのため、原言語である「晩生」のブロックに「原言語話者特徴（年下）」との属性情報を付与する。「晩生」と対応する日本語の訳語候補は、「わたくし」、「わたし」、「僕」等となるため、構築できる機械翻訳用辞書は原言語側では、「晩生」と「年下」から構成される原言語ブロックで、生成側では、「わたくし」と、品詞情報と、目的言語話者特徴（年上、女性）から構成する一番目の目的言語生成ブロックと、「わたし」と、品詞情報と、目的言語話者特徴（年上、男性）から構成する二番目の目的言語生成ブロックと、「僕」と、品詞情報と、目的言語話者特徴（年下、男性）から構成する三番目の目的言語生成ブロックで辞書を構築することができる。 Taking the Chinese pronoun pronoun “late life” as an example, “late life” is generally a humble term and can be used by men and women younger than the listener. Therefore, attribute information “source language speaker characteristics (younger)” is given to the block of “late life” that is the source language. The Japanese translation candidates corresponding to "Late" are "Watakushi", "I", "I", etc., so the dictionary for machine translation that can be constructed is "Late" and "Junior" on the source language side In the source language block that is composed, on the generation side, the first target language generation block composed of “Watashi”, part of speech information, and target language speaker characteristics (older, female), “I”, part of speech Information, a second target language generation block composed of target language speaker characteristics (older, male), “I”, part-of-speech information, and target language speaker characteristics (younger, male) A dictionary can be constructed with the second target language generation block.

次に、「図３３の文章１」を翻訳例として説明する。例えば、第一の話者が青年男性で、第二の話者が「老人、男性」の時、この際の機械翻訳は、まず、中国語の形態素解析と構文解析を行い、その結果、形態素要素代名詞「晩生」、アスペクト助字「在」、動詞「図３３の単語１」、名詞「図３３の単語２」とをから構成された構文情報が得られる。一方、日本語側では、中日翻訳辞書から、代名詞「晩生」から「わたくし」、「わたし」と「僕」を、アスペクト助字「在」から「ている」を、動詞「図３３の単語１」から「読む」を、名詞「図３３の単語２」から「本」を読み込む。 Next, “sentence 1 in FIG. 33” will be described as an example of translation. For example, when the first speaker is an adolescent male and the second speaker is “old man, male”, the machine translation at this time first performs Chinese morphological analysis and syntactic analysis. Syntactic information composed of the element pronoun “late life”, aspect subscript “present”, verb “word 1 in FIG. 33”, and noun “word 2 in FIG. 33” is obtained. On the other hand, on the Japanese side, from the Chinese-Japanese translation dictionary, the pronouns “Bansei” to “Watakushi”, “I” and “I”, the aspect subscript “Ai” to “I”, the verb “word in FIG. 33” “Read” is read from “1”, and “Book” is read from the noun “Word 2 in FIG. 33”.

次に、前記の第一の話者の話者特徴「青年、男性」と第二の話者の話者特徴「老人、男性」の情報を用いて、話者関係判定手段により両話者の年齢を比較して、原言語話者特徴（年下）と目的言語話者特徴（年上）との結果が得られる。さらに、第一の話者の話者特徴量「青年、男性、原言語話者特徴（年下）」と第二の話者の話者特徴量「老人、男性、目的言語話者特徴（年上）」とを用いて、翻訳辞書が持つ属性情報との類似度を算出して、類似度が最大のものを選択する。その結果、「晩生」の三つの訳語候補の中から、「わたし」の訳語を選択することになる。 Next, using the information on the speaker characteristics of the first speaker “Youth, male” and the speaker characteristics of the second speaker “Old man, male” By comparing the ages, the results of the source language speaker characteristics (younger) and the target language speaker characteristics (older) are obtained. Furthermore, the first speaker's speaker feature “Youth, male, source language speaker characteristics (younger)” and the second speaker's speaker feature “old man, male, target language speaker characteristics (year) The above is used to calculate the similarity with the attribute information of the translation dictionary, and the one with the maximum similarity is selected. As a result, the translation of “I” is selected from the three translation candidates of “late life”.

続いて、前記の形態素情報、格フレームを持つ構文情報と翻訳規則を用いて機械翻訳を行い、「わたしは本を読んでいます」との翻訳結果を生成することになる。同様に、第一の話者は青年男性で、第二の話者は老人女性の際に、前記翻訳例の出力は「わたくしは本を読んでいます」との翻訳結果を生成することになる。第一の話者は青年男性で、第二の話者は子供の際に、翻訳結果は「僕は本を読んでいます」との翻訳結果を生成することになる。 Subsequently, machine translation is performed using the morpheme information, the syntax information having the case frame, and the translation rules, and a translation result “I am reading a book” is generated. Similarly, when the first speaker is a young man and the second speaker is an elderly woman, the output of the above translation example is to generate a translation result that reads "I am reading a book." Become. When the first speaker is a young man and the second speaker is a child, the translation results will be translated as “I am reading a book”.

また、コミュニケーション処理手段として音声合成を行うよう実施した場合、第一の話者の話者特徴と第二の話者の特徴を同時に用いて、話者関係を判定する。特に、第二言語側の話者が老人の場合、辞書・モデル選択手段により、第二の話者特徴を用いて音声合成用辞書・モデルを選択して、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、高齢者に適した速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールするができる。 Further, when speech synthesis is performed as the communication processing means, the speaker relationship is determined using the speaker characteristics of the first speaker and the characteristics of the second speaker at the same time. In particular, if the speaker on the second language side is an elderly person, the dictionary / model selection means selects the dictionary / model for speech synthesis using the second speaker characteristics, and the duration of the synthesized speech waveform is slightly increased. Then, by appropriately adjusting the threshold values of the sound pressure level and the pitch frequency, it is possible to slowly increase the volume at a speed suitable for the elderly, to slightly increase the volume, and to control the pitch of the voice.

次に、第２の実施の形態における具体的な構成について図面を参照して詳細に説明する。 Next, a specific configuration in the second embodiment will be described in detail with reference to the drawings.

図2は、第２の実施の形態の構成を示すブロック図である。 FIG. 2 is a block diagram showing a configuration of the second embodiment.

図2を参照すると、第２の実施の形態は、図1に示された第１の実施の形態の構成と同様に、入力手段101と、第二の話者特徴抽出手段105と、前記出力手段116を備えている。その他に、第一の話者の特徴を抽出する第一の話者の特徴抽出手段104と、コミュニケーション処理手段200とを備えている。 Referring to FIG. 2, the second embodiment is similar to the configuration of the first embodiment shown in FIG. 1 in that the input means 101, the second speaker feature extraction means 105, and the output Means 116 are provided. In addition, a first speaker feature extracting unit 104 for extracting features of the first speaker and a communication processing unit 200 are provided.

コミュニケーション処理手段200では、話者関係判定手段、辞書・モデル選択手段、音声認識手段、テキスト変換手段、機械翻訳手段、音声合成手段等の中から、一つまたは二つ以上の構成でコミュニケーション処理を実施する。 The communication processing means 200 performs communication processing in one or more configurations from among speaker relation determination means, dictionary / model selection means, speech recognition means, text conversion means, machine translation means, speech synthesis means, etc. carry out.

尚、話者関係判定手段は、第一の話者の特徴抽出手段104及び第二の話者特徴抽出手段105により抽出された第一及び第二の話者の特徴に基づいて、第一の話者と第二の話者との関係を判定するものである。 Note that the speaker relationship determination means is based on the first and second speaker features extracted by the first speaker feature extraction means 104 and the second speaker feature extraction means 105. The relationship between the speaker and the second speaker is determined.

本実施の形態においては、コミュニケーション処理手段200の各種の処理を行うための各種の辞書・モデルを予め構築して辞書・モデルデータベースに格納する。例えば、子供男性への発話タイプを、「老人男性から子供男性への発話」タイプと、「老人女性から子供男性への発話」タイプと、「青年男性から子供男性への発話」タイプと、「青年女性から子供男性への発話」タイプと、「若年男性から子供男性への発話」タイプと、「若年女性から子供男性への発話」タイプとの種類に分類することができる。同様に、子供女性への発話タイプを、「老人男性から子供女性への発話」タイプと、「老人女性から子供女性への発話」タイプと、「青年男性から子供女性への発話」タイプと、「青年女性から子供女性への発話」タイプと、「若年男性から子供女性への発話」タイプと、「若年女性から子供女性への発話」タイプとの種類に分類することができる。このようにして分類した発話タイプを示す言語を用いて、「子供への発話」における種々の音声認識用辞書・モデルを構築して辞書・モデルデータベースに格納する。同様に、機械翻訳、テキスト変換、音声合成等の辞書・モデルも予め構築して辞書・モデルデータベースに格納する。 In the present embodiment, various dictionaries / models for performing various processes of the communication processing means 200 are previously constructed and stored in the dictionary / model database. For example, the utterance types for child men are `` utterance from old man to child man '' type, `` utterance from old woman to child man '' type, `` utterance from young man to child man '' type, It can be classified into the types of “utterance from young woman to child male” type, “utterance from young man to child male” type, and “utterance from young woman to child male” type. Similarly, the utterance types for children and women are the “speech from old man to child woman” type, the “speak from old woman to child woman” type, the “speak from young man to child woman” type, It can be classified into the types of “utterances from young women to child women”, “utterances from young men to child women”, and “utterances from young women to child women”. Using the language indicating the utterance type classified as described above, various speech recognition dictionaries / models for “utterance to children” are constructed and stored in the dictionary / model database. Similarly, dictionaries / models for machine translation, text conversion, speech synthesis, etc. are also constructed in advance and stored in the dictionary / model database.

次に、図2と図23を参照して本発明を実施するための第２の実施の形態の動作について詳細に説明する。 Next, the operation of the second embodiment for carrying out the present invention will be described in detail with reference to FIG. 2 and FIG.

入力手段101を通じてシステムにデータを入力する（ステップ2301）。 Data is input to the system through the input means 101 (step 2301).

次に、第一の話者特徴抽出手段104により第一の話者の個人属性を示す話者特徴を抽出し、第二の話者特徴抽出手段105により第二の話者の個人属性を示す話者特徴を抽出する（ステップ2302）。 Next, the first speaker feature extraction means 104 extracts speaker features indicating the personal attributes of the first speaker, and the second speaker feature extraction means 105 indicates the personal attributes of the second speaker. Speaker features are extracted (step 2302).

第一の話者特徴抽出手段104から抽出された第一話者の話者特徴と第二の話者特徴抽出手段105から抽出された第二の話者の話者特徴とを用いて、話者関係判定手段により話者の関係を判定する。例えば、両話者の年齢を比較して「年上」と「年下」との判定結果が得られる（ステップ2303）。 Using the speaker features of the first speaker extracted from the first speaker feature extraction unit 104 and the speaker features of the second speaker extracted from the second speaker feature extraction unit 105, Speaker relationship is determined by speaker relationship determination means. For example, by comparing the ages of the two speakers, a determination result of “older” and “younger” is obtained (step 2303).

前記話者関係判定手段により得られた両話者の話者関係情報を用いて、辞書・モデル選択手段により、コミュニケーション処理用の辞書・モデルの選択処理を行う（ステップ2304）。 Using the speaker relationship information obtained by the speaker relationship determination means, the dictionary / model selection means performs a dictionary / model selection process for communication processing (step 2304).

例えば、入力は文で、コミュニケーション処理として機械翻訳を行うよう実施した場合、辞書・モデル選択手段により、両話者の話者特徴に適した翻訳辞書・モデルを選択する。入力は音声で、コミュニケーション処理として音声認識を行うよう実施した場合、辞書・モデル選択手段により、両話者の話者特徴に適した音声認識用辞書・モデルを選択する。辞書・モデル選択手段では、前記話者特徴抽出手段104と前記話者特徴抽出手段105から抽出した両話者の属性情報と、話者関係判定手段により得られた話者関係情報を用いて、辞書・モデルデータベースに格納された各種の辞書・モデルが持つ属性情報との類似度を算出して類似度が最大となるものを選択することにより辞書・モデルの選択処理を行う。 For example, when the input is a sentence and machine translation is performed as a communication process, a dictionary / model selection unit selects a translation dictionary / model suitable for speaker characteristics of both speakers. When speech is input and speech recognition is performed as communication processing, a dictionary / model for speech recognition suitable for speaker characteristics of both speakers is selected by the dictionary / model selection means. In the dictionary / model selection means, the attribute information of both speakers extracted from the speaker feature extraction means 104 and the speaker feature extraction means 105, and the speaker relation information obtained by the speaker relation determination means, The dictionary / model selection process is performed by calculating the similarity with the attribute information of various dictionaries / models stored in the dictionary / model database and selecting the one having the maximum similarity.

次に、前記のステップで選択された辞書・モデルを用いて、コミュニケーション処理手段200によりコミュニケーション処理を行う（ステップ2305）。コミュニケーション処理方式として、音声認識、テキスト変換、機械翻訳、音声合成等を実施することができる。 Next, communication processing is performed by the communication processing means 200 using the dictionary / model selected in the above step (step 2305). As a communication processing method, speech recognition, text conversion, machine translation, speech synthesis, and the like can be performed.

そして、前記コミュニケーション処理手段200での処理結果を、出力装置を通じて出力する（ステップ2306）。 Then, the processing result of the communication processing means 200 is output through the output device (step 2306).

第２の実施の形態の効果について説明する。 The effect of the second embodiment will be described.

第２の実施の形態では、第一の話者特徴抽出手段及び第二の話者特徴抽出手段を設けることにより、両話者のプロファイルを予め用意しなくても、両話者の話者特徴を共に利用したコミュニケーション処理が可能となる。 In the second embodiment, by providing the first speaker feature extracting means and the second speaker feature extracting means, the speaker characteristics of both speakers can be obtained without preparing the profiles of both speakers in advance. Communication processing using both can be performed.

具体的には、コミュニケーション処理として音声認識を行うよう実施した場合、両話者の特徴を共に適した音声認識を行う手法を提供することが可能となる効果を有する。 Specifically, when speech recognition is performed as communication processing, there is an effect that it is possible to provide a method for performing speech recognition suitable for the characteristics of both speakers.

例えば、第一の話者が青年女性で、第二の話者が子供男性の時、辞書・モデル選択手段により、第一の話者の話者特徴を示す「青年、女性」と第二の話者の話者特徴を示す「子供、男性」を用いて、辞書・モデルデータベースに格納された辞書・モデルとの類似度を算出し、類似度が最大となるものを選択する。具体的に第一話者からの発話に対する音声認識処理においては、音響モデルの選出は第一の話者の話者属性を用いて辞書・データベースに格納されている音声認識用音響モデルが持つ話者の属性との類似度を算出し、類似度が最大となるものを選択する。言語モデルの選出は第一の話者の話者特徴と第二の話者の話者特徴を用いて話者関係判定手段により得られた話者関係と両話者の特徴とを同時に用いて、辞書・データベースに格納されている音声認識用言語モデルの属性情報との類似度を算出し、類似度が最大となるものを選択する。このようにして、第一言語の話者が「青年、女性」で、なおかつ、第二の話者が「子供、男性」との属性を持つ音声認識用辞書・モデルを選択されて音声認識を行うことが可能となる。同様に、第一の話者の話者特徴と第二の話者の話者特徴が別の特徴であっても、前記のように音声認識用辞書・モデルを選択して、その両話者の特徴に適した音声認識を行うことが可能となる。 For example, when the first speaker is a youth woman and the second speaker is a child man, the dictionary / model selection means will show the speaker characteristics of the first speaker “Youth, Woman” and the second Using “children, men” indicating the speaker characteristics of the speaker, the similarity with the dictionary / model stored in the dictionary / model database is calculated, and the one with the maximum similarity is selected. Specifically, in speech recognition processing for utterances from the first speaker, the selection of the acoustic model is performed by the speech recognition acoustic model stored in the dictionary / database using the speaker attributes of the first speaker. The similarity with the person's attribute is calculated, and the one with the maximum similarity is selected. The selection of the language model is based on the speaker relation obtained by the speaker relation judging means using the speaker characteristics of the first speaker and the speaker characteristics of the second speaker and the characteristics of both speakers at the same time. Then, the similarity with the attribute information of the speech recognition language model stored in the dictionary / database is calculated, and the one with the maximum similarity is selected. In this way, a speech recognition dictionary / model having the attributes of “young man and woman” as the first language speaker and “children and men” as the second speaker is selected for speech recognition. Can be done. Similarly, even if the speaker feature of the first speaker and the speaker feature of the second speaker are different features, the speech recognition dictionary / model is selected as described above, and both speakers are selected. It is possible to perform speech recognition suitable for the features of

また、コミュニケーション処理として機械翻訳を行うよう実現した場合、形態素解析、構文解析、目的言語を生成する目的言語生成処理等が含まれる。目的言語生成処理では、第一の話者の話者特徴と第二話者の話者特徴を同時に用いて、辞書・モデル選択手段により、辞書・モデルデータベースに格納された翻訳辞書を選択する。 Further, when machine translation is realized as communication processing, morphological analysis, syntax analysis, target language generation processing for generating a target language, and the like are included. In the target language generation process, the dictionary / model selection means selects the translation dictionary stored in the dictionary / model database by using the speaker characteristics of the first speaker and the speaker characteristics of the second speaker at the same time.

例えば、中日機械翻訳を行う場合、中国語「図３３の文章１」の例では、中国語人称代名詞「晩生」は一般的に謙譲的な言い方で、聞き手より年下の人が男性でも女性でも使用できる。例えば日中翻訳辞書の原言語である「晩生」のブロックに「原言語話者特徴（年下）」との属性情報を付与する。「晩生」と対応する日本語の訳語候補は、「わたくし」、「わたし」、「僕」等となるため、構築できる機械翻訳用辞書は原言語側では、「晩生」と「年下」から構成される原言語ブロックで、生成側では、「わたくし」と目的言語話者特徴（年上、女性）から構成する目的言語生成ブロックと、「わたし」と目的言語話者特徴（年上、男性）から構成する目的言語生成ブロックと、「僕」と目的言語話者特徴（年下、男性）等の情報から構成するブロックで辞書を構築することができる。 For example, when performing Chinese-Japanese machine translation, in the example of Chinese “sentence 1 in FIG. 33”, the Chinese pronoun “late life” is generally humbly, and the younger than the listener is male or female. But you can use it. For example, attribute information of “source language speaker characteristics (younger)” is given to the block of “late life”, which is the source language of the Japanese-Chinese translation dictionary. The Japanese translation candidates corresponding to "Late" are "Watakushi", "I", "I", etc., so the dictionary for machine translation that can be constructed is "Late" and "Junior" on the source language side The source language block is composed of the target language generation block composed of “Watashi” and the target language speaker characteristics (older, female), and “I” and the target language speaker characteristics (older, male). ) And a block composed of information such as “I” and target language speaker characteristics (younger, male), etc., can be used to construct a dictionary.

次に、前記の形態素情報、格フレームを持つ構文情報と翻訳規則を用いて機械翻訳を行い、「わたしは本を読んでいます」との翻訳結果を生成することになる。同様に、第一の話者は青年男性で、第二の話者は老人女性の際に、前記翻訳例の出力は「わたくしは本を読んでいます」との翻訳結果を生成することになる。第一の話者は青年男性で、第二の話者は子供の際に、翻訳結果は「僕は本を読んでいます」との翻訳結果を生成する翻訳方法を提供することにより、翻訳結果の多様性を実現することができる。 Next, machine translation is performed using the morpheme information, the syntax information having the case frame, and the translation rule, and a translation result “I am reading a book” is generated. Similarly, when the first speaker is a young man and the second speaker is an elderly woman, the output of the above translation example is to generate a translation result that reads "I am reading a book." Become. The first speaker is a young man, the second speaker is a child, and the translation results are translated by providing a translation method that produces a translation result that says "I am reading a book" Diversity of results can be realized.

例えば、第一の話者の話者特徴と第二の話者の特徴を同時に用いて、話者関係を判定する。特に、第二言語側の話者が老人の場合、辞書・モデル選択手段により、両話者の特徴を用いて音声合成用辞書・モデルを選択して、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、高齢者に適した速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールするができる音声合成方法を提供することにより、合成された音声の実用性を高めることができる。
＜第３の実施の形態＞
第３の実施の形態を説明する。 For example, the speaker relationship is determined using the speaker characteristics of the first speaker and the characteristics of the second speaker at the same time. In particular, if the speaker on the second language side is an elderly person, the dictionary / model selection means selects the dictionary / model for speech synthesis using the characteristics of both speakers, and slightly increases the duration of the synthesized speech waveform. Providing a voice synthesis method that can adjust the sound pressure level and the pitch frequency threshold appropriately to make the elderly slowly grow at a suitable speed, slightly increase the volume, and control the voice pitch As a result, the practicality of the synthesized speech can be improved.
<Third Embodiment>
A third embodiment will be described.

図3は、本発明を実施するための第３の実施の形態の構成を示すブロック図である。 FIG. 3 is a block diagram showing a configuration of the third embodiment for carrying out the present invention.

図3を参照すると、本発明の第３の実施の形態は、図2に示された第２の実施の形態の構成において、前記入力手段101と、前記第一の話者特徴抽出手段104と、前記第二の話者特徴抽出手段105と、前記出力手段116を備えているほか、両話者の対話履歴を保存する話者対話履歴データベース310と、コミュニケーション処理手段300とを備えている。両話者の対話履歴データベース310では、両話者の話者特徴と共に、話者の対話履歴を時系列で記録する。 Referring to FIG. 3, the third embodiment of the present invention is the same as the second embodiment shown in FIG. 2, except that the input means 101, the first speaker feature extraction means 104, In addition to the second speaker feature extraction means 105 and the output means 116, a speaker dialogue history database 310 for storing dialogue histories of both speakers and a communication processing means 300 are provided. In the conversation history database 310 of both speakers, the conversation history of the speakers is recorded in time series together with the speaker characteristics of both speakers.

コミュニケーション処理手段300では、話者関係判定手段、対話履歴管理手段、辞書・モデル選択手段、音声認識手段、テキスト変換手段、機械翻訳手段、音声合成手段等の中から、一つまたは二つ以上の構成でコミュニケーション処理を実施する。 In the communication processing means 300, one or two or more of speaker relation determination means, dialog history management means, dictionary / model selection means, speech recognition means, text conversion means, machine translation means, speech synthesis means, etc. Implement communication processing in the configuration.

また、話者からの発話に対し、常に話者の特徴に応じて時系列で対話履歴データベース310に記録する。発話者が新規話者の場合、新規ユーザとして発話者の特徴量とその発話を話者対話履歴データベース310に保存する。新規話者でない場合、話者の対話履歴を検索して、時系列で話者対話履歴データベース310に記録する。 Further, utterances from the speaker are always recorded in the dialogue history database 310 in time series according to the characteristics of the speaker. When the speaker is a new speaker, the feature amount of the speaker and the utterance are stored in the speaker interaction history database 310 as a new user. If the speaker is not a new speaker, the conversation history of the speaker is searched and recorded in the speaker conversation history database 310 in time series.

また、本実施の形態における辞書・モデルデータベースは、前記第２の実施の形態と同様なものによりシステムを構築することができる。 Further, the dictionary / model database in the present embodiment can construct a system by the same one as in the second embodiment.

また、対話履歴管理手段114は、図29に示すように、話者関係判定手段113から出力された両話者の特徴と話者関係に基づいて、話者対話履歴データベース310から両話者の対話履歴を抽出して分析を行う対話履歴分析手段2901と、対話履歴の分析結果により前記第一の話者特徴抽出手段104と前記第二の話者特徴抽出手段105により抽出された話者の特徴及び前記話者関係判定手段113から出力された話者の関係が正しいかどうかを判定する話者特徴判定手段2902と、前記話者特徴判定手段2902からの判定結果において、「誤りがある」と判定された際、話者の特徴に対するフィードバック処理を行うフィードバック処理手段2903とを構成される。 Further, as shown in FIG. 29, the dialogue history management means 114, based on the characteristics and the speaker relations of both speakers output from the speaker relationship judgment means 113, from the speaker dialogue history database 310, Dialog history analysis means 2901 for extracting and analyzing the dialog history, and the speaker's extracted by the first speaker feature extraction means 104 and the second speaker feature extraction means 105 according to the analysis result of the dialog history In the determination result from the speaker feature determination unit 2902 and the speaker feature determination unit 2902 for determining whether the relationship between the feature and the speaker output from the speaker relationship determination unit 113 is correct, "There is an error" Is determined, feedback processing means 2903 for performing feedback processing on speaker characteristics is configured.

対話履歴分析手段2901は、話者関係判定手段113から出力された両話者の特徴と話者関係に基づいて、話者対話履歴データベース310から両話者の対話履歴を抽出して、話者特徴と、対話スタイルと、対話内容のキーワードなどを含むベクトルや、又は前記ベクトルを時系列化したモデルを生成する処理などを行う。 The dialogue history analysis unit 2901 extracts the dialogue history of both speakers from the speaker dialogue history database 310 based on the characteristics and the speaker relationship of both speakers output from the speaker relationship judging unit 113, and the speaker A process including generating a vector including a feature, a dialog style, a keyword of a dialog content, or a model in which the vector is time-sequentially performed.

話者特徴判定手段2902は、前記話者対話履歴分析手段2901で生成された対話履歴の特徴ベクトルやモデルなどを用いて、前記第一の話者特徴抽出手段104と前記第二の話者特徴抽出手段105とにより抽出された話者の特徴及び前記話者関係判定手段113から出力された話者の関係に対する判定結果が正しいかどうかを判定する。例えば、日本語の「女言葉」の言語表現を示す特徴量と、「男言葉」の言語表現を示す特徴量とをそれぞれ用意して、話者対話履歴データベース310に保存しておいて、対話履歴分析手段2901からえられた話者の特徴ベクトルと照合することにより、話者の性別の判定を行うことができる。 The speaker feature determination unit 2902 uses the feature vector or model of the dialogue history generated by the speaker dialogue history analysis unit 2901, and the first speaker feature extraction unit 104 and the second speaker feature. It is determined whether or not the determination result for the speaker characteristics extracted by the extraction unit 105 and the speaker relationship output from the speaker relationship determination unit 113 is correct. For example, a feature quantity indicating the language expression of the Japanese “female word” and a feature quantity indicating the language expression of the “male language” are prepared and stored in the speaker dialogue history database 310 for dialogue. By comparing with the feature vector of the speaker obtained from the history analysis means 2901, the gender of the speaker can be determined.

フィードバック処理手段2903は、前記話者特徴判定手段2902からの判定結果において、「誤りがある」と判定された際、話者の特徴に対してフィードバック処理を行う。例えば、第一の話者と第二の話者が共に男性と推定されたが、第一の話者の入力文が「鍵を持ってきてくれてよかった。ありがとうございました。どうしてかしら、あたし、最近よく忘れ物をするわよ。」がある時に、前記話者特徴判定手段2902により第一の話者が「女性」であることを判定して、フィードバック処理を行うことにより、第一話者の性別の属性値を直すことできる。 When the determination result from the speaker feature determination unit 2902 determines that “there is an error”, the feedback processing unit 2903 performs feedback processing on the speaker feature. For example, the first speaker and the second speaker were both assumed to be male, but the input sentence of the first speaker was "Thank you for bringing the key. Thank you. I often forget things recently. ”When the speaker feature determination means 2902 determines that the first speaker is“ female ”and performs feedback processing, Gender attribute values can be corrected.

このように、話者の対話の状態や進行から得られる情報を利用し、話者の特徴が「正しいかどうか」を自動的に判断することにより、話者の個人属性の誤りを自動的に検出して訂正する機能を実現することができる。 In this way, by using information obtained from the state and progress of the speaker's dialogue, it is possible to automatically determine whether or not the speaker's personality is “correct” by automatically determining the speaker's personal attribute error. A function of detecting and correcting can be realized.

次に、図3と図24を参照して本発明を実施するための第三の形態の動作について詳細に説明する。 Next, the operation of the third mode for carrying out the present invention will be described in detail with reference to FIG. 3 and FIG.

入力手段101を通じてシステムにデータを入力する（ステップ2401）。 Data is input to the system through the input means 101 (step 2401).

次に、第一の話者特徴抽出手段104により第一の話者の個人属性を示す話者特徴を抽出し、第二の話者特徴抽出手段105により第二の話者の個人属性を示す話者特徴を抽出する（ステップ2402）。 Next, the first speaker feature extraction means 104 extracts speaker features indicating the personal attributes of the first speaker, and the second speaker feature extraction means 105 indicates the personal attributes of the second speaker. Speaker features are extracted (step 2402).

第一の話者特徴抽出手段104から抽出された第一話者の話者特徴と第二の話者特徴抽出手段105から抽出された第二の話者の話者特徴を用いて、話者関係判定手段により話者の関係を判定する。例えば、両話者の年齢を比較して「年上」と「年下」の判定結果が得られる（ステップ2403）。 Using the speaker feature of the first speaker extracted from the first speaker feature extraction unit 104 and the speaker feature of the second speaker extracted from the second speaker feature extraction unit 105, the speaker The relationship determination means determines the speaker relationship. For example, by comparing the ages of both speakers, the determination result of “older” and “younger” is obtained (step 2403).

話者関係判定手段113から出力された両話者の特徴と話者関係に基づいて、話者対話履歴データベース310から両話者の対話履歴を抽出して対話履歴分析手段2901により分析を行い、分析結果により前記使用された両話者の特徴及び話者関係が正しいかどうかに対する判定処理を話者特徴判定手段2902により行う。「誤りがある」と判定された際、話者の特徴抽出を再度行うようにフィードバック処理手段2903によりフィードバック処理を行う（ステップ2404）。 Based on the characteristics and speaker relationships of both speakers output from the speaker relationship determining means 113, the conversation history of both speakers is extracted from the speaker conversation history database 310 and analyzed by the dialog history analyzing means 2901. Based on the analysis result, the speaker feature determination means 2902 determines whether or not the characteristics and speaker relationship of the two speakers used are correct. When it is determined that “there is an error”, feedback processing is performed by the feedback processing unit 2903 so as to perform speaker feature extraction again (step 2404).

前のステップで得られた正しい話者属性と話者関係の情報を用いて、辞書・モデル選択手段により、コミュニケーション処理用の辞書・モデルの選択処理を行う（ステップ2405）。例えば、入力は文で、コミュニケーション処理として機械翻訳を行うよう実施した場合、辞書・モデル選択手段により、両話者の話者特徴に適した翻訳辞書・モデルを選択する。入力は音声で、コミュニケーション処理として音声認識を行うよう実施した場合、辞書・モデル選択手段により、両話者の話者特徴に適した音声認識用辞書・モデルを選択する。
辞書・モデル選択手段では、前記話者特徴抽出手段104と前記話者特徴抽出手段105から抽出した両話者の属性情報と、話者関係判定手段により得られた話者関係情報を用いて、辞書・モデルデータベースに格納された各種の辞書・モデルが持つ属性情報との類似度を算出して類似度が最大となるものを選択することにより辞書・モデルの選択処理を行う。 Using the correct speaker attributes and speaker relationship information obtained in the previous step, the dictionary / model selection means performs dictionary / model selection processing for communication processing (step 2405). For example, when the input is a sentence and machine translation is performed as a communication process, a dictionary / model selection unit selects a translation dictionary / model suitable for speaker characteristics of both speakers. When speech is input and speech recognition is performed as communication processing, a dictionary / model for speech recognition suitable for speaker characteristics of both speakers is selected by the dictionary / model selection means.
In the dictionary / model selection means, the attribute information of both speakers extracted from the speaker feature extraction means 104 and the speaker feature extraction means 105, and the speaker relation information obtained by the speaker relation determination means, The dictionary / model selection process is performed by calculating the similarity with the attribute information of various dictionaries / models stored in the dictionary / model database and selecting the one having the maximum similarity.

続いて、前記のステップで選択された辞書・モデルを用いて、コミュニケーション処理手段100によりコミュニケーション処理を行う（ステップ2406）。コミュニケーション処理方式として、音声認識、テキスト変換、機械翻訳、音声合成等を実施することができる。 Subsequently, communication processing is performed by the communication processing means 100 using the dictionary / model selected in the above step (step 2406). As a communication processing method, speech recognition, text conversion, machine translation, speech synthesis, and the like can be performed.

そして、前記コミュニケーション処理手段300での処理結果を、出力装置を通じて出力する（ステップ2407）。 Then, the processing result of the communication processing means 300 is output through the output device (step 2407).

第３の実施の形態の効果について説明する。 The effect of the third embodiment will be described.

上述のように、本発明の第３の実施の形態では、第２の実施の形態の効果を有するほか、話者対話履歴管理手段114と話者対話履歴データベース310とを設けることにより、話者の対話履歴を分析し、話者の対話の状態や進行から得られる情報を利用し、前記第一の話者特徴抽出手段104と第二の話者特徴抽出手段105により抽出した話者の特徴に対して、正誤判定及びフィードバック処理を行うことにより、誤りを自動的に検出することができ、両話者の話者特徴をコミュニケーション処理に正しく利用されるような機能を実現することができる効果を有する。 As described above, in the third embodiment of the present invention, in addition to the effects of the second embodiment, the speaker dialogue history management means 114 and the speaker dialogue history database 310 are provided, thereby providing a speaker. The speaker characteristics extracted from the first speaker feature extraction means 104 and the second speaker feature extraction means 105 using the information obtained from the conversation status and progress of the speaker On the other hand, by performing correctness determination and feedback processing, an error can be automatically detected, and a function that can correctly use speaker characteristics of both speakers for communication processing can be realized. Have

本発明の実施例１を、図面を参照して説明する。かかる実施例は第１の実施の形態に対応するものである。 A first embodiment of the present invention will be described with reference to the drawings. Such an example corresponds to the first embodiment.

本実施例を図4に示す。テキスト入力手段102と、第二の話者の特徴抽出手段105と、テキスト変換手段106と、辞書・モデルデータベース107と、辞書・モデル選択手段112と、テキスト出力手段115とをから構成される。 This embodiment is shown in FIG. A text input unit 102, a second speaker feature extraction unit 105, a text conversion unit 106, a dictionary / model database 107, a dictionary / model selection unit 112, and a text output unit 115 are configured.

本実施例におけるコミュニケーション処理手段100はテキスト変換手段106と、辞書・モデル選択手段112とを有する。 The communication processing unit 100 in this embodiment includes a text conversion unit 106 and a dictionary / model selection unit 112.

テキスト入力手段102としてキーボードを、テキスト出力手段115としてディスプレイを利用する。 A keyboard is used as the text input means 102 and a display is used as the text output means 115.

第二の話者特徴抽出手段105は、話者の特徴をリアルタイム的に抽出する手段である。話者特徴を抽出する方法は、例えば、話者の音声データから話者の年齢や性別等の話者属性を推定する方法や話者の顔を含む画像から話者属性を推定する方法等が挙げられる。話者の顔を含む画像から話者の特徴を抽出する方法は、一般的に、入力画像データから、顔画像のグレーズケール化処理、画像角度正規化処理、画像サイズ正規化処理、画像特徴抽出処理及び話者属性推定処理を行うことにより、話者特徴を抽出することである。 The second speaker feature extraction means 105 is means for extracting speaker features in real time. Examples of methods for extracting speaker characteristics include a method for estimating speaker attributes such as a speaker's age and gender from speaker's voice data, a method for estimating speaker attributes from an image including a speaker's face, and the like. Can be mentioned. In general, methods for extracting speaker features from an image including a speaker's face generally include face image glaze scale processing, image angle normalization processing, image size normalization processing, and image feature extraction from input image data. The speaker feature is extracted by performing processing and speaker attribute estimation processing.

話者の顔画像特徴抽出技術は、Gabor Waveletを用いた技術が知られている。例えば、非特許文献Pattern Recognition30（6）,pp.837-846,1997,「Phantom Faces for Face Analysis」。顔器官に特徴点を設定する手法としてRetina Samplingと呼ばれる効果的な画素位置サンプリング手法も知られている。例えば、非特許文献Audio and Video based Person Authentication - AVBPA99，pp. 125-129,1999,「Face Authentication by retinotopic sampling of the Gabor decomposition and Support Vector Machines」がある。 A technique using Gabor Wavelet is known as a speaker face image feature extraction technique. For example, non-patent document Pattern Recognition 30 (6), pp. 837-846, 1997, “Phantom Faces for Face Analysis”. An effective pixel position sampling method called Retina Sampling is also known as a method for setting feature points in a facial organ. For example, there is a non-patent document Audio and Video based Person Authentication-AVBPA99, pp. 125-129, 1999, “Face Authentication by retinotopic sampling of the Gabor decomposition and Support Vector Machines”.

また、話者の顔特徴から話者の個人属性を推定する従来技術は、顔のテクスチャを用いた手法と、平均顔との距離を用いた手法と、複数の特徴量を用いた手法が知られている。 In addition, the conventional techniques for estimating the speaker's personal attributes from the speaker's facial features are known as a method using a facial texture, a method using a distance from the average face, and a method using multiple features. It has been.

例えば、平均顔との距離を用いた手法では、年齢、性別ごとに平均顔を作成し、その平均顔に独自に特徴点を取って、また、入力顔にも同様に特徴点を取って、その二乗誤差を利用し、顔器官の特徴点を表す評価関数で評価することにより、入力顔と平均顔との特徴点間の距離を求め、その最小距離となる平均顔を当該話者の年齢や性別とする技術が知られている。例えば、特開2003-99779号公報、特開2006-344236号公報等に示されている従来技術も、この実施の形態の装置においても、前述の技術を用いることが可能である。 For example, in the method using the distance to the average face, an average face is created for each age and gender, and the feature points are taken independently for the average face, and the feature points are taken similarly for the input face, The distance between the feature points of the input face and the average face is calculated by using the square error and evaluating with an evaluation function representing the feature points of the facial organ, and the average face that is the minimum distance is determined as the age of the speaker. And the technology of gender is known. For example, the above-described technique can be used in the conventional technique disclosed in Japanese Patent Laid-Open Nos. 2003-99779 and 2006-344236 and the apparatus of this embodiment.

前記の従来技術のほか、隠れマルコフモデル、遺伝的アルゴリズムを用いることも可能であるが、これだけに限定されない。 In addition to the prior art described above, a hidden Markov model or a genetic algorithm can be used, but is not limited thereto.

また、話者の音声データから話者の年齢や性別等の話者属性を推定することができる。一般的な方法は、まず、学習データとして性別や年齢などの話者属性を示すラベリングされた各年代の音声データにより各年代の平均的な音声特徴量を、音声信号をスペクトラム分析などにより得られる。次に、ある話者の音声データからの音声信号をスペクトラム分析などにより、その話者の音声の特徴量を獲得する。次に、この特徴量と前記学習データにより得られた各年代の平均的な音声特徴量との類似度を算出し、類似度が最大となるものを選び出すことにより、その話者の年齢や性別等を推定することができる。 Further, speaker attributes such as the speaker's age and sex can be estimated from the speaker's voice data. The general method is to first obtain the average speech feature of each age from the labeled speech data of each age indicating speaker attributes such as gender and age as learning data, and obtain the speech signal by spectrum analysis etc. . Next, the feature amount of the voice of the speaker is acquired by performing spectrum analysis or the like on the voice signal from the voice data of the speaker. Next, by calculating the similarity between this feature quantity and the average speech feature quantity of each age obtained from the learning data, and selecting the one with the maximum similarity, the age and sex of the speaker Etc. can be estimated.

例えば、話者から発せられた音声を音響分析して音響特徴量を獲得し、獲得された特徴量を用いて、予め男女別に用意された子供層、若年層、青年層、壮年層、老年層の其々の標準的なパターンとの類似度を算出し、最も類似している標準パターンにより、話者の年齢、性別を獲得する方法も話者の特徴を獲得することができる。 For example, an acoustic feature is obtained by acoustic analysis of speech uttered by a speaker, and children, young people, young people, seniors, and elderly people prepared in advance for each gender using the obtained feature values. The method of calculating the degree of similarity with each of the standard patterns and acquiring the speaker's age and gender according to the most similar standard pattern can also acquire the speaker characteristics.

また、話者特徴抽出手段105が、話者の音声データ、顔画像から抽出する方法に限らずに、センサーや話者の個人属性を通信可能なＩＣカードまたは話者の個人属性を通信可能なＩＣカードに内蔵した端末装置等を用いる方法、他の手段により話者の生体情報を獲得する方法、話者のプロファイルから話者の人種、年齢、性別等の情報を獲得する方法も利用してもよい。 Further, the speaker feature extraction means 105 is not limited to the method of extracting from the voice data and face image of the speaker, but can communicate the IC card or the personal attribute of the speaker that can communicate the personal attribute of the sensor or speaker. A method of using a terminal device built in an IC card, a method of acquiring the biological information of the speaker by other means, and a method of acquiring information such as the race, age, and sex of the speaker from the speaker profile are also used. May be.

辞書・モデル選択手段112は、話者の特徴を示す識別情報を付与した辞書・モデルを格納するデータベース107の中から、話者に適した辞書・モデルを選択し、音声認識、テキスト変換又は機械翻訳、音声合成用の各種の辞書・モデルを選択する手段である．
辞書・モデルを選択するアルゴリズムは、内積、Jaccard係数、余弦、Dice係数に基づく類似度、カイ二乗に基づく類似度、ユークリッド距離の逆数に基づく類似度を用いることができるが、これだけに限定されない。 The dictionary / model selection means 112 selects a dictionary / model suitable for the speaker from the database 107 storing the dictionary / model to which identification information indicating the characteristics of the speaker is added, and performs speech recognition, text conversion, or machine It is a means to select various dictionaries and models for translation and speech synthesis.
The algorithm for selecting a dictionary / model may use, but is not limited to, an inner product, a Jaccard coefficient, a cosine, a similarity based on a Dice coefficient, a similarity based on a chi-square, and a similarity based on the reciprocal of the Euclidean distance.

例えば、標準話者Ａの話者特徴ベクトルは

その正規化された重みは

とし、ある話者Ｂの話者特徴ベクトルは、

とし、その正規化された重みは、

とする。Ｘ・ＹをＸとＹと同じ属性がある場合に同じ属性同士の重みを掛け合わせた合計を返す演算であるとするとき、内積、Jaccard係数、余弦、Dice係数などの方法による話者Ｂと標準話者Ａとの類似度を計算する式は次に示す。

上の式に示すような計算方法で話者の特徴と辞書・モデルが持つ標準話者の特徴との類似度を計算し、類似度が最大となるものは辞書・モデル選択の対象とする。 For example, the speaker feature vector of standard speaker A is

Its normalized weight is

And a speaker feature vector of a certain speaker B is

And its normalized weight is

And If X and Y have the same attributes as X and Y, and the operation returns the sum of the weights of the same attributes multiplied by the speaker B by the method of inner product, Jaccard coefficient, cosine, Dice coefficient, etc. The formula for calculating the similarity with the standard speaker A is as follows.

The similarity between the speaker's characteristics and the standard speaker's characteristics of the dictionary / model is calculated by the calculation method shown in the above formula, and the one having the maximum similarity is selected as a dictionary / model selection target.

辞書・モデルデータベース107では、音声認識、テキスト変換または機械翻訳、音声合成、シソラース辞書等から構成されたデータベースである。各種の辞書及びモデルは、多数の話者の年齢・性別等の属性により一般化して、各年齢層の標準話者の特徴量を識別情報として辞書・モデルに付与して構築する。例えば、子供層、若年層、青年層、壮年層、老年層によって男女別に一般化して各種の辞書を用意することができる。 The dictionary / model database 107 is a database composed of speech recognition, text conversion or machine translation, speech synthesis, thesaurus dictionary, and the like. Various dictionaries and models are generalized by attributes such as the age and gender of a large number of speakers, and feature amounts of standard speakers of each age group are added to the dictionary / model as identification information and constructed. For example, various dictionaries can be prepared by generalizing according to gender by children, young people, young people, middle-aged people, and elderly people.

テキスト変換手段106では、入力テキストの言い換え表現の処理を行う。テキスト入力手段102からの入力に対し、形態素解析、構文解析などの処理を行う．また、第二の話者の特徴抽出手段105から獲得された話者の属性を用いて、辞書・モデル選択手段112により辞書・データベース107から第二の話者に適する言語表現に言い換えを行うことができる。 The text conversion means 106 performs a paraphrase expression process for the input text. Processing such as morphological analysis and syntax analysis is performed on the input from the text input means 102. In addition, using the speaker attributes acquired from the second speaker feature extraction unit 105, the dictionary / model selection unit 112 performs a paraphrase from the dictionary / database 107 to a language expression suitable for the second speaker. Can do.

例えば、日本語言換えを行う場合、仮に言換え用辞書を第二話者の属性に応じて予め構築して辞書・モデルデータベースに格納されている。ここでは、日本語のて接続複合化用言「ほしい」を例として説明する。言換え生成側の辞書を次のように構築する。言換えの対象である「ほしい」の辞書の原言語ブロックは、固有部「欲しい」と、品詞情報「て接続複合化用言（い型）」と、活用形情報と、接続番号情報と等で構成する。生成側では、より丁寧度が高い「頂く」と、より丁寧度が低い「下さい」との二つのブロックで構築する。一番目のブロックは、固有部「頂きたい」と、品詞情報と、活用形情報と、接続番号情報と、第二話者属性（老人,男性,女性）等で構成する。二番目のブロックは、固有部「下さい」と、品詞情報と、活用形情報と、接続番号情報と、第二話者属性（子供,若年,男性,女性）等で構成する。 For example, when performing Japanese paraphrase, a paraphrase dictionary is preliminarily constructed according to the attributes of the second speaker and stored in the dictionary / model database. Here, the Japanese compound connection compounding phrase “I want” will be described as an example. The dictionary on the paraphrase generation side is constructed as follows. The source language block of the “wanted” dictionary that is the target of paraphrasing is the unique part “desired”, the part-of-speech information “telecom compounding word (type)”, utilization information, connection number information, etc. Consists of. On the generation side, it is constructed with two blocks: “Get” with higher politeness and “Please” with lower politeness. The first block is composed of a unique part “I want you”, part-of-speech information, usage information, connection number information, second speaker attributes (old man, man, woman) and the like. The second block is composed of a unique part “please”, part-of-speech information, utilization type information, connection number information, second speaker attributes (child, young, male, female) and the like.

例えば、第一の話者がシステムに「今週のレポートを見せてほしいですが」と入力して、第二の話者が「老人、男性」であると推定された。この際の言換え処理は、まず、日本語の形態素解析と構文解析を行い、その結果、形態素要素は名詞「今週」、格助詞「の」、名詞「レポート」、格助詞「を」、一段動詞「見せる」、接続助詞「て」、て接続複合化用言「ほしい」、助動詞「だ」、終助詞「が」等の情報を得ることができる。この際に、丁寧度を示す単語「ほしい」の辞書から、「頂きたい」と「下さい」との二つの候補を読み込む。 For example, the first speaker entered the system “I want to see this week's report” and the second speaker was estimated to be “old man, man”. In this case, the paraphrasing process first performs Japanese morphological analysis and syntax analysis. As a result, the morpheme elements are the noun “this week”, the case particle “no”, the noun “report”, the case particle “o”, Information such as the verb “show”, the connection particle “te”, the telecom compounding predicate “want”, the auxiliary verb “da”, and the final particle “ga” can be obtained. At this time, two candidates “I want you” and “Please” are read from the dictionary of the word “I want” that shows politeness.

次に、前記の第二の話者の話者属性「老人、男性」の情報を用いて、「頂きたい」と「下さい」の二つの候補の話者特徴との類似度を算出し、「老人、男性」を持つ「頂きたい」のブロックの類似度が最大となるため、「ほしい」の言換えの目的語が「頂きたい」となる。 Next, using the information of the speaker attribute of the second speaker “old man, male”, the degree of similarity between the speaker characteristics of the two candidates “Please” and “Please” is calculated. Since the similarity of the block of “I want to get” with “old man, man” is the maximum, the object of paraphrase of “I want” is “I want to get”.

次に、言換え文を生成するための形態素は「今週」、「の」、「レポート」、「を」、「見せる」、「て」、「頂きたい」、助動詞「だ」、終助詞「が」からなる。 Next, the morphemes for generating paraphrased sentences are "this week", "no", "report", "show", "show", "te", "get me want", auxiliary verb "da", final particle " "".

次に、日本語の構文規則情報を用いて言換えの目的文を生成する。その結果、「今週のレポートを見せて頂きたいですが」となる。同様に、第二言語の話者の話者属性が「若年、男性」または「若年、女性」の際、目的文は「今週のレポートを見せてください」との結果になる。 Next, a target sentence for paraphrasing is generated using Japanese syntax rule information. As a result, I would like to see you this week's report. Similarly, when the speaker attribute of the second language speaker is “young, male” or “young, female”, the target sentence is “Please show this week's report”.

実施例２を、図面を参照して説明する。かかる実施例は本発明を実施するための第１の実施の形態に対応するものである。 A second embodiment will be described with reference to the drawings. This example corresponds to the first embodiment for carrying out the present invention.

図5を参照すると、本発明の実施例２に係る音声合成手段110は、図4に示された実施例１におけるテキスト出力手段115の代わりに、テキスト変換手段106から出力されたテキストを合成音声を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例１と同じである。 Referring to FIG. 5, the speech synthesizing means 110 according to the second embodiment of the present invention uses the text output from the text converting means 106 as synthesized speech instead of the text output means 115 in the first embodiment shown in FIG. And a voice output unit 111 for outputting synthesized voice. The other points are the same as in the first embodiment.

音声合成手段110では、テキスト変換手段106からの出力に対して、第二の話者特徴抽出手段105により抽出した第二の話者特徴を用いて、辞書・モデル選択手段112により、辞書・モデルデータベース107の中から、第二の話者特徴に適する音声合成用辞書を選択し、音声合成処理を行う。 The speech synthesis unit 110 uses the second speaker feature extracted by the second speaker feature extraction unit 105 for the output from the text conversion unit 106, and the dictionary / model selection unit 112 uses the dictionary / model selection unit 112. A speech synthesis dictionary suitable for the second speaker feature is selected from the database 107, and speech synthesis processing is performed.

例えば、第二言語側の話者が老人の場合、第二の話者特徴を用いて、辞書・モデル選択手段112により、辞書・モデルデータベース107の中から、音声合成用辞書・モデルを選択して、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、高齢者に適した速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールするができる音声合成方法を提供することにより、合成された音声の実用性を高める効果を有する。 For example, if the speaker on the second language side is an elderly person, the dictionary / model selection means 112 selects the dictionary / model for speech synthesis from the dictionary / model database 107 using the second speaker feature. By making the duration of the synthesized speech waveform slightly longer, and adjusting the sound pressure level and pitch frequency thresholds appropriately, you can slowly squeeze at a speed suitable for the elderly, increase the volume slightly, By providing a speech synthesis method capable of controlling the length, it has the effect of enhancing the practicality of synthesized speech.

音声出力手段111は、前記音声合成手段110から出力された合成音声波形信号を音声信号としてスピーカを通して合成音声を出力する音声出力手段である。 The voice output unit 111 is a voice output unit that outputs a synthesized voice through a speaker using the synthesized voice waveform signal output from the voice synthesis unit 110 as a voice signal.

本実施例においては、例えば、前記第一の実施例の例文の「今週のレポートを見せてほしいですが」の入力に対して、本実施例においては、前記の言換え処理を行い、言換えの結果に対し、第二の話者の話者特徴を用いて音声合成用の辞書・モデルを選択して音声合成を行うことができる。その結果、第二の話者が「老人、男性」または「老人、女性」の場合、合成音声波形の持続時間をやや長くし、音圧レベルとピッチ周波数の閾値を適切に調整することにより、高齢者に適した速度でゆっくり喋らせ、音量をやや大きくして、声の高さをコントロールして、合成音声の出力で「今週のレポートを見せて頂きたいですが」を提供することができる。又、第二の話者が「若年、男性」または「若年、女性」の場合、一般的な音声合成効果で、合成音声の出力で「今週のレポートを見せてください」を提供することができる。 In the present embodiment, for example, in response to the input of “I want to see this week's report” in the example sentence of the first embodiment, in the present embodiment, the above paraphrase processing is performed. As a result, it is possible to perform speech synthesis by selecting a dictionary / model for speech synthesis using the speaker characteristics of the second speaker. As a result, if the second speaker is `` old man, male '' or `` old man, female '', by slightly increasing the duration of the synthesized speech waveform and adjusting the sound pressure level and pitch frequency threshold appropriately, Slowly speaking at a speed suitable for elderly people, slightly increasing the volume, controlling the pitch of the voice, can provide "Please show me this week's report" with the output of synthesized speech . Also, if the second speaker is "Young, Male" or "Young, Female", you can provide "Please show this week's report" with the output of synthesized speech, with a general speech synthesis effect .

実施例３を、図面を参照して説明する。かかる実施例は本発明を実施するための第１の実施の形態に対応するものである。 A third embodiment will be described with reference to the drawings. This example corresponds to the first embodiment for carrying out the present invention.

図6に示すように、本実施例は実施例１のテキスト変換手段106の代わりに、機械翻訳手段109に変えたものを備え、その他の点は実施例１を同じである。 As shown in FIG. 6, the present embodiment includes a machine translation unit 109 instead of the text conversion unit 106 of the first embodiment, and the other points are the same as those of the first embodiment.

機械翻訳手段109では、前記テキスト入力手段102から入力テキストを第二言語に翻訳する手段である。機械翻訳に必要な各種の辞書は、前記第二の話者特徴抽出手段105により抽出された第二の話者の特徴を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107から第二の話者特徴に適した翻訳用辞書を選出し、翻訳処理を行う。 The machine translation means 109 is means for translating the input text from the text input means 102 into a second language. Various dictionaries necessary for machine translation are obtained from the dictionary / model database 107 by the dictionary / model selection unit 112 using the features of the second speaker extracted by the second speaker feature extraction unit 105. A translation dictionary suitable for the speaker's characteristics is selected and translated.

例えば、図25と図26を参照して、第二の話者の特徴に適した機械翻訳結果を、出力機能を説明する。 For example, with reference to FIGS. 25 and 26, the output function of machine translation results suitable for the characteristics of the second speaker will be described.

図25に記載されたデータは、日本語「名前」の日英翻訳用辞書で、原言語である日本語側は、「名前」を例として挙げられたデータ構造である。その原言語側のフィールドは、単語見出し情報と、読み情報と、標準表記情報と、品詞情報と、文スタイル情報等を含む。目的言語側のフィールドは、読み情報と、品詞情報と、目的言語話者特徴を示す情報と、文スタイル情報等を含む。具体的に、日本語側では、日本語単語「名前」だけの辞書エントリーは、「単語見出し（“名前”）」と、「読み（“なまえ”）」と、品詞情報である「品詞（名詞（普通名詞））」に対し、英語側では、「E_読み（“name”）」と品詞情報を示す「品詞（NOUN（c））」と対応する。また、日本語単語「名前」を使った日本語質問文を示す標準表記である「標準表記（“お名前は”）」と、品詞情報である「品詞（名詞（普通名詞））」と、日本語の質問文の文スタイルを示す「J_文スタイル（質問）」との情報から構成された辞書のエントリーに対し、目的言語である英語側は、「E_読み（“name”）」と品詞情報を示す「品詞（NOUN（c））」と文スタイルと、原言語側話者特徴と、目的言語側話者特徴等のフィールドから構成される。原言語側の話者特徴と目的言語側の話者特徴の属性値は、男性、女性、子供、若年、青年、壮年、老人、話者の年齢差を示す年下や年上、又は年齢と性別から構成かれた属性情報を利用する。また、青年女性、若年女性、壮年男性、壮年女性のような属性値のセットも利用することができる。 The data described in FIG. 25 is a Japanese-English translation dictionary for Japanese “name”, and the Japanese language as the source language has a data structure with “name” as an example. The source language field includes word heading information, reading information, standard notation information, part of speech information, sentence style information, and the like. The target language field includes reading information, part-of-speech information, information indicating target language speaker characteristics, sentence style information, and the like. Specifically, on the Japanese side, a dictionary entry for only the Japanese word “name” is “word heading (“ name ”)”, “reading (“ name ””), and “part of speech (noun)” which is part of speech information. (Common noun)) ”corresponds to“ E_reading (“name”) ”and“ part of speech (NOUN (c)) ”indicating part of speech information on the English side. In addition, “standard notation (“ your name is ””) ”which is a standard notation indicating a Japanese question sentence using the Japanese word“ name ”,“ part of speech (noun (common noun)) ”which is part of speech information, In contrast to the dictionary entry that consists of the information "J_sentence style (question)" indicating the sentence style of the Japanese question sentence, the English side, which is the target language, is "E_reading (" name ")" And “part of speech (NOUN (c))” indicating the part of speech information, a sentence style, source language speaker characteristics, target language speaker characteristics, and the like. The attribute values of the speaker features on the source language side and the speaker features on the target language side are the younger, older, Use attribute information composed of gender. Also, a set of attribute values such as young women, young women, senior men, and senior women can be used.

例えば、第一話者から「お名前は？」と入力した時、第二の話者は若年男性の時、その年齢と性別の属性値は「目的言語話者特徴（年下、子供、若年男性）」の中に「若年男性」と一致するため、第一の発話者からの「お名前は？」の発話に対し、この目的言語の話者特徴を用いて、第二の話者に対して「What is your name?」と翻訳することができる。同様に、第二の話者の属性値は「子供、若年男性」のいずれかである時に、翻訳結果は「What is your name?」で、第二の話者の属性値は「壮年男性」である時、翻訳結果は「May I have your name?」で、第二の話者の属性値は「壮年女性、老人」のいずれかである時、翻訳結果は「Could you please tell me your name?」で、第二の話者の属性値は「若年女性、青年女性」である際に、翻訳結果は「I would like to ask you whether you would grant me permission to have your name?」で出力することができる。 For example, when the first speaker inputs "What's your name?", When the second speaker is a young man, the age and gender attribute values are "target language speaker characteristics (younger, children, younger In order to match “young man” in the “male” ”, the utterance of“ What is your name? ”From the first speaker, On the other hand, it can be translated as “What is your name?”. Similarly, when the attribute value of the second speaker is “Children, Young Men”, the translation result is “What is your name?” And the attribute value of the second speaker is “Mother Men”. When the translation result is `` May I have your name? '' And the attribute value of the second speaker is `` Mature woman, old man '', the translation result is `` Could you please tell me your name '' ? "And the attribute value of the second speaker is" young woman, youth woman ", the translation result is output as" I would like to ask you whether you would grant me permission to have your name? " be able to.

実施例４を、図面を参照して説明する。かかる実施例は本発明を実施するための第１の実施の形態に対応するものである。 Example 4 will be described with reference to the drawings. This example corresponds to the first embodiment for carrying out the present invention.

図7を参照すると、本発明の実施例４に係る音声合成手段110は、図6に示された実施例３における機械翻訳手段109から出力された翻訳結果を合成音声を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例３を同じである。 Referring to FIG. 7, the speech synthesizer 110 according to the fourth embodiment of the present invention includes a speech synthesizer 110 for synthesizing the translation result output from the machine translator 109 in the third embodiment shown in FIG. And voice output means 111 for outputting synthesized voice. The other points are the same as in the third embodiment.

機械翻訳手段109では、テキスト入力手段101の入力に対し、第二の話者特徴抽出手段105により抽出された第二の話者の特徴を用いて、辞書・モデル選択手段112により辞書・データベース107から第二の話者の特徴に適した目的言語辞書を選択して機械翻訳を行う。
音声合成手段110では、機械翻訳手段109からの翻訳結果に対し、第二の話者特徴抽出手段105により抽出された第二の話者の特徴を用いて、辞書・モデルデータベース107から第二の話者の特徴に適用する音声合成用辞書やモデルを選択して、音声合成を行う。 The machine translation means 109 uses the second speaker feature extracted by the second speaker feature extraction means 105 in response to the input from the text input means 101, and the dictionary / model selection means 112 uses the dictionary / database 107. The target language dictionary suitable for the characteristics of the second speaker is selected and machine translation is performed.
In the speech synthesis unit 110, the second speaker feature extracted by the second speaker feature extraction unit 105 is used for the translation result from the machine translation unit 109, and the second feature is extracted from the dictionary / model database 107. Speech synthesis is performed by selecting a speech synthesis dictionary and model to be applied to speaker characteristics.

音声出力手段111では、前記音声合成手段110から出力された合成音声波形信号を音声信号としてスピーカを通して合成音声を出力する。 The voice output unit 111 outputs a synthesized voice through a speaker using the synthesized voice waveform signal output from the voice synthesis unit 110 as a voice signal.

例えば、英日翻訳の場合、第一の話者が「How old are you?」と入力した場合、本実施例により、第二の話者側である日本語側は若い女性の際に翻訳結果は「ご芳齢は？」と翻訳されて合成音声を出力する。日本語側はお年寄りの際に翻訳結果は「おいくつですか?」と翻訳されて、合成音声の音量をやや大きくして、声の高さをやや高めにして、ゆっくりと喋らせる。日本語側は子供の際に翻訳結果は「何歳なの?」と翻訳されて優しい声で合成音声を提供することができる。 For example, in the case of English-Japanese translation, if the first speaker inputs "How old are you?", According to this example, the translation result is obtained when the Japanese side, the second speaker side, is a young woman. Translates to "How old are you?" And outputs synthesized speech. On the Japanese side, when the elderly, the translation result is translated as “How many?”, The volume of the synthesized speech is increased slightly, the pitch of the voice is increased slightly, and the voice is spoken slowly. The Japanese side can provide a synthesized voice with a gentle voice, translated as “How old?” When the child is a child.

実施例５を、図面を参照して説明する。かかる実施例は本発明を実施するための第１の実施の形態に対応するものである。 Example 5 will be described with reference to the drawings. This example corresponds to the first embodiment for carrying out the present invention.

図8を参照すると、本実施例は実施例２のテキスト入力手段102の代わりに、第一話者の音声を入力する音声入力手段103を備え、入力音声を認識処理を行う音声認識手段108を備えている。その他の点は実施例２を同じである。 Referring to FIG. 8, in this embodiment, instead of the text input means 102 of the second embodiment, a voice input means 103 for inputting the voice of the first speaker is provided, and a voice recognition means 108 for recognizing the input voice is provided. I have. The other points are the same as in the second embodiment.

音声入力手段103は、音声を取り込むために指向性マイクを用いて話者の音声を取り込んで音声入力解析処理を行う。マイクには場所や環境などに限定しない、特に携帯端末などに組み込まれたマイクなどを指す。音声入力解析処理は、音声自動通訳機のようなモバイル端末における様々な環境などで、話者の発話音声データを取り込んで、音声信号における各種の音声処理を行うことである。例えば、8kHzや16kHzのサンプリング周波数でアナログ信号をディジタル信号に変換する処理、音声認識を行うための信号を並列データに変換し、レジスタなどに格納する処理、耐雑音処理など。 The voice input means 103 takes in the voice of the speaker using a directional microphone to take in the voice and performs voice input analysis processing. The microphone is not limited to a place or environment, and particularly refers to a microphone incorporated in a portable terminal. The voice input analysis process is to perform various kinds of voice processing on the voice signal by taking in the voice data of the speaker in various environments in a mobile terminal such as an automatic voice interpreter. For example, processing to convert an analog signal into a digital signal at a sampling frequency of 8 kHz or 16 kHz, processing to convert a signal for performing speech recognition into parallel data, and store it in a register, noise proof processing, etc.

音声認識手段108では、音声入力手段103から出力された音声信号に対して音声認識を行って認識結果を出力するものである。音声認識処理手段108に行われる音声認識処理はＬＰＣ音声分析、音声区間検出、パターン照合、判定などの連続した音声認識処理を行う。音声認識の手法は特定のものとする必要がなく、ＨＭＭ、ニューラルネットワーク、Ｎグラム言語モデルなど、一般的に用いられる既存の手法を採用すればよい。
音声認識手段108における音声認識処理の辞書・モデルの選択は、第二の話者特徴抽出手段105により抽出された第二の話者の特徴を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から,第二の話者の特徴に適した音声認識用言語モデルを選択する。 The voice recognition means 108 performs voice recognition on the voice signal output from the voice input means 103 and outputs a recognition result. The speech recognition processing performed by the speech recognition processing means 108 performs continuous speech recognition processing such as LPC speech analysis, speech section detection, pattern matching, and determination. The speech recognition method need not be a specific one, and an existing method that is generally used, such as an HMM, a neural network, or an N-gram language model, may be employed.
The dictionary / model selection of the speech recognition processing in the speech recognition means 108 is performed by the dictionary / model selection means 112 using the feature of the second speaker extracted by the second speaker feature extraction means 105. From the model group stored in the database 107, a speech recognition language model suitable for the characteristics of the second speaker is selected.

例えば、第二の話者特徴抽出手段により抽出した第二の話者の話者特徴は「子供、女性」の時、第一話者からの発話に対して、音声認識用言語モデルの選択は、「子供、女性」の属性情報を用いて、音声認識用辞書・モデルデータベースから、「子供、女性」との第二の話者特徴を持つ音声認識用言語モデルを選択することができる。具体的に、第二話者の話者属性情報（子供、女性）と、辞書・モデルデータベースに格納されている各種の音声認識用辞書・モデルが持つ属性情報との類似度を算出して、類似度が最大となるものを選択する。このようにして、「子供、女性」との属性を持つ音声認識用言語モデルを選択して音声認識を行うことが可能となる。同様に、第二の話者が他の属性を持つ話者の時、その話者の属性にあう音声認識用言語モデルを、辞書・モデル選択手段により選択して音声認識を行うことにより、音声認識精度を向上することができる。 For example, when the speaker feature of the second speaker extracted by the second speaker feature extraction means is “child, woman”, the speech recognition language model is selected for the speech from the first speaker. Using the attribute information of “children and women”, a speech recognition language model having a second speaker characteristic of “children and women” can be selected from the dictionary / model database for speech recognition. Specifically, the similarity between the second speaker's speaker attribute information (children, women) and the attribute information of various voice recognition dictionaries / models stored in the dictionary / model database is calculated, Select the one with the maximum similarity. In this way, it is possible to perform speech recognition by selecting a speech recognition language model having an attribute of “children, women”. Similarly, when the second speaker is a speaker having other attributes, a speech recognition language model that matches the attributes of the speaker is selected by the dictionary / model selection means, and speech recognition is performed. Recognition accuracy can be improved.

テキスト変換処理106では、音声認識手段108からの出力結果を、第二の話者の特徴を利用して辞書・モデル選択手段112により辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中からテキスト変換処理用の辞書・モデルを選択してテキスト変換処理を行う。 In the text conversion processing 106, the output result from the speech recognition means 108 is stored in the dictionary / model database 107 by the dictionary / model selection means 112 by the dictionary / model selection means 112 using the characteristics of the second speaker. Text conversion processing is performed by selecting a dictionary / model for text conversion processing from among the existing model groups.

例えば、「林檎」の言い換え用辞書に、子供向けの言い換え用生成ブロックは読み情報の「りんご」と「名詞」である品詞情報と「果物」である意味分類情報と「第二話者の話者特徴（子供）」である第二話者特徴情報から構成されて、成人や外国人向けの言い換え用生成ブロックは読み情報の「アップル」と「名詞」である品詞情報と「果物」である意味分類情報と「第二話者の話者特徴（成人、外国人）」である第二話者特徴情報から構成すれば、テキスト変換処理手段106における言い換え処理は、第二話者の特徴を参照することにより、「アップルを食べてね」の入力に対し、第二話者側は子供の時、「りんごを食べてね」、第二話者が成人の時「アップルをたべてね」と言い換えを行うことができる。 For example, in the paraphrase dictionary for “apple”, the generation block for paraphrasing for children includes “apple”, “noun” part-of-speech information for reading information, semantic classification information for “fruit”, and “speaker's story”. Consists of second speaker feature information that is "person feature (children)", and the generation block for paraphrasing for adults and foreigners is part of speech information that is "Apple" and "noun" of reading information and "fruit" If it consists of semantic classification information and second speaker feature information that is “speaker features of the second speaker (adult, foreigner)”, the paraphrasing process in the text conversion processing means 106 determines the features of the second speaker. By referring to the input of “Eat Apple”, the second speaker is “Eat an apple” when the child is a child, and “Eat Apple” when the second speaker is an adult. In other words.

また、音声合成手段110では、テキスト変換手段106からの変換結果に対し、第二の話者特徴抽出手段105から獲得された第二の話者の特徴を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から音声合成用の辞書・モデルを選択する。 Further, the speech synthesis unit 110 uses the second speaker feature acquired from the second speaker feature extraction unit 105 to the conversion result from the text conversion unit 106, and the dictionary / model selection unit 112 A dictionary / model for speech synthesis is selected from the model group stored in the dictionary / model database 107.

実施例６を、図面を参照して説明する。かかる実施例は本発明を実施するための第１の実施の形態に対応するものである。 Example 6 will be described with reference to the drawings. This example corresponds to the first embodiment for carrying out the present invention.

図9を参照すると、本実施例は実施例５のテキスト変換手段106の代わりに、機械翻訳手段109を備えている。その他の点は実施例５を同じである。 Referring to FIG. 9, this embodiment includes machine translation means 109 instead of the text conversion means 106 of the fifth embodiment. The other points are the same as in the fifth embodiment.

本実施例にける音声入力手段103は、第一の話者（第一言語の話者）の音声を取り込んで音声入力解析処理を行うものである。入力音声を解析するための処理は前記第五の実施例の音声入力手段と同様な処理を行ってもよい。 The voice input means 103 in the present embodiment performs voice input analysis processing by taking in the voice of the first speaker (speaker in the first language). The process for analyzing the input voice may be the same as the voice input means of the fifth embodiment.

音声認識手段108は、入力された第一言語の話者の音声を認識するものである。音声認識手法は前記第五の実施例の音声認識処理と同様に処理してもよい。 The voice recognition means 108 recognizes the input voice of the speaker in the first language. The voice recognition method may be processed in the same manner as the voice recognition process of the fifth embodiment.

音声認識手段108における音声認識処理の辞書・モデルの選択は、第二の話者特徴抽出手段105により抽出された第二言語の話者の特徴を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から音声認識用の辞書・モデルを選択する。結果として、第二言語の話者の特徴に適した音声認識を行うことができる。 The dictionary / model selection of the speech recognition processing in the speech recognition means 108 is performed by the dictionary / model selection means 112 using the feature of the speaker in the second language extracted by the second speaker feature extraction means 105. A dictionary / model for speech recognition is selected from the model group stored in the model database 107. As a result, speech recognition suitable for the characteristics of the speaker of the second language can be performed.

機械翻訳手段109では、音声認識手段108からの出力結果を、第二言語の話者の特徴を利用して辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から翻訳用単語辞書、規則辞書などの辞書・モデルを選択して翻訳処理を行う。 The machine translation means 109 outputs the output result from the speech recognition means 108 from the model group stored in the dictionary / model database 107 by the dictionary / model selection means 112 using the characteristics of the speaker of the second language. Translation processing is performed by selecting a dictionary / model such as a word dictionary for translation and a rule dictionary.

例えば、英日翻訳の場合、「Please eat the apple!」の音声認識結果を翻訳処理を行うようとする場合、仮に、英日翻訳用単語辞書の単語「apple」は、原言語側の情報を示すブロックと二つの生成ブロックからなる。具体的に、日本語生成側は「りんご」の読み情報と「名詞」の品詞情報と「果物」である意味分類情報と第二言語の話者の属性情報である「第二話者の話者特徴（子供）」で構成する子供向けの一番目の日本語生成ブロックと、「アップル」の読み情報と「名詞」の品詞情報と「果物」である意味分類情報と第二言語の話者の属性情報である「第二話者の話者特徴（成人）」で構成する成人向けの二番目の日本語生成ブロックからなる。同様に、仮に、動詞「eat」の辞書が「eat」の原言語辞書ブロックと、二つの目的言語生成用ブロック「食べる」と「召し上がる」とのブロックからなる。各ブロックは読み情報と、品詞情報と、意味分類情報、第二の話者属性情報などが付与されている。前記「Please eat the apple!」を翻訳する際に、形態素解析と構文解析と目的言語生成処理等の一連の処理を行う。目的言語生成する際に、第二の話者特性を用いて、各訳語候補が持つ話者属性との類似度を算出することにより、訳語選択を行う。子供に「りんご」を選択して「りんごを食べてね！」、成人に「アップル」を選択して丁寧な文法ルールを用いて「アップルをお召し上がりください！」との翻訳結果を提供することができる。 For example, in the case of English-Japanese translation, if the speech recognition result of “Please eat the apple!” Is to be translated, the word “apple” in the English-Japanese translation word dictionary is assumed to contain information on the source language side. It consists of the block shown and two generated blocks. Specifically, the Japanese generation side reads the reading information of “apple”, the part-of-speech information of “noun”, the semantic classification information of “fruit”, and the attribute information of the speaker of the second language “story of the second speaker” The first Japanese generation block for children consisting of "person characteristics (children)", the reading information of "Apple", the part-of-speech information of "noun", the semantic classification information of "fruit", and the speaker of the second language The second Japanese generation block for adults composed of “speaker characteristics of the second speaker (adult)” which is attribute information of the second speaker. Similarly, the dictionary of the verb “eat” is composed of a source language dictionary block of “eat” and two target language generation blocks “eating” and “eating”. Each block is provided with reading information, part-of-speech information, semantic classification information, second speaker attribute information, and the like. When translating “Please eat the apple!”, A series of processes such as morphological analysis, syntax analysis, and target language generation processing are performed. When the target language is generated, the translation selection is performed by calculating the similarity to the speaker attribute of each translation word candidate using the second speaker characteristics. Choose “apples” for children and “Eat apples!”, Select “apples” for adults, and use polite grammar rules to provide translations of “please eat apples!” Can do.

音声合成手段110は前記第五の実施例の音声合成手段110と同様な処理を行ってもよい。 The speech synthesizer 110 may perform the same processing as the speech synthesizer 110 of the fifth embodiment.

音声出力手段111は前記第五の実施例の音声出力手段111と同様な処理を行ってもよい。 The audio output unit 111 may perform the same processing as the audio output unit 111 of the fifth embodiment.

以上のように、第一話者（第一言語の話者）の発話を音声入力手段103へ入力し、第二の話者特徴抽出手段105により獲得された第二の話者の特徴を用いて、辞書・モデル選択手段に112より、辞書・モデルデータベース107から音声認識用の辞書・モデルを選択して音声認識を行う。音声認識手段108からの認識結果に対して、第二の話者の特徴を用いて辞書・モデル選択手段に112より、辞書・モデルデータベース107から機械翻訳用辞書を選択して機械翻訳処理を行う。機械翻訳手段109からの翻訳結果に対して、第二の話者の特徴を用いて辞書・モデル選択手段に112より、辞書・モデルデータベース107から音声合成用辞書を選択して音声合成処理を行う。音声合成処理手段110からの出力は、スピーカなどの音声出力手段を通して出力する。 As described above, the speech of the first speaker (speaker of the first language) is input to the voice input means 103, and the second speaker feature acquired by the second speaker feature extraction means 105 is used. Then, the dictionary / model selection means 112 selects a dictionary / model for speech recognition from the dictionary / model database 107 and performs speech recognition. For the recognition result from the speech recognition means 108, the dictionary / model selection means 112 selects the machine translation dictionary from the dictionary / model database 107 using the characteristics of the second speaker, and performs machine translation processing. . For the translation result from the machine translation means 109, the dictionary / model selection means 112 selects the speech synthesis dictionary from the dictionary / model database 107 using the characteristics of the second speaker, and performs speech synthesis processing. . The output from the voice synthesis processing means 110 is output through voice output means such as a speaker.

実施例７を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 A seventh embodiment will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図10を参照すると、本実施例は、テキスト入力手段102と、コミュニケーション処理手段200と、第一の話者の特徴抽出手段104と、第二の話者の特徴抽出手段105と、辞書・モデルデータベース107と、テキスト出力手段115により構成される。 Referring to FIG. 10, the present embodiment includes a text input means 102, a communication processing means 200, a first speaker feature extraction means 104, a second speaker feature extraction means 105, a dictionary / model. It comprises a database 107 and text output means 115.

コミュニケーション処理手段200はテキスト変換手段106と、辞書・モデル選択手段112と、話者関係判定手段113とを備えている。 The communication processing unit 200 includes a text conversion unit 106, a dictionary / model selection unit 112, and a speaker relationship determination unit 113.

また、本実施例において、コミュニケーション処理手段200の各種の処理を行うための各種の辞書・モデルを予め構築して辞書・モデルデータベース107に格納する。例えば、子供男性への発話タイプを、「老人男性から子供男性への発話」タイプと、「老人女性から子供男性への発話」タイプと、「青年男性から子供男性への発話」タイプと、「青年女性から子供男性への発話」タイプと、「若年男性から子供男性への発話」タイプと、「若年女性から子供男性への発話」タイプとの種類に分類することができる。同様に、子供女性への発話タイプを、「老人男性から子供女性への発話」タイプと、「老人女性から子供女性への発話」タイプと、「青年男性から子供女性への発話」タイプと、「青年女性から子供女性への発話」タイプと、「若年男性から子供女性への発話」タイプと、「若年女性から子供女性への発話」タイプとの種類に分類することができる。このようにして分類した発話タイプを示す言語を用いて、「子供への発話」における種々の音声認識用辞書・モデルを構築して辞書・モデルデータベースに格納することができる。同様に、機械翻訳、テキスト変換、音声合成等の辞書・モデルも予め構築して辞書・モデルデータベースに格納する。 Further, in this embodiment, various dictionaries / models for performing various processes of the communication processing means 200 are constructed in advance and stored in the dictionary / model database 107. For example, the utterance types for child men are `` utterance from old man to child man '' type, `` utterance from old woman to child man '' type, `` utterance from young man to child man '' type, It can be classified into the types of “utterance from young woman to child male” type, “utterance from young man to child male” type, and “utterance from young woman to child male” type. Similarly, the utterance types for children and women are the “speech from old man to child woman” type, the “speak from old woman to child woman” type, the “speak from young man to child woman” type, It can be classified into the types of “utterances from young women to child women”, “utterances from young men to child women”, and “utterances from young women to child women”. By using the language indicating the utterance type classified as described above, various voice recognition dictionaries / models for “utterance to children” can be constructed and stored in the dictionary / model database. Similarly, dictionaries / models for machine translation, text conversion, speech synthesis, etc. are also constructed in advance and stored in the dictionary / model database.

テキスト入力手段101と、テキスト出力手段115と、辞書・モデルデータベース107と、第二の話者特徴抽出手段105においては、それぞれ、前記第一の実施例との対応する部分が同様な処理で構成する。 In the text input means 101, the text output means 115, the dictionary / model database 107, and the second speaker feature extraction means 105, the parts corresponding to those in the first embodiment are configured by the same processing. To do.

第一の話者の特徴抽出手段104は、前記第一の実施例の第二の話者特徴抽出手段105と同様な構成で同じ処理を行ってもよい。 The first speaker feature extraction unit 104 may perform the same processing with the same configuration as the second speaker feature extraction unit 105 of the first embodiment.

話者関係判定手段113では、第一の話者の特徴抽出手段104により抽出された第一の話者特徴と、第二の話者の特徴抽出手段105により抽出された第二の話者特徴と比較して話者の関係を判定する。例えば、第一の話者の個人属性が「青年、男性」で、第二の話者の個人属性が「老人、女性」である際に、比較結果は、第一言語側の話者の属性は「第一の話者の特徴（年下、青年男性）」で、第二の話者の属性は「第二の話者の特徴（年上、老年女性）」で得られる。 In the speaker relation determining means 113, the first speaker feature extracted by the first speaker feature extracting means 104 and the second speaker feature extracted by the second speaker feature extracting means 105 are used. To determine the speaker relationship. For example, when the personal attribute of the first speaker is “Youth, Male” and the personal attribute of the second speaker is “Old Man, Female”, the comparison result is the attribute of the speaker on the first language side. Is obtained from “characteristics of the first speaker (younger, youth male)” and attributes of the second speaker are derived from “characteristics of the second speaker (older, older women)”.

辞書・モデル選択手段112では、前記話者関係判定手段113からの出力に基づいて、両話者の話者関係を用いて辞書・モデルデータベース107の中からテキスト変換用辞書・モデルを選択する。 The dictionary / model selection unit 112 selects a dictionary / model for text conversion from the dictionary / model database 107 using the speaker relationship between the two speakers based on the output from the speaker relationship determination unit 113.

前述のように、本実施例では、両話者の特徴を同時に抽出して、話者関係判定手段113を通して両話者の関係を示す結果により、辞書・モデル選択手段112により、テキスト変換用辞書・モデルを選択し、両話者の特徴を共に考慮したテキスト変換処理を行う。 As described above, in the present embodiment, the characteristics of both speakers are extracted at the same time, and the dictionary / model selection unit 112 performs the text conversion dictionary based on the result indicating the relationship between the speakers through the speaker relationship determination unit 113.・ Select a model and perform text conversion processing considering the characteristics of both speakers.

実施例８を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 Example 8 will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図11を参照すると、本発明の実施例８に係る音声合成手段110は、図10に示された第七の実施例におけるテキスト出力手段115の代わりに、テキスト変換手段106から出力されたテキストを合成音声を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例７を同じである。 Referring to FIG. 11, the speech synthesis means 110 according to the eighth embodiment of the present invention uses the text output from the text conversion means 106 instead of the text output means 115 in the seventh embodiment shown in FIG. A voice synthesizing unit 110 that performs synthesized voice is provided, and a voice output unit 111 that outputs synthesized voice is provided. The other points are the same as in the seventh embodiment.

音声合成手段110では、テキスト変換手段106からの出力に対して、話者関係判定手段113から出力された両話者の関係を示す判定結果を用いて、辞書・モデル選択手段112により、辞書・モデルデータベース107の中から、両話者の話者関係を示す音声合成用辞書・モデルを選択し、音声合成処理を行うためのパラメータをコントロールする。 The speech synthesizer 110 uses the determination result indicating the relationship between the two speakers output from the speaker relationship determination unit 113 with respect to the output from the text conversion unit 106, and the dictionary / model selection unit 112 performs the dictionary / A speech synthesis dictionary / model indicating the speaker relationship between the two speakers is selected from the model database 107, and parameters for speech synthesis processing are controlled.

実施例９を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 A ninth embodiment will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図12を参照すると、本実施例は実施例７のテキスト変換手段106の代わりに、機械翻訳手段109に変えたものを備えている。その他の点は実施例７を同じである。 Referring to FIG. 12, this embodiment includes a machine translation means 109 instead of the text conversion means 106 of the seventh embodiment. The other points are the same as in the seventh embodiment.

機械翻訳手段109では、前記テキスト入力手段102から入力テキストを第二言語に翻訳する手段である。機械翻訳に必要な各種の辞書の選択は、前記第一の話者特徴抽出手段と前記第二の話者の特徴抽出手段により抽出された話者の特徴を前記話者関係判定手段113に出力し、話者関係の判定を行う。話者関係判定手段113からの出力結果を用いて、辞書・モデルデータベース107から機械翻訳用辞書を選択して機械翻訳処理を行う。 The machine translation means 109 is means for translating the input text from the text input means 102 into a second language. Selection of various dictionaries necessary for machine translation is performed by outputting the speaker features extracted by the first speaker feature extraction unit and the second speaker feature extraction unit to the speaker relationship determination unit 113. Then, the speaker relationship is determined. Using the output result from the speaker relation determination means 113, a machine translation dictionary is selected from the dictionary / model database 107 and machine translation processing is performed.

例えば、「会議通訳システム」に本実施例を導入すれば、会議参加者のＰＣへのメモ書きを翻訳する場合、一人の発話者の発話をそれぞれの聞き手の属性に合わせた翻訳結果を提供することができるようになる。 For example, if the present embodiment is introduced into the “conference interpreter system”, when translating a memo from a conference participant to a PC, a translation result that matches the utterance of a single speaker with the attributes of each listener is provided. Will be able to.

ここでは、中日通訳が行える会議通訳システムを想定して説明する。仮に青年男性の発話者Ａ（一郎）、老人女性の話者Ｂ、若年男性の話者Ｃの三人がいると想定する．また、仮に、辞書データベース１０７に格納されている翻訳辞書は、「晩生」、「叫」と「一郎」の内容を含む。 Here, the explanation will be made assuming a conference interpretation system capable of interpreting Chinese and Japanese. Suppose that there are three speakers: a young male speaker A (Ichiro), an elderly female speaker B, and a young male speaker C. In addition, the translation dictionary stored in the dictionary database 107 includes the contents of “late life”, “scream”, and “Ichiro”.

まず、中国語「晩生」を例として説明すると、「晩生」は人称代名詞で、一般的に聞き手より年下の人に、男性でも女性でも使用できる。そのため、「晩生」の原言語側のブロックに「原言語話者特徴（年下）」との属性情報を付与する。「晩生」と対応する日本語の訳語候補は、「わたくし」、「わたし」、「僕」等となるため、構築できる機械翻訳用辞書は原言語側では、「晩生」と原言語話者特徴（年下）から構成される原言語ブロックで、生成側では、「わたくし」と、品詞情報等と、目的言語話者特徴（年上、女性）等の情報から構成する一番目の目的言語生成ブロックと、「わたし」と、品詞情報等と、目的言語話者特徴（年上、男性）等の情報から構成する二番目の目的言語生成ブロックと、「僕」と、品詞情報等と、目的言語話者特徴（年下、男性）等の情報から構成する三番目の目的言語生成ブロックで辞書を構築する。 First of all, taking Chinese “late life” as an example, “late life” is a personal pronoun and can generally be used by younger people than listeners by both men and women. Therefore, attribute information “source language speaker characteristics (younger)” is assigned to the source language block of “late life”. The Japanese translation candidates corresponding to "Late" are "Watakushi", "I", "I", etc., so the dictionary for machine translation that can be constructed is "Late" and the characteristics of the original language speaker on the source language side Source language block composed of (younger), and on the generation side, the first target language generated from "Watakushi", part-of-speech information, and target language speaker characteristics (older, female) Block, “I”, part-of-speech information, etc., second target language generation block composed of information such as target language speaker characteristics (older, male), “I”, part-of-speech information, etc., purpose A dictionary is constructed with a third target language generation block composed of information such as language speaker characteristics (younger, male).

次に、中国語動詞「叫」の使用できる人は年齢と性別との関係がなしに対して、日本語側では丁寧度の異なるもの、独立助動詞「だ」と五段動詞「申す」と対応することができる。そのため、「叫」の翻訳辞書は、原言語側では原言語話者特徴（Φ）の属性を付与する。意味は、すべてのユーザに使用できることを示す。一方、生成側では、独立助動詞「だ」の見出し情報と、固有部情報と、品詞情報と、目的言語話者特徴（年下、男性）等の情報から構成される一番目の生成ブロックと、五段動詞「申す」の見出し情報と、固有部情報と、品詞情報と、目的言語話者特徴（年上、女性）等の情報から構成される二番目の生成ブロックで構成されている。また、名詞「一郎」は、一般的な翻訳辞書で、年齢や性別の情報を付与しなくてもよい。 Next, those who can use the Chinese verb “scream” have no relationship between age and gender, but on the Japanese side, they have different politeness, independent auxiliary verb “Da” and five-level verb “Sue” can do. Therefore, the translation dictionary of “scream” gives the attribute of the source language speaker feature (Φ) on the source language side. The meaning indicates that it can be used for all users. On the other hand, on the generation side, the first generation block composed of the heading information of the independent auxiliary verb “DA”, the specific part information, the part of speech information, the target language speaker characteristics (younger, male), etc., It is composed of a second generation block composed of heading information of the five-step verb “sue”, unique part information, part-of-speech information, and target language speaker characteristics (older, female). The noun “Ichiro” is a general translation dictionary and does not have to be given information on age or sex.

次に、話者Ａからの自己紹介の例文「晩生叫一郎．」を翻訳例として説明する。
この文を日本語に翻訳する時、まず、中国語の形態素解析と構文解析を行い、その結果、形態素要素代名詞「晩生」、動詞「叫」、名詞「一郎」とをから構成された構文情報が得られる。一方、日本語側では、中日翻訳辞書から、代名詞「晩生」から「わたくし」、「わたし」と「僕」を、動詞「叫」から「だ」と「申す」を、名詞「一郎」から「一郎」を読み込む。 Next, a self-introduction example sentence “Yosei Shoichiro.” From speaker A will be explained as a translation example.
When translating this sentence into Japanese, first the Chinese morphological analysis and syntactic analysis were performed. As a result, the syntactic information consisted of the morpheme element pronoun “Late”, the verb “scream”, and the noun “Ichiro”. Is obtained. On the other hand, on the Japanese side, from the Chinese-Japanese translation dictionary, from the pronouns “Yangsei” to “Watakushi”, “I” and “I”, from the verb “scream” to “da” and “sue”, from the noun “Ichiro” Load “Ichiro”.

次に、話者Ａの発話を話者Ｂに翻訳するとき、まず、前記の第一の話者特徴抽出手段104と第二の話者特徴抽出手段105により、両話者の特徴を抽出する。次に話者Ａの特徴「青年、男性」と話者Ｂの特徴「老人、女性」の情報を用いて、話者関係判定手段により両話者の年齢を比較して、原言語話者特徴（年下）と目的言語話者特徴（年上）との結果が得られる。さらに、第一の話者の話者特徴量「青年、男性、原言語話者特徴（年下）」と第二の話者の話者特徴量「老人、女性、目的言語話者特徴（年上）」とを用いて、翻訳辞書が持つ属性情報との類似度を算出して、類似度が最大のものを選択する。その結果、「晩生」の三つの訳語候補の中から、「わたくし」の訳語を選択することになる。同様に「叫」の訳語候補の中から「申す」を選択することになる。 Next, when the utterance of the speaker A is translated into the speaker B, first, the features of the two speakers are extracted by the first speaker feature extracting unit 104 and the second speaker feature extracting unit 105. . Next, using the information of speaker A's characteristics “young man and man” and speaker B's characteristic “old man and woman”, the speaker relationship determination means compares the ages of both speakers, and the features of the original language speaker. The results of (younger) and target language speaker characteristics (older) are obtained. Furthermore, the first speaker's speaker feature “Youth, male, source language speaker characteristics (younger)” and the second speaker's speaker feature “older, female, target language speaker features (year) The above is used to calculate the similarity with the attribute information of the translation dictionary, and the one with the maximum similarity is selected. As a result, the translation of “Watakushi” is selected from the three translation candidates of “late life”. Similarly, “speak” is selected from the translation word candidates of “scream”.

そのため、日本語生成側の形態素は「わたくし」、「申す」、「一郎」からなる。次に、五段動詞「申す」の格フレームと日本語構文生成規則を用いて目的言語を生成し、「わたくしは一郎と申します」との翻訳結果を話者Ｂに提示することができる。 Therefore, the morphemes on the Japanese language generation side consist of “Watakushi”, “Sue”, and “Ichiro”. Next, the target language is generated using the case frame of the five-step verb “Sue” and the Japanese syntax generation rules, and the translation result “I say Ichiro Watakushi” can be presented to speaker B.

同様に、話者Ａの発話を話者Ｃに翻訳する時、「僕は一郎です」との翻訳結果を話者Ｃに提示することができる。 Similarly, when the utterance of the speaker A is translated into the speaker C, the translation result “I am Ichiro” can be presented to the speaker C.

実施例１０を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 A tenth embodiment will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図13を参照すると、本発明の第九の実施例に係る音声合成手段110は、図12に示された第八の実施例におけるテキスト出力手段115の代わりに、機械翻訳手段109から出力された翻訳結果を合成音声を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例８を同じである。 Referring to FIG. 13, the speech synthesis means 110 according to the ninth embodiment of the present invention is output from the machine translation means 109 instead of the text output means 115 in the eighth embodiment shown in FIG. The speech synthesis means 110 for synthesizing the translation result and the speech output means 111 for outputting the synthesized speech are provided. The other points are the same as in the eighth embodiment.

図13において、第一の話者特徴抽出手段104により第一の話者特徴を抽出する。第二の話者特徴抽出手段105により第二の話者の特徴を抽出する。抽出された第一、第二の話者の話者特徴を用いて話者関係判定手段113により、両話者の関係を判定する。判定された話者関係の結果を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107から機械翻訳用辞書を選択して機械翻訳手段109に利用されて翻訳処理を行う。次に、機械翻訳の結果に対して、判定された話者関係の結果を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107から音声合成用辞書・モデルを選択して音声合成を行う。音声合成手段110からの合成音声を音声出力手段111を通じて出力する。 In FIG. 13, the first speaker feature extraction means 104 extracts the first speaker feature. The second speaker feature extraction means 105 extracts the features of the second speaker. The relationship between the two speakers is determined by the speaker relationship determining means 113 using the extracted speaker characteristics of the first and second speakers. Using the determined speaker relation result, the dictionary / model selection means 112 selects a dictionary for machine translation from the dictionary / model database 107 and uses it for the machine translation means 109 to perform translation processing. Next, for the result of machine translation, using the determined speaker relation result, the dictionary / model selection means 112 selects a dictionary / model for speech synthesis from the dictionary / model database 107 and performs speech synthesis. . The synthesized speech from the speech synthesizer 110 is output through the speech output unit 111.

実施例１１を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 Example 11 will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図14を参照すると、本実施例は実施例８のテキスト入力手段102の代わりに、第一話者の音声を入力する音声入力手段103を備え、入力音声を認識処理を行う音声認識手段108を備えている。その他の点は実施例８を同じである。 Referring to FIG. 14, in this embodiment, instead of the text input means 102 of the eighth embodiment, voice input means 103 for inputting the voice of the first speaker is provided, and voice recognition means 108 for recognizing the input voice is provided. I have. The other points are the same as in the eighth embodiment.

音声入力手段103は、第一の話者の音声を取り込んで音声入力解析処理を行うものである。入力音声を解析するための処理は前記第五の実施例の音声入力手段と同様な処理を行ってもよい。 The voice input means 103 takes in the voice of the first speaker and performs voice input analysis processing. The process for analyzing the input voice may be the same as the voice input means of the fifth embodiment.

音声認識手段108における音声認識処理の辞書・モデルの選択は、第一の話者特徴抽出手段104により抽出された第一の話者の特徴と、第二の話者特徴抽出手段105により抽出された第二の話者の特徴を用いて、前記話者関係判定手段113により話者関係を判定して、判定結果を利用して前記辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から音声認識用の辞書・モデルを選択する。 The selection of the dictionary / model of the speech recognition processing in the speech recognition means 108 is extracted by the features of the first speaker extracted by the first speaker feature extraction means 104 and the second speaker feature extraction means 105. Using the characteristics of the second speaker, the speaker relationship determination unit 113 determines the speaker relationship, and using the determination result, the dictionary / model selection unit 112 stores it in the dictionary / model database 107. Select a dictionary / model for speech recognition from a group of models.

テキスト変換手段106では、音声認識手段108からの出力結果を、前記話者関係判定手段113により獲得された話者関係を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から言い換え用単語辞書や、変換規則辞書などを選択してテキスト変換処理を行う。 In the text conversion means 106, the output result from the speech recognition means 108 is stored in the dictionary / model database 107 by the dictionary / model selection means 112 using the speaker relation acquired by the speaker relation determination means 113. A text conversion process is performed by selecting a paraphrase word dictionary, a conversion rule dictionary, or the like from a group of models.

音声合成手段110と音声出力手段111は実施例５の音声合成手段110と音声出力手段111と同様な処理を行うことができる。 The voice synthesis unit 110 and the voice output unit 111 can perform the same processing as the voice synthesis unit 110 and the voice output unit 111 of the fifth embodiment.

実施例１２を、図面を参照して説明する。かかる実施例は本発明を実施するための第２の実施の形態に対応するものである。 Example 12 will be described with reference to the drawings. This example corresponds to the second embodiment for carrying out the present invention.

図12を参照すると、本実施例は実施例１１のテキスト変換手段106の代わりに、機械翻訳手段109を備えている。その他の点は実施例１１と同じである。 Referring to FIG. 12, this embodiment includes machine translation means 109 instead of the text conversion means 106 of the eleventh embodiment. The other points are the same as in the eleventh embodiment.

音声入力手段103と音声認識手段108は、実施例１１の音声入力手段103と音声認識手段108と同様な構成で処理を行うことができる。 The voice input means 103 and the voice recognition means 108 can perform processing with the same configuration as the voice input means 103 and the voice recognition means 108 of the eleventh embodiment.

機械翻訳手段109では、前記話者関係判定手段113により話者関係を判定して、判定結果を利用して前記辞書・モデル選択手段112により辞書・モデルデータベース107に格納されているモデル群の中から機械翻訳用の単語辞書、翻訳規則辞書、翻訳モデル等を選択して、機械翻訳処理を行う。 In the machine translation unit 109, the speaker relationship determination unit 113 determines the speaker relationship, and using the determination result, the dictionary / model selection unit 112 uses the determination result to store among the model groups stored in the dictionary / model database 107. A machine translation processing is performed by selecting a word dictionary for machine translation, a translation rule dictionary, a translation model, and the like.

例えば、中日翻訳の場合、「図３３の文章２」の音声認識結果を翻訳処理する場合、中日翻訳用用言辞書の単語「図３３の単語３」の生成側に、発話者の年齢より「年下」の聞き手に「〜ください」と、発話者の年齢より「年上」の聞き手に「お〜ください」との二つ目的言語生成ブロックを設け、同様に「吃」の辞書に「食べる」（年下）と「召し上がる」（年上）との二つの生成ブロックを設けることができる。また、機械翻訳処理では「図３３の文章２」に対して形態素解析、構文解析を行い、日本語生成するためのルール選択は、前記話者関係判定手段113により獲得された話者関係を示す結果を利用して選択する。そして、「図３３の文章２」の入力文を日本語に翻訳可能な結果として、「林檎を食べてください」と「林檎をお召し上がりください」とを生成することができる。言うまでもなく、第二の話者が発話者より「年下」の時、「林檎を食べてください」が生成されて、第二の話者が発話者より「年上」の時、「林檎をお召し上がりください」という翻訳結果が生成される。 For example, in the case of Chinese-Japanese translation, when the speech recognition result of “sentence 2 in FIG. 33” is translated, the age of the speaker is displayed on the generation side of the word “word 3 in FIG. 33” in the dictionary dictionary for Chinese-Japanese translation. In addition, there is a dual-purpose language generation block for “younger” listeners, “~ please” and “older” for older speakers than “speaker's age”. Two generation blocks can be provided: “eat” (younger) and “eating” (older). In the machine translation process, “sentence 2 in FIG. 33” is subjected to morphological analysis and syntax analysis, and the rule selection for generating Japanese indicates the speaker relationship acquired by the speaker relationship determining means 113. Select using the results. Then, as a result of translating the input sentence of “sentence 2 in FIG. 33” into Japanese, “please eat apples” and “please eat apples” can be generated. Needless to say, when the second speaker is “younger” than the speaker, “Please eat apple” is generated, and when the second speaker is “older” than the speaker, "Please enjoy" is generated.

実施例１３を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Example 13 will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

図16を参照すると、本実施例は、テキスト入力手段102と、第一の話者特徴を抽出する第一の話者特徴抽出手段104と、第二の話者特徴を抽出する第二の話者の特徴抽出手段105と、コミュニケーション処理手段300と、辞書・モデルデータベース107と、話者対話履歴データベース310と、テキスト出力手段115とをから構成される。本実施例におけるコミュニケーション処理手段300は、テキスト変換手段106と、辞書・モデル選択手段112と、対話履歴管理手段114とを有する。 Referring to FIG. 16, in this embodiment, a text input means 102, a first speaker feature extraction means 104 for extracting a first speaker feature, and a second story for extracting a second speaker feature. A feature extraction unit 105, a communication processing unit 300, a dictionary / model database 107, a speaker interaction history database 310, and a text output unit 115 are included. The communication processing unit 300 in this embodiment includes a text conversion unit 106, a dictionary / model selection unit 112, and a dialogue history management unit 114.

図16に示す本実施例におけるテキスト入力手段102と、テキスト出力手段115と、第一話者の特徴抽出手段104と、第二の話者特徴抽出手段105と、辞書・モデルデータベース107との部分は、第十二の実施例の対応する部分と同様な構築方法で実現することができる。
話者の対話履歴データベース310では、両話者の話者特徴と共に、話者の対話履歴を時系列で記録する。 The text input means 102, text output means 115, first speaker feature extraction means 104, second speaker feature extraction means 105, and dictionary / model database 107 in this embodiment shown in FIG. Can be realized by the same construction method as the corresponding part of the twelfth embodiment.
The speaker's dialog history database 310 records the speaker's dialog history in chronological order along with the speaker characteristics of both speakers.

次に、本実施例におけるコミュニケーション処理手段300を、図面を参照して説明する。本実施例におけるコミュニケーション処理手段300における話者関係判定手段113と、辞書モデル選択手段112と、テキスト変換手段106との部分は、第十二の実施例の対応する部分と同様な構築方法で実現することができる。対話管理手段114は、図29に示すように、対話履歴分析手段2901と、話者特徴判定手段2902と、フィードバック処理手段2903とをから構成される。 Next, the communication processing means 300 in the present embodiment will be described with reference to the drawings. The speaker relationship determining means 113, the dictionary model selecting means 112, and the text converting means 106 in the communication processing means 300 in this embodiment are realized by the same construction method as the corresponding parts in the twelfth embodiment. can do. As shown in FIG. 29, the dialogue management unit 114 includes a dialogue history analysis unit 2901, a speaker feature determination unit 2902, and a feedback processing unit 2903.

対話履歴分析手段2901では、話者関係判定手段113から出力された両話者の特徴と話者関係に基づいて、話者対話履歴データベース310から両話者の対話履歴を抽出して、話者特徴と、対話スタイルと、対話内容のキーワードなどを含むベクトルや、又は前記ベクトルを時系列化したモデルを生成する処理などを行う。 The dialogue history analysis means 2901 extracts the conversation history of both speakers from the speaker dialogue history database 310 based on the characteristics and the speaker relations of both speakers output from the speaker relationship judgment means 113, and the speaker A process including generating a vector including a feature, a dialog style, a keyword of a dialog content, or a model in which the vector is time-sequentially performed.

話者特徴判定手段2902では、前記話者対話履歴分析手段2901で生成された対話履歴の特徴ベクトルやモデルなどを用いて、前記第一の話者特徴抽出手段104と前記第二の話者特徴抽出手段105により抽出された話者の特徴及び前記話者関係判定手段113から出力された話者の関係に対する判定結果が正しいかどうかを判定する。 In the speaker feature determination unit 2902, the first speaker feature extraction unit 104 and the second speaker feature are used using the feature vector or model of the dialogue history generated by the speaker dialogue history analysis unit 2901. It is determined whether or not the determination result regarding the speaker characteristics extracted by the extraction unit 105 and the speaker relationship output from the speaker relationship determination unit 113 is correct.

例えば、日本語の「女言葉」の言語表現を示す特徴量と、「男言葉」の言語表現を示す特徴量とをそれぞれ用意して、話者対話履歴データベース310に保存しておいて、対話履歴分析手段2901からえられた話者の特徴ベクトルと照合することにより、話者の性別の判定を行うことができる。 For example, a feature quantity indicating the language expression of the Japanese “female word” and a feature quantity indicating the language expression of the “male language” are prepared and stored in the speaker dialogue history database 310 for dialogue. By comparing with the feature vector of the speaker obtained from the history analysis means 2901, the gender of the speaker can be determined.

フィードバック処理手段2903では、前記話者特徴判定手段2902からの判定結果において、「誤りがある」と判定された際、話者の特徴に対してフィードバック処理を行う。 The feedback processing unit 2903 performs feedback processing on the speaker characteristics when it is determined that “there is an error” in the determination result from the speaker feature determination unit 2902.

例えば、第一の話者と第二の話者が共に男性と推定されたが、第一の話者の入力文が「鍵を持ってきてくれてよかった。ありがとうございました。どうしてかしら、あたし、最近よく忘れ物をするわよ。」がある時に、前記話者特徴判定手段2902により第一の話者が「女性」であることを判定して、フィードバック処理を行うことにより、第一話者の性別の属性値を直すことできる。 For example, the first speaker and the second speaker were both assumed to be male, but the input sentence of the first speaker was "Thank you for bringing the key. Thank you. I often forget things recently. ”When the speaker feature determination means 2902 determines that the first speaker is“ female ”and performs feedback processing, Gender attribute values can be corrected.

次に、発明を実施するための実施例１４を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Next, a fourteenth embodiment for carrying out the invention will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

本実施例を図17に示す。図17を参照すると、実施例１４に係る音声合成手段110は、図16に示された実施例１３におけるテキスト出力手段115の代わりに、テキスト変換手段106から出力されたテキストを合成音声を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例１３と同じである。 This embodiment is shown in FIG. Referring to FIG. 17, the speech synthesizing unit 110 according to the fourteenth embodiment is a voice that synthesizes the text output from the text converting unit 106 instead of the text output unit 115 in the thirteenth embodiment shown in FIG. A synthesis unit 110 is provided, and a voice output unit 111 that outputs a synthesized voice is provided. The other points are the same as in the thirteenth embodiment.

音声合成手段110と音声出力手段111は、実施例１２の対応する部分とを同じである。 The voice synthesizing means 110 and the voice output means 111 are the same as the corresponding parts in the twelfth embodiment.

このように、第一の話者から入力したテキストに対して、第一の話者の特徴と第二の話者の特徴及び両話者の対話履歴を考慮したテキスト変換の結果を合成音声で出力することができる。 In this way, for text input from the first speaker, the result of text conversion in consideration of the characteristics of the first speaker, the characteristics of the second speaker, and the conversation history of both speakers is synthesized speech. Can be output.

実施例１５を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Example 15 will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

本実施例を図18に示す。図18を参照すると、本実施例は実施例１３のテキスト変換手段106の代わりに、機械翻訳手段109に変えたものを備えている。その他の点は実施例１３と同じである。 This embodiment is shown in FIG. Referring to FIG. 18, the present embodiment includes a machine translation means 109 instead of the text conversion means 106 of the thirteenth embodiment. The other points are the same as in the thirteenth embodiment.

機械翻訳手段109では、前記テキスト入力手段102から入力テキストを第二言語に翻訳する手段である。機械翻訳に必要な各種の辞書の選択は、前記第一の話者特徴抽出手段と前記第二の話者の特徴抽出手段により抽出された話者の特徴を前記話者関係判定手段113に出力し、話者関係の判定を行う。話者関係判定手段113からの出力結果に対して、話者対話履歴データベースから話者の対話履歴を抽出して対話管理手段114により話者の特徴が正しいかどうかを判定する。対話管理手段114から出力された正しい話者特徴と話者関係を用いて、辞書・モデル選択手段112により辞書・モデルデータベース107から機械翻訳用辞書を選択して機械翻訳処理を行う。 The machine translation means 109 is means for translating the input text from the text input means 102 into a second language. Selection of various dictionaries necessary for machine translation is performed by outputting the speaker features extracted by the first speaker feature extraction unit and the second speaker feature extraction unit to the speaker relationship determination unit 113. Then, the speaker relationship is determined. In response to the output result from the speaker relationship determination means 113, the conversation history of the speaker is extracted from the speaker conversation history database, and the dialog management means 114 determines whether or not the speaker characteristics are correct. Using the correct speaker characteristics and speaker relationships output from the dialogue management means 114, the dictionary / model selection means 112 selects a machine translation dictionary from the dictionary / model database 107 and performs machine translation processing.

このように、第一の話者から入力したテキストに対して、第一の話者の特徴と第二の話者の特徴及び両話者の対話履歴を考慮した機械翻訳処理の結果を提供することができる。 As described above, for the text input from the first speaker, the result of machine translation processing considering the characteristics of the first speaker, the characteristics of the second speaker, and the conversation history of both speakers is provided. be able to.

実施例１６を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Example 16 will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

本実施例を図19に示す。図19を参照すると、本実施例は実施例１５に係る音声合成手段110は、図18に示された実施例１５におけるテキスト出力手段115の代わりに、機械翻訳手段109から出力された翻訳結果を合成音声処理を行う音声合成手段110を備え、また、合成音声を出力する音声出力手段111を備えている。その他の点は実施例１５と同じである。 This embodiment is shown in FIG. Referring to FIG. 19, in this embodiment, the speech synthesis means 110 according to the fifteenth embodiment uses the translation result output from the machine translation means 109 instead of the text output means 115 in the fifteenth embodiment shown in FIG. A voice synthesizing unit 110 that performs synthesized voice processing is provided, and a voice output unit 111 that outputs synthesized voice is provided. The other points are the same as those in the fifteenth embodiment.

音声合成手段110と音声出力手段111は、実施例１４の対応する部分とを同じである。 The voice synthesizing means 110 and the voice output means 111 are the same as the corresponding parts in the fourteenth embodiment.

このように、第一の話者から入力したテキストに対して、第一の話者の特徴と第二の話者の特徴及び両話者の対話履歴を考慮した機械翻訳結果を合成音声で出力することができる。 In this way, for the text input from the first speaker, machine translation results that take into account the characteristics of the first speaker, the characteristics of the second speaker, and the conversation history of both speakers are output as synthesized speech can do.

実施例１７を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Example 17 will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

本実施例を図20に示す。図20を参照すると、本実施例に係る音声合成手段110は、図17に示された実施例１４におけるテキスト入力手段102の代わりに、第一話者の音声を入力する音声入力手段103を備え、入力音声を認識処理を行う音声認識手段108を備えている。その他の点は実施例１４と同じである。 This embodiment is shown in FIG. Referring to FIG. 20, the voice synthesizing means 110 according to the present embodiment includes voice input means 103 for inputting the voice of the first speaker, instead of the text input means 102 in the fourteenth embodiment shown in FIG. The voice recognition means 108 for recognizing the input voice is provided. The other points are the same as those in Example 14.

このように、第一の話者からの音声入力に対して、第一の話者の特徴と第二の話者の特徴及び両話者の対話履歴を考慮したテキスト変換処理の結果を合成音声で出力することができる。 In this way, for speech input from the first speaker, the result of text conversion processing considering the characteristics of the first speaker, the characteristics of the second speaker, and the conversation history of both speakers is synthesized speech. Can be output.

実施例１８を、図面を参照して説明する。かかる実施例は本発明を実施するための第３の実施の形態に対応するものである。 Example 18 will be described with reference to the drawings. Such an example corresponds to the third embodiment for carrying out the present invention.

本実施例を図21に示す。図21を参照すると、本実施例は実施例１７のテキスト変換手段106の代わりに、機械翻訳手段109を備えている。その他の点は実施例１７と同じである。 This embodiment is shown in FIG. Referring to FIG. 21, this embodiment includes machine translation means 109 instead of the text conversion means 106 of the seventeenth embodiment. The other points are the same as in the seventeenth embodiment.

このように、第一の話者からの音声入力に対して、第一の話者の特徴と第二の話者の特徴及び両話者の対話履歴を考慮した機械翻訳結果を合成音声で出力することができる。
なお、本発明における各実施の形態及び実施例の応用においては、単一的な計算機装置やシステムに限定せず、複数の端末や計算機などによりシステムを構成することも出来る。
例えば、第二の実施の形態に対応して、入力手段101と第一の話者特徴抽出手段104を第一の端末に、出力手段111と第二の話者特徴抽出手段105を第二の端末に、コミュニケーション処理手段200をサーバコンピュータに配置して、各端末とサーバコンピュータがネットワーク経由でお互いに通信しながら処理を実行するように実施することは容易である。 In this way, for speech input from the first speaker, machine translation results that take into account the characteristics of the first speaker, the characteristics of the second speaker, and the conversation history of both speakers are output as synthesized speech can do.
Note that the application of each embodiment and example in the present invention is not limited to a single computer device or system, and a system can be configured by a plurality of terminals or computers.
For example, corresponding to the second embodiment, the input means 101 and the first speaker feature extraction means 104 are used as the first terminal, and the output means 111 and the second speaker feature extraction means 105 are used as the second terminal. It is easy to arrange the communication processing means 200 in the terminal on the server computer so that each terminal and the server computer execute processing while communicating with each other via a network.

本発明のコミュニケーションシステムが、各モジュールを実現するための機能を含むテキスト及び音声出力方法、その各手順を電子機器などに実行させるための音声コミュニケーションプログラム及びこれらのプログラムを記録した電子機器読み取り可能な記録媒体と、これらのプログラムを含む電子機器の内部メモリに内臓可能なプログラム製品、そのプログラムを含む携帯端末やサーバなどの計算機、音声通訳装置などにより提供することができる。 The communication system of the present invention is a text and voice output method including a function for realizing each module, a voice communication program for causing an electronic device or the like to execute each procedure, and an electronic device in which these programs are recorded. The present invention can be provided by a recording medium, a program product that can be incorporated in an internal memory of an electronic device that includes these programs, a computer such as a portable terminal or a server that includes the program, a voice interpretation device, and the like.

本発明によると、以上説明したとおり、両話者のコミュニケーションシステムにおける音声対話システム、テレビ会議システム、テレビ電話自動通訳システム、又は音声通訳システムにおける同言語間、又は異なる言語体系の話者におけるコミュニケーションの補助を行うことができる。また、本発明は、話者の個人属性を音声認識、言い換えや機械翻訳、音声合成に用いることにより、高い音声認識精度と、機械翻訳精度と豊富な音声合成機能、多用な言語表現機能をユーザに提供することができる。特に、本発明は、同言語の話者同士に対しても、異なる言語体系の話者に対しても、必要に応じるシステムを構築することにより、話者間の円滑なコミュニケーションを実現することができる。 According to the present invention, as described above, the communication between the speakers in the same language or in different language systems in the voice dialogue system, the video conference system, the videophone automatic interpretation system, or the voice interpretation system in the communication system of both speakers is explained. Assistance can be performed. In addition, the present invention uses a speaker's personal attributes for speech recognition, paraphrasing, machine translation, and speech synthesis, thereby providing users with high speech recognition accuracy, machine translation accuracy, abundant speech synthesis functions, and versatile language expression functions. Can be provided. In particular, the present invention can realize smooth communication between speakers by constructing a system as needed for speakers of the same language as well as speakers of different language systems. it can.

本発明の第一の実施形態における情報処理システムの概略構成を示すブロック図である。It is a block diagram showing a schematic structure of an information processing system in a first embodiment of the present invention. 本発明の第二の実施形態における情報処理システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the information processing system in 2nd embodiment of this invention. 本発明の第三の実施形態における情報処理システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the information processing system in 3rd embodiment of this invention. 本発明の第一の実施形態における実施例1の概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of Example 1 in the first embodiment of the present invention. 本発明の第一の実施形態における実施例2の概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of Example 2 in the first embodiment of the present invention. 本発明の第一の実施形態における実施例3の概略構成を示すブロック図である。FIG. 6 is a block diagram showing a schematic configuration of Example 3 in the first embodiment of the present invention. 本発明の第一の実施形態における実施例4の概略構成を示すブロック図である。FIG. 7 is a block diagram showing a schematic configuration of Example 4 in the first embodiment of the present invention. 本発明の第一の実施形態における実施例5の概略構成を示すブロック図である。FIG. 10 is a block diagram showing a schematic configuration of Example 5 in the first embodiment of the present invention. 本発明の第一の実施形態における実施例6の概略構成を示すブロック図である。FIG. 10 is a block diagram showing a schematic configuration of Example 6 in the first embodiment of the present invention. 本発明の第二の実施形態における実施例1の概略構成を示すブロック図である。FIG. 6 is a block diagram showing a schematic configuration of Example 1 in the second embodiment of the present invention. 本発明の第二の実施形態における実施例2の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 2 in 2nd embodiment of this invention. 本発明の第二の実施形態における実施例3の概略構成を示すブロック図である。FIG. 10 is a block diagram showing a schematic configuration of Example 3 in the second embodiment of the present invention. 本発明の第二の実施形態における実施例4の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 4 in 2nd embodiment of this invention. 本発明の第二の実施形態における実施例5の概略構成を示すブロック図である。FIG. 10 is a block diagram showing a schematic configuration of Example 5 in the second embodiment of the present invention. 本発明の第二の実施形態における実施例6の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 6 in 2nd embodiment of this invention. 本発明の第三の実施形態における実施例1の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 1 in 3rd embodiment of this invention. 本発明の第三の実施形態における実施例2の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 2 in 3rd embodiment of this invention. 本発明の第三の実施形態における実施例3の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 3 in 3rd embodiment of this invention. 本発明の第三の実施形態における実施例4の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 4 in 3rd embodiment of this invention. 本発明の第三の実施形態における実施例5の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 5 in 3rd embodiment of this invention. 本発明の第三の実施形態における実施例6の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of Example 6 in 3rd embodiment of this invention. 本発明における第一の実施形態のアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of 1st embodiment in this invention. 本発明における第二の実施形態のアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of 2nd embodiment in this invention. 本発明における第三の実施形態のアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of 3rd embodiment in this invention. 第２の実施の形態における一種の日英機械翻訳辞書構造を示す例である。It is an example which shows the kind of Japanese-English machine translation dictionary structure in 2nd Embodiment. 図25の辞書例から生成可能な翻訳結果を示す図である。FIG. 26 is a diagram showing a translation result that can be generated from the dictionary example of FIG. 音声認識用モデルの識別情報を示す図である。It is a figure which shows the identification information of the model for speech recognition. 辞書・モデルデータベースに格納する機械翻訳用辞書の日本語文体情報を示す実施例である。It is an Example which shows the Japanese style information of the dictionary for machine translation stored in a dictionary and model database. 従来の音声認識装置の概略構成図である。It is a schematic block diagram of the conventional speech recognition apparatus. 従来のメール文章を読み上げて合成音声を出力する概略構成図である。It is a schematic block diagram which reads out the conventional mail text and outputs a synthetic voice. 従来の音声応答装置の概略構成図である。It is a schematic block diagram of the conventional voice response apparatus. 従来の音声応答装置の概略構成図である。It is a schematic block diagram of the conventional voice response apparatus. 本発明を説明する為の図である。It is a figure for demonstrating this invention.

Explanation of symbols

100 コミュニケーション処理手段
101 入力手段
102 テキスト入力手段
103 音声入力手段
104 第一の話者特徴抽出手段
105 第二の話者特徴抽出手段
106 テキスト変換手段
107 辞書・モデルデータベース
108 音声認識手段
109 機械翻訳手段
110 音声合成手段
111 音声出力手段
112 辞書・モデル選択手段
113 話者関係判定手段
114 対話履歴管理手段
115 テキスト出力手段
116 出力手段
200 コミュニケーション処理手段
300 コミュニケーション処理手段
310 話者対話履歴データベース 100 Communication processing means
101 Input means
102 Text input means
103 Voice input means
104 First speaker feature extraction means
105 Second speaker feature extraction means
106 Text conversion means
107 Dictionary / Model Database
108 Voice recognition means
109 Machine translation means
110 Speech synthesis means
111 Audio output means
112 Dictionary / model selection means
113 Speaker relation judging means
114 Dialog history management means
115 Text output means
116 Output means
200 Communication processing means
300 Communication processing means
310 speaker dialogue history database

Claims

An information processing system for processing communication from a first speaker to a second speaker,
Second speaker feature extraction means for extracting features of the second speaker;
An information processing system comprising: communication processing means for processing input data from the first speaker based on the characteristics of the second speaker.

First speaker feature extraction means for extracting features of the first speaker;
The information according to claim 1, wherein the communication processing unit processes input data from the first speaker based on the characteristics of the first speaker and the characteristics of the second speaker. Processing system.

The communication processing means includes
Speaker relation determining means for determining speaker relations based on the characteristics of the first speaker and the characteristics of the second speaker;
The information processing system according to claim 2, wherein input data from the first speaker is processed with reference to the determined relationship of the speakers.

A conversation history database that stores the conversation history of the speaker;
The communication processing means processes input data from the first speaker based on the characteristics of the first speaker or the characteristics of the second speaker and the conversation history of the conversation history database speaker. The information processing system according to claim 2 or claim 3, wherein

The communication processing means includes
Dictionary and model database,
A dictionary / model selection means for selecting a dictionary or model from a dictionary / model database based on the characteristics of the second speaker;
5. Data conversion means for converting input data from the first speaker into data suitable for the second speaker using the selected dictionary or model. An information processing system according to any one of the above.

The data conversion means is a text conversion means for converting input data from a first speaker into text suitable for a second speaker using a selected dictionary or model. 5. The information processing system according to 5.

The data conversion means is voice synthesis means for performing voice synthesis suitable for a second speaker on input data from a first speaker using a selected dictionary or model. 5. The information processing system according to 5.

The data conversion means is a translation means for translating input data from the first speaker into a language expression suitable for the second speaker using the selected dictionary or model. 5. The information processing system according to 5.

The data converting means translates input data from the first speaker into a language expression suitable for the second speaker using the selected dictionary or model, and the translated language expression is converted into the second speaker. The information processing system according to claim 5, wherein the information processing system is a translation / speech synthesizer that performs speech synthesis suitable for the user.

The information processing system according to any one of claims 1 to 9, further comprising voice recognition means for voice recognition of the voice of the first speaker and outputting as input data.

11. The information processing according to claim 3, wherein the communication processing unit selects a dictionary or a model from the dictionary / model database based on the determination by the speaker relation determination unit. system.

The communication processing means is a dialog management that performs collation with various patterns obtained by analyzing the conversation history of the speaker from the conversation history database of the speaker based on the speaker feature extracted from the speaker feature extraction means. The information processing system according to claim 3, further comprising: means.

13. The speaker feature extraction unit according to claim 1, wherein the first speaker feature extraction unit or the second speaker feature extraction unit extracts speaker features from the speech data of the speaker. Information processing system.

The said 1st speaker feature extraction means or a 2nd speaker feature extraction means extracts the feature of a speaker from a speaker's face image, The Claim 1 characterized by the above-mentioned. Information processing system.

The information processing system according to any one of claims 1 to 12, wherein the first speaker feature extraction unit or the second speaker feature extraction unit extracts speaker features by a sensor. .

The first speaker feature extracting unit or the second speaker feature extracting unit includes an IC card capable of communicating the personal attribute of the speaker, and a terminal device built in the IC card capable of communicating the personal attribute of the speaker. The information processing system according to any one of claims 1 to 12, wherein speaker characteristics are extracted by using the IC card or the terminal.

The first speaker feature extraction unit or the second speaker feature extraction unit outputs personal attributes indicating the gender, age, race, body posture, or biological information of the speaker. The information processing system according to claim 12.

A processing method for processing communication from a first speaker to a second speaker,
A first speaker feature extraction process for extracting features of the second speaker;
And a communication process for processing input data from the first speaker based on the characteristics of the second speaker.

A second speaker feature extraction process for extracting features of the first speaker;
The processing method according to claim 18, wherein the communication processing processes input data from a first speaker based on characteristics of the first speaker and characteristics of the second speaker. .

The communication process is:
A speaker relationship determination process for determining a speaker relationship based on the characteristics of the first speaker and the characteristics of the second speaker;
The processing method according to claim 19, wherein input data from the first speaker is processed with reference to the determined relationship of the speakers.

The communication process is characterized in that input data from the first speaker is input based on the characteristics of the first speaker or the characteristics of the second speaker and the conversation history of the speaker. Item 20. The processing method according to Item 19 or Item 20.

The communication processing means selects a dictionary or model from a dictionary / model database based on the characteristics of the second speaker, and uses the selected dictionary or model to input data from the first speaker. The processing method according to any one of claims 18 to 21, further comprising a data conversion process for converting the data into data suitable for the second speaker.

23. The data conversion process according to claim 22, wherein the data conversion process is a process of converting input data from the first speaker into a text suitable for the second speaker using the selected dictionary or model. The processing method described.

23. The process according to claim 22, wherein the data conversion process is a process of performing speech synthesis suitable for the second speaker on the input data from the first speaker using the selected dictionary or model. The processing method described.

The data conversion process is a process of translating input data from the first speaker into a language expression suitable for the second speaker, using the selected dictionary or model. The processing method as described in.

In the data conversion process, using the selected dictionary or model, the input data from the first speaker is translated into a language expression suitable for the second speaker, and the translated language expression is translated into the second speaker. The processing method according to claim 22, wherein the processing is a speech synthesis suitable for the user.

27. The processing method according to claim 18, further comprising a voice recognition process for recognizing the voice of the first speaker and outputting the voice as input data.

28. The processing method according to claim 20, wherein in the communication process, a dictionary or a model is selected from a dictionary / model database based on the determination in the speaker relation determination process.

The communication process includes a dialog management process that performs verification with various patterns obtained by analyzing the conversation history of the speaker from the conversation history database of the speaker based on the extracted speaker characteristics. The processing method according to any one of claims 21 to 28.

30. The processing method according to claim 18, wherein a feature of the first speaker or a feature of the second speaker is extracted from the voice data of the speaker.

30. The processing method according to claim 18, wherein a feature of the first speaker or a feature of the second speaker is extracted from the face image of the speaker.

30. The processing method according to claim 18, wherein the feature of the first speaker or the feature of the second speaker is extracted by a sensor.

By using the IC card capable of communicating the personal attribute of the speaker, the terminal device built in the IC card capable of communicating the personal attribute of the speaker, the characteristics of the first speaker or 30. The processing method according to claim 18, wherein a feature of the second speaker is extracted.

30. The processing method according to claim 18, wherein a personal attribute indicating a speaker's sex, age, race, body posture, or biological information is output.

A program of an information processing system that executes processing of communication from a first speaker to a second speaker,
A first speaker feature extraction process for extracting features of the second speaker;
A program for causing an information processing system to execute communication processing for processing input data from a first speaker based on characteristics of the second speaker.

Causing the information processing system to execute a second speaker feature extraction process for extracting features of the first speaker;
36. The program according to claim 35, wherein the communication processing processes input data from the first speaker based on the characteristics of the first speaker and the characteristics of the second speaker.

The communication process is:
A process of determining speaker relationships based on the characteristics of the first speaker and the characteristics of the second speaker;
37. The program according to claim 36, further comprising a process of processing input data from a first speaker with reference to the determined speaker relationship.