JP2007251581A

JP2007251581A - Voice transmission terminal and voice reproduction terminal

Info

Publication number: JP2007251581A
Application number: JP2006071971A
Authority: JP
Inventors: Motoyasu Tanaka; 基康田中; Takashi Matsutani; 隆司松谷; Yusuke Nara; 裕介奈良
Original assignee: MegaChips LSI Solutions Inc
Current assignee: MegaChips Corp
Priority date: 2006-03-16
Filing date: 2006-03-16
Publication date: 2007-09-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide the technique which provides versatile or various communication even to a voice speech. <P>SOLUTION: When an operator of a cellular phone set 10 generates words, a voice input part 11 inputs the voice, and a voice recognition part 12 performs voice recognition. A voice synthesizing part 13 obtains voice data (a sound effect, BGM, etc.) coordinated with a voice recognition result by referring to a voice material data base 101, and synthesizes the voice data obtained from the data base to the voice input from the voice input part 11. A communication part 14 transmits synthesized voice to a cellular phone set 20. The cellular phone set 20 performs the same processing as a normal speech voice, and receives the synthesized voice to reproduce it from a speaker. Thus, the voice obtained by synthesizing the sound effect and BGM to the words of a speaking person is reproduced by a receiving side. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、携帯電話機など音声を送受信する機器で利用される技術に関する。詳しくは、多様性あるいはバラエティ性に富んだコミュニケーションを可能とする音声あるいはデータの送信技術に関する。 The present invention relates to a technique used in a device such as a mobile phone that transmits and receives audio. More specifically, the present invention relates to a voice or data transmission technology that enables communication with a variety or variety.

昨今の携帯電話機には様々な機能が付加され高機能化が進んでいる。本来からある音声通話機能に加え、電子メール送受信機能、インターネット接続機能、サウンド再生機能、ムービー再生機能、カメラ機能、ムービー機能などが付加されている。 Various functions have been added to mobile phones in recent years, and higher functionality has been advanced. In addition to the original voice call function, e-mail transmission / reception function, Internet connection function, sound playback function, movie playback function, camera function, movie function, etc. are added.

たとえば、インターネットに接続してサウンドファイルやムービーファイルをダウンロードすることで、サウンドやムービーを携帯電話機で再生することが可能である。また、カメラ機能やムービー機能を用いて静止画や動画を撮影し、携帯電話機で閲覧することが可能であるし、電子メールを利用して友人に静止画や動画を送信することも可能である。 For example, by connecting to the Internet and downloading a sound file or movie file, it is possible to play the sound or movie on a mobile phone. It is also possible to shoot still images and videos using the camera function and movie function and view them on a mobile phone, and to send still images and videos to friends using e-mail .

このように、携帯電話機の多機能化に伴い、それら機能を複合的に利用することで新しいコミュニケーションの方法を実現することや、新しい楽しみ方をユーザに提供することが可能である。しかし、音声の通話機能については従来から変わらない方法がとられている。つまり、送信する音声の品質を向上させることや、ノイズを低減させるといったことを目的とした技術は多く存在するが、それらは通話者が話した音声を正確に伝えるという目的に留まっている。 As described above, with the increase in the number of functions of mobile phones, it is possible to realize new communication methods and provide users with new ways of enjoying by using these functions in combination. However, the voice call function has not been changed conventionally. In other words, there are many techniques aimed at improving the quality of voice to be transmitted and reducing noise, but these are only for the purpose of accurately transmitting the voice spoken by the caller.

下記特許文献１は、人間の感情状態に基づいて音声を出力する技術に関するものである。電話回線上を流れる音声から音声信号を取得し、取得した音声から感情状態を判断して適当な音楽を出力するようにしている。 Patent Document 1 below relates to a technique for outputting sound based on a human emotional state. A voice signal is acquired from the voice flowing on the telephone line, and the emotional state is judged from the acquired voice to output appropriate music.

特開２００５−３５２１５１号公報JP 2005-352151 A

上述したように音声通話に関しては、従来からのシンプルな方法がとられているのが現状である。したがって、音声通話に対しても表現力を増強させることや、意思伝達を多様な形で表現する技術が存在すれば、携帯電話機などを利用したコミュニケーションに多様性やバラエティ性を加えることができ、携帯電話機をさらに付加価値の高いコミュニケーションツールとして利用できることが期待される。 As described above, the current situation is that a simple method has been adopted for voice calls. Therefore, if there is technology to enhance expressiveness for voice calls and express communication in various ways, diversity and variety can be added to communication using mobile phones, etc. It is expected that mobile phones can be used as communication tools with higher added value.

上記特許文献１の技術は、音声出力装置が受信した音声から感情状態を判断するものである。したがって、受信装置に音声を認識する機能が存在しなければならないため、多くのユーザがこのような機能を利用するためには、それら全てのユーザが利用する電話機がこの機能に対応している必要がある。したがって、多くのユーザが広くこの技術を利用したサービスを受けることができるような仕組みを構築することは容易ではない。 The technique disclosed in Patent Document 1 determines an emotional state from voice received by a voice output device. Accordingly, since the receiving device must have a function for recognizing voice, in order for many users to use such a function, the telephones used by all those users need to support this function. There is. Therefore, it is not easy to construct a mechanism that allows many users to receive services using this technology widely.

そこで、本発明は前記問題点に鑑み、音声通話に対しても多様性あるいはバラエティ性に富んだコミュニケーションを可能とする技術を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a technique that enables communication with a great variety and variety even for voice calls.

上記課題を解決するため、請求項１記載の発明は、音声認識結果に対応付けられた音声素材データを蓄積する音声素材データベースと、音声入力手段と、前記音声入力手段から入力した音声に対する音声認識を行う音声認識手段と、前記音声認識手段による認識結果に対応付けられている音声素材データを前記音声素材データベースから取得し、前記音声入力手段から入力した音声に取得した音声素材データを合成する手段と、音声素材データが合成された音声を相手の端末に送信する通信手段と、を備えることを特徴とする。 In order to solve the above problem, the invention according to claim 1 is a speech material database for storing speech material data associated with a speech recognition result, speech input means, and speech recognition for speech input from the speech input means. Voice recognition means for performing voice recognition, and voice material data associated with a recognition result by the voice recognition means from the voice material database, and means for synthesizing the acquired voice material data with the voice input from the voice input means And communication means for transmitting the voice synthesized with the voice material data to the partner terminal.

請求項２記載の発明は、請求項１に記載の音声送信端末において、さらに、音声認識結果に対応付けられた映像素材データを蓄積する映像素材データベースと、前記音声認識手段による認識結果に対応付けられている映像素材データを前記映像素材データベースから取得する手段と、を備え、前記通信手段は、音声素材データが合成された音声とあわせて前記映像素材データベースから取得した映像を前記相手の端末に送信することを特徴とする。 According to a second aspect of the present invention, in the audio transmitting terminal according to the first aspect, the video material database for storing the video material data associated with the voice recognition result and the recognition result by the voice recognition means Means for acquiring the video material data being recorded from the video material database, and the communication means sends the video acquired from the video material database together with the audio synthesized with the audio material data to the partner terminal. It is characterized by transmitting.

請求項３記載の発明は、請求項１に記載の音声送信端末において、さらに、音声認識結果とバイブレーション設定データとを対応付けたバイブレーション設定データベースと、前記音声認識手段による認識結果に基づいてバイブレーション設定データを決定する手段と、を備え、前記通信手段は、音声素材データが合成された音声とあわせてバイブレーション設定データを前記相手の端末に送信することを特徴とする。 According to a third aspect of the present invention, in the voice transmitting terminal according to the first aspect, a vibration setting database in which a voice recognition result and vibration setting data are associated with each other, and a vibration setting based on the recognition result by the voice recognition means Means for determining data, wherein the communication means transmits vibration setting data to the partner terminal together with the voice synthesized with the voice material data.

請求項４記載の発明は、請求項２または請求項３に記載の音声送信端末において、さらに、前記相手の端末の端末情報を取得する手段、を備え、前記相手の端末の種別に応じて映像素材データあるいはバイブレーション設定データの送信を停止することを特徴とする。 According to a fourth aspect of the present invention, there is provided the voice transmitting terminal according to the second or third aspect, further comprising means for acquiring terminal information of the partner terminal, and video according to a type of the partner terminal. The transmission of material data or vibration setting data is stopped.

請求項５記載の発明は、請求項２ないし請求項４のいずれかに記載の音声送信端末において、さらに、前記通信手段は、前記音声認識手段による認識結果を前記相手の端末に送信することを特徴とする。 According to a fifth aspect of the present invention, in the voice transmission terminal according to any one of the second to fourth aspects, the communication unit further transmits a recognition result by the voice recognition unit to the counterpart terminal. Features.

請求項６記載の発明は、音声認識結果に対応付けられた映像素材データを蓄積する映像素材データベースと、音声入力手段と、前記音声入力手段から入力した音声に対する音声認識を行う音声認識手段と、前記音声認識手段による認識結果に対応付けられている映像素材データを前記映像素材データベースから取得する手段と、前記音声入力手段から入力した音声とあわせて前記映像素材データベースから取得した映像を相手の端末に送信する通信手段と、を備えることを特徴とする。 The invention according to claim 6 is a video material database that stores video material data associated with a voice recognition result, a voice input unit, a voice recognition unit that performs voice recognition on a voice input from the voice input unit, Means for acquiring video material data associated with the recognition result by the voice recognition means from the video material database; and the video acquired from the video material database together with the voice input from the voice input means. And a communication means for transmitting to.

請求項７記載の発明は、音声認識結果とバイブレーション設定データとを対応付けたバイブレーション設定データベースと、音声入力手段と、前記音声入力手段から入力した音声に対する音声認識を行う音声認識手段と、前記音声認識手段による認識結果に基づいてバイブレーション設定データを決定する手段と、前記音声入力手段から入力した音声とあわせてバイブレーション設定データを相手の端末に送信する通信手段と、を備えることを特徴とする。 According to a seventh aspect of the present invention, there is provided a vibration setting database in which voice recognition results and vibration setting data are associated with each other, voice input means, voice recognition means for performing voice recognition on voice input from the voice input means, and the voice It comprises: means for determining vibration setting data based on the recognition result by the recognition means; and communication means for transmitting the vibration setting data to the partner terminal together with the voice input from the voice input means.

請求項８記載の発明は、請求項１ないし請求項７のいずれかに記載の音声送信端末において、前記音声認識結果は、前記音声入力手段が入力した音声をテキスト変換した結果および／または前記音声入力手段が入力した音声から判定された音声の調子に関わる情報を含むことを特徴とする。 According to an eighth aspect of the present invention, in the voice transmitting terminal according to any one of the first to seventh aspects, the voice recognition result is obtained by converting the voice input by the voice input means into a text and / or the voice. It includes information related to the tone of the sound determined from the sound input by the input means.

請求項９記載の発明は、請求項１に記載の音声送信端末において、前記音声素材データベースはメモリカードに格納されており、前記音声送信端末に前記メモリカードを挿入することで前記音声素材データベースが利用可能となることを特徴とする。 According to a ninth aspect of the present invention, in the voice transmitting terminal according to the first aspect, the voice material database is stored in a memory card, and the voice material database is inserted by inserting the memory card into the voice transmitting terminal. It can be used.

請求項１０記載の発明は、請求項２または請求項６に記載の音声送信端末において、前記映像素材データベースはメモリカードに格納されており、前記音声送信端末に前記メモリカードを挿入することで前記映像素材データベースが利用可能となることを特徴とする。 According to a tenth aspect of the present invention, in the audio transmitting terminal according to the second or sixth aspect, the video material database is stored in a memory card, and the memory card is inserted into the audio transmitting terminal to insert the memory card. The video material database can be used.

請求項１１記載の発明は、請求項３または請求項７に記載の音声送信端末において、前記バイブレーション設定データベースはメモリカードに格納されており、前記音声送信端末に前記メモリカードを挿入することで前記バイブレーション設定データベースが利用可能となることを特徴とする。 According to an eleventh aspect of the present invention, in the voice transmitting terminal according to the third or seventh aspect, the vibration setting database is stored in a memory card, and the memory card is inserted into the voice transmitting terminal to insert the memory card. A vibration setting database is available.

請求項１２記載の発明は、請求項１に記載の音声送信端末から送信された音声を受信する端末であって、音声素材データが合成された音声をスピーカから出力することを特徴とする。 A twelfth aspect of the invention is a terminal that receives a voice transmitted from the voice transmitting terminal according to the first aspect, and outputs a voice synthesized with voice material data from a speaker.

請求項１３記載の発明は、請求項２記載の音声送信端末から送信された音声およびデータを受信する端末であって、音声素材データが合成された音声をスピーカから出力しつつ受信した映像素材データをモニタに出力することを特徴とする。 A thirteenth aspect of the invention is a terminal for receiving voice and data transmitted from the voice transmitting terminal according to the second aspect, wherein the video material data received while outputting the synthesized voice data from the speaker. Is output to a monitor.

請求項１４記載の発明は、請求項３記載の音声送信端末から送信された音声およびデータを受信する端末であって、音声素材データが合成された音声をスピーカから出力しつつ、受信したバイブレーション設定データに基づいてバイブレータを駆動することを特徴とする。 The invention according to claim 14 is a terminal for receiving the voice and data transmitted from the voice transmitting terminal according to claim 3, wherein the received vibration setting is output while outputting the voice synthesized with the voice material data from the speaker. The vibrator is driven based on the data.

請求項１５記載の発明は、請求項５記載の音声送信端末から送信された合成音声および音声認識結果を受信する端末であって、前記受信端末は、音声認識結果に対応付けられた映像素材データを蓄積する受信側映像素材データベース、を備え、前記受信端末は、前記音声送信端末から映像素材データを受信した場合、合成音声とともに受信した映像素材データを再生するか、受信した認識結果に基づいて前記受信側映像素材データベースから対応する映像素材データを取得し、合成音声とともに取得した映像素材データを再生するかを選択可能としたことを特徴とする。 A fifteenth aspect of the present invention is a terminal that receives the synthesized voice and the voice recognition result transmitted from the voice transmitting terminal according to the fifth aspect, wherein the receiving terminal is video material data associated with the voice recognition result. A receiving-side video material database for storing the received video material data, when receiving the video material data from the audio transmitting terminal, the video data received together with the synthesized audio is reproduced or based on the received recognition result. The corresponding video material data is acquired from the receiving-side video material database, and it is possible to select whether to reproduce the acquired video material data together with the synthesized audio.

請求項１６記載の発明は、請求項５記載の音声送信端末から送信された合成音声および音声認識結果を受信する端末であって、前記受信端末は、音声認識結果とバイブレーション設定データとを対応付けた受信側バイブレーション設定データベース、を備え、前記受信端末は、前記音声送信端末からバイブレーション設定データを受信した場合、合成音声を再生するとともに受信したバイブレーション設定データに基づいてバイブレータを駆動するか、受信した音声認識結果に基づいて前記受信側バイブレーション設定データベースを参照してバイブレーション設定データを決定し、合成音声を再生するとともに決定したバイブレーション設定データに基づいてバイブレータを駆動するかを選択可能としたことを特徴とする。 The invention according to claim 16 is a terminal that receives the synthesized speech and the speech recognition result transmitted from the speech transmitting terminal according to claim 5, wherein the receiving terminal associates the speech recognition result with the vibration setting data. A receiving side vibration setting database, and when the receiving terminal receives vibration setting data from the voice transmitting terminal, it plays a synthesized voice and drives or receives a vibrator based on the received vibration setting data. The vibration setting data is determined by referring to the reception side vibration setting database based on the voice recognition result, and it is possible to select whether to play the synthesized voice and to drive the vibrator based on the determined vibration setting data. And

請求項１７記載の発明は、キー操作に対応付けられた音声素材データを蓄積する音声素材データベースと、音声入力手段と、音声通話中に入力したキー操作から対応付けられている音声素材データを前記音声素材データベースから取得し、前記音声入力手段から入力した音声に取得した音声素材データを合成する手段と、音声素材データが合成された音声を相手の端末に送信する通信手段と、を備えることを特徴とする。 According to the seventeenth aspect of the present invention, there is provided a voice material database that stores voice material data associated with key operations, a voice input unit, and voice material data associated with key operations input during a voice call. Means for synthesizing the obtained voice material data with the voice inputted from the voice input means and the voice inputted from the voice input means; and a communication means for sending the voice synthesized with the voice material data to the other terminal. Features.

請求項１８記載の発明は、キー操作に対応付けられた映像素材データを蓄積する映像素材データベースと、音声入力手段と、音声通話中に入力したキー操作から対応付けられている映像素材データを前記映像素材データベースから取得する手段と、前記音声入力手段から入力した音声とともに取得した映像素材データを相手の端末に送信する通信手段と、を備えることを特徴とする。 According to the eighteenth aspect of the present invention, there is provided a video material database for storing video material data associated with key operations, a voice input means, and video material data associated with key operations input during a voice call. It is characterized by comprising means for acquiring from a video material database and communication means for transmitting video material data acquired together with the audio input from the audio input means to a partner terminal.

請求項１９記載の発明は、キー操作に対応付けられたバイブレーション設定データを蓄積するバイブレーション設定データベースと、音声入力手段と、音声通話中に入力したキー操作から対応付けられているバイブレーション設定データを前記バイブレーション設定データベースから取得する手段と、前記音声入力手段から入力した音声とともに取得したバイブレーション設定データを相手の端末に送信する通信手段と、を備えることを特徴とする。 According to a nineteenth aspect of the present invention, the vibration setting database for storing the vibration setting data associated with the key operation, the voice input means, and the vibration setting data associated with the key operation input during the voice call are Means for acquiring from a vibration setting database; and communication means for transmitting the vibration setting data acquired together with the voice input from the voice input means to a partner terminal.

本発明の音声送信端末は、音声認識結果から対応付けられている音声素材データを取得し、通話者から入力した音声に取得した音声素材データを合成する。そして、音声素材データが合成された音声を相手の端末に送信する。これにより、通話者が言葉を発すると、その言葉に対応付けられた効果音やＢＧＭが合成されて送信されるので、コミュニケーションにおける表現力を増強させることが可能である。また、送信側の端末で合成音を生成して送信するので、受信側の端末は、通常の音声受信、再生機能が備わっていれば良い。つまり、送受信を行う両方の端末が機能対応している必要はないので、導入および普及しやすいサービスを実現できる。 The voice transmitting terminal according to the present invention acquires the voice material data associated with the voice recognition result, and synthesizes the acquired voice material data with the voice input from the caller. Then, the voice synthesized with the voice material data is transmitted to the partner terminal. Thereby, when a caller utters a word, a sound effect or BGM associated with the word is synthesized and transmitted, so that it is possible to enhance the expressive power in communication. Further, since the synthesized terminal is generated and transmitted by the terminal on the transmission side, the terminal on the reception side only needs to have normal voice reception and playback functions. That is, since it is not necessary for both terminals that perform transmission and reception to correspond to functions, a service that is easy to introduce and spread can be realized.

さらに、本発明の音声送信端末は、音声認識結果に対応付けられた映像素材データを取得し、合成音声とあわせて映像データを相手の端末に送信する。したがって、通話者の感情や意図を映像を利用して表現豊かに伝えることができる。 Furthermore, the audio transmission terminal of the present invention acquires video material data associated with the audio recognition result, and transmits the video data together with the synthesized audio to the partner terminal. Therefore, it is possible to convey the emotion and intention of the caller in an expressive manner using the video.

さらに、本発明の音声送信端末は、音声認識結果に基づいてバイブレーション設定データを決定し、合成音声とあわせてバイブレーション設定データを相手の端末に送信する。したがって、通話者の感情や意図を、振動を利用することでより臨場感あふれる表現形態で伝えることが可能である。 Furthermore, the voice transmitting terminal according to the present invention determines the vibration setting data based on the voice recognition result, and transmits the vibration setting data to the partner terminal together with the synthesized voice. Therefore, it is possible to convey the emotions and intentions of the caller in a more realistic expression form by using vibration.

｛第１の実施の形態｝
以下、図面を参照しつつ本発明の実施の形態について説明する。図１は、本発明の携帯電話を利用したコミュニケーションシステムの利用イメージを示す図である。携帯電話機１０および携帯電話機２０は、音声の通話機能に加えて、ネットワークと接続してデータを送受信する機能を備えている。本発明においては、携帯電話機１０を操作している通話者が言葉を発すると、その音声に様々な特殊効果が与えられて携帯電話機２０において表現される。たとえば、図に示すように、携帯電話機１０の操作者が「うそつき」という言葉を発すると、携帯電話機２０においては「うそつき」という言葉が再生されるとともに、「うそつき」という言葉に関連した映像が携帯電話機２０のモニタに表示される。あるいは、「うそつき」という言葉に関連したＢＧＭが携帯電話機２０で再生されるのである。 {First embodiment}
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a usage image of a communication system using a mobile phone of the present invention. The mobile phone 10 and the mobile phone 20 have a function of connecting to a network and transmitting / receiving data in addition to a voice call function. In the present invention, when a caller operating the mobile phone 10 speaks, various special effects are given to the voice and the voice is expressed on the mobile phone 20. For example, as shown in the figure, when the operator of the mobile phone 10 utters the word “liar”, the word “liar” is reproduced on the mobile phone 20 and an image related to the word “liar” is displayed. It is displayed on the monitor of the mobile phone 20. Alternatively, the BGM related to the word “liar” is played on the mobile phone 20.

図２は、携帯電話機１０および携帯電話機２０のブロック図である。まず、携帯電話機１０の構成および機能について説明する。音声入力部１１は、通話者の音声を入力する。入力した音声は音声認識部１２および音声合成部１３に出力される。 FIG. 2 is a block diagram of the mobile phone 10 and the mobile phone 20. First, the configuration and function of the mobile phone 10 will be described. The voice input unit 11 inputs a caller's voice. The input voice is output to the voice recognition unit 12 and the voice synthesis unit 13.

音声認識部１２は、音声入力部１１が入力した音声に対してリアルタイムで音声認識処理を実行する。音声認識処理としては、周知の方法を利用すればよい。一般には、入力した音声の周波数を解析し、音素を認識する。さらに、単語、文を認識することで、入力した音声を解析する。最終的に、音声認識部１２は、入力した音声をテキストデータに変換する。 The voice recognition unit 12 performs voice recognition processing on the voice input by the voice input unit 11 in real time. A known method may be used as the voice recognition process. In general, the frequency of input speech is analyzed to recognize phonemes. Furthermore, the input speech is analyzed by recognizing words and sentences. Finally, the voice recognition unit 12 converts the input voice into text data.

また、音声認識部１２は、入力した音声の音圧（音の強さ）、テンポ、周波数などから通話者の音声の調子を特定する。通話者の音声の調子を特定するために、音声認識部１２は、予め、音声の音圧、テンポ、周波数などと通話者の状態パターン（音声の調子のパターン）とを対応づけるテーブルを備えている。つまり、「ウキウキした調子」、「沈んだ調子」、「怒鳴り声」、「笑い声」など、所定の状態パターンについては、それぞれ音声の音圧、テンポ、周波数などの数値が設定されているのである。そして、入力した音声の音圧、テンポ、周波数などの数値が登録されているいずれかの状態パターンの設置値と近い場合（設定された閾値の範囲に収まっている場合）には、現在の通話者の音声の調子が登録されたいずれかの状態パターンにあると判定するのである。 Further, the voice recognition unit 12 specifies the tone of the caller's voice from the sound pressure (sound intensity), tempo, frequency, etc. of the input voice. In order to specify the voice tone of the caller, the voice recognition unit 12 includes a table that associates the sound pressure, tempo, frequency, and the like of the voice with the caller's state pattern (voice tone pattern) in advance. Yes. In other words, numerical values such as sound pressure, tempo, frequency, etc. are set for predetermined state patterns such as “excited tone”, “sunk tone”, “screaming voice”, “laughing voice”, etc. . If the input sound value, such as the sound pressure, tempo, and frequency, is close to the set value of any of the registered status patterns (within the set threshold range), the current call It is determined that the person's voice is in one of the registered state patterns.

音声合成部１３は、音声認識部１２から入力した音声認識結果に基づいて合成する音声を決定し、音声入力部１１から入力した音声に決定された音声を合成する。音声認識結果から合成する音声を決定するために、音声素材データベース１０１を利用する。 The voice synthesizer 13 determines the voice to be synthesized based on the voice recognition result input from the voice recognizer 12 and synthesizes the determined voice with the voice input from the voice input unit 11. The voice material database 101 is used to determine the voice to be synthesized from the voice recognition result.

図３は、音声素材データベース１０１の登録例を示す図である。音声素材データベース１０１は、音声認識部１２における音声認識結果と音声データとを対応付けるとともに、対応付けられた音声データを蓄積しているデータベースである。音声素材データベース１０１の音声認識結果フィールドには、キーワードあるいは通話者の状態パターンが登録されている。実際には、キーワードと音声データを対応付けるデータベースと、状態パターンと音声データとを対応付けるデータベースとは、それぞれ別のデータベースで管理されることになるが、説明を簡単にするため１つのデータベースで管理されているものとして説明する。なお、図３において、登録されているのがキーワードの場合には、キーワード文字のみ表示し、登録されているのが状態パターンである場合には、状態パターン名とともに（状態パターン）と表示している。 FIG. 3 is a diagram showing a registration example of the audio material database 101. The voice material database 101 is a database that associates the voice recognition result with the voice data in the voice recognition unit 12 and stores the corresponding voice data. In the speech recognition result field of the speech material database 101, keywords or caller status patterns are registered. Actually, the database for associating keywords with audio data and the database for associating state patterns with audio data are managed by different databases, but they are managed by one database for the sake of simplicity. Explain that it is. In FIG. 3, when the registered keyword is a keyword, only the keyword character is displayed. When the registered is a status pattern, (status pattern) is displayed together with the status pattern name. Yes.

この音声素材データベース１０１は、メモリカード３０（図１に図示）に格納されて例えばコンテンツプロバイダにより提供されている。ユーザは、音声素材データベース１０１が格納されたメモリカード３０を購入して携帯電話機１０のメモリカードスロットに挿入することで、音声素材データベース１０１を利用することが可能となる。ただし、音声素材データベース１０１は、ネットワーク上からダウンロードし、携帯電話機１０の内蔵メモリやメモリカードに格納する形態であってもよい。 The audio material database 101 is stored in the memory card 30 (shown in FIG. 1) and provided by, for example, a content provider. The user can use the audio material database 101 by purchasing the memory card 30 storing the audio material database 101 and inserting it into the memory card slot of the mobile phone 10. However, the audio material database 101 may be downloaded from the network and stored in a built-in memory or a memory card of the mobile phone 10.

音声合成部１３は、音声認識部１２より入力した音声のテキストデータの中に、音声素材データベース１０１に登録されているキーワードが存在するか検索し、キーワードがヒットした場合には、そのキーワードに対応する音声データを合成音声として決定する。あるいは、音声合成部１３は、音声認識部１２より入力した通話者の状態パターンと対応する音声データを合成音声として決定する。 The speech synthesizer 13 searches the speech text data input from the speech recognizer 12 for a keyword registered in the speech material database 101. If the keyword is hit, the speech synthesizer 13 responds to the keyword. The voice data to be determined is determined as synthesized voice. Alternatively, the voice synthesizer 13 determines voice data corresponding to the caller's state pattern input from the voice recognizer 12 as synthesized voice.

図の例であれば、たとえば、通話者が発した音声の中に、「最悪」、「ブルー」、「落ち込む」、「落ち込んだ」といったキーワードが含まれていれば、暗い雰囲気（どよ〜んとした雰囲気）の効果音データが選択される。逆に、通話者が発した音声の中に、「ハッピー」、「嬉しい」、「楽しい」といったキーワードが含まれていれば、明るい輝かしい雰囲気（キラキラキラしたイメージ）の効果音データが選択される。また、音声認識部１２から入力した状態パターンが「怒鳴り声」である場合には、爆発音の効果音データが選択される。 In the case of the example in the figure, for example, if the voice uttered by the caller includes keywords such as “worst”, “blue”, “depressed”, “depressed”, a dark atmosphere (Doyo ~ Sound effect data is selected. On the other hand, if the speech produced by the caller includes keywords such as “happy”, “happy”, and “fun”, sound effect data with a bright and brilliant atmosphere (a sparkling image) is selected. When the state pattern input from the voice recognition unit 12 is “screams”, explosive sound effect data is selected.

音声合成部１３は、音声認識結果に基づいて音声素材データベース１０１から音声データを取得すると、音声入力部１１から入力した音声に、音声素材データベース１０１から取得した音声を合成し、通信部１４に出力する。 When the voice synthesis unit 13 acquires voice data from the voice material database 101 based on the voice recognition result, the voice synthesis unit 13 synthesizes the voice acquired from the voice material database 101 with the voice input from the voice input unit 11 and outputs the synthesized voice to the communication unit 14. To do.

音声認識部１２は、また音声認識結果を映像決定部１５に出力する。映像決定部１５は、音声認識部１２から入力した音声認識結果に基づいて携帯電話機２０に送信する映像を決定する。音声認識結果から選択する映像を決定するために、映像素材データベース１０２を利用する。 The voice recognition unit 12 also outputs the voice recognition result to the video determination unit 15. The video determination unit 15 determines a video to be transmitted to the mobile phone 20 based on the voice recognition result input from the voice recognition unit 12. The video material database 102 is used to determine a video to be selected from the voice recognition result.

図４は、映像素材データベース１０２の登録例を示す図である。映像素材データベース１０２は、音声認識部１２における音声認識結果と映像データとを対応付けるとともに、対応付けられた映像データを蓄積しているデータベースである。映像素材データベース１０２の音声認識結果フィールドには、キーワードあるいは通話者の状態パターンが登録されている。映像決定部１５は、音声認識部１２より入力した音声のテキストデータの中に、映像素材データベース１０２に登録されているキーワードが存在するか検索し、ヒットした場合には、そのキーワードに対応する映像データを取得する。あるいは、映像決定部１５は、音声認識部１２より入力した通話者の状態パターンと対応する映像データを取得する。 FIG. 4 is a diagram illustrating a registration example of the video material database 102. The video material database 102 is a database that associates the voice recognition result in the voice recognition unit 12 with the video data and stores the associated video data. In the voice recognition result field of the video material database 102, a keyword or a caller's state pattern is registered. The video determination unit 15 searches the text data of the voice input from the voice recognition unit 12 for a keyword registered in the video material database 102. If the keyword is hit, the video corresponding to the keyword is searched. Get the data. Alternatively, the video determination unit 15 acquires video data corresponding to the caller's state pattern input from the voice recognition unit 12.

図の例であれば、たとえば、通話者が発した音声の中に、「うそつき」、「うそばっかり」、「うそばかり」といったキーワードが含まれていれば、うそつき者をたしなめている様子を表現した映像データが選択される。通話者が発した音声の中に、「許さない」、「許すことができない」、「許しがたい」といったキーワードが含まれていれば、厳つい男が怒って指を鳴らしている様子を表現した映像データが選択される。また、音声認識部１２から入力した状態パターンが「怒鳴り声」である場合には、怒鳴り声をあげている様子を比喩的に表現した雷の映像が選択される。なお、図の例では、映像データとして動画データが登録されているが、映像データとして静止画データを登録しておいてもよい。 In the example of the figure, for example, if the voice uttered by the caller contains keywords such as “lie”, “lie sober”, and “lie”, it expresses the appearance of the liar Selected video data is selected. If the voice uttered by the caller includes keywords such as “don't forgive”, “can't forgive”, or “unforgivable”, it expresses a harsh man angry Video data is selected. When the state pattern input from the voice recognition unit 12 is “screaming”, a thunder image representing a state of raising a yelling voice is selected. In the example shown in the figure, moving image data is registered as video data, but still image data may be registered as video data.

この映像素材データベース１０２についても、メモリカード３０に格納されて提供されている。ただし、映像素材データベース１０２は、ネットワーク上からダウンロードし、携帯電話機１０の内蔵メモリやメモリカードに格納する形態であってもよい。 The video material database 102 is also stored in the memory card 30 and provided. However, the video material database 102 may be downloaded from the network and stored in a built-in memory or a memory card of the mobile phone 10.

映像決定部１５は、音声認識結果に基づいて映像素材データベース１０２から映像データを取得すると、取得した映像データを通信部１４に出力する。 When the video determination unit 15 acquires video data from the video material database 102 based on the voice recognition result, the video determination unit 15 outputs the acquired video data to the communication unit 14.

また、音声認識部１２は音声認識結果をバイブレーション決定部１６に出力する。バイブレーション決定部１６は、音声認識部１２から入力した音声認識結果に基づいて携帯電話機２０に送信するバイブレーション設定データを決定する。音声認識結果からバイブレーション設定データを取得するために、バイブレーション設定データベース１０３を利用する。 Further, the voice recognition unit 12 outputs the voice recognition result to the vibration determination unit 16. The vibration determination unit 16 determines the vibration setting data to be transmitted to the mobile phone 20 based on the voice recognition result input from the voice recognition unit 12. In order to acquire the vibration setting data from the voice recognition result, the vibration setting database 103 is used.

図５は、バイブレーション設定データベース１０３の登録例を示す図である。バイブレーション設定データベース１０３は、音声認識部１２における音声認識結果とバイブレーション設定データとを対応付けたデータベースである。バイブレーション設定データベース１０３の音声認識結果フィールドには、キーワードあるいは通話者の状態パターンが登録されている。バイブレーション決定部１６は、音声認識部１２より入力した音声のテキストデータの中に、バイブレーション設定データベース１０３に登録されているキーワードが存在するか検索し、ヒットした場合には、そのキーワードに対応するバイブレーション設定データを取得する。あるいは、バイブレーション決定部１６は、音声認識部１２より入力した通話者の状態パターンと対応するバイブレーション設定データを取得する。 FIG. 5 is a diagram illustrating an example of registration in the vibration setting database 103. The vibration setting database 103 is a database in which a voice recognition result in the voice recognition unit 12 is associated with vibration setting data. In the voice recognition result field of the vibration setting database 103, a keyword or a caller's state pattern is registered. The vibration determination unit 16 searches the voice text data input from the voice recognition unit 12 for a keyword registered in the vibration setting database 103, and if a hit is found, the vibration corresponding to the keyword is searched. Get configuration data. Alternatively, the vibration determination unit 16 acquires vibration setting data corresponding to the caller's state pattern input from the voice recognition unit 12.

図の例であれば、たとえば、通話者が発した音声の中に、「うそつき」、「うそばっかり」、「うそばかり」といったキーワードが含まれていれば、強力なバイブレーションを１回だけ単発で発生させるというバイブレーション設定データが選択される。また、音声認識部１２から入力した状態パターンが「怒鳴り声」である場合には、強力なバイブレーションを連続して５回発生させるというバイブレーション設定データが選択される。なお、バイブレーション設定データには、バイブレーションの振動回数、振動時間、振動間隔、振動の強さなどを個別にあるいは複合的に設定可能とすればよい。 In the example shown in the figure, for example, if the voice uttered by the caller includes keywords such as “liar”, “lie sober”, and “lie”, a powerful vibration can be performed only once. Vibration setting data to be generated is selected. Further, when the state pattern input from the voice recognition unit 12 is “screaming”, vibration setting data for generating a strong vibration five times in succession is selected. In the vibration setting data, the number of vibrations, the vibration time, the vibration interval, the strength of vibration, and the like may be set individually or in combination.

このバイブレーション設定データベース１０３についても、メモリカード３０に格納されて提供されている。ただし、バイブレーション設定データベース１０３は、ネットワーク上からダウンロードし、携帯電話機１０の内蔵メモリやメモリカードに格納する形態であってもよい。 The vibration setting database 103 is also stored in the memory card 30 and provided. However, the vibration setting database 103 may be downloaded from the network and stored in the built-in memory or the memory card of the mobile phone 10.

バイブレーション決定部１６は、音声認識結果に基づいてバイブレーション設定データベース１０３からバイブレーション設定データを取得すると、取得したバイブレーション設定データを通信部１４に出力する。 When the vibration determination unit 16 acquires the vibration setting data from the vibration setting database 103 based on the voice recognition result, the vibration determination unit 16 outputs the acquired vibration setting data to the communication unit 14.

通信部１４は、他の携帯電話機との間で音声の送受信を行う機能部とネットワークを介してデータを送受信する機能部とを備えている。上述したように、通信部１４は、音声合成部１３から、効果音やＢＧＭが合成された通話者の音声を入力する。また、通信部１４は、映像決定部１５から音声認識結果に基づいて決定された映像データを入力する。さらに、通信部１４は、バイブレーション決定部１６から音声認識結果に基づいて決定されたバイブレーション設定データを入力する。そして、通信部１４は、これら合成音声とデータとを携帯電話機２０に送信する。 The communication unit 14 includes a functional unit that transmits / receives audio to / from another mobile phone and a functional unit that transmits / receives data via a network. As described above, the communication unit 14 inputs the voice of the caller synthesized with the sound effects and BGM from the voice synthesis unit 13. In addition, the communication unit 14 inputs video data determined based on the voice recognition result from the video determination unit 15. Further, the communication unit 14 inputs the vibration setting data determined based on the voice recognition result from the vibration determination unit 16. Then, the communication unit 14 transmits these synthesized speech and data to the mobile phone 20.

ここで、通信部１４が合成音声を送信する方法は、通常の通話音声を送信する方法と同じである。つまり、携帯電話機１０の音声処理部において通話者から取得した音声と効果音やＢＧＭが合成されているので、通信部１４は、合成音声を通常の音声と同様に処理して送信することができるのである。これにより、この合成音声を受信する携帯電話機は特別な機能を必要としない。受信側の携帯電話機２０は、通常の通話音声と同様に合成音声を受信し、スピーカから合成音声を出力すればよいのである。 Here, the method by which the communication unit 14 transmits the synthesized voice is the same as the method for transmitting normal call voice. That is, since the sound acquired from the caller and the sound effect and BGM are synthesized in the voice processing unit of the mobile phone 10, the communication unit 14 can process and send the synthesized voice in the same way as normal voice. It is. As a result, the mobile phone that receives this synthesized voice does not need a special function. The mobile phone 20 on the receiving side only has to receive the synthesized voice and output the synthesized voice from the speaker in the same manner as normal call voice.

映像データおよびバイブレーション設定データは、音声通信とは別にＴＣＰ／ＩＰなどのデータ送受信プロトコルを利用して送信される。通信部１４は、このような通信を可能とするデータ通信機能を備えている。あるいは、通信部１４は、音声通信の特定の周波数帯域に映像データやバイブレーション設定データを重畳させて送信するようにしてもよい。 Video data and vibration setting data are transmitted using a data transmission / reception protocol such as TCP / IP separately from voice communication. The communication unit 14 has a data communication function that enables such communication. Alternatively, the communication unit 14 may superimpose video data and vibration setting data on a specific frequency band for voice communication and transmit the data.

なお、携帯電話機１０から送信するバイブレーション設定データのデータ形式は、たとえば、バイブレータコマンドと、振動回数、振動時間、振動間隔、振動の強さなどを示す引数とで構成すればよい。 Note that the data format of the vibration setting data transmitted from the mobile phone 10 may be composed of, for example, a vibrator command and an argument indicating the number of vibrations, vibration time, vibration interval, vibration intensity, and the like.

次に、携帯電話機２０の構成および機能について説明する。通信部２１は、通信部１４から送信された合成音声および映像データやバイブレーション設定データを受信する。通信部２１は、音声信号を送受信する機能とデータ通信機能を備えている。上述したように、映像データやバイブレーション設定データが、音声信号とは別にデータ通信機能を利用して送信された場合には、通信部２１は、データ通信機能を利用して映像データやバイブレーション設定データを受信する。映像データやバイブレーション設定データが音声信号に重畳されている場合には、通信部２１は、音声信号から映像データやバイブレーション設定データを分離して取得する。 Next, the configuration and function of the mobile phone 20 will be described. The communication unit 21 receives the synthesized audio and video data and vibration setting data transmitted from the communication unit 14. The communication unit 21 has a function of transmitting / receiving audio signals and a data communication function. As described above, when video data and vibration setting data are transmitted using a data communication function separately from an audio signal, the communication unit 21 uses the data communication function to transmit video data and vibration setting data. Receive. When the video data and the vibration setting data are superimposed on the audio signal, the communication unit 21 acquires the video data and the vibration setting data separately from the audio signal.

通信部２１は、受信した音声信号を音声再生部２２に出力する。音声再生部２２は、携帯電話機２０が備えるスピーカから合成音声を出力する。このようにして、携帯電話機１０において通話者が言葉を発すると、その言葉に効果音やＢＧＭが合成され、合成音が携帯電話機２０において再生されるのである。 The communication unit 21 outputs the received audio signal to the audio reproduction unit 22. The voice reproduction unit 22 outputs synthesized voice from a speaker included in the mobile phone 20. In this way, when a caller utters a word on the mobile phone 10, a sound effect or BGM is synthesized with the word, and the synthesized sound is reproduced on the mobile phone 20.

したがって、携帯電話機２０の通話者は、携帯電話機１０の通話者が発した言葉を単に聞くだけでなく、その言葉に効果音やＢＧＭが付加されることにより臨場感たっぷりのコミュニケーションをとることが可能である。たとえば、携帯電話機１０の通話者が「ハッピー」という言葉を発すると、その言葉とともにキラキラしたイメージの効果音が携帯電話機２０において再生されるので、携帯電話機２０の通話者は、通話相手の感情を充分に感じることができる。 Therefore, the caller of the mobile phone 20 can not only simply listen to the words uttered by the caller of the mobile phone 10, but can also communicate with a sense of presence by adding sound effects and BGM to the words. It is. For example, when a caller of the mobile phone 10 utters the word “happy”, a sound effect of a sparkling image is reproduced on the mobile phone 20 together with the word, so that the caller of the mobile phone 20 expresses the emotion of the other party. I can feel enough.

また、上述したように、この合成音声は送信側の携帯電話機１０において合成され、通常の音声と同様の方法で送信されるので、受信側の携帯電話機２０は、通常の携帯電話機を用いて合成音声を再生することが可能である。したがって、このような合成音声再生機能を利用するために、送信側と受信側の双方の端末が機能に対応している必要はない。本発明の音声合成機能を備えた携帯電話機を用意すれば、どんな音声再生装置（携帯電話機に限らず、固定電話でもよいし、テレビ電話などであってもよい。）に対しても、表現豊かな合成音声を送信することが可能である。 In addition, as described above, this synthesized voice is synthesized in the transmitting-side mobile phone 10 and transmitted in the same manner as normal voice. Therefore, the receiving-side mobile phone 20 is synthesized using an ordinary mobile phone. Audio can be played back. Therefore, in order to use such a synthesized voice reproduction function, it is not necessary for both the transmitting and receiving terminals to support the function. If a mobile phone equipped with the voice synthesis function of the present invention is prepared, it is rich in expression for any voice playback device (not limited to a mobile phone, but may be a fixed phone or a video phone). Simple synthesized speech can be transmitted.

通信部２１は、また、受信した映像データを映像再生部２３に出力する。映像再生部２３は、入力した映像データを携帯電話機２０のモニタに再生するのである。映像再生部２３は、受信した合成音声が音声再生部２２から再生されるのに同期して、モニタに映像データを再生するのである。映像データが動画データである場合には、映像再生部２３はモニタに動画を再生し、映像データが静止画データである場合には、モニタに静止画を表示する。これにより、携帯電話機２０では、受信した合成音声が再生されるとともに、音声認識結果から決定された映像が再生されるのである。 The communication unit 21 also outputs the received video data to the video playback unit 23. The video playback unit 23 plays back the input video data on the monitor of the mobile phone 20. The video reproduction unit 23 reproduces the video data on the monitor in synchronization with the received synthesized sound being reproduced from the audio reproduction unit 22. When the video data is moving image data, the video reproducing unit 23 reproduces the moving image on the monitor, and when the video data is still image data, the still image is displayed on the monitor. As a result, the mobile phone 20 reproduces the received synthesized voice and the video determined from the voice recognition result.

たとえば、図３、図４のデータベースを利用した場合を例にあげると、携帯電話機１０において「許さない」といった言葉が発せられた場合、携帯電話機２０では、「許さない」という音声に「ボキボキ」という指を鳴らす音の効果音が合成されて再生され、さらに、携帯電話機２０のモニタには、厳つい男が指を鳴らす映像が再生されるのである。 For example, in the case of using the database shown in FIGS. 3 and 4, for example, when a word “don't allow” is issued on the mobile phone 10, the mobile phone 20 will say “blow” The sound effect of the sound of a finger is synthesized and reproduced, and further, a video of a strict man's finger is reproduced on the monitor of the mobile phone 20.

また、通信部２１は受信したバイブレーション設定データをバイブレータ２４に出力する。バイブレータ２４は、入力したバイブレーション設定データに基づいて振動を発生させるのである。つまり、バイブレータ２４は、バイブレーション設定データで指定されている振動回数、振動時間、振動間隔、振動の強さなどのパラメータにしたがって振動を発生させることができる。これにより、携帯電話機２０では、受信した合成音声が再生されるとともに、音声認識結果から決定されたバイブレーションが発生するのである。 In addition, the communication unit 21 outputs the received vibration setting data to the vibrator 24. The vibrator 24 generates vibration based on the input vibration setting data. That is, the vibrator 24 can generate vibrations according to parameters such as the number of vibrations, vibration time, vibration interval, and vibration intensity specified in the vibration setting data. As a result, the mobile phone 20 reproduces the received synthesized speech and generates vibration determined from the speech recognition result.

たとえば、図３、図４、図５のデータベースを利用した場合を例にあげると、携帯電話機１０において「うそつき」といった言葉が発せられた場合、携帯電話機２０では、「うそつき」という音声が再生されるとともに、うそをついた者をたしなめる映像が再生され、さらに、強力なバイブレーションが１回単発で発生するのである。 For example, in the case of using the database shown in FIGS. 3, 4, and 5, for example, when a word “liar” is issued on the mobile phone 10, the sound “liar” is played on the mobile phone 20. At the same time, the video that plays the person who lies is played back, and a strong vibration occurs once in a single shot.

このように本実施の形態の携帯電話コミュニケーションシステムを利用すれば、携帯電話機１０の操作者は、自分の気持ちを表現豊かに伝えることが可能である。言葉だけでは中々伝わらない意思、感情を映像やバイブレーションが表現力を増強させてくれるのである。また、携帯電話機１０の通話者は、感情をあらわにすることが苦手であっても、伝えたい言葉だけを発すれば、携帯電話機１０が表現力を増強させてくれるのである。一方、携帯電話機２０の操作者は、相手の気持ちを感情豊かに受け取ることができる。また、単なる意思伝達に多様性を持たせるというだけでなく、遊びの要素を取り入れ、バラエティ性の高いコミュニケーションが可能となる。 As described above, by using the mobile phone communication system of the present embodiment, the operator of the mobile phone 10 can convey his / her feelings in an expressive manner. Images and vibrations enhance the expressive power of emotions and feelings that are not communicated by words alone. In addition, even if the caller of the mobile phone 10 is not good at expressing emotions, the mobile phone 10 enhances the expressive power if only speaking the words that he wants to convey. On the other hand, the operator of the mobile phone 20 can receive the feelings of the other party in an emotional manner. In addition, it is possible not only to give diversity to simple communication but also to incorporate elements of play and to communicate with high variety.

｛第２の実施の形態｝
次に、本発明の第２の実施の形態について説明する。上述したように、携帯電話機１０は、合成音声を通常の音声と同様の方法で送信するため、受信側は通常の音声再生機能を備えていれば良いことを説明した。これに対して、映像データおよびバイブレーション設定データについては、受信側の端末にも対応する機能が必要である。携帯電話機２０は、映像再生部２３およびバイブレータ２４を備え、これらの処理部が音声信号とともにデータを受信した場合に、合成音声の再生と同期させて映像の再生やバイブレータの駆動を行った。 {Second Embodiment}
Next, a second embodiment of the present invention will be described. As described above, since the mobile phone 10 transmits the synthesized voice in the same manner as the normal voice, it has been described that the receiving side only needs to have a normal voice reproduction function. On the other hand, for the video data and the vibration setting data, a function corresponding to the terminal on the receiving side is required. The cellular phone 20 includes a video playback unit 23 and a vibrator 24. When these processing units receive data together with an audio signal, the mobile phone 20 plays back video and drives the vibrator in synchronization with playback of the synthesized voice.

第２の実施の形態においては、携帯電話機１０が送信先の携帯電話機２０の端末種別に応じて送信するデータを選択する。図６に示すように、携帯電話機２０は、端末情報記憶部２５を備えている。この端末情報記憶部２５には、携帯電話機２０のメーカ名、機種名、製造番号名などの情報が記録されている。 In the second embodiment, the mobile phone 10 selects data to be transmitted according to the terminal type of the destination mobile phone 20. As shown in FIG. 6, the mobile phone 20 includes a terminal information storage unit 25. The terminal information storage unit 25 stores information such as the manufacturer name, model name, and manufacturing number name of the mobile phone 20.

一方、携帯電話機１０は、図６に示すように端末情報取得部１７を備えている。端末情報取得部１７は、合成音声やその他のデータを送信する前に、携帯電話機２０から端末機種の情報を取得するのである。第２の実施の形態においても、第１の実施の形態と同様の方法で、合成音声が生成され、映像データおよびバイブレーション設定データが決定される。そして、第１の実施の形態においては、合成音声を生成し、映像データおよびバイブレーション設定データを決定すると、そのまま合成音声とデータを送信した。第２の実施の形態においては、まず、端末情報取得部１７が携帯電話機２０の端末機種情報を取得し、処理の方法を決定するのである。 On the other hand, the mobile phone 10 includes a terminal information acquisition unit 17 as shown in FIG. The terminal information acquisition unit 17 acquires information on the terminal model from the mobile phone 20 before transmitting synthesized speech and other data. Also in the second embodiment, synthesized sound is generated and video data and vibration setting data are determined by the same method as in the first embodiment. In the first embodiment, when the synthesized voice is generated and the video data and the vibration setting data are determined, the synthesized voice and the data are transmitted as they are. In the second embodiment, first, the terminal information acquisition unit 17 acquires the terminal model information of the mobile phone 20 and determines the processing method.

具体的には、音声通話が開始された時点（端末同士が接続された時点で）で、端末情報取得部１７が端末機種の情報取得コマンドを生成すると、通信部１４よりコマンドが携帯電話機２０に送信される。携帯電話機２０では、端末情報記憶部２５より携帯電話機２０の端末機種情報が読み出され、読み出された情報が携帯電話機１０に送信されるのである。こうして、端末情報取得部１７が、携帯電話機２０の端末機種情報を得る。 Specifically, when the terminal information acquisition unit 17 generates an information acquisition command for the terminal model at the time when the voice call is started (when the terminals are connected to each other), the command is sent from the communication unit 14 to the mobile phone 20. Sent. In the mobile phone 20, the terminal model information of the mobile phone 20 is read from the terminal information storage unit 25, and the read information is transmitted to the mobile phone 10. In this way, the terminal information acquisition unit 17 obtains the terminal model information of the mobile phone 20.

端末情報取得部１７は、端末機種情報を映像決定部１５およびバイブレーション決定部１６に出力する。映像決定部１５は、送信先の携帯電話機が音声通話時における映像再生機能を備えているかどうかを端末機種情報から決定する。そして、送信先の携帯電話機が音声通話時の映像再生機能を備えていると判定した場合には、第１の実施の形態と同様、映像素材データベース１０２から取得した映像データを通信部１４に出力するのである。これに対して、送信先の携帯電話機が当該機能を備えていないと判定した場合には、通信部１４に対する映像データの出力を停止する。 The terminal information acquisition unit 17 outputs the terminal model information to the video determination unit 15 and the vibration determination unit 16. The video determination unit 15 determines from the terminal model information whether the destination mobile phone has a video playback function during a voice call. If it is determined that the destination mobile phone has a video playback function during a voice call, the video data acquired from the video material database 102 is output to the communication unit 14 as in the first embodiment. To do. On the other hand, when it is determined that the destination mobile phone does not have the function, the output of the video data to the communication unit 14 is stopped.

バイブレーション決定部１６は、送信先の携帯電話機が音声通話時におけるバイブレーション機能を備えているかどうかを端末機種情報から決定する。そして、送信先の携帯電話機が音声通話時のバイブレーション機能を備えていると判定した場合には、第１の実施の形態と同様、バイブレーション設定データベース１０３を参照して決定したバイブレーション設定データを通信部１４に出力するのである。これに対して、送信先の携帯電話機が当該機能を備えていないと判定した場合には、通信部１４に対するバイブレーション設定データの出力を停止する。 The vibration determining unit 16 determines from the terminal model information whether the destination mobile phone has a vibration function during a voice call. If it is determined that the destination mobile phone has a vibration function for voice calls, the vibration setting data determined with reference to the vibration setting database 103 is transmitted to the communication unit, as in the first embodiment. 14 is output. On the other hand, when it is determined that the destination mobile phone does not have the function, the output of the vibration setting data to the communication unit 14 is stopped.

これにより、通信部１４は、送信先の携帯電話機が音声通話時の映像再生機能を備えているが、バイブレーション機能を備えていない場合には、合成音声および映像データのみを送信する。送信先の携帯電話機が音声通話時のバイブレーション機能を備えているが、映像再生機能を備えていない場合には、合成音声およびバイブレーション設定データのみを送信する。送信先の携帯電話機が音声通話時の映像再生機能およびバイブレーション機能の両方を備えている場合には、合成音声に加えて映像データおよびバイブレーション設定データを送信するのである。 Thereby, the communication unit 14 transmits only the synthesized voice and the video data when the destination mobile phone has the video playback function at the time of the voice call but does not have the vibration function. If the destination mobile phone has a vibration function for voice calls, but does not have a video playback function, only the synthesized voice and vibration setting data are transmitted. When the destination mobile phone has both a video playback function and a vibration function during a voice call, video data and vibration setting data are transmitted in addition to the synthesized voice.

このように、送信先の携帯電話機の種別に応じて送信するデータを決定するので、送信先の携帯電話機が対応していない無駄なデータを送信することはない。これにより、データ送信のコストを低減させることが可能であり、また、通信帯域を無駄なデータで圧迫することを防止できる。 Thus, since the data to be transmitted is determined according to the type of the mobile phone of the transmission destination, useless data not supported by the mobile phone of the transmission destination is not transmitted. Thereby, it is possible to reduce the cost of data transmission, and it is possible to prevent the communication band from being compressed with useless data.

｛第３の実施の形態｝
次に、本発明の第３の実施の形態について説明する。第１の実施の形態においては、携帯電話機２０は、携帯電話機１０から送信された映像データをそのまま再生した。あるいは、携帯電話機１０から送信されたバイブレーション設定データに基づいてバイブレータ２４を駆動した。これに対して第３の実施の形態においては、携帯電話機２０においても、映像データの決定やバイブレーションの設定データを決定する。 {Third embodiment}
Next, a third embodiment of the present invention will be described. In the first embodiment, the mobile phone 20 reproduces the video data transmitted from the mobile phone 10 as it is. Alternatively, the vibrator 24 is driven based on the vibration setting data transmitted from the mobile phone 10. On the other hand, in the third embodiment, the mobile phone 20 also determines video data and vibration setting data.

第３の実施の形態においては、図７に示すように、音声認識部１２は、音声認識結果を通信部１４に出力する。そして、通信部１４は、合成音声と映像データとバイブレーション設定データとともに音声認識結果を携帯電話機２０に送信する。通信部１４は、映像データ等と同じ方法で音声認識結果を送信する。つまり、データ通信機能を利用するか音声信号に重畳させて音声認識結果を送信する。 In the third embodiment, as shown in FIG. 7, the voice recognition unit 12 outputs a voice recognition result to the communication unit 14. The communication unit 14 transmits the voice recognition result to the mobile phone 20 together with the synthesized voice, the video data, and the vibration setting data. The communication unit 14 transmits the voice recognition result by the same method as the video data or the like. That is, the speech recognition result is transmitted using the data communication function or superposed on the speech signal.

携帯電話機２０では、通信部２１が音声信号から映像データやバイブレーション設定データとともに音声認識結果を分離して取得する。通信部２１は、映像再生部２３に対して、受信した映像データとともに音声認識結果を出力する。 In the mobile phone 20, the communication unit 21 separates and acquires the voice recognition result together with the video data and the vibration setting data from the voice signal. The communication unit 21 outputs the voice recognition result together with the received video data to the video reproduction unit 23.

携帯電話機２０は、図７に示すように、映像素材データベース２０１を備えている。映像素材データベース２０１のデータベース構造は、携帯電話機１０が備える映像素材データベース１０２と同様である。つまり、音声認識結果と映像データとが対応付けるとともに、対応付けられている映像データを蓄積している。 The mobile phone 20 includes a video material database 201 as shown in FIG. The database structure of the video material database 201 is the same as the video material database 102 provided in the mobile phone 10. That is, the voice recognition result and the video data are associated with each other and the associated video data is stored.

ただし、映像素材データベース２０１における音声認識結果と映像データとの対応付けは映像素材データベース１０２と異なる場合もある。たとえば、同じキーワードや同じ状態パターンに対しても異なる映像データが対応付けられている場合がある。たとえば、映像素材データベースを同じコンテンツプロバイダから取得していれば、内容も同じとなるが、異なるコンテンツプロバイダから取得していれば内容が異なる。あるいは、１つのコンテンツプロバイダから複数の異なる映像素材データベースが提供されていてもよい。映像素材データベース２０１についても、メモリカードに格納されてユーザに提供されてもよいし、ユーザがネットワーク経由でダウンロードする形態であってもよい。 However, the correlation between the voice recognition result and the video data in the video material database 201 may be different from that in the video material database 102. For example, different video data may be associated with the same keyword or the same state pattern. For example, if the video material database is acquired from the same content provider, the content is the same, but if the video material database is acquired from a different content provider, the content is different. Alternatively, a plurality of different video material databases may be provided from one content provider. The video material database 201 may also be stored in a memory card and provided to the user, or may be downloaded by the user via a network.

映像再生部２３は、通信部２１から映像データと音声認識結果を入力すると、入力した映像データ、つまり携帯電話機１０から送信された映像データをそのままモニタに再生してもよい。あるいは、映像再生部２３は、入力した音声認識結果に基づいて映像素材データベース２０１を参照して対応する映像データを決定し、決定した映像データをモニタに再生させてもよい。どちらの映像を再生するかについては、携帯電話機２０において予め設定可能としておけばよい。相手の端末から送信されてくる映像を楽しみたいのであれば、受信する映像データを再生する設定にすればよいし、自分の携帯電話機２０で利用している映像素材データベース２０１を利用したいのであれば、受信する音声認識結果を利用するように設定すればよい。 When the video reproduction unit 23 receives the video data and the voice recognition result from the communication unit 21, the video reproduction unit 23 may reproduce the input video data, that is, the video data transmitted from the mobile phone 10 as it is on the monitor. Alternatively, the video reproduction unit 23 may determine the corresponding video data by referring to the video material database 201 based on the input voice recognition result, and cause the monitor to reproduce the determined video data. Which video is to be reproduced may be set in advance in the mobile phone 20. If you want to enjoy the video sent from the other party's terminal, you can set it to play the received video data, or if you want to use the video material database 201 used by your mobile phone 20 It may be set to use the received voice recognition result.

また、通信部２１は、バイブレータ２４に対して受信したバイブレーション設定データとともに音声認識結果を出力する。携帯電話機２０は、図７に示すように、バイブレーション設定データベース２０２を備えている。バイブレーション設定データベース２０２のデータベース構造は、携帯電話機１０が備えるバイブレーション設定データベース１０３と同様である。つまり、音声認識結果に対してバイブレーション設定データが対応付けられている。 Further, the communication unit 21 outputs the voice recognition result together with the vibration setting data received to the vibrator 24. The mobile phone 20 includes a vibration setting database 202 as shown in FIG. The database structure of the vibration setting database 202 is the same as the vibration setting database 103 provided in the mobile phone 10. That is, the vibration setting data is associated with the voice recognition result.

同様に、バイブレーション設定データベース１０３とバイブレーション設定データベース２０２は異なるデータベースであってもよいし、同じデータベースであってもよい。バイブレーション設定データベース２０２についても、メモリカードに格納されてユーザに提供されてもよいし、ユーザがネットワーク経由でダウンロードする形態であってもよい。 Similarly, the vibration setting database 103 and the vibration setting database 202 may be different databases or the same database. The vibration setting database 202 may also be stored in a memory card and provided to the user, or may be downloaded by the user via a network.

バイブレータ２４は、通信部２１からバイブレーション設定データと音声認識結果を入力すると、入力したバイブレーション設定データ、つまり携帯電話機１０から送信されたバイブレーション設定データに基づいて振動を発生させてもよい。あるいは、バイブレータ２４は、入力した音声認識結果に基づいてバイブレーション設定データベース２０２を参照してバイブレーション設定データを決定し、決定したデータに基づいて振動を発生させてもよい。 When vibrator 24 receives vibration setting data and a voice recognition result from communication unit 21, vibrator 24 may generate vibration based on the input vibration setting data, that is, vibration setting data transmitted from mobile phone 10. Alternatively, the vibrator 24 may determine the vibration setting data with reference to the vibration setting database 202 based on the input voice recognition result, and may generate vibration based on the determined data.

このように、第３の実施の形態においては、合成音声とともに再生される映像は、送信側の携帯電話機１０で取得されたものを利用するか、受信側の携帯電話機２０で取得されたものを利用するかを選択することが可能である。したがって、受信側がより豊富なデータベースを所持しているならば受信側のデータベースを利用し、より一層表現力豊かなコミュニケーションをとることが可能である。また、受信側のデータベースを利用することで、送信側の意図しない映像が受信側で再生されることを逆に楽しむことができる。 As described above, in the third embodiment, the video reproduced together with the synthesized audio is obtained using the transmission-side mobile phone 10 or is acquired using the reception-side mobile phone 20. It is possible to select whether to use. Therefore, if the receiving side has a richer database, it is possible to use the receiving side database and perform communication with richer expressiveness. In addition, by using the database on the receiving side, it is possible to enjoy that the unintended video on the transmitting side is reproduced on the receiving side.

なお、第３の実施の形態においては、音声認識部１２の認識結果をそのまま通信部１４を介して携帯電話機２０に送信することとした。つまり、通話者の音声をテキストデータに変換したものか、あるいは、音声認識部１２において特定された状態パターンを指定したデータを携帯電話機２０に送信することとした。 In the third embodiment, the recognition result of the voice recognition unit 12 is directly transmitted to the mobile phone 20 via the communication unit 14. In other words, data obtained by converting the caller's voice into text data or data designating the state pattern specified by the voice recognition unit 12 is transmitted to the mobile phone 20.

別の方法として、携帯電話機１０側で採用されたキーワードのみを送信する形態であってもよい。つまり音声合成部１３、映像決定部１５、バイブレーション決定部１６は、音声認識部１２から音声認識結果であるテキストデータを入力すると、それぞれデータベース１０１〜１０３を参照してキーワードの検索を行った。そして、キーワードが検索された場合には、キーワードから音声データや映像データ、バイブレーション設定データを取得した。そこで、キーワードが検索された場合には、このキーワードを通信部１４に出力するのである。そして、通信部１４は、音声を変換したテキストデータの全部を送信するのではなく、キーワードのみを携帯電話機２０に送信するのである。これにより、データ送信量を少なくすることが可能である。 As another method, only the keyword adopted on the mobile phone 10 side may be transmitted. That is, when the speech synthesis unit 13, the video determination unit 15, and the vibration determination unit 16 input text data as a speech recognition result from the speech recognition unit 12, the keyword search is performed with reference to the databases 101 to 103, respectively. When a keyword is searched, audio data, video data, and vibration setting data are acquired from the keyword. Therefore, when a keyword is searched, this keyword is output to the communication unit 14. And the communication part 14 does not transmit all the text data which converted the audio | voice, but transmits only a keyword to the mobile telephone 20. FIG. As a result, the data transmission amount can be reduced.

なお、第２の実施の形態の機能を第３の実施の形態に取り入れても良い。つまり、音声通話の開始時に送信先の携帯電話機の端末機種情報を取得し、送信先の携帯電話機が音声通話時の映像再生機能を備えている場合にのみ映像データと音声認識結果を送信するようにしてもよい。また、送信先の携帯電話機が音声通話時のバイブレーション機能を備えている場合のみ、バイブレーション設定データと音声認識結果を送信するようにしてもよい。 Note that the functions of the second embodiment may be incorporated into the third embodiment. In other words, the terminal model information of the destination mobile phone is acquired at the start of the voice call, and the video data and the voice recognition result are transmitted only when the destination mobile phone has a video playback function at the time of the voice call. It may be. Moreover, the vibration setting data and the voice recognition result may be transmitted only when the transmission destination mobile phone has a vibration function during a voice call.

｛第４の実施の形態｝
第１〜第３の実施の形態においては、携帯電話機１０において入力した音声にリアルタイムで音声認識処理を実行した。そして、音声認識結果に基づいてデータベース１０１〜１０３を参照して音声素材データ、映像素材データ、バイブレーション設定データを取得した。つまり、データベース１０１〜１０３は、音声認識結果と素材データやバイブレーション設定データを対応付けていた。 {Fourth embodiment}
In the first to third embodiments, the voice recognition process is executed in real time on the voice input in the mobile phone 10. Then, the audio material data, the video material data, and the vibration setting data are obtained by referring to the databases 101 to 103 based on the voice recognition result. That is, the databases 101 to 103 associate speech recognition results with material data and vibration setting data.

第４の実施の形態では、音声素材データベースに登録されている音声素材データは、携帯電話機１０のキー操作と対応付けられている。同様に、映像素材データベースに登録されている映像素材データやバイブレーション設定データベースに登録されているバイブレーション設定データは、携帯電話機１０のキー操作と対応付けられている。 In the fourth embodiment, the sound material data registered in the sound material database is associated with the key operation of the mobile phone 10. Similarly, the video material data registered in the video material database and the vibration setting data registered in the vibration setting database are associated with key operations of the mobile phone 10.

したがって、携帯電話機１０では、音声認識処理は行わず、操作者のキー操作に基づいて音声の合成や映像あるいはバイブレーション設定データの送信を行うのである。たとえば、音声の通話中に操作者が、「１」のキーを押下すると、これに応じて対応するＢＧＭが選択され、通話音声に選択されたＢＧＭが合成されて携帯電話機２０に送信される。あるいは、音声の通話中に操作者が、「２」のキーを押下すると、これに応じて対応する映像データおよびバイブレーション設定データが選択され、通話音声とともに、選択された映像データおよびバイブレーション設定データが携帯電話機２０に送信されるのである。 Therefore, the mobile phone 10 does not perform voice recognition processing, and performs voice synthesis and transmission of video or vibration setting data based on the key operation of the operator. For example, when the operator presses the “1” key during a voice call, the corresponding BGM is selected according to the key, and the selected BGM is synthesized with the call voice and transmitted to the mobile phone 20. Alternatively, when the operator depresses the “2” key during a voice call, the corresponding video data and vibration setting data are selected according to the key, and the selected video data and vibration setting data are displayed along with the call voice. It is transmitted to the mobile phone 20.

このように、ユーザが音声通話中に任意のタイミングで割り当てられているキー操作を行うことで、通話音声に任意のタイミングでＢＧＭや効果音を合成させることが可能である。また、任意のタイミングで映像データやバイブレーション設定データを相手の端末に送信することが可能である。上記の例では、１つのキー操作と音声素材データ等を対応付けているが、複数のキー操作と対応付けてもよい。たとえば、「１」のキーが音声を合成する指示に対応し、続けて「１」〜「９」までのキーを選択することによって、音声データを指定するような方法であってもよい。 Thus, BGM and sound effects can be synthesized with the call voice at any timing by performing key operations assigned at any timing during the voice call. In addition, video data and vibration setting data can be transmitted to the partner terminal at an arbitrary timing. In the above example, one key operation is associated with audio material data or the like, but may be associated with a plurality of key operations. For example, a method may be used in which voice data is designated by selecting a key from “1” to “9” in response to an instruction for synthesizing voice with the key “1”.

この方法によっても、通話音声に表現力を増強させる特殊効果を付加することが可能である。また、リアルタイムの音声認識処理を行わないので、携帯電話機１０における処理負荷を小さくすることが可能である。 Also by this method, it is possible to add a special effect that enhances the expressive power to the call voice. In addition, since real-time voice recognition processing is not performed, the processing load on the mobile phone 10 can be reduced.

｛変形例｝
上記各実施の形態においては、合成音声とともに音声認識結果から取得された映像データを送信するようにした。別の実施例として、音声の合成は行わず、通話者が発した音声とともに、音声認識結果から取得された映像データを送信する形態であってもよい。 {Modifications}
In each of the above embodiments, the video data acquired from the voice recognition result is transmitted together with the synthesized voice. As another embodiment, voice data may not be synthesized, and video data acquired from a voice recognition result may be transmitted together with voice uttered by a caller.

また、上記各実施の形態においては、合成音声とともに音声認識結果から取得されたバイブレーション設定データを送信するようにした。別の実施例として、音声の合成は行わず、通話者が発した音声とともに、音声認識結果から取得されたバイブレーション設定データを送信する形態であってもよい。 In each of the above embodiments, the vibration setting data acquired from the speech recognition result is transmitted together with the synthesized speech. As another example, the voice setting data acquired from the voice recognition result may be transmitted together with the voice uttered by the caller without performing voice synthesis.

以上、本発明の実施の形態について、送信側および受信側の端末が携帯電話機である場合を例に説明したが、それら一方あるいは両方が携帯電話機以外の端末であってもよい。たとえば、映像の再生機能やバイブレーション機能を備えた固定電話であってもよい。あるいは、パソコン上で実行されるメッセンジャーなどであってもよい。 As described above, the embodiments of the present invention have been described by way of example in which the transmitting and receiving terminals are mobile phones, but one or both of them may be terminals other than the mobile phone. For example, it may be a landline telephone having a video playback function and a vibration function. Alternatively, it may be a messenger executed on a personal computer.

また、送信側の端末において、上記各実施の形態における機能をＯＮ／ＯＦＦする機能を備えることが望ましい。つまり、音声を合成するのか、映像データを送信するのか、バイブレーション設定データを送信するのか、という点についてそれぞれ個別に機能をＯＮ／ＯＦＦできるようにしておけばよい。たとえば、大事な仕事の内容の通話については、各機能をＯＦＦすることができる。 In addition, it is desirable that the terminal on the transmission side has a function for turning on / off the function in each of the above embodiments. That is, it is only necessary to individually enable / disable the function regarding whether to synthesize audio, to transmit video data, or to transmit vibration setting data. For example, each function can be turned off for an important work call.

また、上記各実施の形態において、各データベースは、メモリカードに格納されて提供される場合やダウンロードにより取得できると説明したが、簡単なオーサリングツールを端末に持たせることで、これらデータベースをユーザ自身が作成できるようにしておいてもよい。 Further, in each of the above embodiments, it has been described that each database can be obtained by being stored in a memory card or downloaded. However, by providing a simple authoring tool in the terminal, these databases can be stored by the user himself / herself. May be created.

携帯電話機を利用したコミュニケーションシステムのイメージ図である。It is an image figure of the communication system using a mobile phone. 送信側および受信側の携帯電話機のブロック図である。It is a block diagram of a mobile phone on a transmission side and a reception side. 音声素材データベースの登録例を示す図である。It is a figure which shows the example of registration of an audio material database. 映像素材データベースの登録例を示す図である。It is a figure which shows the example of registration of a video material database. バイブレーション設定データベースの登録例を示す図である。It is a figure which shows the example of registration of a vibration setting database. 第２の実施の形態における送信側および受信側の携帯電話機のブロック図である。It is a block diagram of a mobile phone on the transmission side and the reception side in the second embodiment. 第３の実施の形態における送信側および受信側の携帯電話機のブロック図である。It is a block diagram of a mobile phone on the transmitting side and the receiving side in the third embodiment.

Explanation of symbols

１０（送信側）携帯電話機
２０（受信側）携帯電話機
１０１音声素材データベース
１０２映像素材データベース
１０３バイブレーション設定データベース
２０１（受信側）映像素材データベース
２０２（受信側）バイブレーション設定データベース
10 (Transmission side) Mobile phone 20 (Reception side) Mobile phone 101 Audio material database 102 Video material database 103 Vibration setting database 201 (Reception side) Video material database 202 (Reception side) Vibration setting database

Claims

An audio material database for storing audio material data associated with the speech recognition results;
Voice input means;
Voice recognition means for performing voice recognition on the voice input from the voice input means;
Means for acquiring the voice material data associated with the recognition result by the voice recognition means from the voice material database, and synthesizing the acquired voice material data with the voice input from the voice input means;
A communication means for transmitting the synthesized voice data to the other terminal;
A voice transmitting terminal comprising:

The voice transmission terminal according to claim 1, further comprising:
A video material database that stores video material data associated with the speech recognition results;
Means for acquiring video material data associated with a recognition result by the voice recognition means from the video material database;
With
The communication means transmits the video acquired from the video material database together with the audio synthesized with the audio material data to the partner terminal.

The voice transmission terminal according to claim 1, further comprising:
A vibration setting database in which voice recognition results are associated with vibration setting data;
Means for determining vibration setting data based on a recognition result by the voice recognition means;
With
The communication means transmits vibration setting data to the partner terminal together with the voice synthesized with the voice material data.

The voice transmission terminal according to claim 2 or 3, further comprising:
Means for obtaining terminal information of the partner terminal;
With
An audio transmitting terminal, wherein transmission of video material data or vibration setting data is stopped according to a type of the partner terminal.

The voice transmission terminal according to any one of claims 2 to 4, further comprising:
The voice transmission terminal characterized in that the communication means transmits a recognition result by the voice recognition means to the partner terminal.

A video material database that stores video material data associated with the speech recognition results;
Voice input means;
Voice recognition means for performing voice recognition on the voice input from the voice input means;
Means for acquiring video material data associated with a recognition result by the voice recognition means from the video material database;
Communication means for transmitting the video acquired from the video material database together with the voice input from the voice input means to a partner terminal;
A voice transmitting terminal comprising:

A vibration setting database in which voice recognition results are associated with vibration setting data;
Voice input means;
Voice recognition means for performing voice recognition on the voice input from the voice input means;
Means for determining vibration setting data based on a recognition result by the voice recognition means;
A communication means for transmitting vibration setting data to the partner terminal together with the voice input from the voice input means;
A voice transmitting terminal comprising:

The voice transmission terminal according to any one of claims 1 to 7,
The speech recognition result includes the result of text-converting the speech input by the speech input means and / or information related to the tone of the speech determined from the speech input by the speech input means. .

The voice transmission terminal according to claim 1,
The audio material database is stored in a memory card, and the audio material database can be used by inserting the memory card into the audio transmission terminal.

In the voice transmitting terminal according to claim 2 or 6,
An audio transmitting terminal, wherein the video material database is stored in a memory card, and the video material database can be used by inserting the memory card into the audio transmitting terminal.

In the voice transmitting terminal according to claim 3 or 7,
The vibration setting database is stored in a memory card, and the vibration setting database can be used by inserting the memory card into the voice transmitting terminal.

A terminal for receiving voice transmitted from the voice transmitting terminal according to claim 1,
An audio reproduction terminal characterized by outputting audio synthesized with audio material data from a speaker.

A terminal for receiving voice and data transmitted from the voice transmitting terminal according to claim 2,
An audio reproduction terminal characterized by outputting received video material data to a monitor while outputting audio synthesized with audio material data from a speaker.

A terminal for receiving voice and data transmitted from the voice transmitting terminal according to claim 3,
An audio reproduction terminal characterized in that a vibrator is driven based on received vibration setting data while outputting a synthesized voice of audio material data from a speaker.

A terminal that receives the synthesized voice and the voice recognition result transmitted from the voice transmitting terminal according to claim 5,
The receiving terminal is
Receiving-side video material database that stores video material data associated with voice recognition results,
With
The receiving terminal is
When the video material data is received from the audio transmission terminal, the video material data received together with the synthesized audio is reproduced, or the corresponding video material data is acquired from the receiving video material database based on the received recognition result, and synthesized. An audio reproduction terminal characterized in that it can select whether to reproduce video material data acquired together with audio.

A terminal that receives the synthesized voice and the voice recognition result transmitted from the voice transmitting terminal according to claim 5,
The receiving terminal is
Receiving side vibration setting database that associates voice recognition results with vibration setting data,
With
The receiving terminal is
When vibration setting data is received from the voice transmitting terminal, the synthesized voice is reproduced and the vibrator is driven based on the received vibration setting data, or the reception side vibration setting database is referred to based on the received voice recognition result. A voice reproduction terminal characterized in that vibration setting data is determined to reproduce a synthesized voice and to select whether to drive a vibrator based on the determined vibration setting data.

An audio material database that stores audio material data associated with key operations;
Voice input means;
Means for acquiring voice material data associated from a key operation input during a voice call from the voice material database, and synthesizing the acquired voice material data with the voice input from the voice input means;
A communication means for transmitting the synthesized voice data to the other terminal;
A voice transmitting terminal comprising:

A video material database that stores video material data associated with key operations;
Voice input means;
Means for acquiring video material data associated with key operations input during a voice call from the video material database;
A communication means for transmitting the video material data acquired together with the sound input from the sound input means to a partner terminal;
A voice transmitting terminal comprising:

A vibration setting database for storing vibration setting data associated with key operations;
Voice input means;
Means for acquiring vibration setting data associated with key operations input during a voice call from the vibration setting database;
Communication means for transmitting the vibration setting data acquired together with the voice input from the voice input means to the partner terminal;
A voice transmitting terminal comprising: