JP4600643B2

JP4600643B2 - Videophone device having character display function and voice character conversion display method in videophone device

Info

Publication number: JP4600643B2
Application number: JP2004164121A
Authority: JP
Inventors: 麻奈美大森
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-06-02
Filing date: 2004-06-02
Publication date: 2010-12-15
Anticipated expiration: 2024-06-02
Also published as: JP2005348006A

Description

本発明は、文字表示機能を有するテレビ電話装置及びテレビ電話装置における受信音声データを基に文字に変換し表示する方法に関し、特にテレビ電話機能を搭載した携帯電話機、ＰＨＳ電話機、携帯情報端末等の移動通信端末等に適用して好適な文字表示機能を有するテレビ電話装置及びテレビ電話装置において受信音声データを基に文字に変換し文字を表示する方法に関する。 The present invention relates to a videophone device having a character display function and a method of converting and displaying characters on the basis of received voice data in the videophone device, and more particularly to a mobile phone, a PHS phone, a portable information terminal, and the like equipped with a videophone function. The present invention relates to a videophone device having a character display function suitable for application to a mobile communication terminal or the like, and a method for displaying characters by converting them into characters based on received voice data.

最近の携帯電話機には、カメラとモニタが設置されテレビ電話機能を有するものが普及してきている。この種電話機でテレビ機能を利用して通話を行うには、カメラで自分自身を撮影すると共にモニタを見ながら通話が行われることから、テレビ電話機能を持たない電話機を使用する場合のように、使用者の頭部側面の耳に受話口が近接するように保持して通話するのは困難である。 Recently, mobile phones having a camera and a monitor and having a videophone function have become widespread. In order to make a call using the video function with this type of phone, you can shoot yourself with the camera and make a call while watching the monitor, so when using a phone that does not have a video phone function, It is difficult to talk while holding the earpiece close to the ear on the side of the user's head.

このため、イヤホン付きマイクを使用し受話音をイヤホンで受け、或いは、マイクとスピーカを使用して音声をスピーカから拡声するハンズフリー機能が使用される。前者は主として受話音が周囲に漏れると他人に迷惑を及ぼし、或いは、他人に聞かれては不都合な場合など、周囲に音声が漏れると問題がある環境で使用され、後者は、周囲に音声が漏れても問題がない環境、例えば、自宅や、オフィスの個室などで使用される場合が多い。 For this reason, a hands-free function is used in which a microphone with an earphone is used to receive a received sound with the earphone, or a voice is amplified from the speaker using a microphone and a speaker. The former is mainly used in an environment where there is a problem if sound leaks to the surroundings, such as inconvenience to others when the received sound leaks to the surroundings, or when it is inconvenient if heard by others. It is often used in an environment where there is no problem with leakage, for example, at home or in a private office room.

通常、ＴＶ電話の着信は、移動端末機の所在場所にかかわらず発生するが、移動端末機にイヤホン付きマイクを常時装着しているユーザは少なく、また，ＴＶ電話着信を受けてからイヤホン付きマイクを装着するとＴＶ電話開始までに時間がかかるし、便利とはいえない。 Normally, an incoming videophone call occurs regardless of the location of the mobile terminal, but there are few users who always wear a microphone with an earphone on the mobile terminal, and a microphone with an earphone after receiving a videophone call. If it is attached, it takes time to start a videophone, which is not convenient.

ハンズフリー機能は、電話機自体に機能として実装するものであるから、イヤホン付きマイクのように、電話機と装着すべき別装置との関係ではなく、ボタン操作だけでＴＶ電話を利用できる。しかし、ハンズフリー機能を使用して通話を行うには、スピーカから拡声された通話相手の音声が周囲に拡がり、使用する環境によっては周囲に迷惑を及ぼし、或いは、通話内容を聞かれて不都合な場合がある。例えば、会議中にテレビ電話機能付携帯電話機に着信がある場合や、電車内で着信がある場合などである。 Since the hands-free function is implemented as a function in the telephone itself, a TV telephone can be used only by operating a button, not a relationship between the telephone and another device to be attached, such as a microphone with an earphone. However, in order to make a call using the hands-free function, the other party's voice that is loudened from the speaker spreads to the surroundings, which may cause inconvenience to the surroundings or hear the contents of the call, which is inconvenient. There is a case. For example, when there is an incoming call to a mobile phone with a videophone function during a conference or when there is an incoming call on a train.

従来、音声に代えて、或いは、音声と共に表示部にテキスト表示を行うことのできる携帯電話機が提案されている。例えば、特開２００３‐１８２７８号公報（特許文献１）は、音声データを受信して音声認識によりテキストデータを出力し画面表示するものである。このテキスト表示は、聴力にハンデイキャップのあるユーザ用の携帯電話機である。 Conventionally, a mobile phone capable of displaying text on a display unit instead of voice or together with voice has been proposed. For example, Japanese Patent Laid-Open No. 2003-18278 (Patent Document 1) receives voice data, outputs text data by voice recognition, and displays it on the screen. This text display is a mobile phone for a user who has a handicap in hearing.

また、特開２００３‐１８８９４８号公報（特許文献２）は、音声出力と共にテキスト表示を行うことのできるハンデイキャップのあるユーザ用の携帯電話機を開示している。 Japanese Patent Laying-Open No. 2003-188948 (Patent Document 2) discloses a mobile phone for a user with a handicap capable of displaying text together with voice output.

これら従来技術は、特定のユーザを対象にした携帯電話機でテキスト表示するもので、一般ユーザを対象としたテレビ電話或いはテレビ電話機能を有する電話機でテキスト表示をするものではない。 These conventional technologies display text on a mobile phone intended for a specific user, and do not display text on a video phone or a phone having a video phone function intended for general users.

特開２００３‐１８２７８号公報Japanese Patent Laid-Open No. 2003-18278 特開２００３‐１８８９４８号公報Japanese Patent Laid-Open No. 2003-188948

本発明の目的は、テレビ電話装置又はテレビ電話機能を有する移動通信端末において、受信音声データをテキストに変換しテキスト表示を可能とするテレビ電話機又はテレビ電話機能を有する移動通信端末を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a videophone or a mobile communication terminal having a videophone function capable of converting received voice data into text and displaying the text in a videophone device or a mobile communication terminal having a videophone function. is there.

本発明の別の目的は、テレビ電話装置又はテレビ電話機能を有する移動通信端末において、音声データをテキスト変換する方法及びテキスト変換された文字を表示する方法を提供することにある。 Another object of the present invention is to provide a method for text-converting voice data and a method for displaying text-converted characters in a videophone device or a mobile communication terminal having a videophone function.

本発明によれば、受信音声データから受信映像データを参照して音声データを抽出する手段と、抽出音声データを基にテキストデータに変換する手段と、前記テキストデータに基づき文字列を表示する手段とを含むテレビ電話機能付移動通信端末が得られる。 According to the present invention, means for extracting audio data by referring to received video data from received audio data, means for converting to text data based on the extracted audio data, and means for displaying a character string based on the text data A mobile communication terminal with a videophone function is obtained.

望ましくは、前記抽出手段は、前記受信音声データから受信映像データに含まれる通話相手の口元の動きの存在に対応して抽出音声データを出力する。 Preferably, the extraction means outputs the extracted voice data corresponding to the presence of the movement of the other party's mouth included in the received video data from the received voice data.

また、望ましくは、前記抽出手段は、前記受信映像に含まれる通話相手の口元の動きの不存在に対応した部分では、抽出音声データを出力しない。 Preferably, the extraction means does not output the extracted voice data in a portion corresponding to the absence of movement of the other party's mouth included in the received video.

さらに、望ましくは、前記抽出手段は、前記受信映像に含まれる通話相手の口元の動きが存在しても、前記受信データの音声レベルが所定レベル以下の場合には抽出音声データの出力をしない。 Further, preferably, the extraction means does not output the extracted voice data when the voice level of the received data is equal to or lower than a predetermined level even if there is a movement of the other party's mouth included in the received video.

望ましい態様では、テレビ電話機能付移動通信端末は、スピーカから通話音声が放射されるハンズフリー機能モードが設定されていない場合で、前記文字列表示手段を起動するテキスト表示モードが設定されている場合には、前記テキスト表示モードで動作する。 In a desirable mode, the mobile communication terminal with a videophone function has a text display mode for activating the character string display means when a handsfree function mode in which a call voice is emitted from a speaker is not set. Operates in the text display mode.

さらに、別の態様では、テレビ電話機能付移動通信端末は、前記テキスト変換手段の供給するテキストデータを保存する記憶手段を含む。 Furthermore, in another aspect, the mobile communication terminal with a videophone function includes storage means for storing text data supplied from the text conversion means.

また、別の態様では、テレビ電話機能付移動通信端末は、スピーカから通話音声が放射されるハンズフリー機能モードが設定されていない場合で、前記テキストデータの保存を起動する保存モードが設定されている場合には、前記保存モードで動作する。 In another aspect, the mobile communication terminal with a videophone function is set to a storage mode for starting the storage of the text data when the handsfree function mode in which a call voice is emitted from a speaker is not set. If so, it operates in the save mode.

本発明によれば、また、テレビ電話機能を有する携帯移動通信端末における音声データのテキスト表示方法であって、受信音声データから受信映像データを参照して音声データを抽出し、前記抽出音声データを基にテキストデータに変換し、前記テキストデータに基づき文字列を表示するテレビ電話機能付移動通信端末における音声データのテキスト表示方法が得られる。 According to the present invention, there is also provided a text display method of audio data in a mobile mobile communication terminal having a videophone function, wherein audio data is extracted from received audio data with reference to received video data, and the extracted audio data is Based on the text data, a text data display method for voice data in a mobile communication terminal with a videophone function for displaying a character string based on the text data is obtained.

望ましくは、前記受信音声データから受信映像データに含まれる通話相手の口元の動きの存在に対応して前記音声データの抽出が行われる。さらに、望ましくは、前記受信映像に含まれる通話相手の口元の動きの不存在に対応した部分では、前記音声データの抽出をしない。 Preferably, the voice data is extracted from the received voice data in accordance with the presence of the movement of the other party's mouth included in the received video data. Further, preferably, the voice data is not extracted in a portion corresponding to the absence of movement of the other party's mouth included in the received video.

さらに、本発明によれば、テレビ電話機能付移動通信端末に使用する音声データのテキスト変換方法であって、受信音声データから受信映像データを参照して音声データを抽出し、前記抽出音声データを基にテキストデータに変換するテレビ電話機能付移動通信端末に使用する音声データのテキスト変換方法が得られる。 Furthermore, according to the present invention, there is provided a text data conversion method for audio data used for a mobile communication terminal with a videophone function, wherein audio data is extracted from received audio data with reference to received video data, and the extracted audio data is Based on this, a text data conversion method for voice data used for a mobile communication terminal with a videophone function for converting text data can be obtained.

望ましくは、テレビ電話機能付移動通信端末に使用する音声データのテキスト変換方法では、前記受信音声データから受信映像データに含まれる通話相手の口元の動きの存在に対応して前記音声データの抽出が行われる。さらに、前記受信映像に含まれる通話相手の口元の動きの不存在に対応した部分では、前記音声データの抽出をしない。 Preferably, in the voice data text conversion method used for the mobile communication terminal with a videophone function, the voice data is extracted from the received voice data in accordance with the presence of movement of the other party's mouth included in the received video data. Done. Further, the voice data is not extracted in a portion corresponding to the absence of movement of the other party's mouth included in the received video.

また、本発明によれば、受信音声データから受信映像データを参照して抽出音声データを出力するデータ解析手段と、前記抽出音声データを基にテキストデータに変換する手段と、前記テキストデータに基づき文字列を表示する手段とを含むテレビ電話装置が得られる。 Further, according to the present invention, based on the text data, data analysis means for outputting the extracted voice data with reference to the received video data from the received voice data, the means for converting to the text data based on the extracted voice data, A videophone device including means for displaying a character string is obtained.

本発明によれば、テレビ電話装置又はテレビ電話機能付移動通信において、必要に応じてテキスト表示モードが設定できるので、通話者を撮像しながら通話相手に画像を送り、通話相手の画像をモニタに映し出し、通話相手の発話内容をテキストメッセージとして表示できるので、ハンズフリー機能を使用した通話が、周囲に迷惑を及ぼしうる環境、或いは、通話内容を聞かれて不都合な環境においても，テレビ電話を行うことができる。 According to the present invention, a text display mode can be set as necessary in a videophone device or mobile communication with a videophone function, so that an image can be sent to the call partner while the caller is imaged, and the image of the call partner can be used as a monitor. Since it is possible to project and display the content of the other party's speech as a text message, videophone calls can be made even in an environment where a call using the hands-free function may cause trouble to the surroundings or when it is inconvenient to hear the content of the call be able to.

本発明の実施の形態における音声データのテキスト変換は、音声データが存在し且つ映像データにおける口元に動きが存在する場合の音声データをテキスト変換するものであるから、送信側のＴＶ電話機が使用される環境における周囲の音をマイクでピックアップした成分が含まれる音声データのうち、映像における口元に動きがある時の音声データが抽出されてテキスト変換されるから、誤認音声データ或いは不要の音声データのテキスト変換は抑制できる。 The text conversion of the audio data in the embodiment of the present invention is to convert the audio data when the audio data is present and there is movement in the mouth of the video data, so that the transmitting side TV phone is used. Audio data that includes a component picked up by a microphone in the surrounding environment is extracted and converted to text when there is movement in the mouth of the video. Text conversion can be suppressed.

また、テキストデータの保存も、送信側のＴＶ電話機が使用される環境における周囲の音をマイクでピックアップした成分が含まれる音声データのうち、口元に動きがある場合の音声データを抽出した音声データから、テキスト変換し、保存するものであるから、保存されるテキストデータから誤認テキストデータ或いは、不要のテキストデータを排除することができる。 In addition, text data is also stored by extracting voice data when there is movement in the mouth from voice data including a component in which ambient sounds are picked up by a microphone in an environment where the transmitting-side TV phone is used. Since the text is converted and saved, misidentified text data or unnecessary text data can be excluded from the saved text data.

次に、図面を参照して本発明の実施の形態について説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

図１は本発明の実施形態に係るテレビ電話機能を携帯電話機（ＴＶ電話機と称する）で実現する場合のブロック図である。同図において、ＴＶ電話機１００は、その制御部１０１がＴＶ電話機の各機能の制御を行う。制御部１０１はＣＰＵ１０２を含みプログラムによって制御を行う。１０３は、無線部でアンテナを介して画像信号や音声信号で変調し無線帯域の高周波に変換して送信し、或いは、無線帯域の高周波から画像信号や音声信号を復調することによって外部との通信を行う。１０４は、ＬＣＤの表示部、１０５は、各種のキー等入力操作を行う操作部、１０６は、辞書データ、電話帳、アドレス帳などを記臆するメモリ部である。表示部１０４は、無線部１０１を介して通話相手のＴＶ電話機から受信した映像、通話相手の音声データをテキスト変換した文字列や、通話者本人のカメラで捕えた画像を表示する。操作部１０５は、電話・ＴＶ電話着信時に通話開始操作、ＴＶ電話の発信操作、ハンズフリー設定、テキスト表示設定等の操作を行う。１０７は、送話口／マイク、１０８はスピーカ、１０９はカメラ部である。送話口／マイク１０７は、電話・ＴＶ電話時に通話者の通話音声を入力し電気信号に変換する。スピーカ１０８は、電話・ＴＶ電話の着信音やハンズフリー設定時の通話相手の音声を出力する。カメラ１０９は、テレビ電話機能の使用時に自分自身を撮影する。さらにＴＶ電話機はデータ解析部１１０、テキスト変換部１１１を含む。データ解析部１１０は、通話相手から送られる映像信号と通話相手の音声データを解析し音声データを出力する。テキスト変換部は、データ解析部１１１で解析した音声データをテキストデータに変換する。 FIG. 1 is a block diagram when the videophone function according to the embodiment of the present invention is realized by a mobile phone (referred to as a TV phone). In the figure, a TV telephone 100 has its control unit 101 controlling each function of the TV telephone. The control unit 101 includes a CPU 102 and performs control according to a program. 103 is a radio unit that modulates an image signal or a sound signal via an antenna and converts the signal into a high frequency in the wireless band and transmits it, or it demodulates the image signal or the sound signal from the high frequency in the wireless band to communicate with the outside. I do. Reference numeral 104 denotes an LCD display unit, 105 denotes an operation unit for performing input operations such as various keys, and 106 denotes a memory unit for storing dictionary data, a phone book, an address book, and the like. The display unit 104 displays a video received from the other party's TV phone via the wireless unit 101, a character string obtained by text-converting the other party's audio data, and an image captured by the caller's own camera. The operation unit 105 performs operations such as a call start operation, a TV phone call operation, a hands-free setting, a text display setting, etc. when a call is received. 107 is a mouthpiece / microphone, 108 is a speaker, and 109 is a camera unit. The mouthpiece / microphone 107 inputs a caller's voice during a telephone call or a videophone call, and converts it into an electrical signal. The speaker 108 outputs a ring tone of a telephone / TV phone or a voice of a call partner when hands-free is set. The camera 109 shoots itself when using the videophone function. Further, the TV phone includes a data analysis unit 110 and a text conversion unit 111. The data analysis unit 110 analyzes the video signal sent from the call partner and the voice data of the call partner and outputs the sound data. The text conversion unit converts the voice data analyzed by the data analysis unit 111 into text data.

図２はＴＶ電話機の折りたたみ状態から開いて展開した状態の外観を示す平面図で、図１で説明したブロックの機能に対応する外観部分には同じ参照数字を付してある。
同図に置いて、上部パネル２００には、耳で受話音声を受けるための受話口２１、カメラ１０９、表示部１０４が配置され、表示部には通話相手の映像２２および自分カメラで撮影した自分の映像２３が表示されている。下部パネルには、送話口／マイク１０７、スピーカ１０８および各種機能のボタンが配列された操作部１０５が配置されている。そして，ＴＶ電話着信時には、操作部１０５のボタンによりＴＶ電話の通話開始や、ハンズフリー機能の設定、テキスト表示機能の設定等の操作を行う。 FIG. 2 is a plan view showing the external appearance of the TV phone when it is opened and unfolded, and the same reference numerals are assigned to the external parts corresponding to the functions of the blocks described in FIG.
In the figure, the upper panel 200 is provided with an earpiece 21, a camera 109, and a display unit 104 for receiving a received voice by ear. The video 23 is displayed. On the lower panel, an operation unit 105 in which a mouthpiece / microphone 107, a speaker 108, and buttons for various functions are arranged is arranged. When a videophone call is received, operations such as starting a videophone call, setting a hands-free function, and setting a text display function are performed using the buttons on the operation unit 105.

図３は、ＴＶ電話機において、テキスト表示モードに設定時の表示部での画面表示とハンズフリーモード設定時の画面表示状態を示し、前者が同図の（ａ）に、後者が同図の（ｂ）である。 FIG. 3 shows the screen display on the display unit when the text display mode is set and the screen display state when the hands-free mode is set in the TV phone. The former is shown in FIG. b).

同図（ｂ）を参照すると、ハンズフリー設定モードでは、相手通話者の映像および自分の映像に加え、ハンズフリー動作中であることを示す表示'sound'が表示部に表示されており、通話相手からの音声は受話口２１からではなく、スピーカ１０８から拡声音で放射される。ハンズフリー設定時に着信があり、開始ボタン又はＴＶボタンを押すと、通常、相手通話者は、会話を始めるのでいきなり拡声音が流れ、受話口に耳に接近させたり、イヤホンを装着しなくとも聞けるので大変便利ではあるが、公共の場では不都合な場合も多い。 Referring to FIG. 6B, in the hands-free setting mode, in addition to the other party's video and own video, a display 'sound' indicating that the hands-free operation is being performed is displayed on the display unit. The voice from the other party is emitted from the speaker 108 as a loud sound, not from the earpiece 21. When there is an incoming call when the hands-free setting is set and the start button or TV button is pressed, the other party usually starts a conversation and suddenly hears loud sound and can be heard without bringing the earpiece close to the ear or wearing an earphone. Although it is very convenient, it is often inconvenient in public places.

このように不都合な環境で、ＴＶ電話の着信があったときに、通話相手の音声データを受信側でテキスト情報に変換して、表示画面に表示できるようにテキスト表示モードを設定した場合の画面表示状態および音声の出力の様子を図３の（ａ）である。同図において、操作部１０５の操作で、テキスト表示モードに設定すると、表示部１０４の表示面２５にはテキスト表示モードが設定されていることを示す'text'の表示が出ている。着信に対してテキスト表示モードを選定して、開始ボタンを押すと、相手通話者からの音声信号はテキストに変換されて、表示部のテキスト表示部分に文字列２４で示す「もしもし、鈴木で・・」のように表示される。音声は、受話口２１から流れるが、スピーカからの放射はない。テキストへの変換は、音声信号と、通話相手の映像２２を表示する映像信号とを基に行うが、その詳細については後述する。 Screen when the text display mode is set so that the voice data of the other party can be converted into text information on the receiving side and displayed on the display screen when there is an incoming videophone call in such an inconvenient environment FIG. 3A shows the display state and the sound output state. In the figure, when the text display mode is set by the operation of the operation unit 105, “text” indicating that the text display mode is set is displayed on the display surface 25 of the display unit 104. When the text display mode is selected for the incoming call and the start button is pressed, the voice signal from the other party's caller is converted to text, and the text display portion of the display section shows the text string 24 as “・ ”Is displayed. The sound flows from the earpiece 21, but there is no radiation from the speaker. The conversion to text is performed based on the audio signal and the video signal for displaying the video 22 of the other party, details of which will be described later.

したがって、本発明のＴＶ電話機は，ＴＶ電話機が位置している環境に応じて、ハンズフリーモード、テキスト表示モードを切り替えて、映像を交えて通話をすることができる。この切替は、通話中に操作部１０５を操作して適宜切り替えることができる。なお、ＴＶ電話機の使用者の音声は、送話口／マイク１０７がピックアップするが、その音声の強度は、通常電話機の使用時に発せられる普通の声であるので、通常の電話の使用が許される環境であれば、公共の場であってあっても、ハンズフリー通話時のスピーカ音に比べると音量は遥かに小さい。 Therefore, the TV telephone according to the present invention can make a call with a video by switching between the hands-free mode and the text display mode according to the environment where the TV telephone is located. This switching can be appropriately switched by operating the operation unit 105 during a call. The voice of the user of the TV phone is picked up by the mouthpiece / microphone 107. However, since the strength of the voice is a normal voice that is emitted when the normal telephone is used, the normal telephone can be used. In an environment, even in public places, the volume is much lower than the speaker sound during a hands-free call.

次に、図４を参照して、音声データからテキストへの変換について説明する。本実施形態ではテキストは、受信音声データから作成するが、データ解析部１１０は、制御部１０１の制御の基で、受信された通話相手の映像と音声データとを解析する。音声データの解析は、音声データの有無及び通話相手の映像（例えば口の動き）データから、通話相手が発話を行っているかを解析する。音声データの有無及び通話相手の映像によって次の４つのパターンに分類する。 Next, conversion from voice data to text will be described with reference to FIG. In this embodiment, the text is created from the received voice data, but the data analysis unit 110 analyzes the received video and voice data of the other party under the control of the control unit 101. In the analysis of the voice data, whether the other party is speaking is analyzed from the presence / absence of the voice data and the other party's video (for example, mouth movement) data. The patterns are classified into the following four patterns according to the presence / absence of audio data and the image of the other party.

パターン１では、音声データが無く、通話相手の映像に動きが無い状態で、この状態では音声データの抽出はしない。したがって、テキスト変換、文字列表示は行われない。図４の（ａ）は、パターン１の状態での上部パネルの状況を示す。表示部１０４の画像表示面には通話相手の画像が表示されており、その画像で口元４１は閉じている。 In pattern 1, there is no audio data, and there is no movement in the image of the other party, and no audio data is extracted in this state. Therefore, text conversion and character string display are not performed. FIG. 4A shows the state of the upper panel in the pattern 1 state. The image of the other party is displayed on the image display surface of the display unit 104, and the mouth 41 is closed by the image.

パターン２では、音声データが有り、通話相手の映像に動きが無い状態で、この状態では、音声データは雑音と見做し、音声データの抽出はしない。したがって、テキスト変換、文字列表示は行われない。図４の（ｂ）は、パターン２で、受話口２１から音４２が発せられているが、口元４１は閉じており、この状態での音は雑音と見なす。 In pattern 2, there is audio data and there is no movement in the image of the other party. In this state, the audio data is regarded as noise, and no audio data is extracted. Therefore, text conversion and character string display are not performed. FIG. 4B shows a pattern 2 in which a sound 42 is emitted from the earpiece 21 but the mouth 41 is closed, and the sound in this state is regarded as noise.

パターン３では、音声データは無く、通話相手の映像に動きがある状態である。この状態では、通話相手は発話とは異なる動き（例えば、口の動きは発生するが、音声の出力が無い欠伸など）を行っていると見做し、音声データの抽出はしない。したがって、テキスト変換、文字列表示は行われない。図４の（ｃ）は、パターン３で、口元４１は開いているが、受話音は無い。 In pattern 3, there is no audio data and there is movement in the video of the other party. In this state, it is assumed that the other party is moving differently from the utterance (for example, a mouth movement occurs but there is no voice output), and no voice data is extracted. Therefore, text conversion and character string display are not performed. (C) of FIG. 4 is the pattern 3, and the mouth 41 is open, but there is no reception sound.

パターン４では、音声データがあり、通話相手の映像に動きがある状態である。この状態では、通話相手は発話を行っていると見做し、音声データの抽出が行われる。この抽出データに基づき、テキスト変換、文字列表示が行われる。図４の（ｄ）は、受話口２１から音声４２が発せられ、口元４１が開き、相手通話者の画像の下部２４に、受信音声データをテキスト変換した文字列が表示されている。 In pattern 4, there is audio data and there is movement in the video of the other party. In this state, it is assumed that the other party is speaking, and voice data is extracted. Based on this extracted data, text conversion and character string display are performed. In FIG. 4D, the voice 42 is emitted from the earpiece 21, the mouth 41 is opened, and a character string obtained by converting the received voice data into text is displayed in the lower part 24 of the image of the other party.

なお、口元の動きの検出には、例えば、特徴点抽出等の既知の技術を用いることができる。 For detecting the movement of the mouth, for example, a known technique such as feature point extraction can be used.

図５は、本発明のＴＶ電話機の動作を説明するためのフローチャートである。同図並びに図１、図２、図３及び図４を参照して動作について説明する。 FIG. 5 is a flowchart for explaining the operation of the TV telephone of the present invention. The operation will be described with reference to FIG. 1, FIG. 1, FIG. 2, FIG. 3 and FIG.

ＴＶ電話機Ａは、ＴＶ電話機Ｂへの発信を行うと（ステップ５０１）、通信回線を経由して、ＴＶ電話機Ｂに着信する（ステップ５０２）。操作部１０４の開始ボタンの操作又はＴＶボタンの操作により、ＴＶ電話通話を開始する（ステップ５０３）。制御部１０１は、ハンズフリーモードに設定されているかをチックする（ステップ５０４）。ハンズフリーモードの場合には、そのモードでＴＶ電話が継続される（ステップ５１２）。したがって、通話相手からの音声は、スピーカ１０８から発せられ、相手の画像が表示面に映し出された状況でＴＶ電話通話を行う。 When the video phone A makes a call to the video phone B (step 501), the video phone A arrives at the video phone B via the communication line (step 502). A videophone call is started by operating the start button of the operation unit 104 or the TV button (step 503). The control unit 101 ticks whether the hands-free mode is set (step 504). In the case of the hands-free mode, the videophone is continued in that mode (step 512). Therefore, the voice from the other party is emitted from the speaker 108 and a videophone call is performed in a situation where the other party's image is displayed on the display screen.

ハンズフリーモードの設定でない場合は、制御部はテキスト表示モードの設定があるかをチェックする（ステップ５０５）。テキスト表示モードの設定でない場合には、ＴＶ電話通話が継続される（ステップ５１２）。この場合のＴＶ電話では、音声は、受話口から発せられるので、画面を見ながらの通話は聞き取りにくい状況である。 If the setting is not the hands-free mode, the control unit checks whether the text display mode is set (step 505). If the text display mode is not set, the videophone call is continued (step 512). In the TV phone in this case, since the voice is emitted from the earpiece, it is difficult to hear the call while looking at the screen.

テキスト表示モードが設定されていると、データ解析部１１０は、受信信号に音声データが存在するかをチェックする（ステップ５０６）。音声データが無い場合にはＴＶ電話の使用が継続中かをチェックし（ステップ５１１）、継続中であれば、音声データが存在するかチェックを行う（ステップ５０６）。ステップ５１１でＴＶ電話継続中でない場合には、通信切断処理を行う（ステップ５１３）。 If the text display mode is set, the data analysis unit 110 checks whether there is audio data in the received signal (step 506). If there is no audio data, it is checked whether or not the videophone is being used (step 511). If the audio phone is being used, it is checked whether there is audio data (step 506). If the videophone is not being continued in step 511, communication disconnection processing is performed (step 513).

ステップ５０６で音声データが存在する場合には、データ解析部１１０は、着信相手の映像で口が動いているかをチェックし（ステップ５０７）、口が動いてない場合には，ＴＶ電話継続中かをチェックするステップ５１１に行き、継続中の場合には、音声データの存在のチェック及び着信相手の映像の口元が動いているかのステップのループでの処理を実行する。 If there is audio data in step 506, the data analysis unit 110 checks whether the mouth is moving in the image of the incoming call partner (step 507). The process goes to step 511 for checking and if it is continuing, the process of checking the presence of the audio data and the process of the loop of whether the mouth of the video of the other party is moving is executed.

ステップ５０７で、着信相手の映像で口元が動いている場合には、データ解析部１１０は、音声データの抽出を行う（ステップ５０８）。音声データの抽出結果、すなわち、相手通話者が発話した音声データがテキスト変換部に供給される。テキスト変換部１１１は、抽出結果に基づき、音声認識して文字に変換し（ステップ５０９）、表示部にテキストの表示を行う（ステップ５１０）。テキストの表示処理を行うと、ＴＶ電話継続中かをチェックして（ステップ５１１）、継続中の場合には、次の音声データの存在、着信映像で口元に動きがあるかをチェックして、次々にテキストを表示していく動作が行われる。やがて，ステップ５１１でＴＶ電話が使用継続中で無い判断がされると、通信切断処理が行われ（ステップ５１３）、終了する。 In step 507, if the mouth is moving in the video of the incoming call partner, the data analysis unit 110 extracts voice data (step 508). The extraction result of the voice data, that is, the voice data uttered by the other party is supplied to the text conversion unit. Based on the extraction result, the text conversion unit 111 recognizes voice and converts it into characters (step 509), and displays the text on the display unit (step 510). When the text display process is performed, it is checked whether the videophone is being continued (step 511). If the videophone is being continued, the presence of the next voice data and whether there is movement in the mouth in the incoming video are checked. The operation of displaying text one after another is performed. Eventually, if it is determined in step 511 that the TV phone is not being used, communication disconnection processing is performed (step 513), and the process ends.

なお、表示部１０４のテキスト表示は、図６の（ａ）に示すようなテキストの文字列が右から左へ流れるテロップ表示にしても、同図の（ｂ）のように複数行表示にしてもよい。 Note that the text display on the display unit 104 may be a multi-line display as shown in FIG. 6B, even if the text string shown in FIG. Also good.

次に本発明の第２の実施形態について説明する。第１の実施形態では、音声データをテキスト変換し文字列を画面に表示したが、本実施形態では、ＴＶ電話機の内部又は外部のメモリに、テキスト情報を保存することができる。その際、単に音声データをテキストデータに変換してそれを保存するのではなく、相手通話者の口元の動作と関連ある音声データをテキストデータに変換して保存する。 Next, a second embodiment of the present invention will be described. In the first embodiment, voice data is converted into text and a character string is displayed on the screen. However, in this embodiment, text information can be stored in a memory inside or outside the TV phone. At this time, instead of simply converting the voice data into text data and storing it, the voice data related to the operation of the other party's mouth is converted into text data and stored.

図７は外部の記憶装置６１に保存する場合を模式的に示したもので、ＴＶ電話機の表示部１０４の画面には、通話相手の映像２２、自分のカメラで捕えた自分の映像２３の他にモード表示領域２５に保存モードであることを示す'memory'が表示されている。このモードが設定されている時には、音声データはテキスト変換され、そのテキストが記憶装置に保存される。 FIG. 7 schematically shows a case in which the image is stored in an external storage device 61. The screen of the display unit 104 of the TV phone includes the image 22 of the other party of the call and the image 23 of the user captured by his camera. In the mode display area 25, “memory” indicating the storage mode is displayed. When this mode is set, the voice data is converted to text and the text is stored in the storage device.

第２の実施形態においても、音声データのテキストデータへの変換は第１の実施形態の場合と同じように行われるので、第１の実施形態で用いた図１、図２、図３および図４をも併せて参照しながら、第２の実施形態の動作についての図８のフローチャートを用いて説明する。 Also in the second embodiment, conversion of voice data into text data is performed in the same manner as in the first embodiment, so FIGS. 1, 2, 3, and 3 used in the first embodiment. The operation of the second embodiment will be described with reference to the flowchart of FIG.

ＴＶ電話機Ａは、ＴＶ電話機Ｂへの発信を行うと（ステップ８０１）、通信回線を経由して、ＴＶ電話機Ｂに着信する（ステップ８０２）。操作部１０４の開始ボタン又はＴＶボタンの操作により、ＴＶ電話通話を開始する（ステップ８０３）。制御部１０１は、ハンズフリーモードに設定されているかをチックする（ステップ８０４）。ハンズフリーモードの場合には、そのモードでＴＶ電話が継続される（ステップ８１２）。したがって、通話相手からの音声は、スピーカ１０８から発せられ、相手の画像が表示面に映し出された状況でＴＶ電話通話を行う。 When the video phone A makes a call to the video phone B (step 801), the video phone A arrives at the video phone B via the communication line (step 802). A videophone call is started by operating the start button or the TV button of the operation unit 104 (step 803). The control unit 101 ticks whether the hands-free mode is set (step 804). In the case of the hands-free mode, the videophone call is continued in that mode (step 812). Therefore, the voice from the other party is emitted from the speaker 108 and a videophone call is performed in a situation where the other party's image is displayed on the display screen.

ハンズフリーモードの設定でない場合は、制御部はメモリ保存モードの設定があるかをチェックする（ステップ８０５）。メモリ保存モードの設定でない場合には、ＴＶ電話通話が継続される（ステップ８１２）。この場合のＴＶ電話では、音声は、受話口から発せられるので、画面を見ながらの通話は聞き取りにくい状況である。 If the setting is not the hands-free mode, the control unit checks whether there is a memory storage mode setting (step 805). If the memory storage mode is not set, the videophone call is continued (step 812). In the TV phone in this case, since the voice is emitted from the earpiece, it is difficult to hear the call while looking at the screen.

メモリ保存モードが設定されていると、データ解析部１１０は、受信信号に音声データが存在するかをチェックする（ステップ８０６）。音声データが無い場合にはＴＶ電話の使用が継続中かをチェックし（ステップ８１１）、継続中であれば、音声データが存在するかチェックを行い（ステップ８０６）、ループ処理を実行する。ステップ８１１でＴＶ電話継続中でない場合には、通信切断処理を行う（ステップ８１３）。 If the memory storage mode is set, the data analysis unit 110 checks whether there is audio data in the received signal (step 806). If there is no audio data, it is checked whether or not the videophone is being used (step 811), and if it is continuing, it is checked whether there is audio data (step 806) and loop processing is executed. If the videophone is not being continued in step 811, communication disconnection processing is performed (step 813).

ステップ８０６で音声データが存在する場合には、データ解析部１１０は、着信相手の映像で口元が動いているかをチェックし（ステップ８０７）、口元が動いてない場合には，ＴＶ電話継続中かをチェックするステップ７１１に行き、継続中の場合には、音声データの存在のチェック及び着信相手の映像の口元が動いているかのステップのループでの処理を実行する。 If the voice data exists in step 806, the data analysis unit 110 checks whether the mouth is moving in the video of the incoming call partner (step 807). If the mouth is not moving, is the video phone continued? Step 711 for checking is performed, and if it is continued, the processing in a loop of steps for checking the presence of the voice data and checking whether the mouth of the video of the called party is moving is executed.

ステップ８０７で、着信相手の映像で口元が動いている場合には、音声データ解析部１１０は、音声データの抽出を行う（ステップ８０８）。音声データの抽出結果、即ち、相手通話者が発話した音声データがテキスト変換部に供給される。テキスト変換部１１１は、文字列に変換し（ステップ８０９）、外部メモリ６１にテキスト情報を保存する（ステップ８１０）。次に、ＴＶ電話継続中かをチェックして（ステップ８１１）、継続中の場合には、次の音声データの存在、着信映像で口元に動きがあるかをチェックして、次々にテキスト変換してメモリに保存する動作が行われる。したがって、メモリには文字列が保存される。やがて，ステップ８１１でＴＶ電話が使用継続中で無い判断がされると、通信切断処理が行われ（ステップ８１３）、終了する。 In step 807, when the mouth is moving in the video of the incoming call partner, the voice data analysis unit 110 extracts voice data (step 808). The extraction result of the voice data, that is, the voice data uttered by the other party is supplied to the text conversion unit. The text conversion unit 111 converts it into a character string (step 809) and stores the text information in the external memory 61 (step 810). Next, it is checked whether or not the videophone is ongoing (step 811). If the videophone is ongoing, the presence of the next voice data and whether there is movement in the mouth in the incoming video are checked, and the text is converted one after another To save to memory. Therefore, the character string is stored in the memory. Eventually, when it is determined in step 811 that the TV phone is not being used, communication disconnection processing is performed (step 813), and the process ends.

第２の実施形態のようにＴＶ電話で通話メッセージをテキスト保存しておけば、音声データそのものよりデータ量が少なくできる。したがって、より多くのメッセージを保存することができる。また、テキストデータで保存しておけばドキュメントやメール等へのメッセージの再利用が可能になる。 If the call message is stored as text on the TV phone as in the second embodiment, the amount of data can be smaller than the voice data itself. Therefore, more messages can be stored. In addition, if the text data is saved, the message can be reused for a document or e-mail.

また、本実施の形態の場合、音声データを単にテキストデータに変換して保存するのではなく、テキスト変換する音声データは、映像と関連付けて抽出したものである。即ち、音声データが存在し且つ映像データにおける口元に動きが存在する場合の音声データをテキスト変換するものであるから、送信側から送られる音声データには、送信側のＴＶ電話機が使用される環境における周囲の音をマイクでピックアップした成分が含まれるが、映像における口元の動きに関連で付けて口元の動きがある時の音声データが抽出されて、テキスト変換されるから、誤認音声データ或いは不要の音声データのテキスト変換は抑制される。したがって、また、保存されるテキストデータには誤認テキストデータ或いは、不要のテキストデータを排除することができる。 In the case of the present embodiment, the audio data is not simply converted into text data and stored, but the audio data to be converted into text is extracted in association with the video. In other words, since the voice data when the voice data exists and the movement in the mouth of the video data exists is converted into text, the voice data sent from the transmission side is used in an environment where the TV phone on the transmission side is used. Including the component picked up by the microphone in the surrounding sound, but the voice data when there is movement of the mouth in relation to the movement of the mouth in the video is extracted and converted to text, so misidentified voice data or unnecessary The text conversion of the voice data is suppressed. Therefore, misidentified text data or unnecessary text data can be excluded from the stored text data.

本発明の実施形態を示す携帯電話機のブロック図である。It is a block diagram of a mobile phone showing an embodiment of the present invention. 本発明の実施形態を示す携帯電話機の上部パネルを開いた平面図である。It is the top view which opened the upper panel of the mobile telephone which shows embodiment of this invention. 本発明の実施形態の携帯電話機で使用するテキスト表示モード及びハンズフリーモードの状態を示す平面図である。It is a top view which shows the state of the text display mode and hands-free mode which are used with the mobile telephone of embodiment of this invention. 本発明の実施形態の携帯電話機で使用する音声と映像との関係パターンを示す模式図である。It is a schematic diagram which shows the relationship pattern of the audio | voice and image | video used with the mobile telephone of embodiment of this invention. 本発明の携帯電話機の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the mobile telephone of this invention. 本発明の実施形態で使用するテキスト表示も画面表示例を示す模式図である。The text display used in the embodiment of the present invention is also a schematic diagram showing a screen display example. 本発明の第２の実施形態を模式的に示す図である。It is a figure which shows the 2nd Embodiment of this invention typically. 本発明の第２の実施形態の携帯電話機の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the mobile telephone of the 2nd Embodiment of this invention.

Explanation of symbols

２１受話口
２２通話相手の画像
２３
２４テキスト表示領域
２５モード表示領域
６１外部メモリ
１００テレビ電話機能付携帯電話機
１０１制御部
１０３無線部
１０４表示部
１０５操作部
１０８スピーカ
１０９カメラ
１１０データ解析部
１１１テキスト変換部
21 Earpiece 22 Call partner image 23
24 text display area 25 mode display area 61 external memory 100 mobile phone with videophone function 101 control unit 103 radio unit 104 display unit 105 operation unit 108 speaker 109 camera 110 data analysis unit 111 text conversion unit

Claims

A mobile communication terminal with a videophone function that can be switched between a text display mode and a voice output mode, with reference to the received video data, the absence of movement of the other party's mouth included in the received video from the received voice data Means for extracting voice data in response to the presence of movement of the other party's mouth included in the received video data without extracting voice data in the corresponding part ; means for converting into text data based on the extracted voice data; Means for displaying a character string based on the text data, and the character string is displayed together with the received video when the videophone function is operating in the text display mode.

2. The videophone function according to claim 1, wherein the extraction means does not extract voice data when the voice level of the received data is equal to or lower than a predetermined level even if there is movement of the other party's mouth included in the received video. Mobile communication terminal.

2. When the hands-free function mode in which call voice is radiated from a speaker is not set and the text display mode for starting the character string display means is set, the text display mode operates. Or the mobile communication terminal with a video telephone function of 2 description.

4. The mobile communication terminal with a videophone function according to claim 1, further comprising storage means for storing text data supplied by the text conversion means.

5. The operation according to claim 4, wherein when a hands-free function mode in which a call voice is emitted from a speaker is not set and a saving mode for starting saving of the text data is set, the operation is performed in the saving mode. Mobile communication terminal with videophone function.

This is a text data display method for audio data in a mobile communication terminal with a videophone function that can be switched between text display mode and voice output mode, and text is displayed along with received video when the videophone function is operating in text display mode. In response to the absence of movement of the other party's mouth included in the received video from the received audio data with reference to the received video data, the audio data is not extracted from the received audio data and is included in the received video data. In a mobile communication terminal with a videophone function that extracts voice data in response to the presence of movement of the other party's mouth, converts the extracted voice data into text data, and displays a character string based on the text data Text display method for audio data.

The mobile communication with a videophone function according to claim 6, wherein the voice data is not extracted when the voice level of the received data is equal to or lower than a predetermined level even if there is movement of the other party's mouth included in the received video. Text display method of voice data on the terminal.

Text data conversion method for audio data used for mobile communication terminals with videophone function that can be switched between text display mode and voice output mode, and text is displayed along with received video when videophone function is operating in text display mode In the portion corresponding to the absence of movement of the other party's mouth included in the received video from the received audio data with reference to the received video data, the call included in the received video data without extracting the audio data A voice data text conversion method used for a mobile communication terminal with a videophone function, which extracts the voice data corresponding to the presence of movement of the other party's mouth and converts the voice data into text data based on the extracted voice data.

9. Mobile communication with a videophone function according to claim 8, wherein the voice data is not extracted when the voice level of the received data is equal to or lower than a predetermined level even if there is movement of the other party's mouth included in the received video. Text data conversion method for audio data used for terminals.

A videophone that can switch between text display mode and audio output mode, and refers to the received video data, and the audio corresponding to the absence of movement of the other party's mouth included in the received video from the received audio data Means for extracting voice data corresponding to the presence of movement of the other party's mouth included in the received video data without extracting data; means for converting into text data based on the extracted voice data; And a means for displaying a character string based on the character string, the character string being displayed together with an image when the videophone is operating in a text display mode.