JPH09331385A

JPH09331385A - Voice image communication system and image information retrieval system

Info

Publication number: JPH09331385A
Application number: JP8150933A
Authority: JP
Inventors: Hitoshi Sato; 均佐藤; Toshiyuki Matsuda; 俊幸松田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-06-12
Filing date: 1996-06-12
Publication date: 1997-12-22

Abstract

PROBLEM TO BE SOLVED: To obtain the voice image communication system with excellent operating convenience by which retrieval and confirmation are facilitated by making a voice keyword corresponding to image information and telephone number information and storing the information to a storage section. SOLUTION: A database 201 corresponding to voice keywords 200 is stored in linking to a storage section 107. Furthermore, an image information database 300 and a telephone number database 300 in cross reference with the image information are stored in the database 201 of the storage section 107. Then a plurality of sets of image information are in cross reference with one voice keyword and one telephone number is linked with a plurality of the sets of the image information. Thus, a plurality of objects are retrieved by one voice keyword and visual information such as image information corresponding to the keyword is provided. Moreover, even when one voice keyword for retrieval is forgotten, it is relieved by making a plurality of keywords in cross reference with one telephone number.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声画像通信シス
テムおよび画像情報検索システムに係り、特に、コンピ
ュータや電話機、画像の送受信機能、テレビ放送受信機
能などが統合されたマルチメディア機能を備えたシステ
ムの技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice image communication system and an image information retrieval system, and more particularly to a system having a multimedia function in which a computer, a telephone, an image transmitting / receiving function, a television broadcasting receiving function and the like are integrated. Of technology.

【０００２】[0002]

【従来の技術】従来、音声認識を用いて電話をかける音
声ダイヤル機能を備えた電話機やコマンド制御などが提
案されている。例えば、音声ダイヤルの場合には、予め
電話番号に対応した単語を発声し、音声分析処理して得
られる特徴量を標準パタンとして、単語レベルで電話番
号に対応付けてメモリに登録しておく。発信する時に、
電話番号をキー入力する代わりに、その電話番号に対応
する単語を発声し、音声分析処理して得られる特徴量と
標準パタンをマッチングすることで単語認識を行い、認
識結果に対応付けられた番号情報を電話回線に発信する
ようにした方法が提案されている（特開平２−１４３６
５１号公報参照）。そして、現状の画像表示ディスプレ
イを持たない電話機のみでは、音声認識の結果確認手段
としては音声出力あるいは文字出力を用いている。2. Description of the Related Art Conventionally, a telephone having a voice dial function for making a call using voice recognition, command control, and the like have been proposed. For example, in the case of voice dialing, a word corresponding to a telephone number is uttered in advance, and a feature amount obtained by voice analysis processing is registered as a standard pattern in a memory in association with a telephone number at a word level. When making a call
Instead of keying in the phone number, speak the word corresponding to the phone number, and perform word recognition by matching the feature amount obtained by voice analysis processing with the standard pattern, and the number associated with the recognition result. A method has been proposed in which information is transmitted to a telephone line (Japanese Patent Laid-Open No. 2-1436).
51). Then, only the telephone having no image display at present uses voice output or character output as a means for confirming the result of voice recognition.

【０００３】ところで最近では、ＩＳＤＮ（Integrated
Services Digital Network ）の普及などから、パーソ
ナルコンピュータ（パソコン；ＰＣ）やワークステーシ
ョン（ＷＳ）とカメラ入力とが一体となったデスクトッ
プ型のテレビ電話やテレビ会議システムなどが普及しつ
つある。そして、相手と回線を接続するためには、例え
ばスター型の多地点間のテレビ会議では、複数の相手を
選ぶのに個々の電話番号をキー入力したり、あるいはあ
らかじめ登録してある名前や電話番号等を選択する方法
を用いている。By the way, recently, ISDN (Integrated
Due to the spread of Services Digital Network, desktop-type videophones and videoconference systems in which a personal computer (PC) or workstation (WS) and a camera input are integrated are becoming widespread. Then, in order to connect the line to the other party, for example, in a star type multipoint video conference, you can key in individual telephone numbers to select multiple parties, or you can enter names or telephone numbers registered in advance. The method of selecting numbers etc. is used.

【０００４】また、電話をかける以外に、最近ではディ
ジタル・スチルカメラやディジタル・ビデオカメラなど
ディジタルで画像情報を入力して、記録する方法が提案
されている。これらのディジタル情報は、パソコンやワ
ークステーションに取り込まれ、その情報は加工，修
正，変更等の編集に用いられるなど、その用途は広く考
えられる。検索する場合、現在は画像情報に日付やＮ
ｏ．などの時系列の付加情報があれば、これを利用して
キーボードやマウスを使って検索を行うことができる。In addition to making a telephone call, a method of digitally inputting and recording image information such as a digital still camera or a digital video camera has recently been proposed. Such digital information is taken into a personal computer or a workstation, and the information is used for editing such as processing, correction, change, etc., and its application is widely considered. When searching, the date and N are currently added to the image information.
o. If there is additional information in chronological order such as, you can use it to search using a keyboard or mouse.

【０００５】[0005]

【発明が解決しようとする課題】これまで知られてきた
方法では、音声ダイヤル機能を利用して通話する前の準
備として、予め登録操作が必要であり、この登録は、１
つの音声キーワードに対して１つの電話番号を対応付け
るものである。このようにする所以は、従来の方法で
は、結果確認の手段に制約があり、１つの音声キーワー
ドで複数の電話番号を割り当てることが困難であったた
めである。そのため、複数の相手に発信する際には、各
々異なった音声キーワードを覚えておいて、これを音声
入力しなければならず、ユーザにとっては面倒で、使い
勝手が悪いという問題がある。The methods known so far require a registration operation in advance as a preparation before making a call using the voice dial function.
One phone number is associated with one voice keyword. The reason for doing this is that the conventional method has a limitation in the means for confirming the result and it is difficult to assign a plurality of telephone numbers with one voice keyword. Therefore, when calling to a plurality of parties, it is necessary to remember different voice keywords and input them by voice, which is troublesome for the user and inconvenient.

【０００６】なお、１つの音声キーワードで複数の電話
番号を対応付けた場合には、音声認識して得られる複数
の電話番号を、どのようにして確認するかという問題を
生じる。When a plurality of telephone numbers are associated with one voice keyword, there arises a problem of how to confirm a plurality of telephone numbers obtained by voice recognition.

【０００７】一方、画像などのディジタル情報は、日付
や時間、Ｎｏ．などの時系列の付加情報だけでは、一度
パソコンまたはワークステーションに取り込んだ後、ユ
ーザが撮影した日付や時間などを特定できない場合に、
検索をすることは容易ではない。そのため、メモリに蓄
積する画像などの情報量が増大するのに従って、検索の
ための確認作業の手間が増え、ユーザーにとっては煩わ
しく、本来目的とした編集作業が十分に行えない可能性
が生じるという問題がある。On the other hand, digital information such as an image includes date, time, No. When it is not possible to specify the date and time when the user took a picture after it was once downloaded to a PC or workstation using only time-series additional information such as
Searching is not easy. Therefore, as the amount of information such as images to be stored in the memory increases, the time and effort for the confirmation work for the search increase, which is troublesome for the user, and there is a possibility that the originally intended editing work may not be performed sufficiently. There is.

【０００８】本発明の目的は、１つの音声キーワードに
複数の電話番号を対応付けて登録を行っても、所望する
電話番号の特定が、画像情報を利用することにより確実
・容易に行える、使い勝手のよい音声画像通信システム
を提供することにある。It is an object of the present invention that even if a plurality of telephone numbers are associated with one voice keyword and registered, the desired telephone number can be specified reliably and easily by using image information. To provide a good audio-visual communication system.

【０００９】また、本発明の目的は、パソコンまたはワ
ークステーションに画像情報を取り込む際に、画像情報
に音声情報インデックスを付けて登録することで、検索
と確認を容易に行える、画像情報など視覚的な情報と音
声情報とを組み合わせた画像情報検索システムを提供す
ることにある。It is another object of the present invention to add a voice information index to the image information when registering the image information into a personal computer or a work station so that the information can be easily searched and confirmed. An object of the present invention is to provide an image information retrieval system that combines various information and voice information.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するた
め、本発明の音声画像通信システムには、音声が入力さ
れる音声入力部と、キーまたはマウスにより電話番号が
入力されるキー入力部と、キー入力部より入力される電
話番号を再ダイヤル用に記憶する電話番号記憶部と、相
手静止画像を取り込む画像記憶部と、再ダイヤル用に記
憶される電話番号と相手静止画像と音声入力部より入力
される音声とを対応づけて記憶する記憶部とを、設け
た。In order to achieve the above object, the voice image communication system of the present invention comprises a voice input section for inputting voice and a key input section for inputting a telephone number by a key or a mouse. , A telephone number storage unit that stores the telephone number entered from the key input unit for redialing, an image storage unit that captures the still image of the other party, a telephone number, a still image of the other party, and a voice input section that are stored for redialing A storage unit that stores the input voice in association with each other is provided.

【００１１】また、本発明の音声画像通信システムに
は、キーワード登録の際、音声キーワードとそれに対応
する複数の画像情報とを対応付けて記憶する記憶部を設
けた。Further, the voice image communication system of the present invention is provided with a storage unit for storing a voice keyword and a plurality of image information corresponding thereto at the time of keyword registration.

【００１２】また、本発明の音声画像通信システムに
は、認識結果をディスプレイに表示する際、複数の画像
に対応した音声キーワードが入力された場合、記憶部か
ら対応する画像情報を読み出し複数分割して表示できる
画像表示部を設けた。これにより、ユーザは視覚的に確
認を容易にできるものである。Further, in the voice image communication system of the present invention, when the recognition result is displayed on the display, when voice keywords corresponding to a plurality of images are input, the corresponding image information is read from the storage unit and divided into a plurality of pieces. An image display unit that can display the image is provided. With this, the user can easily visually confirm.

【００１３】また、本発明の画像情報検索システムに
は、画像情報にインデックスとしてそれに対応する音声
キーワードを登録する記憶部と、記憶部から対応する画
像情報を読み出し複数分割して表示できる画像表示部と
を設けた。Further, in the image information retrieval system of the present invention, a storage unit for registering a voice keyword corresponding to the image information as an index, and an image display unit capable of reading the corresponding image information from the storage unit and displaying the image in a plurality of divided areas. And.

【００１４】[0014]

【発明の実施の形態】以下、本発明の実施の形態を、図
面を参照して説明する。図１は、本発明の１実施形態に
係る音声画像通信システムの端末装置（電話機能付き端
末装置）のブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a terminal device (terminal device with a telephone function) of a voice image communication system according to an embodiment of the present invention.

【００１５】図１において、１００は電話番号を入力す
るキー入力部、１０１は電話番号やその他のメッセージ
を表示する表示部、１０２は静止画や動画などを入力す
るための画像入力部、１０３は通話相手もしくは自己の
画像を出力する画像表示部、１０４は音声を入力するた
めの音声入力部、１０５は通話相手もしくは自己の音
声、または登録の確認メッセージ等の何らかの音声を出
力する音声出力部、１０６は端末装置内の各部を制御す
る制御部、１０７は音声キーワード，画像情報，電話番
号情報のデータベース、及びそれらの対応関係のデータ
ベースが記憶される記憶部、１０８は入力された画像を
分析処理する画像処理部、１０９は画像を伝送するため
の画像ＣＯＤＥＣ、１１０は入力された音声を分析処理
する音声処理部、１１１は音声を伝送するための音声Ｃ
ＯＤＥＣ、１１２は公衆回線網等の回線網と接続するた
めの回線接続部であり、本実施形態の電話機能付き端末
装置は、上記の各部１００〜１１２を中心に構成され
る。なお、１１３は広く普及している公衆電話回線網あ
るいはＩＳＤＮなどの回線網である。In FIG. 1, 100 is a key input unit for inputting a telephone number, 101 is a display unit for displaying a telephone number and other messages, 102 is an image input unit for inputting still images and moving images, and 103 is An image display unit that outputs the image of the other party or the other party, 104 is a voice input unit for inputting voice, 105 is a voice output unit that outputs the voice of the other party or the other party, or some voice such as a confirmation message of registration, 106 is a control unit that controls each unit in the terminal device, 107 is a storage unit that stores a database of voice keywords, image information, telephone number information, and a database of their correspondence, 108 is an analysis process of the input image An image processing unit 109, an image CODEC 109 for transmitting an image, a voice processing unit 110 for analyzing input voice, 1 Voice C for 1 to transmit speech
The ODEC 112 is a line connection unit for connecting to a line network such as a public line network, and the terminal device with a telephone function of the present embodiment is mainly configured by the above units 100 to 112. Reference numeral 113 is a public telephone line network or a line network such as ISDN which is widely used.

【００１６】本実施形態では、まずオフフック状態で、
テンキーパッド等のキー入力部１００から電話番号を入
力する。このときに入力された電話番号は、一時的に再
ダイヤル用のメモリに記憶される。次に、入力された電
話番号は、制御部１０６により制御される回線接続部１
１２から回線網１１３に発信し、一般的な電話交換局に
送られ通話相手と接続される。そして、相手がオフフッ
クして画像情報が送信されてきた間のある時間の情報を
静止画像として取り込むことで、再度電話をかける時の
視覚的な確認用データとして用いる。また、音声入力部
１０４から入力された音声信号を、音声処理部１１０で
分析し特徴パラメータに変換する。そして、この特徴パ
ラメータとキー入力部１００から入力された電話番号と
取り込んだ静止画像とを対応させて記憶することで、登
録が完了する。In this embodiment, first in the off-hook state,
A telephone number is input from the key input unit 100 such as a numeric keypad. The telephone number input at this time is temporarily stored in the memory for redialing. Next, the input telephone number is the line connection unit 1 controlled by the control unit 106.
The call is sent from 12 to the network 113 and sent to a general telephone exchange to be connected to the other party. Then, the information of a certain time during which the other party goes off-hook and the image information is transmitted is captured as a still image, and is used as visual confirmation data when making a call again. In addition, the voice signal input from the voice input unit 104 is analyzed by the voice processing unit 110 and converted into a characteristic parameter. Then, the registration is completed by storing the characteristic parameter, the telephone number input from the key input unit 100, and the captured still image in association with each other.

【００１７】オフフック状態で、かつキー入力部１００
から電話番号を入力していない状態で、音声入力部１０
４から音声が入力されると、音声処理部１１０で特徴パ
ラメータを求め、記憶部１０７にある音声キーワードの
標準パタンと類似度を計算し、一番類似度が大きい単語
を求める。その後、求めたキーワード単語に対応する画
像情報を記憶部１０７から参照し、制御部１０６から画
像処理部１０８に複数のデータを送り、画像表示部１０
３で画面上に分割してマルチ表示する。表示された画面
により選択あるいは確認を行い、選択された画像に対応
する電話番号を記憶部１０７から制御部１０６に読み込
み、その電話番号を回線接続部１１２に送り、回線網１
１３に発信する。The off-hook state and the key input section 100
Voice input unit 10 with no telephone number input
When a voice is input from 4, the voice processing unit 110 obtains a characteristic parameter, calculates a standard pattern of the voice keyword stored in the storage unit 107 and the similarity, and obtains the word having the highest similarity. After that, the image information corresponding to the obtained keyword word is referred from the storage unit 107, a plurality of data is sent from the control unit 106 to the image processing unit 108, and the image display unit 10
At 3, the screen is divided and multi-displayed. Selection or confirmation is performed on the displayed screen, the telephone number corresponding to the selected image is read from the storage unit 107 to the control unit 106, the telephone number is sent to the line connection unit 112, and the line network 1
Call 13.

【００１８】図２に示すように、記憶部１０７には、音
声キーワード２００に対応してデータベース２０１が連
結して記憶されている。As shown in FIG. 2, the storage unit 107 stores a database 201 linked to the voice keywords 200.

【００１９】また図３に示すように、記憶部１０７のデ
ータベース２０１には、画像情報のデータベース３００
が格納されていて、１つの音声キーワードに複数の画像
情報が対応して格納されている。また、１つの画像情報
に対して複数の音声キーワードが対応して格納されてい
る。As shown in FIG. 3, the database 201 of the storage unit 107 includes a database 300 of image information.
Is stored, and a plurality of image information is stored in association with one voice keyword. Further, a plurality of voice keywords are stored corresponding to one image information.

【００２０】さらに図４に示すように、記憶部１０７の
データベース２０１には、画像情報のデータベース３０
０と、それに対応する電話番号のデータベース４００と
が格納されている。そして、１つの音声キーワードに複
数の画像情報が対応していて、また、１つの電話番号
は、複数の画像情報とリンクさせて対応付けることがで
きるようになっている。このように、１つの電話番号に
複数のキーワードを対応させておけば、検索するある１
つの音声キーワードを忘れた場合でも対処できるので、
音声による検索語彙を制限することなく、幅広い検索シ
ステムを構築できることになる。Further, as shown in FIG. 4, the database 201 of the storage unit 107 includes a database 30 of image information.
0 and a telephone number database 400 corresponding thereto are stored. A plurality of image information correspond to one voice keyword, and one telephone number can be linked to a plurality of image information to be associated with each other. In this way, if you associate one telephone number with multiple keywords, you can search
Even if you forget one voice keyword, you can deal with it,
A wide range of search systems can be constructed without limiting the search vocabulary by voice.

【００２１】図５は、上記の記憶部１０７をサーバ５０
０側に持つことで、データベースサーバ５０１に、複数
のクライアント５０４からＬＡＮ５０３と公衆回線５０
２を通して共有してアクセスできることを示す。In FIG. 5, the storage unit 107 is replaced by the server 50.
By having it on the 0 side, a plurality of clients 504 are connected to the LAN 503 and the public line 50 in the database server 501.
It shows that it can be shared and accessed through 2.

【００２２】図６は、本実施形態を適用したテレビ電話
やテレビ会議システムなどにおいて、電話番号情報に音
声キーワード及び画像情報を登録するシーケンスのフロ
ー図である。FIG. 6 is a flow chart of a sequence for registering a voice keyword and image information in the telephone number information in the videophone and the videoconference system to which the present embodiment is applied.

【００２３】まず、ステップ６００のオフフック状態か
ら、ステップ６０１でキー入力部１００から電話番号を
入力するとステップ６０２に進み、電話番号は記憶部１
０７の電話番号データベース４００に格納される。入力
された電話番号は、制御部１０６により制御される回線
接続部１１２から回線網１１３に発信し、一般的な電話
交換局に送られ通話相手と接続される。そして、ステッ
プ６０３では、相手の電話機がビデオ信号を送受信でき
る電話機であるかどうかを判断し、ビデオ信号を送受信
できる電話機であればステップ６０４に進み、相手がオ
フフックした後に送信してくるビデオ信号を受信する。
その後、ステップ６０５で、受信されたビデオ信号を画
像ＣＯＤＥＣ１０９を通して画像処理部１０８から静止
画として取り込む処理を行う。この処理は、相手から送
信されてくる画像のある時間の情報を制御部１０６によ
り静止画像として取り込む。例えば、相手がオフフック
した後、声を発声している音声区間を音声処理部１１０
で判定し、制御部１０６が、その時点の受信画像を画像
処理部１０８から自動的に取り込む。あるいは、ユーザ
ーがキー入力部１００から選択キーなどを用いて、任意
の受信画像を記憶部１０７に登録できるようにすること
も可能である。また、相手から選択された画像を送信し
てもらうことによる、画像登録も可能である。取り込ま
れた画像情報は、ステップ６０８で画像データベース３
００に格納され、ステップ６０９に進む。First, when the telephone number is input from the key input unit 100 in step 601 from the off-hook state in step 600, the process proceeds to step 602, and the telephone number is stored in the storage unit 1.
No. 07 telephone number database 400 is stored. The input telephone number is transmitted from the line connection unit 112 controlled by the control unit 106 to the line network 113, and is sent to a general telephone exchange to be connected to the other party. Then, in step 603, it is determined whether or not the other party's telephone is a telephone capable of transmitting and receiving a video signal, and if it is a telephone capable of transmitting and receiving a video signal, the process proceeds to step 604, and the video signal transmitted after the other party goes off hook is transmitted. To receive.
Then, in step 605, the received video signal is processed as a still image from the image processing unit 108 through the image CODEC 109. In this process, the control unit 106 takes in information of a certain time of an image transmitted from the other party as a still image. For example, after the other party goes off-hook, the voice processing unit 110 sets the voice section in which the voice is being uttered.
The control unit 106 automatically captures the received image at that time from the image processing unit 108. Alternatively, the user can register any received image in the storage unit 107 by using the selection key or the like from the key input unit 100. Image registration is also possible by having the partner send the selected image. The captured image information is stored in the image database 3 in step 608.
00, and the process proceeds to step 609.

【００２４】一方、ステップ６０３で、かけた相手先が
ビデオ信号を送受信できる電話機でないと判断した場合
は、ステップ６０６に進み、画像情報を登録するかどう
かをユーザーに選択させる。画像を登録する場合、ステ
ップ６０７に進み、写真や絵などの静止画像を画像入力
部１０２から自分で入力する。あるいは、画像を取り込
む代わりに、それに代わる文字の情報をマウス等を用い
てキー入力部１００から入力して記憶することで、後で
確認できるようにすることも可能である。On the other hand, if it is determined in step 603 that the called party is not a telephone capable of transmitting and receiving a video signal, the process proceeds to step 606 to allow the user to select whether to register image information. When registering an image, the process proceeds to step 607, and a still image such as a photograph or a picture is input by the image input unit 102 by itself. Alternatively, instead of capturing the image, it is possible to input the information of the character instead of the image from the key input unit 100 using a mouse or the like and store the information so that the information can be confirmed later.

【００２５】通話が終了した後、ステップ６０９に進ん
でオンフックを行い、ステップ６１０では、音声入力部
１０４から入力される音声キーワードをメモリに格納す
る。音声キーワードの登録に関しては、従来音声認識を
用いる場合には認識する単語を音声で登録して特定話者
用の標準パタンを作成する必要があったが、最近ではテ
キストデータから自由に登録して不特定話者用の標準パ
タンを作成できる方式もある。そこで、インデックスの
登録は、キー入力部１００からキーボードやマウスを用
いてテキスト入力することや、音声を入力してテキスト
データに変換する方法や、あるいは音声情報そのものを
記憶部１０７に格納しておくことが可能である。その後
ステップ６１１に進み、画像情報と電話番号情報とに、
音声キーワードを対応させて記憶部１０７に格納する。After the call is completed, the process proceeds to step 609 to perform on-hook, and in step 610, the voice keyword input from the voice input unit 104 is stored in the memory. Regarding the registration of voice keywords, conventionally, when using voice recognition, it was necessary to register the words to be recognized by voice to create a standard pattern for a specific speaker, but recently, it is possible to register freely from text data. There is also a method that can create standard patterns for unspecified speakers. Therefore, the index is registered by inputting text from the key input unit 100 using a keyboard or a mouse, a method of inputting voice and converting it into text data, or storing voice information itself in the storage unit 107. It is possible. After that, the process proceeds to step 611, where the image information and the telephone number information are
The voice keyword is associated and stored in the storage unit 107.

【００２６】図７は動画から静止画像を取り込む方法の
例を示す。静止画像の取り込みは、ユーザに対してあま
り操作を意識させずに簡単に行えることが重要である。
また、取り込んだ画像では相手を識別できないといった
場合や、ユーザの意図した画像情報でない場合に、再度
やり直さなければならないといった失敗をできるだけ回
避して、確実に取り込めることも重要である。FIG. 7 shows an example of a method for capturing a still image from a moving image. It is important that the still image can be captured easily without the user having to be aware of the operation.
Further, it is also important to avoid a failure such as having to try again in the case where the other party cannot be identified from the captured image or the image information is not intended by the user, and it is possible to reliably capture the image.

【００２７】そこで、通常相手の顔を見ながら対話する
ことを考えると、テレビ電話やテレビ会議などにおいて
も、声を発声している時は、正面を向いた状態でカメラ
あるいは画面上に映し出された映像を見ている可能性が
高いと考えられる。よって、相手が話している間に送信
されてくる音声信号を、音声ＣＯＤＥＣ１１１を通して
音声処理部１１０で音声区間７０１を検出する。そし
て、そこで相手から送信されてくる動画に対して、その
音声区間内で例えば一定の周期Ｔで画像を取り込む。取
り込まれた画像データは、一時的に記憶するメモリ７０
３を用意しておいて、順次格納していく。メモリ７０３
の容量には制約があるので、メモリ７０３がいっぱいに
なったら、時間的に新しい情報を上書きして常に新しい
情報を格納しておくことで、メモリを節約できる。Therefore, considering that the user usually talks while looking at the face of the other party, even when making a voice call in a videophone or a video conference, it is displayed on the camera or on the screen while facing the front. It is highly possible that you are watching the video. Therefore, the voice processing unit 110 detects the voice section 701 of the voice signal transmitted while the other party is talking through the voice CODEC 111. Then, with respect to the moving image transmitted from the other party, an image is captured in the voice section at a constant cycle T, for example. The memory 70 temporarily stores the captured image data.
3 is prepared and stored sequentially. Memory 703
Since there is a limit to the capacity of the memory 703, when the memory 703 becomes full, the new information is temporally overwritten and the new information is always stored to save the memory.

【００２８】そして、このメモリ７０３の情報を制御部
１０６によって順次取り出して、画像処理部１０８にデ
ータを送り、画像表示部１０３でマルチ画面に時系列に
複数表示する。そして、ユーザに確認及び選択させるこ
とで確定した静止画像を、メモリ７０３から記憶部１０
７に記憶する。このように、メモリ７０３に一時的に格
納しておくので、通話開始後だけでなく通話終了後で
も、簡単にしかも確実に静止画像を登録できる。Then, the information in the memory 703 is sequentially taken out by the control unit 106, the data is sent to the image processing unit 108, and a plurality of images are displayed in time series on the multi-screen on the image display unit 103. Then, the still image confirmed by the user's confirmation and selection is stored in the storage unit 10 from the memory 703.
7 is stored. In this way, since the memory 703 is temporarily stored, the still image can be registered easily and surely not only after the start of the call but also after the end of the call.

【００２９】図８は、本実施形態を適用したテレビ電話
やテレビ会議システムになどおいて、音声キーワードを
発声して認識処理を行い、発信するシーケンスのフロー
図である。FIG. 8 is a flow chart of a sequence in which a voice keyword is uttered, recognition processing is performed, and a call is transmitted in a videophone or a videoconference system to which the present embodiment is applied.

【００３０】まず、ステップ８００のオフフック状態か
ら、ステップ８０１で音声キーワードを発声すると音声
入力部１０４でこれを取り込み、ステップ８０２に進
む。ステップ８０２では、音声処理部１１０で処理を行
い、認識結果を制御部１０６に送る。ステップ８０３で
は、制御部１０６で認識結果からそれに対応する情報を
記憶部１０７から参照する。そして、ステップ８０４で
は、画像データベース３００に参照された画像情報があ
るかどうかを確認する。もしあればステップ８０５に進
み、なければステップ８１０に進む。First, from the off-hook state in step 800, when a voice keyword is uttered in step 801, the voice keyword is captured by the voice input unit 104, and the process proceeds to step 802. In step 802, the voice processing unit 110 performs processing and sends the recognition result to the control unit 106. In step 803, the control unit 106 refers to the information corresponding to the recognition result from the storage unit 107. Then, in step 804, it is confirmed whether or not the image information referred to in the image database 300 exists. If there is, go to step 805, and if not, go to step 810.

【００３１】ステップ８０５では、さらに１つの音声キ
ーワードに複数の画像情報及び電話番号が対応している
かどうかを判断する。もし複数の画像情報が対応してい
ればステップ８０６に、そうでなければステップ８０７
に進む。In step 805, it is further determined whether or not one voice keyword corresponds to a plurality of image information and telephone numbers. If a plurality of pieces of image information correspond to each other, go to step 806, and if not, go to step 807.
Proceed to.

【００３２】ステップ８０６では、制御部１０６から画
像処理部１０８に複数のデータを送り、画像表示部１０
３で画面上に分割してマルチ表示して、ステップ８０８
に進む。このことにより、ユーザーにとって視覚的に確
認し易くなる。In step 806, the control unit 106 sends a plurality of data to the image processing unit 108, and the image display unit 10
In step 3, the screen is divided and multi-displayed, and step 808
Proceed to. This makes it easier for the user to visually confirm.

【００３３】一方、ステップ８０７では、同様に制御部
１０６から画像処理部１０８にデータを送り、画像表示
部１０３で１画面上に表示する。この時もし、認識処理
の結果得られる類似度が一番大きいキーワードに対応す
る画像情報だけでなく、次候補の情報も同時にマルチ画
面表示する場合、個々の画面表示の大きさを類似度の大
きさに比例して分割することで、より視覚的に確認しや
すくすることも可能である。On the other hand, in step 807, data is similarly sent from the control unit 106 to the image processing unit 108, and the image display unit 103 displays the data on one screen. At this time, if not only the image information corresponding to the keyword with the highest similarity obtained as a result of the recognition process but also the information of the next candidate is displayed on the multi-screen at the same time, the size of each screen display is set to It is also possible to make it easier to visually confirm by dividing in proportion to the height.

【００３４】ステップ８０８では、表示された画像によ
り選択あるいは確認を行い、ステップ８０９で、記憶部
１０７の電話番号データベース４００から選択された画
像に対応する電話番号を読み込み、ステップ８１０で、
制御部１０６から対応する電話番号を回線接続部１１２
に送り、回線網１１３に発信する。In step 808, selection or confirmation is performed based on the displayed image, in step 809, the telephone number corresponding to the selected image is read from the telephone number database 400 in the storage unit 107, and in step 810.
The corresponding telephone number is input from the control unit 106 to the line connection unit 112.
To the line network 113.

【００３５】以上のように、１つの音声キーワードに対
応する画像情報を画面に複数表示することで、例えばパ
ソコンやワークステーションでの多地点間のＴＶ会議シ
ステムにおいて複数の相手に同時に発信するような場合
に、１つのキーワードを発声するだけの操作で発信でき
ることから、各々選択する手間が省ける。また、１つの
キーワードに対して１つの電話番号が対応している場
合、認識処理の結果得られる類似度などにより画面表示
の大きさを次候補よりも大きくすることで、より視覚的
に確認しやすくすることができる。あるいは、ビデオ信
号を送受信できる電話機の場合では、相手の画像情報を
保持しておけば、発信の際に相手の顔は覚えているが電
話番号は覚えていないといったケースで、音声キーワー
ドを発声し画像を表示して相手を確認することができる
ので、誤まった発信を避けることができる。そのため、
この方法では複数の相手に同一の音声キーワードをつけ
るような場合、例えば名前が同じようなケースでは、電
話番号では確認できなくても画像であれば確認できるな
どその利点は多い。以上のように画像情報を用いること
で、視覚的に確認を容易にでき、同じ音声キーワードに
対しても複数の電話番号を対応付けることが可能とな
り、ユーザに各々選択操作させる負担を軽減できる。As described above, by displaying a plurality of image information corresponding to one voice keyword on the screen, it is possible to simultaneously call a plurality of parties in a multipoint TV conference system such as a personal computer or a workstation. In this case, since it is possible to make a call by simply speaking one keyword, it is possible to save the trouble of selecting each keyword. Further, when one telephone number corresponds to one keyword, the size of the screen display is made larger than the next candidate according to the degree of similarity obtained as a result of the recognition process, so that it can be visually confirmed. Can be made easier. Alternatively, in the case of a telephone that can send and receive a video signal, if the image information of the other party is stored, the voice keyword is uttered in the case of remembering the other party's face but not the telephone number when making a call. Since the other party can be confirmed by displaying the image, it is possible to avoid making an erroneous call. for that reason,
This method has many advantages such as when the same voice keyword is attached to a plurality of parties, for example, in the case where the names are the same, it is possible to confirm the image even if the telephone number cannot confirm. As described above, by using the image information, the visual confirmation can be easily performed, a plurality of telephone numbers can be associated with the same voice keyword, and the burden on the user to select each can be reduced.

【００３６】かように、音声によるダイヤリングを、画
像情報を参照することと組み合わせて実現でき、大いに
利便性が向上する。As described above, the voice dialing can be realized in combination with the reference of the image information, and the convenience is greatly improved.

【００３７】次に、本発明のシステムのダイヤリングを
する以外での用途として、画像情報検索システムが考え
られる。例えば、画像情報をイメージスキャナ等を利用
してパソコンやワークステーションに取り込んで編集作
業を行うような場合、メモリに格納している情報量が多
くなるにしたがい、検索機能が必要になってくる。ま
た、頻繁に参照するような情報についても検索までの時
間を短縮して、効果的に編集作業を行いたい願望が強
い。そのような場合、画像情報に対応した音声インデッ
クスが付加されていれば、検索を容易にできる。Next, an image information retrieval system can be considered as an application other than dialing the system of the present invention. For example, when the image information is loaded into a personal computer or a workstation using an image scanner or the like for editing work, a search function is required as the amount of information stored in the memory increases. In addition, there is a strong desire to shorten the time required for searching for information that is frequently referred to and to perform effective editing work. In such a case, if a voice index corresponding to the image information is added, the search can be facilitated.

【００３８】図９に、本発明の他の実施形態としての画
像情報検索システムにおける、登録シーケンスのフロー
図を示す。なお、本画像情報検索システムの構成は、前
記図１の構成と同様であり、外部と回線網１１３を通し
て通信を行う必要のない場合には、前記回線接続部１１
２は割愛することも可能である。FIG. 9 shows a flow chart of a registration sequence in the image information search system as another embodiment of the present invention. The configuration of the image information retrieval system is the same as that of FIG. 1, and when it is not necessary to communicate with the outside through the line network 113, the line connection unit 11 is used.
2 can be omitted.

【００３９】まず、ステップ９００で、画像情報をイメ
ージスキャナ等の画像入力部１０２から、パソコンやワ
ークステーションに入力する。その後、ステップ９０１
で、記憶部１０７の画像データベース３００にこのディ
ジタルデータを記憶して、ステップ９０２に進む。ここ
で、動画像から静止画像を取り込む場合には、動画像デ
ータと時間的に同期して録音されている音声があれば、
図７で説明したように、その音声区間を検出して、それ
に同期して複数の静止画像に分割することも可能であ
る。ステップ９０２では、取り込んだ画像データに対し
て音声インデックスを付加するかどうか確認する。イン
デックスを付加する場合は、ステップ９０３に進みイン
デックスを付加する。音声インデックスの付け方として
は、キーボードやマウスを用いてテキスト入力すること
や、音声を入力してテキストデータに変換する方法、あ
るいは音声情報そのものを記憶部１０７に格納しておく
ことが可能である。その後、ステップ９０４で、入力し
た画像情報とインデックス情報の対応づけを記憶部１０
７に記憶して、登録を終了する。また、一度取り込んだ
データをオリジナル情報として記憶しておけば、例えば
オリジナルの情報を加工や編集した場合には、その都度
インデックス情報を付加しなくても、自動的にオリジナ
ルと同一のものを付加することもできる。First, in step 900, image information is input to the personal computer or workstation from the image input unit 102 such as an image scanner. Then, step 901
Then, the digital data is stored in the image database 300 of the storage unit 107, and the process proceeds to step 902. Here, when capturing a still image from a moving image, if there is a sound recorded in synchronization with the moving image data,
As described with reference to FIG. 7, it is possible to detect the voice section and divide it into a plurality of still images in synchronization with it. In step 902, it is confirmed whether a voice index is added to the captured image data. When adding an index, the process proceeds to step 903 and the index is added. As a method of assigning a voice index, it is possible to input text using a keyboard or a mouse, a method of inputting voice and converting it into text data, or storing voice information itself in the storage unit 107. Then, in step 904, the storage unit 10 associates the input image information with the index information.
7 is stored and the registration is completed. Also, if the data that has been captured once is stored as original information, when the original information is processed or edited, for example, the same information as the original is automatically added without adding the index information each time. You can also do it.

【００４０】図１０に、本発明の他の実施形態としての
画像情報検索システムにおける、認識シーケンスのフロ
ー図を示す。FIG. 10 shows a flow chart of a recognition sequence in the image information retrieval system as another embodiment of the present invention.

【００４１】図１０における処理の流れは、図８で説明
したなかでの、ステップ８００のオフフック、ステップ
８０４の画像情報があるかどうかの確認、ステップ８０
９のデータベースからの電話番号読み込み、ステップ８
１０の回線網１１３に発信するところを除いて、共通な
処理である。The processing flow in FIG. 10 is the off-hook of step 800, the confirmation of whether there is image information of step 804, and the step 80 of FIG.
Reading phone numbers from the database in step 9, step 8
This is a common process except that the call is sent to the ten network 113.

【００４２】以上のような画像情報検索システムでは、
編集作業である画面を見ながら違う画像情報を参照する
ような場合に、音声キーワードでの検索を行える。ま
た、同一のカテゴリーやディレクトリの中などを参照す
るのに、キーあるいはマウス操作を不要とし、情報量の
増大に伴って階層が深くなるような場合でも、音声で検
索可能であるから階層間の操作や確認作業に手間がかか
らない。さらに、例えば、オリジナルの画像に対して音
声インデックスが付加されていれば、編集後に検索する
場合も、オリジナルの画像情報にツリー状にリンクして
表示することで、どのオリジナルの画像から編集したの
かを確認することもできるし、また、オリジナル画像に
関連した他の情報も同時に検索できる。このことによ
り、検索の自由度がさらに広がる。また、以上で説明し
てきた以外にも、もちろんディジタル・カメラやディジ
タル・ビデオカメラ端末で画像情報を記憶する時に、デ
ータ形式として同時に音声キーワードを付加しておき、
そのままパソコン等に取り込んで利用することもでき
る。In the image information retrieval system as described above,
When referring to different image information while looking at the screen which is the editing work, the search can be performed by the voice keyword. Also, even if the same category or directory is referenced, no key or mouse operation is required, and even if the layers become deeper as the amount of information increases, it is possible to search by voice. It does not take time to operate or check. Further, for example, if a voice index is added to the original image, even when searching after editing, by displaying a link like a tree to the original image information, which original image was edited Can be checked, and other information related to the original image can be searched at the same time. As a result, the degree of freedom in searching is further expanded. In addition to the above explanation, of course, when storing image information in a digital camera or digital video camera terminal, voice keywords are added at the same time as a data format,
It can also be used by directly importing it to a personal computer or the like.

【００４３】また、上述してきた画像情報検索システム
を用いることで、次の用途も考えられる。例えば、オー
ディオグラフィックス会議の際の説明資料を画面に表示
させるような場合、予め音声キーワードと画像情報を対
応付けて記憶部１０７に格納しておけば、必要な時に記
憶部１０７から参照し、時間をかけずに音声で検索でき
る。さらに、不特定話者用の音声キーワードを用いれ
ば、相手の端末に画像情報とともにそのインデックスで
ある音声キーワードも同時に送信することで、他の使用
者が自分の端末でそのキーワードを用いることが可能に
なる。これにより、新たに自分で登録する手間がかから
ず、検索時間を大幅に短縮できるなどその利点は大き
い。あるいはさらに、端末側で個々に音声キーワードと
画像情報を対応付けた記憶部１０７を保持しておかなく
ても、サーバ側で不特定話者用のデータベースを保持し
ておくことで、多くのユーザが共通にアクセスして使用
できるようにすることも考えられる。Further, by using the image information retrieval system described above, the following uses can be considered. For example, when displaying explanatory materials for an audio graphics conference on the screen, if voice keywords and image information are stored in advance in the storage unit 107 in association with each other, the storage unit 107 can be referred to when necessary. You can search by voice without spending time. Furthermore, by using voice keywords for unspecified speakers, other users can use the keywords on their own terminals by simultaneously transmitting the image information and the voice keyword that is the index to the other party's terminal. become. As a result, there is no need to newly register by oneself, and the search time can be greatly shortened, which is a great advantage. Alternatively, even if the terminal side does not hold the storage unit 107 in which the voice keyword and the image information are individually associated with each other, by holding the database for the unspecified speaker on the server side, many users can It is also possible to make it accessible and used in common.

【００４４】さらにまた、医療分野のサービスも考えら
れる。例えば、医療用の画像データベースに対しても、
音声キーワードを対応させておくことで、検索を容易に
することができる。したがって、遠隔地の病院と都市部
の専門病院などをＩＳＤＮ回線で結んで、患者の診断デ
ータを転送してやり取りする病理診断などに利用でき
る。Furthermore, services in the medical field can be considered. For example, even for medical image databases,
Searching can be facilitated by associating voice keywords. Therefore, it can be used for pathological diagnosis in which a remote hospital is connected to a specialty hospital in an urban area with an ISDN line to transfer and exchange patient diagnostic data.

【００４５】以上、いくつかの例によって本発明を説明
したが、本発明の精神を逸脱しない範囲内で、その他の
種々の変形が可能であることは言うまでもない。The present invention has been described above with reference to some examples, but it goes without saying that various other modifications can be made without departing from the spirit of the present invention.

【００４６】[0046]

【発明の効果】以上詳細に説明したように、本発明によ
れば、複数対象を１つの音声キーワードにより検索で
き、それに対応した画像情報などの視覚情報を提供する
ことで、ユーザに対する検索及び確認作業の負担を軽減
することができる。As described above in detail, according to the present invention, a plurality of objects can be searched by one voice keyword, and visual information such as image information corresponding to them can be provided, so that the user can search and confirm. The work load can be reduced.

[Brief description of drawings]

【図１】本発明の１実施形態に係る音声画像通信システ
ムの端末装置（電話機能付き端末装置）のブロック図で
ある。FIG. 1 is a block diagram of a terminal device (terminal device with a telephone function) of a voice image communication system according to an embodiment of the present invention.

【図２】図１中の記憶部の説明図である。FIG. 2 is an explanatory diagram of a storage unit in FIG.

【図３】図２の記憶部内の画像データベースを示す説明
図である。FIG. 3 is an explanatory diagram showing an image database in a storage unit of FIG.

【図４】図２の記憶部内の画像情報データベースと電話
番号情報データベースを示す説明図である。FIG. 4 is an explanatory diagram showing an image information database and a telephone number information database in the storage unit of FIG.

【図５】本発明の１実施形態に係る音声画像通信システ
ムにおける、ネットワーク構成の１例を示す説明図であ
る。FIG. 5 is an explanatory diagram showing an example of a network configuration in the audiovisual communication system according to the embodiment of the present invention.

【図６】本発明の１実施形態における、登録シーケンス
の１例を示すフロー図である。FIG. 6 is a flowchart showing an example of a registration sequence according to the embodiment of the present invention.

【図７】本発明の１実施形態における、登録時の静止画
像の取り込み例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of capturing a still image at the time of registration according to the embodiment of the present invention.

【図８】本発明の１実施形態の適用例としてのテレビ電
話やテレビ会議システムなどにおける、音声キーワード
を発声して認識処理を行い、発信するシーケンスのフロ
ー図である。FIG. 8 is a flowchart of a sequence in which a voice keyword is uttered, recognition processing is performed, and a call is transmitted in a videophone, a videoconference system, and the like as an application example of one embodiment of the present invention.

【図９】本発明の他の実施形態に係る画像情報検索シス
テムにおける、登録シーケンスのフロー図である。FIG. 9 is a flowchart of a registration sequence in the image information search system according to another embodiment of the present invention.

【図１０】本発明の他の実施形態に係る画像情報検索シ
ステムにおける、認識シーケンスのフロー図である。FIG. 10 is a flowchart of a recognition sequence in the image information search system according to another embodiment of the present invention.

[Explanation of symbols]

１００キー入力部１０１表示部１０２画像入力部１０３画像表示部１０４音声入力部１０５音声出力部１０６制御部１０７記憶部１０８画像処理部１０９画像ＣＯＤＥＣ１１０音声処理部１１１音声ＣＯＤＥＣ１１２回線接続部１１３回線網 100 key input unit 101 display unit 102 image input unit 103 image display unit 104 voice input unit 105 voice output unit 106 control unit 107 storage unit 108 image processing unit 109 image CODEC 110 voice processing unit 111 voice CODEC 112 line connection unit 113 network

Claims

[Claims]

1. A voice image communication system for transmitting a telephone number using voice, comprising: a voice input section for inputting a voice, a key input section for inputting a telephone number, and an image input section for inputting an image. A storage unit that stores a telephone number input from the key input unit for redialing, a storage unit that stores an image from the image input unit, a voice keyword input from the voice input unit, and the image An audiovisual communication system comprising: a storage unit that stores an image input from an input unit and a telephone number input from the key input unit in association with each other.

2. In a voice image communication system for transmitting a telephone number using voice, a voice input section for inputting voice, a key input section for inputting a telephone number, and an image input section for inputting an image. A storage unit that stores a telephone number input from the key input unit for redialing, a storage unit that stores an image from the image input unit, and one voice keyword input from the voice input unit. And a plurality of image information input from the image input unit and a storage unit that stores the corresponding telephone number input from the key input unit in association with each other. system.

3. A voice image communication system for transmitting a telephone number using voice, comprising: a voice input section for inputting a voice, a key input section for inputting a telephone number, and an image input section for inputting an image. A storage unit that stores a telephone number input from the key input unit for redialing, a storage unit that stores an image from the image input unit, a voice keyword input from the voice input unit, and the image An image input unit, a storage unit that stores the telephone number input from the key input unit in association with each other, and an image display unit capable of multi-displaying a plurality of image information simultaneously. Audio-visual communication system.

4. An image information retrieval system for retrieving image information using voice, wherein an image input section for inputting an image, a voice input section for inputting voice, and the input voice are stored as voice keywords. A storage unit; a storage unit that stores the image from the image input unit; a storage unit that stores a plurality of image information in association with one voice keyword input from the voice input unit; An image information retrieval system, comprising: an image display unit capable of simultaneously displaying multiple pieces of image information.