JPH11305983A

JPH11305983A - Voice controlled image screen display system

Info

Publication number: JPH11305983A
Application number: JP10129478A
Authority: JP
Inventors: Tsukimi Wakabayashi; つきみ若林
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 1998-04-23
Filing date: 1998-04-23
Publication date: 1999-11-05

Abstract

PROBLEM TO BE SOLVED: To change the image map of a system existent at present into that of voice interactive type in a system for providing voice control image screen display. SOLUTION: This system is composed of an image recognizing part 5 for recognizing features on an image while referring to image data information, keyword link setting part 6 for setting the combination of a keyword expressing the features of the image and a data link, grammar generating part 7 for generating a grammar for voice recognition from the keyword, voice recognizing part 9 for recognizing the input of voice according to the grammar and making it into text data, and input evaluating part 8 for extracting/evaluating the keyword from the text data based on the grammar and the combination of the keyword and data link and performing an instruction to an image screen display part so as to extract document data at the destination of link and to display them when there is the data link corresponding to the keyword.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声制御画像スク
リーン表示を提供するシステムに係り、現存するシステ
ムの画像マップを音声対話型に変更することに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system for providing a voice-controlled image screen display, and to changing the image map of an existing system to voice interactive.

【０００２】[0002]

【従来の技術】従来、ビデオ（画像）スクリーン表示を
制御する方法は、キーボードまたはマウス等のポインテ
ィングデバイスを用いていた。図５は従来、パソコン等
でWWW網のページをブラウジングする場合の構成であ
る。WWW のページはハイパーテキストマークアップ言語
(HTML)のドキュメントとして記述され、ビデオスクリー
ン表示部（ブラウザ）１１はWWW 網１２にアクセスして
このHTMLドキュメントを取り出して解釈し、ビデオスク
リーン表示部（ブラウザ）１１に表示する。ユーザがマ
ウス１３でビデオスクリーン上のリンク表示をクリック
すると、ビデオスクリーン表示部（ブラウザ）１１は入
力を受けてWWW 網１２へアクセスしてリンク先のページ
のHTMLドキュメントを取り出し、ビデオスクリーン表示
部（ブラウザ）１１に表示する。このようにして、ユー
ザはWWW 網１２のページを次々に見ることが出来る。2. Description of the Related Art Conventionally, a method of controlling a video (image) screen display has used a pointing device such as a keyboard or a mouse. FIG. 5 shows a conventional configuration for browsing pages on the WWW network using a personal computer or the like. WWW pages are hypertext markup languages
The HTML document is described as a (HTML) document, and the video screen display unit (browser) 11 accesses the WWW network 12 to retrieve and interpret the HTML document and display the HTML document on the video screen display unit (browser) 11. When the user clicks the link display on the video screen with the mouse 13, the video screen display unit (browser) 11 receives the input, accesses the WWW network 12, extracts the HTML document of the linked page, and outputs the HTML document. (Browser) 11. In this way, the user can view the pages of the WWW network 12 one after another.

【０００３】この場合、キーボードやマウス操作の為の
空間が必要で、ユーザが必ず手を使って操作せねばなら
ない制約が有った。音声認識システムを含む音声制御画
像スクリーン表示は周知であるが、事前に用意された言
語に依存する音声認識を用いており、現存の非音声認識
システムを音声対話型にするものではなかった。これを
解決する技術として、特開平8-335160号において、ビデ
オスクリーン表示と関連するデータリンクを認識して音
声入力の文法を作成し、認識されたテキストデータを評
価してビデオスクリーン表示を制御する方法が提案され
ている。ワールドワイドウェブ(WWW）網の例を取れば、
ハイパーテキストマークアップ言語(HTML)で記述された
ドキュメントを解釈し、音声認識の為の文法を作成する
ものである。In this case, a space for operating a keyboard and a mouse is required, and there is a restriction that the user must always use a hand to perform an operation. Voice control image screen displays, including voice recognition systems, are well known, but use language-dependent voice recognition prepared in advance, and do not render existing non-voice recognition systems speech-based. As a technique for solving this, in Japanese Patent Application Laid-Open No. 8-335160, a grammar of voice input is created by recognizing a data link associated with a video screen display, and the recognized text data is evaluated to control the video screen display. A method has been proposed. Take the example of the World Wide Web (WWW) network,
It interprets documents written in Hypertext Markup Language (HTML) and creates grammar for speech recognition.

【０００４】[0004]

【発明が解決しようとする課題】画像スクリーン表示で
は、画面に表示された図形や静止画を指し示すことによ
り、表示を制御することも多い。絵で示されたボタンを
マウスでクリックすると詳細な説明や写真が表示される
場合等である。この様な画像スクリーン表示を行う場
合、WWW を例に取ると、HTMLドキュメント内にデータリ
ンクとアンカーとなる画像データファイルを指定する。
画像データファイルはGIF やJPEG等の所謂画像データで
ある。従来のものは、ドキュメント内に記述されるデー
タリンクを抽出し、音声認識用の文法を作成することは
出来るが、ドキュメント内の記述で指定される画像を容
易に入力出来る形で文法化することは考慮されていな
い。HTMLのタグには画像を代用するテキストデータの属
性があるが画像が表示されていれば画面表示されないも
ので、画像を見て自然に発声出来る言葉とは限らない。In the image screen display, the display is often controlled by pointing to a figure or a still image displayed on the screen. Clicking a button indicated by a picture with a mouse displays a detailed description or a photo. In the case of performing such an image screen display, using a WWW as an example, an image data file to be a data link and an anchor is specified in an HTML document.
The image data file is so-called image data such as GIF or JPEG. Conventionally, data links described in documents can be extracted and grammar for speech recognition can be created, but grammarization must be performed so that images specified in descriptions in documents can be easily input. Is not taken into account. The HTML tag has an attribute of text data that substitutes for an image, but is not displayed on the screen if the image is displayed, and is not necessarily a word that can be naturally spoken by looking at the image.

【０００５】また、画像スクリーン表示では、画面に表
示された図形や静止画の一定の部分を指し示すことによ
り、表示を制御することも多い。施設の地図を表示した
画面で、地図中の場所をマウスでクリックするとその場
所の詳細な説明や写真が表示される場合等である。この
様な画像スクリーン表示を行う場合、WWW を例に取る
と、HTMLドキュメント内に画像データファイルとマップ
を指定する。画像データファイルはGIF やJPEG等の所謂
画像データである。マップは画像中の座標範囲とその範
囲が選択された場合のデータリンクとの対応情報を記述
している。[0005] In image screen display, display is often controlled by pointing to a certain part of a figure or a still image displayed on the screen. When a mouse clicks on a place on the map on the screen displaying the map of the facility, a detailed description of the place or a photograph is displayed. To perform such an image screen display, specify the image data file and map in the HTML document, taking WWW as an example. The image data file is so-called image data such as GIF or JPEG. The map describes correspondence information between a coordinate range in the image and a data link when the range is selected.

【０００６】従来の方法では、ドキュメント内に記述さ
れるデータリンクを抽出し、音声認識用の文法を作成す
ることは出来るが、ドキュメント内の記述で指定される
画像データの座標をユーザが容易に入力出来る形で文法
化することは考慮されていない。たとえマップデータを
解釈してデータリンクに関わる全ての座標値を文法化し
ても、ユーザが画像スクリーン上の画像を見て入力した
い領域の座標値を判定して発声するのは困難であり、不
自然である。よって、画像スクリーン上の図形や静止画
を選択したり、図形や静止画の一定の部分を指し示すこ
とによる表示制御を、ユーザが容易に使える形で音声対
話型にすることが課題であった。そこで、本発明は画像
データの示す内容を認識してユーザが容易に指定出来る
言葉の形で文法を作成し、画像スクリーン表示を音声対
話型とすることを目的とする。In the conventional method, a data link described in a document can be extracted and a grammar for voice recognition can be created. However, the coordinates of image data specified by the description in the document can be easily set by the user. It does not consider grammarization in a form that can be entered. Even if the map data is interpreted and all the coordinate values related to the data link are grammarized, it is difficult for the user to determine the coordinate value of the area to be input by looking at the image on the image screen and to utter. It is natural. Therefore, it has been a problem to make the display control by selecting a graphic or a still image on the image screen or pointing a certain portion of the graphic or the still image into a voice interactive type so that the user can easily use it. Accordingly, an object of the present invention is to recognize the contents indicated by image data, create a grammar in the form of words that can be easily specified by a user, and to make the image screen display a voice interactive type.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
本発明は、データ接続より表示用ドキュメントデータを
取り出して解釈し画像スクリーンに表示する画像スクリ
ーン表示部と、前記表示用ドキュメントデータを解析し
てドキュメント中の画像データ情報と前記画像データ情
報に関連付けられるデータリンク情報を取り出すドキュ
メント解析部と、画像データ情報を参照して画像上の特
徴を認識する画像認識部と、画像の特徴を表すキーワー
ドとデータリンクの組合せを設定するキーワード・リン
ク設定部と、前記キーワードから音声認識用の文法を生
成する文法生成部と、文法に従って音声入力を認識しテ
キストデータにする音声認識部と、前記文法とキーワー
ドとデータリンクの組合せを基に前記テキストデータか
らキーワードを抽出・評価してキーワードに該当するデ
ータリンクがあれば前記データ接続よりリンク先のドキ
ュメントデータを取り出して表示するよう画像スクリー
ン表示部に指示する入力評価部とからなる音声制御画像
スクリーン表示システムを提供する。In order to achieve the above object, the present invention provides an image screen display unit for extracting display document data from a data connection, interpreting the display document data, and displaying the image data on an image screen, and analyzing the display document data. A document analysis unit that retrieves image data information in a document and data link information associated with the image data information, an image recognition unit that recognizes features on the image by referring to the image data information, and a keyword that represents the features of the image A keyword / link setting unit that sets a combination of a keyword and a data link; a grammar generation unit that generates a grammar for voice recognition from the keyword; a voice recognition unit that recognizes voice input according to the grammar and converts the input to text data; Extract keywords from the text data based on the combination of keywords and data links Provide an audio control image screen display system comprising an input evaluation unit for instructing the image screen display unit to retrieve and display the linked document data from the data connection if there is a data link corresponding to the keyword after evaluation. .

【０００８】さらに、本発明は、データ接続より表示用
ドキュメントデータを取り出して解釈し画像スクリーン
に表示する画像スクリーン表示部と、前記表示用ドキュ
メントデータを解析してドキュメント中の画像データ情
報と前記画像上の領域とデータリンクのマップデータ情
報を取り出すドキュメント解析部と、前記マップデータ
情報を参照して画像データ中の領域とデータリンクの組
合せを抽出するマップ抽出部と、前記画像データ情報を
参照して画像上の領域とその特徴を認識する画像認識部
と、前記画像中の領域の特徴を表すキーワードとデータ
リンクの組合せを設定するキーワード・リンク設定部
と、キーワードから音声認識用の文法を生成する文法生
成部と、前記文法に従って音声入力を認識しテキストデ
ータにする音声認識部と、前記文法とキーワードとデー
タリンクの組合せを基に前記テキストデータからキーワ
ードを抽出・評価してキーワードに該当するデータリン
クがあれば前記データ接続よりリンク先のドキュメント
データを取り出して表示するよう前記画像スクリーン表
示部に指示する入力評価部とからなる音声制御画像スク
リーン表示システムを提供する。Further, the present invention provides an image screen display unit for extracting display document data from a data connection, interpreting the display document data, and displaying the image data on an image screen, analyzing the display document data, and displaying image data information in the document and the image data. A document analysis unit for extracting map data information of an upper region and a data link, a map extraction unit for extracting a combination of a region and a data link in image data with reference to the map data information, and referring to the image data information. An image recognition unit for recognizing a region on an image and its characteristics, a keyword / link setting unit for setting a combination of a keyword and a data link representing the characteristics of the region in the image, and generating a grammar for speech recognition from the keyword Grammar generator for performing speech recognition based on the grammar and recognizing speech input into text data Extracting and evaluating a keyword from the text data based on the combination of the grammar, the keyword, and the data link, and if there is a data link corresponding to the keyword, retrieves and displays the linked document data from the data connection. An audio control image screen display system comprising an input evaluation unit for instructing an image screen display unit is provided.

【０００９】[0009]

【発明の実施の形態】本発明は、音声認識による制御を
前提としない画像スクリーン表示システムに関して、音
声認識によるビデオ（画像）スクリーン表示制御の方法
を提供する。特に、画像スクリーン表示用ドキュメント
中に指定される画像データにおいて、従来、ユーザが画
像スクリーンに表示された画像の一部をマウス等のポイ
ンティングデバイスでクリックし、その画像選択または
クリックした座標データに基づき画像スクリーン表示装
置が表示制御を行っていた場面において、音声認識によ
る制御を可能とするものである。本発明の実施の形態
を、周知の分野であるインターネットのワールドワイド
ウェブ(WWW）網にアクセスし画面表示するブラウザを、
WWW の操作に音声認識による音声制御を用いる場合につ
いて以下に説明する。DETAILED DESCRIPTION OF THE INVENTION The present invention provides a video (image) screen display control method based on voice recognition for an image screen display system which does not assume control based on voice recognition. In particular, in image data specified in an image screen display document, conventionally, a user clicks a part of an image displayed on an image screen with a pointing device such as a mouse, and selects the image based on the selected coordinate data or the clicked coordinate data. This enables control by voice recognition in a scene where the image screen display device is performing display control. A browser that accesses an embodiment of the present invention and displays a screen on a world wide web (WWW) network of the Internet, which is a well-known field,
The case where voice control by voice recognition is used for WWW operation will be described below.

【００１０】図５で従来のパソコン等でWWW 網のページ
をブラウジングする場合の構成を示したが、HTMLドキュ
メントでブラウザに表示された画像のクリックする領域
によって別のリンク先にジャンプする機能をイメージマ
ップと呼ぶ。イメージマップを用いるHTMLドキュメント
には、画像データファイルの指定、及び、マップの指定
が記述される。画像データファイルはGIF 、JPEG等の標
準的なフォーマットの画像データである。マップの指定
は、画像上のある領域とリンク先を対応付けるもので、
領域毎に座標、範囲（形状・大きさ）・リンク先が記述
される。ユーザが表示された画像の上でマウスをクリッ
クすると、クリックした位置がリンク先が指定された領
域内に有れば、そのリンク先のページを表示する仕組み
になっている。HTMLではまた、画像データファイルをリ
ンク先へのアンカーとして記述することが出来る。この
場合、画像スクリーンに表示された画像をクリックする
と、リンク先にジャンプすることが出来る。[0010] Fig. 5 shows a conventional configuration for browsing a WWW network page using a personal computer or the like. The function of jumping to another link destination according to a clicked area of an image displayed on a browser in an HTML document is illustrated. Call it a map. In an HTML document using an image map, an image data file specification and a map specification are described. The image data file is image data in a standard format such as GIF or JPEG. The specification of the map is to associate a certain area on the image with the link destination,
The coordinates, range (shape / size), and link destination are described for each area. When the user clicks the mouse on the displayed image, if the clicked position is within the designated area of the link, the link destination page is displayed. HTML can also describe image data files as anchors to links. In this case, by clicking the image displayed on the image screen, the user can jump to the link destination.

【００１１】上記のイメージマップ及び画像をアンカー
とするリンクを本発明の一実施例に従って図１と共に以
下に説明する。図１に示すビデオスクリーン表示部（ブ
ラウザ）１は、データ接続WWW 網２に接続してHTMLドキ
ュメントを取り出し、ビデオスクリーン表示部１にペー
ジ画面を表示する。ドキュメント解析部３は、ビデオス
クリーン表示部（ブラウザ）１に接続してHTMLドキュメ
ントを読み込んで解析しHTMLドキュメント中にイメージ
マップ記述が有れば、記述に従って指定されたマップデ
ータとHTMLドキュメント中の画像データとを取り出して
マップデータはマップ抽出部４へ、画像データは画像認
識部５へ夫々送る。画像をアンカーとするリンク記述が
有れば、リンク先及びサイズ情報はマップ抽出部４へ、
画像データは画像認識部５へ夫々送る。The above-described image map and the link using the image as an anchor will be described below with reference to FIG. 1 according to an embodiment of the present invention. A video screen display unit (browser) 1 shown in FIG. 1 is connected to a data connection WWW network 2 to take out an HTML document, and displays a page screen on the video screen display unit 1. The document analysis unit 3 connects to the video screen display unit (browser) 1 to read and analyze the HTML document. If the HTML document has an image map description, the map data specified according to the description and the image in the HTML document. The data and the map data are sent to the map extraction unit 4 and the image data is sent to the image recognition unit 5. If there is a link description with an image as an anchor, the link destination and size information are sent to the map extraction unit 4.
The image data is sent to the image recognition unit 5 respectively.

【００１２】マップ抽出部４では、マップデータから、
画像中の領域とリンク先のリスト（マップリスト）を作
成して、キーワード・リンク設定部６へ送る。この領域
は位置、形状、大きさ等の値で表される。画像をアンカ
ーとするリンクの場合は、画像全体を一つの矩形領域と
してマップリストに登録する。画像認識部５では、画像
データ中の選択対象となり得る領域を認識する。画像中
の文字列（１文字も含む。）や図形、表現されている物
を認識し、認識した対象毎に、その領域、位置、形状
（名称）、大きさ、色、文字列の場合はその文字列を取
り出して画像リストを作成し、キーワード・リンク設定
部６へ送る。この認識対象となる図形やロゴ、画像は、
認識用画像データベース5aに登録しておくことが出来、
文字認識には認識用文字データベース5bを用いる。画像
をアンカーとするリンクの場合は、画像毎に一つの領域
を抽出して画像リストに登録する。In the map extracting section 4, from the map data,
A list (map list) of regions and link destinations in the image is created and sent to the keyword / link setting unit 6. This area is represented by values such as position, shape, and size. In the case of a link using an image as an anchor, the entire image is registered in the map list as one rectangular area. The image recognition unit 5 recognizes a region that can be selected in the image data. Recognize character strings (including one character), figures, and objects represented in images, and for each recognized object, its area, position, shape (name), size, color, and character string The character string is taken out, an image list is created, and sent to the keyword / link setting unit 6. The shapes, logos and images to be recognized are
It can be registered in the recognition image database 5a,
The character database for recognition 5b is used for character recognition. In the case of a link using an image as an anchor, one region is extracted for each image and registered in the image list.

【００１３】キーワード・リンク設定部６では、マップ
抽出部４からのマップリストと画像認識部５からの画像
リストとを照合して領域毎にキーワード及びリンク先を
設定する。マップリストの領域と一致する画像リストの
領域を取り出し、この領域に関する、位置、形状（名
称）、大きさ、色、文字列等を対応するリンク先へのキ
ーワードとして言葉で設定する。ここで、設定されたキ
ーワードは、画像リストの情報を基に、各領域の相対的
な関係を加味し、キーワード辞書6aを参照して設定され
る。例えば３個の図形が横に並んでいる場合、位置に関
して、夫々、「左」、「真ん中」、「右」とキーワード
を設定する。大きさに関しては、大きさ10の図形と大き
さ５の図形が並んでいれば10の図形は「大きい」、５の
図形は「小さい」となるが、大きさ10の図形と20の図形
が並んでいれば10の図形は「小さい」、20の図形は「大
きい」とキーワードを設定する。こうして各領域に対す
るキーワード群とリンク先の組のキーワード・リンクリ
ストが作成される。キーワード・リンク設定部６で生成
されたリストは、文法生成部７及び入力評価部８に送ら
れる。The keyword / link setting unit 6 compares the map list from the map extracting unit 4 with the image list from the image recognizing unit 5 and sets a keyword and a link destination for each area. An area of the image list that matches the area of the map list is extracted, and the position, shape (name), size, color, character string, and the like regarding this area are set as a keyword to the corresponding link destination using words. Here, the set keywords are set with reference to the keyword dictionary 6a, taking into account the relative relationship between the respective regions based on the information in the image list. For example, when three figures are arranged side by side, keywords “left”, “middle”, and “right” are set for the positions, respectively. Regarding the size, if the figure of size 10 and the figure of size 5 are arranged side by side, the figure of 10 is "large" and the figure of 5 is "small", but the figure of size 10 and the figure of 20 are If they are lined up, the keywords are set as "small" for 10 figures and "large" for 20 figures. In this way, a keyword / link list of a set of a keyword group and a link destination for each area is created. The list generated by the keyword / link setting unit 6 is sent to the grammar generation unit 7 and the input evaluation unit 8.

【００１４】文法生成部７では、キーワード・リンクリ
ストのキーワードを基に、音声認識の認識候補となる文
法を生成する。この文法は構文規則を用いて記述しコン
パイルすることが出来る。文法生成部７で生成された文
法データは、入力評価部８及び音声認識部９に送られ
る。音声認識部９では、文法生成部７で生成された文法
を基にユーザの発声した音声を認識し、認識結果をテキ
ストデータとして入力評価部８に送る。音声認識部９に
は、入力指示部9aが含まれる。入力指示部9aでは文法デ
ータを受け取るとユーザに入力を促す指示を出力し、ま
た、入力評価部８からの再入力指示情報を受けて、再入
力を促す指示を出力する。指示の出力は、Text To Spee
chの音声または画像スクリーンの一部または別途設けら
れた表示装置への表示を行うことが出来る。The grammar generation unit 7 generates a grammar that is a candidate for speech recognition based on the keywords in the keyword / link list. This grammar can be described and compiled using syntax rules. The grammar data generated by the grammar generation unit 7 is sent to the input evaluation unit 8 and the speech recognition unit 9. The speech recognition unit 9 recognizes the voice uttered by the user based on the grammar generated by the grammar generation unit 7 and sends the recognition result to the input evaluation unit 8 as text data. The voice recognition unit 9 includes an input instruction unit 9a. Upon receiving the grammar data, the input instructing unit 9a outputs an instruction to prompt the user to input, and receives the re-input instruction information from the input evaluation unit 8 and outputs an instruction to prompt the user to re-input. The output of the instruction is Text To Spee
The sound of the channel or the display on a part of the image screen or a separately provided display device can be performed.

【００１５】入力評価部８では、音声認識部９より送ら
れたテキストデータを文法に従って解析し、キーワード
・リンクリストのキーワード群と照合する。抽出された
キーワードからリンク先が一意に決まれば、そのリンク
先が選択されたと判断し、ビデオスクリーン表示部（ブ
ラウザ）１にリンク先の選択を入力する。ビデオスクリ
ーン表示部（ブラウザ）１は、入力を受けてWWW 網２へ
アクセスしてリンク先のページのHTMLドキュメントを取
り出し、ビデオスクリーン表示部（ブラウザ）１に表示
する。このようにして、ユーザはWWW 網２のページを次
々に見ることが出来る。入力評価部８で抽出されたキー
ワードからリンク先が一意に決まらない場合、或いは不
適切なキーワードの組合せが抽出された場合、音声認識
部９に再入力指示を要求する。音声認識部９では、再入
力指示を行ってユーザの追加または訂正の音声を認識
し、テキストデータを入力評価部８へ送る。入力評価部
８では音声認識部９より送られたテキストデータを文法
に従って解析し、追加のキーワードまたは訂正されたキ
ーワードを抽出・評価する。これをリンク先が一意に決
まるまで繰り返すことが出来る。The input evaluation unit 8 analyzes the text data sent from the speech recognition unit 9 according to the grammar, and checks the text data against a keyword group in a keyword / link list. If the link destination is uniquely determined from the extracted keywords, it is determined that the link destination has been selected, and the selection of the link destination is input to the video screen display unit (browser) 1. Upon receiving the input, the video screen display unit (browser) 1 accesses the WWW network 2, extracts the HTML document of the linked page, and displays it on the video screen display unit (browser) 1. In this way, the user can view the pages of the WWW network 2 one after another. When the link destination is not uniquely determined from the keywords extracted by the input evaluation unit 8, or when an inappropriate combination of keywords is extracted, the voice recognition unit 9 is requested to input again. The voice recognition unit 9 issues a re-input instruction to recognize the voice of addition or correction of the user, and sends text data to the input evaluation unit 8. The input evaluation unit 8 analyzes the text data sent from the speech recognition unit 9 according to the grammar, and extracts and evaluates an additional keyword or a corrected keyword. This can be repeated until the link destination is uniquely determined.

【００１６】つぎに、図２で示す簡単なイメージマップ
を音声対話型とする本発明の一実施例について、以下に
説明する。図２のX 軸、Y 軸は、左上から右下へ大きく
なる画像の座標を示す。(Xp,Yp）の矩形範囲がこの画像
データの画像スクリーン表示イメージである。図中の点
線は座標指定を分かり易くする為のもので、表示イメー
ジに含まれない。イメージマップは「あ」の丸い領域は
リンク先１へ、「い」の四角い領域はリンク先２へ、
「う」の丸い領域はリンク先３へ、リンクするよう記述
されているものとする。ただし、「えらんでください」
とある四角はタイトルなのでリンクは記述していない。
マップ抽出部４で抽出されるマップリストの例は、表１
のようになる。一方、画像認識部５で抽出される画像リ
ストの例は、表２のようになる。Next, an embodiment of the present invention in which the simple image map shown in FIG. 2 is of a voice interactive type will be described below. The X axis and the Y axis in FIG. 2 indicate the coordinates of the image that increases from the upper left to the lower right. The rectangular range of (Xp, Yp) is an image screen display image of this image data. The dotted line in the figure is for making the coordinate designation easy to understand, and is not included in the display image. In the image map, the round area of “A” goes to link 1; the square area of “i” goes to link 2;
It is assumed that the round area of “U” is described so as to link to the link destination 3. However, "Please choose"
Since a certain square is a title, no link is described.
Table 1 shows an example of a map list extracted by the map extracting unit 4.
become that way. On the other hand, an example of the image list extracted by the image recognition unit 5 is as shown in Table 2.

【００１７】[0017]

【表１】 [Table 1]

【表２】 [Table 2]

【００１８】画像リストの領域は、マップリストの領域
と照合するのに用いられる。位置は認識した領域の中心
の座標、形状は描かれている図形または物の名称、大き
さは認識した領域を含む矩形範囲の横・縦のサイズ、色
は領域の主要な色のＲ・Ｇ・Ｂ各256 階調での表現、そ
して、文字列は認識された文字列データとなる。キーワ
ード・リンク設定部６では、マップリスト及び画像リス
トから、各領域のキーワードを設定し、キーワード・リ
ンクリストを作成する。ここで、キーワードは、画像リ
ストの情報を基に、各領域の相対的な関係を加味して設
定される。例えば、領域番号Ｇ１とＧ３では位置が上下
関係にあるので「上」・「下」のキーワードが与えら
れ、同じ四角でも縦横比が明らかに異なるので「長方
形」・「正方形」のキーワードが、領域番号Ｇ２・Ｇ３
・Ｇ４は横に並んでいるので、「左」・「真ん中」・
「右」のキーワードが夫々与えられる。色は各領域の色
を識別するのに適した色名をキーワードとして与える。
似たような色の領域があれば、「濃い」・「薄い」等の
キーワードも加わる。キーワード・リンクリストの例を
表３に示す。The image list area is used to match the map list area. The position is the coordinates of the center of the recognized area, the shape is the name of the figure or object being drawn, the size is the horizontal and vertical size of the rectangular range including the recognized area, and the colors are R and G of the main colors of the area. B: Representation at 256 gradations, and the character string is recognized character string data. The keyword / link setting unit 6 sets a keyword for each area from the map list and the image list, and creates a keyword / link list. Here, the keywords are set based on the information of the image list, taking into account the relative relationship between the respective regions. For example, in the region numbers G1 and G3, the positions are in an up-down relationship, so that the keywords "up" and "down" are given. Even in the same square, the aspect ratios are clearly different. Number G2 ・ G3
・ Because G4 is lined up sideways, "Left", "Middle"
The "right" keyword is given respectively. For the color, a color name suitable for identifying the color of each area is given as a keyword.
If there is an area with a similar color, keywords such as "dark" and "light" are added. Table 3 shows an example of the keyword / link list.

【００１９】[0019]

【表３】 [Table 3]

【００２１】音声認識部９では、文法に従って音声を認
識し、テキストデータとして入力評価部８に送る。入力
評価部８では文法データとキーワード・リンクリストを
基に、入力テキストデータを解析し、テキストデータ中
のキーワードからリンク先が決定出来るか判断する。例
えば、「黄色い丸」・「あ」・「左の丸」等と入力され
れば、領域２を示しリンク先はリンク先１と判断出来る
が、単に「丸」と入力された場合は領域２か４か判断出
来ない。このような場合は、「どちらの丸ですか」とい
うように追加キーワードを促すメッセージを音声認識部
９から出力し、追加キーワードを取得する。また、「青
い丸」のように該当する領域が存在しないキーワードの
組合せが入力された場合は、該当する領域がない旨メッ
セージを出力し、再入力を促す。「えらんでください」
のようにリンク先がない場合は、リンクされていない旨
メッセージを出力し、再入力を促す。ここで、単に「四
角」と入力された場合、領域１と領域３が考えられる
が、領域１にはリンク先が指定されていない。このよう
な場合、領域１は選択肢に含まれない表示と見做し、
「四角」といえば領域３と判断し、リンク先２を呼び出
すことが出来る。The speech recognition unit 9 recognizes speech according to the grammar and sends the speech to the input evaluation unit 8 as text data. The input evaluator 8 analyzes the input text data based on the grammar data and the keyword / link list, and determines whether a link destination can be determined from a keyword in the text data. For example, if "yellow circle", "a", "left circle" or the like is input, it indicates area 2 and the link destination can be determined to be link destination 1. However, if "circle" is simply input, area 2 I can not judge whether it is 4 or not. In such a case, a message prompting an additional keyword such as "Which circle?" Is output from the voice recognition unit 9 to acquire the additional keyword. When a combination of keywords having no corresponding area, such as “blue circle”, is input, a message indicating that there is no corresponding area is output, and re-input is prompted. "please choose"
When there is no link destination as in the above, a message indicating that there is no link is output, and the user is prompted to input again. Here, if “square” is simply input, the area 1 and the area 3 can be considered, but no link destination is specified in the area 1. In such a case, the area 1 is regarded as a display not included in the options,
Speaking of the “square”, it is determined that the area is 3, and the link destination 2 can be called.

【００２２】画像認識部５で抽出する画像の形状は丸、
四角等の単なる図形のみでなく、図中で表す物の名称で
抽出する。図３に示す(A）〜(C）の３つの絵はマップリ
ストではいずれも同じ大きさの四角の領域となるが、画
像認識部５では、認識用画像データベース5aを参照する
ことで、夫々、「ノート」、「新聞」、「カレンダー」
等の名称で認識し、区別する。また、図３(D）のマップ
リストでは２つの円形の領域となるが、画像認識部５で
は、「雪だるま」という一つの物として認識される。従
って、ユーザは見た画像中の物の名称で選択をすること
が可能となる。The shape of the image extracted by the image recognition unit 5 is a circle,
It is extracted not only by a simple figure such as a square, but also by the name of an object represented in the figure. Although the three pictures (A) to (C) shown in FIG. 3 are all rectangular areas of the same size in the map list, the image recognizing unit 5 refers to the recognition image database 5a to generate the respective areas. , "Notes", "Newspapers", "Calendars"
Recognize and distinguish by names such as In the map list of FIG. 3D, two circular areas are provided, but the image recognition unit 5 recognizes the area as one “snowman”. Therefore, the user can make a selection based on the name of an object in the viewed image.

【００２３】なお、キーワード・リンク設定部６はキー
ワード辞書6aを有し、表４に挙げる連想知識をキーワー
ド辞書6aに保持している。図４に示すような黄色い円が
描かれている画像があった場合、キーワード・リンク設
定部６では、背景領域の色が白ならば「卵の黄身」、青
ならば「太陽」、黒ならば「月」のように、連想知識の
条件を満たすキーワードを設定することが出来る。以
上、実施例としてWWW 網２への接続する画像スクリーン
（ブラウザ）１への適用を挙げたが、ローカルなネット
ワークシステムにおける端末画面での表示、スタンドア
ロンの画像スクリーン表示システムにおいても、表示画
像の一定領域を選択することによって表示制御を行う場
面で、表示制御の音声対話型化の為に有効に実施出来
る。本実施例で示した、領域、位置、大きさ、色等の表
現のフォーマットや単位も説明の為の一例で、本発明の
範囲はこれに限定されない。The keyword / link setting section 6 has a keyword dictionary 6a, and stores the association knowledge listed in Table 4 in the keyword dictionary 6a. If there is an image in which a yellow circle as shown in FIG. 4 is drawn, the keyword / link setting unit 6 sets “egg yolk” if the background area is white, “sun” if blue, and “black” if black. For example, a keyword that satisfies the condition of associative knowledge, such as “month”, can be set. As described above, the application to the image screen (browser) 1 connected to the WWW network 2 has been described as an example. However, even in a display on a terminal screen in a local network system or a stand-alone image screen display system, a fixed display image In a scene where display control is performed by selecting an area, the display control can be effectively implemented for voice interactive type. The format and unit of expression such as region, position, size, color and the like shown in the present embodiment are also examples for explanation, and the scope of the present invention is not limited to this.

【００２４】[0024]

【表４】 [Table 4]

【００２５】[0025]

【発明の効果】本発明の音声制御画像スクリーン表示シ
ステムでは、画像データの示す内容を認識してマップの
データリンクとの対応から、ユーザが容易に音声で指定
出来る言葉の形で文法を作成し、音声認識結果からデー
タリンクを参照して画像スクリーン表示を制御すること
が出来る。即ち、画像のリンク領域に対応するキーワー
ドを、その領域の画像の特徴を表す言葉で設定すること
が出来る。これにより、従来、ユーザが画像スクリーン
に表示された画像の一部をマウス等のポインティングデ
バイスでクリックし、その画像選択またはクリックされ
た座標データに基づき画像スクリーン表示装置が表示制
御を行っていた場面において、ユーザが表示画像を見て
自然に指定出来る言葉で画像の領域を指定することによ
って表示制御を行うことが出来る。即ち、画像マップに
対応した音声制御画像スクリーン表示システムを提供す
ることによって、ワールドワイドウェブ網や対話型テレ
ビジョン表示のような画像スクリーン表示に関する手を
用いないナビゲーションが可能になる。According to the voice control image screen display system of the present invention, the contents indicated by the image data are recognized, and the grammar is created in the form of words that can be easily specified by voice by the user based on the correspondence with the map data link. The image screen display can be controlled by referring to the data link from the voice recognition result. That is, the keyword corresponding to the link region of the image can be set by a word representing the feature of the image in that region. Thus, conventionally, when the user clicks a part of the image displayed on the image screen with a pointing device such as a mouse, and the image screen display device performs display control based on the image selection or the clicked coordinate data. In, display control can be performed by designating a region of the image using words that can be naturally designated by the user while viewing the display image. That is, by providing an audio control image screen display system corresponding to an image map, hand-free navigation for an image screen display such as a world wide web network or an interactive television display becomes possible.

【００２６】本発明の音声制御画像スクリーン表示シス
テムでは、従来の画像スクリーン表示の制御方法と併用
することが出来、従来の音声対話型を含む画像スクリー
ン表示制御に画像マップ対応の音声による制御機能を付
加することが出来る。The voice control image screen display system of the present invention can be used in combination with the conventional image screen display control method. Can be added.

[Brief description of the drawings]

【図１】本発明の音声制御画像スクリーン表示システム
の一実施例の構成図である。FIG. 1 is a configuration diagram of an embodiment of a voice control image screen display system according to the present invention.

【図２】本発明の一実施例におけるイメージマップの画
像の例である。FIG. 2 is an example of an image of an image map according to an embodiment of the present invention.

【図３】本発明の一実施例における画像認識部で認識す
る画像の例である。FIG. 3 is an example of an image recognized by an image recognition unit according to an embodiment of the present invention.

【図４】本発明の請求項６の一実施例における画像の例
である。FIG. 4 is an example of an image according to an embodiment of the present invention.

【図５】従来の画像スクリーン表示装置の例の構成図で
ある。FIG. 5 is a configuration diagram of an example of a conventional image screen display device.

[Explanation of symbols]

１ビデオ（画像）スクリーン表示部（ブラウザ）２データ接続（www 網）３ドキュメント解析部４マップ抽出部５画像認識部５ａ認識用画像データベース５ｂ認識用文字データベース６キーワード・リンク設定部６ａキーワード辞書７文法生成部８入力評価部９音声認識部９ａ入力指示部 1 video (image) screen display unit (browser) 2 data connection (www network) 3 document analysis unit 4 map extraction unit 5 image recognition unit 5a recognition image database 5b recognition character database 6 keyword / link setting unit 6a keyword dictionary 7 Grammar generation unit 8 Input evaluation unit 9 Voice recognition unit 9a Input instruction unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号ＦＩＧ０６Ｔ 7/00 Ｇ１０Ｌ 3/00 ５２１ＣＧ０９Ｇ 5/00 ５３０５５１ＧＧ１０Ｌ 3/00 ５２１５５１Ｚ５５１Ｇ０６Ｆ 15/62 ３２１Ａ 15/70 ３３０Ｑ ────────────────────────────────────────────────── ─── front page continued (51) Int.Cl. ⁶ identifications FI G06T 7/00 G10L 3/00 521C G09G 5/00 530 551G G10L 3/00 521 551Z 551 G06F 15/62 321A 15/70 330Q

Claims

[Claims]

1. An image screen display unit for extracting display document data from a data connection, interpreting the display document data, and displaying the display document data on an image screen, analyzing the display document data and associating the image data information in the document with the image data information. A document analysis unit for extracting data link information to be obtained, an image recognition unit for recognizing a feature on an image by referring to the image data information, and a keyword / link setting for setting a combination of a keyword and a data link representing the feature of the image A grammar generation unit that generates a grammar for speech recognition from the keyword; a speech recognition unit that recognizes speech input according to the grammar and converts it into text data; and a text based on a combination of the grammar, the keyword, and a data link. Extract and evaluate keywords from data and match keywords Voice control picture screen display system characterized by comprising the input evaluation unit and instructing said image screen display unit so that Tarinku to retrieve and display document data destination from the data connection, if any.

2. An image screen display unit for extracting display document data from a data connection, interpreting the display document data, and displaying the display document data on an image screen; analyzing the display document data to obtain image data information in the document and an area on the image; A document analysis unit that extracts map data information of a data link; a map extraction unit that extracts a combination of a region and a data link in image data with reference to the map data information; An image recognition unit for recognizing a region and its feature; a keyword / link setting unit for setting a combination of a keyword and a data link representing a feature of the region in the image; and a grammar generation for generating a grammar for speech recognition from the keyword A voice recognition unit that recognizes voice input according to the grammar and converts the voice input into text data; Extracting and evaluating keywords from the text data based on a combination of grammar, keywords and data links, and displaying the image screen so that if there is a data link corresponding to the keyword, the linked document data is extracted from the data connection and displayed. A voice control image screen display system, comprising: an input evaluation unit for instructing the unit.

3. The voice-controlled image screen display system according to claim 1, wherein the image recognition unit has at least one of a recognition image database and a recognition character database, and the recognition image database. Are the shapes, logos, people,
Recognize other objects and give their names as features of the area in the image, the recognition character database recognizes a single character or character string in the image, and converts the character or character string to the area in the image. An audio-controlled image screen display system, characterized in that it is provided as a feature.

4. The voice control image screen display system according to claim 1, wherein the keyword / link setting unit holds a keyword dictionary and compares characteristics of a plurality of image areas displayed on the image screen. A voice control image screen display system, wherein keywords are provided based on a relative relationship between a plurality of regions.

5. The voice control image screen display system according to claim 1, wherein the input evaluation unit extracts and evaluates a keyword from the text data based on a combination of a grammar, a keyword, and a data link, and evaluates the keyword. If there is no data link destination corresponding to, the voice recognition unit requests re-input, the voice recognition unit outputs an input instruction when receiving grammatical data, and outputs a re-input instruction in response to the re-input request of the input evaluation unit. A voice-controlled image screen display system.

6. The voice control image screen display system according to claim 4, wherein the keyword dictionary has associative knowledge regarding at least one of a color and a shape, and the keyword / link setting unit includes:
A voice control image screen display system, wherein a keyword associated with at least one of a color and a shape between a plurality of image areas is set.