JPH07191599A

JPH07191599A - Video apparatus

Info

Publication number: JPH07191599A
Application number: JP5332899A
Authority: JP
Inventors: Takuo Shimada; 拓生嶋田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-12-27
Filing date: 1993-12-27
Publication date: 1995-07-28

Abstract

PURPOSE:To obtain a video apparatus which enables even a person handicapped in hearing and speaking to understand contents of talking simply by viewing a screen by converting audio signals to character information and displaying the information on the screen. CONSTITUTION:An audio circuit 6 converts the intermediate frequencies of audio to FM detection and demodulation after amplification thereof, converts the frequencies to audio signals and supplies these signals via a character generating means 11 and a speech recognizing means 12 to a finger language forming means 13. The character forming means 11 converts the audio signals to character information and the audio recognizing means 12 recognizes the contents of meanings from the audio signals and converts the contents of the meanings recognized by the finger language image forming means 13 to animation images of the finger language. The images selected by a switching means 14 are so synthesized as to be superimposed on a part of the video signals corresponding to a basic screen by a superimposing display means 10 and supplies the prescribed signals to the electrodes of an image receiving tube 5.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は特に聾唖者に適したテレ
ビ、ビデオなどの映像機器に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video device such as a television or a video which is suitable for people who are deaf.

【０００２】[0002]

【従来の技術】従来のこの種の映像機器の代表例である
テレビは、図８に示すように受信アンテナ１で得た電波
をチューナ２に供給し、ここで増幅後周波数変換し選局
リモコン３に応じた選局を行う。映像回路４は中間周波
を増幅・検波し、音声中間周波と色副搬送波及び同期信
号を分配した後、輝度信号の増幅を行い、色差信号を加
えて受像管５の電極に所定の映像信号を供給する。音声
回路６は音声中間周波を増幅後ＦＭ検波・復調し、音声
信号に変換してスピーカ７に供給する。同期・偏向回路
８は同期信号を分離し、垂直・水平発信器の同期をと
り、のこぎり波電流を作成して偏向コイル９に供給する
構成となっている。ここで音声信号及び映像信号を入力
するメディア入力部は、受信アンテナ１、チューナ２、
選曲リモコン３、映像回路４及び音声回路６で構成され
ている。2. Description of the Related Art A television, which is a typical example of conventional video equipment of this type, supplies radio waves obtained by a receiving antenna 1 to a tuner 2 as shown in FIG. Channel selection according to 3. The video circuit 4 amplifies and detects the intermediate frequency, distributes the audio intermediate frequency, the color subcarrier, and the synchronization signal, and then amplifies the luminance signal, adds the color difference signal, and outputs a predetermined video signal to the electrode of the picture tube 5. Supply. The audio circuit 6 amplifies the audio intermediate frequency, FM-detects and demodulates it, converts it into an audio signal, and supplies it to the speaker 7. The synchronization / deflection circuit 8 separates the synchronization signal, synchronizes the vertical / horizontal oscillators, creates a sawtooth wave current, and supplies it to the deflection coil 9. Here, the media input unit for inputting the audio signal and the video signal is the receiving antenna 1, the tuner 2,
It is composed of a music selection remote controller 3, a video circuit 4, and an audio circuit 6.

【０００３】ところで聾唖者のためのテレビ放送として
は、番組中の会話やアナウンスなどをテレビ画面上に字
幕として表示するクローズドキャプション放送が米国で
行われている。日本ではごく限られた番組のみで、音声
信号に対応した字幕や手話を付けた映像が放送されてい
る現状である。By the way, as a television broadcast for the deaf person, a closed caption broadcast for displaying conversations and announcements in a program as subtitles on the television screen is performed in the United States. In Japan, only a limited number of programs are currently broadcasting subtitles and video with sign language corresponding to audio signals.

【０００４】また最近の研究において、不特定話者に対
する音声認識や意味情報に対応する手話のアニメーショ
ン画像合成が考えられるようになってきた（例えば徐
軍、棚橋真、坂本雄児、青木由直：“手話画像知的通信
のための手振り記述と単語辞書の構成法”信学論Ａ、Ｊ
７６−Ａ、９、ｐｐ．１３３２−１３４１（１９９３−
９））。Further, in recent studies, voice recognition for unspecified speakers and animation image synthesis of sign language corresponding to semantic information have come to be considered (for example, Xu Army, Makoto Tanahashi, Yuji Sakamoto, Yoshinao Aoki: "Sign language image, hand gesture description and word dictionary construction method for intelligent communication"
76-A, 9, pp. 1332-1341 (1993-
9)).

【０００５】さらに有線通信システムの一部には、聾唖
者用に音声サービスと同等の内容を文字情報にして画面
上にスーパーインポーズ表示させるものがある（例えば
特開昭６２−４８８９０号公報）。Further, there is a part of the wired communication system in which contents equivalent to those of a voice service are converted into character information for a deaf person and superimpose-displayed on the screen (for example, JP-A-62-48890). .

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記従来
の構成では、クローズドキャプション放送や字幕、手話
付きの放送などのように映像メディアの提供側があらか
じめ音声に相当する情報を別途付加しておかない限り、
会話やアナウンスなどの発話内容は聾唖者に全く理解で
きないという課題があった。However, in the above-mentioned conventional configuration, unless the video media provider has previously added information corresponding to audio in advance such as closed caption broadcasting, subtitles, and broadcasting with sign language, etc.
There was a problem that deaf people could not understand the contents of utterances such as conversations and announcements at all.

【０００７】本発明は上記課題を解決するもので、聾唖
者などにとって画面を見るだけで発話内容が理解できる
映像機器を提供することを目的とする。The present invention is intended to solve the above problems, and an object of the present invention is to provide a video device which enables a deaf person or the like to understand the contents of speech just by looking at the screen.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するため
に本発明の映像機器は、音声信号及び映像信号を入力す
るメディア入力部と、このメディア入力部で受けた音声
信号を文字情報に変換する文字生成手段と、この文字生
成手段で生成された文字情報を画面上に表示させる表示
手段とを備えたものである。In order to solve the above-mentioned problems, the video equipment of the present invention comprises a media input section for inputting a voice signal and a video signal, and a voice signal received by the media input section converted into character information. And a display unit for displaying the character information generated by the character generation unit on the screen.

【０００９】また音声信号及び映像信号を入力するメデ
ィア入力部と、このメディア入力部で受けた音声信号か
ら意味内容を認識する音声認識手段と、この音声認識手
段で認識された意味内容を手話のアニメーション画像に
変換する手話画像生成手段と、この手話画像生成手段で
生成された画像情報を画面上に表示させる表示手段とを
備えたものである。Further, a media input section for inputting a voice signal and a video signal, a voice recognizing means for recognizing the meaning content from the voice signal received by the media input section, and a meaning content recognized by the voice recognizing means for sign language. A sign language image generating means for converting into an animation image and a display means for displaying the image information generated by the sign language image generating means on a screen are provided.

【００１０】あるいは音声信号及び映像信号を入力する
メディア入力部と、このメディア入力部で受けた音声信
号から意味内容を認識する音声認識手段と、この音声信
号ないし映像信号から話者の位置を特定する位置特定手
段と、この音声認識手段で認識された意味内容を手話の
アニメーション画像に変換する手話画像生成手段と、こ
の手話画像生成手段で生成された画像情報を位置特定手
段で特定された話者の位置に対応して画面上に表示させ
る表示手段とを備えたものである。Alternatively, a media input unit for inputting a voice signal and a video signal, voice recognition means for recognizing the meaning and content from the voice signal received by the media input unit, and the position of the speaker are identified from the voice signal or the video signal. Position identifying means, a sign language image generating means for converting the meaning content recognized by the voice recognizing means into an animation image of sign language, and the image information generated by the sign language image generating means is specified by the position identifying means. Display means for displaying on the screen corresponding to the position of the person.

【００１１】あるいは音声信号及び映像信号を入力する
メディア入力部と、このメディア入力部で受けた音声信
号から意味内容を認識する音声認識手段と、この音声信
号ないし映像信号から話者を識別する話者識別手段と、
音声認識手段で認識された意味内容を手話のアニメーシ
ョン画像に変換する手話画像生成手段と、この手話画像
生成手段で生成された画像情報を話者識別手段で識別さ
れた個々の話者に対応して手話のアニメーションの種別
を変更し画面上に表示させる表示手段とを備えたもので
ある。Alternatively, a media input unit for inputting a voice signal and a video signal, voice recognition means for recognizing the meaning and content from the voice signal received by the media input unit, and a talk for identifying a speaker from the voice signal or the video signal. Person identification means,
Sign language image generation means for converting the meaning content recognized by the voice recognition means into a sign language animation image, and image information generated by this sign language image generation means for the individual speakers identified by the speaker identification means. And a display means for changing the type of the sign language animation and displaying it on the screen.

【００１２】[0012]

【作用】本発明は上記構成によって、テレビ放送をはじ
めとするあらゆる映像メディアにおける会話やアナウン
スなどの発話内容が文字あるいは手話のアニメーション
画像として画面上に即座に表示される。According to the present invention, the contents of speech such as conversations and announcements in all video media such as television broadcasting are immediately displayed on the screen as an animation image of characters or sign language by the above-mentioned structure.

【００１３】特に音声信号ないし映像信号から話者の位
置を特定する位置特定手段を備え、この話者の位置に対
応して手話のアニメーション画像の表示位置を変えるこ
とで、もとの映像信号に合致した臨場感あるわかりやす
い画面が構築される。Particularly, a position specifying means for specifying the position of the speaker from the audio signal or the video signal is provided, and the display position of the animation image of the sign language is changed in accordance with the position of the speaker so that the original video signal is obtained. A realistic and easy-to-understand screen is created.

【００１４】あるいは音声信号ないし映像信号から話者
を識別する話者識別手段を備え、識別された個々の話者
に対応して手話のアニメーションの種別を変更すること
で、さらに臨場感あるわかりやすい画面が構築される。Alternatively, a speaker identifying means for identifying a speaker from an audio signal or a video signal is provided, and the type of the sign language animation is changed in accordance with each identified speaker, whereby a more realistic and easy-to-understand screen is displayed. Is built.

【００１５】[0015]

【実施例】以下本発明の第１の実施例を図１から図３を
参照して説明する。図１において従来例で示したものと
同一機能を持ち、同一番号を付与し、一部説明を省略す
る。従来例同様映像機器の代表例であるテレビは、受信
アンテナ１で得た電波がまずチューナ２に供給され、こ
こで増幅後周波数変換し選局リモコン３に応じた選局を
行う。映像回路４は中間周波を増幅・検波し、音声中間
周波と色副搬送波及び同期信号を分配した後、輝度信号
の増幅を行い、色差信号を加えて基本画面に相当する映
像信号をスーパーインポーズ表示手段１０に供給する。
音声回路６は音声中間周波を増幅後ＦＭ検波・復調し、
音声信号に変換してスピーカ７、文字生成手段１１及び
音声認識手段１２を介し手話画像生成手段１３に供給す
る。ここで音声信号及び映像信号を入力するメディア入
力部は、受信アンテナ１、チューナ２、選曲リモコン
３、映像回路４及び音声回路６で構成されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to FIGS. In FIG. 1, it has the same function as that shown in the conventional example, the same number is given, and a part of the description is omitted. In the television, which is a typical example of the video equipment as in the conventional example, the radio wave obtained by the receiving antenna 1 is first supplied to the tuner 2, where it is amplified and frequency-converted to perform tuning according to the tuning remote controller 3. The video circuit 4 amplifies and detects the intermediate frequency, distributes the audio intermediate frequency, the color subcarrier, and the synchronization signal, then amplifies the luminance signal, adds the color difference signal, and superimposes the video signal corresponding to the basic screen. It is supplied to the display means 10.
The voice circuit 6 amplifies the voice intermediate frequency and then performs FM detection and demodulation,
It is converted into a voice signal and supplied to the sign language image generation means 13 via the speaker 7, the character generation means 11 and the voice recognition means 12. Here, the media input unit for inputting the audio signal and the video signal is composed of the receiving antenna 1, the tuner 2, the music selection remote controller 3, the video circuit 4, and the audio circuit 6.

【００１６】文字生成手段１１は音声信号を文字情報に
変換し、また音声認識手段１２は音声信号から意味内容
を認識後、手話画像生成手段１３で認識された意味内容
を手話のアニメーション画像に変換する。切替手段１４
は文字生成手段１１で生成された文字化画像と手話画像
生成手段１３で生成された手話のアニメーション画像の
いずれか一方を選択したり、あるいは両方共を選択した
りしなかったりできる切替手段である。切替手段１４に
よって選択された画像はスーパーインポーズ表示手段１
０で基本画面に相当する映像信号の一部にスーパーイン
ポーズするよう合成された後、受像管５の電極に所定の
映像信号を供給する。The character generation means 11 converts the voice signal into character information, and the voice recognition means 12 recognizes the meaning content from the voice signal and then converts the meaning content recognized by the sign language image generation means 13 into a sign language animation image. To do. Switching means 14
Is a switching unit that can select either one of the characterized image generated by the character generation unit 11 and the sign language animation image generated by the sign language image generation unit 13 or neither of them. . The image selected by the switching unit 14 is the superimposed display unit 1.
At 0, the video signal is synthesized so as to be superimposed on a part of the video signal corresponding to the basic screen, and then a predetermined video signal is supplied to the electrode of the picture tube 5.

【００１７】文字生成手段１１の構成を図２を用いて説
明する。音響分析部１１ａでは、音声回路６からの音声
入力を単位時間ごとのフレーム周期で音響分析し、音声
パワーと自己相関関数を求める。この分析データにもと
づいて音素認識部１１ｂでは、音声入力を母音と子音に
分離し、音素規則部１１ｃにあらかじめ記憶されている
音素標準パターンとマッチングをとることによって一音
ごとの音素列を抽出する。単語認識部１１ｄでは、単語
辞書１１ｅにあらかじめ記憶されている単語知識を用い
て単語系列を予測し、これと音素列とのマッチングをと
ることによって単語を抽出する。構文解析部１１ｆでは
さらに文法・構文規則部１１ｇにあらかじめ記憶されて
いる言語的知識を用いて文章としての整合性をチェック
し修正する。ここで音素規則部１１ｃ、単語辞書１１ｅ
及び文法・構文規則部１１ｇは互いに連携し知識データ
ベースを形成している。漢字かな混じり文字変換部１１
ｈでは構文解析部１１ｆで抽出された文章を漢字かな混
じりの日本語文字列に変換し、字幕画像生成部１１ｉで
これを画像化して出力する。The structure of the character generation means 11 will be described with reference to FIG. The sound analysis unit 11a performs sound analysis on the sound input from the sound circuit 6 at a frame cycle for each unit time to obtain a sound power and an autocorrelation function. Based on this analysis data, the phoneme recognition unit 11b separates the voice input into vowels and consonants, and extracts a phoneme string for each phoneme by matching with a phoneme standard pattern stored in advance in the phoneme rule unit 11c. . The word recognition unit 11d predicts a word sequence using the word knowledge stored in advance in the word dictionary 11e, and extracts a word by matching this with a phoneme string. The syntactic analysis unit 11f further checks and corrects the consistency as a sentence by using the linguistic knowledge stored in advance in the grammar / syntax rule unit 11g. Here, the phoneme rule section 11c and the word dictionary 11e
The grammar / syntax rule part 11g cooperates with each other to form a knowledge database. Kanji / Kana mixed character converter 11
In h, the sentence extracted by the syntax analysis unit 11f is converted into a Japanese character string containing kanji and kana, and the subtitle image generation unit 11i converts this into an image and outputs it.

【００１８】一方音声認識手段１２、手話画像生成手段
１３の構成を図３を用いて説明する。文字生成手段１１
同様、音響分析部１２ａでは、音声回路６からの音声入
力を単位時間ごとのフレーム周期で音響分析し、音声パ
ワーと自己相関関数を求める。この分析データにもとづ
いて音素認識部１２ｂでは、音声入力を母音と子音に分
離し、音素規則部１２ｃにあらかじめ記憶されている音
素標準パターンとマッチングをとることによって一音ご
との音素列を抽出する。単語認識部１２ｄでは、単語辞
書１２ｅにあらかじめ記憶されている単語知識を用いて
単語系列を予測し、これと音素列とのマッチングをとる
ことによって単語を抽出する。構文解析部１２ｆではさ
らに文法・構文規則部１２ｇにあらかじめ記憶されてい
る言語的知識を用いて文章としての整合性をチェックし
修正する。ここで音素規則部１２ｃ、単語辞書１２ｅ及
び文法・構文規則部１２ｇは互いに連携し知識データベ
ースを形成している。意味推論１２ｈでは構文解析部１
２ｆで抽出された文章に対応した意味内容を言語レベル
で認識し、基本単位ごとに区切られた意味のある文章情
報として手話画像生成手段１３に伝送する。On the other hand, the structures of the voice recognition means 12 and the sign language image generation means 13 will be described with reference to FIG. Character generation means 11
Similarly, in the acoustic analysis unit 12a, the audio input from the audio circuit 6 is acoustically analyzed in a frame cycle for each unit time, and the audio power and the autocorrelation function are obtained. Based on this analysis data, the phoneme recognition unit 12b separates the voice input into vowels and consonants, and extracts a phoneme string for each phone by matching with the phoneme standard pattern stored in advance in the phoneme rule unit 12c. . The word recognition unit 12d extracts a word by predicting a word sequence using word knowledge stored in advance in the word dictionary 12e and matching this with a phoneme string. The syntactic analysis unit 12f further checks and corrects the consistency of the sentence by using the linguistic knowledge stored in advance in the grammar / syntax rule unit 12g. Here, the phoneme rule unit 12c, the word dictionary 12e, and the grammar / syntax rule unit 12g cooperate with each other to form a knowledge database. In the semantic reasoning 12h, the parsing unit 1
The meaning content corresponding to the sentence extracted in 2f is recognized at the language level, and is transmitted to the sign language image generation means 13 as meaningful sentence information divided for each basic unit.

【００１９】手話画像生成手段１３では、この文章情報
に対応して手話記述部１３ａが手話単語辞書１３ｂにあ
らかじめ記憶されている手話単語から必要な手話動作パ
ターンを引き出し、合成する。手話画像生成部１３ｃで
は手話記述部１３ａで合成された手話動作をアニメーシ
ョン画像化して出力する。In the sign language image generating means 13, the sign language description unit 13a extracts a necessary sign language action pattern from the sign language words stored in advance in the sign language word dictionary 13b and synthesizes it, corresponding to the sentence information. The sign language image generation unit 13c converts the sign language operation combined by the sign language description unit 13a into an animation image and outputs it.

【００２０】上記構成において文字生成手段１２が音声
信号を文字情報に変換し、さらにスーパーインポーズ表
示手段１０がテレビ画面上にこの文字情報を即時に表示
するので、スピーカ７からの音声出力がなくてもテレビ
画面だけで内容を理解できるという効果がある。あるい
は音声認識手段１２、手話画像生成手段１３が音声信号
を手話のアニメーション画像に変換し、さらにスーパー
インポーズ表示手段１０がテレビ画面上にこの画像情報
を即時に表示するので、文字表示だけでは読み取りが困
難になるような早口言葉に対しても必要な情報を視覚に
訴えてすばやく簡潔に伝えることができる。特に受像管
５の画面サイズが小さい場合にも疲れにくいという効果
がある。In the above structure, the character generating means 12 converts the voice signal into the character information, and the superimposing display means 10 immediately displays the character information on the television screen, so that there is no voice output from the speaker 7. However, the effect is that you can understand the contents only on the TV screen. Alternatively, the voice recognition means 12 and the sign language image generation means 13 convert the voice signal into an animation image of sign language, and the superimposing display means 10 immediately displays this image information on the TV screen, so that it can be read only by the character display. It is possible to convey the necessary information visually and quickly and concisely even in tongue twisters that make it difficult. In particular, even when the screen size of the picture tube 5 is small, there is an effect that it is hard to get tired.

【００２１】次に本発明の第２の実施例を図４〜図５を
参照して説明する。図４において本発明の第１の実施例
と異なるのは文字生成手段１１や切替手段１４がなく、
新たに映像信号及び音声信号から話者の位置を特定する
位置特定手段１５とこの話者の位置に対応して画面上に
表示させるスーパーインポーズ表示手段１６を備えたこ
とにある。ここで音声回路６からは音声ステレオ信号が
左右チャンネル独立に出力されているものとする。Next, a second embodiment of the present invention will be described with reference to FIGS. 4 is different from the first embodiment of the present invention in that there is no character generation means 11 or switching means 14,
This is because a position specifying means 15 for newly specifying the position of the speaker from the video signal and the audio signal and a superimposing display means 16 for displaying on the screen corresponding to the position of the speaker are newly provided. Here, it is assumed that audio stereo signals are output from the audio circuit 6 independently of the left and right channels.

【００２２】他の構成は第１の実施例と同様なので説明
を省略し、位置特定手段１５の構成のみ図５を用いて説
明する。発話信号抽出部１５ａは、音声回路６からの音
声入力から発話信号のみ抽出するフィルタであり、発話
信号抽出部１５ａを通過した音声信号について音量検出
部１５ｂで音量を検出し、また方位検出部１５ｃで左右
の音量バランスとその変化によって音場における話者の
方向を算出する。一方、人体位置検出部１５ｄでは、映
像回路４からの映像入力を画像処理することで２次元画
面から人体位置を検出し、さらに口唇動作検出部１５ｅ
によって検出された人体中の口唇動作を検出し話者の画
面上の位置を算出する。位置判定部１５ｆでは音量検出
部１５ｂ、方位検出部１５ｃ及び口唇動作検出部１５ｅ
からの入力に基づいて仮想の３次元空間における話者の
位置を推定し、この位置情報をスーパーインポーズ表示
手段１６に伝える。Since the other construction is the same as that of the first embodiment, its explanation is omitted and only the construction of the position specifying means 15 will be explained with reference to FIG. The speech signal extraction unit 15a is a filter that extracts only the speech signal from the speech input from the speech circuit 6, detects the volume of the speech signal that has passed through the speech signal extraction unit 15a by the volume detection unit 15b, and the direction detection unit 15c. The direction of the speaker in the sound field is calculated by the left-right volume balance and its change. On the other hand, the human body position detecting unit 15d detects the human body position from the two-dimensional screen by image-processing the video input from the video circuit 4, and further the lip motion detecting unit 15e.
The position of the speaker on the screen is calculated by detecting the lip movement in the human body detected by. In the position determination unit 15f, the volume detection unit 15b, the orientation detection unit 15c, and the lip movement detection unit 15e.
The position of the speaker in the virtual three-dimensional space is estimated based on the input from, and this position information is transmitted to the superimposing display means 16.

【００２３】スーパーインポーズ表示手段１６は、位置
特定手段１５で特定された話者の位置情報に従ってスー
パーインポーズ表示させる手話のアニメーション画像の
表示位置を画面上で変更する。スーパーインポーズする
２次元表示領域の中で、遠近感はアニメーション画像の
大きさによって表現する。The superimpose display means 16 changes the display position of the animation image of sign language to be superimposed on the screen according to the position information of the speaker specified by the position specifying means 15. The perspective is expressed by the size of the animation image in the superposed two-dimensional display area.

【００２４】上記構成において位置特定手段１５で特定
された話者の位置に対応して手話のアニメーション画像
の表示位置が即時に変わる。例えば音声信号から判断し
て右手にいる話者が話をはじめると手話のアニメーショ
ンは画面右の方に現れ、話の内容に即した手話動作をは
じめる。話者が話しながら左に移動すれば、手話のアニ
メーションも画面上で左に移動していく。上下方向ない
し奥行き方向に関しても同様である。つまり音声内容を
手話に置換しただけでは欠如してしまう話者の位置情報
を臨場感を出して画面に盛り込むことができる。In the above configuration, the display position of the animation image of sign language is instantly changed corresponding to the position of the speaker specified by the position specifying means 15. For example, judging from the audio signal, when the speaker in the right hand starts talking, the sign language animation appears on the right side of the screen, and the sign language operation according to the content of the talk starts. If the speaker moves to the left while talking, the sign language animation also moves to the left on the screen. The same applies to the vertical direction or the depth direction. In other words, the position information of the speaker, which is lacking just by replacing the voice content with the sign language, can be included in the screen with a realistic feeling.

【００２５】次に本発明の第３の実施例を図６〜図７を
参照して説明する。図６において本発明の第２の実施例
と異なるのは、話者の位置を特定する位置特定手段１５
がなく、新たに映像信号から話者を識別する話者識別手
段１７とこの話者に対応して前記手話のアニメーション
の種別を変更し画面上に表示させる表示手段１８を備え
たことにある。話者識別手段１７は、受信回路４から入
力した映像信号から全ての人を抽出し、これらの人から
話者を識別することでアニメーションの種別を変更させ
る構成である。Next, a third embodiment of the present invention will be described with reference to FIGS. 6 is different from the second embodiment of the present invention in that the position specifying means 15 for specifying the position of the speaker is used.
In other words, there is newly provided speaker identifying means 17 for identifying a speaker from a video signal and display means 18 for changing the type of the sign language animation corresponding to the speaker and displaying it on the screen. The speaker identifying means 17 is configured to extract all persons from the video signal input from the receiving circuit 4 and identify the speakers from these persons to change the type of animation.

【００２６】話者識別手段１７の構成を図７を用いて説
明する。発話信号抽出部１７ａは、音声回路６からの音
声入力から発話信号のみ抽出するフィルタであり、発話
信号抽出部１７ａを通過した音声信号について音量検出
部１７ｂ、音程検出部１７ｃ、音色検出部１７ｄ及び速
度検出部１７ｅではそれぞれ発話信号の音量、音程、音
色及び話者の発語速度を検出する。一方、人体形状検出
部１７ｆでは、映像回路４からの映像入力を画像処理す
ることで２次元画面から人体形状を検出し、さらに口唇
動作検出部１７ｇ及び特徴抽出部１７ｈによって画像か
ら得られる話者の特徴量を算出する。話者判定部１７ｉ
ではこれら音量検出部１７ｂ、音程検出部１７ｃ、音色
検出部１７ｄ、速度検出部１７ｅ、人体形状検出部１７
ｆ及び口唇動作検出部１７ｇからの出力に基づき話者の
特徴量を例えばあらかじめ用意した２０パターンのいず
れかに分類し、この分類された情報をスーパーインポー
ズ表示手段１８に伝える。分類の方法は主成分分析など
の多変量解析手法や学習ベクトル量子化などのニューラ
ルネットワーク手法を用い、多次元空間中に個々人の特
徴ベクトルを描くことで実現する。スーパーインポーズ
表示手段１８は、話者識別手段１７で識別された個々の
話者に対応してスーパーインポーズ表示させる手話のア
ニメーションの種別を即時に切り替える構成である。The structure of the speaker identifying means 17 will be described with reference to FIG. The speech signal extraction unit 17a is a filter that extracts only the speech signal from the speech input from the speech circuit 6, and for the speech signal that has passed through the speech signal extraction unit 17a, the volume detection unit 17b, the pitch detection unit 17c, the tone color detection unit 17d, and The speed detecting unit 17e detects the volume, pitch, tone color and speech speed of the speaker, respectively, of the speech signal. On the other hand, the human body shape detection unit 17f detects the human body shape from the two-dimensional screen by image-processing the video input from the video circuit 4, and the speaker obtained from the image by the lip motion detection unit 17g and the feature extraction unit 17h. The feature amount of is calculated. Speaker determination unit 17i
Then, these volume detecting section 17b, pitch detecting section 17c, timbre detecting section 17d, speed detecting section 17e, human body shape detecting section 17
Based on the output from f and the lip motion detection unit 17g, the speaker feature amount is classified into, for example, one of 20 patterns prepared in advance, and the classified information is transmitted to the superimpose display unit 18. The classification method uses a multivariate analysis method such as principal component analysis or a neural network method such as learning vector quantization, and realizes by drawing the individual feature vector in a multidimensional space. The superimpose display means 18 is configured to immediately switch the type of the sign language animation to be superposed and displayed corresponding to each speaker identified by the speaker identification means 17.

【００２７】上記構成において話者識別手段１７で識別
された個々の話者に対応して手話のアニメーションの種
別が即時に変わる。つまり音声内容を手話に置換しただ
けでは欠如してしまう個々の話者の特徴を臨場感を出し
て画面に盛り込むことができる。In the above configuration, the type of sign language animation is instantly changed corresponding to each speaker identified by the speaker identifying means 17. In other words, it is possible to incorporate the features of individual speakers, which would otherwise be lacking just by replacing the voice content into sign language, into the screen.

【００２８】なお第１から第３の実施例ではテレビを例
にとって説明したが、本発明は音声信号と映像信号を出
力するあらゆる映像機器に適応可能である。音声信号や
映像信号の種類にも依存しない。画面への表示方式とし
ては、スーパーインポーズ表示手段１０、１６、１８を
用いるとしたが、子画面に分割して独立に表示させた
り、新たに生成した文字情報や手話のアニメーション画
像を別の画面に映し出す構成にしても構わない。画面も
ＣＲＴなどの受像管５を用いなくてよい。また表示画面
を有線ないし無線の通信回線で結び、他の構成要素から
離れた場所に設置してもよい。Although the first to third embodiments have been described using the television as an example, the present invention can be applied to any video equipment that outputs an audio signal and a video signal. It does not depend on the type of audio signal or video signal. As the display method on the screen, the superimposing display means 10, 16 and 18 are used, but they are divided into sub-screens and independently displayed, or newly generated character information and sign language animation images are displayed separately. It may be configured to be displayed on the screen. The screen does not need to use the picture tube 5 such as a CRT. Further, the display screen may be connected by a wired or wireless communication line and may be installed at a place apart from other components.

【００２９】さらに文字生成手段１１が対象とする言語
は日本語に限るものではない。位置特定手段１５あるい
は話者識別手段１７への入力信号は音声信号と映像信号
両方を組み合わせなくてもよい。話者識別手段１７によ
って識別される話者の種類は例えば性別など２種類だけ
でも構わない。逆に話者の特徴量にしたがってアニメー
ションの大きさ、形状、表情、動作、種別などを無段階
に調節しても構わない。声の特徴量を色彩や明度などの
色情報に変換し、手話のアニメーション表現に変化を持
たせてもよい。Further, the language targeted by the character generation means 11 is not limited to Japanese. The input signal to the position specifying means 15 or the speaker identifying means 17 may not be a combination of both audio signals and video signals. The types of speakers identified by the speaker identifying means 17 may be only two types such as sex. Conversely, the size, shape, facial expression, action, type, etc. of the animation may be adjusted steplessly according to the feature amount of the speaker. The feature amount of the voice may be converted into color information such as color and brightness to change the sign language animation expression.

【００３０】[0030]

【発明の効果】以上のように本発明の映像機器によれ
ば、次の効果が得られる。As described above, according to the video equipment of the present invention, the following effects can be obtained.

【００３１】（１）聾唖者や音声なしで映像内容を知り
たい人にとって、画面を見るだけで発話内容が理解でき
る。またこれまでに蓄積された膨大な映像メディアや緊
急性の高い報道・中継番組にもそのまま利用できる。(1) For a deaf person or a person who wants to know the video content without voice, the utterance content can be understood only by looking at the screen. It can also be used as is for the vast amount of video media that has been accumulated so far and for highly urgent news and relay programs.

【００３２】（２）発話内容を手話のアニメーション画
像で表示する場合、文字表示だけでは読み取りが困難に
なるような早口言葉に対しても必要な情報を視覚に訴え
てすばやく簡潔に伝えることができる。手話は、一度習
得した人にとっては理解しやすい意志伝達方法であり、
表示画面サイズが小さい場合にも疲れにくい効果があ
る。(2) When the utterance content is displayed as an animation image of sign language, the necessary information can be visually appealed quickly and concisely even to fast-mouthed words that would be difficult to read only by displaying characters. . Sign language is a communication method that is easy for people who have learned it to understand.
Even if the display screen size is small, it is effective in reducing fatigue.

【００３３】（３）さらに手話のアニメーション画像を
特定された話者の位置に対応して画面上に同時に表示さ
せることによって臨場感あるわかりやすい画面が構築で
きる。(3) Further, by displaying the animation image of the sign language on the screen at the same time corresponding to the position of the specified speaker, a realistic and easy-to-understand screen can be constructed.

【００３４】（４）あるいは手話のアニメーション画像
の種別を個々の話者に対応して変化させることによって
より臨場感あるわかりやすい画面が構築できる。(4) Alternatively, a more realistic and easy-to-understand screen can be constructed by changing the type of the animation image of sign language corresponding to each speaker.

[Brief description of drawings]

【図１】本発明の第１の実施例における映像機器の構成
図FIG. 1 is a configuration diagram of a video device according to a first embodiment of the present invention.

【図２】同実施例における文字生成手段の構成図FIG. 2 is a block diagram of a character generation means in the same embodiment.

【図３】同実施例における音声認識手段及び手話画像生
成手段の構成図FIG. 3 is a configuration diagram of a voice recognition means and a sign language image generation means in the same embodiment.

【図４】本発明の第２の実施例における映像機器の構成
図FIG. 4 is a configuration diagram of a video device according to a second embodiment of the present invention.

【図５】同実施例における位置特定手段の構成図FIG. 5 is a configuration diagram of position specifying means in the embodiment.

【図６】本発明の第３の実施例における映像機器の構成
図FIG. 6 is a configuration diagram of a video device according to a third embodiment of the present invention.

【図７】同実施例における話者識別手段の構成図FIG. 7 is a block diagram of speaker identification means in the same embodiment.

【図８】従来の映像機器の構成図FIG. 8 is a block diagram of a conventional video device.

[Explanation of symbols]

２チューナ４映像回路６音声回路１０スーパーインポーズ表示手段１１文字生成手段１２音声認識手段１３手話画像生成手段１５位置特定手段１７話者識別手段 2 tuner 4 video circuit 6 audio circuit 10 superimpose display means 11 character generation means 12 voice recognition means 13 sign language image generation means 15 position identification means 17 speaker identification means

Claims

[Claims]

1. A media input unit for inputting an audio signal and a video signal, a character generation unit for converting an audio signal received by the media input unit into character information, and the character information generated by the character generation unit. A video device having a display means for displaying on the screen.

2. A media input unit for inputting a voice signal and a video signal, a voice recognition unit for recognizing the meaning content from the voice signal received by the media input unit, and a sign language for the meaning content recognized by the voice recognition unit. A video device comprising: a sign language image generating means for converting into the animation image and a display means for displaying the image information generated by the sign language image generating means on the screen.

3. A media input unit for inputting a voice signal and a video signal, voice recognition means for recognizing the meaning and content from the voice signal received by the media input unit, and the position of the speaker from the voice signal or the video signal. The position specifying means for specifying, the sign language image generating means for converting the meaning content recognized by the voice recognition means into a sign language animation image, and the image information generated by the sign language image generating means are specified by the position specifying means. A video device having display means for displaying on the screen corresponding to the position of the speaker.

4. A media input unit for inputting a voice signal and a video signal, voice recognition means for recognizing the meaning and content from the voice signal received by the media input unit, and identifying a speaker from the voice signal or the video signal. A speaker identification means, a sign language image generation means for converting the meaning content recognized by the voice recognition means into a sign language animation image, and image information generated by the sign language image generation means are identified by the speaker identification means. And a display unit for changing the type of the sign language animation and displaying it on the screen corresponding to each speaker.