JP3162832B2

JP3162832B2 - Subtitle super screen creation device

Info

Publication number: JP3162832B2
Application number: JP28997392A
Authority: JP
Inventors: 亨今井; 彰男安藤; 俊朗原賀; 栄一宮坂
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1992-10-28
Filing date: 1992-10-28
Publication date: 2001-05-08
Anticipated expiration: 2016-05-08
Also published as: JPH06141240A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、テレビジョン番組で使
用される字幕スーパー画面を作成する字幕スーパー画面
作成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for creating a superimposed subtitle screen used in a television program.

【０００２】［発明の概要］本発明はテレビジョン番組制作時の字幕スーパー画面作
成装置に関するもので、音声認識および字幕スーパーさ
れることを前提として、アナウンサー、出演者、ディレ
クターなど特定された複数の話者が発声した生の音声を
自動認識し、この認識結果に基づいて得られる文字列か
らテレビジョン信号として字幕画面を生成し、外部映像
と合成して字幕スーパー画面を作成することにより、従
来のキーボード等による入力よりも容易にかつ迅速に字
幕スーパー画面を作成できるようにするものである。[Summary of the Invention] [0002] The present invention relates to an apparatus for creating a subtitle super screen at the time of producing a television program, and presupposes that voice recognition and subtitles will be superimposed. Conventionally, by automatically recognizing the raw voice uttered by the speaker, generating a subtitle screen as a television signal from a character string obtained based on this recognition result, and synthesizing it with an external video to create a subtitle super screen This makes it possible to create a superimposed subtitle screen more easily and more quickly than by using a keyboard or the like.

【０００３】[0003]

【従来の技術】従来、テレビジョン番組制作時の字幕ス
ーパー画面は、次の３つの方法で作成されている。2. Description of the Related Art Conventionally, a subtitle super screen at the time of producing a television program is created by the following three methods.

【０００４】１番目の方法は写真技術を使用する方法で
あり、この方法では、写植機で文字を紙の上に印刷し、
それをカメラで撮影した後、この撮影処理によって得ら
れた字幕画面を任意の外部映像と合成して字幕スーパー
画面を作成する。[0004] The first method uses photographic technology, in which characters are printed on paper with a photocomposer,
After shooting it with a camera, the subtitle screen obtained by this shooting processing is combined with an arbitrary external video to create a subtitle super screen.

【０００５】２番目の方法は電子技術を使用する方法で
あり、この方法では、電子文字発生装置と呼ばれるワー
ドプロセッサと同様な装置を用い、この装置に設けられ
たキーボードから文字を入力して内蔵の文字フォントを
呼び出し、この文字フォントを使用して字幕画面用のテ
レビジョン信号を生成し、このテレビジョン信号を任意
の外部映像と合成して字幕スーパー画面を作成する。A second method is a method using electronic technology. In this method, a device similar to a word processor called an electronic character generator is used, and characters are input from a keyboard provided in the device and a built-in device is used. A character font is called, a television signal for a subtitle screen is generated using the character font, and the television signal is combined with an arbitrary external video to create a subtitle super screen.

【０００６】３番目の方法はプログラム技術を使用する
方法であり、この方法では、パーソナルコンピュータに
予め字幕スーパーしたい文字や図形を表示するプログラ
ムを組んでおき、キーボードのテンキーやタッチパネル
などで字幕画面を呼び出し、この字幕画面を任意の外部
映像と合成して字幕スーパー画面を作成する。A third method is to use a program technique. In this method, a program for displaying characters and graphics to be superimposed on a subtitle is previously set in a personal computer, and a subtitle screen is displayed using a numeric keypad or a touch panel of a keyboard. Then, the subtitle screen is combined with an arbitrary external video to create a subtitle super screen.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、従来の
各字幕スーパー画面作成方法においては、次に述べるよ
うな問題があった。However, the conventional subtitle super screen creation methods have the following problems.

【０００８】すなわち、写真技術を使用する方法や電子
技術を使用する方法では、文字を入力するとき、写植機
やワードプロセッサを使用しなければならないので、熟
練したオペレータを養成しなければならず、人員の確保
やコストという点で字幕スーパー画面の作成方法として
適したものではなかった。That is, in the method using photographic technology or the method using electronic technology, when inputting characters, it is necessary to use a typesetting machine or a word processor, so that skilled operators must be trained. In terms of security and cost, it was not a suitable method for creating a subtitle super screen.

【０００９】また、これら写真技術を使用する方法や電
子技術を使用する方法では、文字の入力にある程度の時
間を必要とするため、放送に字幕スーパー画面を利用す
る場合、放送番組の収録前に字幕画面の作成を終えてい
なければならない。[0009] Further, in the method using photographic technology and the method using electronic technology, a certain time is required for inputting characters. Therefore, when a subtitle super screen is used for broadcasting, it is necessary to record a broadcast program before recording. Subtitle screen creation must be completed.

【００１０】また、これら写真技術を使用する方法や電
子技術を使用する方法では、字幕スーパー画面を作成し
た後で、入力した文字を容易に修正することができない
ので、急な変更等に対処し難いという問題があった。[0010] In addition, in the method using photographic technology and the method using electronic technology, the input characters cannot be easily corrected after the subtitle super screen is created. There was a problem that it was difficult.

【００１１】また、プログラム技術を使用する方法で
は、放送中でも好みの字幕画面を即座に呼び出せる利点
があるものの、あらかじめ決められた字幕画面しか作成
することができないため、スポーツ番組における選手名
の字幕スーパー等に利用することができるだけで、任意
の文字を組み込んだ字幕スーパー画面の作成に適さない
という問題があった。Although the method using the program technique has an advantage that a desired subtitle screen can be called immediately even during broadcasting, only a predetermined subtitle screen can be created. However, there is a problem that it is not suitable for creating a subtitle super screen incorporating arbitrary characters.

【００１２】そこで、このような問題を解決するため、
受像機側でテレビジョン番組中の音声を自動認識し、そ
の認識結果に基づいて字幕スーパー画面を作成する方法
が提案されている。Therefore, in order to solve such a problem,
There has been proposed a method of automatically recognizing audio in a television program on the receiver side and creating a subtitle super screen based on the recognition result.

【００１３】この技術に関連する技術としては、例え
ば、特願昭６０−１０６７７９号に示されている「不特
定話者の音声入力装置を用いたＴＶ・及びモニターディ
スプレイにおける字幕スーパー文字表示システムに関す
る方法」や実願昭６３−１３１２１２号に示されている
音声認識装置を内蔵した「テレビジョン受像機」などが
ある。[0013] As a technique related to this technique, for example, Japanese Patent Application No. 60-106779 discloses a system for displaying superimposed subtitles on TV and monitor displays using an unspecified speaker's voice input device. Method "and a" television receiver "incorporating a speech recognition device disclosed in Japanese Utility Model Application No. 63-131212.

【００１４】しかしながら、これらの技術は主に難聴の
視聴者を対象として、受像機側で、放送局から送られて
きたテレビジョン番組中の音声を自動認識し、その認識
結果を受像機上の映像に字幕スーパーする方法であるた
め、次に述べる点から実現が極めて困難であると思われ
る。[0014] However, these techniques are mainly intended for a hearing-impaired viewer, and the receiver automatically recognizes the sound in the television program sent from the broadcasting station, and outputs the recognition result on the receiver. Since it is a method of superimposing subtitles on video, it seems to be extremely difficult to realize it from the following points.

【００１５】第１の問題として、ほとんどの場合、放送
局から送られてきたテレビジョン番組中の音声にはＢＧ
Ｍなどの背景音や他の話者の音声が混入しており、特定
の話者の音声を抽出するのは困難である。[0015] As a first problem, in most cases, the sound in a television program transmitted from a broadcast station is BG-based.
Background sounds such as M and voices of other speakers are mixed, and it is difficult to extract voices of a specific speaker.

【００１６】第２の問題として、放送番組での話者は不
特定多数であることから、このような不特定話者の音声
を確実に認識することは困難である。As a second problem, since the number of speakers in a broadcast program is unspecified, it is difficult to reliably recognize the voice of such unspecified speakers.

【００１７】第３の問題として、放送番組で話される言
葉の語彙数は膨大であり、番組内容、あるいは字幕スー
パーすべき言葉を特定しておかないと、認識すべき語彙
が多すぎて認識辞書が膨大になってしまう。As a third problem, the vocabulary of words spoken in a broadcast program is enormous, and unless the program contents or words to be subtitled are specified, there are too many vocabularies to be recognized. The dictionary becomes huge.

【００１８】本発明は上記の事情に鑑み、従来のキーボ
ード等による入力よりも容易にかつ迅速に、放送局側で
テレビジョン番組の字幕スーパー画面を作成することが
でき、これによって受像機側で字幕スーパーを作成する
ときに発生する種々の困難をなくすことができる字幕ス
ーパー画面作成装置を提供することを目的としている。The present invention has been made in view of the above circumstances, and enables a broadcast station to create a subtitle super screen of a television program more easily and more quickly than a conventional keyboard or the like. It is an object of the present invention to provide a subtitle super screen creation device that can eliminate various difficulties that occur when creating a subtitle supermarket.

【００１９】[0019]

【課題を解決するための手段】上記の目的を達成するた
めに本発明による字幕スーパー画面作成装置は、入力さ
れた特定話者の音声信号を音声認識処理して文字列デー
タを生成する音声認識部と、この音声認識部によって生
成された文字列データに基づいて字幕画面を作成すると
ともに、この字幕画面と外部映像とを合成して字幕スー
パー画面を作成する字幕画面生成部とを備えたことを特
徴としている。In order to achieve the above-mentioned object, a subtitle super-screen creation apparatus according to the present invention provides a voice recognition system for generating a character string data by performing voice recognition processing on an input voice signal of a specific speaker. Unit and a subtitle screen generation unit that creates a subtitle screen based on the character string data generated by the voice recognition unit and combines the subtitle screen with an external video to create a subtitle super screen It is characterized by.

【００２０】[0020]

【作用】上記の構成において、音声認識または字幕スー
パーされることを前提として、特定の複数の話者が発声
した生の音声を音声認識処理し、この音声認識処理によ
って得られる文字列に基づいて字幕画面を作成してテレ
ビジョン信号を生成することにより、操作に熟練した者
でなくとも、従来のキーボード等による入力よりも容易
にしかも迅速に、テレビジョン番組制作時に字幕スーパ
ー画面を作成する。In the above arrangement, on the assumption that voice recognition or subtitle superimposition is performed, raw voices uttered by a plurality of specific speakers are subjected to voice recognition processing, and based on a character string obtained by this voice recognition processing. By creating a subtitle screen and generating a television signal, even a person who is not skilled in operation can easily and quickly create a superimposed subtitle screen when producing a television program, even if input is performed using a conventional keyboard or the like.

【００２１】[0021]

【実施例】図１は本発明による字幕スーパー画面作成装
置の一実施例を適用した字幕スーパー画面作成システム
を示すブロック図である。FIG. 1 is a block diagram showing a subtitle super screen creation system to which an embodiment of a subtitle super screen creation apparatus according to the present invention is applied.

【００２２】この図に示す字幕スーパー画面作成システ
ムは、入力された特定話者の音声信号を認識して文字列
データを生成する音声認識部１と、この音声認識部１に
よって生成された文字列データに基づいて字幕画面を作
成するとともに、この字幕画面と外部映像とを合成して
字幕スーパー画面を作成する字幕画面生成部２とを備え
ており、特定話者の音声信号が入力されたとき、この音
声信号を取り込んで音声認識した後、この音声認識処理
によって得られる文字列データに基づいて字幕画面を作
成するとともに、この字幕画面と外部映像と合成して字
幕スーパー画面を作成する。The system for creating a superimposed subtitle screen shown in FIG. 1 includes a speech recognition unit 1 for recognizing an input speech signal of a specific speaker to generate character string data, and a character string generated by the speech recognition unit 1. A subtitle screen generating unit 2 that generates a subtitle screen based on the data, and combines this subtitle screen with an external video to create a subtitle super screen, when a specific speaker's audio signal is input After capturing the audio signal and performing voice recognition, a subtitle screen is created based on the character string data obtained by the voice recognition processing, and the subtitle screen is synthesized with an external video to create a subtitle super screen.

【００２３】音声認識部１は、図２に示す如くＡ／Ｄ変
換器３と、音響分析部４と、母音認識部５と、子音認識
部６と、認識辞書７と、言語処理部８とを備えており、
音声信号が入力されたとき、この音声信号を取り込んで
デジタル化した後、このデジタル化処理によって得られ
た音声データに基づいて音響特徴量を抽出するととも
に、この音響特徴量に基づいて母音の認識を行い、さら
にこの認識結果に基づいて仮説を生成して各仮説毎に子
音の尤度を求めた後、これらの各尤度に基づいて最適な
仮説を決定し、これを認識結果としてその文字列データ
を字幕画面生成部２に供給する。As shown in FIG. 2, the speech recognition section 1 includes an A / D converter 3, a sound analysis section 4, a vowel recognition section 5, a consonant recognition section 6, a recognition dictionary 7, a language processing section 8, With
When a voice signal is input, the voice signal is fetched and digitized. Then, an acoustic feature is extracted based on the voice data obtained by the digitizing process, and vowel recognition is performed based on the acoustic feature. After generating hypotheses based on the recognition results and calculating the likelihood of the consonant for each hypothesis, the optimal hypothesis is determined based on each of these likelihoods, and this is used as the recognition result for the character. The column data is supplied to the subtitle screen generation unit 2.

【００２４】Ａ／Ｄ変換器３は、マイクロフォンによっ
て収音された音声信号、例えば特定の操作者によって字
幕スーパーしたい単語や文節、あるいは文が読み上げら
れたときの音声信号が供給されたとき、これを取り込ん
で予め設定されているサンプリング周波数、例えば音響
特徴量を抽出するのに十分な１５ｋＨｚのサンプリング
周波数でディジタル化して音声データを生成し、これを
音響分析部４に供給する。The A / D converter 3 receives an audio signal picked up by a microphone, for example, when a specific operator inputs a word or phrase to be superimposed on subtitles or an audio signal when a sentence is read out, Then, the data is digitized at a preset sampling frequency, for example, a sampling frequency of 15 kHz sufficient to extract an acoustic feature value, to generate audio data, and this is supplied to the acoustic analysis unit 4.

【００２５】音響分析部４は、前記Ａ／Ｄ変換器３から
出力される音声データを取り込むとともに、長さ２０ｍ
ｓのハミング窓を用いて５ｍｓの周期で前記音声データ
をフレームに分割し、この後各フレームの音声データに
対して線形予測分析と零交差波分析とを行って１８次元
のＬＰＣケプストラム係数、零交差数、パワーなどの音
響パラメータを求め、この音響パラメータを母音認識部
５と、子音認識部６とに供給する。The acoustic analyzer 4 takes in the audio data output from the A / D converter 3 and has a length of 20 m.
The voice data is divided into frames at a period of 5 ms using a Hamming window of s, and then linear prediction analysis and zero-crossing wave analysis are performed on the voice data of each frame to obtain an 18-dimensional LPC cepstrum coefficient, Sound parameters such as the number of intersections and power are obtained, and the sound parameters are supplied to the vowel recognition unit 5 and the consonant recognition unit 6.

【００２６】母音認識部５は、前記音響分析部４から出
力される音響パラメータを取り込むとともに、この音響
パラメータと予め学習した母音標準パターン、すなわち
認識させたい人の声をあらかじめ集めて学習して得られ
た母音標準パターンや既に学習済みの他の話者の母音標
準パターンを利用して新しい話者に適応化させた母音標
準パターンとを比較し、この比較結果に基づいて前記音
声データ中の母音を検出して入力音声の母音系列データ
を作成し、これを言語処理部８に供給する。The vowel recognition unit 5 takes in the acoustic parameters output from the acoustic analysis unit 4 and collects and learns in advance the acoustic parameters and the pre-learned vowel standard pattern, ie, the voice of the person to be recognized. A vowel standard pattern that has been adapted to a new speaker using the vowel standard pattern obtained or the vowel standard pattern of another speaker that has already been trained, and based on the comparison result, the vowel in the voice data Is detected to create vowel sequence data of the input voice, and this is supplied to the language processing unit 8.

【００２７】また、子音認識部６は、音響分析部４から
出力される音響パラメータを取り込むとともに、あらか
じめ学習済みのＨＭＭ（隠れマルコフモデル）、すなわ
ち認識させたい人の声を予め集めて学習させたり、既に
学習済みの他の話者のＨＭＭを利用し、新しい話者に適
応化させたりして作成されたＨＭＭを用いて、言語処理
部８から出力される各仮説を採用した時の、入力音声を
構成する子音部分の尤度を求め、これを前記言語処理部
８に供給する。The consonant recognition unit 6 takes in the acoustic parameters output from the acoustic analysis unit 4 and collects and learns in advance the HMM (hidden Markov model) that has been learned in advance, that is, the voice of the person to be recognized. The input when each hypothesis output from the language processing unit 8 is adopted using the HMM created by adapting to a new speaker using the HMM of another speaker already trained. The likelihood of a consonant part constituting the speech is obtained, and this is supplied to the language processing unit 8.

【００２８】また、認識辞書７は、認識対象となる自立
語がテキスト形式で記述されており、これにより言語処
理部８は汎用性の高い文節文法を用いて、各自立語から
構成可能な文節データを全て自動生成することが可能と
なる。In the recognition dictionary 7, the independent words to be recognized are described in a text format, so that the language processing unit 8 can use a highly versatile phrase grammar to construct a phrase that can be composed from each independent word. All data can be automatically generated.

【００２９】言語処理部８は、母音認識部５から出力さ
れる母音系列データを取り込むとともに、この母音系列
データをキーとして認識辞書７を検索して前記母音系列
データを含む単語データあるいは文節データを読み出
し、これら単語データあるいは文節データに基づいて前
記母音系列データ中の不確かな母音を他の母音と入れ替
えたり、削除したり、新たな母音を挿入するなどして、
いくつかの候補を仮説として生成し、これを子音認識部
６に供給する。そして、この子音認識部６から各仮説の
尤度データ（確からしさを示すデータ）が出力されたと
き、この尤度データと、前記母音認識部５から出力され
た母音系列データとを統合し、各仮説と入力音声の近さ
を求め、最も近い仮説を認識結果としてその文字列デー
タを字幕画面生成部２に供給する。The linguistic processing unit 8 takes in the vowel sequence data output from the vowel recognition unit 5, searches the recognition dictionary 7 using the vowel sequence data as a key, and retrieves word data or phrase data including the vowel sequence data. Read, replace the uncertain vowels in the vowel sequence data with other vowels based on these word data or phrase data, delete or insert new vowels, etc.
Some candidates are generated as hypotheses and supplied to the consonant recognition unit 6. When the likelihood data (data indicating the certainty) of each hypothesis is output from the consonant recognition unit 6, the likelihood data and the vowel sequence data output from the vowel recognition unit 5 are integrated, The closeness between each hypothesis and the input voice is obtained, and the character string data of the closest hypothesis is supplied to the caption screen generation unit 2 as a recognition result.

【００３０】字幕画面生成部２は、図３に示す如く文字
フォントファイル１０と、文字列／字幕画面変換部１１
と、ビデオＲＡＭ１２と、スキャンコンバータ１３と、
合成部１４とを備えており、前記音声認識部１から出力
される文字列データを取り込むとともに、この文字列デ
ータに基づいて文字フォントファイル１０をアクセスし
て文字フォント情報を取り込んで字幕画面を作成した
後、これをテレビジョン映像信号に変換して外部からの
映像（外部映像）と合成して字幕スーパー画面を作成す
る。The subtitle screen generation unit 2 includes a character font file 10 and a character string / subtitle screen conversion unit 11 as shown in FIG.
, A video RAM 12, a scan converter 13,
A synthesizing unit 14 for fetching character string data output from the voice recognition unit 1 and accessing a character font file 10 based on the character string data to fetch character font information to create a subtitle screen After that, this is converted into a television video signal and combined with an external video (external video) to create a superimposed subtitle screen.

【００３１】文字フォントファイル１０は、字幕画面で
使用される各文字のフォント情報が格納されており、前
記文字列／字幕画面変換部１１からの読出し指令に応じ
て指定された文字のフォント情報を読み出しこれを文字
列／字幕画面変換部１１に供給する。The character font file 10 stores the font information of each character used in the subtitle screen, and stores the font information of the character specified in response to the read command from the character string / subtitle screen conversion unit 11. This is read and supplied to the character string / subtitle screen conversion unit 11.

【００３２】文字列／字幕画面変換部１１は、前記音声
認識部１から出力される文字列データを取り込むととも
に、この文字列データを構成する各文字コードに基づい
て前記各文字フォントファイル１０をアクセスして前記
各文字コードに対応する文字フォント情報を読み出して
これを画面上の最適な位置に並べて字幕画面データを作
成し、これをビデオＲＡＭ１２に供給する。The character string / subtitle screen conversion unit 11 takes in the character string data output from the voice recognition unit 1 and accesses each of the character font files 10 based on each character code constituting the character string data. Then, character font information corresponding to each of the character codes is read out, arranged at an optimum position on the screen to create caption screen data, and supplied to the video RAM 12.

【００３３】ビデオＲＡＭ１２は、前記文字列／字幕画
面変換部１１から出力される字幕画面データを取り込ん
で、これを記憶し、前記スキャンコンバータ１３から読
出し指令が出力されたとき、記憶している字幕画面デー
タを読出してスキャンコンバータ１３に供給する。The video RAM 12 takes in the subtitle screen data output from the character string / subtitle screen conversion unit 11 and stores it. When a read command is output from the scan converter 13, the stored subtitle screen data is stored. The screen data is read and supplied to the scan converter 13.

【００３４】スキャンコンバータ１３は、前記ビデオＲ
ＡＭ１２から出力される字幕画面データを取り込むとと
もに、この字幕画面データを指定された規格、例えばＮ
ＴＳＣ、ＰＡＬ、ＳＥＣＡＭ、ＨＤＴＶなどの規格のテ
レビジョン映像信号に変換してこれを合成部１４に供給
する。The scan converter 13 outputs the video R
The subtitle screen data output from the AM 12 is fetched, and the subtitle screen data is specified by a specified standard, for example, N
The video signal is converted into a television video signal of a standard such as TSC, PAL, SECAM, and HDTV, and supplied to the synthesizing unit 14.

【００３５】合成部１４は、前記スキャンコンバータ１
３から出力されるテレビジョン映像信号と外部から供給
される映像信号（外部映像信号）とを合成して字幕スー
パー画面を作成して出力する。The synthesizing unit 14 is provided with the scan converter 1
3 and a video signal (external video signal) supplied from outside and a television video signal output from an external device 3 to create and output a subtitle super screen.

【００３６】そして、この実施例の音声認識率を測定す
るため、上述した音声認識部１を実現する実際のハード
ウェアとして図４に示す構成の回路を作成した。Then, in order to measure the speech recognition rate of this embodiment, a circuit having the configuration shown in FIG. 4 was created as actual hardware for realizing the speech recognition unit 1 described above.

【００３７】この図に示す回路は、Ａ／Ｄ変換を行なう
Ａ／Ｄ変換器３と、並列処理用プロセッサによって構成
される９個のトランスピュータ２０〜２８と、制御用の
トランスピュータ２０の記憶装置として使用されるディ
スク装置２９とによって構成されている。The circuit shown in FIG. 1 includes an A / D converter 3 for performing A / D conversion, nine transputers 20 to 28 each constituted by a processor for parallel processing, and storage of a transputer 20 for control. And a disk device 29 used as a device.

【００３８】そして、制御用のトランスピュータ２０
は、前記母音認識部５の処理および全トランスピュータ
２１〜２８の制御、字幕画面生成部２とのデータの受け
渡しを行い、また各トランスピュータ２１〜２４は前記
音響分析部４の処理を行う。The transputer 20 for control
Performs the processing of the vowel recognition unit 5, controls all the transputers 21 to 28, and exchanges data with the subtitle screen generation unit 2, and the transputers 21 to 24 perform the processing of the acoustic analysis unit 4.

【００３９】この場合、これら各トランスピュータ２１
〜２４のうち、トランスピュータ２１は入力された音声
データの偶数フレームの音響分析を行い、トランスピュ
ータ２２はトランスピュータ２１で得られたＬＰＣケプ
ストラム係数のベクトル量子化を行う。トランスピュー
タ２３は入力された音声データの奇数フレームの音響分
析を行い、トランスピュータ２４はトランスピュータ２
３で得られたＬＰＣケプストラム係数のベクトル量子化
を行う。In this case, each of these transputers 21
Among them, the transputer 21 performs acoustic analysis of the even-numbered frame of the input audio data, and the transputer 22 performs vector quantization of the LPC cepstrum coefficient obtained by the transputer 21. The transputer 23 performs an acoustic analysis on the odd-numbered frames of the input audio data, and the transputer 24
Vector quantization of the LPC cepstrum coefficient obtained in step 3 is performed.

【００４０】また、トランスピュータ２５〜２８は、そ
れぞれが異なる仮説生成の方法により、前記言語処理部
８および前記子音認識部６の処理を行う。The transputers 25 to 28 perform the processing of the language processing unit 8 and the consonant recognition unit 6 by different hypothesis generation methods.

【００４１】また、ディスク装置２９は、前記認識辞書
７で使用される自立語をテキスト形式で格納する記憶エ
リアとして使用される。The disk device 29 is used as a storage area for storing independent words used in the recognition dictionary 7 in a text format.

【００４２】そして、この回路を使用して実際の音声信
号に対し、その音声認識率を実験したところ、次に述べ
るような顕著な結果を得ることができた。When an experiment was conducted on the speech recognition rate of an actual speech signal using this circuit, the following remarkable results could be obtained.

【００４３】すなわち、テレビジョン番組中の大相撲番
組を対象として、取り組み力士名や決まり手に関する文
をアナウンサーに発声させ、このとき得られた音声信号
に基づいて字幕スーパー画面を作成させた。That is, for the sumo wrestling program in the television program, the announcer uttered a sentence regarding the name of the wrestler and the determinant, and a superimposed subtitle screen was created based on the audio signal obtained at this time.

【００４４】その結果、アナウンサーが発声した文が
「貴花田と、小錦の、取り組み」、「ただ今の取り組み
は、貴花田が、寄り切りで勝ちました」のようにほぼ文
節単位で区切られた文であるとき、９８％の文節認識率
を得ることができ、また１００％の文認識率を得ること
ができた。As a result, the sentence uttered by the announcer is a sentence that is divided almost in units of phrases, such as "Kihanata and Konishiki's approach", and "In the current approach, Kihanada has won by a close approach". At this time, a phrase recognition rate of 98% was obtained, and a sentence recognition rate of 100% was obtained.

【００４５】また、このとき、処理に要した時間もほぼ
実時間と一致し、実用上何ら問題なく使用できることが
確認できた。At this time, the time required for the processing almost coincided with the real time, and it was confirmed that the device can be used without any practical problems.

【００４６】このようにこの実施例においては、テレビ
ジョン番組制作時に発声される特定の複数の話者の音声
を認識対象としており、音声認識処理方法として母音系
列を使用して各仮説を求め、これらの各仮説に対する子
音の尤度を用いて音声信号の文字列を求める方法を使用
するようにしているので、入力された音声を確実に認識
することができ、これによって従来のキーボード等によ
る入力よりも容易にかつ迅速に、放送局側でテレビジョ
ン番組の字幕スーパー画面を作成することができ、この
結果受像機側で字幕スーパーを作成するときに発生する
種々の困難をなくすことができる。As described above, in this embodiment, the voices of a plurality of specific speakers uttered during the production of a television program are to be recognized, and each hypothesis is obtained using a vowel sequence as a voice recognition processing method. Since the method of obtaining the character string of the voice signal using the likelihood of the consonant for each of these hypotheses is used, the input voice can be reliably recognized. It is possible to create a superimposed subtitle screen of a television program more easily and quickly on the broadcast station side, and as a result, it is possible to eliminate various difficulties that occur when creating a superimposed subtitle screen on the receiver side.

【００４７】また、上述した実施例においては、音声認
識部１を構成する音響分析部４は、長さ２０ｍｓのハミ
ング窓を用いて５ｍｓの周期で前記音声データをフレー
ムに分割した後、各フレームの音声データに対して線形
予測分析と零交差波分析とを行って１８次元のＬＰＣケ
プストラム係数、零交差数、パワーなどの音響パラメー
タを求める音響分析手法を使用しているが、このような
音響分析手法以外にも、例えばＦＦＴ分析などの周波数
スペクトラム分析や他の手法を使用するようにしても良
い。In the above-described embodiment, the sound analysis unit 4 constituting the speech recognition unit 1 divides the speech data into frames at a cycle of 5 ms using a Hamming window having a length of 20 ms. A sound analysis method is used to obtain sound parameters such as an 18-dimensional LPC cepstrum coefficient, the number of zero crossings, and power by performing a linear prediction analysis and a zero-crossing wave analysis on the audio data of In addition to the analysis method, for example, a frequency spectrum analysis such as an FFT analysis or another method may be used.

【００４８】また、上述した実施例においては、音声認
識部１を構成する母音認識部５および、子音認識部６の
音声認識処理方法として母音標準パターンおよびＨＭＭ
（隠れマルコフモデル）を使用するようにしているが、
このような音声認識方法のみならず、例えばＤＰマッチ
ングやニューラルネットなどを利用した音声認識方法を
使用するようにしても良い。In the above-described embodiment, the vowel recognition unit 5 and the consonant recognition unit 6 constituting the speech recognition unit 1 are used as the speech recognition processing method.
(Hidden Markov Model)
Not only such a voice recognition method but also a voice recognition method using, for example, DP matching or a neural network may be used.

【００４９】[0049]

【発明の効果】以上説明したように本発明によれば、音
声認識を利用することにより、操作に熟練した者でなく
とも、従来のキーボード等による入力よりも容易にしか
も迅速に、テレビジョン番組制作時に字幕スーパー画面
を作成することができる。As described above, according to the present invention, by utilizing voice recognition, even a person who is not skilled in operation can easily and quickly input a television program by using a conventional keyboard or the like. You can create a subtitle super screen during production.

[Brief description of the drawings]

【図１】本発明による字幕スーパー画面作成装置の一実
施例を適用した字幕スーパー画面作成システムを示すブ
ロック図である。FIG. 1 is a block diagram showing a subtitle super screen creation system to which an embodiment of a subtitle super screen creation device according to the present invention is applied.

【図２】図１に示す音声認識部の詳細な回路構成例を示
すブロック図である。FIG. 2 is a block diagram illustrating a detailed circuit configuration example of a speech recognition unit illustrated in FIG. 1;

【図３】図１に示す字幕画面生成部の詳細な回路構成例
を示すブロック図である。FIG. 3 is a block diagram illustrating a detailed circuit configuration example of a subtitle screen generation unit illustrated in FIG. 1;

【図４】図２に示す音声認識部の具体的なハードウェア
構成例を示すブロック図である。FIG. 4 is a block diagram illustrating a specific hardware configuration example of a speech recognition unit illustrated in FIG. 2;

[Explanation of symbols]

１音声認識部２字幕画面生成部３Ａ／Ｄ変換器４音響分析部５母音認識部６子音認識部７認識辞書８言語処理部１０文字フォントファイル１１文字列／字幕画面変換部１２ビデオＲＡＭ１３スキャンコンバータ１４合成部２０〜２８トランスピュータ２９ディスク装置 DESCRIPTION OF SYMBOLS 1 Speech recognition part 2 Subtitle screen generation part 3 A / D converter 4 Sound analysis part 5 Vowel recognition part 6 Consonant recognition part 7 Recognition dictionary 8 Language processing part 10 Character font file 11 Character string / subtitle screen conversion part 12 Video RAM 13 Scan converter 14 Synthesis unit 20-28 Transputer 29 Disk unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者宮坂栄一東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (56)参考文献特開昭61−264882（ＪＰ，Ａ) 特開平５−176232（ＪＰ，Ａ) 実開平２−53670（ＪＰ，Ｕ) 実開平６−7372（ＪＰ，Ｕ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) H04N 5/278 G10L 3/00 551 G06F 3/16 320 G06F 15/20 503 ──────────────────────────────────────────────────続き Continuation of the front page (72) Inventor Eiichi Miyasaka 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Research Institute of Broadcasting Technology (56) References JP-A-61-264882 (JP, A) Hei 5-176232 (JP, A) JP-A 2-53670 (JP, U) JP-A 6-7372 (JP, U) (58) Fields surveyed (Int. Cl. ⁷ , DB name) H04N 5 / 278 G10L 3/00 551 G06F 3/16 320 G06F 15/20 503

Claims

(57) [Claims]

1. A voice recognition unit that generates character string data by performing voice recognition processing on an input voice signal of a specific speaker, and generates a subtitle screen based on the character string data generated by the voice recognition unit. And a subtitle screen generating unit that generates a subtitle super screen by combining the subtitle screen with an external video.