JP5503507B2

JP5503507B2 - Character area detection apparatus and program thereof

Info

Publication number: JP5503507B2
Application number: JP2010256468A
Authority: JP
Inventors: 吉彦河合; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2014-05-28
Anticipated expiration: 2030-11-17
Also published as: JP2012108689A

Description

本発明は、入力画像に含まれる文字の領域である文字領域を、入力画像から検出する文字領域検出装置およびそのプログラムに関する。 The present invention relates to a character area detection device that detects a character area, which is a character area included in an input image, from the input image and a program thereof.

テレビジョン放送などの映像に付加された字幕は、映像へのメタデータ付与および映像の内容解析に用いることができるため、非常に有用な情報である。そこで、従来から、字幕領域を検出する様々な技術が提案されている。 Subtitles added to video such as television broadcasts are very useful information because they can be used for adding metadata to video and analyzing the content of the video. Therefore, various techniques for detecting a caption area have been proposed.

例えば、非特許文献１に記載の技術は、入力された映像の各フレーム画像からエッジを検出し、一定の時間、同じ場所に出現し続けるエッジだけを抽出するものである。その後、非特許文献１に記載の技術は、抽出したエッジが密集している領域を、字幕領域として検出する。 For example, the technique described in Non-Patent Document 1 detects edges from each frame image of an input video, and extracts only edges that continue to appear at the same place for a certain period of time. Thereafter, the technique described in Non-Patent Document 1 detects an area where the extracted edges are dense as a caption area.

また、非特許文献２に記載の技術は、ＭＰＥＧ（Moving Picture Experts Group）方式でデコードされた映像から、動き補償情報に基づいて字幕領域を検出するものである。
また、非特許文献３に記載の技術は、映像内で移動する字幕にも対応できるように、フレーム画像間におけるエッジペア（勾配方向が反転するようなエッジのペア）の集合から、字幕領域の移動パラメータを推定するものである。
これら非特許文献２，３に記載の技術は、非特許文献１と同様、時間的に連続するエッジの密集に基づいて字幕領域を検出する手法を利用している。 The technique described in Non-Patent Document 2 detects a caption area based on motion compensation information from a video decoded by an MPEG (Moving Picture Experts Group) method.
In addition, the technique described in Non-Patent Document 3 moves a caption area from a set of edge pairs (edge pairs whose gradient directions are reversed) between frame images so that captions moving in a video can also be handled. The parameter is estimated.
The techniques described in Non-Patent Documents 2 and 3 use a technique for detecting a caption area based on the denseness of temporally continuous edges, as in Non-Patent Document 1.

鷲尾、有木、緒方、「クロスメディア・パッセージ検索−テロップやCGフリップ文字列を検索質問とした発話文書に対する検索方式−」、電子情報通信学会論文誌、Vol.J84-D-II、no.8、pp.1809-1816、2001Hagio, Ariki, Ogata, "Cross-media passage search-Search method for utterance documents using telop and CG flip character strings as search questions-", IEICE Transactions, Vol. J84-D-II, no. 8, pp. 1809-1816, 2001 佐藤、新倉、谷口、阿久津、外村、浜田、「MPEG符号化映像からの高速テロップ領域検出法」、電子情報通信学会論文誌、Vol.J81-D-II、No.8、pp.1847-1855、1998Sato, Niikura, Taniguchi, Akutsu, Tonomura, Hamada, "High-speed telop area detection method from MPEG encoded video", IEICE Transactions, Vol.J81-D-II, No.8, pp.1847- 1855, 1998 森、倉掛、杉村、塩、鈴木、「背景・文字の形状特徴と動的修正識別関数を用いた映像中テロップ文字認識」、電子情報通信学会論文誌、Vol.J83-D-II、No.7、pp.1658-1666、2000Mori, Kurakake, Sugimura, Shio, Suzuki, "Telop character recognition in video using shape characteristics of background and characters and dynamic correction discriminant function", IEICE Transactions, Vol.J83-D-II, No. 7, pp.1658-1666, 2000

しかし、非特許文献１〜３に記載の技術では、字幕が一定の時間、同一の場所に表示され続けるという経験的な知識（エッジの密集という経験的な知識）に基づくため、字幕出現位置または字幕サイズが時間的に変化する字幕を検出することができない。また、非特許文献１〜３に記載の技術では、字幕以外の背景やオブジェクトが一定時間、同じ場所に出現した場合、誤検出が発生する恐れがある。
また、非特許文献１〜３に記載の技術では、エッジの密集という経験的な知識に基づいて、字幕領域とそれ以外とを分類しているため、エッジが多く出現する複雑な背景の上に出現する字幕を検出することができない。 However, in the technologies described in Non-Patent Documents 1 to 3, since the subtitles are based on empirical knowledge (experience knowledge of crowding of edges) that the subtitles continue to be displayed at the same place for a certain period of time, A subtitle whose subtitle size changes over time cannot be detected. In addition, in the techniques described in Non-Patent Documents 1 to 3, if a background or an object other than captions appears in the same place for a certain period of time, there is a risk of erroneous detection.
In addition, in the techniques described in Non-Patent Documents 1 to 3, the subtitle area and other areas are classified based on empirical knowledge of edge crowding, so that a complicated background with many edges appears. The subtitles that appear cannot be detected.

そして、非特許文献１〜３に記載の技術では、字幕のサイズ（フォントサイズ）によってエッジの密集の度合いが変化するため、様々なサイズの字幕が出現する場合、検出漏れが発生する恐れがある。
さらに、非特許文献１〜３に記載の技術では、字幕領域の検出に連続した複数のフレーム画像が必要となるため、１枚のフレーム画像から字幕を検出することができない。 In the techniques described in Non-Patent Documents 1 to 3, since the degree of edge density changes depending on the size of the subtitles (font size), there is a risk of omission of detection when subtitles of various sizes appear. .
Furthermore, in the techniques described in Non-Patent Documents 1 to 3, a plurality of continuous frame images are necessary for detecting the subtitle area, so that the subtitles cannot be detected from one frame image.

そこで、本発明は、前記した問題を解決し、１枚の画像から精度よく文字領域を検出できる文字領域検出装置およびそのプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to solve the above-described problems and provide a character area detection device and a program therefor that can accurately detect a character area from one image.

前記した課題に鑑みて、本願第１発明に係る文字領域検出装置は、入力画像に含まれる文字の領域である文字領域を、入力画像から検出する文字領域検出装置であって、解像度変換手段と、走査手段と、画像特徴ベクトル算出手段と、文字候補領域判定手段と、文字検出領域判定手段と、拡大手段と、文字領域出力手段とを備えることを特徴とする。 In view of the above-described problems, a character area detection device according to the first invention of the present application is a character area detection device that detects a character area, which is a character area included in an input image, from the input image. A scanning means, an image feature vector calculation means, a character candidate area determination means, a character detection area determination means, an enlargement means, and a character area output means.

かかる構成によれば、文字領域検出装置は、解像度変換手段によって、入力画像が入力されると共に、入力画像を、入力画像より解像度が低い１以上の低解像度画像に変換する。また、文字領域検出装置は、走査手段によって、解像度変換手段によって変換された低解像度画像および入力画像をそれぞれ、同じ大きさの走査窓で走査することによって、入力画像および低解像度画像ごとに、走査窓の領域に対応した走査窓領域画像を生成する。 According to such a configuration, the character area detection device receives the input image and converts the input image into one or more low-resolution images having a resolution lower than that of the input image by the resolution conversion unit. In addition, the character area detection apparatus scans the input image and the low resolution image for each of the input image and the low resolution image by scanning the low resolution image and the input image converted by the resolution conversion unit with the scanning window having the same size. A scanning window area image corresponding to the window area is generated.

例えば、入力画像に文字が含まれる場合、低解像度画像の解像度が低くなるほど、低解像度画像の文字が小さくなる。その一方、入力画像および低解像度画像の解像度に関係なく、走査窓の大きさは一定である。つまり、文字領域検出装置は、走査手段によって、入力画像と同一の文字が様々なサイズで現れる走査窓領域画像を生成する。
また、文字領域検出装置は、走査手段によって、入力画像および低解像度画像全体を走査対象とするため、文字出現位置に係わらず文字領域を検出できる。 For example, when a character is included in the input image, the character of the low resolution image becomes smaller as the resolution of the low resolution image becomes lower. On the other hand, the size of the scanning window is constant regardless of the resolution of the input image and the low resolution image. That is, the character area detection device generates scanning window area images in which the same characters as the input image appear in various sizes by the scanning unit.
Further, since the character area detection device scans the entire input image and the low-resolution image by the scanning unit, the character area can be detected regardless of the character appearance position.

また、文字領域検出装置は、画像特徴ベクトル算出手段によって、入力画像および低解像度画像ごとに、走査手段によって生成された走査窓領域画像の特徴ベクトルを算出する。そして、文字領域検出装置は、文字候補領域判定手段によって、入力画像および低解像度画像ごとに、画像特徴ベクトル算出手段によって算出された走査窓領域画像の特徴ベクトルに基づいて、走査窓領域画像が文字候補領域であるか否かを機械学習によって判定する。つまり、文字領域検出装置は、文字候補領域判定手段によって、入力画像の文字が様々なサイズで現れる走査窓領域画像を用いて文字候補領域を判定するため、文字サイズに係わらず文字領域を検出できる。 Further, the character area detecting device calculates a feature vector of the scanning window area image generated by the scanning unit for each of the input image and the low resolution image by the image feature vector calculating unit. Then, the character region detection device detects whether the scanning window region image is a character based on the feature vector of the scanning window region image calculated by the image feature vector calculation unit for each input image and low resolution image by the character candidate region determination unit. It is determined by machine learning whether it is a candidate area. In other words, the character region detection device can detect the character region regardless of the character size because the character candidate region determination unit determines the character candidate region using the scanning window region image in which the characters of the input image appear in various sizes. .

また、文字領域検出装置は、文字検出領域判定手段によって、入力画像および低解像度画像ごとに、文字候補領域判定手段によって判定された文字候補領域が互いに重なる回数を算出し、算出した重なる回数が予め設定された第１閾値以上となる文字候補領域を文字検出領域として判定する。 The character area detection device calculates the number of times that the character candidate areas determined by the character candidate area determination unit overlap each other for each of the input image and the low-resolution image by the character detection area determination unit. Character candidate areas that are equal to or greater than the set first threshold are determined as character detection areas.

また、文字領域検出装置は、拡大手段によって、文字検出領域判定手段によって判定された低解像度画像ごとの文字検出領域を、文字検出領域に対応する低解像度画像が入力画像と同じ解像度になる拡大率で拡大する。そして、文字領域検出装置は、文字領域出力手段によって、拡大手段によって拡大された低解像度画像ごとの文字検出領域と、入力画像の文字検出領域との何れか１以上が重なるか否かを判定し、互いに重なると判定された文字検出領域のうち、入力画像または解像度が最大の低解像度画像に対応する文字検出領域である基準文字検出領域に対して、他の文字検出領域が重なる割合を算出すると共に、算出した重なる割合が予め設定された第２閾値以上の場合、基準文字検出領域のみを文字領域として出力する。つまり、文字領域検出装置は、文字領域出力手段によって、同一の文字に対応する複数の文字検出領域を１つの文字領域に統合する。
これによって、文字領域検出装置は、エッジの密集という経験的な知識に依存することなく、複雑な背景上に出現する文字領域も検出することができる。 In addition, the character area detection device uses the enlargement unit to convert the character detection area for each low resolution image determined by the character detection area determination unit into an enlargement ratio at which the low resolution image corresponding to the character detection area has the same resolution as the input image. Zoom in. Then, the character area detection device determines whether any one or more of the character detection area for each low resolution image enlarged by the enlargement means and the character detection area of the input image overlap by the character area output means. Of the character detection areas determined to overlap with each other, the ratio of the overlap of other character detection areas to the reference character detection area that is the character detection area corresponding to the input image or the low resolution image having the maximum resolution is calculated. At the same time, when the calculated overlapping ratio is equal to or greater than a preset second threshold, only the reference character detection area is output as the character area. That is, the character area detection device integrates a plurality of character detection areas corresponding to the same character into one character area by the character area output means.
As a result, the character area detecting device can detect a character area appearing on a complicated background without depending on empirical knowledge of edge crowding.

なお、文字領域とは、入力画像で文字が出現する領域の位置（座標）、形状および大きさを示す情報である。
字幕候補領域は、入力画像および低解像度画像のそれぞれについて、文字検出領域の候補となる領域の位置（座標）、形状および大きさを示す情報である。
字幕検出領域は、入力画像および低解像度画像のそれぞれについて、文字が出現する領域の位置（座標）、形状および大きさを示す情報である。 The character area is information indicating the position (coordinates), shape, and size of an area where a character appears in the input image.
The caption candidate area is information indicating the position (coordinates), shape, and size of a candidate area for the character detection area for each of the input image and the low-resolution image.
The caption detection area is information indicating the position (coordinates), shape, and size of an area where characters appear for each of the input image and the low-resolution image.

また、本願第２発明に係る文字領域検出装置は、入力画像および低解像度画像ごとに、文字検出領域判定手段によって判定された文字検出領域の大きさまたはエッジ強度の密度を評価値として算出し、算出した評価値が予め設定された第３閾値以下になるか否かを判定すると共に、文字検出領域判定手段によって判定された文字検出領域から、第３閾値以下になると判定された文字検出領域を削除する文字検出領域削除手段をさらに備えることが好ましい。
かかる構成によれば、文字領域検出装置は、評価値が第３閾値以下になる場合、文字検出領域が誤検出されたと判定し、その文字検出領域を削除する。 Further, the character area detection device according to the second invention of the present application calculates, as an evaluation value, the size of the character detection area or the density of the edge strength determined by the character detection area determination unit for each of the input image and the low resolution image. It is determined whether or not the calculated evaluation value is equal to or less than a preset third threshold value, and a character detection area determined to be equal to or less than the third threshold value is determined from the character detection area determined by the character detection area determination unit. It is preferable to further include character detection area deletion means for deletion.
According to this configuration, when the evaluation value is equal to or less than the third threshold value, the character area detection device determines that the character detection area has been erroneously detected, and deletes the character detection area.

また、本願第３発明に係る文字領域検出装置は、文字領域出力手段が、互いに重ならないと判定された文字検出領域と、互いに重なると判定され、かつ、算出した重なる割合が第２閾値未満の文字検出領域とを、文字領域としてさらに出力することが好ましい。 In the character area detection device according to the third aspect of the present invention, the character area output means determines that the character detection area is determined not to overlap with the character detection area, and the calculated overlapping ratio is less than the second threshold value. It is preferable to further output the character detection area as a character area.

また、本願第４発明に係る文字領域検出装置は、映像が入力されると共に、入力された映像を復号し、映像を構成するフレーム画像を入力画像として解像度変換手段に出力する映像復号手段をさらに備え、文字領域出力手段が、フレーム画像ごとに文字領域を出力し、文字領域出力手段によって出力されたフレーム画像ごとの文字領域が、フレーム画像から予め設定された範囲内の他のフレーム画像に存在しないか否かを判定すると共に、文字領域出力手段によって出力されたフレーム画像ごとの文字領域から、他のフレーム画像に存在しないと判定された文字領域を削除する文字領域削除手段、をさらに備えることが好ましい。 The character area detection device according to the fourth aspect of the present invention further includes a video decoding unit that decodes the input video and outputs a frame image constituting the video to the resolution conversion unit as an input image when the video is input. The character area output means outputs a character area for each frame image, and the character area for each frame image output by the character area output means exists in another frame image within a preset range from the frame image. And a character area deleting unit that deletes a character area determined not to exist in another frame image from the character area for each frame image output by the character area output unit. Is preferred.

通常の映像では、１フレームだけ字幕などの文字が表示されることは極めて少ない。そこで、文字領域検出装置は、文字領域削除手段によって、あるフレーム画像から検出した文字領域が一定範囲のフレーム画像に存在しない場合、文字領域が誤検出されたと判定し、その文字領域を削除する In ordinary video, characters such as subtitles are rarely displayed for only one frame. Therefore, the character area detecting device determines that the character area is erroneously detected and deletes the character area when the character area detected from a certain frame image does not exist in the frame image in a certain range by the character area deleting unit.

なお、本願第１発明に係る文字領域検出装置は、一般的なコンピュータを、解像度変換手段、走査手段、画像特徴ベクトル算出手段、文字候補領域判定手段、文字検出領域判定手段、拡大手段、文字領域出力手段として機能させる文字領域検出プログラムによって実現することもできる。 The character area detection apparatus according to the first invention of the present application is a general computer that includes a resolution conversion means, a scanning means, an image feature vector calculation means, a character candidate area determination means, a character detection area determination means, an enlargement means, and a character area. It can also be realized by a character area detection program that functions as output means.

本願発明によれば、以下のような優れた効果を奏する。
本願第１発明によれば、入力画像および低解像度画像全体を走査対象とするために、文字出現位置に係わらず文字領域を検出できると共に、入力画像の文字が様々なサイズで出現する走査窓領域画像を用いるために、文字サイズに係わらず文字領域を検出できる。これによって、本願第１発明によれば、文字領域が複雑な背景の上に出現する場合でも、エッジの密集という経験的な知識に依存せず、１枚の画像から精度よく文字領域を検出することができる。 According to the present invention, the following excellent effects can be obtained.
According to the first invention of the present application, since the entire input image and the low-resolution image are to be scanned, the character area can be detected regardless of the character appearance position, and the scanning window area in which the characters of the input image appear in various sizes Since an image is used, a character area can be detected regardless of the character size. Thus, according to the first invention of the present application, even when the character region appears on a complicated background, the character region is detected from one image with high accuracy without depending on empirical knowledge of edge crowding. be able to.

本願第２発明によれば、誤検出した可能性が高い文字検出領域を削除するため、文字領域の検出精度をより高くすることができる。
本願第３発明によれば、互いに重ならない文字検出領域と、重なる割合が第２閾値未満の文字検出領域とを文字領域として出力し、文字領域の検出精度をより高くすることができる。
本願第４発明によれば、誤検出した可能性が高い文字領域を削除するため、文字領域の検出精度をより高くすることができる。 According to the second aspect of the present invention, since the character detection area having a high possibility of being erroneously detected is deleted, the detection accuracy of the character area can be further increased.
According to the third aspect of the present invention, the character detection areas that do not overlap with each other and the character detection areas with the overlapping ratio less than the second threshold are output as the character areas, and the detection accuracy of the character areas can be further increased.
According to the fourth aspect of the present invention, since the character area that has a high possibility of being erroneously detected is deleted, the detection accuracy of the character area can be further increased.

本発明の実施形態に係る映像字幕検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the video caption detection apparatus which concerns on embodiment of this invention. 図１の走査手段による走査を説明する図である。It is a figure explaining the scanning by the scanning means of FIG. 図１の字幕候補領域判定手段による字幕候補領域の判定を説明する図である。It is a figure explaining the determination of a caption candidate area by the caption candidate area determination means of FIG. 図１の字幕検出領域判定手段による字幕検出領域の判定を説明する画像である。It is an image explaining determination of a caption detection area by the caption detection area determination means of FIG. 図１のラベリング手段によるラベリング処理を説明する画像である。It is an image explaining the labeling process by the labeling means of FIG. 図１の字幕検出手段が検出した字幕検出領域の一例を示す図である。It is a figure which shows an example of the subtitle detection area which the subtitle detection means of FIG. 1 detected. 図１の拡大手段による字幕検出領域の拡大を説明する図である。It is a figure explaining expansion of a caption detection area by the expansion means of FIG. （ａ）〜（ｃ）は、図１の字幕検出領域統合手段による字幕検出領域の統合を説明する図である。(A)-(c) is a figure explaining the integration of a caption detection area by the caption detection area integration means of FIG. 図１の映像字幕検出装置の全体動作を示すフローチャートである。It is a flowchart which shows the whole operation | movement of the video caption detection apparatus of FIG. 図１の字幕検出手段による字幕検出領域検出処理を示すフローチャートである。It is a flowchart which shows the subtitle detection area detection processing by the subtitle detection means of FIG. 本発明の実施例において、実験用映像の第１例を示す画像である。In the Example of this invention, it is an image which shows the 1st example of the image | video for experiment. 本発明の実施例において、実験用映像の第２例を示す画像である。In the Example of this invention, it is an image which shows the 2nd example of the image | video for experiment.

［映像字幕検出装置の構成］
本実施形態では、本発明に係る文字領域検出装置の一例として、映像に含まれる字幕領域を検出する映像字幕検出装置を説明する。
以下、図１を参照して、映像字幕検出装置１の構成について説明する。図１に示すように、映像字幕検出装置１は、映像に含まれる字幕領域をその映像から検出するものであり、映像復号手段１０と、解像度変換手段２０と、記憶手段３０と、字幕検出手段４０と、検出結果統合手段５０とを備える。 [Configuration of video caption detection device]
In the present embodiment, a video caption detection device that detects a caption region included in a video will be described as an example of a character region detection device according to the present invention.
Hereinafter, the configuration of the video subtitle detection apparatus 1 will be described with reference to FIG. As shown in FIG. 1, a video subtitle detection apparatus 1 detects a subtitle area included in a video from the video, and includes video decoding means 10, resolution conversion means 20, storage means 30, and subtitle detection means. 40 and detection result integration means 50.

映像字幕検出装置１に入力される映像は、連続したフレーム画像（入力画像）で構成されており、各フレーム画像をタイムコードまたはフレーム番号で特定することができる。
また、この映像には、何から字幕が含まれることとする。例えば、字幕としては、映像に冗長された文字画像、テロップ、オープンキャプションまたはスクロールスーパーをあげることができる。
また、字幕領域とは、映像の各フレーム画像において、字幕が出現する領域の位置（座標）、形状および大きさと、映像に字幕が出現する時刻（フレーム画像）とを示す情報である。 The video input to the video caption detection device 1 is composed of continuous frame images (input images), and each frame image can be specified by a time code or a frame number.
Also, it is assumed that subtitles are included in this video. For example, the subtitles can include character images, telops, open captions, or scroll supers that are redundant in the video.
The caption area is information indicating the position (coordinates), shape, and size of the area where the caption appears in each frame image of the video, and the time (frame image) when the caption appears in the video.

映像復号手段１０は、映像が入力されると共に、入力された映像を復号し、この映像を構成するフレーム画像を入力画像として解像度変換手段２０に出力するものである。例えば、ＭＰＥＧ２形式の映像が入力される場合、映像復号手段１０は、ＭＰＥＧ２形式の一般的なデコーダとなる。 The video decoding unit 10 receives a video, decodes the input video, and outputs a frame image constituting the video to the resolution conversion unit 20 as an input image. For example, when an MPEG2 format video is input, the video decoding means 10 is a general MPEG2 format decoder.

解像度変換手段２０は、映像復号手段１０からフレーム画像が入力されると共に、入力されたフレーム画像を、この入力画像より解像度が低い低解像度画像に変換するものである。具体的には、解像度変換手段２０は、フレーム画像をそれぞれ縮小率α^−ｉだけ縮小して、フレーム画像を、解像度がＮ種類の低解像度画像に変換する。そして、解像度変換手段２０は、映像復号手段１０から入力されたフレーム画像と、変換した低解像度画像とを記憶手段３０に記憶させる。 The resolution conversion unit 20 receives the frame image from the video decoding unit 10 and converts the input frame image into a low resolution image having a lower resolution than the input image. Specifically, the resolution conversion unit 20 reduces the frame images by the reduction rate α- ^i, and converts the frame images into low resolution images having N types of resolutions. Then, the resolution conversion unit 20 causes the storage unit 30 to store the frame image input from the video decoding unit 10 and the converted low resolution image.

ここで、αは定数であり（例えば、√２）、Ｎは解像度の種類であり（例えば、６）、ｉは０≦ｉ＜Ｎを満たすカウンタである。なお、α、Ｎは、映像字幕検出装置１のオペレータによって任意の値に設定される。 Here, α is a constant (for example, √2), N is the type of resolution (for example, 6), and i is a counter that satisfies 0 ≦ i <N. Α and N are set to arbitrary values by the operator of the video caption detection device 1.

記憶手段３０は、フレーム画像と、低解像度画像と、後記する字幕候補領域（文字候補領域）と、後記する字幕検出領域（文字検出領域）とを記憶するメモリ、ハードディスク等の記憶装置である。 The storage means 30 is a storage device such as a memory or a hard disk that stores a frame image, a low-resolution image, a caption candidate area (character candidate area) to be described later, and a caption detection area (character detection area) to be described later.

字幕候補領域は、フレーム画像および低解像度画像のそれぞれについて、字幕検出領域の候補となる領域の位置（座標）、形状および大きさを示す情報である。
字幕検出領域は、フレーム画像および低解像度画像のそれぞれについて、字幕が出現する領域の位置（座標）、形状および大きさを示す情報である。
なお、本実施形態では、映像内での時刻を特定するために、低解像度画像、字幕候補領域および字幕検出領域に、元となるフレーム画像のタイムコードを付加することとする。 The caption candidate area is information indicating the position (coordinates), shape, and size of a candidate area for the caption detection area for each of the frame image and the low-resolution image.
The caption detection area is information indicating the position (coordinates), the shape, and the size of the area where the caption appears for each of the frame image and the low resolution image.
In this embodiment, in order to specify the time in the video, the time code of the original frame image is added to the low-resolution image, the caption candidate area, and the caption detection area.

字幕検出手段４０は、フレーム画像および低解像度画像のそれぞれに対して、字幕検出領域検出処理を施すことで、字幕検出領域を検出するものである。このため、字幕検出手段４０は、走査手段４１と、画像特徴ベクトル算出手段４２と、字幕候補領域判定手段（文字候補領域判定手段）４３と、字幕検出領域判定手段（文字検出領域判定手段）４４と、ラベリング手段４５と、外接領域算出手段４６と、フィルタリング手段（文字検出領域削除手段）４７とを備える。 The caption detection means 40 detects a caption detection area by performing a caption detection area detection process on each of the frame image and the low resolution image. For this reason, the caption detection means 40 includes a scanning means 41, an image feature vector calculation means 42, a caption candidate area determination means (character candidate area determination means) 43, and a caption detection area determination means (character detection area determination means) 44. And labeling means 45, circumscribed area calculating means 46, and filtering means (character detection area deleting means) 47.

走査手段４１は、記憶手段３０に記憶されたフレーム画像および低解像度画像をそれぞれ、同じ大きさの走査窓で走査することによって、フレーム画像および低解像度画像ごとに、走査窓の領域に対応した走査窓領域画像を生成するものである。 The scanning unit 41 scans the frame image and the low resolution image stored in the storage unit 30 with the scanning window having the same size, thereby scanning the frame image and the low resolution image corresponding to the region of the scanning window. A window area image is generated.

図２に示すように、走査窓Ｗｉは、画像左上を初期位置とする。まず、走査手段４１は、走査窓Ｗｉが右端に達するまで、この初期位置から走査窓Ｗｉを右方向（水平方向）に少しずつ移動させながら、走査窓Ｗｉの領域に対応する走査窓領域画像を生成する。次に、走査手段４１は、走査窓Ｗｉが右端に達したら、走査窓Ｗｉを初期位置から下方向（垂直方向）に少し移動させた後、前記と同様に水平方向の操作を繰り返す。このようにして、走査手段４１は、フレーム画像および低解像度画像の全体を走査して、走査窓領域画像を生成する。 As shown in FIG. 2, the scanning window Wi has an initial position at the upper left of the image. First, the scanning unit 41 moves the scanning window Wi little by little in the right direction (horizontal direction) from the initial position until the scanning window Wi reaches the right end, and scan image area images corresponding to the area of the scanning window Wi are displayed. Generate. Next, when the scanning window Wi reaches the right end, the scanning unit 41 moves the scanning window Wi slightly downward (vertical direction) from the initial position, and then repeats the horizontal operation as described above. In this way, the scanning unit 41 scans the entire frame image and the low resolution image to generate a scanning window region image.

ここで、走査窓Ｗｉは、フレーム画像および低解像度画像の解像度に関係なく、全てのフレーム画像および低解像度画像で一定の大きさ（例えば、縦１６画素×横１６画素）となる。
なお、走査窓Ｗｉの水平方向および垂直方向の移動画素数は、オペレータによって任意の値（例えば、４画素）に設定される。 Here, the scanning window Wi has a constant size (for example, 16 pixels in the vertical direction × 16 pixels in the horizontal direction) in all the frame images and the low-resolution images regardless of the resolutions of the frame images and the low-resolution images.
Note that the number of moving pixels in the horizontal and vertical directions of the scanning window Wi is set to an arbitrary value (for example, 4 pixels) by the operator.

画像特徴ベクトル算出手段４２は、フレーム画像および低解像度画像ごとに、走査手段４１によって生成された走査窓領域画像の特徴ベクトルを算出するものである。ここでは、画像特徴ベクトル算出手段４２は、走査窓領域画像から、その形状に基づく特徴ベクトル、または、その色・テクスチャに基づく特徴ベクトルを算出する。 The image feature vector calculating unit 42 calculates a feature vector of the scanning window region image generated by the scanning unit 41 for each frame image and low resolution image. Here, the image feature vector calculation means 42 calculates a feature vector based on the shape or a feature vector based on the color / texture from the scanning window region image.

＜第１例：形状に基づく特徴ベクトル＞
以下、特徴ベクトルの算出方法の第１例，第２例を順に説明する。
形状に基づく特徴ベクトルを用いる場合、画像特徴ベクトル算出手段４２は、走査窓領域画像にエッジ検出処理（例えば、Ｓｏｂｅｌフィルタ、Ｐｒｅｗｉｔｔフィルタ、Ｒｏｂｅｒｔｓフィルタなどの一次微分フィルタ）を施して、エッジ画像を生成する。そして、画像特徴ベクトル算出手段４２は、生成したエッジ画像から、形状に基づく特徴ベクトルを算出する。 <First Example: Feature Vector Based on Shape>
Hereinafter, a first example and a second example of the feature vector calculation method will be described in order.
When a feature vector based on the shape is used, the image feature vector calculation unit 42 performs edge detection processing (for example, a first-order differential filter such as a Sobel filter, a Prewitt filter, and a Roberts filter) on the scanning window region image to generate an edge image. To do. Then, the image feature vector calculation means 42 calculates a feature vector based on the shape from the generated edge image.

ここで、形状に基づく特徴ベクトルとして、例えば、エッジ方向ヒストグラムと、エッジ強度ヒストグラムとをあげることができる。
エッジ方向ヒストグラム：エッジ方向を何種類かのパターンに量子化して、走査窓領域画像内における各パターンの出現頻度ヒストグラムを算出して、特徴ベクトルとする。
エッジ強度ヒストグラム：エッジ強度を何種類かのパターンに量子化して、走査窓領域画像内における各パターンの出現頻度ヒストグラムを算出して、特徴ベクトルとする。 Here, examples of the feature vector based on the shape include an edge direction histogram and an edge intensity histogram.
Edge direction histogram: The edge direction is quantized into several types of patterns, and an appearance frequency histogram of each pattern in the scanning window region image is calculated and used as a feature vector.
Edge intensity histogram: The edge intensity is quantized into several types of patterns, and an appearance frequency histogram of each pattern in the scanning window region image is calculated and used as a feature vector.

＜第２例：色・テクスチャに基づく特徴ベクトル＞
ここで、色・テクスチャに基づく特徴ベクトルとして、例えば、輝度モーメントと、ハールウェーブレット変換による特徴ベクトルと、ローカルバイナリパターンとあげることができる。
輝度モーメント：走査窓領域画像内における輝度の平均値および分散を算出して、特徴ベクトルとする。
ハールウェーブレット：走査窓領域画像に２次元ハールウェーブレット変換を何段階か適用して、各サブバンドにおける分散を算出し、特徴ベクトルとする。 <Second Example: Feature Vector Based on Color / Texture>
Here, as the feature vector based on the color / texture, for example, a luminance moment, a feature vector by Haar wavelet transform, and a local binary pattern can be cited.
Luminance moment: The average value and variance of the luminance in the scanning window area image are calculated and used as a feature vector.
Haar wavelet: A two-dimensional Haar wavelet transform is applied to the scanning window region image in several stages to calculate the variance in each subband to obtain a feature vector.

ローカルバイナリパターン：走査窓領域画像をブロックに分割し、ブロック内の各画素におけるローカルバイナリパターンを算出し、ブロックごとにローカルバイナリパターンのヒストグラムを生成し、各ヒストグラムを連結して特徴ベクトルとする。
なお、ローカルバイナリパターンの詳細は、例えば、文献「T.Ojala,M.Pietikainen, and T.Maenpaa,Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.24,no.7,pp.971-987,2002.」に記載されている。 Local binary pattern: The scanning window area image is divided into blocks, a local binary pattern is calculated for each pixel in the block, a histogram of the local binary pattern is generated for each block, and the histograms are connected to form a feature vector.
Details of local binary patterns are described in, for example, the documents `` T.Ojala, M. Pietikainen, and T.Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. .24, no. 7, pp. 971-987, 2002.

なお、画像特徴ベクトル算出手段４２は、前記した手法の何れで特徴ベクトルを算出してもよい。また、前記した手法を２以上組み合わせて、特徴ベクトルを算出してもよい。 The image feature vector calculation means 42 may calculate the feature vector by any of the methods described above. Further, the feature vector may be calculated by combining two or more of the above-described methods.

以下、映像字幕検出装置１の構成について説明を続ける。
字幕候補領域判定手段４３は、フレーム画像および低解像度画像ごとに、画像特徴ベクトル算出手段４２によって算出された走査窓領域画像の特徴ベクトルに基づいて、この走査窓領域画像が字幕候補領域であるか否かを機械学習によって判定するものである。 Hereinafter, the description of the configuration of the video subtitle detection apparatus 1 will be continued.
The subtitle candidate area determination unit 43 determines whether the scanning window area image is a subtitle candidate area based on the feature vector of the scanning window area image calculated by the image feature vector calculation unit 42 for each frame image and low resolution image. Whether or not is determined by machine learning.

具体的には、字幕候補領域判定手段４３では、画像特徴ベクトル算出手段４２によって算出された走査窓領域画像の特徴ベクトルを識別器に入力して、走査窓領域画像が字幕候補領域であるか否かを判定する。その後、字幕候補領域判定手段４３は、字幕候補領域と判定されたものを記憶手段３０に記憶させる。 Specifically, the caption candidate area determination unit 43 inputs the feature vector of the scanning window area image calculated by the image feature vector calculation unit 42 to the discriminator, and determines whether or not the scanning window area image is a caption candidate area. Determine whether. After that, the caption candidate area determination unit 43 stores in the storage unit 30 what is determined as a caption candidate area.

ここで、識別器は、字幕領域が出現する画像から求めた特徴ベクトル（正例）と、字幕領域が出現しない画像から求めた特徴ベクトル（負例）とを学習データとして、サポートベクタマシンなどの機械学習によって事前に準備しておく。例えば、サポートベクタマシンでは、学習データの中で、他クラスと最も近い位置になるものを基準として、ユークリッド距離が最大となる位置に識別境界を設定する。つまり、サポートベクタマシンでは、字幕領域が出現するクラスから、字幕領域が出現しないクラスまでのマージンを最大化するように識別境界を設定する。そして、サポートベクタマシンでは、識別境界を基準として、入力された走査窓領域画像の特徴ベクトルを、字幕領域が出現するクラス、または、字幕領域が出現しないクラスの何れかに分類する。 Here, the discriminator uses a feature vector obtained from an image in which a caption area appears (positive example) and a feature vector obtained from an image in which no caption area appears (negative example) as learning data, such as a support vector machine. Prepare in advance by machine learning. For example, in the support vector machine, an identification boundary is set at a position where the Euclidean distance is maximum with reference to a learning data that is closest to another class. That is, in the support vector machine, the identification boundary is set so as to maximize the margin from the class in which the caption area appears to the class in which the caption area does not appear. In the support vector machine, the feature vector of the input scanning window area image is classified into either a class in which a subtitle area appears or a class in which a subtitle area does not appear on the basis of the identification boundary.

図３に示すように、字幕候補領域判定手段４３は、図２のフレーム画像の左上部「停戦○○地区被害の詳細明らかに」という字幕部分や、左下部「破壊された町その実態は」という字幕部分において、矩形状の字幕候補領域ＫＡを多数判定している。
なお、図３では、図面を見やすくするため、一部の字幕候補領域ＫＡのみ符号を図示した。 As shown in FIG. 3, the caption candidate area determination means 43 performs the caption portion “upper left of the ceasefire XX district damage details” in the frame image of FIG. In the subtitle portion, a large number of rectangular subtitle candidate areas KA are determined.
Note that, in FIG. 3, only a part of the caption candidate areas KA is illustrated for easy understanding of the drawing.

字幕検出領域判定手段４４は、フレーム画像および低解像度画像ごとに、記憶手段３０に記憶された字幕候補領域が互いに重なる回数を算出し、算出した重なる回数が第１閾値以上となる字幕候補領域を字幕検出領域として判定するものである。
なお、第１閾値は、オペレータによって任意の値（例えば、２）に設定される。 The subtitle detection area determination unit 44 calculates the number of times that the subtitle candidate areas stored in the storage unit 30 overlap each other for each frame image and low-resolution image, and selects a subtitle candidate area where the calculated number of overlaps is equal to or greater than the first threshold. It is determined as a caption detection area.
The first threshold is set to an arbitrary value (for example, 2) by the operator.

具体的には、字幕検出領域判定手段４４は、記憶手段３０に記憶された字幕候補領域の座標にその字幕候補領域を配置した際、他の字幕候補領域に重なるか否かを判定する。そして、字幕検出領域判定手段４４は、字幕候補領域ごとに、その字幕候補領域が他の字幕候補領域に重なると判定された回数をカウントして、重なる回数を算出する。その後、字幕検出領域判定手段４４は、字幕候補領域ごとの重なる回数が第１閾値以上になるものを、字幕検出領域として判定する。ここで、字幕検出領域判定手段４４は、全ての字幕候補領域が走査窓領域画像（つまり、図２の走査窓Ｗｉ）と同じ大きさのため、字幕候補領域が重なるか否かを容易に判定できる。 Specifically, the caption detection area determination unit 44 determines whether or not the caption candidate area overlaps with another caption candidate area when the caption candidate area is arranged at the coordinates of the caption candidate area stored in the storage unit 30. Then, for each caption candidate area, the caption detection area determination unit 44 counts the number of times that the caption candidate area is determined to overlap another caption candidate area, and calculates the number of overlaps. Thereafter, the caption detection area determination unit 44 determines that the number of overlapping times for each caption candidate area is equal to or greater than the first threshold as the caption detection area. Here, the subtitle detection area determination unit 44 easily determines whether or not the subtitle candidate areas overlap because all the subtitle candidate areas are the same size as the scanning window area image (that is, the scanning window Wi in FIG. 2). it can.

図４に示すように、字幕検出領域判定手段４４は、例えば、図３のフレーム画像の左上部「停戦○○地区被害の詳細明らかに」という字幕部分や、左下部「破壊された町その実態は」という字幕部分で字幕候補領域ＫＡが多数重なるため、これら字幕部分を字幕検出領域として判定している。
なお、図４では、図３の字幕候補領域ＫＡが重なる回数を濃淡（白黒）で表されており、重なる回数が多くなるほど淡く（白く）、重なる回数が少ないほど濃く（黒く）なる。 As shown in FIG. 4, the subtitle detection area determination means 44, for example, the upper left portion of the frame image of FIG. Since a large number of subtitle candidate areas KA overlap in the subtitle portion “ha”, these subtitle portions are determined as subtitle detection regions.
In FIG. 4, the number of times the caption candidate area KA in FIG. 3 overlaps is represented in shades (black and white). The number of overlaps increases as the number of overlaps increases (white), and the number of overlaps decreases as the number of overlaps decreases.

ラベリング手段４５は、字幕検出領域判定手段４４によって判定された字幕検出領域に対して、ラベリング処理を施すものである。つまり、ラベリング手段４５は、どの画素がどの字幕検出領域に属するかを調べるためにラベリング処理を施す。図５では、ラベリング処理によって、異なる字幕検出領域に属する画素が、赤、青、緑などの異なる色で表されている。 The labeling unit 45 performs a labeling process on the caption detection area determined by the caption detection area determination unit 44. That is, the labeling means 45 performs a labeling process to check which pixel belongs to which caption detection area. In FIG. 5, pixels belonging to different caption detection areas are represented by different colors such as red, blue, and green by the labeling process.

外接領域算出手段４６は、ラベリング手段４５によってラベリング処理が施された字幕検出領域の外接矩形を算出して、字幕検出領域を矩形状に整形するものである。具体的には、外接領域算出手段４６は、ラベリング処理が施された字幕検出領域の左上点、右上点、右下点および左下点を頂点とする矩形状に字幕検出領域を整形する。 The circumscribing area calculation unit 46 calculates a circumscribing rectangle of the caption detection area subjected to the labeling process by the labeling unit 45, and shapes the caption detection area into a rectangular shape. Specifically, the circumscribing area calculation means 46 shapes the caption detection area into a rectangular shape with the upper left point, the upper right point, the lower right point, and the lower left point of the labeled caption detection area subjected to the labeling process as vertices.

フィルタリング手段４７は、フレーム画像および低解像度画像ごとに、外接領域算出手段４６によって整形された字幕検出領域の評価値を算出し、算出した評価値が第３閾値以下になるか否かを判定すると共に、外接領域算出手段４６によって整形された字幕検出領域から、第３閾値以下になると判定された字幕検出領域を削除するものである。 The filtering means 47 calculates the evaluation value of the caption detection area shaped by the circumscribed area calculation means 46 for each frame image and low resolution image, and determines whether or not the calculated evaluation value is equal to or less than the third threshold value. At the same time, the caption detection area determined to be equal to or smaller than the third threshold is deleted from the caption detection area shaped by the circumscribed area calculation means 46.

具体的には、フィルタリング手段４７は、字幕検出領域の大きさ（面積）、または、字幕検出領域のエッジ強度の密度を評価値として算出する。そして、フィルタリング手段４７は、この評価値が第３閾値以下と判定された字幕検出領域を誤検出として削除し、削除されなかった字幕検出領域を記憶手段３０に記憶させる。 Specifically, the filtering unit 47 calculates the size (area) of the caption detection area or the edge strength density of the caption detection area as an evaluation value. Then, the filtering unit 47 deletes the caption detection area in which the evaluation value is determined to be equal to or less than the third threshold as an erroneous detection, and causes the storage unit 30 to store the caption detection area that has not been deleted.

ここで、エッジ強度の密度を評価値とする場合、フィルタリング手段４７は、字幕検出領域にエッジ検出処理を施して、字幕検出領域のエッジ画像を生成する。そして、フィルタリング手段４７は、字幕検出領域のエッジ画像内にあるエッジ成分の強度の総和を字幕検出領域の面積で割ることで、エッジ強度の密度を算出する。 Here, when the density of the edge strength is used as the evaluation value, the filtering unit 47 performs edge detection processing on the caption detection area to generate an edge image of the caption detection area. Then, the filtering unit 47 calculates the density of the edge strength by dividing the sum of the strengths of the edge components in the edge image of the caption detection region by the area of the caption detection region.

なお、第３閾値は、オペレータによって任意の値に設定される。例えば、第３閾値は、字幕検出領域の大きさ（面積）を評価値とする場合には７００平方画素、エッジ強度の密度を評価値とする場合には４である。
また、本実施形態では、フィルタリング手段４７は、後記する拡大手段５１が字幕検出領域を拡大するため、字幕検出領域に、元となる低解像度画像の縮小率を付加することとする。 Note that the third threshold is set to an arbitrary value by the operator. For example, the third threshold is 700 square pixels when the size (area) of the caption detection region is used as an evaluation value, and 4 when the density of edge strength is used as the evaluation value.
Further, in the present embodiment, the filtering unit 47 adds the reduction ratio of the original low-resolution image to the caption detection area because the enlargement unit 51 described later expands the caption detection area.

従って、図６に示すように、図２のフレーム画像の左上部「停戦○○地区被害の詳細明らかに」という字幕部分や、左下部「破壊された町その実態は」という字幕部分が、字幕検出領域ＳＡとして判定されることになる。 Therefore, as shown in FIG. 6, the subtitle portion of the upper left portion of the frame image of FIG. It will be determined as the detection area SA.

検出結果統合手段５０は、Ｎ＋１種類の解像度に対応する字幕検出領域を統合して、字幕領域を検出結果として出力するものである。このため、検出結果統合手段５０は、拡大手段５１と、字幕検出領域統合手段（文字領域出力手段）５２と、誤検出削除手段（文字領域削除手段）５３とを備える。 The detection result integration unit 50 integrates subtitle detection areas corresponding to N + 1 types of resolutions and outputs the subtitle areas as detection results. Therefore, the detection result integration unit 50 includes an enlargement unit 51, a caption detection area integration unit (character area output unit) 52, and an erroneous detection deletion unit (character area deletion unit) 53.

拡大手段５１は、記憶手段３０に記憶された字幕検出領域を、この字幕検出領域に対応する低解像度画像がフレーム画像と同じ解像度になる拡大率で拡大するものである。つまり、拡大手段５１は、記憶手段３０に記憶された字幕検出領域のうち、低解像度画像に対応する字幕検出領域を拡大する。当然、拡大手段５１は、フレーム画像から判定された字幕検出領域を拡大する必要がない。
なお、拡大率は、各字幕検出領域に付加された縮小率の逆数であり、α^ｉと表すことができる。 The enlarging means 51 enlarges the caption detection area stored in the storage means 30 at an enlargement ratio at which the low resolution image corresponding to the caption detection area becomes the same resolution as the frame image. That is, the enlarging means 51 enlarges the caption detection area corresponding to the low resolution image among the caption detection areas stored in the storage means 30. Naturally, the enlarging means 51 does not need to enlarge the caption detection area determined from the frame image.
The enlargement rate is the reciprocal of the reduction rate added to each caption detection area, and can be expressed as α ⁱ .

以下、図７を参照し、図２のフレーム画像の左下部「破壊された町その実態は」という字幕部分に着目して、字幕検出領域の拡大について説明する。なお、図７では、説明を簡易にするため、この字幕部分以外の図示を省略している。 Hereinafter, with reference to FIG. 7, the enlargement of the subtitle detection area will be described by focusing on the subtitle portion of the lower left portion of the frame image of FIG. In FIG. 7, illustrations other than the subtitle portion are omitted for simplicity of explanation.

図７（ａ）に示すように、フレーム画像Ｐ_ｌには、元のサイズの字幕が含まれている。一方、低解像度画像Ｐｓは、解像度変換手段２０によって縮小されるため、字幕のサイズが小さくなっている。この場合、図７（ｂ）に示すように、字幕検出手段４０は、フレーム画像Ｐ_ｌから大きなサイズの字幕検出領域ＳＡ_ｌを判定し、低解像度画像Ｐ_ｓから小さなサイズの字幕検出領域ＳＡ_ｓを判定することになる。しかし、この状態では、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓは、互いの縮小率が異なるために、重なるか否かを正確に判定できない。 As shown in FIG. 7A, the frame image _Pl includes a caption of the original size. On the other hand, since the low resolution image Ps is reduced by the resolution conversion means 20, the size of the caption is reduced. In this case, as shown in FIG. 7 (b), the caption detection unit 40, the frame image P _l determines the caption detection area SA _l large size from the caption detection area of small size from the low-resolution image P _s SA _s Will be judged. However, in this state, since the subtitle detection areas SA ₁ and SA _s have different reduction rates, it cannot be accurately determined whether or not they overlap.

そこで、図７（ｃ）に示すように、拡大手段５１は、図７（ｂ）の字幕検出領域ＳＡ_ｓの座標ＰＴ_ｓ（ｘ，ｙ）に拡大率を乗じて、拡大後の字幕検出領域ＳＡ_ｓ´の座標ＰＴ_ｓ´（ｘ´，ｙ´）を算出する。さらに、拡大手段５１は、字幕検出領域ＳＡ_ｓの縦辺および横辺の長さに拡大率を乗じて、拡大後の字幕検出領域ＳＡ_ｓ´の縦辺および横辺の長さを算出する。このようにして、拡大手段５１は、低解像度画像Ｐ_ｓがフレーム画像Ｐ_ｌと同じ大きさとなるように、字幕検出領域ＳＡ_ｓを拡大する。 Therefore, as shown in FIG. 7C, the enlarging means 51 multiplies the coordinates PT _s (x, y) of the caption detection area SA _{s in} FIG. The coordinates PT _s ′ (x ′, y ′) of SA _s ′ are calculated. Furthermore, the enlarging means 51 multiplies the lengths of the vertical and horizontal sides of the caption detection area SA _s by the enlargement ratio to calculate the lengths of the vertical and horizontal sides of the enlarged caption detection area SA _s ′. In this way, the enlarging means 51 enlarges the caption detection area SA _s so that the low-resolution image P _s has the same size as the frame image P _l .

なお、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ，ＳＡ_ｓ´の座標ＰＴ_ｌ，ＰＴ_ｓ，ＰＴ_ｓ´は、画像左上を原点ＰＴ_０（０，０）として、字幕検出領域内に設定された基準点（例えば、字幕検出領域の左上点）を表すこととする。また、ｘ，ｘ´は水平座標であり、ｙ，ｙ´は垂直座標である。 Note that the coordinates PT _l , PT _s , and PT _s ′ of the caption detection areas SA _l , SA _s , and SA _s ′ are the reference points set in the caption detection area with the upper left corner of the image as the origin PT ₀ (0, 0). (For example, the upper left point of the caption detection area). X and x ′ are horizontal coordinates, and y and y ′ are vertical coordinates.

以下、図１に戻り、映像字幕検出装置１の構成について説明を続ける。
字幕検出領域統合手段５２は、拡大手段５１によって拡大された低解像度画像ごとの字幕検出領域と、フレーム画像の字幕検出領域との何れか１以上が重なるか否かを判定し、字幕検出領域が互いに重なる割合に応じて、字幕検出領域を統合するものである。 Hereinafter, returning to FIG. 1, the description of the configuration of the video subtitle detection apparatus 1 will be continued.
The caption detection area integration unit 52 determines whether one or more of the caption detection area for each low resolution image enlarged by the enlargement unit 51 and the caption detection area of the frame image overlap, and the caption detection area is The subtitle detection areas are integrated according to the overlapping ratio.

説明を簡易にするため、図８に示すように、拡大手段５１によって拡大された字幕検出領域（つまり、低解像度画像に対応する字幕検出領域）ＳＡ_ｓ´と、フレーム画像に対応する字幕検出領域ＳＡ_ｌとを統合する例で説明する。 In order to simplify the explanation, as shown in FIG. 8, the caption detection area (that is, the caption detection area corresponding to the low-resolution image) SA _s ′ enlarged by the enlargement unit 51 and the caption detection area corresponding to the frame image An example in which _SAL is integrated will be described.

字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´の座標ＰＴ_ｌ，ＰＴ_ｓ´に字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´を配置した際、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´が重なるか否かを判定する。そして、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´が重なる場合、字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｓ´の全面積のうち、基準字幕検出領域ＳＡ_ｌに重なる部分の面積（図８のハッチング部分）を重なる割合として算出する。さらに、字幕検出領域統合手段５２は、この重なる割合が第２閾値以上であるか否かを判定する。 Caption detection region integration means 52, the caption detection area _SA l, 'coordinates _PT l of, PT _s' SA _s in the caption detection area _SA l, 'when disposed, caption detection area _SA l, _SA _{_s'} _SA _s is It is determined whether or not they overlap. Then, when the caption detection areas SA _l and SA _s ′ overlap, the caption detection area integration unit 52 of the total area of the caption detection area SA _s ′ is the area of the portion overlapping the reference caption detection area SA _l (FIG. 8). (Hatched part) is calculated as the overlapping ratio. Further, the caption detection area integration unit 52 determines whether or not the overlapping ratio is equal to or greater than the second threshold value.

なお、互いに重なる字幕検出領域のうち、フレーム画像または解像度が最大の低解像度画像に対応する字幕検出領域を基準字幕検出領域（基準文字検出領域）と呼ぶ。図８の例では、字幕検出領域ＳＡ_ｌがフレーム画像に対応するため、基準文字検出領域となる。また、図８において、縮小された低解像度画像のうち、解像度が最大の低解像度画像に字幕検出領域ＳＡ_ｌが対応し、より解像度が低い低解像度画像に字幕検出領域ＳＡ_ｓ´が対応する場合、字幕検出領域ＳＡ_ｌが基準文字検出領域となる。
また、第２閾値は、オペレータによって任意の値（例えば、８０％）に設定される。 Of the caption detection areas that overlap each other, the caption detection area corresponding to the frame image or the low-resolution image with the maximum resolution is referred to as a reference caption detection area (reference character detection area). In the example of FIG. 8, since the caption detection area SA _l corresponds to the frame image, a reference character detection region. In FIG. 8, among the reduced low-resolution images, the subtitle detection area SA _l corresponds to the low-resolution image having the maximum resolution, and the subtitle detection area SA _s ′ corresponds to the low-resolution image having a lower resolution. The caption detection area _SAL becomes the reference character detection area.
The second threshold value is set to an arbitrary value (for example, 80%) by the operator.

具体的には、算出した割合が第２閾値以上の場合（図８（ａ）参照）、字幕検出領域統合手段５２は、基準文字検出領域ＳＡ_ｌのみを字幕領域として誤検出削除手段５３に出力する。言い換えるなら、字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｓ´を検出結果から削除するように、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´を統合する。 Specifically, when the calculated percentage is equal to or greater than the second threshold value (see FIG. 8 (a)), the caption detection region integration means 52 outputs only the reference character detection area SA _l to detect deletion unit 53 erroneously as subtitle region To do. In other words, the caption detection area integration means 52 integrates the caption detection areas SA _l and SA _s ′ so as to delete the caption detection area SA _s ′ from the detection result.

また、算出した割合が第２閾値未満の場合（図８（ｂ）参照）、字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´を字幕領域として誤検出削除手段５３に出力する。言い換えるなら、字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´の両方で囲われる最大領域が１つの字幕領域となるように、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´を統合する。 When the calculated ratio is less than the second threshold (see FIG. 8B), the caption detection area integration unit 52 outputs the caption detection areas SA ₁ and SA _s ′ as caption areas to the erroneous detection deletion unit 53. . Other words, the subtitle detection region integration means 52 'so that the maximum area surrounded by both is the one subtitle area, subtitle detection area _SA l, SA _s' caption detection area _SA l, SA _s integrate.

また、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´が重ならない場合（図８（ｃ）参照）、字幕検出領域統合手段５２は、字幕検出領域ＳＡ_ｌ，ＳＡ_ｓ´を別々の字幕領域として誤検出削除手段５３に出力する。
なお、図８では、２個の字幕検出領域を例として説明したが、字幕検出領域統合手段５２は、全ての字幕検出領域について互いに重なるか否かを判定する。 When the caption detection areas SA ₁ and SA _s ′ do not overlap (see FIG. 8C), the caption detection area integration unit 52 erroneously deletes the caption detection areas SA _l and SA _s ′ as separate caption areas. Output to means 53.
In FIG. 8, two subtitle detection areas are described as an example, but the subtitle detection area integration unit 52 determines whether or not all the subtitle detection areas overlap each other.

以下、図１に戻り、映像字幕検出装置１の構成について説明を続ける。
誤検出削除手段５３は、字幕検出領域統合手段５２によって出力されたフレーム画像ごとの字幕領域が、そのフレーム画像から前後Ｍフレームの範囲内の他のフレーム画像に存在しないか否かを判定すると共に、字幕検出領域統合手段５２によって出力されたフレーム画像ごとの字幕領域から、他のフレーム画像に存在しないと判定された字幕領域を削除するものである。
なお、フレームの範囲Ｍは、オペレータによって任意の値（例えば、３０フレーム）に設定される。 Hereinafter, returning to FIG. 1, the description of the configuration of the video subtitle detection apparatus 1 will be continued.
The erroneous detection deletion means 53 determines whether or not the caption area for each frame image output by the caption detection area integration means 52 does not exist in other frame images within the range of the preceding and following M frames from the frame image. The subtitle area determined not to exist in the other frame images is deleted from the subtitle area for each frame image output by the subtitle detection area integration means 52.
The frame range M is set to an arbitrary value (for example, 30 frames) by the operator.

通常の映像では、１フレームだけ字幕が表示されることは極めて少ないと思われる。このため、誤検出削除手段５３は、フレーム画像間の時間的な連続性を考慮して、誤検出した可能性が高いと思われる字幕領域を削除する。 In normal video, it seems that subtitles are not displayed for only one frame. For this reason, the erroneous detection deletion unit 53 deletes a caption area that is considered to be highly likely to be erroneously detected in consideration of temporal continuity between frame images.

具体的には、誤検出削除手段５３は、あるフレーム画像の字幕領域が、そのフレーム画像の前後Ｍフレームの範囲内に位置する他のフレーム画像の字幕領域の何れかと重なるか否かを判定する。そして、誤検出削除手段５３は、そのフレーム画像の字幕領域が、他のフレーム画像の字幕領域の何れとも重ならない場合、そのフレーム画像の字幕領域を削除する。一方、誤検出削除手段５３は、そのフレーム画像の字幕領域が、他のフレーム画像の字幕領域の何れかと重なる場合、そのフレーム画像の字幕領域を削除しない。さらに、誤検出削除手段５３は、映像内の全ての字幕領域を処理した後、削除されずに残った字幕領域を検出結果として出力する。
なお、本実施形態では、誤検出削除手段５３は、映像内に字幕が出現する時刻を特定するために、字幕領域に元となるフレーム画像のタイムコードを付加することとする。 Specifically, the erroneous detection deletion unit 53 determines whether or not the caption area of a certain frame image overlaps with any of the caption areas of other frame images located within the range of the preceding and following M frames of the frame image. . Then, when the subtitle area of the frame image does not overlap any of the subtitle areas of other frame images, the erroneous detection deletion unit 53 deletes the subtitle area of the frame image. On the other hand, when the subtitle area of the frame image overlaps with any of the subtitle areas of other frame images, the erroneous detection deletion unit 53 does not delete the subtitle area of the frame image. Further, the erroneous detection deletion means 53 outputs all the subtitle areas in the video as the detection result after processing all the subtitle areas.
In the present embodiment, the erroneous detection deletion means 53 adds the time code of the original frame image to the caption area in order to specify the time when the caption appears in the video.

［映像字幕検出装置の全体動作］
以下、図９を参照して、映像字幕検出装置１の全体動作について説明する（適宜図１参照）。 [Overall operation of video caption detection device]
Hereinafter, the overall operation of the video caption detection device 1 will be described with reference to FIG. 9 (see FIG. 1 as appropriate).

映像字幕検出装置１は、映像復号手段１０によって、入力された映像を復号し、復号した映像の全てのフレーム画像の出力を終了したか否かを判定する（ステップＳ１）。ここで、全てのフレーム画像の出力を終了していない場合（ステップＳ１でＮｏ）、映像字幕検出装置１は、映像復号手段１０によって、次のフレーム画像を解像度変換手段２０に出力する（ステップＳ２）。 The video subtitle detection apparatus 1 uses the video decoding means 10 to decode the input video and determines whether or not the output of all frame images of the decoded video has been completed (step S1). Here, when the output of all the frame images is not completed (No in step S1), the video subtitle detection apparatus 1 outputs the next frame image to the resolution conversion unit 20 by the video decoding unit 10 (step S2). ).

映像字幕検出装置１は、解像度変換手段２０によって、カウンタｉを０に設定する（ステップＳ３）。そして、映像字幕検出装置１は、解像度変換手段２０によって、カウンタｉが低解像度画像の枚数Ｎ未満であるか否かを判定する（ステップＳ４）。 The video subtitle detection apparatus 1 sets the counter i to 0 by the resolution conversion means 20 (step S3). Then, the video subtitle detection apparatus 1 determines whether or not the counter i is less than the number N of low-resolution images by the resolution conversion unit 20 (step S4).

カウンタｉが低解像度画像の枚数Ｎ未満である場合（ステップＳ４でＹｅｓ）、映像字幕検出装置１は、解像度変換手段２０によって、フレーム画像を縮小率α^−ｉだけ縮小する（ステップＳ５）。 When the counter i is less than the number N of low-resolution images (Yes in step S4), the video subtitle detection apparatus 1 reduces the frame image by the reduction rate α− ⁱ by the resolution conversion unit 20 (step S5).

映像字幕検出装置１は、字幕検出手段４０によって、フレーム画像および低解像度画像に対し、字幕検出領域検出処理を施すことで、字幕検出領域を検出する（ステップＳ６）。なお、字幕検出領域検出処理の詳細は、後記する。
映像字幕検出装置１は、解像度変換手段２０によって、カウンタｉをインクリメントし（ステップＳ７）、ステップＳ４の処理に戻る。 The video subtitle detection apparatus 1 detects a subtitle detection area by performing subtitle detection area detection processing on the frame image and the low-resolution image by the subtitle detection means 40 (step S6). Details of the caption detection area detection process will be described later.
The video caption detection device 1 increments the counter i by the resolution conversion means 20 (step S7), and returns to the process of step S4.

カウンタｉが低解像度画像の枚数Ｎ未満でない場合（ステップＳ４でＮｏ）、映像字幕検出装置１は、拡大手段５１によって、低解像度画像に対応する字幕検出領域を、その低解像度画像がフレーム画像と同じ解像度になる拡大率で拡大する（ステップＳ８）。 If the counter i is not less than the number N of low resolution images (No in step S4), the video subtitle detection apparatus 1 uses the enlargement unit 51 to set a subtitle detection area corresponding to the low resolution image, and the low resolution image is a frame image. The image is enlarged at an enlargement rate that results in the same resolution (step S8).

映像字幕検出装置１は、字幕検出領域統合手段５２によって、低解像度画像ごとの字幕検出領域と、フレーム画像の字幕検出領域との何れか１以上が重なるか否かを判定し、字幕検出領域が互いに重なる割合に応じて、字幕検出領域を統合する。ここで、字幕検出領域が互いに重ならない場合、または、字幕検出領域の重なる割合が第２閾値未満の場合、映像字幕検出装置１は、字幕検出領域統合手段５２によって、各字幕検出領域をそれぞれ字幕領域として出力する。一方、字幕検出領域の重なる割合が第２閾値以上の場合、映像字幕検出装置１は、字幕検出領域統合手段５２によって、基準文字検出領域のみを字幕領域として出力する（ステップＳ９）。 The video subtitle detection apparatus 1 determines whether or not one or more of the subtitle detection area for each low resolution image and the subtitle detection area of the frame image overlap each other by the subtitle detection area integration unit 52. The subtitle detection areas are integrated according to the overlapping ratio. Here, when the subtitle detection areas do not overlap each other, or when the overlapping ratio of the subtitle detection areas is less than the second threshold, the video subtitle detection apparatus 1 uses the subtitle detection area integration unit 52 to subtitle each subtitle detection area. Output as a region. On the other hand, when the overlapping ratio of the caption detection areas is equal to or greater than the second threshold value, the video caption detection apparatus 1 outputs only the reference character detection area as the caption area by the caption detection area integration unit 52 (step S9).

映像字幕検出装置１は、誤検出削除手段５３によって、時間的連続性に基づいて、誤検出した可能性が高い字幕領域を削除する。つまり、映像字幕検出装置１は、誤検出削除手段５３によって、あるフレーム画像の字幕領域が、そのフレーム画像から前後Ｍフレームの範囲内の他のフレーム画像に存在しないか否かを判定する。そして、他のフレーム画像の字幕領域と全く重ならない場合、映像字幕検出装置１は、誤検出削除手段５３によって、そのフレーム画像の字幕領域を削除する（ステップＳ１０）。その後、映像字幕検出装置１は、ステップＳ１の処理に戻る。
全てのフレーム画像の出力を終了した場合（ステップＳ１でＹｅｓ）、映像字幕検出装置１は、全体動作を終了する。 In the video caption detection device 1, the erroneous detection deletion unit 53 deletes a caption area that is highly likely to be erroneously detected based on temporal continuity. That is, the video subtitle detection apparatus 1 determines whether or not the subtitle area of a certain frame image exists in other frame images within the range of the preceding and following M frames from the frame image by the erroneous detection deletion unit 53. If the subtitle area of another frame image does not overlap at all, the video subtitle detection apparatus 1 deletes the subtitle area of the frame image by the erroneous detection deletion means 53 (step S10). Thereafter, the video subtitle detection apparatus 1 returns to the process of step S1.
When the output of all the frame images is completed (Yes in step S1), the video subtitle detection device 1 ends the entire operation.

［字幕検出領域検出処理］
以下、図１０を参照して、字幕検出領域検出処理（図９のステップＳ６）について説明する（適宜図１参照）。 [Subtitle detection area detection processing]
Hereinafter, the caption detection area detection process (step S6 in FIG. 9) will be described with reference to FIG. 10 (see FIG. 1 as appropriate).

字幕検出手段４０は、走査手段４１によって、フレーム画像または低解像度画像全体の走査を終了したか否かを判定する（ステップＳ６１）。ここで、フレーム画像または低解像度画像全体の走査を終了していない場合（ステップＳ６１でＮｏ）、字幕検出手段４０は、走査手段４１によって、走査窓を少し移動させる（ステップＳ６２）。 The caption detection unit 40 determines whether or not the scanning unit 41 has finished scanning the entire frame image or low-resolution image (step S61). Here, when the scanning of the entire frame image or low-resolution image is not completed (No in step S61), the caption detection unit 40 moves the scanning window slightly by the scanning unit 41 (step S62).

字幕検出手段４０は、画像特徴ベクトル算出手段４２によって、走査窓領域画像から、その形状に基づく特徴ベクトル、または、その色・テクスチャに基づく特徴ベクトルを算出する（ステップＳ６３）。 The caption detection unit 40 calculates the feature vector based on the shape or the feature vector based on the color / texture from the scanning window region image by the image feature vector calculation unit 42 (step S63).

字幕検出手段４０は、字幕候補領域判定手段４３によって、字幕候補領域を識別器によって判定する。つまり、字幕検出手段４０は、字幕候補領域判定手段４３によって、走査窓領域画像の特徴ベクトルを識別器に入力し、この走査窓領域画像が字幕候補領域であるか否かを判定する（ステップＳ６４）。その後、字幕検出手段４０は、ステップＳ６１の処理に戻る。 The caption detection means 40 determines the caption candidate area by the classifier by the caption candidate area determination means 43. That is, the caption detection means 40 inputs the feature vector of the scanning window area image to the discriminator by the caption candidate area determination means 43, and determines whether or not this scanning window area image is a caption candidate area (step S64). ). Thereafter, the caption detection unit 40 returns to the process of step S61.

フレーム画像または低解像度画像全体の走査を終了した場合（ステップＳ６１でＹｅｓ）、字幕検出手段４０は、字幕検出領域判定手段４４によって、字幕検出領域を判定する。つまり、字幕検出手段４０は、字幕検出領域判定手段４４によって、字幕候補領域が互いに重なる回数を算出し、重なる回数が第１閾値以上となる字幕候補領域を字幕検出領域として判定する（ステップＳ６５）。 When scanning of the entire frame image or low-resolution image is completed (Yes in step S61), the caption detection unit 40 determines the caption detection region by the caption detection region determination unit 44. That is, the subtitle detection means 40 calculates the number of times that the subtitle candidate areas overlap each other by the subtitle detection area determination means 44, and determines the subtitle candidate area where the number of overlaps is equal to or greater than the first threshold as the subtitle detection area (step S65). .

字幕検出手段４０は、ラベリング手段４５によって、字幕検出領域のラベリングを行う（ステップＳ６６）。そして、字幕検出手段４０は、外接領域算出手段４６によって、字幕検出領域の外接矩形を算出する（ステップＳ６７）。 The caption detection means 40 labels the caption detection area by the labeling means 45 (step S66). Then, the caption detection means 40 calculates the circumscribed rectangle of the caption detection area by the circumscribed area calculation means 46 (step S67).

字幕検出手段４０は、フィルタリング手段４７によって、字幕検出領域の評価値を算出し、この評価値による字幕検出領域のフィルタリングを行う。つまり、字幕検出手段４０は、フィルタリング手段４７によって、字幕検出領域の大きさ、または、字幕検出領域のエッジ強度の密度を評価値として算出し、この評価値が第３閾値以下と判定された字幕検出領域を誤検出として削除する（ステップＳ６８）。 The caption detection means 40 calculates the evaluation value of the caption detection area by the filtering means 47 and performs filtering of the caption detection area based on the evaluation value. That is, the subtitle detection means 40 calculates the size of the subtitle detection area or the density of the edge strength of the subtitle detection area as the evaluation value by the filtering means 47, and the subtitle for which the evaluation value is determined to be equal to or less than the third threshold value. The detection area is deleted as a false detection (step S68).

以上のように、本発明の実施形態に係る映像字幕検出装置１は、フレーム画像および低解像度画像全体を走査対象とするために、字幕出現位置に係わらず字幕領域を検出できる。また、映像字幕検出装置１は、フレーム画像および各低解像度画像を同じ大きさの走査窓によって走査するので、フレーム画像の字幕が様々なサイズで走査窓領域画像に出現することになり、字幕サイズに係わらず字幕領域を検出できる。このように、エッジの密集という経験的な知識に依存しないため、映像字幕検出装置１は、字幕領域が複雑な背景の上に出現する場合でも、精度よく字幕領域を検出することができる。例えば、映像字幕検出装置１は、従来の字幕検出技術に比べて、スクロールスーパーや不規則な動きをする字幕を精度よく検出することができる。 As described above, the video subtitle detection apparatus 1 according to the embodiment of the present invention can detect a subtitle area regardless of the subtitle appearance position because the entire frame image and the low resolution image are to be scanned. In addition, since the video subtitle detection apparatus 1 scans the frame image and each low-resolution image with the scanning window having the same size, the subtitles of the frame image appear in the scanning window area image in various sizes. Regardless of the subtitle area, the subtitle area can be detected. Thus, since it does not depend on empirical knowledge of edge crowding, the video subtitle detection apparatus 1 can accurately detect a subtitle area even when the subtitle area appears on a complicated background. For example, the video subtitle detection apparatus 1 can accurately detect a scroll super or a subtitle that moves irregularly compared to a conventional subtitle detection technique.

また、映像字幕検出装置１は、フィルタリング手段４７によって、誤検出した可能性が高い字幕検出領域を削除するため、字幕領域の検出精度をより高くすることができる。さらに、映像字幕検出装置１は、誤検出削除手段５３によって、誤検出した可能性が高い字幕領域を削除するため、字幕領域の検出精度をより高くすることができる。 Moreover, since the video subtitle detection apparatus 1 deletes the subtitle detection area that is highly likely to be erroneously detected by the filtering unit 47, the detection accuracy of the subtitle area can be further increased. Furthermore, since the video caption detection device 1 deletes a caption area that is highly likely to be erroneously detected by the erroneous detection deletion unit 53, the detection accuracy of the caption area can be further increased.

従って、映像字幕検出装置１は、検出した字幕領域に対して文字認識を適用することにより、ハードディスクレコーダなどの録画機器でシーンを検索するためのキーワード検出技術や、映像内容を解析するための付加情報の生成技術として利用できる。具体的には、映像字幕検出装置１は、放送番組のダイジェスト生成、代表フレームの抽出、類似番組検索、および、関連情報検索への利用を期待できる。 Therefore, the video subtitle detection apparatus 1 applies keyword recognition to the detected subtitle area, thereby adding a keyword detection technique for searching a scene with a recording device such as a hard disk recorder, or an analysis for analyzing video content. It can be used as information generation technology. Specifically, the video caption detection device 1 can be expected to be used for digest generation of broadcast programs, extraction of representative frames, similar program search, and related information search.

ここで、従来の字幕検出技術において、様々なサイズの字幕が出現する画像を学習データとして機会学習すれば、字幕サイズに係わらず、字幕領域の検出精度が向上すると思われる。しかし、この場合、学習データを収集する手間が膨大になり、現実的でない。
一方、映像字幕検出装置１は、フレーム画像の字幕を様々なサイズに変換した後に識別器によって判定するので、様々なサイズの字幕が出現する画像を機械学習する必要がなく、機械学習の回数を少なくすることができる。 Here, in the conventional caption detection technique, if an opportunity learning is performed as learning data on images in which captions of various sizes appear, it is considered that the detection accuracy of the caption area is improved regardless of the caption size. However, in this case, the effort for collecting the learning data is enormous, which is not realistic.
On the other hand, the video subtitle detection apparatus 1 determines by the discriminator after converting the subtitles of the frame image into various sizes, so that it is not necessary to machine-learn images in which subtitles of various sizes appear, and the number of machine learning times is reduced. Can be reduced.

なお、本実施形態では、外接領域算出手段４６を備えることとして説明したが、本発明は、これに限定されない。例えば、字幕領域が矩形状でなくともよい場合、映像字幕検出装置１は、外接領域算出手段４６を備える必要がない。
また、映像字幕検出装置１は、字幕領域を矩形以外の形状としてもよい。 In the present embodiment, the circumscribed area calculating unit 46 has been described. However, the present invention is not limited to this. For example, when the subtitle area does not have to be rectangular, the video subtitle detection apparatus 1 does not need to include the circumscribed area calculation unit 46.
In addition, the video caption detection device 1 may have a caption area other than a rectangle.

なお、本実施形態では、映像字幕検出装置１が映像の字幕を検出することとして説明したが、本発明は、これに限定されない。例えば、本発明に係る文字検出装置は、カメラで撮影した静止画から、その静止画に写った背景部分の文字（例えば、看板の文字）を検出することができる。 In the present embodiment, the video subtitle detection apparatus 1 has been described as detecting video subtitles, but the present invention is not limited to this. For example, the character detection device according to the present invention can detect a character of a background portion (for example, a signboard character) reflected in the still image from a still image captured by a camera.

なお、本実施形態では、字幕候補領域判定手段４３がサポートベクタマシンを利用することとして説明したが、本発明は、これに限定されない。例えば、本発明に係る文字検出装置は、機械学習として、ニューラルネットワークを利用することができる。 In the present embodiment, the caption candidate area determination unit 43 is described as using a support vector machine, but the present invention is not limited to this. For example, the character detection apparatus according to the present invention can use a neural network as machine learning.

なお、本実施形態では、本発明に係る映像字幕検出装置を独立した装置として説明したが、本発明は、一般的なコンピュータのＣＰＵ、メモリ、ハードディスクなどのハードウェア資源を、前記した各手段として協調動作させる映像字幕検出プログラムによって実現することもできる。この映像字幕検出プログラムは、通信回線を介して配布しても良く、ＣＤ−ＲＯＭやフラッシュメモリ等の記録媒体に書き込んで配布してもよい。 In the present embodiment, the video caption detection device according to the present invention has been described as an independent device, but the present invention uses hardware resources such as a CPU, a memory, and a hard disk of a general computer as the above-described units. It can also be realized by a video subtitle detection program for cooperative operation. This video subtitle detection program may be distributed via a communication line, or may be distributed by writing in a recording medium such as a CD-ROM or a flash memory.

本発明に係る文字検出装置を一般的なコンピュータで実装して、実験用映像に対して字幕領域の検出実験を行い、その実験結果を評価した。 The character detection device according to the present invention was mounted on a general computer, and a subtitle area detection experiment was performed on an experimental video, and the experimental result was evaluated.

この実験では、図１の映像字幕検出装置１が誤検出削除手段５３を備えず、時間方向のフィルタリングを行わない装置（以後、映像字幕検出装置１Ｂと呼ぶ）を準備した。映像字幕検出装置１Ｂでは、低解像度画像の枚数Ｎ＝１と設定し、低解像度画像を生成せずに、フレーム画像のみを処理対象とした。また、映像字幕検出装置１Ｂでは、エッジ強度ヒストグラム、輝度モーメント、ハールウェーブレットおよびローカルバイナリパターンを組み合わせて、特徴ベクトルを算出した。このとき、エッジ検出処理には、Ｓｏｂｅｌフィルタを利用した。さらに、映像字幕検出装置１Ｂでは、正例２５００枚および負例２５００枚のフレーム画像を別の放送番組から学習データとして収集し、この学習データをサポートベクタマシンによって機械学習した。
なお、第１閾値〜第３閾値など説明を省略したものは、前記した実施形態と同様の値である。 In this experiment, an apparatus (hereinafter referred to as an image caption detection device 1B) in which the image caption detection device 1 of FIG. 1 does not include the erroneous detection deletion unit 53 and does not perform time direction filtering is prepared. In the video subtitle detection apparatus 1B, the number N of low resolution images is set to 1, and only the frame images are processed without generating the low resolution images. In the video caption detection device 1B, the feature vector is calculated by combining the edge intensity histogram, the luminance moment, the Haar wavelet, and the local binary pattern. At this time, a Sobel filter was used for the edge detection process. Furthermore, the video subtitle detection apparatus 1B collects 2500 frame images of positive examples and 2500 frames of negative examples as learning data from another broadcast program, and machine-learns the learning data using a support vector machine.
In addition, what abbreviate | omitted description, such as a 1st threshold value-a 3rd threshold value, is a value similar to above-described embodiment.

そして、映像字幕検出装置１Ｂに、実験用映像として、３０分のニュース番組を入力した。この実験用映像は、解像度が横４３２画素×縦２４０画素であり、２３６種類の字幕が出現した。 Then, a 30-minute news program was input as an experimental video to the video subtitle detection apparatus 1B. In this experimental video, the resolution is 432 pixels wide × 240 pixels vertical, and 236 types of subtitles have appeared.

実験結果として、下記の検出結果評価式で表される再現率および適合率を算出した。ここで、Ｎｂは実験用映像から正しく検出できた字幕領域の総数であり、Ｎｃは実験用映像に出現する字幕領域の総数であり、Ｎｏは実験用映像から検出された字幕領域の総数である。つまり、再現率は検出漏れの少なさを表す値であり、適合率は誤検出の少なさを表す値である。
再現率＝Ｎｂ／Ｎｃ
適合率＝Ｎｂ／Ｎｏ・・・検出結果評価式 As an experimental result, the recall and the precision represented by the following detection result evaluation formula were calculated. Here, Nb is the total number of subtitle areas that can be correctly detected from the experimental video, Nc is the total number of subtitle areas that appear in the experimental video, and No is the total number of subtitle areas detected from the experimental video. . In other words, the recall rate is a value that represents a small number of detection omissions, and the precision rate is a value that represents a small number of false detections.
Reproducibility = Nb / Nc
Compliance rate = Nb / No ... Detection result evaluation formula

また、比較対象として、非特許文献１に記載の技術を用いて、同じ実験用映像から字幕領域を検出して、同じ検出結果評価式により再現率および適合率を算出した。
下記の表に、映像字幕検出装置１Ｂの実験結果を実施例として、非特許文献１に記載の技術による実験結果を比較例として示す。 In addition, as a comparison target, the subtitle area was detected from the same experimental video using the technique described in Non-Patent Document 1, and the recall rate and the matching rate were calculated using the same detection result evaluation formula.
In the following table, the experimental results of the video caption detection device 1B are shown as examples, and the experimental results of the technique described in Non-Patent Document 1 are shown as comparative examples.

前記した表より、実施例および比較例では、再現率が８０％程度であった。ここで、実験用映像に移動する字幕などの特殊な字幕が含まれず、背景との明度差が小さい字幕領域を両者とも検出できなかったため、両手法の再現率に大きな差が発生しなかった。 From the above table, the recall rate was about 80% in the examples and comparative examples. Here, special subtitles such as subtitles that move to the experimental video were not included, and neither subtitle area having a small brightness difference from the background could be detected.

適合率について、実施例は、比較例よりも約１５％向上した。比較例では、図１１，図１２の画像のように、背景のエッジが密集する領域で誤検出が発生した。一方、実施例では、図１１，図１２の画像から、字幕領域を正確に検出できた。その結果、両手法の再現率には、大きな差が発生した。 Regarding the precision, the example improved about 15% over the comparative example. In the comparative example, erroneous detection occurred in a region where background edges are dense, as in the images of FIGS. On the other hand, in the example, the caption area could be accurately detected from the images of FIGS. As a result, there was a big difference in the recall of both methods.

以上より、本発明に係る文字検出装置は、低解像度画像を用いずに、フレーム画像に字幕検出領域検出処理を施すだけでも、字幕領域の検出精度を大きく向上させることがわかった。仮に、低解像度画像にも字幕検出領域検出処理を施した場合、本発明に係る文字検出装置は、従来の文字検出技術と比べて、字幕領域の検出精度を飛躍的に向上させるものと予想される。 From the above, it has been found that the character detection device according to the present invention greatly improves the detection accuracy of the caption area only by performing the caption detection area detection process on the frame image without using the low resolution image. If the subtitle detection area detection processing is also performed on the low-resolution image, the character detection device according to the present invention is expected to dramatically improve the subtitle area detection accuracy as compared with the conventional character detection technology. The

１，１Ｂ映像字幕検出装置
１０映像復号手段
２０解像度変換手段
３０記憶手段
４０字幕検出手段
４１走査手段
４２画像特徴ベクトル算出手段
４３字幕候補領域判定手段（文字候補領域判定手段）
４４字幕検出領域判定手段（文字検出領域判定手段）
４５ラベリング手段
４６外接領域算出手段
４７フィルタリング手段（文字検出領域削除手段）
５０検出結果統合手段
５１拡大手段
５２字幕検出領域統合手段（文字領域出力手段）
５３誤検出削除手段（文字領域削除手段） 1,1B Video caption detection device 10 Video decoding means 20 Resolution conversion means 30 Storage means 40 Subtitle detection means 41 Scanning means 42 Image feature vector calculation means 43 Subtitle candidate area determination means (character candidate area determination means)
44 Subtitle detection area determination means (character detection area determination means)
45 Labeling means 46 circumscribed area calculating means 47 filtering means (character detection area deleting means)
50 Detection result integration means 51 Enlargement means 52 Subtitle detection area integration means (character area output means)
53 False detection deletion means (character area deletion means)

Claims

A character region detection device that detects a character region that is a character region included in an input image from the input image,
A resolution conversion means for inputting the input image and converting the input image into one or more low-resolution images having a resolution lower than that of the input image;
The low resolution image converted by the resolution conversion means and the input image are scanned with the scanning window of the same size, respectively, so that the scanning corresponding to the area of the scanning window is performed for each of the input image and the low resolution image. Scanning means for generating a window area image;
Image feature vector calculating means for calculating a feature vector of a scanning window region image generated by the scanning means for each of the input image and the low resolution image;
Based on the feature vector of the scanning window area image calculated by the image feature vector calculation means for each of the input image and the low-resolution image, whether or not the scanning window area image is a character candidate area is determined by machine learning. Character candidate area determination means for determining;
For each of the input image and the low-resolution image, the number of times that the character candidate areas determined by the character candidate area determination unit overlap each other is calculated, and the calculated number of times of overlap is equal to or greater than a preset first threshold value Character detection area determination means for determining an area as a character detection area;
An enlargement means for enlarging the character detection area for each low resolution image determined by the character detection area determination means at an enlargement ratio at which the low resolution image corresponding to the character detection area is the same resolution as the input image;
It is determined whether at least one of the character detection area for each low-resolution image enlarged by the enlargement means and the character detection area of the input image overlaps, and among the character detection areas determined to overlap each other The ratio of overlapping of other character detection areas to the reference character detection area, which is the character detection area corresponding to the input image or the low resolution image having the maximum resolution, is calculated in advance. A character area output means for outputting only the reference character detection area as the character area when the second threshold is exceeded,
A character area detecting device comprising:

For each of the input image and the low resolution image, the size of the character detection area or the density of the edge strength determined by the character detection area determination unit is calculated as an evaluation value, and the calculated evaluation value is set in advance. A character detection area deleting unit that determines whether or not the threshold value is 3 or less, and deletes the character detection area determined to be equal to or less than the third threshold value from the character detection area determined by the character detection area determination unit;
The character region detection device according to claim 1, further comprising:

The character area output means further includes, as the character area, a character detection area determined not to overlap with each other and a character detection area determined to overlap with each other and the calculated overlapping ratio being less than the second threshold value. 3. The character area detecting apparatus according to claim 1, wherein the character area detecting apparatus outputs the character area.

A video decoding means for decoding the input video and outputting a frame image constituting the video to the resolution conversion means as the input image;
The character area output means outputs the character area for each frame image,
It is determined whether the character area for each frame image output by the character area output means does not exist in other frame images within a preset range from the frame image, and is output by the character area output means. A character area deleting means for deleting a character area determined not to exist in the other frame image from the character area for each frame image that has been
The character region detection device according to claim 1, further comprising:

In order to detect a character region that is a character region included in the input image from the input image,
A resolution converting means for inputting the input image and converting the input image into one or more low-resolution images having a resolution lower than that of the input image;
The low resolution image converted by the resolution conversion means and the input image are scanned with the scanning window of the same size, respectively, so that the scanning corresponding to the area of the scanning window is performed for each of the input image and the low resolution image. Scanning means for generating a window area image;
Image feature vector calculating means for calculating a feature vector of a scanning window area image generated by the scanning means for each of the input image and the low resolution image;
Based on the feature vector of the scanning window area image calculated by the image feature vector calculation means for each of the input image and the low-resolution image, whether or not the scanning window area image is a character candidate area is determined by machine learning. Character candidate area determination means for determining,
For each of the input image and the low-resolution image, the number of times that the character candidate areas determined by the character candidate area determination unit overlap each other is calculated, and the calculated number of times of overlap is equal to or greater than a preset first threshold value Character detection area determination means for determining an area as a character detection area;
Enlarging means for enlarging the character detection area for each low-resolution image determined by the character detection area determining means at an enlargement ratio at which the low-resolution image corresponding to the character detection area becomes the same resolution as the input image;
It is determined whether at least one of the character detection area for each low-resolution image enlarged by the enlargement means and the character detection area of the input image overlaps, and among the character detection areas determined to overlap each other The ratio of overlapping of other character detection areas to the reference character detection area, which is the character detection area corresponding to the input image or the low resolution image having the maximum resolution, is calculated in advance. A character area output means for outputting only the reference character detection area as the character area when the second threshold is exceeded,
Character area detection program to function as