JP4483085B2

JP4483085B2 - Learning device, application device, learning method, and application method

Info

Publication number: JP4483085B2
Application number: JP2000392253A
Authority: JP
Inventors: 哲二郎近藤; 智勅使川原; 泰弘藤森
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-12-25
Filing date: 2000-12-25
Publication date: 2010-06-16
Anticipated expiration: 2020-12-25
Also published as: JP2002199349A

Description

【０００１】
この発明は、入力動画像信号のフレーム周波数を変換することによる出力動画像信号の生成に適用可能な学習装置、適用装置、学習方法および適用方法に関する。
【０００２】
【従来の技術】
ある放送方式で作成された画像信号を他の放送方式で利用するために、フレーム周波数を変換することが行なわれている。従来のフレーム周波数変換は、フレーム周波数の関係に基づいて、フレーム数を変換し、また、検出された動きベクトルに基づいて画素を移動することによって、変換先の画像を生成している。従来のフレーム周波数変換は、元の画像（入力画像）の性質と無関係になされるものであった。
【０００３】
また、フィルムに撮影された映画素材をテレビ放送用のビデオ信号へ変換するテレシネ変換が行なわれている。テレシネ変換においては、一般的にフィルムのコマ数が毎秒２４コマであるので、変換先のビデオ信号のフレーム周波数によって、フィルムの１コマの画像を２回繰り返して表示したり、２回繰り返しと３回繰り返しの組を周期的に行なう２−３プルダウンと呼ばれる方法で表示したりしている。
【０００４】
【発明が解決しようとする課題】
従来のフレーム周波数変換方法は、入力画像の性質と無関係に固定の方法で出力画像を生成するので、出力画像の解像度の面で問題が生じることがあった。また、従来のテレシネ変換方法は、動きのぎごちなさが変換先の画像で非常に目立つ問題があった。
【０００５】
従って、この発明の目的は、画像信号の変換を行なう際に、より高解像度で、動きが自然な動画像信号を得ることが可能な学習装置、適用装置、学習方法および適用方法を提供することにある。
【０００６】
上述した課題を解決するため、第１の発明は、変換前フレーム周波数を有する生徒動画像信号および変換後フレーム周波数を有する教師動画像信号とに基づいて、入力動画像信号のフレーム周波数を変換して出力動画像信号を生成するための予測係数を算出する学習装置において、教師動画像信号に含まれる教師画像中に設定された注目画素に対応する生徒動画像信号に含まれる生徒画像の画素のフレーム間における動きベクトルを求める動きベクトル検出手段と、動きベクトルを注目画素の位置にシフトさせ、該動きベクトルを動きベクトルを算出したフレームに割り振ることにより、生徒画像上におけるタップ中心位置を決定するタップ中心位置決定手段と、タップ中心位置に基づいて、生徒画像から１乃至複数の画素をクラス画素群として抽出するクラス画素群取り出し手段と、クラス画素群からタップ中心位置に対応したクラス値を算出するクラス値決定手段と、タップ中心位置に基づいて、生徒画像から１乃至複数の画素を生徒画素群として取り出す生徒画素群取り出し手段と、生徒画素群と予測係数とに基づいて予測教師画素を生成し、該予測教師画素の画素値と教師画素の画素値との誤差が最小になるようにクラス値ごとに予測係数を算出する予測係数学習手段とを備える学習装置である。
【０００７】
第２の発明は、予測係数を格納する予測係数メモリと、入力動画像信号に含まれる入力画像のフレーム間における動きベクトルを第２動きベクトルとして求める第２動きベクトル検出手段と、第２動きベクトルに基づいて、入力画像上におけるタップ中心位置を第２タップ中心位置として決定する第２タップ中心位置決定手段と、第２タップ中心位置に基づいて、入力動画像信号から１乃至複数の画素を第２のクラス画素群として抽出する第２クラス画素群取り出し手段と、第２クラス画素群から第２タップ中心位置に対応したクラス値を第２クラス値として算出する第２クラス値決定手段と、第２タップ中心位置に基づいて、入力画像から１乃至複数の画素を予測画素群として取り出す予測画素群取り出し手段と、上記第２クラス値に応じた上記予測係数と上記予測画素群とに基づいて出力動画像信号を生成する予測演算手段とを備える適用装置である。
【０００８】
第３の発明は、変換前フレーム周波数を有する生徒動画像信号および変換後フレーム周波数を有する教師動画像信号とに基づいて、入力動画像信号のフレーム周波数を変換して出力動画像信号を生成するための予測係数を算出する学習方法において、教師動画像信号に含まれる教師画像中に設定された注目画素に対応する生徒動画像信号に含まれる生徒画像の画素のフレーム間における動きベクトルを求める動きベクトル検出ステップと、動きベクトルを注目画素の位置にシフトさせ、該動きベクトルを動きベクトルを算出したフレームに割り振ることにより、生徒画像上におけるタップ中心位置を決定するタップ中心位置決定ステップと、タップ中心位置に基づいて、生徒画像から１乃至複数の画素をクラス画素群として抽出するクラス画素群取り出しステップと、クラス画素群からタップ中心位置に対応したクラス値を算出するクラス値決定ステップと、タップ中心位置に基づいて、生徒画像から１乃至複数の画素を生徒画素群として取り出す生徒画素群取り出しステップと、生徒画素群と予測係数とに基づいて予測教師画素を生成し、該予測教師画素の画素値と教師画素の画素値との誤差が最小になるようにクラス値ごとに予測係数を算出する予測係数学習ステップとからなる学習方法である。
【０００９】
第４の発明は、予測係数を格納する予測係数メモリと、入力動画像信号に含まれる入力画像のフレーム間における動きベクトルを第２動きベクトルとして求める第２動きベクトル検出ステップと、第２動きベクトルに基づいて、入力画像上におけるタップ中心位置を第２タップ中心位置として決定する第２タップ中心位置決定ステップと、第２タップ中心位置に基づいて、入力動画像信号から１乃至複数の画素を第２のクラス画素群として抽出する第２クラス画素群取り出しステップと、第２クラス画素群から第２タップ中心位置に対応したクラス値を第２クラス値として算出する第２クラス値決定ステップと、第２タップ中心位置に基づいて、入力画像から１乃至複数の画素を予測画素群として取り出す予測画素群取り出しステップと、第２クラス値に応じた予測係数と予測画素群とに基づいて出力動画像信号を生成する予測演算ステップとを備える適用方法である。
【００１０】
この発明によれば、第２の画像信号の画素間隔以下の精度で注目画素の動き情報を求め、動き情報に基づいて注目画素をクラス分類することによって、より高解像度で、動きが自然な画像信号を得ることができる。
【００１１】
【発明の実施の形態】
以下、この発明をフレーム周波数の変換処理に適用した一実施形態について図面を参照して説明する。図１は、一実施形態の全体的構成を示すものである。参照符号１００が画像信号関係学習装置であり、参照符号２００が画像信号関係適用装置である。学習装置１００に対しては、教師画像信号と生徒画像信号とが供給される。教師画像信号は、変換後フレーム周波数の画像信号であり、生徒画像信号は、教師画像信号と同一の画像を内容とし、変換前フレーム周波数の画像信号である。フレーム周波数の変換としては、テレビジョン方式の違いに基づく場合、時間解像度を高くする場合等によって、変換前フレーム周波数と変換後フレーム周波数の値としては、種々の値をとりうる。学習装置１００において、教師画像信号と生徒画像信号の関係が学習され、画像信号関係係数が学習装置１００から出力される。
【００１２】
学習装置１００で得られた係数が適用装置２００に与えられる。適用装置２００は、学習装置１００によって学習した係数と変換前フレーム周波数の画像信号を入力とし、変換後フレーム周波数の画像信号を出力する。
【００１３】
図２は、学習装置１００の一例を示すものである。学習装置１００に対しては、変換前フレーム周波数を有する生徒画像、変換後フレーム周波数を有する教師画像、変換前後の周波数情報を入力する。生徒画像信号が動きベクトル検出部１０１に供給され、所定数のフレーム間での動きベクトルが検出される。変換前後の周波数情報は、周波数情報コントローラ１０２から出力される。周波数情報が時間モード値決定部１０３に供給され、時間モード値が決定される。時間モード値は、学習すべきフレームの時間位置に関する値である。
【００１４】
動きベクトルおよび時間モード値がタップ中心位置決定部１０４に供給され、タップ中心位置が決定される。タップ中心位置が生徒画素群取り出し部１０５およびクラス画素群取り出し部１０６に供給される。取り出し部１０５は、タップ中心位置に基づいて、複数の生徒画素からなる生徒画素群を取り出す。取り出された生徒画素群が予測係数学習部１０７に供給される。
【００１５】
取り出し部１０６は、タップ中心位置に基づいて、複数の生徒画素からなるクラス画素群を取り出す。取り出されたクラス画素群がクラス値決定部１０８に供給される。クラス値決定部１０８は、クラス画素群からクラス値を決定する。クラス値が予測係数学習部１０７に供給される。
【００１６】
位置モード値決定部１０９は、タップ中心位置、動きベクトルおよび時間モード値に基づいて後述するように、位置モード値を決定する。位置モード値が予測係数学習部１０７に供給される。さらに、時間モード値に基づいて教師画素切り出し部１１０が教師画素を切り出す。切り出された教師画素が予測係数学習部１０７に供給される。
【００１７】
予測係数学習部１０７には、時間モード値、位置モード値、クラス値、生徒学習群、教師画素が供給される。予測係数学習部１０７は、これらの情報を使用して後述するように、生徒画素群から教師画素を予測するための予測係数を学習する。
【００１８】
図２では、学習装置１００がブロック図で示されているが、学習装置１００をソフトウェアによって実現することも可能である。図３は、学習方法の処理を示すフローチャートである。最初にステップＳ１おいて、生徒画像および教師画像が入力される。周波数情報に基づいてステップＳ２において時間モード値が決定される。
【００１９】
ステップＳ３では、生徒画像から動きベクトルが検出される。ステップＳ４では、時間モード値と動きベクトルに基づいてタップ中心位置が検出される。時間モード値、動きベクトルおよびタップ中心位置からステップＳ５において、位置モード値が決定される。
【００２０】
ステップＳ６では、タップ中心位置の情報に基づいて生徒画像からクラス画素群が取り出される。ステップＳ７では、クラス画素群を使用してクラス分類処理がなされ、クラス値が決定される。ステップＳ８では、タップ中心位置の情報に基づいて生徒画像から生徒画素群が取り出される。ステップＳ９では、教師画像から教師画素が取り出される。
【００２１】
ステップＳ１０からステップＳ１５までの処理は、最小二乗法に基づく予測係数学習処理である。すなわち、複数の予測係数と生徒画素との線型１次結合によって予測値を推定した時に、予測値と教師画像中の真値との誤差の二乗和を最小とするように、予測係数を定める。実際的な計算方法としては、誤差の二乗和に関する式を偏微分し、偏微分値が０となるように予測係数が定められる。その場合に正規方程式がたてられ、正規方程式が掃き出し法等の一般的な行列解法にしたがって解かれ、予測係数が算出される。
【００２２】
ステップＳ１０は、各クラス毎の正規方程式にデータを足し込む処理である。ステップＳ１１では、フレーム内全画素を処理したか否かが決定される。処理が終了していなならば、ステップＳ４（タップ中心位置決定）に戻る。フレーム内全画素を処理したと判断されると、ステップＳ１２において、画像内の全フレームの処理を終了したか否かが決定される。処理が終了していないならば、ステップＳ２（時間モード値決定）に戻る。さらに、ステップＳ１３では、入力全画像を処理したか否かが決定される。処理が終了していないならば、ステップＳ１（画像を入力）に戻る。
【００２３】
ステップＳ１３において、入力された全画像を処理したものと決定されると、ステップＳ１４において掃き出し法によって正規方程式が解かれる。そして、ステップＳ１５において求められた予測係数が出力される。
【００２４】
さらに、この発明の一実施形態について説明する。最初に動きベクトル検出部１０１における動きベクトル検出について説明する。動きベクトルの検出の方法としては、ブロックマッチング法、勾配法等を使用できる。動きベクトルの精度は、画素単位精度以上あれば良い。動きベクトルを検出する領域単位は、画素毎でも良いし、最終的に生成する画像が劣化しない範囲であれば、水平および／または垂直方向に数画素おきでも良い。
【００２５】
動きベクトルを検出するフレーム数は、生徒画像の２フレーム間以上とする。フレーム数を多くすれば、空間アクティビティの低い領域や、繰り返しパターンを含む領域での誤検出を減らすことができる。一方、仮定した直線性から外れる動きや、カバード／アンカバードバックグラウンドでの誤検出が増える可能性がある。以降の説明では、一例として生徒画像の２フレーム間で画素毎に画素単位精度で動きベクトルを検出するものとして説明する。
【００２６】
次に、時間モード値決定部１０３について、図４を参照して説明する。時間モード値決定部１０３は、変換前後の周波数から学習すべきフレームの時間位置に関する時間モード値を決定する。変換前後のフレーム周波数の比と、生徒画像と教師画像で同じ時間位置のフレームで学習を行なうか否かによって、モードの数と決定方法とが異なる。
【００２７】
図４Ａは、フレーム周波数を２倍にする変換のための係数を学習する場合を示す。この場合では、生徒画像の２フレームの間に教師画像の２フレームが入る時間関係とされる。そして、教師画像フレームのどちらの時間的位置の係数を学習するかによって、モード０およびモード１が規定される。例えば２フレーム間で、より時間的に前の教師画像フレーム上の画素値を予測するための係数を学習する場合には、時間モード値が０とされ、他方の教師画像フレーム上の画素値を予測するための係数を学習する場合には、時間モード値が１とされる。
【００２８】
図４Ｂは、フレーム周波数を２．５倍にする変換のための係数を学習する場合を示す。この場合では、生徒画像と教師画像とが同じ時間的位置のフレームでは、係数を学習しないものとしており、生徒画像の２フレーム間に教師画像の２フレームが入る。したがって、４種類の時間的位置の教師画像フレーム上の画素値を予測する係数を学習することになる。どのフレーム位置の係数を学習するかによって、時間モード値が０から３までの何れかの値をとる。
【００２９】
次に、動きベクトル値と時間モード値を受け取ってタップ中心位置を決定するタップ中心位置決定部１０４について、図５を参照して説明する。図５Ａは、フレーム周波数を５倍に変換する処理において、教師画像と、生徒画像の２フレームから決定された動きベクトルとを示している。一例として、動きベクトルが垂直方向の成分のみとし、２画素分の下方向の動きベクトルが検出されたものとしている。タップ中心位置は、次の手順で決定される。
【００３０】
まず、教師フレーム上に設定した注目画素位置に対応する生徒フレーム上の画素位置での動きベクトルを注目画素位置にシフトする。次に、時間モード値によって、動きベクトルを検出するのに使用した生徒画像の２フレームに対して動きベクトルを割り振る。注目画素位置から割り振られた動きベクトル分ずれた位置を計算する。その位置を四捨五入、切り捨て、または切り上げによって、整数画素位置に丸めたものをタップ中心位置とする。タップ中心位置は、教師画像のフレームの前後に位置する生徒画像の２フレームに対してそれぞれ設定される。タップ中心位置を中心として画素（生徒画素群およびクラス画素群）を切り出す時には、この前後の２フレームのそれぞれから画素が切り出される。
【００３１】
図５Ｂおよび図５Ｃの例は、異なる時間モード値の場合を示している。図５Ｂの例では、割り振られた動きベクトル分ずれた位置を計算し、その位置を整数画素位置に丸めても、生徒画像上のタップ中心位置の画素が１画素移動しない。一方、図５Ｃの例では、丸めた結果、生徒画像上のタップ中心位置の画素が１画素ずれている。
【００３２】
図６は、タップ中心位置決定処理をソフトウェア処理で行なう場合の流れを示すフローチャートである。ステップＳ２１では、教師画像上の注目画素位置を決定する。次のステップＳ２２では、生徒画像上の注目画素位置の動きベクトルを読み出す。ステップＳ２３では、フレーム間時間距離比に応じた動きベクトルを計算する。この処理は、時間モード値によって、動きベクトルを検出するのに使用した生徒画像の２フレームに対して動きベクトルを割り振ることである。例えば生徒画像の２フレーム間の中心位置に教師画像のフレームが位置する時には、動きベクトルが１／２とされる。
【００３３】
そして、ステップＳ２４においては、ステップＳ２３で計算された動きベクトルに応じた生徒画像上の対応画素位置（小数精度）の計算を行なう。ステップＳ２５では、切り捨て、切り上げ、四捨五入等の処理で、整数画素位置へ対応画素位置を丸める。以上でタップ中心位置の決定処理が終了する。
【００３４】
図７は、上述したタップ中心位置に基づいて、クラス画素群取り出し部１０６によって取り出されるクラス画素群のいくつかの例を示している。生徒画像上で、上述したように決定されたタップ中心位置の画素が図７では、黒丸で示されている。また、周辺の画素で、クラス画素として使用される画素には、斜線が付されている。時間的に連続する生徒画像の２フレームのそれぞれにおいて、同一の配置関係にあるクラス画素群が取り出される。
【００３５】
クラス画素群を取り出す場合には、タップ中心位置決定部１０４で決めておいたタップ中心位置からの相対位置をメモリから読み出し、タップ中心位置からそれぞれのクラス画素位置を求め、対応する画素値をフレームバッファから読み出すようになされる。クラス画素の数および位置は、学習の効率、メモリの制限、処理速度などの点を考慮して適宜定められる。
【００３６】
次に、クラス値決定部１０８におけるクラス値の決定処理について図８を参照して説明する。上述したように取り出されたクラス画素群の画素値を１ビットＡＤＲＣ（Adaptive Dynamic Range Coding)によって符号化し、符号化の結果（ビット列）を整数としてみた値をクラス値とする。
【００３７】
画素値が８ビットで表現されている時には、画素値として０から２５５までの値がありうる。図８では、時間的に連続する生徒画像の２フレームの各フレームに５個のクラス画素が含まれ、合計１０画素によってクラス画素群が構成されている。この１０画素のクラス画素値の最大値と最小値の差がダイナミックレンジＤＲである。１ビットＡＤＲＣであるので、ＤＲが１／２とされた値が中値とされ、この中値に対するクラス画素値の大小関係が調べられる。クラス画素値が中値より小であれば、"0" と符号化され、クラス画素値が中値以上であれば、"1"と符号化される。図８の例では、１ビットＡＤＲＣの結果の符号化値のビット列が（００００１００００１）となる。このビット列を整数としてみた値（＝３３）がクラス値とされる。
【００３８】
クラス数を削減するために、符号化結果のビット列をビット毎に反転させた値をクラス値としても良い。この場合は、クラス数は半分となる。また、タップ配置が左右／上下に対称な場合、画素値を並び替えて同様の計算を行なって、クラス数をそれぞれ半分としても良い。
【００３９】
次に、位置モード値決定部１０９における位置モード値の決定処理について、図９を参照して説明する。位置モード値決定部１０９には、動きベクトル、時間モード値およびタップ中心位置が与えられる。前述したように、タップ中心位置を決定する際に、教師フレーム上の注目画素位置から生徒画像の２フレームにそれぞれ動きベクトルを割り振っている。動きベクトルは、画素単位精度で求めるようにしている。したがって、注目画素位置から動きベクトル割り振り分シフトした位置と整数格子点位置との小数以下のずれは、フレーム周波数２倍の時には、０および０．５の二通りが存在する。図５に示した例のように、フレーム周波数が５倍の時には、０，０．２，０．４，０．6 ，０．８の５通りのパターンがある。フレーム周波数が２．５倍の時でも、フレーム周波数が５倍の時と同様の５通りのパターンがある。
【００４０】
この小数以下のずれ量を水平方向および垂直方向毎に考えると、フレーム周波数が２倍の時は、２×２＝４通りの位置モード値を決定する。フレーム周波数が５倍の時は、５×５＝２５通りの位置モード値を決定する。図９は、生徒フレームの２フレームの一方のフレームにおける画素位置を示し、フレーム周波数を５倍に変換する例である。図９Ａは、切り捨てによって位置モード値を決定する場合を示す。小数位置タップ中心（白丸で示す）が５×５の領域内にある時に、切り捨て処理によって、黒丸で示す整数位置タップ中心に変更される。四捨五入による場合では、図９Ｂに示すように、５×５の領域内に小数位置タップ中心がある時に、その中心の黒丸の画素が四捨五入後の整数位置タップ中心となる。５×５の２５通りの位置モード値が規定される。位置モード値は、予測係数学習部１０７に供給される。
【００４１】
図１０を参照してソフトウェア処理によって位置モード値を決定する時の処理の流れを説明する。ステップＳ３１では、教師画像上の注目画素位置が決定される。ステップＳ３２では、生徒画像上の注目画素位置の動きベクトルを読み出す。ステップＳ３３では、フレーム間時間距離比に応じて動きベクトルを計算する。ステップＳ３４では、計算した動きベクトルに応じた生徒画像上の対応画素位置（小数精度）を計算する。ステップＳ３５では、対応画素位置とタップ中心位置との差分を計算する。そして、ステップＳ３６において、差分を位置モード値に変換する。
【００４２】
上述したタップ中心位置決定部１０４で決定されたタップ中心位置に基づいて、生徒画素群取り出し部１０５が生徒画素群を取り出す。図１１は、取り出される生徒画素群のいくつかの例を示している。前述したクラス画素群の取り出し（図７参照）と生徒画素群の取り出しは、タップ中心位置からの相対位置が異なる点を除いて同様である。生徒画素群として取り出す生徒画素の数および位置は、学習の効率、メモリの制限、処理速度などの点を考慮して適宜定めるようになされる。
【００４３】
教師画素取り出し部１１０は、時間モード値に基づいて、教師画像から教師画素を取り出すものである。すなわち、現在どの時間モード値の係数を学習しているかによって、入力された教師画像のうちの一つを選択し、注目画素位置に対応する画素値をフレームバッファから読み出す。
【００４４】
予測係数学習部１０７は、時間モード値、位置モード値、クラス値、生徒画素群、教師画素に基づいて、生徒画素群から教師画素を予測するための予測係数を学習するものである。以下、学習部１０７の処理について説明する。ある特定の時間モード値ｍ_t、位置モード値ｍ_p、クラス値ｃの組み合わせを持つ生徒画素群ｘ_si（１≦ｉ≦ｎ）と教師画素ｘ_tの対は、ほぼ同様の性質を持つと考えられる。そで、そのような対を集めて予測係数（ｗ₁〜ｗ_n）を求める。
【００４５】
一例として、生徒画素群の画素値の１次結合によって教師画素を予測する。すなわち、
ｘ_t＾＝ｗ₁ ×ｘ_s1＋ｗ₂ ×ｘ_s2＋‥‥＋ｗ_n ×ｘ_sn
と表せる。但し、＾は、予測値を表す。
【００４６】
上述した予測式を同じ（ｍ_t、ｍ_p、ｃ）の組み合わせを持つ全ての生徒画素群ｘ_sji（１≦ｉ≦ｎ，１≦ｊ≦ｍ）に対して適用すると、次式の関係となる。
【００４７】
【数１】

【００４８】
予測係数学習部１０７には、実際の教師画素値（ｘ_t1，ｘ_t2，・・・，ｘ_tm）が供給されるので、実際の教師画素値と予測される画素値（ｘ_t1＾，ｘ_t2＾，・・・，ｘ_tm＾）との誤差が最小となるように最小二乗法によって、線型１次結合の係数（ｗ₁，ｗ₂，・・・，ｗ_n）を求める。これが予測係数である。予測係数は、（ｍ_t、ｍ_p、ｃ）の組み合わせの全てに関して求められる。
【００４９】
図１２は、画像信号関係適用装置２００の一例の構成を示す。上述した学習装置１００によって求められた予測係数が予測係数メモリ２１０に格納されている。メモリ２１０には、時間モード値ｍ_t、位置モード値ｍ_p、クラス値ｃの組み合わせ（ｍ_t、ｍ_p、ｃ）毎に係数が格納されている。
【００５０】
適用装置２００に対しては、変換前フレーム周波数を有する入力画像、変換前後の周波数情報を入力する。入力画像信号が動きベクトル検出部２０１に供給され、所定数のフレーム間での動きベクトルが検出される。なお、入力画像がＭＰＥＧのように動きベクトルを用いて圧縮したものである場合には、サイド情報として送られてきた動きベクトルが後の処理の要求する精度を持つならば、復号時に得られる動きベクトルをそのまま流用し、動きベクトルの検出処理を省略することができる。変換前後の周波数情報は、周波数情報コントローラ２０２から出力される。周波数情報が時間モード値決定部２０３に供給され、時間モード値が決定される。時間モード値は、学習すべきフレームの時間位置に関する値である。
【００５１】
動きベクトルおよび時間モード値がタップ中心位置決定部２０４に供給され、タップ中心位置が決定される。タップ中心位置が予測画素群取り出し部２０５およびクラス画素群取り出し部２０６に供給される。取り出し部２０５は、タップ中心位置に基づいて、複数の予測画素からなる予測画素群を取り出す。取り出された予測画素群が予測演算部２０７に供給される。
【００５２】
取り出し部２０６は、タップ中心位置に基づいて、複数の予測画素からなるクラス画素群を取り出す。取り出されたクラス画素群がクラス値決定部２０８に供給される。クラス値決定部２０８は、クラス画素群からクラス値を決定する。クラス値が予測係数メモリ２１０に供給される。
【００５３】
位置モード値決定部２０９は、タップ中心位置、動きベクトルおよび時間モード値に基づいて、位置モード値を決定する。位置モード値が予測係数メモリ２１０に供給される。さらに、時間モード値が予測係数メモリ２１０に供給される。
【００５４】
予測係数メモリ２１０には、時間モード値ｍ_t、位置モード値ｍ_p、クラス値ｃが供給され、（ｍ_t、ｍ_p、ｃ）の組み合わせに対応する予測係数（ｎ個の係数のセット）（ｗ₁，ｗ₂，・・・，ｗ_n）が予測係数メモリ２１０から出力される。この予測係数と、予測画素群取り出し部２０５からの予測画素群が予測演算部２０７に供給される。予測演算部２０７では、予測画素群の予測画素と予測係数との線型１次結合によって、変換後のフレーム周波数の画像信号を生成する。予測演算部２０７は、予測した画素値がフレーム単位まで蓄積されたら、フレームの画像データを出力する。
【００５５】
予測演算部２０７の予測演算は、予測係数（ｗ₁，ｗ₂，・・・，ｗ_n）と予測画素群ｘ_pi（１≦ｉ≦ｎ）とを使用し、
ｘ＾＝ｗ₁ｘ_p1＋ｗ₂ｘ_p2＋・・・＋ｗ_nｘ_pn
の演算によって予測画素値ｘ＾を生成する。
【００５６】
動きベクトル検出部２０１、周波数情報コントローラ２０２、時間モード値決定部２０３、タップ中心位置決定部２０４、予測学習群取り出し部２０５、クラス画素群取り出し部２０６、クラス値決定部２０８、位置モード決定部２０９は、上述した学習装置１００における動きベクトル検出部１０１、周波数情報コントローラ１０２、時間モード値決定部１０３、タップ中心位置決定部１０４、生徒画素群取り出し部１０５、クラス画素群取り出し部１０６、クラス値決定部１０８、位置モード決定部１０９と同様の機能を有するもので、その詳細な説明は、省略する。
【００５７】
図１３は、画像信号関係適用装置２００をソフトウェア処理で実現した場合の処理の流れを示すフローチャートである。ステップＳ４１では、入力画像信号が供給される。ステップＳ４２において、時間モード値が決定される。ステップＳ４３において、動きベクトルが検出される。ステップＳ４４においては、タップ中心位置が決定される。ステップＳ４５において、位置モード値が決定される。
【００５８】
ステップＳ４６においては、決定されたタップ中心位置に基づいてクラス画素群が取り出され、ステップＳ４７において、クラス値が決定され、また、クラス値、位置モード値および時間モード値によってクラス分類がなされる。ステップＳ４８において、予測画素群が取り出される。ステップＳ４９において、記憶されている予測係数の内で、クラスに応じた予測係数が読み出される。ステップＳ５０では、予測画素群の複数の画素と複数の予測係数の線型１次結合（予測演算）によって、予測画素が生成される。ステップＳ５１において、生成された予測画素が出力される。
【００５９】
ステップＳ５２では、フレーム内全画素を処理したか否かが決定される。処理が終了していなならば、ステップＳ４４（タップ中心位置決定）に戻る。フレーム内全画素を処理したと判断されると、ステップＳ５３において、画像内の全フレームの処理を終了したか否かが決定される。処理が終了していないならば、ステップＳ４２（時間モード値決定）に戻る。さらに、ステップＳ５３では、入力全画像を処理したか否かが決定される。処理が終了していないならば、ステップＳ４１（画像を入力）に戻る。
【００６０】
この発明は、上述したこの発明の一実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。例えば線型１次結合以外の予測演算式を使用しても良い。また、レベル以外の特徴を使用してクラス分類を行なうようにしても良い。
【００６１】
【発明の効果】
上述したように、この発明は、実際に撮影された高精細で動きの自然な変換後フレーム周波数の教師画像信号と、それに対応し、変換前フレーム周波数の生徒画像信号との間の関係を、変換前フレーム周波数の生徒画像信号の性質の分類毎に学習する。また、学習の結果得られる予測係数を用いて変換前フレーム周波数の入力画像信号を変換後フレーム周波数の出力画像信号へ変換する。このような処理によって、空間解像度が高精細で、動きの自然な出力画像信号を得ることができる。
【図面の簡単な説明】
【図１】この発明の一実施形態の概略的構成を示すブロック図である。
【図２】画像信号関係学習装置の一例のブロック図である。
【図３】画像信号関係学習装置をソフトウェアで実現した場合の学習処理の一例を説明するフローチャートである。
【図４】この発明の一実施形態における時間モード値を説明するための略線図である。
【図５】この発明の一実施形態におけるタップ中心位置決定方法を説明するための略線図である。
【図６】タップ中心位置決定方法をソフトウェアで実現した場合の処理の一例を説明するフローチャートである。
【図７】クラス画素群のいくつかの配置例を示す略線図である。
【図８】クラス決定方法の一例を説明するための略線図である。
【図９】位置モード値の説明に使用する略線図である。
【図１０】位置モード値を求める方法をソフトウェアで実現した場合の処理の一例を説明するフローチャートである。
【図１１】生徒画素群のいくつかの配置例を示す略線図である。
【図１２】画像信号関係適用装置の一例のブロック図である。
【図１３】画像信号関係適用装置をソフトウェアで実現した場合の処理の一例を説明するフローチャートである。
【符号の説明】
１０１，２０１・・・動きベクトル検出部、１０３，２０３・・・時間モード値決定部、１０４，２０４・・・タップ中心位置決定部、１０５，２０５・・・生徒画素群取り出し部、１０６，２０６・・・クラス画素群取り出し部、１０７・・・予測係数学習部、１０８，２０８・・・クラス値決定部、１０９，２０９・・・位置モード値決定部、１１０・・・教師画素取り出し部、２０７・・・予測演算部、２１０・・・予測係数メモリ[0001]
  This inventionGeneration of output video signal by converting frame frequency of input video signalApplicable toLearning device, application device, learning method, and application methodAbout.
[0002]
[Prior art]
In order to use an image signal created in one broadcasting system in another broadcasting system, the frame frequency is converted. In the conventional frame frequency conversion, the number of frames is converted based on the relationship between the frame frequencies, and the conversion destination image is generated by moving the pixels based on the detected motion vector. The conventional frame frequency conversion is performed regardless of the nature of the original image (input image).
[0003]
In addition, telecine conversion is performed to convert movie material photographed on a film into a video signal for television broadcasting. In telecine conversion, the number of frames of a film is generally 24 frames per second. Therefore, depending on the frame frequency of the video signal to be converted, an image of one frame of the film is displayed twice or displayed twice. It is displayed by a method called 2-3 pull-down in which a set of repetitions is periodically performed.
[0004]
[Problems to be solved by the invention]
Since the conventional frame frequency conversion method generates an output image by a fixed method regardless of the nature of the input image, there may be a problem in terms of the resolution of the output image. In addition, the conventional telecine conversion method has a problem that the jerkyness of movement is very conspicuous in the converted image.
[0005]
  Accordingly, an object of the present invention is to achieve higher resolution and natural movement when converting an image signal.MovementPossible to obtain image signalLearning device, application device, learning method, and application methodIs to provide.
[0006]
  In order to solve the above-mentioned problems,The first invention generates an output moving image signal by converting the frame frequency of the input moving image signal based on the student moving image signal having the pre-conversion frame frequency and the teacher moving image signal having the converted frame frequency. In a learning device for calculating a prediction coefficient for a motion, a motion for obtaining a motion vector between frames of a pixel of a student image included in a student moving image signal corresponding to a target pixel set in a teacher image included in the teacher moving image signal A tap detection unit for determining the tap center position on the student image by shifting the motion vector to the position of the pixel of interest and allocating the motion vector to the frame from which the motion vector is calculated; Class pixel group extraction that extracts one or more pixels from a student image as a class pixel group based on a position A class value determining unit that calculates a class value corresponding to the tap center position from the class pixel group, and a student pixel group extraction that extracts one or more pixels from the student image as a student pixel group based on the tap center position The prediction teacher pixel is generated based on the means, the student pixel group, and the prediction coefficient, and the prediction coefficient is calculated for each class value so that the error between the pixel value of the prediction teacher pixel and the pixel value of the teacher pixel is minimized. Learning device comprising prediction coefficient learning meansIt is.
[0007]
  According to a second aspect of the present invention, there is provided a prediction coefficient memory for storing a prediction coefficient, second motion vector detection means for obtaining a motion vector between frames of an input image included in an input moving image signal as a second motion vector, and a second motion vector Based on the second tap center position determining means for determining the tap center position on the input image as the second tap center position, and one or more pixels from the input video signal based on the second tap center position. Second class pixel group extracting means for extracting as a second class pixel group; second class value determining means for calculating a class value corresponding to the second tap center position from the second class pixel group as a second class value; Based on the 2-tap center position, predicted pixel group extracting means for extracting one or more pixels from the input image as a predicted pixel group, and according to the second class value It is applying apparatus and a prediction calculation means for generating an output moving image signal based on the above prediction coefficients and the prediction pixel group.
[0008]
  The third invention generates an output moving image signal by converting the frame frequency of the input moving image signal based on the student moving image signal having the pre-conversion frame frequency and the teacher moving image signal having the converted frame frequency. In a learning method for calculating a prediction coefficient for a motion, a motion for obtaining a motion vector between frames of pixels of a student image included in a student moving image signal corresponding to a target pixel set in a teacher image included in the teacher moving image signal A vector detection step, a tap center position determination step for determining a tap center position on the student image by shifting the motion vector to the position of the pixel of interest, and allocating the motion vector to the frame from which the motion vector is calculated; Class image that extracts one or more pixels from the student image as a class pixel group based on the position A group extraction step, a class value determination step for calculating a class value corresponding to the tap center position from the class pixel group, and a student pixel group for extracting one or more pixels from the student image as a student pixel group based on the tap center position A prediction teacher pixel is generated based on the extraction step, the student pixel group, and the prediction coefficient, and a prediction coefficient is set for each class value so that an error between the pixel value of the prediction teacher pixel and the pixel value of the teacher pixel is minimized. This is a learning method including a prediction coefficient learning step to be calculated.
[0009]
  According to a fourth aspect of the present invention, there is provided a prediction coefficient memory for storing a prediction coefficient, a second motion vector detection step for obtaining a motion vector between frames of an input image included in the input moving image signal as a second motion vector, and a second motion vector Based on the second tap center position determining step for determining the tap center position on the input image as the second tap center position, and, based on the second tap center position, one or more pixels are selected from the input moving image signal. A second class pixel group extracting step for extracting as a second class pixel group; a second class value determining step for calculating a class value corresponding to the second tap center position from the second class pixel group as a second class value; A predicted pixel group extracting step of extracting one or more pixels from the input image as a predicted pixel group based on the 2-tap center position; Prediction coefficients corresponding to the class value and based on the predicted pixel group is a method of application and a prediction calculation step of generating an output moving picture signal.
[0010]
According to the present invention, the motion information of the pixel of interest is obtained with an accuracy equal to or less than the pixel interval of the second image signal, and the pixel of interest is classified according to the motion information. A signal can be obtained.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment in which the present invention is applied to a frame frequency conversion process will be described with reference to the drawings. FIG. 1 shows the overall configuration of one embodiment. Reference numeral 100 is an image signal relation learning apparatus, and reference numeral 200 is an image signal relation application apparatus. A teacher image signal and a student image signal are supplied to the learning device 100. The teacher image signal is an image signal having a post-conversion frame frequency, and the student image signal is an image signal having the same image as the teacher image signal and having a pre-conversion frame frequency. As the frame frequency conversion, various values can be used as the values of the pre-conversion frame frequency and the post-conversion frame frequency depending on the difference in the television system, the case where the temporal resolution is increased, and the like. The learning device 100 learns the relationship between the teacher image signal and the student image signal, and the image signal relationship coefficient is output from the learning device 100.
[0012]
The coefficient obtained by the learning device 100 is given to the application device 200. The application apparatus 200 receives the coefficient learned by the learning apparatus 100 and the image signal having the pre-conversion frame frequency, and outputs the image signal having the post-conversion frame frequency.
[0013]
FIG. 2 shows an example of the learning device 100. A student image having a pre-conversion frame frequency, a teacher image having a post-conversion frame frequency, and frequency information before and after conversion are input to the learning apparatus 100. The student image signal is supplied to the motion vector detection unit 101, and a motion vector between a predetermined number of frames is detected. The frequency information before and after conversion is output from the frequency information controller 102. The frequency information is supplied to the time mode value determining unit 103, and the time mode value is determined. The time mode value is a value related to the time position of the frame to be learned.
[0014]
The motion vector and the time mode value are supplied to the tap center position determination unit 104, and the tap center position is determined. The tap center position is supplied to the student pixel group extraction unit 105 and the class pixel group extraction unit 106. The extraction unit 105 extracts a student pixel group including a plurality of student pixels based on the tap center position. The extracted student pixel group is supplied to the prediction coefficient learning unit 107.
[0015]
The extraction unit 106 extracts a class pixel group including a plurality of student pixels based on the tap center position. The extracted class pixel group is supplied to the class value determination unit 108. The class value determining unit 108 determines a class value from the class pixel group. The class value is supplied to the prediction coefficient learning unit 107.
[0016]
The position mode value determining unit 109 determines the position mode value based on the tap center position, the motion vector, and the time mode value, as will be described later. The position mode value is supplied to the prediction coefficient learning unit 107. Further, the teacher pixel cutout unit 110 cuts out the teacher pixel based on the time mode value. The cut out teacher pixels are supplied to the prediction coefficient learning unit 107.
[0017]
The prediction coefficient learning unit 107 is supplied with a time mode value, a position mode value, a class value, a student learning group, and a teacher pixel. The prediction coefficient learning unit 107 uses these pieces of information to learn a prediction coefficient for predicting a teacher pixel from the student pixel group.
[0018]
In FIG. 2, the learning device 100 is shown in a block diagram, but the learning device 100 can also be realized by software. FIG. 3 is a flowchart showing processing of the learning method. First, in step S1, a student image and a teacher image are input. A time mode value is determined in step S2 based on the frequency information.
[0019]
In step S3, a motion vector is detected from the student image. In step S4, the tap center position is detected based on the time mode value and the motion vector. In step S5, the position mode value is determined from the time mode value, the motion vector, and the tap center position.
[0020]
In step S6, a class pixel group is extracted from the student image based on the information on the tap center position. In step S7, class classification processing is performed using the class pixel group, and the class value is determined. In step S8, a student pixel group is extracted from the student image based on the information on the tap center position. In step S9, teacher pixels are extracted from the teacher image.
[0021]
The process from step S10 to step S15 is a prediction coefficient learning process based on the least square method. That is, the prediction coefficient is determined so that the sum of squares of errors between the prediction value and the true value in the teacher image is minimized when the prediction value is estimated by linear linear combination of a plurality of prediction coefficients and student pixels. As a practical calculation method, a prediction coefficient is determined so that a partial differential value of an equation regarding the sum of squared errors is zero. In this case, a normal equation is established, and the normal equation is solved according to a general matrix solution method such as a sweep-out method, and a prediction coefficient is calculated.
[0022]
Step S10 is a process of adding data to the normal equation for each class. In step S11, it is determined whether all pixels in the frame have been processed. If the process has not ended, the process returns to step S4 (determining the tap center position). If it is determined that all the pixels in the frame have been processed, it is determined in step S12 whether or not the processing of all the frames in the image has been completed. If the process has not ended, the process returns to step S2 (time mode value determination). In step S13, it is determined whether all input images have been processed. If the processing has not ended, the process returns to step S1 (input image).
[0023]
If it is determined in step S13 that all input images have been processed, the normal equation is solved by the sweep-out method in step S14. And the prediction coefficient calculated | required in step S15 is output.
[0024]
Furthermore, an embodiment of the present invention will be described. First, motion vector detection in the motion vector detection unit 101 will be described. As a motion vector detection method, a block matching method, a gradient method, or the like can be used. The accuracy of the motion vector only needs to be greater than the pixel unit accuracy. The region unit for detecting the motion vector may be pixel by pixel, or may be every several pixels in the horizontal and / or vertical direction as long as the finally generated image does not deteriorate.
[0025]
The number of frames in which the motion vector is detected is at least between two frames of the student image. Increasing the number of frames can reduce false detections in areas with low spatial activity and areas containing repetitive patterns. On the other hand, there is a possibility that movements deviating from the assumed linearity and false detection in the covered / uncovered background will increase. In the following description, as an example, it is assumed that a motion vector is detected with pixel unit accuracy for each pixel between two frames of a student image.
[0026]
Next, the time mode value determination unit 103 will be described with reference to FIG. The time mode value determination unit 103 determines a time mode value related to the time position of the frame to be learned from the frequencies before and after conversion. The number of modes and the determination method differ depending on the ratio of the frame frequency before and after the conversion and whether or not learning is performed in the frame at the same time position in the student image and the teacher image.
[0027]
FIG. 4A shows a case where a coefficient for conversion for doubling the frame frequency is learned. In this case, the time relationship is such that two frames of the teacher image fall between two frames of the student image. Then, mode 0 and mode 1 are defined depending on which temporal position coefficient of the teacher image frame is learned. For example, when learning a coefficient for predicting a pixel value on a previous teacher image frame between two frames, the time mode value is set to 0, and the pixel value on the other teacher image frame is set to When learning the coefficient for prediction, the time mode value is set to 1.
[0028]
FIG. 4B shows a case where a coefficient for conversion for increasing the frame frequency by 2.5 is learned. In this case, the coefficient is not learned in the frame where the student image and the teacher image are at the same time position, and two frames of the teacher image are inserted between the two frames of the student image. Therefore, the coefficients for predicting the pixel values on the teacher image frame at the four types of temporal positions are learned. The time mode value takes any value from 0 to 3 depending on which frame position coefficient is learned.
[0029]
Next, the tap center position determination unit 104 that receives the motion vector value and the time mode value and determines the tap center position will be described with reference to FIG. FIG. 5A shows a teacher image and a motion vector determined from two frames of the student image in the process of converting the frame frequency to 5 times. As an example, it is assumed that the motion vector has only a vertical component, and a downward motion vector for two pixels is detected. The tap center position is determined by the following procedure.
[0030]
First, the motion vector at the pixel position on the student frame corresponding to the target pixel position set on the teacher frame is shifted to the target pixel position. Next, the motion vector is allocated to the two frames of the student image used to detect the motion vector according to the time mode value. A position shifted by the motion vector allocated from the pixel position of interest is calculated. The position rounded to the integer pixel position by rounding, rounding down, or rounding up is used as the tap center position. The tap center position is set for each of the two frames of the student image located before and after the frame of the teacher image. When a pixel (student pixel group and class pixel group) is cut out with the tap center position as the center, the pixel is cut out from each of the two frames before and after this.
[0031]
The example of FIGS. 5B and 5C shows the case of different time mode values. In the example of FIG. 5B, even if a position shifted by the allocated motion vector is calculated and the position is rounded to an integer pixel position, the pixel at the tap center position on the student image does not move by one pixel. On the other hand, in the example of FIG. 5C, as a result of rounding, the pixel at the tap center position on the student image is shifted by one pixel.
[0032]
FIG. 6 is a flowchart showing a flow when the tap center position determination process is performed by software processing. In step S21, the target pixel position on the teacher image is determined. In the next step S22, the motion vector at the target pixel position on the student image is read out. In step S23, a motion vector corresponding to the inter-frame time distance ratio is calculated. This process is to allocate motion vectors to the two frames of the student image used to detect the motion vectors according to the time mode value. For example, when the frame of the teacher image is located at the center position between two frames of the student image, the motion vector is halved.
[0033]
In step S24, the corresponding pixel position (decimal precision) on the student image corresponding to the motion vector calculated in step S23 is calculated. In step S25, the corresponding pixel position is rounded to the integer pixel position by processing such as rounding down, rounding up, and rounding off. This completes the tap center position determination process.
[0034]
FIG. 7 shows some examples of class pixel groups extracted by the class pixel group extraction unit 106 based on the tap center position described above. The pixel at the tap center position determined as described above on the student image is indicated by a black circle in FIG. In addition, in the surrounding pixels, pixels used as class pixels are shaded. Class pixel groups having the same arrangement relationship are extracted from each of two frames of the student images that are temporally continuous.
[0035]
When extracting a class pixel group, the relative position from the tap center position determined by the tap center position determination unit 104 is read from the memory, each class pixel position is obtained from the tap center position, and the corresponding pixel value is framed. Read from the buffer. The number and position of the class pixels are appropriately determined in consideration of learning efficiency, memory limitations, processing speed, and the like.
[0036]
Next, class value determination processing in the class value determination unit 108 will be described with reference to FIG. The pixel value of the class pixel group extracted as described above is encoded by 1-bit ADRC (Adaptive Dynamic Range Coding), and a value obtained by considering the encoding result (bit string) as an integer is defined as a class value.
[0037]
When the pixel value is expressed by 8 bits, the pixel value can be a value from 0 to 255. In FIG. 8, five class pixels are included in each of two frames of the student images that are temporally continuous, and a class pixel group is configured by a total of 10 pixels. The difference between the maximum value and the minimum value of the class pixel values of 10 pixels is the dynamic range DR. Since it is 1-bit ADRC, the value of which DR is halved is the intermediate value, and the magnitude relationship of the class pixel value with respect to this intermediate value is examined. If the class pixel value is less than the medium value, it is encoded as “0”, and if the class pixel value is equal to or greater than the medium value, it is encoded as “1”. In the example of FIG. 8, the bit string of the encoded value as a result of 1-bit ADRC is (0000100001). A value (= 33) in which this bit string is regarded as an integer is set as a class value.
[0038]
In order to reduce the number of classes, a value obtained by inverting the bit string of the encoding result for each bit may be used as the class value. In this case, the number of classes is halved. When the tap arrangement is symmetrical left / right / up / down, the pixel values may be rearranged and the same calculation may be performed to halve the number of classes.
[0039]
Next, position mode value determination processing in the position mode value determination unit 109 will be described with reference to FIG. The position mode value determination unit 109 is given a motion vector, a time mode value, and a tap center position. As described above, when determining the tap center position, motion vectors are assigned to the two frames of the student image from the target pixel position on the teacher frame. The motion vector is obtained with pixel unit accuracy. Accordingly, there are two types of deviations of 0 and 0.5 when the frame frequency is doubled between the position shifted from the target pixel position by the motion vector allocation and the integer grid point position. As in the example shown in FIG. 5, when the frame frequency is 5 times, there are five patterns of 0, 0.2, 0.4, 0.6, and 0.8. Even when the frame frequency is 2.5 times, there are five patterns similar to those when the frame frequency is 5 times.
[0040]
Considering the amount of shift below this decimal for each horizontal and vertical direction, 2 × 2 = 4 position mode values are determined when the frame frequency is doubled. When the frame frequency is 5 times, 5 * 5 = 25 position mode values are determined. FIG. 9 shows the pixel position in one of the two student frames, and is an example of converting the frame frequency to 5 times. FIG. 9A shows a case where the position mode value is determined by truncation. When the decimal position tap center (indicated by a white circle) is within the 5 × 5 region, the center is changed to the integer position tap center indicated by a black circle by the truncation process. In the case of rounding, as shown in FIG. 9B, when the decimal position tap center is in the 5 × 5 region, the black circle pixel at the center becomes the integer position tap center after rounding. 25 position mode values of 5 × 5 are defined. The position mode value is supplied to the prediction coefficient learning unit 107.
[0041]
With reference to FIG. 10, the flow of processing when determining the position mode value by software processing will be described. In step S31, the target pixel position on the teacher image is determined. In step S32, the motion vector at the target pixel position on the student image is read out. In step S33, a motion vector is calculated according to the inter-frame time distance ratio. In step S34, the corresponding pixel position (decimal precision) on the student image corresponding to the calculated motion vector is calculated. In step S35, the difference between the corresponding pixel position and the tap center position is calculated. In step S36, the difference is converted into a position mode value.
[0042]
Based on the tap center position determined by the tap center position determination unit 104 described above, the student pixel group extraction unit 105 extracts the student pixel group. FIG. 11 shows some examples of student pixel groups to be extracted. The class pixel group extraction (see FIG. 7) and the student pixel group extraction described above are the same except that the relative position from the tap center position is different. The number and position of student pixels to be extracted as a student pixel group are determined as appropriate in consideration of learning efficiency, memory limitations, processing speed, and the like.
[0043]
The teacher pixel extraction unit 110 extracts teacher pixels from the teacher image based on the time mode value. That is, one of the input teacher images is selected depending on which time mode value coefficient is currently learned, and the pixel value corresponding to the target pixel position is read from the frame buffer.
[0044]
The prediction coefficient learning unit 107 learns a prediction coefficient for predicting a teacher pixel from a student pixel group based on a time mode value, a position mode value, a class value, a student pixel group, and a teacher pixel. Hereinafter, processing of the learning unit 107 will be described. A specific time mode value m_t, Position mode value m_p, Student pixel group x having a combination of class values c_si(1 ≦ i ≦ n) and teacher pixel x_tThe pair is considered to have almost similar properties. So, such pairs are collected and the prediction coefficient (w₁~ W_n)
[0045]
As an example, a teacher pixel is predicted by a linear combination of pixel values of a student pixel group. That is,
x_t^ = W₁ X_s1+ W₂ X_s2+ ... + w_n X_sn
It can be expressed. However, ^ represents a predicted value.
[0046]
Same prediction formula as above (m_t, M_p, C) all student pixel groups x_sjiWhen applied to (1 ≦ i ≦ n, 1 ≦ j ≦ m), the following relationship is established.
[0047]
[Expression 1]

[0048]
The prediction coefficient learning unit 107 stores an actual teacher pixel value (x_t1, X_t2, ..., x_tm) Are supplied, the actual teacher pixel value and the predicted pixel value (x_t1^, X_t2^, ..., x_tmThe linear linear coupling coefficient (w) by the least square method so that the error from ^) is minimized.₁, W₂, ..., w_n) This is the prediction coefficient. The prediction coefficient is (m_t, M_p, C) for all combinations.
[0049]
FIG. 12 shows an example of the configuration of the image signal relationship application apparatus 200. The prediction coefficient obtained by the learning device 100 described above is stored in the prediction coefficient memory 210. The memory 210 has a time mode value m_t, Position mode value m_p, Combination of class values c (m_t, M_p, C) stores a coefficient.
[0050]
An input image having a pre-conversion frame frequency and frequency information before and after conversion are input to the application apparatus 200. An input image signal is supplied to the motion vector detection unit 201, and a motion vector between a predetermined number of frames is detected. If the input image is compressed using a motion vector as in MPEG, the motion obtained at the time of decoding can be used if the motion vector sent as side information has the accuracy required by subsequent processing. The vector can be used as it is, and the motion vector detection process can be omitted. The frequency information before and after conversion is output from the frequency information controller 202. The frequency information is supplied to the time mode value determining unit 203, and the time mode value is determined. The time mode value is a value related to the time position of the frame to be learned.
[0051]
The motion vector and the time mode value are supplied to the tap center position determination unit 204, and the tap center position is determined. The tap center position is supplied to the prediction pixel group extraction unit 205 and the class pixel group extraction unit 206. The extraction unit 205 extracts a prediction pixel group including a plurality of prediction pixels based on the tap center position. The extracted prediction pixel group is supplied to the prediction calculation unit 207.
[0052]
The extraction unit 206 extracts a class pixel group including a plurality of prediction pixels based on the tap center position. The extracted class pixel group is supplied to the class value determination unit 208. The class value determining unit 208 determines a class value from the class pixel group. The class value is supplied to the prediction coefficient memory 210.
[0053]
The position mode value determination unit 209 determines the position mode value based on the tap center position, the motion vector, and the time mode value. The position mode value is supplied to the prediction coefficient memory 210. Further, the time mode value is supplied to the prediction coefficient memory 210.
[0054]
The prediction coefficient memory 210 stores a time mode value m._t, Position mode value m_p, The class value c is supplied and (m_t, M_p, C) prediction coefficients corresponding to the combination (set of n coefficients) (w₁, W₂, ..., w_n) Is output from the prediction coefficient memory 210. The prediction coefficient and the prediction pixel group from the prediction pixel group extraction unit 205 are supplied to the prediction calculation unit 207. The prediction calculation unit 207 generates an image signal having a converted frame frequency by linear linear combination of a prediction pixel of a prediction pixel group and a prediction coefficient. When the predicted pixel value is accumulated up to the frame unit, the prediction calculation unit 207 outputs the image data of the frame.
[0055]
The prediction calculation of the prediction calculation unit 207 is performed using a prediction coefficient (w₁, W₂, ..., w_n) And predicted pixel group x_pi(1 ≦ i ≦ n) and
x ^ = w₁x_p1+ W₂x_p2+ ... + w_nx_pn
The predicted pixel value x ^ is generated by the above calculation.
[0056]
Motion vector detection unit 201, frequency information controller 202, time mode value determination unit 203, tap center position determination unit 204, predictive learning group extraction unit 205, class pixel group extraction unit 206, class value determination unit 208, position mode determination unit 209 Are the motion vector detection unit 101, frequency information controller 102, time mode value determination unit 103, tap center position determination unit 104, student pixel group extraction unit 105, class pixel group extraction unit 106, class value determination in the learning device 100 described above. It has the same function as the unit 108 and the position mode determination unit 109, and detailed description thereof is omitted.
[0057]
FIG. 13 is a flowchart showing a processing flow when the image signal relationship applying apparatus 200 is realized by software processing. In step S41, an input image signal is supplied. In step S42, a time mode value is determined. In step S43, a motion vector is detected. In step S44, the tap center position is determined. In step S45, the position mode value is determined.
[0058]
In step S46, a class pixel group is extracted based on the determined tap center position. In step S47, a class value is determined, and classification is performed based on the class value, the position mode value, and the time mode value. In step S48, a predicted pixel group is extracted. In step S49, a prediction coefficient corresponding to the class is read out of the stored prediction coefficients. In step S50, a prediction pixel is generated by linear primary combination (prediction calculation) of a plurality of pixels of the prediction pixel group and a plurality of prediction coefficients. In step S51, the generated prediction pixel is output.
[0059]
In step S52, it is determined whether all the pixels in the frame have been processed. If the process has not ended, the process returns to step S44 (determining the tap center position). If it is determined that all the pixels in the frame have been processed, it is determined in step S53 whether or not the processing of all the frames in the image has been completed. If the process has not ended, the process returns to step S42 (time mode value determination). In step S53, it is determined whether all input images have been processed. If the process has not ended, the process returns to step S41 (input image).
[0060]
The present invention is not limited to the above-described embodiment of the present invention, and various modifications and applications can be made without departing from the gist of the present invention. For example, a prediction calculation expression other than the linear primary combination may be used. Further, the classification may be performed using features other than the level.
[0061]
【The invention's effect】
As described above, the present invention relates to the relationship between the actually captured high-definition and motion-natural converted frame frequency teacher image signal and the corresponding student image signal at the pre-conversion frame frequency. Learning is performed for each classification of the nature of the student image signal at the pre-conversion frame frequency. In addition, an input image signal having a pre-conversion frame frequency is converted into an output image signal having a post-conversion frame frequency by using a prediction coefficient obtained as a result of learning. By such processing, an output image signal having a high spatial resolution and a natural motion can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of an embodiment of the present invention.
FIG. 2 is a block diagram of an example of an image signal relation learning device.
FIG. 3 is a flowchart for explaining an example of learning processing when the image signal relation learning device is realized by software.
FIG. 4 is a schematic diagram for illustrating a time mode value in one embodiment of the present invention.
FIG. 5 is a schematic diagram for explaining a tap center position determining method according to an embodiment of the present invention;
FIG. 6 is a flowchart for explaining an example of processing when the tap center position determination method is realized by software;
FIG. 7 is a schematic diagram illustrating some arrangement examples of class pixel groups.
FIG. 8 is a schematic diagram for explaining an example of a class determination method.
FIG. 9 is a schematic diagram used for explaining position mode values;
FIG. 10 is a flowchart for explaining an example of processing when a method for obtaining a position mode value is realized by software.
FIG. 11 is a schematic diagram illustrating some arrangement examples of student pixel groups.
FIG. 12 is a block diagram of an example of an apparatus for applying an image signal relationship.
FIG. 13 is a flowchart illustrating an example of processing when the image signal relationship application apparatus is realized by software.
[Explanation of symbols]
101, 201 ... motion vector detection unit, 103, 203 ... time mode value determination unit, 104, 204 ... tap center position determination unit, 105, 205 ... student pixel group extraction unit, 106, 206 ... class pixel group extraction unit, 107 ... prediction coefficient learning unit, 108, 208 ... class value determination unit, 109, 209 ... position mode value determination unit, 110 ... teacher pixel extraction unit, 207 ... Prediction calculation unit, 210 ... Prediction coefficient memory

Claims

Based on the student video signal having the pre-conversion frame frequency and the teacher video signal having the post-conversion frame frequency, the prediction coefficient for converting the frame frequency of the input video signal to generate the output video signal is calculated. Learning device
Motion vector detecting means for obtaining a motion vector between frames of pixels of the student image included in the student moving image signal corresponding to the target pixel set in the teacher image included in the teacher moving image signal;
Tap center position determining means for determining the tap center position on the student image by shifting the motion vector to the position of the target pixel and allocating the motion vector to the frame from which the motion vector is calculated;
Class pixel group extracting means for extracting one or more pixels from the student image as a class pixel group based on the tap center position;
Class value determining means for calculating a class value corresponding to the tap center position from the class pixel group;
Student pixel group extraction means for extracting one or more pixels from the student image as a student pixel group based on the tap center position;
A prediction teacher pixel is generated based on the student pixel group and the prediction coefficient, and the prediction coefficient is set for each class value so that an error between the pixel value of the prediction teacher pixel and the pixel value of the teacher pixel is minimized. A prediction coefficient learning means for calculating;
A learning apparatus comprising:

A prediction coefficient memory for storing a prediction coefficient calculated by the learning device according to claim 1;
Second motion vector detecting means for obtaining a motion vector between frames of the input image included in the input moving image signal as a second motion vector;
Second tap center position determining means for determining a tap center position on the input image as a second tap center position based on the second motion vector;
Second class pixel group extracting means for extracting one or more pixels from the input moving image signal as a second class pixel group based on the second tap center position;
Second class value determining means for calculating a class value corresponding to the second tap center position from the second class pixel group as a second class value;
Predicted pixel group extraction means for extracting one or more pixels from the input image as a predicted pixel group based on the second tap center position;
Prediction calculation means for generating an output moving image signal based on the prediction coefficient according to the second class value and the prediction pixel group;
An application device comprising:

The learning apparatus according to claim 1, wherein the teacher moving image signal has a larger number of frames per unit time than the student moving image signal.

Based on the student video signal having the pre-conversion frame frequency and the teacher video signal having the post-conversion frame frequency, the prediction coefficient for converting the frame frequency of the input video signal to generate the output video signal is calculated. In the learning method to
A motion vector detecting step for obtaining a motion vector between frames of pixels of the student image included in the student moving image signal corresponding to the target pixel set in the teacher image included in the teacher moving image signal;
A tap center position determining step for determining a tap center position on the student image by shifting the motion vector to the position of the target pixel and allocating the motion vector to the frame from which the motion vector is calculated;
A class pixel group extraction step of extracting one or more pixels from the student image as a class pixel group based on the tap center position;
A class value determining step for calculating a class value corresponding to the tap center position from the class pixel group;
A student pixel group extraction step of extracting one or more pixels from the student image as a student pixel group based on the tap center position;
A prediction teacher pixel is generated based on the student pixel group and the prediction coefficient, and the prediction coefficient is set for each class value so that an error between the pixel value of the prediction teacher pixel and the pixel value of the teacher pixel is minimized. A prediction coefficient learning step to be calculated;
A learning method consisting of

A prediction coefficient memory for storing a prediction coefficient calculated by the learning method according to claim 4;
A second motion vector detecting step for obtaining a motion vector between frames of the input image included in the input moving image signal as a second motion vector;
A second tap center position determining step for determining a tap center position on the input image as a second tap center position based on the second motion vector;
A second class pixel group extraction step of extracting one or more pixels from the input moving image signal as a second class pixel group based on the second tap center position;
A second class value determining step of calculating a class value corresponding to the second tap center position from the second class pixel group as a second class value;
A prediction pixel group extraction step of extracting one or more pixels from the input image as a prediction pixel group based on the second tap center position;
A prediction calculation step for generating an output moving image signal based on the prediction coefficient according to the second class value and the prediction pixel group;
Application method comprising.