CN105741252A

CN105741252A - Sparse representation and dictionary learning-based video image layered reconstruction method

Info

Publication number: CN105741252A
Application number: CN201510789969.6A
Authority: CN
Inventors: 王海; 王柯; 刘岩; 张皓迪; 李彬; 毛敏泉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2016-07-06
Anticipated expiration: 2035-11-17
Also published as: CN105741252B

Abstract

The invention discloses a sparse representation and dictionary learning-based video image layered reconstruction method. The main objective of the invention is to solve the problem of long consumed time in video image reconstruction in the prior art. The method includes the following steps of: (1) obtaining a sample set; (2) layering images in the sample set; (3) training the images of the sample set before and after layering so as to obtain high-resolution dictionaries and low-resolution dictionaries of the sample set before and after layering; (4) dividing an image to be reconstructed into a main region, a sub region or a region-of-non-interest; (5) reconstructing the main region according to the high-resolution dictionaries and the low-resolution dictionaries of the sample set after layering; reconstructing the sub region according to the high-resolution dictionaries and the low-resolution dictionaries of the sample set before layering; (7) reconstructing the region-of-non-interest; (8) fusing a reconstructed main region and a reconstructed sub region into a reconstructed region-of-non-interest so as to obtain a complete reconstructed image. With the method of the invention adopted, the reconstruction time of the image is reduced. The method can be used for the processing of medical images, natural images and remote sensing images.

Description

Video image hierarchical reconstruction method based on sparse representation and dictionary learning

Technical Field

The invention belongs to the technical field of video and image processing, and relates to a super-resolution reconstruction method for a video image, which can be used in occasions requiring high-resolution images such as medical images, natural images, remote sensing images and the like.

Background

Due to the limitation of inherent properties of an imaging system and the influence of many factors such as atmospheric interference, the problems of poor imaging quality, low resolution and the like of an obtained single image or video can be caused. How to recover the original appearance or improve the quality indexes such as resolution, definition and the like of the video image as much as possible based on the existing hardware condition and the acquired video image is always a hot problem in the scientific research and engineering application of the video image. Super-resolution reconstruction is a technology capable of effectively improving and increasing the resolution level of a video image, and is used for reconstructing an acquired single-frame or multi-frame low-resolution image by using prior knowledge such as an image mathematical model and the like so as to obtain a high-resolution image.

Currently, there are three main methods for super-resolution reconstruction: interpolation, reconstruction, and learning-based methods. The traditional interpolation methods include a nearest neighbor interpolation method, a bilinear interpolation method and a bicubic interpolation method, and although the interpolation method is simple and easy to realize, the edge of a reconstructed image has the defects of discontinuity, ringing effect or smooth whole and the like. The reconstruction method aims at effectively and reasonably modeling the capturing process of the low-resolution image, forms the prior knowledge corresponding to the high-resolution information in a regularization mode to be restricted, and converts the image super-resolution reconstruction problem into the estimation problem of the low-resolution image on the high-resolution image, namely the optimal solution problem with the limiting criterion cost method. The super-resolution reconstruction method based on learning is a mainstream method in the technical field of image restoration in recent years, and the idea of the super-resolution reconstruction method is derived from machine learning. Freeman et al propose a super-resolution reconstruction method based on samples, which first divides high and low resolution sample images into blocks through machine learning, utilizes a Markov network to model the spatial relationship of the images, and seeks the most appropriate position in the Markov grid for each block of the low resolution image to be reconstructed in the established model, thereby realizing super-resolution reconstruction. Although this method can recover more detailed information, it processes the full image area, usually requires a longer reconstruction time, and is not suitable for the reconstruction of video images containing multiple moving objects.

Disclosure of Invention

The invention aims to provide a video image hierarchical reconstruction method based on sparse representation and dictionary learning aiming at the defects of the prior art, and the method can ensure the main content reconstruction quality of the video image and reduce the reconstruction time for the reconstruction of the video image containing multiple moving objects, and lays a foundation for the real-time reconstruction of the video.

The technical idea for realizing the invention is as follows: the method comprises the steps of layering images in a sample set by using a morphological component analysis method, training the images before and after layering by using a KSVD algorithm to obtain training dictionaries, dividing an image to be reconstructed into an interesting region and a non-interesting region by using a Snake algorithm, further dividing the interesting region into a main region and a sub-region according to the size of a moving target, performing super-resolution reconstruction on the main region by using a double-dictionary learning method, performing super-resolution reconstruction on the sub-region by using a single-dictionary learning method, performing interpolation reconstruction on the non-interesting region by using an interpolation method, and fusing the reconstructed main region, the sub-region and the non-interesting region to obtain a reconstructed original image. The method comprises the following specific steps:

(1) obtaining a sample set I ═ { I ═ I from a sample database_h,I_l}，Represents a set of high-resolution samples and,representing a low resolution sample set, using a high resolution image of the same content in sample set IAnd low resolution imagesForming a sample pair image

(2) Carrying out texture layering and structure layering on the image in the sample set I by using a morphological component analysis method to obtain a high-resolution texture layer I_htHigh resolution structural layer I_hsAnd a low resolution texture layer I_ltLow resolution structural layer I_ls；

(3) High-resolution sample image I in sample set I by using KSVD algorithm_hAnd low resolution sample image I_lTraining to obtain high-resolution dictionary D_hAnd a low resolution dictionary D_l；

(4) Training each layered image in the sample set I by using a KSVD algorithm to obtain a texture high-resolution dictionary D_htStructured high resolution dictionary D_hsAnd texture low resolution dictionary D_ltStructured low resolution dictionary D_ls；

(5) Dividing a low-resolution video single-frame image to be reconstructed into an interested area and an uninteresting area;

(6) dividing an interested area of a low-resolution video single-frame image to be reconstructed into a main area and a secondary area;

(7) performing super-resolution reconstruction on the main region by adopting a double-dictionary learning method, performing super-resolution reconstruction on the sub-region by adopting a single-dictionary learning method, and performing interpolation reconstruction on the region of no interest by adopting an interpolation method;

(8) and fusing the reconstructed main region and the reconstructed secondary region into the reconstructed region of no interest to obtain a complete reconstructed image.

Compared with the prior art, the invention has the following advantages:

1. the invention carries out hierarchical reconstruction on the video image, adopts reconstruction methods with different precision levels for different areas in the video image, carries out reconstruction based on double-dictionary learning on a main area, carries out reconstruction based on single-dictionary learning on a sub-area, and carries out interpolation reconstruction on an uninteresting area, thereby solving the problem of longer reconstruction time caused by the fact that the existing super-resolution reconstruction method based on dictionary learning acts on a full-image area and laying a foundation for real-time reconstruction of the video;

2. when the interesting region of the video image is extracted, the accurate closed contour of the moving target is detected by using a Snake algorithm, and the minimum rectangular region containing the accurate closed contour is used as the interesting region, so that the obtained interesting region can be minimum while containing the moving target;

3. according to the method, the moving target in the video image is divided into the main target and the secondary target according to the pixel area, and the primary and secondary targets are represented by using the pixel area, so that the primary and secondary classification of the target can be directly and effectively carried out, the calculation complexity is not increased on the whole, and the reconstruction time of the video image is further shortened;

4. according to the method, the minimum rectangular region containing the main target in the single-frame image of the low-resolution video to be reconstructed is used as the main region, so that when the main region is reconstructed by a super-resolution reconstruction algorithm based on double-dictionary learning, the action region is minimum, and the reconstruction time of the main region is shortened;

5. when the sparse representation of each block in the main region is calculated, the optimal matching blocks in front and back three frames of images of the block are searched by utilizing a search algorithm, the weighted sum of the sparse representation of the optimal matching blocks is used as the sparse representation of the block, the sparse representation obtained by utilizing the space-time correlation of the front frame and the back frame of the video is more accurate in coefficient, and the reconstruction effect of the main region in a single frame image to be reconstructed can be further improved;

in conclusion, the invention can effectively carry out hierarchical reconstruction on the low-resolution video image, and can reduce the reconstruction time of the video image and lay a circuit foundation for real-time reconstruction of the video while ensuring the reconstruction quality of the main content of the main target.

Drawings

FIG. 1 is a general flow chart of an implementation of the present invention;

FIG. 2 is a sub-flowchart of texture layer sparse representation and structure layer sparse representation of the calculation of the main region in the present invention.

Detailed description of the preferred embodiments

The steps of the present invention are described in further detail below with reference to FIG. 1:

step 1, a sample set is obtained.

The picture set provided by the paschaloc committee was used as a sample database comprising a total of 20 catalogs of four major categories, human, animal, vehicle and indoor: wherein the animal comprises bird, cat, cattle, dog, horse, and sheep; the transportation means comprise airplanes, bicycles, ships, buses, cars, motorcycles and trains; the room comprises a bottle, a chair, a dining table, potted plants, a sofa and a television.

At random, 10 images were selected under each catalog, resulting in 200 sample images. Forming high-resolution sample set by using 200 obtained sample imagesRespectively carrying out 3-time down-sampling on the 200 obtained sample images to obtain 200 low-resolution mapsUsing the 200 low-resolution images to construct a low-resolution sample setHigh resolution sample set I_hAnd low resolution sample set I_lCollectively forming a sample set I ═ { I ═ I_h,I_l}; high resolution images of the same contentAnd low resolution imagesForming a sample pair image

And 2, carrying out texture layering and structure layering on the images in the sample set by using a morphological component analysis method.

The core of the morphological component analysis is: and representing the image morphology by using the optimal sparsity. Suppose that the image X to be processed includes different modalities, i.e., the image X includes different background transparent layers, { X_λ,λ＝1,2,....,}，X＝X₁+X₂+...+X_λ+...+XThe MCA method uses an overcomplete set of dictionaries { T }₁,T₂,...,T_λ,...,TDescribing the layers of the image X, the lambda layer X_λCan only use dictionary T_λIs sparsely represented by other dictionaries T_γAtoms of (γ ≠ λ) cannot be represented, so it is possible to construct a set of overcomplete dictionaries { T ≠ λ)₁,T₂,...,T_λ,...,TTo implement layering of the layers of image X.

This example decomposes the image into two different modalities, namely texture layer X_tAnd a structural layer X_sTherefore, an overcomplete dictionary { T } needs to be constructed_t,T_s}，T_tTo describe the dictionary of image texture information, T_sA dictionary describing image structure information.

Tools for constructing texture dictionaries are Gabor transform, DCT transform, and the like, and tools for constructing structure dictionaries are wavelet transform, curvelet transform, ridgelet transform, contourlet transform, and the like. The selection of the dictionary is usually selected according to a fidelity measurement function or other similar methods, but the method of selecting the optimal dictionary according to a theoretical function is too complicated, so that in many image processing works, the image is usually analyzed according to the experience of a user, a common transformation capable of well representing texture or structure is selected, and the texture part and the structure part of the image are separated. The present example selects but is not limited to constructing a texture dictionary for an image using a DCT transform and a texture dictionary for an image using a contourlet transform. The method comprises the following concrete steps:

2.1) construction of texture dictionary

For 200 sample pairs in the sample set I_iI1, 2, 200 are respectively subjected to DCT transformation to obtain a high-resolution sample set I_h200 DCT transform matrices and low resolution sample set I_lThe 200 DCT transformation matrixes are used as a dictionary of the image to obtain a high-resolution sample set I_h200 texture dictionariesAnd low resolution sample set I_l200 texture dictionaries

2.2) building a structural dictionary

For 200 sample pairs in the sample set I_i1,2, 200 are respectively subjected to contourlet transformation to obtain a high-resolution sample set I_h200 contourlet transform matrices and low resolution sample set I_l200 contourlet transformation matrices, and making the contourlet transformation matrices intoFor a dictionary of images, obtaining a high resolution sample set I_h200 structural dictionariesAnd low resolution sample set I_l200 structural dictionaries

2.3) calculating the optimal texture sparse coefficient and the optimal structure sparse coefficient by using a matching pursuit algorithm

Texture layer for obtaining high resolution sample imageAnd a structural layerThe need to compute high resolution sample imagesTexture dictionary at high resolutionAnd high resolution structured dictionaryOptimal sparse representation of the following, i.e. solving the following optimization problem

\begin{matrix} {α_{{ht}_{i}}^{*}, α_{{hs}_{i}}^{*}} = \underset{{α_{{ht}_{i}}, α_{{hs}_{i}}}}{A r g} \min {| | α_{{ht}_{i}} | |_{1} + | | α_{{hs}_{i}} | |_{1}} & s . t | | I_{h_{i}} - T_{{ht}_{i}} \times α_{{ht}_{i}} - T_{{hs}_{i}} \times α_{{hs}_{i}} | |_{2} \leq ϵ, i = 1, 2, ..., 200 \end{matrix}

Wherein 1.0 × 10^-6In order to obtain the sparseness empirical value,andrespectively calculated high-resolution texture sparse coefficients and high-resolution structure sparse coefficients,andrespectively obtaining the high-resolution optimal texture sparse coefficient and the high-resolution optimal structure sparse coefficient.

Algorithms for solving the optimization problem include a matching pursuit algorithm, a basis pursuit algorithm, an orthogonal matching pursuit algorithm and the like. The matching pursuit algorithm is a greedy algorithm, which obtains signal sparse representation through gradual approximation, has simple principle and convenient realization, and is the most common method for signal sparse decomposition at present. Therefore, the matching pursuit algorithm is adopted by the embodiment but not limited to carry out sparse decomposition on the image, and the matching pursuit algorithm is also used for carrying out sparse decomposition on the image in the subsequent image reconstruction based on dictionary learning.

For low resolution sample imageThe same processing is carried out to obtain the optimal texture sparse coefficient with low resolution

α_{{lt}_{i}}^{*}, i = 1, 2, ..., 200

And low resolution optimal structure sparse coefficients

α_{{ls}_{i}}^{*}, i = 1, 2, ..., 200.

2.4) calculating the texture hierarchy and the structure hierarchy of the image:

2.4a) texture dictionary from high resolutionAnd high resolution optimal texture sparsity coefficients

α_{{ht}_{i}}^{*}, i = 1, 2, ..., 200,

Obtaining a high resolution texture layer

I_{{ht}_{i}} = T_{{ht}_{i}} \times α_{{ht}_{i}}^{*}, i = 1, 2, ..., 200,

Memo

I_{h t} = {I_{{ht}_{i}}, i = 1, 2, ..., 200};

2.4b) structured dictionary according to high resolutionAnd high resolution optimal structure sparsity coefficients

α_{{hs}_{i}}^{*}, i = 1, 2, ..., 200,

Obtaining a high resolution structural layer

I_{{hs}_{i}} = T_{{hs}_{i}} \times α_{{hs}_{i}}^{*}, i = 1, 2, ..., 200,

Memo

I_{h s} = {I_{{hs}_{i}}, i = 1, 2, ..., 200};

2.4c) texture dictionary from Low resolutionAnd low resolution optimal texture sparse coefficients

α_{{lt}_{i}}^{*}, i = 1, 2, ..., 200,

Obtaining a low resolution texture layer

I_{{lt}_{i}} = T_{{lt}_{i}} \times α_{{lt}_{i}}^{*}, i = 1, 2, ..., 200,

Memo

I_{l t} = {I_{{lt}_{i}}, i = 1, 2, ..., 200};

2.4d) construction of dictionaries according to Low resolutionAnd low resolution optimal structure sparse coefficients

α_{{ls}_{i}}^{*}, i = 1, 2, ..., 200,

Obtaining a low-resolution structural layer

I_{{ls}_{i}} = T_{{ls}_{i}} \times α_{{ls}_{i}}^{*}, i = 1, 2, ..., 200,

Memo

I_{l s} = {I_{{ls}_{i}}, i = 1, 2, ..., 200} .

And 3, training the images in the sample set by using a KSVD algorithm.

Image super-resolution reconstruction based on dictionary learning usually needs to train a large number of sample images to obtain a high-resolution dictionary and a low-resolution dictionary, and the efficiency of training the dictionary is greatly influenced by the number of dictionary atoms, so that the method for effectively reducing the number of dictionary atoms is very important.

The methods for dictionary learning are mainly divided into two categories: unsupervised dictionary learning and supervised dictionary learning. Unsupervised dictionary learning aims at learning a dictionary with good representation capability, and supervised dictionary learning is commonly used in computer recognition tasks because of taking the discriminability of the dictionary into consideration. In the image super-resolution reconstruction based on dictionary learning, the optimal sparse representation of the image needs to be obtained, and a good dictionary can enable the corresponding sparse representation to have higher sparsity, so that the method of unsupervised dictionary learning is selected to train images of all layers in the example.

The representative method of unsupervised dictionary learning comprises an MOD method and a KSVD method, the optimized objective functions of the two methods are the same, when the dictionary iteration is carried out by using a matching pursuit algorithm, the MOD method solves the dictionary once by using a global algorithm, the KSVD method optimizes a sequence updating column on the basis of the MOD method, and each iteration only updates one column of the dictionary, namely each iteration only updates one atom of the dictionary. The KSVD method is an optimization method for sequentially updating columns, which can effectively reduce the number of atoms in a dictionary, and the trained atoms can still linearly represent all information of an initial dictionary, so that the example trains a sample image by using but not limited to the KSVD algorithm.

In this example, the high resolution sample image I is trained using the KSVD algorithm in the same way as the low resolution sample image I_hFor example, the specific implementation steps are as follows:

3.1) high resolution sample image I_hPerforming overlapping blocking

High resolution sample image I_hEach image ofOverlapping and blocking according to an array scanning mode, wherein the block size is 9 × 9 pixel blocks, and the pixel blocks are overlapped in the horizontal direction and the vertical direction respectively to obtain a high-resolution sample block setWherein,representing high resolution sample images I_hM1, 2, M denotes the high-resolution sample image I_hThe number of partitions.

3.2) construction of high-resolution dictionary D_hInitial value of (2)

Taking a high resolution sample block set Y_hThe first 1024 sample blocks in the sequence are DCT-transformed to obtain 1024 DCT-transformed matrixes with the size of 9 × 9, each DCT-transformed matrix with the size of 9 × 9 is expanded into column vectors to obtain 1024 column vectors with the length of 81, and the 1024 column vectors are combined according to columns to obtain a matrix with the size of 81 × 1024Array, using this matrix as a high resolution dictionary D_hIs started.

3.3) computing an optimal high resolution dictionary

Using the KSVD algorithm, the high resolution dictionary D is optimized by_hUpdating until high resolution sample block set Y_hIn high resolution dictionary D_hThe sparse representation of

\begin{matrix} D_{h}^{*} = \underset{D_{h}}{A r g} \min {Σ_{m = 1}^{M} | | y_{h_{m}} - D_{h} \times α_{h_{m}} | |_{2}^{2}} & s . t & | | α_{h_{m}} | |_{0} \leq ϵ, m = 1, 2, ..., M \end{matrix}

Wherein,is a sample blockIn high resolution dictionary D_hSparse representation of 1.0 × 10^-6In order to obtain the sparseness empirical value,the dictionary is the optimal high-resolution dictionary.

Considering that the low-resolution sample image is composed of high-resolution sample imagesObtained by 3 times down-sampling, so as to train the low-resolution sample image I_lIn this example, the block size in step 3.1) is set to be 3 × 3 pixel block, so that the initial value of the dictionary obtained in step 3.2) is 9 × 1024, and the other operations are the same as those in step 3.1) □ 3.3.3), so as to obtain the optimal low-resolution dictionary

And 4, training each layered image in the sample set by using a KSVD algorithm.

For the high-resolution texture layer I according to step 3.1) □ 3.3.3)_htAnd a high-resolution structural layer I_hsProcessing to obtain texture high-resolution dictionaryAnd structure high resolution dictionary

In training the low resolution texture layer I_ltAnd a low resolution structural layer I_lsIn this example, the block size in step 3.1) is set to be 3 × 3 pixel block, so that the initial value of the dictionary obtained in step 3.2) is 9 × 1024, and the other operations are the same as those in step 3.1) □ 3.3.3), so as to obtain the optimal texture low-resolution dictionaryAnd an optimally structured low resolution dictionary

And 5, dividing the low-resolution video single-frame image to be reconstructed into an interested area and an uninterested area.

The problem of dividing the interesting region and the uninteresting region of the low-resolution video single-frame image to be reconstructed can be solved as the problem of dividing the foreground image and the background image in machine vision. In the field of machine vision, methods for separating the foreground and background of a video image mainly comprise two main categories, one category is background modeling of a video or an image sequence to obtain a background image, the foreground image is obtained by subtracting the background from the video image to be detected, the conventional methods comprise a mixed Gaussian background modeling method and an optical flow method, but the background extracted by the algorithms still contains fuzzy moving objects, and the moving objects in an interested area can be not clear enough when the algorithms are used in the example.

The other type is that a moving target is directly extracted by utilizing the motion information in the video image, the moving target area is used as a foreground image, and the part except the foreground image is used as a background image. The method usually extracts motion vector information in a video image code stream, and obtains a binary image representing a motion region by combining morphological processing, but the description of the motion target by the binary image usually has a large deviation, and at this time, if a minimum rectangular region containing the motion target is used as a foreground region, the motion target is inevitably lacked.

Considering that the Snake algorithm can detect a more accurate contour of a target in a blurred image, the Snake algorithm is adopted in the embodiment to extract the interested area of the image to be reconstructed, and the interested area is a minimum rectangular area containing a more accurate moving target. The method comprises the following concrete steps:

5.1) acquiring a binary image representing a moving target:

5.1a) extracting motion information from an H.264 code stream of a low-resolution video single-frame image to be reconstructed to obtain a motion vector field MV of a current frame;

5.1b) representing the pixel gray value by using the vector length, normalizing the gray value to the range of [0,255], and converting the motion vector field MV of the current frame into a gray map G representing the motion area of the current frame;

5.1c) carrying out morphological processing on the gray-scale image G for representing the motion area of the current frame to obtain a binary image BW of the motion target.

5.2) extracting a more accurate contour of the moving target by using a Snake algorithm:

5.2a) extracting a closed outer contour of the moving target binary image BW to obtain a curve v(s) ([ x(s), y (s)) ], x(s), y(s) and a curve s (s)) which are respectively an abscissa and an ordinate of a point on the contour curve, wherein the parameter s belongs to [0,1], and taking the curve as an initial contour value of a Snake algorithm;

5.2b) deforming the curve v(s) by using Snake algorithm to enable the curve to approach to a more accurate contour v(s) of the moving object^*This process can be translated into finding the following optimal solution

\begin{matrix} v {(s)}^{*} = \underset{v (s)}{A r g} \min {&Integral;}_{0}^{1} E_{s n a k e} (v (s)) d s \\ = \underset{v (s)}{A r g} \min {&Integral;}_{0}^{1} [E_{int} (v (s)) + E_{i m a g e} (v (s)) + E_{c o n} (v (s))] d s \end{matrix}

Wherein,representing internal energy, v_s、v_ssFirst and second derivatives of v(s), α(s) and β(s) are weight parameters for controlling tension and smoothness of the curve v(s), respectively, and determine the extension and bending degree of the curve v(s) at a certain point, E_image(v (s)) represents the energy generated by the image acting force, and is generally designed by image gray scale and gradient information in order to highlight the salient features of the image, and a guide curve v(s) approaches to an edge contour; e_conRepresents the energy generated by the external restraining force, which this example sets to 0; v(s)^*Is a more precise contour of the moving object.

5.3) acquiring interested regions and non-interested regions

The low resolution video single frame image to be reconstructed contains the more accurate closed contour v(s) of the moving object^*Is extracted as a region of interest P, and the portion other than the region of interest is taken as a region of non-interest B.

And 6, dividing the region of interest of the low-resolution video single-frame image to be reconstructed into a main region and a secondary region.

For a scene containing two or more than two moving objects, if all the moving objects are subjected to super-resolution reconstruction with the same precision in a primary mode and a secondary mode, the reconstruction time of a video image is too long, and too much computing resources are occupied. Meanwhile, when a computer processes a digital image or a subjective observer observes a video image, the computer often cares more about information of a main target in the video image. Therefore, the super-resolution reconstruction with high precision is carried out on the main target in the video image, and the reconstruction with relatively low precision is carried out on the secondary target, so that the reconstruction quality of the main content of the video image is ensured, the reconstruction time is shortened, and the reconstruction efficiency is improved.

Considering that when a video or an image is shot, a focused object tends to occupy more pixel area, in the example, a target with larger pixel area in a single-frame image of a low-resolution video to be reconstructed is taken as a main target, and a target with smaller pixel area is taken as a secondary target. The method comprises the following concrete steps:

6.1) utilizing the more accurate closed contour v(s) of the moving object obtained in the step 5.2)^*Calculating the pixel area A of each object as { A ═ A₁，A₂，...，A_n，...，A_NIn which A is_nThe pixel area of the nth object is represented, wherein N is 1,2, and N represents the number of moving objects in the video image;

6.2) using K-means algorithm to get the target pixel area A ═ A₁，A₂，...，A_n，...，A_NDivide into two categories according to the size of the area, and the category with large area is marked as the main target A_mThe class with small area is denoted as the sub-target A_sub；

6.3) will contain the primary target A_mAs the main region P_mThe part other than the main region is regarded as the sub region P_sub；

6.4) recording the smallest rectangular area to be reconstructedPosition Pos ═ row, col, del in single-frame images of low-resolution video_row，del_col]Where (row, col) is the row-column coordinate of the top left pixel of the smallest rectangular area, del_row、del_colThe number of rows and columns is obtained for the smallest rectangular area, respectively.

And 7, reconstructing the main region by adopting a double-dictionary learning method.

In order to solve the problem of long reconstruction time caused by the fact that the existing image super-resolution reconstruction method based on dictionary learning acts on a full-image region, the method carries out hierarchical reconstruction on a video image, wherein the super-resolution reconstruction based on double-dictionary learning is carried out on a main region containing a main target in the video image, and the specific implementation steps are as follows:

7.1) Main region P following step 2)_mDivided into texture layers P_mtAnd a structural layer P_ms；

7.2) calculating the texture layer sparse representation and the structural layer sparse representation of the main area:

most of the existing image super-resolution reconstruction methods directly process images to be reconstructed by using a trained dictionary, and the methods have good reconstruction effects. The input of the embodiment is video information, when a single-frame image to be reconstructed is reconstructed, in order to further improve the reconstruction effect of the image, the single-frame image to be reconstructed is not directly processed, but the time correlation and the space correlation between video frames are combined, namely, three frames of images before and after the single-frame image to be reconstructed adjacent to the single-frame image to be reconstructed are selected as reference images, and the image to be reconstructed is indirectly reconstructed by reconstructing the reference images.

Referring to fig. 2, this step is implemented as follows:

7.2a) selecting the front and back three frames of images adjacent to the frame where the main area is located as reference images to obtain a reference image set P_r＝{P_rj,j＝1,2,...,6}，P_rjRepresenting a frame of reference picture;

7.2b) according toReference image set P as per step 2)_rDivided into texture layers P_rt＝{P_rtjJ-1, 2.. 6} and a structural layer P_rs＝{P_rsj1,2, 6, where P is_rtjFor reference picture P_rjTexture layer of, P_rsjFor reference picture P_rjThe structural layer of (1);

7.2c) texture layer P of the main area_mtOverlapping and blocking according to an array scanning mode, wherein the block size is 3 × 3 pixel blocks, and the pixel blocks are overlapped in the horizontal direction and the vertical direction respectively to obtain a block set of the main area texture layerWhereinTexture layer P representing a main region_mtN, N denotes the texture layer P of the main area_mtThe number of partitions of (a);

7.2d) Using the parallelComputingToolbox toolbox in Matlab, six parallel tasks Pro are created_jJ 1,2, 6, Pro for each task_jProcessing only the texture layer P for the reference picture_rtjThe operation of (1);

7.2e) at task Pro_j1, 2.., 6, for the main area texture layer P_mtEach block ofUsing three-step search algorithm to locate the reference image texture layer P_rtjSearching for best matching blockThe block matching criterion uses the MAD criterion, i.e. minimizing the mean absolute error function MAD (d)_h,d_v)：

M A D (d_{h}, d_{v}) = \frac{1}{R C} Σ_{r = 1}^{R} Σ_{c = 1}^{C} | f (r, c) - f_{r j} (r + d_{h}, c + d_{v}) |

Wherein R, C are divided into blocksF (r, c) represents the luminance value of the pixel with coordinates (r, c) in the block, f_rj(r+d_h,c+d_v) Representing a reference picture texture layer P_rtjThe middle coordinate is (r + d)_h,c+d_v) (ii) a pixel luminance value of (d)_h,d_v) As motion displacement vectors, d_hFor horizontal displacement, d_vIs displaced in the vertical direction;

7.2f) low resolution dictionary from textureCalculating a matching blockIs sparse representation ofWhereinIs composed ofThe inverse matrix of (d);

7.2g) compute reference image texture layer P_rtjBest match block ofWeight coefficient ofThe calculation formula is as follows

w_{j_{n}} = \frac{1}{\sqrt{(y_{t_{n}} - y_{{rtj}_{n}}^{*}) {(y_{t_{n}} - y_{{rtj}_{n}}^{*})}^{T}}};

7.2h) pairs of matching blocksIs sparse representation ofCarrying out weighted summation to obtain the main region texture layer blocksSparse representation of texture layersNote the bookA texture layer sparse representation of the main region;

7.2i) structural layer P with Main region_msProcessing according to the step 7.2c) □ 7.2.2 h) to obtain the structural layer sparse representation of the main area

7.3) according to the main region P_mSparse representation of texture layersAnd texture high resolution dictionaryObtaining a reconstructed image of the main region texture layer

P_{m t}^{*} = D_{h t}^{*} \times β_{m t}^{*};

7.4) sparse representation of the structural layer according to the main regionAnd structure high resolution dictionaryObtaining a reconstructed image of the main region structure layer

P_{m s}^{*} = D_{h s}^{*} \times β_{m s}^{*};

7.5) reconstruction of the Main region texture layerAnd reconstructed image of the main region structure layerAnd fusing to obtain a complete reconstructed image of the main region.

And 8, reconstructing the secondary region by adopting a single dictionary learning method.

In order to shorten the reconstruction time of a video image and ensure the reconstruction quality of the main content of the video image, the embodiment carries out graded reconstruction on the video image, wherein super-resolution reconstruction based on single dictionary learning is carried out on a secondary region containing a secondary target in the video image, and the specific implementation steps are as follows:

8.1) obtaining the optimal low resolution dictionary according to step 3)Calculating the sub-region P_subIs sparse representation of

β_{s u b} = (D_{l}^{*}) \times P_{s u b},

WhereinIs composed ofThe inverse matrix of (d);

8.2) sparse representation β from Secondary regions_subAnd the optimal high-resolution dictionary obtained in the step 3)Is obtained byReconstructed image of a region

P_{s u b}^{*} = D_{h}^{*} \times β_{s u b} .

And 9, reconstructing the region of no interest by adopting an interpolation method.

Currently, there are three main methods for super-resolution reconstruction: interpolation, reconstruction and learning-based methods. The interpolation method is simple in algorithm and easy to implement, and the quality of the reconstructed image is deviated from that of other two methods. The method aims to carry out hierarchical reconstruction on the video image, and carries out learning-based reconstruction on the interested region containing the moving object, so that the moving object has better reconstruction effect; the non-interested region is reconstructed by adopting an interpolation method, although the reconstruction quality of the non-interested region is sacrificed, the reconstruction quality of the main content of the moving object can be ensured, and meanwhile, the reconstruction time of the video image can be shortened.

The interpolation method mainly includes a nearest neighbor interpolation method, a bilinear interpolation method and a bicubic interpolation method. The nearest neighbor interpolation method is simple and easy to realize, the calculated amount is small, but the image quality after interpolation is not high, and the block effect and the sawtooth effect often occur.

The bilinear interpolation method determines a corresponding weight value for the pixel value of each point to be interpolated according to the distance between the pixel value and the adjacent 4 points, and determines the pixel value of the point to be interpolated according to the weighted sum of the pixel values of the adjacent 4 points.

The bicubic interpolation method utilizes the gray values of 16 points around the point to be interpolated to carry out cubic interpolation, not only considers the gray influence of 4 directly adjacent points, but also considers the influence of the gray value change rate between the adjacent points, and the reconstruction effect is superior to the two methods. In this embodiment, the non-interesting region B is reconstructed by, but not limited to, bicubic interpolation, which has the following interpolation formula:

f(i+u,j+v)＝A^*B^*C^*

A^*＝[S(1+u)S(u)S(1-u)S(2-u)]

B^{*} = [\begin{matrix} f (i - 1, j - 2) & f (i, j - 2) & f (i + 1, j - 2) & f (i + 2, j - 2) \\ f (i - 1, j - 1) & f (i, j - 1) & f (i + 1, j - 1) & f (i + 2, j - 1) \\ f (i - 1, j) & f (i, j) & f (i + 1, j) & f (i + 2, j) \\ f (i - 1, j + 1) & f (i, j + 1) & f (i + 1, j + 1) & f (i + 2, j + 1) \end{matrix}]

C^*＝[S(1+v)S(v)S(1-v)S(2-v)]^T

S (w) = \{\begin{matrix} 1 - 2 | w |^{2} + | w |^{3}, | w | < 1; \\ 4 - 8 | w | + 5 | w |^{2} - | w |^{3}, 1 \leq | w | < 2; \\ 0, | w | &GreaterEqual; 2 \end{matrix}

wherein i and j are non-negative integers which respectively represent row coordinates and column coordinates of a point to be interpolated in an original image; u and v are floating point numbers in the interval of (0,1), and respectively represent the distance between a point to be interpolated and the nearest pixel point in the horizontal direction and the vertical direction; f (i, j) represents the pixel value of the original image at the coordinate (i, j); s (w) is a bicubic interpolation basis function, and an argument w belongs to R, and | w | represents taking an absolute value of the argument w.

Step 10, recording the reconstructed main region obtained in the step 7) and the reconstructed sub region obtained in the step 8) according to the spatial position Pos [ row, col, del) recorded in the step 6.4)_row,del_col]And fusing the image into the reconstructed region of no interest obtained in the step 9) to obtain a complete reconstructed image.

The above description is only one specific example of the present invention and should not be construed as limiting the invention in any way. It will be apparent to persons skilled in the relevant art(s) that various modifications and changes in form and detail can be made therein without departing from the principles and structures of the invention, but such modifications and changes are within the spirit and scope of the invention as defined in the appended claims.

Claims

1. A video image hierarchical reconstruction method based on sparse representation and dictionary learning comprises the following steps:

(1) obtaining a sample set I ═ { I ═ I from a sample database_h,I_l}，Represents a set of high-resolution samples and,represents lowResolution sample set, using high resolution images of the same content in sample set IAnd low resolution imagesForming a sample pair image

I_{i} = {I_{h_{i}}, I_{l_{i}}};

2. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: in the step (2), the texture layering and the structure layering are carried out on the image in the sample set I by using a morphological component analysis method, and the method comprises the following steps:

(2a) for sample to image I_iPerforming DCT transformation, and constructing high-resolution texture dictionary from transformed dataAnd low resolution texture dictionary

(2b) For sample to image I_iPerforming contourlet transformation, and constructing high-resolution structure dictionary from transformed dataAnd low resolution structured dictionary

(2c) Computing high resolution images using a matching pursuit algorithmTexture dictionary at high resolutionAnd high resolution structured dictionaryOptimal sparse representation of the following, i.e. converting the calculation process into an optimization process

\begin{matrix} {α_{{ht}_{i}}^{*}, α_{{hs}_{i}}^{*}} = \underset{{α_{{ht}_{i}}, α_{{hs}_{i}}}}{A r g} m i n {| | α_{{ht}_{i}} | |_{1} + | | α_{{hs}_{i}} | |_{1}} & s . t | | I_{h_{i}} - T_{{ht}_{i}} \times α_{{ht}_{i}} - T_{{hs}_{i}} \times α_{{hs}_{i}} | |_{2} \leq ϵ \end{matrix},

Wherein the sparse degree empirical value is a sparse degree empirical value,andrespectively a high-resolution texture sparse coefficient and a high-resolution structure sparse coefficient calculated by using a matching pursuit algorithm,andrespectively obtaining a high-resolution optimal texture sparse coefficient and a high-resolution optimal structure sparse coefficient;

(2d) computing a low resolution image according to step (2c)Texture dictionary at low resolutionAnd low resolution structured dictionaryObtaining low-resolution optimal texture sparse coefficient by optimal sparse representationAnd low resolution optimal structure sparse coefficients

(2e) From high resolution texture dictionaryAnd high resolution optimal texture sparsity coefficientsObtaining a high resolution texture layerNote the bookA high resolution texture layer that is a sample set I; structured dictionary based on high resolutionAnd high resolution optimal structure sparsity coefficientsObtaining a high resolution structural layerNote the bookA high resolution structural layer that is a sample set I;

(2f) from low resolution texture dictionariesAnd low resolution optimal texture sparse coefficientsObtaining a low resolution texture layerNote the bookA low resolution texture layer that is a sample set I; according to a low-resolution structure dictionaryAnd low resolution optimal structure sparse coefficientsObtaining a low-resolution structural layerNote the bookA low resolution structural layer of the sample set I.

3. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: in the step (3), the KSVD algorithm is used for training the images in the sample set I, and the method comprises the following steps:

(3a) high-resolution sample image I in sample set I_hPerforming overlapping blocking to obtain high-resolution sample block set Representing high resolution sample images I_hM1, 2, M denotes the high-resolution sample image I_hThe number of partitions of (a);

(3b) at high resolution sample block set Y_hRandomly selecting a sample block, DCT transforming it, and forming high-resolution dictionary D by using transformed data_hAn initial value of (1);

(3c) using the KSVD algorithm, byOptimization procedure for high resolution dictionary D_hUpdating until high resolution sample block set Y_hIn high resolution dictionary D_hThe following sparse representation is the optimal sparse representation:

\begin{matrix} D_{h}^{*} = \underset{D_{h}}{A r g} m i n {Σ_{m = 1}^{M} | | y_{h_{m}} - D_{h} \times α_{h_{m}} | |_{2}^{2}} & s . t | | α_{h_{m}} | |_{0} \leq ϵ, m = 1, 2, ..., M \end{matrix}

wherein,is a sample blockIn high resolution dictionary D_hThe sparse representation below, which is a sparsity empirical value,a dictionary with optimal high resolution is obtained;

(3d) low resolution sample image I of sample set I_lProcessing according to the step (3a) □ (3c) to obtain an optimal low resolution dictionary

4. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: step (5) dividing the low-resolution video single-frame image to be reconstructed into an interested area and a non-interested area, and performing the following steps:

(5a) carrying out moving target detection on a low-resolution video single-frame image to be reconstructed to obtain a binary image of a moving target;

(5b) taking the closed outline of the moving target binary image as an initial outline value of a Snake algorithm, and obtaining an accurate closed outline of the moving target through a successive iteration process of the Snake algorithm;

(5c) and taking the minimum rectangular area containing the precise closed contour of the moving target in the low-resolution video single-frame image to be reconstructed as an interested area P, and taking the part except the interested area as a non-interested area B.

5. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: step (6) the interested area of the low-resolution video single-frame image to be reconstructed is divided into a main area and a secondary area, and the method comprises the following steps:

(6a) calculating the pixel area of each target by using the accurate closed contour of the moving target obtained in the step (5 b);

(6b) dividing the target into a main target and a secondary target according to the area size of the pixel;

(6c) taking the minimum rectangular area containing the main target as the main area P_mThe part other than the main region is regarded as the sub region P_sub。

6. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: performing super-resolution reconstruction on the main region by adopting a double-dictionary learning method in the step (7) according to the following steps:

(7a) the main region P is processed according to the step (2)_mDivided into texture layers P_mtAnd a structural layer P_ms；

(7b) Selecting a reference image of a main region, and calculating the sparse representation of the texture layer of the main region by using the sparse representation of the texture layer and the sparse representation of the structure layer of the reference imageAnd structural layer sparse representation

(7c) According to the main region P_mSparse representation of texture layersAnd texture high resolution dictionaryObtaining a reconstructed image of the main region texture layerSparse representation of structural layers from primary regionsAnd structure high resolution dictionaryObtaining a reconstructed image of the main region structure layer

(7d) Reconstructing an image of a main region texture layerAnd reconstructed image of the main region structure layerAnd fusing to obtain a complete reconstructed image of the main region.

7. The video image hierarchical reconstruction method based on sparse representation and dictionary learning according to claim 1, characterized in that: selecting a reference image of the main region in the step (7b), and calculating the sparse representation of the texture layer of the main region by using the sparse representation of the texture layer and the sparse representation of the structure layer of the reference imageAnd structural layer sparse representationThe method comprises the following steps:

(7b1) taking the front frame and the rear frame of the frame where the main area is located as reference images to obtain a reference image set P_r＝{P_rj}，P_rjRepresents a frame of reference picture, j 1, 2.., 6;

(7b2) according to the step (2), the reference image set P_rDivided into texture layers P_rt＝{P_rtjAnd a structural layer P_rs＝{P_rsj}，P_rtjFor reference picture P_rjTexture layer of, P_rsjFor reference picture P_rjThe structural layer of (1);

(7b3) texture layer P of main area_mtPerforming overlapping blocking to obtain a block set of the main region texture layer Texture layer P representing a main region_mtN, N denotes the texture layer P of the main area_mtThe number of partitions of (a);

(7b4) to the main region texture layer P_mtEach block ofUsing three-step search algorithm to locate the reference image texture layer P_rtjSearching for the best matching block

(7b5) Texture based low resolution dictionaryCalculating a matching blockIs sparse representation ofWhereinIs composed ofThe inverse matrix of (d);

(7b6) computing a reference image texture layer P_rtjBest match block ofIs given by a weight coefficient w_jnThe calculation formula is as follows

w_{j_{n}} = \frac{1}{\sqrt{(y_{{mt}_{n}} - y_{{rtj}_{n}}^{*}) {(y_{{mt}_{n}} - y_{{rtj}_{n}}^{*})}^{T}}};

(7b7) For matching blockIs sparse representation ofCarrying out weighted summation to obtain the main region texture layer blocksSparse representation of texture layersNote the bookA texture layer sparse representation of the main region;

(7b8) main area of the structure layer P_msProcessing is carried out according to the step (7b3) □ (7b7), and structural layer sparse representation of the main region is obtained