CN116546155A

CN116546155A - Video calling device based on face recognition

Info

Publication number: CN116546155A
Application number: CN202310662284.XA
Authority: CN
Inventors: 张兵兵
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-08-04

Abstract

The invention provides a video call device based on face recognition, which comprises a mobile phone body, wherein a dialing component and a liquid crystal display screen are arranged on the mobile phone body, and a video call system based on face recognition is arranged in the mobile phone body; the video call system based on face recognition comprises: the device comprises a memory, a processor and a main control board, wherein the memory and the processor are respectively in communication connection with the main control board; the main control board is integrated with a face recognition module, a login module, a real-name authentication module, a voice input module, a voice output module and an anti-cracking module. The invention can ensure the safety of video call.

Description

Image processing apparatus, image processing method, and storage medium

Technical Field

The present invention relates to an image processing apparatus and method using machine learning to make an image group have high definition, and a storage medium.

Background

Regarding super-resolution imaging using machine learning, when an image is enlarged and resolution conversion is performed, a high-definition image can be generated by estimating, using machine learning, a high-frequency component that cannot be estimated via linear interpolation processing of pixel values. In super-resolution imaging, first, a learning model is generated using an image group G and a degradation image obtained by degrading an image of the image group G using an arbitrary method as teacher data. The learning model is generated by learning the difference in pixel value between the original image and the degraded image and updating the super-resolution processing parameter itself. When the image H insufficient in high-frequency components is input into the learning model generated in this way, the high-frequency components are obtained by performing inference using the learning model. By superimposing the high-frequency component obtained through the inference on the image H, a high-definition image can be generated. In performing super-resolution processing on a moving image, a high-definition moving image can be generated by inputting all frames one at a time into a learning model.

Generally, when a product or service is provided using a learning model, a process to collect teacher data and generate a learning model is performed by a developer, and the generated learning model is provided to a user. Therefore, at the time of learning processing, the content of the moving image to be input by the user is unknown. Therefore, on the developer side, a large number of images of many types and kinds having no deviation in image pattern are prepared as teacher data and reused in learning, so that it is possible to perform uniform accuracy inference on all kinds of inference target moving images.

For example, in japanese patent laid-open publication 2019-204167 (patent document 1), the following technique is described: super-resolution processing is performed on a moving image using a learning model trained with a variety of images. However, since the kinds of teacher data are large, the amount of teacher data having a high similarity with the estimation-target moving image Q specified by the user may be very small. In the case of using such a learning model, a learning result using an image having a low similarity with the inference target moving image Q is reflected in the inference process. As a result, improvement or the like is limited to improvement in sharpness by emphasizing the edge of the subject, and it is difficult to accurately infer high-frequency components such as detailed patterns on the subject, which means that the accuracy of inference cannot be regarded as high.

An example of a system for solving such a problem is described in japanese patent laid-open No. 2019-129328 (patent document 2). The method described herein comprises: only images similar to the inferred target moving image in terms of image capturing place, image capturing condition, and the like are used as teacher data to perform learning on the user side to obtain a moving image with higher definition than when learning is performed using various images.

In patent document 2, learning is performed using teacher data having a common imaging place but different imaging times. More specifically, videos previously captured in a section S of a route of a bus are collected and used for learning, and then inference is performed with respect to the real-time video of the section S using the learning model thus obtained. The teacher data in this case is limited to the data photographed in the section S. Thus, a group of images having a relatively high degree of similarity to the inference target is obtained, which means that an improved inference accuracy can be expected. However, among the videos captured in the section S, the imaging place is different in the video of the start point of the section S and the video of the end point of the section S. Therefore, the photographed subjects are also very different, which makes it difficult to say that the similarity is high. This results in a reduced accuracy of the inference of the entire section S. In addition, in the previous video serving as teacher data and the real-time video of the inference target, the video may show the same place, but the shown subject may be different. This also results in a decrease in inference accuracy, since accurate inference cannot be made with respect to an object that is not learned.

Further, as described in patent document 2, previous videos are classified into a plurality of groups according to imaging conditions such as weather, and a plurality of learning models are generated by learning independently using data of each group. This enables switching of the learning model in use according to the imaging conditions of the real-time video. According to this technique, it is possible to suppress a decrease in estimation accuracy caused by a difference in imaging conditions. However, even when conditions such as weather are the same, when the value of illuminance or the like is even slightly different, the frequency component is different between the teacher data and the inference target. Therefore, it cannot be said that the decrease in the accuracy of the inference is sufficiently suppressed. For these reasons, the technique of patent document 2 cannot provide a sufficient accuracy of inference for the high frequency component.

Disclosure of Invention

According to an aspect of the present invention, there is provided an image processing apparatus that can use machine learning to make an image have high definition with high accuracy.

According to an aspect of the present invention, an image processing apparatus that uses a first image group to make an image of a second image group having fewer high-frequency components than an image of the first image group, the image processing apparatus comprising: a calculation section for calculating a similarity between a current image selected as a high-definition target from the second image group and a previous image preceding the current image as a high-definition target; a selection unit configured to select teacher data to be used in learning from a plurality of teacher data using an image included in the first image group as one of image pairs based on the current image; model generating means for generating a learning model for making the current image have high definition using the selected teacher data; an inference section that infers a high-frequency component of the current image using the learning model generated by the model generation section in a case where the similarity is equal to or smaller than a threshold value, and infers a high-frequency component of the current image using a learning model for making the previous image have high definition in a case where the similarity is greater than the threshold value; and an image generation section for generating a high-definition image based on the current image and the inferred high-frequency component.

According to another aspect of the present invention, an image processing method for using a first image group to make an image of a second image group having fewer high frequency components than an image of the first image group has high definition, the image processing method comprising: calculating a similarity between a current image selected as a high-definition target from the second image group and a previous image preceding the current image as a high-definition target; selecting teacher data to be used in learning from a plurality of teacher data using an image included in the first image group as one of image pairs based on the current image; generating a learning model for making the current image have high definition using the selected teacher data; inferring a high frequency component of the current image using the learning model generated in the generating, in a case where the similarity is equal to or less than a threshold, and inferring a high frequency component of the current image using a learning model for making the previous image have high definition, in a case where the similarity is greater than the threshold; and generating a high-definition image based on the current image and the high-frequency component inferred in the inference.

According to another aspect of the present invention, there is provided a storage medium storing a program for causing a computer to function as a component of the above-described image processing apparatus.

Further features of the invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.

Drawings

Fig. 1 is a block diagram showing the structure of an image processing apparatus according to a first embodiment.

Fig. 2 is a diagram for explaining a functional structure of the image processing apparatus according to the first embodiment.

Fig. 3 is a diagram showing an example of a frame structure of an input moving image according to the first embodiment.

Fig. 4 is a diagram for explaining a functional structure of the image processing apparatus according to the first embodiment.

Fig. 5 is a diagram showing an example of a data structure of a candidate database according to the first embodiment.

Fig. 6 is a flowchart of the teacher data candidate obtaining process according to the first embodiment.

Fig. 7 is a flowchart of a high definition moving image generation process according to the first embodiment.

Fig. 8 is a schematic diagram for explaining the learning/inference process according to the first embodiment.

Fig. 9 is a diagram showing an example of a frame structure of an input moving image according to the second embodiment.

Fig. 10 is a flowchart of a teacher data candidate obtaining process according to the second embodiment.

Fig. 11 is a diagram showing an example of a frame structure of an input moving image according to the third embodiment.

Fig. 12 is a flowchart of a teacher data candidate obtaining process according to the third embodiment.

Fig. 13 is a diagram showing an example of a frame structure of a moving image according to the fifth embodiment.

Fig. 14 is a diagram for explaining a functional structure of an image processing apparatus according to the fifth embodiment.

Fig. 15 is a flowchart of a high definition moving image generation process according to the fifth embodiment.

Fig. 16 is a flowchart of a high-definition moving image generation process according to the sixth embodiment, the seventh embodiment, the eighth embodiment, and the ninth embodiment.

Fig. 17 is a diagram showing an example of learning/inference processing according to the sixth embodiment.

Fig. 18 is a flowchart of a high definition moving image generation process according to the eighth embodiment.

Fig. 19 is a diagram showing an example of teacher data area selection according to the ninth embodiment.

Fig. 20 is a flowchart of a high definition moving image generation process according to the tenth embodiment.

Detailed Description

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Note that the following examples are not intended to limit the scope of the claimed invention. In the embodiments, a plurality of features are described, but the invention requiring all of these features is not limited, and a plurality of these features may be appropriately combined. In addition, in the drawings, the same reference numerals are given to the same or similar configurations, and redundant description thereof is omitted.

First embodiment

Overview of image processing apparatus

The image processing apparatus of the first embodiment accepts as input two moving images, a moving image a and a moving image B, which are simultaneously captured by the same image capturing apparatus. The relationship between the resolution XA and the frame rate FA of the moving image a and the resolution XB and the frame rate FB of the moving image B corresponds to XA > XB and FA < FB. The image processing apparatus has the following functions (high definition moving image generation function): a learning model is generated using frames of the moving image a and the moving image B, and a moving image C having a resolution XA and a frame rate FB is generated from the moving image B via inference using the generated learning model.

Description of the construction of the image processing apparatus

Fig. 1 is a block diagram showing an example of a hardware structure of an image processing apparatus 100 according to the first embodiment. The control unit 101 is an arithmetic device such as a central processing unit (hereinafter referred to as CPU). The control unit 101 realizes various types of functions by loading programs stored in a read only memory (hereinafter referred to as ROM) 102 onto a work area of a random access memory (hereinafter referred to as RAM) 103 and executing the programs. The control unit 101 may be used, for example, as various functional blocks including the analysis unit 211 and the decoded moving image generation unit 212 described below using fig. 2, and the candidate acquisition unit 413 and the teacher data extraction unit 414 described below using fig. 4. The ROM 102 stores a control program executed by the control unit 101. The RAM 103 serves as a work memory in which the control unit 101 executes programs, a temporary storage area for various types of data, and the like.

The decoding unit 104 decodes moving images or image data compressed in an encoding format set by a moving picture experts group (hereinafter, referred to as MPEG) into uncompressed data. The learning/inference unit 105 includes a functional block (the learning unit 451 described below using fig. 4) that accepts teacher data as input and generates and updates a learning model. Further, the learning/inference unit 105 includes a functional block that generates a high-definition image of an input image by analyzing the input image and inferring high-frequency components using a learning model generated through learning (an inference unit 452 described below using fig. 4). In the present embodiment, as a learning model, a convolutional neural network (hereinafter abbreviated as CNN) based CNN model for super-resolution processing is used. This is used to amplify an input image via linear interpolation, generate a high-frequency component to be added to the amplified image, and add and synthesize the two.

The storage unit 106 is constituted by a storage medium such as a Hard Disk Drive (HDD) or a memory card, which is detachably connected to the image processing apparatus 100, and a storage medium control apparatus which controls the storage medium. The storage medium control device controls storage medium initialization, data transfer for reading and writing of data between the storage medium and the RAM 103, and the like, in accordance with a command from the control unit 101. Bus 107 is an information communication path connecting functions. The control unit 101, ROM 102, RAM 103, decoding unit 104, learning/inference unit 105, and storage unit 106 are communicatively connected to each other.

Note that the hardware blocks described in the present embodiment and the functional blocks implemented thereby need not have the above-described configuration. For example, two or more blocks in the control unit 101, the decoding unit 104, and the learning/inference unit 105 may be implemented by one hardware. Furthermore, the function of one functional block or the function of a plurality of functional blocks may be performed by cooperation between two or more hardware. The functional blocks may be implemented by a CPU executing a computer program loaded on a memory or may be implemented by dedicated hardware. Further, one or more of the functional blocks may reside on a cloud server and be configured to communicate the processing result data via communication. For example, the decoding unit 104 may be implemented by the same CPU as the control unit 101, or may be implemented by a different CPU. Alternatively, the decoding unit 104 may be implemented by a Graphics Processing Unit (GPU) that operates by receiving instructions from the control unit 101. In another case, the decoding unit 104 may be implemented by a hardware process using an electronic circuit configured for a combination process. For example, the learning/inference unit 105 may be implemented by the same CPU as the control unit 101, or may be implemented by a different CPU. Alternatively, the learning/inference unit 105 may be implemented by a GPU that operates by receiving instructions from the control unit 101. In another case, the learning/inference unit 105 may be implemented by a hardware process using an electronic circuit configured for learning and inference.

Data stored in storage medium and decoding and loading method thereof

Fig. 2 is a diagram for explaining functional blocks for executing processing for loading compressed moving image data via the control unit 101 (the analysis unit 211 and the decoding moving image generation unit 212). The storage unit 106 stores a moving image a and a moving image b as input data for high definition moving image generation processing. The term moving image as used herein refers to one or more image data that are consecutive in time. The moving image a and the moving image b of the present embodiment are simultaneously captured by an image capturing apparatus having an image sensor, and compressed by an MPEG method. The moving image a and the moving image b may be generated by additionally performing thinning-out or reduction processing on images captured by a single image sensor, or may be generated by capturing the same subject with image sensors having different resolutions and frame rates. Here, the moving image a and the moving image b are two image groups obtained by performing different image processing on a single image captured by a single image sensor of a single image capturing apparatus. The moving image data of the moving image a and the moving image b are compressed by the MPEG method, multiplexed together with the imaging time information, and stored in the MP4 format. Note that formats other than the above may be used as long as the image data from the storage unit 106 and the corresponding image capturing time information can be obtained in pairs.

The analysis unit 211 has the following functions: the moving image data (MP 4 file in this example) stored in the storage unit 106 is parsed, and compressed image data to be packaged and a storage location in the file of time information registered as metadata are calculated. Position information indicating the storage position of frame data and imaging time information in a file is stored in the Moov section using the MP4 format. The analysis unit 211 loads the Moov portion of the moving image a from the storage unit 106 onto the RAM 103, and parses the Moov portion, and generates a table Pa including the frame number of the moving image a, the position information indicating the storage position of the frame data, and the position information indicating the storage position of the imaging time. Further, the analysis unit 211 parses the Moov portion of the moving image b in a similar manner, and generates a table Pb including the frame number of the moving image b, position information indicating the storage position of the frame data, and position information indicating the storage position of the imaging time. Table Pa and table Pb are held in the RAM 103.

A process to convert the moving image a and the moving image b into uncompressed formats so that they can be used in the high definition moving image generation process must be performed. As shown in fig. 2, the decoded moving image generation unit 212 of the control unit 101 decodes the moving image a and the moving image B, generates the moving image a and the moving image B, and stores them in the storage unit 106. More specifically, the decoded moving image generation unit 212 refers to the table Pa and the table Pb held in the RAM 103, and sequentially inputs the frame data of the moving image a and the moving image b stored in the storage unit 106 to the decoding unit 104. The decoded moving image generation unit 212 multiplexes the frame data in the uncompressed format output by the decoding unit 104 with the imaging time information obtained by referring to the table Pa and the table Pb, and stores it in the storage unit 106. Here, the moving image a is obtained by decoding the moving image a, and the moving image B is obtained by decoding the moving image B. Further, the decoded moving image generation unit 212 generates a table PA including the frame number of the moving image a, position information indicating the storage position of the frame data, and position information indicating the storage position of the imaging time, and stores it in the RAM 103. In a similar manner, the decoded moving image generation unit 212 generates a table PB including the frame number of the moving image B, the position information indicating the storage position of the frame data, and the position information indicating the storage position of the imaging time, and stores it in the RAM 103. An example of the frame structure of the moving image a and the moving image B is shown in fig. 3. In fig. 3, n is the total frame number of the moving image a, and m is the total frame number of the moving image B. The pair of frames indicated by the broken line (image pairs A1 and B2, A2 and B5, and A3 and B8, etc.) are pairs of frames including the same imaging time information, and this indicates that images of these frames are taken at the same timing. Further, as described above, the relationship between the resolution XA of the moving image a and the resolution XB of the moving image B is XA > XB, and the relationship between the frame rate FA of the moving image a and the frame rate FB of the moving image B is FA < FB.

Next, a process for generating a high-definition image according to the present embodiment will be described. The processing is roughly divided into two parts, i.e., a teacher data candidate obtaining processing and a high-definition moving image generating processing.

Fig. 4 is a diagram for explaining the structure and operation of functional blocks related to image processing performed by the image processing apparatus 100 of the first embodiment. As described in fig. 2, the moving image a and the moving image B are held in the storage unit 106, and the tables PA and PB are held in the RAM 103. The teacher data candidate obtaining process is performed by the candidate obtaining unit 413. Further, the high-definition moving image generation process is performed by the teacher data extraction unit 414, the learning unit 451, and the inference unit 452. The candidate obtaining unit 413 extracts, as teacher data candidates, a pair of frames corresponding to teacher data candidates for learning from the frame group of the moving image a and the frame group of the moving image B, and generates a teacher data candidate database (hereinafter referred to as a candidate database D1). Frames By which are a high resolution target and a high definition target are obtained from the frame group of the image B. In order to generate a learning model suitable for the high-frequency component of the inferred frame By, the teacher data extraction unit 414 also extracts teacher data suitable for learning from among the teacher data candidates registered in the candidate database D1. The teacher data extraction unit 414 generates a teacher data database (hereinafter referred to as a teacher database D2) using the extracted teacher data. The learning unit 451 of the learning/inference unit 105 uses the teacher database D2 and generates a learning model M for the frame By. The inference unit 452 inputs the frame By, which is a high resolution target, into the learning model M generated By the learning unit 451, and performs high definition processing on the frame By. Hereinafter, the teacher data candidate obtaining process and the high-definition moving image generating process will be described in more detail.

Teacher data candidate acquisition processing

In the teacher data candidate obtaining process, the candidate database D1 is generated via the control unit 101 (candidate obtaining unit 413). In the first embodiment, the candidate obtaining unit 413 obtains, as teacher data candidates, pairs including a frame of the moving image a and a frame of the moving image B whose imaging times coincide, from the moving image a and the moving image B. Specifically, all pairs (pairs of frames indicated by broken lines in fig. 3) sharing a common imaging time between the moving image a and the moving image B are obtained as teacher data candidates. The candidate obtaining unit 413 checks which frames can be used as teacher data before performing learning processing described below, constructs a candidate database D1, and registers the check result.

Fig. 5 is a diagram showing an example of a data structure of the candidate database D1. In the candidate database D1, the frame numbers in the moving image file of the frame group TA capable of serving as teacher data from the frame group of the moving image a and the frame group TB capable of serving as teacher data from the moving image B are registered. Here, pairs of frames (pairs of frame numbers) whose imaging times are identical are correlated and registered using the index I unique to the candidate database D1. For example, for the moving image a and the moving image B shown in fig. 3, frame pairs A1 and B2, A2 and B5, and A3 and B8 (omitted below) are combined as frames photographed at the same time. In the candidate database D1 shown in fig. 5, these pairs are shown stored by frame number and have a unique index I. In this way, the candidate database D1 is used to manage the obtained teacher data candidates.

The teacher data candidate obtaining process described above will now be described in further detail using the flowchart of fig. 6. In step S601, the candidate obtaining unit 413 selects one frame from the frames of the moving image a, and obtains time information corresponding to the selected frame from the table PA. In the present embodiment, frames are sequentially selected from the top of the moving image a stored in the storage unit 106. Specifically, the candidate obtaining unit 413 sequentially selects one frame from the top of the moving image a stored in the storage unit 106. Hereinafter, the selected frame is referred to as a frame Ax. The candidate obtaining unit 413 refers to the table PA stored in the RAM 103, reads out time information corresponding to the frame Ax from the storage unit 106, and transfers the time information to the RAM 103.

In step S602, the candidate obtaining unit 413 compares the time information of the frame Ax read out in step S601 with the time information of each frame of the moving image B. Specifically, the candidate obtaining unit 413 sequentially obtains the image capturing time information of each frame of the moving image B from the storage unit 106 with reference to the position information of the image capturing time stored in the table PB, and compares these image capturing time information with the time information of the frame Ax. In step S603, the candidate obtaining unit 413 obtains a frame of the moving image B having the imaging time coincident with the time information of the frame Ax, and sets the frame as the frame Bx.

In step S604, the candidate obtaining unit 413 gives the index Ix unique to the candidate database D1 to the combination of the frame Ax and the frame Bx described above, and registers both in the candidate database D1. Specifically, the candidate obtaining unit 413 issues the unique index Ix to the combination of the frame Ax and the frame Bx, and registers the index Ix, the frame number in the moving image a of the frame Ax, and the frame number in the moving image B of the frame Bx in the candidate database D1.

In step S605, the control unit 101 determines whether the processing of steps S601 to S604 described above is completed for all frames of the moving image a. When the control unit 101 determines that the processing is completed (yes in step S605), the processing ends. In the case where the control unit 101 determines that the processing is not completed (no in step S605), the processing returns to step S601, and the above-described processing is performed for the next frame of the moving image a. The candidate database D1 is generated by this process.

Note that in the present embodiment, in step S602, the pair of frames to be registered in the candidate database D1 is determined via comparison of imaging times. However, such limitation is not intended. For example, the frame Ax is reduced to the resolution XB, and the similarity determination is made using an index indicating the similarity between the frame Ax and the images of the frames of the moving image B. The judgment result may then be used to select a pair of frames to be registered in the candidate database D1. In this case, the candidate obtaining unit 413 has a similarity judging function for judging the similarity by comparing two or more image data. Note that as an index indicating the similarity between images, for example, structural Similarity (SSIM) may be used. Further, when an index indicating the similarity is obtained, the image of the frame Ax is reduced to the resolution XB. However, such limitation is not intended. The image of the frame Ax may not be reduced, or the resolution after the reduction may be a resolution other than XB.

High definition moving image generation processing

Next, the high-definition moving image generation process performed by the control unit 101 (teacher data extraction unit 414) and the learning/inference unit 105 (learning unit 451 and inference unit 452) will be described. First, an outline of the high definition moving image generation process will be described with reference to fig. 4. The teacher data extraction unit 414 selects, from the candidate database D1, teacher data suitable for learning of a learning model for use in inferring the target frame By, and generates a teacher database D2 (fig. 4) (details thereof will be described below with reference to steps S702 to S703 of fig. 7). The learning unit 451 generates a learning model using the extracted teacher data (step S704). Further, the inference unit 452 uses the learning model to infer a high-frequency component of the inference target frame By, and performs high-definition processing (step S705), and obtains a frame (image) Cy By converting the inference target frame By into high definition. Note that, before starting the high-definition moving image generation process, the control unit 101 generates a moving image C on the storage unit 106. At the start of generation of a high-definition moving image, the moving image C is in an empty state without any frame data. The estimation unit 452 sequentially stores the generated frames Cy in the moving image C.

Next, a process for generating the above-described high-definition moving image will be described in detail with reference to a flowchart in fig. 7. In step S701, the teacher data extraction unit 414 reads out one frame as a high-definition target frame from the moving image B. In the present embodiment, the teacher data extraction unit 414 sequentially reads out frames one frame at a time from the top of the moving image B stored in the storage unit 106. Hereinafter, the frame read out in step S701 is defined as a frame By. More specifically, the teacher data extraction unit 414 refers to the table PB and reads out frame data of the frame By and imaging time information from the storage unit 106, and transfers it to the RAM 103.

In step S702, the teacher data extraction unit 414 extracts frames whose imaging time difference from the frame By is smaller than a threshold value set in advance in the system from the teacher data candidate TB registered in the candidate database D1, and registers the frames in the teacher database D2. As the threshold value, for example, a display period of one frame of the moving image a (a display period of one frame via the frame rate XA) can be used. The teacher database D2 has a structure similar to that of the candidate database D1 (FIG. 5). Specifically, first, the teacher data extraction unit 414 refers to the position information of the table PB, and obtains the time information of each frame group TB registered in the candidate database D1. Then, the teacher data extraction unit 414 compares the obtained respective time information with the time information of the frame By, extracts frames whose difference is smaller than the threshold value from the frame group TB, and registers these frames in the teacher database D2 on the RAM 103. Hereinafter, the frame group of the moving image B registered in the teacher database D2 by this process is denoted by UB. Note that, in the present embodiment, when the teacher database D2 is constructed, a frame group whose imaging time difference from the frame By is smaller than the threshold value is extracted from the candidate database D1. However, such limitation is not intended. Using an index indicating the similarity to the frame By, the frame group UB can be extracted. For example, the teacher data extraction unit 414 may extract, from the frame group TB, a frame group whose index of similarity to the frame By is higher than a threshold value set in advance in the system using SSIM, and register the frame group as the frame group UB.

In step S703, the teacher data extraction unit 414 registers the frame of the frame group TA corresponding to the pair of the frames of the frame group UB in the candidate database D1 in the teacher database D2. Specifically, the teacher data extraction unit 414 refers to the candidate database D1 on the RAM 103, and registers the frame of the frame group TA associated with each frame of the frame group UB via the index I in the teacher database D2. At this time, the combinations of the associated two frames are not changed, and the index J unique in the teacher database D2 is assigned to each combination. Hereinafter, a frame group of the moving image a registered in the teacher database D2 is denoted by UA.

In step S704, the learning unit 451 learns using the teacher data (frame group UA and frame group UB) registered in the teacher database D2, and generates a learning model M.

Fig. 8 is a diagram schematically showing a learning model generation function of the learning unit 451. The learning model generation function includes a learning process and an inference process, and the inference process is divided into a feature extraction process using a filter including CNN and a reconfiguration process. First, in the feature extraction process, the learning unit 451 inputs a single image (defined as an image E) from the frame group UB into the CNN, extracts convolution features via the CNN, and generates a plurality of feature maps. Next, in the reconfiguration process, the learning unit 451 upsamples via transpose convolution of all feature maps, and generates a high-frequency component. Further, in the reconfiguration process, the learning unit 451 reconfigures the image by adding a high-frequency component to the image E' obtained by amplifying the image E via the bicubic method or the like, and generates an estimated high-definition image G. In the learning process, the learning unit 451 compares the estimated high-definition image G generated in the above-described inference process with the image H corresponding to the image E from the frame group UA, and fine-adjusts the learning model M by the back-propagation method using the difference between the two. The learning unit 451 improves the estimation accuracy by repeating this process for the same image E a predetermined number of times. By performing the above-described series of processing on each image of the frame group UB, a learning model M suitable for the estimation processing of the frame group UB is constructed.

As described above, the learning unit 451 refers to the teacher database D2, the table PA, and the table PB, and reads out the frame data of the frame pair registered as the teacher data from the storage unit 106, and performs the above-described learning model generation function. The learning unit 451 stores the learning model M generated by the learning model generation function in the RAM 103.

In step S705, the estimation unit 452 generates a high-definition frame Cy from the frame By via estimation using the learning model M generated in step S704. Specifically, first, the inference unit 452 reads out the learning model M stored in the RAM 103. Next, the inference unit 452 inputs the frame data (image) of the frame By held in the RAM 103 in step S701 into the CNN of the learning model M, and generates a high-frequency component expected when the image of the frame By is enlarged to the resolution XA. The inference unit 452 adds the generated high-frequency component to an image obtained By linearly amplifying the image of the frame By to the resolution XA to generate an image of a high-definition frame Cy of the resolution XA, and stores the image in the RAM 103. Note that the process from high-frequency component inference to high-definition image generation performed for the frame By is a process similar to the process using the inference process described above in fig. 8. The estimation unit 452 adds the frame data of the high-definition frame Cy stored in the RAM 103 to the end of the high-definition moving image C on the storage unit 106. Further, the image capturing time information of the frame By is copied and multiplexed into the image capturing time of the high definition frame Cy, and stored in the moving image C.

In step S706, the control unit 101 determines whether the above-described processing is completed for frames of the estimation target range of the moving image B (this may be all frames or a part of frames of the moving image B). In the case where the control unit 101 determines that the processing is not completed (no in step S706), the processing advances to step S701, the next frame of the moving image B is selected as the frame By the teacher data extraction unit 414, and the above-described processing is repeated. When the control unit 101 determines that the processing is completed (yes in step S706), the current processing ends. As described above, at the end of the high-definition moving image generation process, the high-definition moving image C having the resolution XA and the frame rate FB is stored in the storage unit 106 in an uncompressed format.

Note that, in the above-described embodiment, each of the functional blocks is implemented by only the control unit 101 or only the learning/inference unit 105. However, such limitation is not intended. For example, the functional blocks may be realized via cooperation between the control unit 101 and the learning/inference unit 105. For example, the function of the inference unit 452 may be implemented by the control unit 101 and the learning/inference unit 105, and the processing to store the high definition frame Cy and the imaging time in the moving image C on the storage unit 106 may be executed by the control unit 101.

Further, in the present embodiment, the teacher data candidate obtaining process is performed before the learning process and the high-definition moving image generating process are performed on all the moving images, but the teacher data candidate obtaining process may be performed in parallel with the high-definition moving image generating process. Further, in the present embodiment, in step S704, a learning model M is newly generated for each inference target frame, and the previously generated learning model M is discarded. However, such limitation is not intended. For example, the learning model M 'trained externally may be preloaded, and additional learning using the frame group UA and the frame group UB may be performed on the loaded learning model M' in step S704.

As described above, according to the first embodiment, the learning model M trained with the image group similar to the high-definition target image among the image groups captured in the same imaging period is used. This enables an image to have high definition with high accuracy.

Further, the same time image pair from the two image groups is used as teacher data. This enables learning with even higher accuracy.

Second embodiment

In the processing for obtaining teacher data candidates in the first embodiment, a combination of a frame of a moving image a and a frame of a moving image B whose imaging times are identical is registered in the candidate database D1. In the case where the moving image a and the moving image B are obtained from moving images simultaneously captured by the same image sensor using a single image capturing apparatus, as shown in fig. 3, frames having the same image capturing time can be obtained from the moving image a and the moving image B. However, with this method, in the case where the moving image a and the moving image B are moving images captured by a plurality of image sensors in the same imaging period, extraction of teacher data candidates may not be performed appropriately. This is because, as shown in fig. 9, for the frames of the moving image a, there are not always frames with identical imaging times in the moving image B. Note that examples of a structure for capturing a moving image a and a moving image B via a plurality of image sensors include a structure for capturing an image using an image capturing apparatus including a plurality of image sensors, a structure for capturing an image using a plurality of image capturing apparatuses each having one or more image sensors, and the like. In the process for obtaining teacher data candidates in the second embodiment, the above-described problem is solved by registering a combination of frames whose time difference is smaller than a predetermined threshold in the candidate database D1 even if the imaging times of the frames of the moving image a and the moving image B are not identical.

In the second embodiment, the configuration of the image processing apparatus 100 and the high-definition image generation process are similar to those in the first embodiment, but a part of the process for obtaining teacher data candidates is different. Fig. 10 is a flowchart for explaining a process for obtaining teacher data candidates according to the second embodiment. Hereinafter, a portion different from the process (fig. 6) for obtaining teacher data candidates in the first embodiment will be mainly described.

The processing of steps S1001 to S1002 is similar to steps S601 to S602 of the first embodiment (fig. 6). In step S1003, the candidate obtaining unit 413 obtains, from the frames of the moving image B, frames whose imaging time difference from one frame Ax of the moving image a is smaller than a predetermined threshold as frames Bx, and registers the frames in the candidate database D1 on the RAM 103. Note that as the threshold value, for example, a display period per frame of the moving image B at the frame rate XB may be used. The subsequent processing of steps S1004 to S1005 is similar to steps S604 to S605 of the first embodiment (fig. 6).

In this way, according to the second embodiment, even in the case where the moving image a and the moving image B are obtained by a plurality of image sensors, extraction of teacher data candidates can be appropriately performed.

Third embodiment

In the first and second embodiments, the moving image a and the moving image B are captured at least in the same imaging period. Therefore, in the teacher data candidate obtaining processing of the first and second embodiments, as shown in fig. 11, in the case where the moving image a and the moving image B are photographed by the same or a plurality of photographing apparatuses at different times (non-overlapping photographing periods), the teacher data candidate cannot be obtained. In the third embodiment, a teacher data candidate obtaining process for appropriately obtaining teacher data candidates for the moving image a and the moving image B shown in fig. 11 will be described.

In the process for obtaining teacher data candidates according to the third embodiment, an index indicating the similarity of frames between the frames of the moving image a and the frames of the moving image B is calculated, and a pair of frames having an index equal to or greater than a threshold value set in advance in the system is registered in the candidate database D1. Note that as an index indicating the similarity of frames, SSIM may be used, for example, as described above. Further, in judging the similarity, the image of the frame of the moving image a may be reduced to the resolution XB, and an index indicating the similarity may be calculated using the image and the image of each frame of the moving image B. At this time, however, the image of the frame of the moving image a may not be reduced, or the resolution after reduction may be a resolution other than XB.

Fig. 12 is a flowchart for explaining a process for obtaining teacher data candidates according to the third embodiment. Hereinafter, a portion different from the process (fig. 6) for obtaining teacher data candidates in the first embodiment will be mainly described with reference to fig. 12.

In step S1201, the candidate obtaining unit 413 selects one frame from the frames of the moving image a, and loads frame data of the selected frame. The candidate obtaining unit 413 sequentially selects one frame from the top of the moving image a stored in the storage unit 106 (hereinafter, the selected frame is referred to as a frame Ax). The candidate obtaining unit 413 refers to the table PA stored in the RAM 103, and transfers the frame data of the selected frame Ax from the storage unit 106 to the RAM 103.

In step S1202, the candidate obtaining unit 413 calculates the degree of similarity between the frame Ax read out in step S1201 and each frame of the moving image B. More specifically, the candidate obtaining unit 413 refers to the position information (related to frame data) of the table PB, and sequentially obtains frame data of each frame of the moving image B from the storage unit 106 to the RAM 103. Then, the candidate obtaining unit 413 calculates the similarity index between the frame Ax and each frame using a similarity index calculation function (SSIM in this embodiment), and stores it in the RAM 103. In step S1203, the candidate obtaining unit 413 obtains the frame of the moving image B having the highest value from the similarity index calculated in step S1202 as the frame Bx. The subsequent processing of steps S1204 to S1205 is similar to steps S604 to S605 of the first embodiment (fig. 6).

As described above, according to the third embodiment, even in the case where the imaging periods of the two image groups (moving image a and moving image B) do not overlap, the teacher data candidates can be appropriately obtained.

Fourth embodiment

In the fourth embodiment, for the learning processes of the first to third embodiments, the performance improvement of the learning model M in consideration of the image similarity will be described. As described in the first embodiment, appropriate teacher data is extracted for the frame By selected in step S701 of fig. 7, and the teacher data is used to generate or update the learning model M in step S704. In generating or updating the learning model M, back propagation is used to adjust network parameters as shown in fig. 8. In the fourth embodiment, the intensity of adjustment via back propagation is controlled based on the attribute (e.g., imaging time) of the frame (image E) used in learning and the frame By which is a high resolution or high definition target or the images of these frames. More specifically, the learning unit 451 sets coefficients such that in the learning process, in the case where the similarity between the frame By and each frame of the frame group UB sequentially input is high, the influence on the network parameter update is strong, and in the case where the similarity is low, the influence is weak. Here, the similarity between frames may be determined simply based on the time difference between the frame By and the input image E, or may be determined By comparing images of two frames using SSIM or the like. In an example configuration in the case of using the former (a method using a time difference), as described below, when the time difference is smaller than the threshold value, the adjusted intensity is multiplied by a coefficient 1, and when the time difference is equal to or larger than the threshold value, the adjusted intensity is multiplied by a coefficient 0.5.

if (ABS (time difference between By and E) < threshold value) { coefficient=1 } else { coefficient=0.5 }

In an example configuration in the case of using the latter (a method using similarity), SSIM is used as a coefficient of adjusting intensity as described below.

Coefficient=ssim (By and E) [ 0.ltoreq.ssim (x). Ltoreq.1 ]

Note that examples of how the strong or weak influence is applied include a method of multiplying the update rate of the network parameter using back propagation in the learning process by the coefficient described above, a method of multiplying the number of times the learning cycle is performed on the input image E by the coefficient without multiplying the parameter update rate by the coefficient, and the like.

Fifth embodiment

The first to third embodiments described above have the following configurations: pairs including frames from the moving image a and frames from the moving image B are extracted as teacher data candidates and registered in the candidate database D1. In the fifth embodiment, the moving image a is converted into the resolution XB of the moving image B to generate the moving image a ', and the candidate obtaining unit 413 obtains teacher data candidates using the moving image a and the moving image a'. In other words, the candidate obtaining unit 413 of the fifth embodiment extracts the frame Ax ' having the same frame number as the frame Ax of the moving image a from the moving image a ', and registers the pair including the frame Ax and the frame Ax ' as the teacher data candidate in the candidate database D1. The fifth embodiment will be described in detail below.

Description of the structure of the image processing apparatus 100

The hardware structure and functional structure of the image processing apparatus 100 are similar to those of the first embodiment (fig. 1). However, the control unit 101 of the fifth embodiment also has a resolution converting function for reducing and converting the resolution of an image via the bicubic method. The resolution conversion function calculates the pixel value of the pixel to be interpolated by referring to surrounding pixels when performing resolution reduction processing on the image data stored in the RAM 103.

Data stored in storage unit 106 and decoding and loading method thereof

In the first embodiment, the moving image a and the moving image B in the storage unit 106 are converted into uncompressed formats, and the moving image a obtained by decoding the moving image a and the moving image B obtained by decoding the moving image B are stored in the storage unit 106. Further, in the fifth embodiment, a moving image a' is generated by converting the moving image a into the resolution XB of the moving image B. More specifically, the control unit 101 refers to the table PA stored in the RAM103, and sequentially inputs frame data of frames of the moving image a (hereinafter referred to as frame K) stored in the storage unit 106 into the resolution conversion function of the control unit 101. Then, using the resolution conversion function, a frame of frame data of the resolution XB (hereinafter referred to as a frame K') is output. The control unit 101 refers to the table PA and multiplexes it with the imaging time information of the frame K read out from the storage unit 106, and stores it in the storage unit 106 as a frame of the moving image a'. Further, a table PA 'holding the frame numbers of the frames of the moving image a', position information indicating the storage positions of the frame data, and position information indicating the storage positions of the imaging time data is stored in the RAM 103.

An example of the moving image a, the moving image B, and the moving image a' is shown in fig. 13. The images (A1 ' to An ') generated by reducing the resolution of the images (A1 to An) of the frames of the moving image a to the resolution XB are stored in the storage unit 106 as the moving image a '. Note that in the above example, the resolution of the moving image a is reduced to XB, but such limitation is not intended. It is sufficient that the moving image a' includes an image converted into a resolution lower than that of the moving image a. However, by using an image converted into the same resolution as that of the high-definition target image, a learning model more suitable for the high-definition target image can be constructed.

Teacher data candidate acquisition processing

Fig. 14 is a diagram showing the structure and operation of functional blocks related to image processing performed by the image processing apparatus 100 of the fifth embodiment. The candidate obtaining unit 413 obtains a combination of frames having the same frame number for each frame of the moving image a and the moving image a', and registers it in the candidate database D1. More specifically, for each frame of the moving image a listed in the table PA, the candidate obtaining unit 413 searches for a frame having a uniform frame number in the moving image a 'by referring to the table PA'. The candidate obtaining unit 413 assigns the unique index I to the combination of the frames of the moving image a and the moving image a' having the same frame number, and registers it in the candidate database D1. The frame group of the moving image a registered in the candidate database D1 is denoted by TA, and the frame group of the moving image a 'is denoted by TA'.

High definition moving image generation processing

Hereinafter, a portion different from the processing (fig. 7) of the first embodiment will be mainly described with reference to the flowchart of fig. 15.

The process of step S1501 is similar to step S701 of the first embodiment (fig. 7). In step S1502, the teacher data extraction unit 414 extracts frames whose imaging time difference from the frame By is smaller than a threshold value set in advance in the system from the frame group TA' of the teacher data candidates registered in the candidate database D1. As the threshold value, for example, a display period of one frame of the moving image a (a display period of one frame via the frame rate XA) can be used. The teacher data extraction unit 414 registers the extracted frame in the teacher database D2.

Specifically, first, the teacher data extraction unit 414 refers to the table PA 'and obtains time information of frames registered in the frame group TA'. Then, the teacher data extraction unit 414 registers, in the teacher database D2 on the RAM 103, a frame whose time difference from the frame By is smaller than the threshold value in the obtained time information of the frame group TA'. Hereinafter, the frame group of the moving image a 'registered in the teacher database D2 is referred to as a frame group UA'. Note that in the present embodiment, a frame whose imaging time difference from the frame By is smaller than a predetermined threshold value is extracted from the candidate database D1. However, such limitation is not intended. For example, a frame having an index (for example, SSIM) indicating that the similarity between the image of the frame By and the image of each frame in the frame group TA 'is higher than a threshold value set in advance in the system may be extracted from the frame group TA' and registered in the teacher database D2.

In step S1503, the teacher data extraction unit 414 registers the frame of the frame group TA associated with each frame of the frame group UA' via the index I in the teacher database D2. Specifically, the teacher data extraction unit 414 refers to the candidate database D1 on the RAM 103, and registers the frame of the frame group TA associated with each frame of the frame group UA' via the index I in the teacher database D2. At this time, the associated combination (frame pair) is not changed, and the index J unique in the teacher database D2 is assigned to each combination. Hereinafter, the frame group of the moving image a registered in the teacher database D2 is referred to as a frame group UA.

In step S1504, the learning unit 451 refers to the teacher database D2, and performs learning using the frame group UA and the frame group UA', and generates a learning model M. Specifically, first, the learning unit 451 refers to the teacher database D2 and tables PA and PA', reads out frame data from the storage unit 106, and inputs the frame data into the learning model generation function. The learning unit 451 performs learning using the frame data read out by the learning model generation function, and stores the learning model M generated as a learning result in the RAM 103. Details of learning of the learning model are described above with reference to fig. 8. The subsequent processing of steps S1505 to S1506 is similar to that of the first embodiment (the processing of steps S705 to S706 in fig. 7).

As described above, according to the above-described embodiment, teacher data used in learning of the learning model is selected based on the high-definition target image. Thus, the learning model trained using the selected teacher data can infer the high-frequency component of the high-definition target image with higher accuracy, thereby enabling highly accurate high-definition images to be obtained. In other words, the accuracy of moving image super-resolution imaging for making a moving image have high definition can be improved.

Note that in the above-described embodiment, when obtaining the teacher data candidate, the image forming a pair with the image selected from the moving image a is the image selected from the moving image B based on the imaging time or the similarity with the image, or the image obtained by reducing the resolution of the selected image. However, the present embodiment is not limited thereto. It is sufficient that the image related to the selected image from among the moving images a to be used as the teacher data candidates is an image related to the selected image with a resolution lower than that of the selected image. For example, whether or not an image is related to an image selected from the moving image a may be determined based on common characteristics such as the air temperature at the time of image capturing, the image capturing place, or the image capturing direction.

Further, in the above-described embodiment, the process has two stages of generating the teacher database D2 after generating the candidate database D1. However, such limitation is not intended. For example, the teacher data extraction unit 414 may extract frames, which may be pairs with the teacher data, from the moving image a based on the frame By, and may obtain the teacher data using the extracted frames and the frames related to the extracted frames as pairs. However, in the case where a plurality of images of the moving image B are sequentially made to have high definition, as in the above-described embodiment, it is more efficient to generate the candidate database D1, and then extract and use appropriate teacher data from the candidate database D1 in accordance with the high-definition target image.

Further, in the above-described embodiment, the targets of the processing are the moving image a and the moving image b having a lower resolution than the moving image a. However, such limitation is not intended. For example, an uncompressed moving image a and a moving image b obtained by restoration after compression are possible processing targets. In this case, the moving image a may be thinned out in terms of frames and stored. In this way, the relationship between the moving image a and the moving image b as the processing targets of the above-described embodiments is not limited to the resolution size relationship, and it is sufficient that the moving image a has better sharpness than the moving image b. In other words, it is sufficient that the image group forming the moving image a (moving image a) includes higher frequency components than the image group forming the moving image B (moving image B). For example, the processing of the above embodiment may be applied as long as each image in the image group of the moving image a corresponds to one or more than one image in the image group of the moving image b, and the images in the image group of the moving image a have a higher frequency component than the images corresponding to the image group of the moving image b.

Further, the moving image data is briefly described above. However, for example, in the case of a device that can generate a still image at a predetermined timing during recording of a moving image, the above-described embodiment can be applied in the following cases. In other words, a still image may be used as the data corresponding to the moving image a, and a moving image may be used as the data corresponding to the moving image b. For example, it is assumed that one of the above embodiments is applied to an image pickup apparatus that picks up an image of 6K Raw (Raw) data size at 60fps using an image sensor. Further, it is assumed that a still image is, for example, data stored in a format such as JPEG or HEIF after the development processing and still image compression without being changed to a size of 6K. Further, it is assumed that a moving image is data stored in a format such as MP4 or the like (moving image data of 2K size at 60 fps) after performing development processing and moving image compression on original data obtained by converting 6K data obtained by an image sensor into 2K data size. Under these assumptions, by the user pressing the release switch and continuously shooting still images during recording of 2K moving image data at 60fps with the image pickup apparatus, for example, 6K still images at intervals of 10fps can be generated for the frame rate (60 fps) of the moving images. By applying one of the above-described embodiments to the still image and the moving image generated in this way, for example, data having still image quality corresponding to the moving image of a period of time in which a plurality of still images are captured can be generated. In other words, a system that obtains a moving image having a size of 6K as a still image, which looks like a moving image photographed at a frame rate of 60fps, can be realized. Further, in this case, a still image and a moving image are prepared using an image pickup apparatus, and in the image pickup apparatus, learning and inference processing is performed to generate data of the quality of the still image corresponding to the moving image.

Sixth embodiment

In the sixth embodiment, improvement of learning performance and inference performance in consideration of image similarity related to the learning process and the inference process of the first embodiment will be described.

In the first embodiment, appropriate teacher data is extracted for the frame By selected in step S701 of fig. 7, and the learning model M is generated or updated using the teacher data in step S704. Further, in step S705, the high-frequency component is inferred using the learning model M, and the high-definition frame Cy is generated. However, with this method, in the case where various textures (such as textures of a person, a building, vegetation, ocean, or the like) are included in the frame By, the amount of information learned at one time is large, which means that a desired learning performance may not be obtained. This is because high-frequency components of various patterns are included in one frame. Therefore, the learning process of the sixth embodiment solves this problem by extracting a certain region from one frame, generating a learning model for each local region, performing inference using the learning model for each local region, and generating an image converted into high definition for each local region and combining the images.

In the sixth embodiment, the hardware structure and the functional structure of the image processing apparatus 100 are similar to those of the first embodiment (fig. 1). The extracted teacher data may be the same as that according to any one of the first to fifth embodiments. The processing after the learning processing is different, and this will be described in detail using the flowchart in fig. 16 and an example of the learning inference processing in fig. 17.

The processing of steps S1601 to S1603 is similar to steps S701 to S703 of the first embodiment (fig. 7).

In step S1604, the inference unit 452 extracts a local area from the inference target frame By (local area determination), and holds the local area in the RAM 103. Hereinafter, the extracted partial region (partial image) is referred to as a partial region Byn 1701.

Next, in step S1605, the learning unit 451 selects, from the teacher data (frame groups UA and UB) registered in the teacher database D2, local areas UAn 1702 and UBn 1703 corresponding to the same coordinate positions as the local area Byn of the estimation target frame By (local area selection). The learning unit 451 holds the selected partial region UAn 1702 and the partial region UBn 1703 in the RAM 103. In the present embodiment, the teacher data is one pair of local areas, but the teacher data may be a plurality of pairs of local areas. Note that the partial region group is a rectangular region having a uniform size of several tens pixels×several tens pixels. However, such limitation is not intended.

Note that the expression of a local area "corresponding to the same coordinate position as the local area Byn 1701" as the estimation target refers to an area indicated By the exactly same coordinates as the local area of the estimation target frame By in the case of the frame group UB. In other words, if it is inferred that the local region coordinates of the target frame By are (sx, sy), the local region coordinates of the local region UBn 1703 are also (sx, sy). Further, in the frame group UA, a ratio between the resolution XA of the moving image a and the resolution XB of the moving image B is considered. For example, in the case where XA: XB corresponds to a 2:1 relationship in terms of width and height, if it is inferred that the local region coordinates of the target frame By are (sx, sy), the local region coordinates of the local region UAn 1702 are (sx×2, sy×2). Hereinafter, this will be referred to as "local area corresponding to the same coordinate position".

In step S1606, the learning unit 451 generates a learning model Mn 1704 (local region learning model) using the local region UAn 1702 and the local region UBn1703, and using the learning model generation function shown in fig. 8. The learning unit 451 reads out frame data of a frame pair registered as teacher data from the storage unit 106, inputs the frame data to the learning model generation function for each local area, and stores the generated learning model Mn 1704 in the RAM 103.

In step S1607, the estimation unit 452 estimates the local area Byn 1701 using the learning model Mn 1704 generated in step S1606, and generates a local area Cyn 1705 (local high frequency component) of the high definition frame. First, the estimation unit 452 reads out the learning model Mn 1704 stored in the RAM 103 in step S1606. Next, the inference unit 452 inputs the local area Byn 1701 held in the RAM 103 in step S1604 into the CNN of the learning model Mn 1704, and generates a high-frequency component that is expected when the local area Byn 1701 is enlarged to the local area UAn 1702. The inference unit 452 generates a local area Cyn 1705 by adding the generated high-frequency component to an image obtained by linearly enlarging the image of the local area Byn 1701 to the local area UAn 1702, and stores it in the RAM 103. Note that the process from high-frequency component inference to high-definition image generation performed for the local area Byn 1701 is a process similar to the inference process shown in fig. 8.

Next, in step S1608, the estimation unit 452 combines the partial areas Cyn 1705 stored in the RAM 103 based on the frame coordinate position information to generate a high definition frame Cy1706, and holds it in the RAM 103. Note that 1705 indicated by a broken line in fig. 17 represents a partial region Cyn, and 1706 indicated by a solid line represents a high definition frame Cy.

In step S1609, the control unit 101 determines whether the above-described processing is completed for all the partial areas of the frame By. In the case where the control unit 101 determines that the processing is not completed (no in step S1609), the processing advances to step S1605, and the above-described processing is repeated for the next partial region of the frame By. When the control unit 101 determines that the processing is completed (yes in step S1609), the processing proceeds to step S1610.

In step S1610, the inference unit 452 adds the frame data of the high-definition frame Cy1706 stored in the RAM 103 to the end of the high-definition moving image C on the storage unit 106. Further, the image capturing time information of the frame By is copied and multiplexed into the image capturing time of the high definition frame Cy1706, and stored in the moving image C.

In step S1611, the control unit 101 determines whether the above-described processing is completed for all frames of the moving image B. In the case where the control unit 101 determines that the processing is not completed (no in step S1611), the processing advances to step S1601, and the above-described processing is repeated with the next frame of the moving image B as the frame By. When the control unit 101 determines that the processing is completed (yes in step S1611), the current processing ends. As described above, at the end of the high-definition moving image generation process, the high-definition moving image C having the resolution XA and the frame rate FB is stored in the storage unit 106 in an uncompressed format.

As described above, according to the sixth embodiment, with respect to a high-definition target image having various textures and a large amount of information, by learning for each partial region, the amount of information used in one-pass learning can be reduced, thereby enabling learning with higher accuracy. Thus, a higher definition image can be generated.

Seventh embodiment

The seventh embodiment described below is an example of improving super resolution by changing the learning process for each partial region according to the sixth embodiment.

With the method of the sixth embodiment, from within a frame different from the estimation target, a learning model is generated by learning a region in the same position as the estimation target region. However, with this method, for example, in a case where the subject moves much, the inferred region and the content shown in the teacher data may be different. This may make it difficult to obtain desired super-resolution performance.

In order to solve this problem, in the learning process of the seventh embodiment, a similarity evaluation function is provided. Via this similarity evaluation function, an area having a high similarity with the inferred area is searched for in the teacher data candidates, and the obtained area having the high similarity is used in learning.

High definition moving image generation processing

The difference between the seventh embodiment and the sixth embodiment is only the processing of step S1605 in the flowchart of the high-definition moving image generation processing shown in fig. 16. Therefore, only the process of step S1605 according to the seventh embodiment will be described.

In step S1605, the inference unit 452 extracts the region of the inference target frame By and holds it as a local region in the RAM 103. Note that the partial region is a rectangular region having a uniform size of several tens pixels×several tens pixels. However, such limitation is not intended. The control unit 101 searches the frame group UB of the teacher data registered in the teacher database D2 for the region UBn having the highest similarity with the local region of the inference target frame By using the SSIM provided for realizing the similarity evaluation function, and holds it in the RAM 103. The learning unit 451 selects frames from the frame group UA to form a pair with frames to which the local area UBn held in the RAM103 belongs, and accordingly holds the local area UAn having the relatively same position as the local area UBn in the RAM 103. Note that peak signal-to-noise ratio (PSNR), signal-to-noise ratio (SNR), or Mean Square Error (MSE) may be used for similarity estimation. Further, as described above, the region UBn having the highest similarity is searched for among all frames included in the frame group UB. However, such limitation is not intended. For example, the region UBn having the highest similarity may be searched for among the frames included in the frame group UB. In this case, the obtained logarithm of the local area UBn and the local area UAn is equal to the number of frames included in the frame group UB.

As described above, according to the seventh embodiment, learning is performed using a region having a high degree of similarity to the inferred region. Therefore, even for a moving image in which the subject moves a lot, a higher definition image can be generated.

Eighth embodiment

In an eighth embodiment, the solution to the problem according to the sixth embodiment described in the seventh embodiment, which is different from the solution of the seventh embodiment, is explained.

In the eighth embodiment, a method of estimating a motion vector associated with a region is used to identify a region having high similarity. However, according to the eighth embodiment, it is assumed that the moving image b is compressed into the MPEG-4AVC format using inter prediction. Note that MPEG-4AVC is ISO/IEC.14496-10"MPEG-4 part 10: abbreviation for advanced video coding ".

Next, differences between the eighth embodiment and the sixth embodiment will be mainly described.

Data stored in storage medium and decoding and loading method thereof

In the processing of the analysis unit 211 according to the eighth embodiment, the following processing is performed in addition to the processing to parse the moving image data stored in the storage unit 106 (as described in the first embodiment). The analysis unit 211 parses the MP4 file storing the moving image b, and obtains an avc box. Then, the analysis unit 211 obtains a sequence parameter set (hereinafter referred to as SPS) and a picture parameter set (hereinafter referred to as PPS) included in the avc box, and stores both in the RAM 103.

High definition moving image generation processing

The high-definition moving image generation process between the eighth embodiment and the sixth embodiment is different in the processes of steps S1605 to S1607 in the flowchart of fig. 16. Therefore, the processing of steps S1605 to S1607 according to the eighth embodiment will be described using the flowchart of fig. 18.

Note that in step S1604 according to the sixth embodiment described above, the inference unit 452 extracts the local area Byn of the inference target frame By as a rectangular area having a uniform size of 16×16 pixels.

In step S1801, in the case where it is inferred that the target frame By is an I picture, the control unit 101 advances the process to step S1803. In the case where it is inferred that the target frame By is a P picture or a B picture, the control unit 101 advances the process to step S1802. For example, whether the inferred target frame is an I picture, a P picture, or a B picture may be determined by referring to SPS and PPS.

In step S1802, the control unit 101 obtains a macroblock layer from the local area Byn of the inference target frame By. In addition, in the case of using a sub-macroblock, sub-macroblock prediction is obtained. Otherwise, macroblock predictions are obtained.

The control unit 101 derives a prediction unit block area Bynb of a macroblock to which the local area Byn of the target frame By belongs via sub-macroblock prediction or macroblock prediction of the macroblock. The prediction unit block region Bynb may be a macroblock, blocks of a macroblock divided by a partition, blocks of a sub-macroblock, or blocks of a sub-macroblock divided by a partition. These blocks are units of motion compensation in MPEG-4 AVC.

The control unit 101 derives a motion vector of the block region Bynb, a referenced frame, mbPartIdx, and subsubpartidx via SPS, PPS, macroblock prediction, or sub-macroblock prediction.

Here, the control unit 101 generates six pieces of information ("mbPartIdx", "subsubpartidx", "presence or absence of motion vector", "reference/referenced frame", and "reference direction") for each block area Bynb, and stores it in the RAM 103. "mbPartIdx" and "subsubpartidx" are information for identifying which block region in a macroblock is the block region Bynb. "motion vector" refers to temporal and spatial movement of the block region Bynb, and specifically refers to the reference destination block of the referenced frame. The "presence or absence of a motion vector" refers to whether or not the block area Bynb includes such a motion vector. The "reference/referenced frame" refers to a referenced frame that is referred to when decoding the inferred face frame By from which the block region Bynb is extracted, and a reference frame that references the block region Bynb. When "reference/referenced frame" is generated in step S1802, the referenced frame is stored. Further, for the term "reference direction", the direction indicated By the motion vector of the macroblock from the local area Byn of the inference target frame By is the reference direction, and the direction indicated By the local area Byn of the inference target frame By from the macroblock of the other frame is the referenced direction. Hereinafter, the above six pieces of information are collectively referred to as motion vector information.

The control unit 101 checks whether a frame identifiable via the "reference/referenced frame" of the generated motion vector information exists in the teacher data candidates. In the case where a frame identifiable by the "reference/referenced frame" exists in the teacher data candidate, the control unit 101 sets "presence or absence of a motion vector" from the motion vector information to "yes", and in the case where a frame identifiable by the "reference/referenced frame" does not exist in the teacher data candidate, the control unit 101 sets "presence or absence of a motion vector" to "no".

Further, for example, in the case where the inference target frame By is a B picture and the block includes two motion vectors, a referenced frame that is closer in temporal distance to the inference target frame By is used. In the same case as the temporal distance difference of the inferred target frame By, a motion vector that is closer in terms of the spatial distance indicated By the motion vector and information of the referenced frame are used. In the case where both the temporal distance and the spatial distance are equal, either one of the referenced frames may be used.

In step S1803, the control unit 101 searches for a block of the reference block area Bynb in the teacher data candidate for the block area Bynb in which "no" of the presence or absence of the motion vector in the motion vector information. Hereinafter, the block of the reference block area Bynb is also referred to as a reference source block. Note that the method for obtaining the motion vector and the reference frame information required for the reference source block for judging whether the block is the block area Bynb has been described with reference to step S1802, and therefore, the method is omitted.

When a block of the reference block area Bynb (reference source block of the block area Bynb) is found, the "presence or absence of a motion vector" in the motion vector information of the block area Bynb is updated to "yes". Further, a frame including the block of the reference block area Bynb is stored as a referenced frame in the "reference/referenced frame". Note that the range of the searched frame is within 3 frames before or after the frame including the block area Bynb. Further, the range of the searched macro block is within MaxVmvR of each level set in each MPEG-4 AVC. MaxVmvR is derived from SPS of the moving image b. Note that the range of the searched frame and the range of the searched macroblock are not limited to these examples.

In step S1804, the estimation unit 452 obtains a reference destination or a reference source block area UBXnb from the frame group UB for each block area Bynb of which "presence or absence of motion vector" is yes "in the motion vector information, and holds it in the RAM 103. Further, the inference unit 452 obtains, from the frame group UA, a block area UAXnb corresponding to the same coordinate position as the block area UBXnb obtained via the motion vector information of each block area Bynb stored in the RAM103, and holds it in the RAM 103. In other words, the inference unit 452 obtains the block area UAXnb corresponding to the same coordinate position as the block area UBXnb from the frames of the frame group UA of the pair of frame formations to which the block area UBXnb belongs. Further, the inference unit 452 associates the block area UAXnb with the block area UBXnb, and holds it in the RAM 103.

In step S1805, the control unit 101 determines whether "whether there is a motion vector" yes "or" no "of motion vector information of all the block areas Bynb included in the local area Byn of the estimation target frame By. When the control unit 101 determines that the presence or absence of the "motion vector" of all the block areas Bynb is yes (yes in step S1805), the process advances to step S1806.

In step S1806, the inference unit 452 combines the block areas UBXnb stored in the RAM103 based on the coordinate position information of the block areas Bynb, and generates the local area UBXn. The inference unit 452 holds the generated local area UBXn in the RAM 103.

Further, the inference unit 452 combines the block areas UAXnb corresponding to the same coordinate positions as the block areas UBXnb stored in the RAM103 based on the coordinate position information of the block areas Bynb, and generates the local areas UAXn. The inference unit 452 holds the generated local area UAXn in the RAM 103.

Further, the learning unit 451 generates the learning model Mn using the local area UAXn and the local area UBXn stored in the RAM103, and the learning model generating function shown in fig. 8. Note that the partial area UBXn is teacher data corresponding to the same coordinate position as the partial area UAXn of the paired frame. The learning unit 451 reads out teacher data from the RAM103, performs a learning model generation function, and stores the generated learning model Mn in the RAM 103.

In step S1807, the estimation unit 452 uses the learning model Mn generated in step S1806 to estimate the local area Byn of the frame By, and generates the local area Cyn of the high-definition frame.

First, the inference unit 452 reads out the learning model Mn stored in the RAM 103 in step S1806. Next, the inference unit 452 inputs the local area Byn of the frame By held in the RAM 103 into the CNN of the learning model Mn, and generates a high-frequency component expected in the local area Byn when the inference target frame By is enlarged to the resolution XA. The inference unit 452 generates a local area Cyn by adding the generated high-frequency component to a local area Byn obtained by linear amplification based on the ratio between the resolution XB and the resolution XA, and stores it in the RAM 103. Note that the process from high-frequency component inference to high-definition image generation performed for the local area Byn is a process similar to that of the inference process shown in fig. 8.

In step S1805, when the control unit 101 determines that the local area Byn includes the block area Bynb of "no" of the presence or absence of the motion vector (no in step S1805), the processing proceeds to step S1808. In step S1808, the control unit 101 determines whether "whether there is a" yes "or" no "motion vector with respect to the motion vector information of each block area Bynb included in the local area Byn. When the control unit 101 determines that "presence or absence of a motion vector" is yes "(yes in step S1808), the process advances to step S1809. On the other hand, in step S1808, when the control unit 101 determines that "whether or not there is a motion vector" no "(no in step S1808), the process advances to step S1811.

In step S1809, the learning unit 451 generates a learning model Mnb using the block region Bynb and the local region UBXnb, using the learning model generation function shown in fig. 8, and holds it in the RAM 103.

More specifically, in step S1809, the learning unit 451 generates a learning model Mnb for use in the inference of the block region Bynb using the local region UBXnb and the local region UAXnb stored in the RAM103 and the learning model generation function shown in fig. 8. Note that the local area UBXnb is teacher data corresponding to the same coordinate position as the local area UAXnb of the paired frame. The learning unit 451 reads out teacher data from the RAM103, inputs it into the learning model generation function, and stores the generated learning model Mnb in the RAM 103.

In step S1810, the inference unit 452 uses the learning model Mnb to infer the block region Bynb of the frame By, and generates the block region Cynb of the high definition frame. First, the estimation unit 452 reads out the learning model Mnb stored in the RAM103 in step S1809. Next, the inference unit 452 inputs the block region Bynb held in the RAM103 into the CNN of the learning model Mnb, and generates a high-frequency component expected in the local region Bynb when amplifying the inference target frame By to the resolution XA. The inference unit 452 generates a block region Cynb of the high-definition frame by adding the generated high-frequency component to the local region Bynb obtained by linear amplification based on the ratio between the resolution XB and the resolution XA, and stores it in the RAM 103. Note that the process from high-frequency component inference to high-definition image generation performed for the block region Bynb is a process similar to the inference process shown in fig. 8.

In step S1811, the control unit 101 holds in the RAM 103 a block area Cynb of a high-definition frame Cy obtained by linearly enlarging a block area Bynb of whether or not the motion vector is "no" in the motion vector information based on the comparison between the resolution XA and the resolution XB. Note that the method of linear amplification is not limited as long as amplification can be performed based on the ratio between the resolution XA and the resolution XB.

In step S1812, the control unit 101 determines whether the above-described processing is completed for all the block areas Bynb. In the case where the control unit 101 determines that the processing is not completed (no in step S1812), the processing advances to step S1807, and the incomplete block area Bynb is processed. When the control unit 101 determines that the processing is completed (yes in step S1812), the processing proceeds to step S1813. In step S1813, the control unit 101 reads out the block areas Cynb held in the RAM 103 in step S1810 and step S1811, combines the block areas based on the coordinate position information of the corresponding block areas Bynb, and generates the partial area Cyn of the high-definition frame. The generated partial area Cyn is held in the RAM 103. In step S1608 of fig. 16, the local area Cyn generated as described above is used as the local area Cyn1705.

As described above, according to the eighth embodiment, learning is performed using a motion vector of a region having a high similarity with a reference/referred inferred region. Therefore, even for a moving image in which the subject moves a lot, a higher definition image can be generated.

Ninth embodiment

In a ninth embodiment, the solution to the problem according to the sixth embodiment described in the seventh embodiment, which is different from the solutions of the seventh and eighth embodiments, is described.

Next, differences between the ninth embodiment and the sixth embodiment will be mainly described.

High definition moving image generation processing

The difference between the ninth embodiment and the sixth embodiment is only the processing of steps S1605 and S1606 in the flowchart of the high-definition moving image generation processing shown in fig. 16. Therefore, the processing of steps S1605 and S1606 according to the ninth embodiment will be described below.

In step S1605, the control unit 101 selects a local area (corresponding to UAn and UBn 5) corresponding to the same coordinate position as the local area Byn of the inference target frame By from the paired frames of the frame groups UA and UB, and holds it in the RAM 103. In addition, the control unit 101 holds eight areas adjacent to UBn and having the same size as UBn5 in the RAM 103. In a similar manner, the control unit 101 stores eight areas adjacent to UAn5 and having the same size as UAn in the RAM 103. An example of region selection of frames included in the frame group UB is shown in fig. 19. Note that, in the present embodiment, for the inference target region, a region having the same position coordinates as the local region Byn and eight adjacent regions are selected. However, the selection method and the number of areas are not limited thereto.

Next, the control unit 101 evaluates the similarity between the local areas Byn and UBn to UBn of the inference target frame By, and obtains a similarity evaluation value. Then, the control unit 101 determines the number of learning times for each of UBn to UBn based on the similarity evaluation value, and holds it as learning information in the RAM 103. Note that the learning information includes, for example, "information for identifying UBn to UBn 9", "similarity evaluation value with the local area Byn", and "number of learning times". In the case where the evaluation value of the similarity with the local area Byn in the learning information is smaller than the threshold value set in advance in the system, the control unit 101 updates the number of learning times in the learning information to 0. For the regions where the similarity evaluation value is equal to or greater than the threshold value, the number of learning times is determined using the ratio of the similarity evaluation values between the regions where the similarity evaluation value is equal to or greater than the threshold value, and the learning information is updated. In this example, the similarity evaluation values of UBn, UBn, and UBn6 are equal to or greater than the threshold value, and the ratio therebetween is 2:5:3. Further, the total number of learning times is set to 1000 times. In this example, the learning times of the learning information of UBn to UBn6 are 200 times, 500 times, and 300 times, respectively. Note that in this method for determining the number of learning according to the present embodiment, the number of learning is linearly allocated to a region where the similarity evaluation value is larger than the threshold value. However, the method is not limited thereto.

In step S1606, the learning unit 451 uses, as teacher data, a pair of an image of the local region (one of UBn to UBn) indicated by the learning information and an image of the local region (one of UAn to UAn 9) in the corresponding frame group UA to generate the learning model Mn. The learning unit 451 learns the number of times of learning indicated by the learning information for each teacher data using the learning model generation function shown in fig. 8, and generates a learning model Mn. The generated learning model Mn is stored in the RAM 103.

The processing from step S1607 is the same as that in the sixth embodiment, and therefore a description thereof is omitted.

As described above, according to the ninth embodiment, a plurality of regions having high similarity to the inferred region are used in learning according to the similarity to the inferred region. Therefore, even for a moving image in which the subject moves a lot, a higher definition image can be generated.

As described above, according to the sixth to ninth embodiments, a local area can be determined from a high-definition target image, and the amount of information used in learning of a learning model can be reduced. Further, according to the sixth to ninth embodiments, it is possible to select a local area of teacher data having high correlation with a local area determined from a high-definition target image and use it for learning of a learning model. Accordingly, the high-frequency component of the high-definition target image can be inferred with higher accuracy, thereby enabling a highly accurate high-definition image to be obtained. In other words, the accuracy of moving image super-resolution imaging for making a moving image have high definition can be improved.

Tenth embodiment

The tenth embodiment described below is an example of changing the learning process of the learning model according to the first embodiment and reducing the learning process load. In the method of the first embodiment, a learning model is generated for each inference target frame, and super-resolution performance is improved via inference processing. However, with this method, a plurality of learning models equal to the number of inference target frames must be generated. This tends to cause an increase in learning processing load. However, with the learning process of the tenth embodiment, the amount of movement is detected from the previous frame for each inference target frame, and the learning model M uses the previous inference target frame for the inference target frame with little movement. In this way, the number of times of generating the learning model is reduced, and the learning processing load is reduced.

The difference between the tenth embodiment and the first embodiment is the processing of the flowchart of the high-definition moving image generation processing shown in fig. 7. Therefore, differences from the processing of the first embodiment will be mainly described below.

Fig. 20 is a flowchart showing a high definition moving image generation process according to the tenth embodiment. Steps for performing the processing similar to that of the first embodiment are given the same reference numerals as in the first embodiment (fig. 7). In step S2001, the control unit 101 determines whether or not there is a movement with respect to the previous estimation target frame with respect to the estimation target frame By read out in step S701. The control unit 101 calculates a difference between the estimation target frame By and the previous estimation target frame, and determines that "there is movement" if the difference is greater than a threshold value, and determines that "there is no movement" if the difference is equal to or less than the threshold value. The difference between two frames may be, for example, a similarity between the inferred target frame By and a previous inferred target frame obtained using SSIM. In the case where the obtained similarity is higher than the specific threshold value, it is determined that "there is no movement". Note that peak signal-to-noise ratio (PSNR), signal-to-noise ratio (SNR), or Mean Square Error (MSE), etc. may be used for the similarity evaluation. In the case where the control unit 101 determines that "there is a movement" (yes in step S2001), the process proceeds to step S702, and the learning model M is generated via the processes (steps S702 to S704) similar to those of the first embodiment. On the other hand, in step S2001, when the control unit 1001 determines that "no movement exists" (no in step S2001), the process advances to step S2002. In step S2002, the control unit 101 determines that the learning model M for the previous inference target frame is used in the inference process for the current inference target frame By performed in step S705. In step S705, the inference unit 452 infers the high-frequency component using the learning model M generated in step S704 or the learning model M determined to be used in step S2002. Further, the estimation unit 452 uses the estimated high frequency component to generate a frame Cy obtained By making the estimation target frame By have high definition.

As described above, according to the tenth embodiment, since the learning model M for the immediately preceding inference target frame is used for the inference target frame of "no movement", the number of times of performing the process to generate the learning model can be reduced. Therefore, according to the tenth embodiment, it is possible to reduce the learning processing load while maintaining the super-resolution performance.

Note that the above-described embodiment is an example that can be applied to the structure of the first embodiment. However, it is apparent that this can also be applied to the structures of the second to fifth embodiments in a similar manner. Further, the method of the tenth embodiment may be applied to the structures of the sixth to ninth embodiments. In this case, it is sufficient to generate a learning model for each local region in the case where it is determined that there is a movement between the estimation target frame By and the immediately preceding estimation target frame. On the other hand, in the case where it is determined that there is no movement between the estimation target frame By and the previous estimation target frame, the learning model for the previous estimation target frame is used as the learning model for all the local areas. In other words, the configuration for judging whether to use the learning model for each frame unit according to the tenth embodiment is obviously also applicable to the structures of the sixth to ninth embodiments. Further, in the tenth embodiment, the estimation target frame is subjected to the similarity evaluation, but such limitation is not intended. For example, the sixth embodiment to the ninth embodiment may have the following configuration: the similarity between the local region of the inference target frame By and the local region of the immediately preceding inference target frame is calculated, and it is determined whether or not a learning model is used for each local region. Since it can be determined whether or not to generate (update) the learning model for each local area, accurate updating of the learning model can be achieved while reducing the learning processing load.

Other embodiments

The embodiments of the present invention can also be realized by a method in which software (program) that performs the functions of the above embodiments is supplied to a system or apparatus, a computer of the system or apparatus or a method in which a Central Processing Unit (CPU), a Micro Processing Unit (MPU), or the like reads out and executes the program, through a network or various storage mediums.

Although the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. An image processing apparatus that uses a first image group to make an image of a second image group having fewer high-frequency components than an image of the first image group, the image processing apparatus comprising:

a calculation section for calculating a similarity between a current image selected as a high-definition target from the second image group and a previous image preceding the current image as a high-definition target;

a selection unit configured to select teacher data to be used in learning from a plurality of teacher data using an image included in the first image group as one of image pairs based on the current image;

Model generating means for generating a learning model for making the current image have high definition using the selected teacher data;

an inference section that infers a high-frequency component of the current image using the learning model generated by the model generation section in a case where the similarity is equal to or smaller than a threshold value, and infers a high-frequency component of the current image using a learning model for making the previous image have high definition in a case where the similarity is greater than the threshold value; and

an image generating section for generating a high-definition image based on the current image and the inferred high-frequency component.

2. The image processing apparatus according to claim 1, further comprising:

a first obtaining section for obtaining, as candidates of the teacher data, a pair including a first image selected from the first image group and a third image related to the first image having fewer high-frequency components than the first image,

wherein the selecting means selects teacher data to be used in the learning from among candidates of the teacher data.

3. The image processing apparatus according to claim 2,

Wherein the first obtaining means obtains the candidate of the teacher data by obtaining the third image from the second image group.

4. The image processing apparatus according to claim 3,

wherein the first obtaining means obtains, as the third image, an image whose image capturing time is the same as that of the first image from the second image group.

5. The image processing apparatus according to claim 3,

wherein the first obtaining means obtains, as the third image, an image whose imaging time difference from the first image is smaller than a predetermined threshold from the second image group.

6. The image processing apparatus according to claim 3,

wherein the first obtaining means obtains, as the third image, an image having the highest similarity to the first image from the second image group.

7. The image processing apparatus according to claim 6,

wherein the first obtaining means determines a similarity between the image of the resolution of the first image reduced to the second image group and the image of the second image group.

8. The image processing apparatus according to claim 2,

wherein the first obtaining means obtains, as the third image, an image of which the size of the first image is reduced and the resolution is lower.

9. The image processing apparatus according to claim 8,

wherein the third image is an image of a resolution at which the first image is reduced to the second image group.

10. The image processing apparatus according to claim 2,

wherein the selecting means selects, as teacher data to be used in the learning, candidates of teacher data including an image whose imaging time difference from the current image is smaller than a predetermined threshold.

11. The image processing apparatus according to claim 2,

wherein the selecting means selects, from among the candidates of teacher data, teacher data including an image having a similarity with the current image greater than a predetermined threshold value as teacher data to be used in the learning.

12. The image processing apparatus according to claim 1,

wherein the inference means controls updating of the parameter via back propagation in the learning based on teacher data to be used in the learning and the current image.

13. The image processing apparatus according to claim 12,

wherein the inference section determines a coefficient based on teacher data to be used in the learning and the current image, and controls an update amount of the parameter via the back propagation based on the coefficient.

14. The image processing apparatus according to claim 12,

wherein the inference section determines a coefficient based on teacher data to be used in the learning and the current image, and controls the number of repetitions of updating of the parameter via the back propagation based on the coefficient.

15. The image processing apparatus according to claim 13,

wherein the inference section determines the coefficient based on a difference between an image capturing time of an image of teacher data to be used in the learning and an image capturing time of the current image.

16. The image processing apparatus according to claim 13,

wherein the inference section determines the coefficient based on a similarity between an image of teacher data to be used in the learning and the current image.

17. The image processing apparatus according to claim 1, further comprising:

a second obtaining section for obtaining an image pair corresponding to a local area extracted from the current image from the teacher data selected by the selecting section,

wherein the model generating section generates a learning model of the local region using the image pair obtained by the second obtaining section,

Wherein the inference means infers a high-frequency component of a local region of the current image by using the learning model generated by the model generation means in a case where the similarity is equal to or smaller than the threshold value, and using a learning model for making a region corresponding to the local region of the previous image have high definition in a case where the similarity is larger than the threshold value, an

Wherein the image generating section generates a high-definition image of the local area using the high-frequency component of the local area and the image of the local area of the current image, and combines the high-definition images generated for the respective local areas.

18. The image processing apparatus according to claim 17,

wherein the calculation section calculates a similarity between the current image and the previous image for each local area, and

wherein the inference section uses the learning model generated by the model generation section for a local region where the similarity is equal to or smaller than the threshold value, uses the learning model for making a region of the previous image have high definition for a local region where the similarity is larger than the threshold value, and infers a high-frequency component of the local region of the current image.

19. The image processing apparatus according to claim 17,

wherein the second obtaining means obtains an image pair of an area corresponding to the same coordinate position as the partial area from the teacher data selected by the selecting means.

20. The image processing apparatus according to claim 19,

wherein the image generating section generates the high-definition image by combining the high-definition images of the respective local areas based on the coordinate positions of the local areas.

21. The image processing apparatus according to claim 17,

wherein the second obtaining means obtains an image pair having the highest similarity with the image of the local area from the plurality of image pairs extracted from the teacher data selected by the selecting means.

22. The image processing apparatus according to claim 17,

wherein the second obtaining means obtains an image pair corresponding to the local area from the teacher data based on a motion vector set for a block included in the local area as a motion compensation unit or based on a motion vector referencing a block included in the local area.

23. The image processing apparatus according to claim 17,

Wherein the second obtaining means obtains a plurality of image pairs corresponding to a plurality of areas from the selected teacher data based on the position of the partial area, an

Wherein the model generating section determines the number of times each of the plurality of image pairs is to be used for learning when generating the learning model, based on the similarity between the image of the local area of the current image and each of the plurality of image pairs.

24. The image processing apparatus according to claim 23,

wherein the plurality of regions includes a first region corresponding to a position of the partial region and a second region adjacent to the first region.

25. The image processing apparatus according to claim 23,

wherein the model generating section does not learn using an image pair having a similarity with the image of the local area equal to or smaller than a threshold value.

26. The image processing apparatus according to claim 1,

wherein the first image group and the second image group are two image groups obtained by performing different image processing on one image captured by one image sensor included in one image capturing apparatus.

27. The image processing apparatus according to claim 1,

wherein the first image group and the second image group are image groups captured by two different image sensors.

28. The image processing apparatus according to claim 1,

wherein the first group of images has a lower frame rate than the second group of images.

29. An image processing method that uses a first image group to make an image of a second image group having fewer high-frequency components than an image of the first image group, the image processing method comprising:

calculating a similarity between a current image selected as a high-definition target from the second image group and a previous image preceding the current image as a high-definition target;

selecting teacher data to be used in learning from a plurality of teacher data using an image included in the first image group as one of image pairs based on the current image;

generating a learning model for making the current image have high definition using the selected teacher data;

inferring a high frequency component of the current image using the learning model generated in the generating, in a case where the similarity is equal to or less than a threshold, and inferring a high frequency component of the current image using a learning model for making the previous image have high definition, in a case where the similarity is greater than the threshold; and

A high-definition image is generated based on the current image and the high-frequency component inferred in the inference.

30. A storage medium storing a program for causing a computer to function as a part of the image processing apparatus according to any one of claims 1 to 28.