WO2024047994A1

WO2024047994A1 - Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device

Info

Publication number: WO2024047994A1
Application number: PCT/JP2023/021204
Authority: WO
Inventors: 宏能地; ピヤワトスワンウイタヤ
Original assignee: ＬｅａｐＭｉｎｄ株式会社
Priority date: 2022-08-31
Filing date: 2023-06-07
Publication date: 2024-03-07

Abstract

This input information generating device comprises: an image acquiring unit that acquires, as an input image, a plurality of frames that include at least a target frame that serves as a target for the generation of input information among the frames that constitute a video; an input converting unit that converts pixel values of the input image of the plurality of acquired frames into a plurality of pieces of two-bit input data; a synthesizing unit that synthesizes the plurality of pieces of the converted input data into one piece of synthesis data; and an output unit that outputs the synthesized synthesis data.

Description

Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device

The present invention relates to an input information generation device, an image processing device, an input information generation method, a learning device, a program, and a learning method for a noise reduction device.
This application claims priority based on Japanese Patent Application No. 2022-137834, which was filed in Japan on August 31, 2022, and Japanese Patent Application No. 2022-137843, which was filed in Japan on August 31, 2022. , all contents stated in the application are incorporated by reference.

When capturing an image with an imaging device, if the amount of surrounding light is not sufficient or if the settings of the imaging device such as shutter speed, aperture, or ISO sensitivity are inappropriate, the image may be of low quality. There is a technology that converts a low-quality image that has already been captured into a high-quality image through image processing. For example, there is a technique for image processing a low-quality image into a high-quality image using machine learning (see, for example, Patent Document 1). In such technical fields, the highest priority requirement is to improve the quality of images.

US Patent No. 10623756

Here, it is conceivable to convert a low-quality video into a high-quality video by applying the above-mentioned conventional technology to a video and increasing the image quality of each frame that makes up the video. When increasing the image quality of a moving image captured by an imaging device in real time, a problem may arise in which the frame rate of the moving image is sacrificed if time is required for image processing. That is, when converting a low-quality video into a high-quality video, it is not possible to give priority only to increasing the image quality of frame images, and it is required to maintain the frame rate through lightweight image processing. Furthermore, processing to improve the image quality of moving images is sometimes performed on edge devices, and in consideration of the processing capabilities of edge devices, there has been a strong demand for lightweight image processing.

Therefore, the present invention aims to provide a technology that can convert a low-quality video to a high-quality video using lightweight calculations.

(1) One aspect of the present invention includes an image acquisition unit that acquires, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion unit that converts a pixel value of an input image into input data with a smaller number of bits than the number of bits indicating the pixel value of the input image; and a synthesis unit that combines the plurality of converted input data into one composite data. , and an output unit that outputs the synthesized data.

(2) One aspect of the present invention is the input information generation device according to (1) above, in which the moving image is a color moving image, and the image acquisition unit converts pixel values of each color from one frame into a plurality of different images. The input conversion unit converts each of a plurality of images acquired from one frame into the input data.

(3) One aspect of the present invention is the input information generation device according to (1) or (2) above, in which the image acquisition unit is configured to continuously capture images before and after the target frame, respectively, of the frames constituting the video. This method acquires images of multiple adjacent frames.

(4) One aspect of the present invention is the input information generation device according to any one of (1) to (3) above, wherein the image acquisition unit outputs the composite data regarding the target frame by the output unit. After that, a frame adjacent to the target frame is set as the target frame, and a plurality of frames including at least the target frame are acquired as the input image.

(5) One aspect of the present invention is that the input information generation device according to any one of (1) to (4) above is configured to use a frame other than the target frame among the plurality of frames acquired by the image acquisition unit. The input conversion unit converts the pixel values of the target frame into one integrated frame by performing an operation based on the pixel values of a certain adjacent frame. data, and further converts the pixel values of the integrated frame into the input data, and the combining unit converts the plurality of input data converted based on the target frame and the plurality of input data converted based on the integrated frame. The above-mentioned input data are combined into one above-mentioned combined data.

(6) One aspect of the present invention is the input information generation device according to (5) above, wherein the integrating unit sets an average value of pixel values of the plurality of adjacent frames as the pixel value of the integrated frame. be.

(7) One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit calculates a weighted average of the plurality of adjacent frames according to a temporal distance from the target frame. By this calculation, the pixel value of the integrated frame is calculated.

(8) One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit excludes frames with large brightness changes from among the frames forming the video from the targets for calculating the average value. It is something to do.

(9) One aspect of the present invention is that the input information generation device according to (5) above includes an average value temporary storage unit that stores an average value of pixel values of a predetermined frame among the frames constituting the moving image. Furthermore, the integrating unit calculates the pixel value of the integrated frame by calculation based on the value stored in the average value temporary storage unit and the target frame.

(10) One aspect of the present invention is that the input information generation device according to (9) above includes an imaging condition acquisition unit that acquires the imaging conditions of the moving image, and a temporary value of the average value according to the acquired imaging conditions. The apparatus further includes an adjustment section that adjusts the average value stored in the storage section.

(11) One aspect of the present invention is that the input information generation device according to (9) above further includes a comparison unit that compares the value stored in the average value temporary storage unit and the pixel value of the target frame. Preparation: When the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit, the integration unit calculates a moving average based on the value stored in the average value temporary storage unit and the target frame. The pixel value of the integrated frame is calculated, and if the difference is not less than a predetermined value, the pixel value of the target frame is set as the pixel value of the integrated frame.

(12) One aspect of the present invention is the input information generation device according to (5) above, in which the integrating unit integrates randomly specified frames among the adjacent frames acquired by the image acquiring unit. It is used as a frame.

(13) One aspect of the present invention includes the input information generation device according to any one of (1) to (12) above, and a convolution neural network that uses the synthetic data outputted by the input information generation device as input information. An image processing apparatus includes a network.

(14) One aspect of the present invention includes an image acquisition step of acquiring, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion step of converting a pixel value of an input image into input data with a smaller number of bits than the number of bits representing the pixel value of the input image; and a synthesis step of combining the plurality of converted input data into one composite data. , and an output step of outputting the synthesized data.

(A1) One aspect of the present invention is that first image information including at least one image and a subject that is the same as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition unit that acquires second image information including at least one image of lower image quality than the image contained in the image; and a plurality of images at different positions that are part of the acquired first image information and are cut out. A plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information. A learning unit that learns to infer a high-quality video from a low-quality video based on a video information generation unit and teacher data that includes the first video information and the second video information generated by the video information generation unit. This is a learning device comprising:

(A2) One aspect of the present invention is the learning device according to (A1) above, in which the second image information includes the same subject as the subject captured in the image included in the first image information. the plurality of images, each of which includes a plurality of images on which different noises are superimposed, and the video information generation unit is configured to cut out different parts from each of the plurality of images included in the second image information. The second moving image information is generated.

(A3) One aspect of the present invention is the learning device according to (A1) or (A2) above, in which the plurality of images included in the second image information are images captured at different times that are close to each other.

(A4) One aspect of the present invention is the learning device according to any one of (A1) to (A3) above, in which the video information generation unit generates a different image from one image included in the first image information. The first moving image information is generated by cutting out a portion.

(A5) One aspect of the present invention is the learning device according to any one of (A1) to (A4) above, in which the video information generation unit shifts the plurality of cut out images in a predetermined direction so as to move the plurality of cut out images to different positions. This is to cut out multiple images.

(A6) One aspect of the present invention is that in the learning device according to any one of (A1) to (A5) above, the video information generation unit is configured to generate a plurality of video information at a position shifted by a predetermined number of bits in a predetermined direction. The image is cut out.

(A7) One aspect of the present invention is that in the learning device described in (A6) above, the predetermined direction in which the video information generation unit cuts out the image is calculated by affine transformation.

(A8) One aspect of the present invention is the learning device according to (A6) above, further comprising a trajectory vector acquisition unit that acquires a trajectory vector, and the predetermined direction in which the video information generation unit cuts out the image is the acquisition device. This is calculated based on the trajectory vector obtained.

(A9) One aspect of the present invention includes: an image acquisition unit that acquires image information including at least one image; a cutting unit that cuts out a plurality of images at different positions that are part of the acquired image information; a first video information generation section that combines a plurality of cut out images to generate first video information; a noise superposition section that superimposes noise on each of the plurality of images cut out by the cutout section; and a noise superposition section that a second video information generation unit that generates second video information by combining a plurality of images on which noise is superimposed; and the first video information generated by the first video information generation unit and the second video information generation unit. The learning device is provided with a learning unit that learns to infer a high-quality video from a low-quality video based on teacher data that includes the second video information generated by the above-mentioned second video information.

(A10) One aspect of the present invention is that first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are captured by the computer, and the first image information includes at least one image; an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the first image information; and cutting out a plurality of images at different positions that are part of the acquired first image information. , combine a plurality of cut out images to generate first moving image information, cut out a plurality of images at different positions that are part of the acquired second image information, and combine the plurality of cut out images to generate second moving image information. learning to infer a high-quality video from a low-quality video based on training data including the first video information and the second video information generated by the video information generation step; This is a program that executes learning steps.

(A11) One aspect of the present invention is that the first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the image; and cutting out a plurality of images at different positions that are part of the acquired first image information. A plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information. a learning step of learning to infer a high-quality video from a low-quality video based on a video information generation step and teacher data that includes the first video information and the second video information generated by the video information generation step; This is a learning method for a noise reduction device having the following steps.

According to the present invention, it is possible to convert a low-quality video to a high-quality video with lightweight calculations.

FIG. 1 is a block diagram illustrating an example of a functional configuration of a high-quality video generation system according to a first embodiment. FIG. 1 is a diagram illustrating an example of a convolutional neural network according to the first embodiment. FIG. 3 is a diagram illustrating frames constituting a moving image according to the first embodiment. FIG. 2 is a diagram for explaining an overview of an input information generation method according to the first embodiment. FIG. 2 is a block diagram illustrating an example of a functional configuration of input information generation according to the first embodiment. FIG. 2 is a block diagram illustrating an example of a functional configuration of an input conversion section according to the first embodiment. FIG. 7 is a diagram for explaining an overview of an input information generation method according to a second embodiment. FIG. 3 is a block diagram illustrating an example of a functional configuration of input information generation according to a second embodiment. FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a third embodiment. FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a fourth embodiment. It is a figure for explaining the outline of the learning system concerning a 5th embodiment. It is a figure showing an example of the functional composition of the learning device concerning a 5th embodiment. FIG. 12 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment. FIG. 12 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment. FIG. 12 is a diagram for explaining an example of a direction in which a learning device according to a fifth embodiment cuts out. FIG. 12 is a diagram illustrating an example of a functional configuration of a learning device according to a fifth embodiment when the learning device generates a moving image based on a trajectory vector. FIG. 12 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector. 12 is a flowchart illustrating an example of a series of operations of a learning method of a noise reduction device according to a modification of the fifth embodiment. It is a figure for explaining the outline of the learning system concerning a 6th embodiment. It is a figure showing an example of functional composition of a video information generation part concerning a 6th embodiment.

Hereinafter, preferred embodiments of an input information generation device, an image processing device, and an input information generation method according to aspects of the present invention will be described in detail with reference to the accompanying drawings. Note that the aspects of the present invention are not limited to these embodiments, and also include those with various changes or improvements. That is, the components described below include those that can be easily assumed by those skilled in the art and are substantially the same, and the components described below can be combined as appropriate. Further, various omissions, substitutions, or changes of the constituent elements can be made without departing from the gist of the present invention. Further, in the following drawings, in order to make each structure easier to understand, the scale, number, etc. of each structure may be different from the scale, number, etc. of the actual structure.

First, the premises of this embodiment will be explained. The input information generation device, image processing device, and input information generation method according to the present embodiment receive low-quality video information with superimposed noise as input and generate high-quality video information from which noise has been removed. Low-quality videos include low-quality videos, and high-quality videos include high-quality videos. An example of a high-quality moving image is a moving image with high image quality captured by low ISO sensitivity and long exposure. An example of a low-quality moving image is a moving image with low image quality captured by high ISO sensitivity and short exposure.

In the following description, image quality deterioration due to noise will be described as an example of low-quality video, but the present embodiment is widely applicable to matters other than noise that degrade the quality of video. Things that can degrade video quality include a decrease in resolution or color shift due to optical aberrations, a decrease in resolution due to camera shake or subject shake, uneven black levels due to dark current or circuits, ghosts and flare due to high-brightness subjects, Examples include signal level abnormalities. In addition to random noise that occurs for each pixel, noise includes streak-like noise that occurs in the horizontal or vertical direction of an image, noise that occurs in a fixed pattern in an image, and the like. Further, noise specific to moving images, such as flicker-like noise that fluctuates between consecutive frames, may be included. The input information generation device, image processing device, and input information generation method according to the present embodiment improve the image quality of each frame by image processing each frame included in a video, thereby increasing the quality of the video. I do.

Note that as the input video to be improved in quality, a video captured by an imaging device may be used, or a video prepared in advance may be used. In the following description, a low-quality video may be referred to as a low-quality video or a noise video. Furthermore, in the following description, a high-quality video may be referred to as a high-quality video.

In addition, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a video captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. good. Further, the moving image targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is an image captured by a CMOS (complementary metal oxide semiconductor) camera using a CMOS (complementary metal oxide semiconductor) image sensor. Good too. Further, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a color video or a monochrome video. Further, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is a video captured by an infrared camera using an infrared sensor or the like that acquires non-visible light components. There may be.

[First embodiment]
First, a first embodiment will be described with reference to FIGS. 1 to 6.
FIG. 1 is a block diagram showing an example of the functional configuration of a high-quality video generation system according to the first embodiment. An example of the functional configuration of the high-quality video generation system 1 will be described with reference to the same figure. The high-quality video generation system 1 includes an imaging device 100, an input information generation device 10, and a convolutional neural network 200 (hereinafter referred to as "CNN 200") as its functions. The input information generation device 10 and CNN 200 perform image processing on each frame constituting the moving image captured by the imaging device 100. The input information generation device 10 and the CNN 200 include trained models that have been trained in advance. In the following description, a configuration including the input information generation device 10 and CNN 200 may be referred to as an image processing device 2. Note that the high-quality moving image generation system 1 may be configured to include an encoding unit that compresses and encodes the output of the image processing device 2, and a predetermined memory that holds the results compressed and encoded by the encoder.

The imaging device 100 images a moving image. The moving image captured by the imaging device 100 is a low-quality moving image that is subject to quality improvement. The imaging device 100 may be, for example, a surveillance camera installed in a dark (low amount of light) location. The imaging device 100 images a low-quality moving image due to insufficient light, for example. The imaging device 100 outputs the captured moving image to the input information generation device 10. A moving image captured by the imaging device 100 becomes an input to the image processing device 2 . Therefore, the video output from the imaging device 100 to the input information generation device 10 may be referred to as video information IM.

Note that both the imaging device 100 and the image processing device 2 may exist within a housing of a smartphone, a tablet terminal, or the like. That is, the high-quality video generation system 1 may exist as an element constituting an edge device. Further, the imaging device 100 may be connected to the image processing device 2 via a predetermined communication network. That is, the high-quality video generation system 1 may exist by having components connected to each other via a predetermined communication network.
Further, the imaging device 100 may be configured to include a plurality of lenses and a plurality of image sensors respectively corresponding to the plurality of lenses. As a specific example of such a configuration, the imaging device 100 may include a plurality of lenses and image sensors so as to acquire images with different angles of view. According to the imaging device 100 configured in this way, the images acquired from the respective image sensors can be said to be spatially adjacent to each other. The high-quality video generation system 1 is applicable not only to a plurality of temporally adjacent images such as a video, but also to a plurality of spatially adjacent images.

The input information generation device 10 acquires video information IM from the imaging device 100. The input information generation device 10 generates input information IN based on the acquired video information IM. Input information IN is generated for each frame that constitutes moving image information IM. The input information IN may be generated based on a target frame and other frames determined based on the target frame. The other frame determined based on the frame may be a frame temporally adjacent to the target frame.

The CNN 200 is a convolutional neural network that uses the data output by the input information generation device 10 as input information IN. An example of the CNN 200 will be described with reference to FIG. 2.

FIG. 2 is a diagram showing an example of the CNN 200 according to the first embodiment. Details of the CNN 200 will be explained in detail with reference to the figure. CNN 200 is a neural network with a multilayer structure. The CNN 200 is a multilayer network including an input layer 210 to which input information IN is input, a convolution layer 220 to perform convolution operations, a pooling layer 230 to perform pooling, and an output layer 240. In at least a portion of the CNN 200, the convolution layer 220 and the pooling layer 230 are alternately connected. CNN200 is a model widely used for image recognition and video recognition. The CNN 200 may further include layers having other functions, such as a fully connected layer. Note that the pooling layer 230 may include a quantization layer that performs a quantization operation to reduce the number of bits to the operation result of the convolution layer 220. Specifically, when the result of the convolution operation in the convolution layer 220 is 16 bits, the quantization layer performs an operation to reduce the number of bits of the result of the convolution operation in the quantization layer to 8 bits or less.

Note that the CNN 200 may adopt a configuration in which the outputs of each of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 are used as intermediate outputs and inputs to other layers. In another embodiment, the CNN 200 configures a U-net by using the outputs of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 as intermediate outputs and inputs to other layers. It's okay. In this case, the CNN 200 includes an encoder section that extracts a feature amount by a convolution operation, and a decoder section that performs a deconvolution operation based on the extracted feature amount.

Input information IN is input to the input layer 210. Input information IN is generated based on the input image. The input image is a frame image that constitutes a moving image. The input information generation device 10 according to this embodiment generates input information IN from an input image. In this embodiment, the elements of the input information IN may be, for example, 2-bit unsigned integers (0, 1, 2, 3). Furthermore, the elements of the input data may be, for example, 4-bit or 8-bit integers.

The convolution layer 220 performs a convolution operation on the input information IN input to the input layer 210. The convolution layer 220 performs a convolution operation on the low-bit input information IN. The convolution layer 220 outputs predetermined output data to the pooling layer 230 as a result of performing a predetermined convolution operation.

The pooling layer extracts a representative value of a certain area based on the result of the convolution operation performed by the convolution layer 220. Specifically, the pooling layer 230 compresses the output data of the convolution layer 220 by performing an operation such as average pooling or MAX pooling on the output data of the convolution operation output by the convolution layer 220.

The output layer 240 is a layer that outputs the results of the CNN 200. The output layer 240 may output the results of the CNN 200 using, for example, an identity function or a softmax function. The layer provided before the output layer 240 may be the convolution layer 220, the pooling layer 230, or another layer.

FIG. 3 is a diagram showing frames constituting a moving image according to the first embodiment. With reference to the figure, frames used by the input information generation device 10 to generate input information IN will be described. The figure shows a plurality of consecutive frames constituting a moving image. Frames F1 to F7 shown in the figure are examples of a plurality of consecutive frames constituting a moving image.
Note that each frame is a RAW image that has not been compressed and encoded, and each pixel is expressed with 12 or 14 bits. The number of pixels in each frame is the number of pixels necessary to satisfy a predetermined video format such as 1920x1080 or 4096x2160. In this embodiment, the processing target of the CNN 200 will be described as a RAW image, but the processing target is not limited to this. If the image to be processed contains sufficient signal components, the image that has been subjected to processing such as compression encoding may be used as the target.

The input information generation device 10 generates input information IN based on a target frame TF, which is a target frame, and an adjacent frame AF, which is a frame adjacent to the target frame TF. The adjacent frame AF is, for example, a frame that is consecutively adjacent before or after the target frame TF. In the illustrated example, two frames before and after the target frame TF are set as adjacent frames AF. That is, when the target frame TF is the frame F4, the frame F2, the frame F3, the frame F5, and the frame F6 are the adjacent frames AF.

Note that the number of adjacent frames AF is not limited to this example, and may be one frame before and after the target frame TF, or three frames before and after the target frame TF. Further, the adjacent frames AF are not limited to the example of adjacent frames before and after the target frame TF, but may be only frames adjacent to either the front or the rear of the target frame TF, for example. Furthermore, the adjacent frame AF does not need to be continuous with the target frame TF; for example, when frame F4 is the target frame TF, the adjacent frame AF may be frames F2, F6, etc. that are not continuous with frame F4.

FIG. 4 is a diagram for explaining an overview of the input information generation method according to the first embodiment. A method for generating input information IN by the input information generation device 10 will be described with reference to the same figure. The figure shows a frame at time t-2, a frame at time t-1, a frame at time t, a frame at time t+1, and a frame at time t+2. The frame at time t corresponds to the above-described target frame TF, and the frames at time t-2, time t-1, time t+1, and time t+2 correspond to adjacent frames AF. Note that since each frame includes a large number of pixels, a circuit that processes the entire frame at the same time becomes large-scale. Therefore, when processing is performed in the CNN 200, it is preferable to divide each frame into predetermined sizes. In this embodiment, as an example, a case where a patch having a size of 256×256 is divided into a plurality of parts will be described.

Each frame is configured to include image data of 4 channels: R (Red), G (Green) × 2 channels, and B (Blue). The input information generation device 10 performs quantization and vectorization of each channel. The input information generation device 10 generates nine channels of vectorized data from, for example, image data of each channel. That is, the input information generation device 10 generates 4 channels×9 channels=36 channels of data from one frame. The nine-channel data may be data in which pixel values are quantized using different threshold values. The input information generation device 10 combines (concatenates) data of 5 frames x 36 channels = 180 channels generated from the target frame TF and the adjacent frames AF (5 frames in total in the illustrated example). The input information generation device 10 outputs the combined 180 channel data to the input layer of the CNN 200 as input information IN.

Note that in the illustrated example, a case has been described in which the input information IN is generated using four channels of image data constituting one frame, but the aspect of the present embodiment is not limited to this example. The input information generation device 10 may generate the input information IN based on three-channel data including RGB, for example. Further, in the illustrated example, a case has been described in which 9-channel data is generated by performing quantization and vectorization based on image data, but the aspect of the present embodiment is not limited to this example. It is preferable that the number of data to be generated is a number that allows efficient calculation after synthesis. For example, the input information generation device 10 generates data of M channels (M is a natural number of 1 or more) for each of N channels (N is a natural number of 1 or more) of image data constituting one frame. M channel data may be generated. Note that the value of N×M is preferably a value close to a multiple of 32 (or 64).

FIG. 5 is a block diagram illustrating an example of a functional configuration for generating input information according to the first embodiment. An example of the functional configuration of the input information generation device 10 will be described with reference to the same figure. The input information generation device 10 includes an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14. The input information generation device 10 includes a CPU (Central Processing Unit) (not shown), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus. The input information generation device 10 functions as a device including an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14 by executing an input information generation program.

Note that all or part of each function of the input information generation device 10 is realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array). Good too. The input information generation program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The input information generation program may be transmitted via a telecommunications line.

In the illustrated example, a memory in which moving image information IM of a moving image captured by the imaging device 100 is stored is referred to as a first memory M1, and a memory in which input information IN generated by the input information generating device 10 is stored is referred to as a first memory M1. 2 memory M2. The first memory M1 and the second memory M2 are storage devices such as ROM or RAM.

The image acquisition unit 11 acquires image information IMD that includes the input image used for processing from among the video information IM stored in the first memory M1. Specifically, the image acquisition unit 11 acquires, as input images, a plurality of frames including at least a target frame TF for which input information IN is to be generated, among a plurality of frames constituting a moving image. For example, the image acquisition unit 11 acquires an adjacent frame AF as an input image in addition to the target frame TF. The adjacent frames AF may be a plurality of frames that are consecutively adjacent to each other before and after the target frame TF.

Note that when the video captured by the imaging device 100 is a color video, the image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images. For example, if the imaging device 100 uses an image sensor employing a Bayer array, the image acquisition unit 11 acquires four channels of RGGB image information from one frame. The pixel value of the image acquired by the image acquisition unit 11 includes multi-bit elements.

The input conversion unit 12 acquires image information IMD from the image acquisition unit 11. The input conversion unit 12 converts pixel values of input images of multiple frames included in the image information IMD into low-bit input data IND based on comparisons with multiple threshold values. Since the input image is a RAW image and its pixel value includes multi-bit (for example, 12 bits or 14 bits) elements, the image acquisition unit 11 indicates the pixel value of the input image based on a plurality of threshold values. The input data IND is converted into input data IND having a bit number (eg, 2 bits or 1 bit) less than the bit number (eg, 8 bits). The input conversion unit 12 outputs the converted input data IND to the synthesis unit 13.

Note that when the image information IMD includes image information for each color of RGB, the input conversion unit 12 performs conversion for each color. That is, the input conversion unit 12 converts each of the plurality of images acquired from one frame into input data IND.

FIG. 6 is a block diagram showing an example of the functional configuration of the input conversion section according to the first embodiment. The details of the function of the input conversion section 12 will be explained with reference to the same figure. As shown in the figure, the input conversion section 12 includes a plurality of conversion sections 121 and a threshold storage section 122. In the illustrated example, the input conversion unit 12 includes a conversion unit 121-1, a conversion unit 121-2, ..., a conversion unit 121-n (n is a natural number of 1 or more) as an example of the plurality of conversion units 121. Equipped with. The number of converters 121 included in the input converter 12 may be the number of input data IND that the input converter 12 generates from one channel of input images. That is, when the input converter 12 converts a one-channel input image into nine-channel input data IND, the input converter 12 includes nine converters 121, converters 121-1 to 121-9.

In the illustrated example, the image data of the input image has a matrix-like data structure in which pixel data is multivalued, with each element in the x-axis direction and y-axis direction having more than 8 bits. When the image data of this input image is converted by the input conversion unit 12, each element is quantized and becomes input data of low bits (for example, 2 bits or 1 bit below 8 bits).

The conversion unit 121 compares each element of the input image with a predetermined threshold. The conversion unit 121 quantizes each element of the input image based on the comparison result. The conversion unit 121 quantizes, for example, a 12-bit input image into a 2-bit or 1-bit value. The converter 121 may perform quantization by comparing the number of bits with a number of thresholds corresponding to the number of bits after conversion. For example, for conversion to 1 bit, one threshold value is sufficient, and for conversion to 2 bits, three threshold values may be used. In other words, one threshold value may be used when the quantization performed by the conversion unit 121 is 1-bit quantization, and three threshold values may be used when the quantization is 2-bit quantization. Note that if a large number of threshold values such as 8 bits are required, quantization may be performed using a function, a table, or the like instead of a threshold value.

Each conversion unit 121 performs quantization on the same element using independent thresholds. In other words, the input converter 12 outputs a vector including elements corresponding to the number of converters 121 as the calculation result (input data IND) for one channel of input. Note that the bit precision of the converted result, which is the output of the converting unit 121, may be changed as appropriate based on the bit precision of the input image.

The threshold storage unit 122 stores a plurality of threshold values used in calculations performed by the conversion unit 121. The threshold value stored in the threshold storage unit 122 is a predetermined value, and is set corresponding to each of the plurality of conversion units 121. Note that each threshold value may be a learning target parameter, and may be determined and updated in the learning step.

Although the illustrated example shows an example in which the same element of the input image is input to a plurality of converters 121, the mode of the input converter 12 is not limited to this. For example, if the input image is image data that includes elements of three or more channels including color components, the conversion unit 121 is divided into a plurality of corresponding groups, and the corresponding elements are input to each group and converted. You may. Furthermore, even for elements other than color components, some kind of conversion processing may be applied in advance to the elements to be input to a predetermined conversion unit 121, or which conversion unit 12 to input them to can be switched depending on the presence or absence of pre-processing. It's okay. Further, it is not necessary to perform the conversion process on all elements of the input image, and for example, the conversion process may be performed only on a specific element in the input image, that is, an element corresponding to a specific color.

Note that the number of conversion units 121 does not need to be fixed, and may be determined as appropriate depending on the structure of the neural network or hardware information. Note that if it is necessary to compensate for a decrease in calculation accuracy due to quantization by the converter 121, it is preferable to set the number of converters 121 to be greater than or equal to the bit precision of each element of the input image. More generally, it is preferable to set the number of converters 121 to be greater than the difference in bit precision of the input image before and after quantization. Specifically, when quantizing an input image whose pixel value is represented by 8 bits to 1 bit, the number of converters 121 is 7 or more (for example, 16 or 32), which corresponds to 7 bits as a difference. ) is preferable.

Returning to FIG. 5, the combining unit 13 combines (concats) the plurality of converted input data IND into one data. Data obtained by combining a plurality of input data is also referred to as composite data CD. The combining process by the combining unit 13 may be a process of arranging (or connecting) a plurality of input data IND into one data.
The output section 14 outputs the composite data CD synthesized by the synthesis section 13. The composite data CD may be temporarily stored in the second memory M2. The composite data CD is, in other words, the input information IN input to the input layer 210 of the CNN 200.

After generating input information IN for the target frame TF, the input information generation device 10 generates input information IN for the next frame of the target frame TF. The next frame may be a frame temporally continuous with the target frame TF. That is, after the output unit 14 outputs the composite data CD regarding the target frame TF, the image acquisition unit 11 shifts the target frame TF by one frame and acquires a frame adjacent to the target frame TF as the target frame TF. do. In this way, the input information generation device 10 acquires a plurality of frames including at least the target frame TF as input images, and generates composite data CD.

The input information generation device 10 generates input information IN for all frames included in the video information IM. Note that, in the example described above, an example was described in which the input information generation device 10 generates input information IN for all frames included in the video information IM, but the aspect of the present embodiment is not limited to this example. For example, the input information generation device 10 may generate input information IN every predetermined frame. Furthermore, although the high-quality video generation system 1 converts the video into a high-quality video based on the video information IM, the output format is not limited to the video format. For example, the high-quality video generation system 1 may generate still images from videos. That is, the high-quality video generation system 1 can apply this embodiment even when extracting frames included in video information IM to generate still images by using frames extracted from a video as target frames TF. can.

[Summary of the first embodiment]
According to the embodiment described above, the input information generation device 10 is equipped with the image acquisition unit 11 to generate a plurality of frames including at least the target frame TF, which is the target of generation of the input information IND, among the frames constituting the moving image. Obtain as input image. In addition, the input information generation device 10 includes the input conversion unit 12, so that the input information generation device 10 converts the pixel values of the acquired multiple frames of the input image into the number of bits indicating the pixel value of the input image ( The input data IND is converted into input data IND having a smaller number of bits (for example, 2 bits or 1 bit less than 8 bits) (for example, 12 bits). The input information generation device 10 also includes a synthesizing section 13 to synthesize a plurality of converted input data IND into one synthetic data CD, and includes an output section 14 to synthesize the synthesized synthetic data CD. Output. That is, according to the present embodiment, the input information generation device 10 generates input information IN by combining a plurality of input data IND obtained from a plurality of images including at least the target frame TF. The generated input information IN is input to the input layer 210 of the CNN 200.

Here, the input information IN is information with lower bits than the input image in each element. By processing the input information IN instead of the input image, the CNN 200 can process low-bit information. Therefore, according to this embodiment, the processing of the CNN 200 can be reduced in weight. Furthermore, the input information IN is information generated based on a plurality of images. Therefore, when improving the quality of a video based on the input information IN, it becomes possible to perform processing that also considers multiple frames adjacent to the target frame TF, making it possible to remove noise with high precision. become able to. Therefore, according to this embodiment, it is possible to convert a low-quality video into a high-quality video with lightweight calculations. Note that the processing of the CNN 200 is not limited to noise removal.

Furthermore, according to the embodiment described above, since the moving image to be processed is a color moving image, the moving image information IM includes, for example, pixel values of each color of RGB. The image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images, and the input conversion unit 12 converts each of the plurality of images acquired from one frame as different input data IND. Therefore, according to this embodiment, more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with even higher accuracy.

Furthermore, according to the embodiment described above, the image acquisition unit 11 acquires images of a plurality of frames consecutively adjacent to each of the front and rear of the target frame TF, among the plurality of frames constituting the moving image. According to the present embodiment, the input information IN is generated based on the information of the frames adjacent before and after the target frame TF, so that more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with high accuracy.

Further, according to the embodiment described above, after the output unit 14 outputs the composite data CD regarding the target frame TF, the image acquisition unit 11 sets a frame adjacent to the target frame TF to at least the target frame TF. A plurality of frames including TF are acquired as input images. That is, the input information generation device 10 generates input information IN for each of a plurality of frames included in the video by shifting the target frames TF one after another. Therefore, according to this embodiment, a low quality video can be converted into a high quality video.

[Second embodiment]
Next, a second embodiment will be described with reference to FIGS. 7 and 8. First, the problem to be solved in the second embodiment will be explained. The input information generation device 10 according to the first embodiment reads a plurality of frames including a target frame TF and an adjacent frame AF, and performs quantization on each of the plurality of frames. Therefore, according to the input information generation device 10 according to the first embodiment, the number of frames to be quantized is large, and the calculation load on the first layer becomes large. The second embodiment attempts to solve this problem and further lighten the calculation load.

FIG. 7 is a diagram for explaining an overview of the input information generation method according to the second embodiment. A method for generating input information IN by the input information generation device 10A according to the second embodiment will be described with reference to the same figure. In the figure, a frame at time t is shown as the target frame TF. Further, in the figure, a frame at time t-2, a frame at time t-1, a frame at time t+1, and a frame at time t+2 are shown as adjacent frames AF.

Each frame is configured to include image data of four channels of RGGB. The input information generation device 10A performs quantization and vectorization of each channel for the target frame TF. In addition, the input information generation device 10A performs quantization and vectorization of each channel on the average image of adjacent frames AF. That is, the second embodiment is different from the first embodiment in that the average image of the adjacent frames AF is quantized and vectorized instead of being quantized and vectorized for each adjacent frame AF. different.

In the illustrated example, each of 4 channels of RGGB image data is converted to 16 channels of data. Therefore, 4 channels x 16 channels = 64 channels of data are generated from one frame. The input information generation device 10A converts each of the target frame TF and the average image of the adjacent frame AF into 64 channels of data, so a total of 128 channels of data is generated.

The averaging process may be performed by calculating a simple average of pixel values. Furthermore, the input information generation device 10A may generate an average image for each color by calculating the average for each color. That is, the input information generation device 10A generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2, and generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2. An average image based on the B image of the frame in the adjacent frame AF from time t-2 to time t+2 may be generated.

Note that in the illustrated example, quantization and vectorization are performed after calculating the average of adjacent frames AF, but the aspect of the present embodiment is not limited to this example. The input information generation device 10A may be configured to calculate the average after performing quantization and vectorization, for example.

After the target frame TF is quantized and vectorized, and the average image of the adjacent frame AF is quantized and vectorized, input information IN is generated by combining these. In the illustrated example, 128 channels of data are generated. The combined input information IN is smaller than the amount of information according to the first embodiment. On the other hand, the amount of data representing the target frame TF is increased compared to the first embodiment.

FIG. 8 is a block diagram illustrating an example of a functional configuration for generating input information according to the second embodiment. An example of the functional configuration of the input information generation device 10A will be described with reference to the same figure. The input information generation device 10A differs from the input information generation device 10 in that it further includes an integration section 15, an input conversion section 12A instead of the input conversion section 12, and a synthesis section 13A instead of the synthesis section 13. In the description of the input information generation device 10A, the same components as the input information generation device 10 may be given the same reference numerals and the description thereof may be omitted.

The integrating unit 15 acquires information regarding adjacent frames AF from the image acquiring unit 11. The adjacent frame AF is a frame other than the target frame TF among the plurality of frames acquired by the image acquisition unit 11. The integrating unit 15 performs a process of integrating a plurality of adjacent frames AF into one integrated frame IF by performing calculations based on the pixel values of the adjacent frames AF. The integrating unit 15 outputs the integrated frame IF obtained as a result of the calculation to the input converting unit 12A.

The integrating unit 15 may perform the integrating process by, for example, obtaining a simple average of adjacent frames AF. In this case, the integrating unit 15 takes, for example, the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF.

Note that the aspect of the integration process of the integration unit 15 according to the present embodiment is not limited to the example of calculating a simple average. The integrating unit 15 may perform the integrating process by calculating a weighted average, for example. The weighted average is calculated according to the temporal distance from the target frame TF. For example, the integrating unit 15 reduces the weight by multiplying the pixel values of frames at time t-2 and time t+2, which are temporally distant from time t, by 0.7, which is smaller than 1, and The weight may be increased by multiplying the pixel values of the frames at time t-1 and time t+1 by 1.3, which is greater than 1. That is, the integrating unit 15 may calculate the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. By integrating using a weighted average, the degree of contribution of the adjacent frame AF to the target frame TF can be reflected in the integrated frame IF. Note that the magnitude of the weight to be multiplied may be different depending on whether the temporal distance from the target frame TF is the same or whether it is before or after the target frame TF. For example, if the adjacent frame AF is before the target frame TF, the weight may be increased, and if the adjacent frame AF is after the target frame TF, the weight may be decreased.

The input conversion unit 12A converts the pixel value of the target frame TF acquired from the image acquisition unit 11 into input data IND based on comparison with a plurality of threshold values. Further, the input conversion unit 12A converts the pixel values of the integrated frame IF obtained from the integration unit 15 into input data based on comparison with a plurality of threshold values. The input conversion section 12A outputs the converted input data IND to the synthesis section 13A.

The synthesis unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD. The combined input information IN is smaller than the amount of information according to the first embodiment.

[Summary of second embodiment]
According to the embodiment described above, the input information generation device 10A further includes the integrating unit 15, thereby performing calculations based on the pixel values of adjacent frames AF, and converting a plurality of adjacent frames AF into one integrated frame IF. Integrate. In addition, the input conversion unit 12A converts the pixel value of the target frame TF into input data IND based on comparison with a plurality of threshold values, and further converts the pixel value of the integrated frame IF into input data IND based on comparison with a plurality of threshold values. Convert to IND. Furthermore, the synthesizing unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD. That is, the input information generation device 10A does not quantize and vectorize the adjacent frame AF like the target frame TF, but calculates an integrated frame IF based on a plurality of adjacent frames AF, and quantizes the integrated frame IF. and vectorization. Therefore, the input information IN synthesized by the synthesizing section 13A is smaller than the amount of information according to the first embodiment, and the calculation load on the first layer can be reduced.

Furthermore, according to the embodiment described above, the integrating unit 15 sets the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF. That is, the integrating unit 15 takes the simple average of the adjacent frames AF as the integrated frame IF. Therefore, according to the present embodiment, input information IN having a smaller amount of information than the input information IN according to the first embodiment can be generated by easy calculation, and the calculation load on the first layer can be reduced. Can be done.

Further, according to the embodiment described above, the integrating unit 15 calculates the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. . Therefore, according to the present embodiment, the integrated frame IF can be generated in consideration of the degree of contribution of the adjacent frame AF to the target frame TF. Therefore, by using the input information IN generated by the input information generation device 10A, the CNN 200 can perform image processing with higher accuracy.

[Third embodiment]
Next, a third embodiment will be described with reference to FIG. 9. First, the problem to be solved in the third embodiment will be explained. The input information generation device 10A according to the second embodiment includes the integrating unit 15 to calculate the average value of adjacent frames AF. When the target frame TF is shifted one frame at a time, calculating the average value for each frame causes duplication of processing and is not efficient. Therefore, the third embodiment attempts to solve this problem and further lighten the calculation load.

FIG. 9 is a block diagram illustrating an example of a functional configuration for generating input information according to the third embodiment. An example of the functional configuration of the input information generation device 10B will be described with reference to the same figure. The input information generation device 10B is different from the input information generation device 10A in that it further includes an average value temporary storage section 16, an imaging condition acquisition section 17, and an adjustment section 18, and includes an integration section 15B instead of the integration section 15. different. In the description of the input information generation device 10B, the same components as the input information generation device 10A may be given the same reference numerals and the description thereof may be omitted.

The average value temporary storage unit 16 stores the average value of the pixel values of a predetermined frame among the frames making up the moving image. The value stored in the average value temporary storage section 16 is calculated by the integration section 15B. The integration unit 15B acquires information about the target frame TF from the image acquisition unit 11, and acquires the stored value SV from the average value temporary storage unit 16. The integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the target frame TF and the stored value SV, which is the value stored in the average value temporary storage unit 16. The integrating unit 15B stores the calculated value in the average value temporary storage unit 16 as the calculated value CV. That is, the value stored in the average value temporary storage unit 16 is updated every time the integration unit 15B calculates a new target frame TF. By repeating such calculations, the input information generation device 10B calculates a moving average based on the target frame TF. Note that regarding the calculation for the first frame, since the stored value SV does not yet exist in the average value temporary storage unit 16, the integrating unit 15B may perform the calculation based only on the target frame TF.

The imaging condition acquisition unit 17 acquires video imaging conditions from the imaging device 100. The video imaging conditions acquired by the imaging condition acquisition unit 17 may be, for example, settings of the imaging device such as shutter speed, aperture, or ISO sensitivity. Further, the video imaging conditions acquired by the imaging condition acquisition unit 17 may include other information regarding the operation and drive of the imaging device 100.

The adjustment unit 18 adjusts the value (average value) stored in the average value temporary storage unit 16 according to the imaging condition acquired by the imaging condition acquisition unit 17. Here, if the imaging conditions of the imaging device 100 change, the relationship between the pixel value of the target frame TF and the past moving average value will change. For example, if the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, the pixel value of the target frame TF becomes brighter than the past moving average value, so the moving average value suddenly becomes darker. turn into. Therefore, when the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, by doubling the value stored in the average value temporary storage unit 16, the integrating unit 15B continues to calculate the average value. It is possible to continue calculating the moving average value while utilizing the values stored in the value temporary storage unit 16.

Furthermore, if a moving average is calculated with respect to past frames when the shooting scene of a video changes, the CNN 200 may not be able to properly perform image processing regarding the target frame TF. Therefore, in cases where the shooting scene of a moving image has changed, it is preferable not to obtain a moving average with past frames. The adjustment unit 18 is configured to reset the value stored in the average value temporary storage unit 16 when the video shooting scene changes, so that the integration unit 15B starts calculating a new moving average value. You can leave it there. Whether or not the shooting scene of the video has changed may be determined based on whether the power button of the imaging device 100 is turned on or off, the shooting button or the stop button is turned on or off, or the like.

Note that among the multiple frames included in the video, a frame with a large change in brightness may be inserted. Possible causes for inserting a frame with a large luminance change include a case where a light source is imaged due to a change in the imaging angle, a case where a car's headlights are reflected, etc. In such a case, the integrating unit 15B may exclude a frame with a large luminance change among the plurality of frames from the targets for calculating the average value. By excluding frames with large brightness changes from the average value calculation target, it is possible to prevent the moving average value from being dragged by frames with large brightness changes. As an example of determining whether the brightness change is large, by comparing the pixel value of the immediately previous target frame TF with the pixel value of the target frame TF to be calculated, it is determined whether the difference is less than or equal to a threshold value. It may be determined whether or not.

[Summary of third embodiment]
According to the embodiment described above, the input information generation device 10B further includes the average value temporary storage unit 16 to store the average value of the pixel values of a predetermined frame among the frames constituting the moving image. Further, according to the input information generation device 10B, the integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the value stored in the average value temporary storage unit 16 and the target frame TF. That is, the input information generation device 10B calculates the pixel value of the integrated frame IF based on the target frame TF and the stored moving average value. Therefore, the input information generation device 10B has a lighter calculation load than the input information generation device 10A. Therefore, according to this embodiment, the calculation load can be further reduced.

Furthermore, according to the embodiment described above, the integrating unit 15B excludes frames with large luminance changes from among the plurality of frames making up the video from the targets for calculating the average value. Therefore, according to the present embodiment, when a brightness change suddenly becomes large, by excluding the frame from the calculation of the moving average value, the pixel values of the integrated frame IF can be adjusted to avoid sudden brightness changes. It is possible to prevent the pixel value from being dragged down by the pixel value of a frame that has become large.

Further, according to the embodiment described above, the input information generation device 10B further includes the imaging condition acquisition unit 17 to acquire the imaging conditions for a moving image, and further includes the adjustment unit 18 to adjust the imaging conditions according to the acquired imaging conditions. The average value stored in the average value temporary storage section 16 is adjusted. Therefore, according to this embodiment, it is possible to adjust the average value according to changes in imaging conditions. Therefore, according to this embodiment, even if a change occurs in the imaging conditions, it is possible to continue calculating the moving average value.

[Fourth embodiment]
Next, a fourth embodiment will be described with reference to FIG. First, the problem to be solved in the fourth embodiment will be explained. The input information generation device 10B according to the third embodiment calculates a moving average by including the average value temporary storage section 16. According to the third embodiment, the integrating unit 15B calculates the moving average of the entire image. The input information generation device 10B generates input data IN based on the target frame TF and the moving average, and the CNN 200 removes noise from the moving image based on the generated input data IN. Since the integrating unit 15B calculates a moving average of the entire image, if a moving subject is photographed in part of the video, afterimages may occur in the part where the moving subject is photographed as a result of noise removal. In some cases, problems may arise. The fourth embodiment attempts to solve this problem.

FIG. 10 is a block diagram illustrating an example of a functional configuration for generating input information according to the fourth embodiment. An example of the functional configuration of an input information generation device 10C according to the fourth embodiment will be described with reference to the same figure. The input information generation device 10C is different from the input information generation device 10B in that it further includes a comparison section 19 and includes an integration section 15C instead of the integration section 15B. The input information generation device 10C may or may not include the imaging condition acquisition section 17 and the adjustment section 18 like the input information generation device 10B. In the illustrated example, an example will be described in which the input information generation device 10C does not include the imaging condition acquisition section 17 and the adjustment section 18. In the description of the input information generation device 10C, the same components as the input information generation device 10B may be given the same reference numerals and the description thereof may be omitted.

The comparison unit 19 acquires the stored value SV from the average value temporary storage unit 16 and acquires the target frame TF from the integration unit 15C. The comparison unit 19 compares the acquired stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF. The comparison unit 19 may compare the entire image, each pixel, or each patch composed of a plurality of pixels. The comparison unit 19 outputs the comparison result to the integration unit 15C as a comparison result CR. The comparison result CR may include a difference between pixel values, or may include information about the result of comparing the difference with a predetermined threshold.

The integration unit 15C acquires the comparison result CR from the comparison unit 19 and acquires the stored value SV from the average value temporary storage unit 16. Based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C calculates the average value between the stored value SV stored in the average value temporary storage unit 16 and the target frame TF, if the difference is less than a predetermined value. Calculate the moving average based on The integrating unit 15C sets the calculated value as the pixel value of the integrated frame IF. Further, based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF if the difference is not less than a predetermined value.

The integration process by the integration unit 15C may be performed for the entire image, for each pixel, or for each patch made up of multiple pixels. When the integration process is performed pixel by pixel or patch by patch, the pixel value of the integrated frame IF is a moving average value for locations where the difference is less than a predetermined value (i.e., locations where the movement is small), and for locations where the difference is not less than a predetermined value ( In other words, the pixel value of the target frame TF is used for areas with large movements. The integrating unit 15C stores the calculated result in the average value temporary storage unit 16 as a calculated value CV. According to the input information generation device 10C, a subject with large movements is not included in the average image, and a background with small movements is included in the average image. The calculation performed by the input information generation device 10C can also be called a selective average.

Note that as a modification of the processing performed by the integrating unit 15C, instead of choosing between capturing or not capturing into the average image, it may be multiplied by a coefficient. For example, if the difference is less than or equal to a predetermined value based on the comparison result CR that is the result of comparison by the comparison unit 19, the integrating unit 15C adds a predetermined coefficient ( For example, a moving average may be calculated based on the value multiplied by 0.9) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF. Furthermore, if the difference is not less than a predetermined value based on the comparison result CR, which is the result of comparison by the comparator 19, the integrating section 15C adds a predetermined coefficient (for example, A moving average may be calculated based on the value multiplied by 0.1) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF.

[Summary of the fourth embodiment]
According to the embodiment described above, the input information generation device 10C further includes the comparison unit 19 to compare the stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF. . If the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit 19, the integration unit 15C calculates a moving average based on the stored value SV stored in the average value temporary storage unit 16 and the target frame TF. Calculate the pixel value of the integrated frame IF. Further, as a result of the comparison by the comparing unit 19, if the difference is not less than a predetermined value, the integrating unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF. That is, according to the input information generation device 10C, the pixel value of the integrated frame IF is determined by distinguishing between a subject with large movement and a background with small movement, and selectively performing averaging processing. The input information generation device 10C generates input data IN based on the target frame TF and the integrated frame IF. Therefore, according to the present embodiment, since the pixel values of the adjacent frame AF are not reflected in the pixel values of the integrated frame IF in areas where there is large movement, it is possible to prevent the problem of occurrence of afterimages.

Note that in the second to fourth embodiments described above, a predetermined calculation is required to determine the pixel value of the integrated frame IF, which may result in heavy processing. In order to further reduce the processing weight, any one of the frames adjacent to the target frame TF (for example, two frames before and after) may be specified by a predetermined algorithm and used as the integrated frame IF. The predetermined algorithm may be one that randomly identifies one frame among frames adjacent to the target frame TF. In this case, the integrating unit 15 sets a randomly specified frame among the adjacent frames AF acquired by the image acquiring unit 11 as an integrated frame IF.

(Other examples)
Note that in the embodiment described above, an example has been described in which the combining unit 13 and the integrating unit 15 perform a process of combining and integrating a plurality of acquired frames. However, the present embodiment is not limited to this example, and the objects to be synthesized or integrated are not limited to a plurality of frames themselves. In other embodiments, intermediate outputs from at least a portion of the CNN 200 may be subjected to synthesis processing and integration processing. More specifically, an intermediate output generated when the target frame TF and the adjacent frame AF are processed by the CNN 200 may be used as a target for synthesis or integration. Further, the intermediate output generated when the adjacent frame AF is processed by the CNN 200 and the result of vectorizing the target frame TF may be combined or integrated. Note that the present embodiment is not limited to the above-mentioned example, and includes a wide variety of information obtained as a result of processing based on frames as a target of synthesis or integration.

Note that aspects of the present embodiment are not limited to any of the aspects of the first to fourth embodiments described above, and can be modified from the first to fourth embodiments based on predetermined conditions. Either of these may be used selectively. The predetermined conditions may be video shooting conditions, shooting mode, exposure conditions, type of subject, etc.

Up to now, an example has been shown in which calculations are performed using adjacent frames AF in order to improve the quality of the target frame TF, but even when learning a trained model included in the CNN 200, the first to fourth As in any one of the embodiments, it is preferable to perform learning using not only the target frame TF but also the adjacent frame AF. Calculations related to learning do not necessarily need to be executed in the image processing device 2, and results such as parameters learned in advance in a dedicated learning device may be included in the CNN 200 as a learned model.

Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the spirit of the present invention. can be added.

Next, with reference to FIGS. 11 to 20, a learning device, a program, and a learning method of a noise reduction device according to aspects of the present invention will be described in detail with reference to preferred embodiments and the accompanying drawings. do. Note that aspects of the present invention are not limited to these embodiments, but also include those with various changes or improvements. That is, the components described below include those that can be easily assumed by those skilled in the art and are substantially the same, and the components described below can be combined as appropriate. Moreover, various omissions, substitutions, or changes of the constituent elements can be made without departing from the gist of the present invention. Further, in the following drawings, in order to make each structure easier to understand, the scale, number, etc. of each structure may be different from the scale, number, etc. of the actual structure.

First, the background art of the present invention and the problems to be solved by the invention will be explained.

Conventionally, there has been a technology that uses machine learning to process low-quality images into high-quality images. In such technical fields, a learning model is trained using a combination of a noise image on which noise is superimposed and a high-quality image as training data. The training data is created by capturing images of the same object with different exposure settings using an imaging device to obtain a high-quality image and a noise image. It is generally known that machine learning requires a large amount of training data, and creating training data by capturing images using a camera is time-consuming. Therefore, a technique is known in which training data is created by adding random noise to a high-quality image (for example, see Japanese Patent Application Laid-Open No. 2021-071936). It is known to use such conventional techniques to create training data for inferring a high quality image from a low quality image by adding random noise to a high quality image.

Here, it is known that when image processing a low-quality video into a high-quality video, a large amount of training data for machine learning is required, as in the case of still images described above. However, in the case of videos, it is very difficult to photograph the same object with different settings and easily create high-quality videos and low-quality videos of the same object. Therefore, it is conceivable to apply the above-mentioned conventional technology to generate a low-quality video by superimposing noise on each frame of a high-quality video that has been shot in advance, but this poses the problem of requiring an enormous amount of storage space. etc., and it was extremely difficult.

Therefore, the present invention aims to provide a technology that can generate training data for inferring a high-quality video from a low-quality video.

Next, the premise of this embodiment will be explained. In the learning device, program, and learning method of the noise reduction device according to the present embodiment, a learning model is trained to infer a high-quality video from which noise is removed by inputting low-quality video information with superimposed noise. Low-quality videos include low-quality videos, and high-quality videos include high-quality videos. Teacher data used for learning by the learning device, program, and learning method of the noise reduction device according to the present embodiment is generated from a still image of a subject. A still image taken of a subject may be a single high-quality image, or multiple images taken of the same subject (one or more high-quality images and one or more low-quality images). combination of images). A plurality of images of the same subject may be captured under different imaging conditions. Further, the image captured of the subject may be any other image including at least one image. A high-quality image can be, for example, a high-quality image captured by low ISO sensitivity and long exposure. In the following description, a high quality image may be referred to as GT (Ground Truth). An example of a low-quality image is a low-quality image captured by high ISO sensitivity and short exposure.

In the following description, image quality deterioration due to noise will be described as an example of a low-quality image, but the present embodiment is widely applicable to matters other than noise that degrade image quality. Items that reduce image quality include reduced resolution or color shift due to optical aberrations, reduced resolution due to camera shake or subject blur, uneven black level due to dark current or circuits, ghosts and flare caused by high-brightness objects, Examples include signal level abnormalities.

Note that images prepared in advance may be used to generate the teacher data. In the following description, a low-quality image may be referred to as a low-quality image or a noise image. Furthermore, in the following description, a high quality image may be referred to as a high quality image or GT. Similarly, a low-quality video may be described as a low-quality video or a noise video. Furthermore, in the following description, a high-quality video may be referred to as a high-quality video or GT.

Images targeted by the learning device according to the present embodiment may be still images or frames included in a video. Furthermore, the data format may be a format that has not undergone compression encoding processing, such as a Raw format, or a format that has undergone compression encoding processing, such as a Jpeg format or an MPEG format. In the following, unless there is a particular limitation, the case where the image is a still image in Raw format will be described.

Further, the image targeted by the learning device according to the present embodiment may be an image captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. Further, the image targeted by the learning device according to the present embodiment may be an image captured by a CMOS camera using a complementary metal oxide semiconductor (CMOS) image sensor. Further, the image targeted by the learning device according to the present embodiment may be a color image or a monochrome image. Further, the image targeted by the learning device according to the present embodiment may be an image captured by an infrared camera using an infrared sensor or the like to obtain a non-visible light component.

[Fifth embodiment]
First, a fifth embodiment will be described with reference to FIGS. 11 to 18.
FIG. 11 is a diagram for explaining an overview of the learning system according to the fifth embodiment. An overview of the learning system 1001 will be explained with reference to the same figure. A learning system 1001 shown in the figure is an example of a configuration in the learning stage of machine learning. The learning system 1001 trains the learning model 1040 using teacher data TD generated based on images captured by the imaging device 1020.

The learning system 1001 is equipped with an imaging device 1020 to capture a high-quality image 1031 and a low-quality image 1032. The high-quality image 1031 and the low-quality image 1032 are images of the same subject. For example, the high-quality image 1031 and the low-quality image 1032 are captured at the same angle of view and imaging angle, but with different settings such as ISO sensitivity and exposure time. Further, although it is preferable that there is one high-quality image 1031, there may be a plurality of high-quality images 1031. Further, although it is preferable that there be a plurality of low-quality images 1032, there may be only one low-quality image 1032. Preferably, the plurality of low-quality images 1032 are different images captured with different settings such as ISO sensitivity and exposure time. The imaging device 1020 may be, for example, a smartphone having communication means, a tablet terminal, or the like. Further, the imaging device 1020 may be a surveillance camera or the like having communication means.

The learning system 1001 generates a high-quality video 1033 from a high-quality image 1031 and a low-quality video 1034 from a low-quality image 1032. The high-quality video 1033 is preferably generated from one high-quality image 1031, and the low-quality video 1034 is preferably generated from a plurality of low-quality images 1032. A high-quality video 1033 and a low-quality video 1034 generated from a high-quality image 1031 and a low-quality image 1032 captured from the same subject are associated with each other. A high-quality video 1033 and a low-quality video 1034 that correspond to each other are input to the learning model 1040 for learning as teacher data TD.

Note that the high-quality video 1033 and low-quality video 1034 that correspond to each other may be temporarily stored in a predetermined storage device for later learning. That is, the learning system 1001 may generate a plurality of teacher data TD in advance before learning that is performed later. Further, the high quality image 1031 and the low quality image 1032 captured by the imaging device 1020 may be temporarily stored in a predetermined storage device. In this case, the learning system 1001 may store a plurality of combinations of mutually corresponding high-quality images 1031 and low-quality images 1032, and generate teacher data TD during learning.

The learning model 1040 is trained using the teacher data TD generated by the learning system 1001. Specifically, the learning model 1040 is trained to infer high quality videos from low quality videos. In other words, the learning model 1040 after learning infers a high-quality video using a low-quality video as input, and outputs the inference result. That is, the learned model 1040 after learning may be used in a noise reduction device for removing noise from a low-quality video.

Note that the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020 are stored in a predetermined storage device that temporarily stores information. The predetermined storage device may be provided in the imaging device 1020, or may be provided in a cloud server or the like. That is, the learning system 1001 may be configured as an edge device or may include an edge device and a cloud server. Furthermore, the learning of the learning model 1040 may also utilize a GPU or the like provided on the server.

FIG. 12 is a diagram showing an example of the functional configuration of the learning device according to the fifth embodiment. The functional configuration of the learning device 1010 will be explained with reference to the same figure. The learning device 1010 is used to implement the learning system 1001 described above. The learning device 1010 generates a high-quality video 1033 and a low-quality video 1034 based on the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020. The learning device 1010 causes the learning model 1040 to learn using the generated high-quality video 1033 and low-quality video 1034 as teacher data TD. The learning device 1010 includes an image acquisition section 1011, a video information generation section 1012, and a learning section 1013. The learning device 1010 includes a CPU (Central Processing Unit), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus (not shown). The learning device 1010 functions as a device including an image acquisition section 1011, a video information generation section 1012, and a learning section 1013 by executing a learning program.

Note that all or part of each function of the learning device 1010 may be realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array). . The learning program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The learning program may be transmitted via a telecommunications line.

The image acquisition unit 1011 acquires image information I from the imaging device 1020. Image information I includes first image information I1 and second image information I2. The first image information I1 includes at least one high-quality image 1031. The second image information I2 includes at least one low-quality image 1032. The same subject as the subject captured in the high-quality image 1031 included in the first image information I1 is captured in the low-quality image 1032 included in the second image information I2. The image included in the second image information I2 has lower image quality than the image included in the first image information I1. The image acquisition unit 1011 outputs the acquired image information I to the video information generation unit 1012.

The video information generation unit 1012 generates video information M by cutting out a plurality of parts of the images included in the image information I and connecting the cut out images as frame images at a predetermined time interval (or it can also be called a frame rate). generate. The frame rate may be, for example, 60 [FPS (frames per second)]. The position of the image cut out by the video information generation unit 1012 may be different for each frame. For example, the size of the cut out images may be fixed, and the video information generation unit 1012 may cut out a plurality of images at positions moved by a predetermined number of pixels (bit number) in a predetermined direction. Specifically, the size of the image to be cut out may be fixed to 256 pixels x 256 pixels. Further, the video information generation unit 1012 may cut out an image at a position where the size is shifted by 10 pixels for each frame. If the amount of shift is too large, the amount of change in the image for each frame will become too large, resulting in an unnatural moving image, so it is preferable to set a limit (upper limit) so that the shift does not exceed a predetermined amount. It is preferable to determine the amount of shift and the limit based on the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, shooting frame rate, and the like. Furthermore, since the speed of a falling subject increases in terms of acceleration, the amount of shift may be increased for frames temporally farther from the target image.

The video information generation unit 1012 generates first video information M1 from the images included in the first image information I1, and generates second video information M2 from the images included in the second image information I2. That is, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. Further, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the second image information I2, and generates the second video information M2 by combining the plurality of cut out images. Generating a moving image by combining a plurality of images may mean converting the plurality of images into a file format that displays them at predetermined time intervals depending on the frame rate. The video information generation unit 1012 outputs information including the generated first video information M1 and second video information M2 as video information M to the learning unit 1013.

Here, the sizes of the plurality of images cut out by the video information generation unit 1012 and the cutout positions may be arbitrarily determined. However, it is preferable that the position to be cut out from the image included in the first image information I1 and the position to be cut out from the image included in the second image information I2 are approximately the same position. This is because the first video information M1, which is a high-quality video, and the second video information M2, which is a low-quality video, should be of the same subject.

The learning unit 1013 acquires video information M from the video information generation unit 1012. The learning unit 1013 causes the learning model 1040 to learn by inputting the acquired video information M to the learning model 1040 as teacher data TD. The learning model 1040 is trained to infer high quality videos from low quality videos. That is, the learning unit 1013 causes learning to infer a high-quality video from a low-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. . The learning model 1040 can also be said to be trained to reason to remove noise from the input video.

Next, with reference to FIGS. 13 to 15, an image that the learning device 1010 cuts out from the image captured by the imaging device 1020 will be described. In the following explanation, a method of generating a high-quality video from a high-quality image (described with reference to FIG. 13) and a method of generating a low-quality video from a low-quality image (described with reference to FIG. 14) will be explained. Although the methods described above are described as being different from each other, the present embodiment is not limited to this example. Instead of the following description, a high-quality video may be generated from a high-quality image, and a low-quality video may be generated from a low-quality image, using methods similar to each other. That is, a low-quality video may be generated by the method described with reference to FIG. 13, and a high-quality video may be generated by the method described with reference to FIG. 14.

FIG. 13 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment. An example of the position of an image cut out from a high-quality image by the learning device 1010 will be described with reference to the same figure. FIG. 13A shows an image I-11 that is an example of an image included in the first image information I1. FIG. 13(B) shows an example of a case where a plurality of images are cut out from image I-11 shown in FIG. 13(A) as image I-12.

As shown in FIG. 13(A), the ball B, which is the subject, is captured in the image I-11. The video information generation unit 1012 generates a video from the still image I-11 by cutting out a plurality of images from the image I-11 and temporally connecting the cut-out images.

The image I-12 shown in FIG. 13(B) shows a plurality of cut-out images CI, which are images cut out by the video information generation unit 1012. Specifically, cut-out images CI-11 to cut-out images CI-15 are shown as examples of images cut out by the video information generation unit 1012. When cutout images CI-11 to cutout images CI-15 are not distinguished, they may be simply written as cutout images CI.

The cutout images CI-11 to CI-15 are each shifted by a predetermined number of pixels in the vertical and horizontal directions. According to the first video information M1 generated by the video information generation unit 1012, an image C-11 is displayed at a certain time t1, an image C-12 is displayed at a certain time t2, and an image C-13 is displayed at a certain time t3. is displayed, image C-14 is displayed at a certain time t4, and image C-15 is displayed at a certain time t5. In this way, by temporally connecting different cut-out images CI, it is possible to generate a moving image that makes it appear as if the ball B, which is the object in the still image, is moving. When the video information generation unit 1012 generates a video with a frame rate of 60 [fps], the interval between each time may be 1/60th of a second.

The shift direction and shift amount of the image cut out by the video information generation unit 1012 are determined based on shooting conditions such as the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, and shooting frame rate. is suitable. Furthermore, in the case of simulating a falling object, since the speed increases in an accelerated manner, it is preferable to gradually change (increase) the shift amount.

Here, the high-quality video (first video information M1) generated by the learning device 1010 is a high-quality video without superimposed noise. Therefore, it is ideal that noise is not superimposed on an image that is a still image for generating a moving image. Further, ideally, each frame of a high-quality video generated from an image without superimposed noise should also be free from superimposed noise. Therefore, it is preferable that the video information generation unit 1012 generates a video from a single image on which no noise is superimposed. That is, it is preferable that the video information generation unit 1012 generates the first video information M1 by cutting out different parts from one high-quality image included in the first image information I1.

FIG. 14 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment. An example of the position of an image cut out from a low-quality image by the learning device 1010 will be described with reference to the same figure. The learning device 1010 cuts out images of different frames from a plurality of low-quality images. Images I-21 to I-25, which are different images, are shown in FIGS. 14(A) to 14(E), respectively. The learning device 1010 cuts out images of different frames from images I-21 to I-25.

The compositions of images I-21 to I-25, which are low-quality images, are similar to image I-11 shown in FIG. 13(A). That is, the ball B is imaged at the same position in images I-21 to I-25. Images I-21 to I-25 differ from image I-11 in that different noises are superimposed on them. Images I-21 to I-25 may have different noises superimposed on them, for example, by using different imaging conditions during imaging.

The video information generation unit 1012 cuts out a cutout image CI-21 from the image I-21, cuts out a cutout image CI-22 from the image I-22, cuts out a cutout image CI-23 from the image I-23, and cuts out the cutout image CI-23 from the image I-24. A cutout image CI-24 is cut out from the image, and a cutout image CI-25 is cut out from the image I-25. The cutout images CI-21 to CI-25 are each shifted by a predetermined number of pixels in the vertical and horizontal directions. According to the second video information M2 generated by the video information generation unit 1012, image C-21 is displayed at a certain time t1, image C-22 is displayed at a certain time t2, and image C-23 is displayed at a certain time t3. is displayed, image C-24 is displayed at a certain time t4, and image C-25 is displayed at a certain time t5. Since different noises are superimposed on each of the cutout images CI-21 to CI-25, different noises are superimposed on the generated moving image depending on the time.

Here, the low-quality video (second video information M2) generated by the learning device 1010 is a low-quality video on which noise is superimposed. If you cut out multiple different positions from a single image with superimposed noise and create a video, the same noise will be included at every moment (in other words, the noise will not change over time). It may not be appropriate as a low-quality video. Therefore, in this embodiment, a low-quality video is generated by cutting out a plurality of different low-quality images. The same subject as the subject captured in the high-quality image is captured in each of the different low-quality images. That is, the second image information M2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is imaged, and each image has different noise superimposed thereon. is included. The plurality of images included in the second image information I2 may be images captured at different times close to each other. The video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information.
Note that, for example, it is not necessary to prepare low-quality images for the number of frames, and the images may be cut out multiple times so as not to be continuous from a plurality of images. The order of cutting out the plurality of images may be random.

FIG. 15 is a diagram for explaining an example of the direction in which the learning device according to the fifth embodiment cuts out. In the example described with reference to FIGS. 13 and 14, an example was described in which a position moved by a predetermined number of pixels in both the vertical and horizontal directions is cut out. However, the video information generation unit 1012 may cut out positions moved in other directions. Another example of the direction in which the video information generation unit 1012 cuts out the cutout image CI will be described with reference to FIGS. 15(A) to 15(C).

FIG. 15(A) shows image I-31. FIG. 15(A) is an example of a case where a position moved only in the lateral direction (horizontal direction) is extracted. In this case, the video information generation unit 1012 fixes the y-coordinate in the vertical direction and changes only the x-coordinate in the horizontal direction, thereby cutting out the cut-out images CI at a plurality of different positions. By cutting out the image in this way, it is possible to generate a moving image in which the subject moves laterally (horizontally). Similarly, the video information generation unit 1012 may cut out the cutout image CI at a position moved only in the vertical direction (vertical direction). By cutting out the image in this way, it is possible to generate a moving image in which the subject moves in the vertical direction (vertical movement).
Further, as shown in FIGS. 13 and 14, the video information generation unit 1012 may cut out the cutout image CI at a position moved in both the vertical and horizontal directions. In this case, the amount of movement in the vertical direction and the amount of movement in the lateral direction may be different from each other.

FIG. 15(B) shows image I-32. FIG. 15(B) is an example of a case where a position moved in the rotational direction is extracted. In this case, the video information generation unit 1012 cuts out the cutout images CI at a plurality of different positions by moving the cutout position in an arc shape having a rotation center of 0 and a radius of r. In the example shown in the figure, the video information generation unit 1012 cuts out a position rotated counterclockwise. By cutting out in this way, it is possible to generate a moving image in which the subject moves in the rotational direction. The position of the center of rotation O and the size of the radius r may differ from frame to frame.

FIG. 15(C) shows image I-33. FIG. 15C is an example of enlarging and reducing the cutting position. In this embodiment, it is preferable that the size of the cutout image CI is constant. Therefore, the video information generation unit 1012 enlarges or reduces the image I and cuts it out while maintaining the size of the cut-out image CI. When the size of the cutout image CI is fixed at 256 pixels x 256 pixels, the video information generation unit 1012 enlarges and reduces the image I so that it fits within the size of the cutout image CI. By cutting out the image in this way, it is possible to generate a moving image that looks like the subject is zoomed in or zoomed out.

Note that the cutout positions described with reference to FIGS. 15(A) to 15(C) are examples of this embodiment, and the video information generation unit 1012 can generate a video by cutting out and connecting other different positions. Information may be generated. The video information generation unit 1012 may cut out the cutout image CI, for example, by combining the cutout methods described with reference to FIGS. 15(A) to 15(C). In this case, it is possible to generate a moving image in which, for example, the moving image is horizontally or vertically moved and then rotated, or moved and then enlarged or reduced.

Note that the movement of the cutout position as described above may be calculated by affine transformation. That is, the predetermined direction in which the video information generation unit 1012 cuts out an image can also be described as being calculated by affine transformation.

Note that instead of changing the cutout position as described above, the video information generation unit 1012 may generate a video by cutting out a part of the image and then moving it. In this case, the video information generation unit 1012 cuts out an image of 256 pixels x 256 pixels, and generates a plurality of pixels by moving the cut out image in a predetermined direction. The video information generation unit 1012 generates a video by connecting the cut out images. That is, the video information generation unit 1012 may cut out a plurality of images at different positions by shifting the plurality of cut out images in a predetermined direction.
Note that by moving the image after cutting it out, an area where no data exists will occur around the image. However, by predefining the peripheral portion of the image as the margin, it is possible to exclude it from the range of the image to be learned, and to prevent problems from occurring in the later learning stage.

In the above description, an example was described in which the video information generation unit 1012 generates a video by cutting out an image that has been moved in a direction calculated by some method such as affine transformation. However, in actual videos, the subject often does not move in these calculated directions, but rather moves randomly. Therefore, the learning device 1010 can generate a moving image by cutting out an image in which the object is moved in a direction based on the trajectory of the actual movement, and can generate training data that is more effective for machine learning. An example of such a case will be described as a modification of the fifth embodiment with reference to FIGS. 16 and 17.

Here, in bright scenes such as on sunny days, it is common to increase the shutter speed in order to maintain exposure. This is known to cause moving subjects to lose their smoothness, resulting in choppy images. Similarly, when creating a moving image from high-resolution still images, the resulting moving image may be choppy and unnatural with little smoothness. For this reason, the video information generation unit 1012 may generate the video after performing correction to add pseudo subject blur to the still image for which the video is to be created. As an example, subject blurring may be added by performing a predetermined averaging process in the shift direction or by performing a process to lower the resolution.

FIG. 16 is a diagram illustrating an example of the functional configuration of a learning device according to a modification of the fifth embodiment when the learning device generates a moving image based on a trajectory vector. An example of the functional configuration of a learning device 1010A according to a modification of the fifth embodiment will be described with reference to the same figure. A learning system 1001A according to a modification of the fifth embodiment differs from the learning system 1001 in that it further includes a trajectory vector generation device 1050. The learning device 1010A differs from the learning device 1010 in that it further includes a trajectory vector acquisition unit 1014. Further, the learning device 1010A differs from the learning device 1010 in that the learning device 1010A includes a video information generating section 1012A instead of the video information generating section 1012. In the description of the learning device 1010A, the same components as the learning device 1010 may be given the same reference numerals and the description thereof may be omitted.

The trajectory vector generation device 1050 acquires information regarding the trajectory of the object captured in the video. Video information is input to the trajectory vector generation device 1050, and the trajectory vector generation device 1050 analyzes the trajectory of the object imaged based on the input video information. Trajectory vector generation device 1050 outputs the analyzed result as trajectory vector TV. The trajectory vector TV indicates the trajectory of the object captured in the video information. Trajectory vector generation device 1050 acquires trajectory vector TV from video information using, for example, conventional technology such as optical flow.
Note that the trajectory vector TV may include coordinate information indicating the trajectory of the movement of the object in addition to or in place of the vector information.

The trajectory vector acquisition unit 1014 acquires the trajectory vector TV from the trajectory vector generation device 1050. The trajectory vector acquisition unit 1014 outputs the acquired trajectory vector TV to the video information generation unit 1012A. Note that the moving image for which the trajectory vector TV has been acquired by the trajectory vector generation device 1050 and the image acquired by the image acquisition unit 1011 may have a predetermined relationship. In this case, for example, the image acquisition unit 1011 may acquire, as an image, one frame of a video whose trajectory vector TV has been acquired by the trajectory vector generation device 1050.
However, the present embodiment is not limited to this example, and the video whose trajectory vector TV is acquired by the trajectory vector generation device 1050 and the video acquired by the image acquisition unit 1011 do not have a predetermined relationship. It's okay.

The video information generation unit 1012A acquires image information I from the image acquisition unit 1011 and acquires the trajectory vector TV from the trajectory vector acquisition unit 1014. The video information generation unit 1012A generates video information based on the acquired image information I and trajectory vector TV. The video information generation unit 1012A determines the cutting direction of the cutout image CI and the amount of shift per frame based on the trajectory indicated by the trajectory vector TV. That is, the predetermined direction in which the video information generation unit 1012A cuts out the image is calculated based on the acquired trajectory vector TV.

FIG. 17 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector. An example of the position coordinates of the cut-out image CI in the case of generating a moving image based on the trajectory vector TV will be described with reference to the same figure. FIG. 17A shows an image I-41 that is an example of an image included in the first image information I1. FIG. 17B shows an example of a plurality of cut out images CI cut out from the image I-41.

As shown in FIG. 17(A), the image I-41 shows a trajectory vector TV that is the trajectory of the ball B, which is the subject. The trajectory vector TV represents a vector in which the ball B falls from the upper right direction in the figure to the lower center direction, bounces at the lower center point, and then moves toward the upper left direction in the figure. The video information generation unit 1012A cuts out the cutout image CI of the position coordinates based on the trajectory vector TV shown in the image I-41, and temporally connects the cutout images, so that the image I-41, which is a still image, is extracted from the image I-41. Generate video.

FIG. 17(B) shows an example of a cut-out image CI, which is an image cut out by the video information generation unit 1012. Specifically, as examples of images cut out by the video information generation unit 1012, cutout images CI-41 to cutout images CI-49 are shown. The cutout images CI-41 to CI-49 are located at coordinates based on the trajectory vector TV. That is, the cutout image CI-41 is located in the upper right direction in the figure, and the cutout position moves toward the center and lower in the figure as it approaches the cutout image CI-45. Further, the cutout position moves toward the upper left in the figure from cutout image CI-45 to cutout image CI-49.

FIG. 18 is a flowchart illustrating an example of a series of operations of the learning method of the noise reduction device according to the fifth embodiment. An example of a series of operations of the learning method of the noise reduction device using the learning device 1010 will be described with reference to the same figure.

(Step S110) First, the image acquisition unit 1011 acquires an image. The image acquisition unit 1011 acquires first image information I1 that includes a high-quality image and second image information I2 that includes a low-quality image. Note that the step of acquiring an image by the image acquisition unit 1011 may be referred to as an image acquisition step or an image acquisition process.

(Step S130) Next, the video information generation unit 1012 cuts out a part of the acquired image. The video information generation unit 1012 cuts out a plurality of cut images CI from the acquired image. The video information generation unit 1012 cuts out a plurality of cutout images CI from each of the high quality image included in the first image information I1 and the low quality image included in the second image information I2. Note that it is preferable that the position coordinates cut out from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are the same. However, if there is a time difference between the timing of acquiring the high-quality image included in the first image information I1 and the timing of acquiring the low-quality image included in the second image information I2, the subject included in the cut-out image There may be deviations due to time differences. In such a case, the position coordinates to be extracted from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are determined by taking into account the deviation due to the time difference. It is preferable. More specifically, it is preferable to change the position coordinates to be cut out from the high-quality image included in the first image information I1 or the low-quality image included in the second image information I2 in a direction that reduces the amount of deviation caused by the time difference.

(Step S150) Next, the video information generation unit 1012 connects the cut out images to generate a video. The video information generation unit 1012 generates a high-quality video by connecting multiple images cut out from high-quality images, and generates a low-quality video by connecting multiple images cut out from low-quality images. The step of generating video information in step S130 and step S150 may be referred to as a video information generation step or a video information generation step.

(Step S170) Finally, the learning unit 1013 uses the combination of the generated high-quality video and low-quality video as teacher data TD and learns to infer a high-quality video from a low-quality video. This step may be referred to as a learning step or a learning process.

[Summary of the fifth embodiment]
According to the embodiment described above, the learning device 1010 includes the image acquisition unit 1011 to acquire the first image information I1 and the second image information I2. The first image information I1 includes at least one image, the second image information I2 captures the same subject as the subject captured in the image included in the first image information I1, and the first image information I1 contains at least one image of lower quality than the images contained in the image. Furthermore, the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. do. Similarly, the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the second image information I2, and combines the plurality of cut out images to generate the second video information M2. generate. Further, the learning device 1010 includes a learning unit 1013, so that the learning device 1010 can change the quality of the video from low-quality video to high-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. Train to infer videos. That is, according to the present embodiment, the learning device 1010 does not need to acquire training data including low-quality videos and high-quality videos by shooting videos, which was conventionally required, and can generate the training data from still images. can. Therefore, according to this embodiment, training data for inferring a high-quality video from a low-quality video can be easily generated.

Furthermore, according to the present embodiment, the learning device 1010 can generate a plurality of different moving images from the same still image. Therefore, according to this embodiment, since a huge amount of teacher data TD is generated, it is not necessary to prepare a huge amount of still images, and many moving images can be generated from a small number of still images. Therefore, according to this embodiment, the time required to capture images for use in learning can be shortened.

Further, according to the embodiment described above, the second image information I2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is captured, and each image is mutually Contains multiple images on which different noises are superimposed. The video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information I2. That is, according to the present embodiment, a low-quality moving image with superimposed noise is generated based on a plurality of different low-quality images with superimposed noise. Therefore, the second video information M2 generated according to the present embodiment has different noise superimposed on each frame, and can be generated by reproducing a low-quality video with noise superimposed more accurately.

Furthermore, according to the embodiment described above, the plurality of images included in the second image information I2 are images taken at different times that are close to each other. That is, low-quality images for generating a low-quality video are captured at close times. The close time may be, for example, 1/60th of a second. Here, in the case of a moving image, unlike a still image, noise peculiar to moving images having a temporal component may be superimposed. Images taken at different times that are close to each other contain noise specific to this moving image. Therefore, according to the present embodiment, since the learning device 1010 generates a moving image based on images captured at different times that are close to each other, the learning device 1010 can reproduce and generate noise peculiar to moving images having a temporal component. .

Furthermore, according to the embodiment described above, the video information generation unit 1012 generates the first video information M1 by cutting out a different part from one image included in the first image information I1. That is, according to this embodiment, a high-quality video is generated based on one image. Therefore, according to this embodiment, it is possible to easily generate a high-quality moving image without having to capture many high-quality images.

Furthermore, according to the embodiment described above, the video information generation unit 1012 cuts out a plurality of images at different positions by shifting the plurality of cut out images by different amounts in a predetermined direction. That is, according to this embodiment, the learning device 1010 cuts out the image and then shifts it in a predetermined direction. In other words, after cutting out an image, the learning device 1010 performs processing based on the small image that has been cut out, without requiring processing based on the large image. Therefore, according to this embodiment, the learning device 1010 can lighten the processing.

Furthermore, according to the embodiment described above, the video information generation unit 1012 cuts out a plurality of images at positions shifted by a predetermined number of bits in a predetermined direction. The video information generation unit 1012 generates a video by connecting the cut out images. That is, the subject imaged in the video generated by the video information generation unit 1012 appears to move in a predetermined direction in the video. Therefore, according to this embodiment, a moving image can be easily generated from a still image.

Furthermore, according to the embodiment described above, the predetermined direction in which the video information generation unit 1012 cuts out an image is calculated by affine transformation. In other words, the predetermined direction in which the video information generation unit 1012 cuts out the image is the direction in which the subject moves in the video. Therefore, according to this embodiment, the learning device 1010 can generate a video in which the subject moves in various directions.

Furthermore, according to the embodiment described above, the learning device 1010 further includes the trajectory vector acquisition unit 1014 to acquire the trajectory vector TV. Further, the predetermined direction in which the video information generation unit 1012 cuts out the image is calculated based on the acquired trajectory vector TV. The trajectory vector TV is information regarding a vector indicating the trajectory of the subject actually moving in the actually captured moving image. Therefore, according to this embodiment, it is possible to generate a video based on the trajectory of the actual movement of the subject.

[Sixth embodiment]
Next, a sixth embodiment will be described with reference to FIGS. 19 and 20. In the first embodiment, high-quality images and low-quality images are required to create the training data TD, whereas in the sixth embodiment, only high-quality images are required. , which is different from the fifth embodiment.

FIG. 19 is a diagram for explaining an overview of the learning system according to the sixth embodiment. An overview of a learning system 1001B according to the sixth embodiment will be described with reference to the same figure. In the description of the figure, the same components as those in the fifth embodiment may be given the same reference numerals and the description thereof may be omitted. In the sixth embodiment, the imaging device 1020 captures a high-quality image 1031. The low-quality image 1032 is generated based on the high-quality image 1031 by the learning device 1010B according to the sixth embodiment. The low-quality image 1032 is generated, for example, by subjecting the high-quality image 1031 to image processing and superimposing noise. That is, according to the present embodiment, the imaging device 1020 captures only the high-quality image 1031 and does not need to capture the low-quality image 1032.

FIG. 20 is a diagram illustrating an example of the functional configuration of the video information generation section according to the sixth embodiment. The video information generation unit 1012B included in the learning device 1010B will be described with reference to the same figure. A learning device 1010B according to the sixth embodiment differs from the learning device 1010 in that it includes a video information generating section 1012B instead of the video information generating section 1012. The video information generation section 1012B includes a cutting section 1121, a noise superimposition section 1123, a first video information generation section 1125, and a second video information generation section 1127.

The cutting unit 1121 acquires an image from the image acquiring unit 1011. In this embodiment, since the learning device 1010B acquires a high-quality image from the imaging device 1020, the cutting unit 1121 acquires a high-quality image from the image acquisition unit 1011. The cutout unit 1121 cuts out a plurality of cutout images CI that are part of the acquired high-quality image and have different positional coordinates. The cutout unit 1121 outputs the cutout image CI to the first moving image information generation unit 1125 and the noise superimposition unit 1123.

The noise superimposition unit 1123 acquires the cutout image CI cut out by the cutout unit 1121. The noise superimposition unit 1123 superimposes noise on the acquired cutout image CI. The noise superimposition unit 1123 obtains a plurality of cutout images CI obtained by cutting out a plurality of position coordinates, and superimposes noise on each of the plurality of obtained cutout images CI. The noise superimposed by the noise superimposing unit 1123 may be modeled in advance. The modeled noises include shot noise due to fluctuations in the number of photons, noise that occurs when the light incident on the image sensor is converted into electrons, noise that occurs when the converted electrons are converted to analog voltage values, and noise that occurs when the converted electrons are converted to analog voltage values. An example of this is noise that occurs when converting an analog voltage value into a digital signal. The intensity of the superimposed noise may be adjusted by a predetermined method. It is preferable that the noise superimposition unit 1123 superimposes different noises on each of the plurality of cut-out images CI. The noise superimposition unit 1123 outputs the image after superimposing noise to the second moving image information generation unit 1127 as a noise image NI.

The first video information generation unit 1125 acquires a plurality of cut out images CI from the cut out unit 1121. The first video information generation unit 1125 generates first video information M1 by combining the plurality of cut out images. The first video information generation unit 1125 outputs the generated first video information M1 to the learning unit 1013.

The second video information generation unit 1127 acquires a plurality of noise images NI from the noise superimposition unit 1123. The second video information generation unit 1127 generates second video information M2 by combining a plurality of noise images NI on which noise is superimposed. The second video information generation unit 1127 outputs the generated second video information M2 to the learning unit 1013.

The learning unit 1013 acquires the first video information M1 from the first video information generation unit 1125 and the second video information M2 from the second video information generation unit 1127. The learning unit 1013 trains the learning model 1040 based on the first video information M1 and the second video information M2 generated by the video information generation unit 1012B.

[Summary of the sixth embodiment]
According to the embodiment described above, the learning device 1010B includes the image acquisition unit 1011 to acquire image information I including at least one high-quality image. Further, the learning device 1010B includes a video information generation unit 1012B to generate both high-quality videos and low-quality videos from high-quality images. The video information generation unit 1012B includes a cutting unit 1121 to cut out a plurality of images at different positions that are part of the acquired image information I. Furthermore, the video information generation unit 1012B includes a noise superimposition unit 1123 to superimpose noise on each of the plurality of images cut out by the cutout unit 1121. The video information generation unit 1012B includes a first video information generation unit 1125, so that the video information generation unit 1012B generates first video information M1, which is a high-quality video, by combining the plurality of images cut out by the cutout unit 1121, and generates the second video information M1. By providing the information generating section 1127, the second moving image information M2 is generated by combining a plurality of images on which noise has been superimposed by the noise superimposing section 1123. Furthermore, by including the learning unit 1013, the learning device 1010B can generate the first video information M1 generated by the first video information generation unit 1125 and the second video information M2 generated by the second video information generation unit 1127. Based on the included teacher data TD, learning is performed to infer a high-quality video from a low-quality video. That is, according to the learning device 1010B, a learning model 1040 is trained that generates a high-quality video and a low-quality video based on one high-quality image, and infers a high-quality video from the low-quality video. In other words, inferring a high-quality video from a low-quality video is noise removal. Therefore, according to the present embodiment, it is possible to easily learn the noise removal model without requiring time to acquire the teacher data TD.

In the sixth embodiment, a high-quality video is generated from a high-quality image, a low-quality image is generated by superimposing noise on the high-quality image, and a low-quality video is generated based on the generated low-quality image. was generated. However, this embodiment is not limited to this example. For example, as a modification of this embodiment, the learning device 1010 may create the teacher data TD based only on low-quality images. That is, a low-quality video may be generated from a low-quality image, a high-quality image may be generated by further removing noise from the low-quality video, and a high-quality video may be generated based on the generated high-quality image. The number of images used to generate the moving image may be one or multiple.

Note that the learning device 1010 and learning device 1010A described in the fifth embodiment and the learning device 1010B described in the sixth embodiment are examples used for learning a learning model 1040 that infers a high-quality video from a low-quality video. shown, but is not limited to this. For example, the learning model 1040 may be configured to have a function to detect a specific subject such as a person in the high-quality video after inferring a high-quality video from the low-quality video, or a function may be provided to detect a specific subject such as a person in the high-quality video. It may also be configured to have a function of recognizing characters on signboards and the like. That is, the high-quality video inferred by the learning model 1040 is not limited to an example of a video for viewing, but may be used for purposes such as object detection.

Conventionally, in order to improve the generalization performance of a learning model, it has been suitable to include as many possible scenes as possible in the training data. In other words, the ideal training data is a video that includes as much of the expected movement of the subject as possible. On the other hand, it is difficult to obtain such training data through actual photography, and it requires a huge amount of cost and time. By using this embodiment for learning a learning model, it is possible to significantly reduce the cost and time required to collect teacher data. It becomes possible to improve the generalization performance of the model.

DESCRIPTION OF SYMBOLS 1...High quality video generation system, 2...Image processing device, 10...Input information generation device, 11...Image acquisition section, 12...Input conversion section, 13...Composition section, 14...Output section, 15...Integration section, 16... Average value temporary storage section, 17... Imaging condition acquisition section, 18... Adjustment section, 19... Comparison section, 100... Imaging device, 200... CNN, 210... Input layer, 220... Convolution layer, 230... Pooling layer, 240... Output Layer, IM...Video information, IN...Input information, TF...Target frame, AF...Adjacent frame, IF...Integrated frame, M1...First memory, M2...Second memory, IMD...Image information, IND...Input data, CD ...Synthetic data, SV...Stored value, CV...Calculated value, CR...Comparison result, 1001...Learning system, 1010...Learning device, 1011...Image acquisition section, 1012...Video information generation section, 1013...Learning section, 1014...Trajectory Vector acquisition unit, 1020... Imaging device, 1031... High quality image, 1032... Low quality image, 1033... High quality video, 1034... Low quality video, 1040... Learning model, 1050... Trajectory vector generation device, TD... Teacher data, I...image information, I1...first image information, I2...second image information, M...video information, M1...first video information, M2...second video information, TV...trajectory vector, 1121...cutting section, 1123 ...Noise superimposition unit, 1125...First video information generation unit, 1127...Second video information generation unit

Claims

an image acquisition unit that acquires, as input images, a plurality of frames that include at least a target frame for which input information is to be generated, among frames that constitute a video;
an input conversion unit that converts pixel values of the input image of the plurality of acquired frames into input data having a smaller number of bits than the number of bits indicating the pixel value of the input image;
a synthesis unit that synthesizes a plurality of converted input data into one synthetic data;
An input information generation device comprising: an output unit that outputs the synthesized data.
The video is a color video,
The image acquisition unit acquires pixel values of each color from one frame as a plurality of different images,
The input information generation device according to claim 1, wherein the input conversion unit converts each of a plurality of images acquired from one frame into the input data.
The input information generation device according to claim 1, wherein the image acquisition unit acquires images of a plurality of frames consecutively adjacent to each of the front and rear of the target frame, among the frames constituting the moving image.
After the output unit outputs the composite data regarding the target frame, the image acquisition unit is configured to set a frame adjacent to the target frame as the target frame, and set a plurality of frames including at least the target frame as the input image. The input information generation device according to claim 1.
An integrating unit that integrates the plurality of adjacent frames into one integrated frame by performing an operation based on pixel values of adjacent frames that are frames other than the target frame among the plurality of frames acquired by the image acquisition unit. further comprising;
The input conversion unit converts the pixel values of the target frame into the input data, further converts the pixel values of the integrated frame into the input data,
The input according to claim 1, wherein the combining unit combines a plurality of the input data converted based on the target frame and a plurality of the input data converted based on the integrated frame into one composite data. Information generation device.
The input information generation device according to claim 5, wherein the integrating unit sets the average value of pixel values of the plurality of adjacent frames as the pixel value of the integrated frame.
The input information generation device according to claim 6, wherein the integrating unit calculates a pixel value of the integrated frame by calculating a weighted average of the plurality of adjacent frames according to a temporal distance from the target frame. .
The input information generation device according to claim 6 , wherein the integrating unit excludes, from among the frames constituting the moving image, frames with large luminance changes from the average value calculation target.
Further comprising an average value temporary storage unit that stores an average value of pixel values of a predetermined frame among the frames constituting the moving image,
The input information generation device according to claim 5, wherein the integration unit calculates the pixel value of the integrated frame by calculation based on the value stored in the average value temporary storage unit and the target frame.
an imaging condition acquisition unit that acquires imaging conditions for the video;
The input information generation device according to claim 9 , further comprising an adjustment unit that adjusts the average value stored in the average value temporary storage unit according to the acquired imaging condition.
further comprising a comparison unit that compares the value stored in the average value temporary storage unit and the pixel value of the target frame,
When the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit, the integration unit performs the integration by calculating a moving average based on the value stored in the average value temporary storage unit and the target frame. The input information generation device according to claim 9, wherein a pixel value of the frame is calculated, and if the difference is not less than a predetermined value, the pixel value of the target frame is set as the pixel value of the integrated frame.
The input information generation device according to claim 5, wherein the integrating unit sets a randomly specified frame among the adjacent frames acquired by the image acquiring unit as the integrated frame.
The input information generation device according to any one of claims 1 to 12,
an image processing device comprising: a convolutional neural network whose input information is the composite data outputted by the input information generation device;
an image acquisition step of acquiring, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting the video;
an input conversion step of converting the pixel values of the input image of the plurality of acquired frames into input data having a smaller number of bits than the number of bits indicating the pixel value of the input image;
a synthesis step of synthesizing a plurality of converted input data into one synthetic data;
An input information generation method comprising: an output step of outputting the synthesized data.
First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition unit that acquires second image information including at least one image;
Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation unit that cuts out a plurality of images at different positions and generates second video information by combining the plurality of cut out images;
A learning device configured to learn to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated by the video information generation unit.
The second image information includes a plurality of images in which the same subject as the subject captured in the image included in the first image information is captured, and each image includes a plurality of images on which different noises are superimposed on each other. Re,
The learning device according to claim 15, wherein the video information generation unit generates the second video information by cutting out different parts from each of the plurality of images included in the second image information.
The learning device according to claim 16, wherein the plurality of images included in the second image information are images taken at different times that are close to each other.
The learning device according to claim 15 or 16, wherein the video information generation unit generates the first video information by cutting out a different part from one image included in the first image information.
The learning device according to claim 15 or 16, wherein the video information generation unit cuts out a plurality of images at different positions by shifting the plurality of cut out images in a predetermined direction.
The learning device according to claim 15 or 16, wherein the video information generation unit cuts out a plurality of images at positions shifted by a predetermined number of bits in a predetermined direction.
The learning device according to claim 19, wherein the predetermined direction in which the video information generation section cuts out the image is calculated by affine transformation.
further comprising a trajectory vector acquisition unit that obtains a trajectory vector,
The learning device according to claim 19, wherein the predetermined direction in which the video information generation unit cuts out the image is calculated based on the acquired trajectory vector.
an image acquisition unit that acquires image information including at least one image;
a cutting unit that cuts out a plurality of images at different positions that are part of the acquired image information;
a first video information generation unit that generates first video information by combining the plurality of cut out images;
a noise superimposition unit that superimposes noise on each of the plurality of images cut out by the cutting unit;
a second video information generation unit that generates second video information by combining a plurality of images on which noise has been superimposed by the noise superimposition unit;
Converting a high-quality video from a low-quality video based on teacher data including the first video information generated by the first video information generation unit and the second video information generated by the second video information generation unit. A learning device comprising: a learning section for learning to infer;
to the computer,
First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition step of acquiring second image information including at least one image;
Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation step of cutting out a plurality of images at different positions and generating second video information by combining the plurality of cut out images;
a learning step of learning to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated in the video information generation step.
First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition step of acquiring second image information including at least one image;
Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation step of cutting out a plurality of images at different positions and generating second video information by combining the plurality of cut out images;
a learning step of learning to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated in the video information generation step. How to learn.