WO2024047994A1 - Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device - Google Patents

Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device Download PDF

Info

Publication number
WO2024047994A1
WO2024047994A1 PCT/JP2023/021204 JP2023021204W WO2024047994A1 WO 2024047994 A1 WO2024047994 A1 WO 2024047994A1 JP 2023021204 W JP2023021204 W JP 2023021204W WO 2024047994 A1 WO2024047994 A1 WO 2024047994A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
video
input
unit
images
Prior art date
Application number
PCT/JP2023/021204
Other languages
French (fr)
Japanese (ja)
Inventor
宏 能地
ピヤワト スワンウイタヤ
Original Assignee
LeapMind株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2022137834A external-priority patent/JP2024033913A/en
Priority claimed from JP2022137843A external-priority patent/JP2024033920A/en
Application filed by LeapMind株式会社 filed Critical LeapMind株式会社
Publication of WO2024047994A1 publication Critical patent/WO2024047994A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/72Data preparation, e.g. statistical preprocessing of image or video features
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/14Picture signal circuitry for video frequency region
    • H04N5/21Circuitry for suppressing or minimising disturbance, e.g. moiré or halo

Definitions

  • the present invention relates to an input information generation device, an image processing device, an input information generation method, a learning device, a program, and a learning method for a noise reduction device.
  • This application claims priority based on Japanese Patent Application No. 2022-137834, which was filed in Japan on August 31, 2022, and Japanese Patent Application No. 2022-137843, which was filed in Japan on August 31, 2022. , all contents stated in the application are incorporated by reference.
  • the image When capturing an image with an imaging device, if the amount of surrounding light is not sufficient or if the settings of the imaging device such as shutter speed, aperture, or ISO sensitivity are inappropriate, the image may be of low quality.
  • the present invention aims to provide a technology that can convert a low-quality video to a high-quality video using lightweight calculations.
  • One aspect of the present invention includes an image acquisition unit that acquires, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion unit that converts a pixel value of an input image into input data with a smaller number of bits than the number of bits indicating the pixel value of the input image; and a synthesis unit that combines the plurality of converted input data into one composite data. , and an output unit that outputs the synthesized data.
  • One aspect of the present invention is the input information generation device according to (1) above, in which the moving image is a color moving image, and the image acquisition unit converts pixel values of each color from one frame into a plurality of different images.
  • the input conversion unit converts each of a plurality of images acquired from one frame into the input data.
  • One aspect of the present invention is the input information generation device according to (1) or (2) above, in which the image acquisition unit is configured to continuously capture images before and after the target frame, respectively, of the frames constituting the video. This method acquires images of multiple adjacent frames.
  • One aspect of the present invention is the input information generation device according to any one of (1) to (3) above, wherein the image acquisition unit outputs the composite data regarding the target frame by the output unit. After that, a frame adjacent to the target frame is set as the target frame, and a plurality of frames including at least the target frame are acquired as the input image.
  • the input information generation device is configured to use a frame other than the target frame among the plurality of frames acquired by the image acquisition unit.
  • the input conversion unit converts the pixel values of the target frame into one integrated frame by performing an operation based on the pixel values of a certain adjacent frame. data, and further converts the pixel values of the integrated frame into the input data, and the combining unit converts the plurality of input data converted based on the target frame and the plurality of input data converted based on the integrated frame.
  • the above-mentioned input data are combined into one above-mentioned combined data.
  • One aspect of the present invention is the input information generation device according to (5) above, wherein the integrating unit sets an average value of pixel values of the plurality of adjacent frames as the pixel value of the integrated frame. be.
  • One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit calculates a weighted average of the plurality of adjacent frames according to a temporal distance from the target frame. By this calculation, the pixel value of the integrated frame is calculated.
  • One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit excludes frames with large brightness changes from among the frames forming the video from the targets for calculating the average value. It is something to do.
  • the input information generation device includes an average value temporary storage unit that stores an average value of pixel values of a predetermined frame among the frames constituting the moving image. Furthermore, the integrating unit calculates the pixel value of the integrated frame by calculation based on the value stored in the average value temporary storage unit and the target frame.
  • the input information generation device includes an imaging condition acquisition unit that acquires the imaging conditions of the moving image, and a temporary value of the average value according to the acquired imaging conditions.
  • the apparatus further includes an adjustment section that adjusts the average value stored in the storage section.
  • the input information generation device further includes a comparison unit that compares the value stored in the average value temporary storage unit and the pixel value of the target frame. Preparation: When the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit, the integration unit calculates a moving average based on the value stored in the average value temporary storage unit and the target frame. The pixel value of the integrated frame is calculated, and if the difference is not less than a predetermined value, the pixel value of the target frame is set as the pixel value of the integrated frame.
  • One aspect of the present invention is the input information generation device according to (5) above, in which the integrating unit integrates randomly specified frames among the adjacent frames acquired by the image acquiring unit. It is used as a frame.
  • One aspect of the present invention includes the input information generation device according to any one of (1) to (12) above, and a convolution neural network that uses the synthetic data outputted by the input information generation device as input information.
  • An image processing apparatus includes a network.
  • One aspect of the present invention includes an image acquisition step of acquiring, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion step of converting a pixel value of an input image into input data with a smaller number of bits than the number of bits representing the pixel value of the input image; and a synthesis step of combining the plurality of converted input data into one composite data. , and an output step of outputting the synthesized data.
  • A1 One aspect of the present invention is that first image information including at least one image and a subject that is the same as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition unit that acquires second image information including at least one image of lower image quality than the image contained in the image; and a plurality of images at different positions that are part of the acquired first image information and are cut out.
  • a plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information.
  • a learning unit that learns to infer a high-quality video from a low-quality video based on a video information generation unit and teacher data that includes the first video information and the second video information generated by the video information generation unit. This is a learning device comprising:
  • One aspect of the present invention is the learning device according to (A1) above, in which the second image information includes the same subject as the subject captured in the image included in the first image information. the plurality of images, each of which includes a plurality of images on which different noises are superimposed, and the video information generation unit is configured to cut out different parts from each of the plurality of images included in the second image information. The second moving image information is generated.
  • One aspect of the present invention is the learning device according to (A1) or (A2) above, in which the plurality of images included in the second image information are images captured at different times that are close to each other.
  • One aspect of the present invention is the learning device according to any one of (A1) to (A3) above, in which the video information generation unit generates a different image from one image included in the first image information.
  • the first moving image information is generated by cutting out a portion.
  • One aspect of the present invention is the learning device according to any one of (A1) to (A4) above, in which the video information generation unit shifts the plurality of cut out images in a predetermined direction so as to move the plurality of cut out images to different positions. This is to cut out multiple images.
  • the video information generation unit is configured to generate a plurality of video information at a position shifted by a predetermined number of bits in a predetermined direction. The image is cut out.
  • One aspect of the present invention is that in the learning device described in (A6) above, the predetermined direction in which the video information generation unit cuts out the image is calculated by affine transformation.
  • One aspect of the present invention is the learning device according to (A6) above, further comprising a trajectory vector acquisition unit that acquires a trajectory vector, and the predetermined direction in which the video information generation unit cuts out the image is the acquisition device. This is calculated based on the trajectory vector obtained.
  • One aspect of the present invention includes: an image acquisition unit that acquires image information including at least one image; a cutting unit that cuts out a plurality of images at different positions that are part of the acquired image information; a first video information generation section that combines a plurality of cut out images to generate first video information; a noise superposition section that superimposes noise on each of the plurality of images cut out by the cutout section; and a noise superposition section that a second video information generation unit that generates second video information by combining a plurality of images on which noise is superimposed; and the first video information generated by the first video information generation unit and the second video information generation unit.
  • the learning device is provided with a learning unit that learns to infer a high-quality video from a low-quality video based on teacher data that includes the second video information generated by the above-mentioned second video information.
  • first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are captured by the computer, and the first image information includes at least one image; an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the first image information; and cutting out a plurality of images at different positions that are part of the acquired first image information. , combine a plurality of cut out images to generate first moving image information, cut out a plurality of images at different positions that are part of the acquired second image information, and combine the plurality of cut out images to generate second moving image information.
  • learning to infer a high-quality video from a low-quality video based on training data including the first video information and the second video information generated by the video information generation step; This is a program that executes learning steps.
  • One aspect of the present invention is that the first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the image; and cutting out a plurality of images at different positions that are part of the acquired first image information.
  • a plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information.
  • FIG. 1 is a block diagram illustrating an example of a functional configuration of a high-quality video generation system according to a first embodiment.
  • FIG. 1 is a diagram illustrating an example of a convolutional neural network according to the first embodiment.
  • FIG. 3 is a diagram illustrating frames constituting a moving image according to the first embodiment.
  • FIG. 2 is a diagram for explaining an overview of an input information generation method according to the first embodiment.
  • FIG. 2 is a block diagram illustrating an example of a functional configuration of input information generation according to the first embodiment.
  • FIG. 2 is a block diagram illustrating an example of a functional configuration of an input conversion section according to the first embodiment.
  • FIG. 7 is a diagram for explaining an overview of an input information generation method according to a second embodiment.
  • FIG. 1 is a block diagram illustrating an example of a functional configuration of a high-quality video generation system according to a first embodiment.
  • FIG. 1 is a diagram illustrating an example of a convolutional neural network according to the first
  • FIG. 3 is a block diagram illustrating an example of a functional configuration of input information generation according to a second embodiment.
  • FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a third embodiment.
  • FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a fourth embodiment. It is a figure for explaining the outline of the learning system concerning a 5th embodiment. It is a figure showing an example of the functional composition of the learning device concerning a 5th embodiment.
  • FIG. 12 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment.
  • FIG. 12 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment.
  • FIG. 12 is a diagram for explaining an example of a direction in which a learning device according to a fifth embodiment cuts out.
  • FIG. 12 is a diagram illustrating an example of a functional configuration of a learning device according to a fifth embodiment when the learning device generates a moving image based on a trajectory vector.
  • FIG. 12 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector.
  • FIG. 12 is a flowchart illustrating an example of a series of operations of a learning method of a noise reduction device according to a modification of the fifth embodiment. It is a figure for explaining the outline of the learning system concerning a 6th embodiment. It is a figure showing an example of functional composition of a video information generation part concerning a 6th embodiment.
  • the input information generation device, image processing device, and input information generation method according to the present embodiment receive low-quality video information with superimposed noise as input and generate high-quality video information from which noise has been removed.
  • Low-quality videos include low-quality videos
  • high-quality videos include high-quality videos.
  • An example of a high-quality moving image is a moving image with high image quality captured by low ISO sensitivity and long exposure.
  • An example of a low-quality moving image is a moving image with low image quality captured by high ISO sensitivity and short exposure.
  • image quality deterioration due to noise will be described as an example of low-quality video, but the present embodiment is widely applicable to matters other than noise that degrade the quality of video.
  • Things that can degrade video quality include a decrease in resolution or color shift due to optical aberrations, a decrease in resolution due to camera shake or subject shake, uneven black levels due to dark current or circuits, ghosts and flare due to high-brightness subjects, Examples include signal level abnormalities.
  • noise includes streak-like noise that occurs in the horizontal or vertical direction of an image, noise that occurs in a fixed pattern in an image, and the like. Further, noise specific to moving images, such as flicker-like noise that fluctuates between consecutive frames, may be included.
  • the input information generation device, image processing device, and input information generation method according to the present embodiment improve the image quality of each frame by image processing each frame included in a video, thereby increasing the quality of the video. I do.
  • a video captured by an imaging device may be used, or a video prepared in advance may be used.
  • a low-quality video may be referred to as a low-quality video or a noise video.
  • a high-quality video may be referred to as a high-quality video.
  • the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a video captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. good.
  • the moving image targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is an image captured by a CMOS (complementary metal oxide semiconductor) camera using a CMOS (complementary metal oxide semiconductor) image sensor. Good too.
  • the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a color video or a monochrome video.
  • the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is a video captured by an infrared camera using an infrared sensor or the like that acquires non-visible light components. There may be.
  • FIG. 1 is a block diagram showing an example of the functional configuration of a high-quality video generation system according to the first embodiment.
  • An example of the functional configuration of the high-quality video generation system 1 will be described with reference to the same figure.
  • the high-quality video generation system 1 includes an imaging device 100, an input information generation device 10, and a convolutional neural network 200 (hereinafter referred to as "CNN 200") as its functions.
  • the input information generation device 10 and CNN 200 perform image processing on each frame constituting the moving image captured by the imaging device 100.
  • the input information generation device 10 and the CNN 200 include trained models that have been trained in advance.
  • a configuration including the input information generation device 10 and CNN 200 may be referred to as an image processing device 2.
  • the high-quality moving image generation system 1 may be configured to include an encoding unit that compresses and encodes the output of the image processing device 2, and a predetermined memory that holds the results compressed and encoded by the encoder.
  • the imaging device 100 images a moving image.
  • the moving image captured by the imaging device 100 is a low-quality moving image that is subject to quality improvement.
  • the imaging device 100 may be, for example, a surveillance camera installed in a dark (low amount of light) location.
  • the imaging device 100 images a low-quality moving image due to insufficient light, for example.
  • the imaging device 100 outputs the captured moving image to the input information generation device 10.
  • a moving image captured by the imaging device 100 becomes an input to the image processing device 2 . Therefore, the video output from the imaging device 100 to the input information generation device 10 may be referred to as video information IM.
  • both the imaging device 100 and the image processing device 2 may exist within a housing of a smartphone, a tablet terminal, or the like. That is, the high-quality video generation system 1 may exist as an element constituting an edge device. Further, the imaging device 100 may be connected to the image processing device 2 via a predetermined communication network. That is, the high-quality video generation system 1 may exist by having components connected to each other via a predetermined communication network. Further, the imaging device 100 may be configured to include a plurality of lenses and a plurality of image sensors respectively corresponding to the plurality of lenses. As a specific example of such a configuration, the imaging device 100 may include a plurality of lenses and image sensors so as to acquire images with different angles of view.
  • the images acquired from the respective image sensors can be said to be spatially adjacent to each other.
  • the high-quality video generation system 1 is applicable not only to a plurality of temporally adjacent images such as a video, but also to a plurality of spatially adjacent images.
  • the input information generation device 10 acquires video information IM from the imaging device 100.
  • the input information generation device 10 generates input information IN based on the acquired video information IM.
  • Input information IN is generated for each frame that constitutes moving image information IM.
  • the input information IN may be generated based on a target frame and other frames determined based on the target frame.
  • the other frame determined based on the frame may be a frame temporally adjacent to the target frame.
  • the CNN 200 is a convolutional neural network that uses the data output by the input information generation device 10 as input information IN. An example of the CNN 200 will be described with reference to FIG. 2.
  • FIG. 2 is a diagram showing an example of the CNN 200 according to the first embodiment. Details of the CNN 200 will be explained in detail with reference to the figure.
  • CNN 200 is a neural network with a multilayer structure.
  • the CNN 200 is a multilayer network including an input layer 210 to which input information IN is input, a convolution layer 220 to perform convolution operations, a pooling layer 230 to perform pooling, and an output layer 240. In at least a portion of the CNN 200, the convolution layer 220 and the pooling layer 230 are alternately connected.
  • CNN200 is a model widely used for image recognition and video recognition.
  • the CNN 200 may further include layers having other functions, such as a fully connected layer.
  • the pooling layer 230 may include a quantization layer that performs a quantization operation to reduce the number of bits to the operation result of the convolution layer 220. Specifically, when the result of the convolution operation in the convolution layer 220 is 16 bits, the quantization layer performs an operation to reduce the number of bits of the result of the convolution operation in the quantization layer to 8 bits or less.
  • the CNN 200 may adopt a configuration in which the outputs of each of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 are used as intermediate outputs and inputs to other layers.
  • the CNN 200 configures a U-net by using the outputs of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 as intermediate outputs and inputs to other layers. It's okay.
  • the CNN 200 includes an encoder section that extracts a feature amount by a convolution operation, and a decoder section that performs a deconvolution operation based on the extracted feature amount.
  • Input information IN is input to the input layer 210.
  • Input information IN is generated based on the input image.
  • the input image is a frame image that constitutes a moving image.
  • the input information generation device 10 according to this embodiment generates input information IN from an input image.
  • the elements of the input information IN may be, for example, 2-bit unsigned integers (0, 1, 2, 3).
  • the elements of the input data may be, for example, 4-bit or 8-bit integers.
  • the convolution layer 220 performs a convolution operation on the input information IN input to the input layer 210.
  • the convolution layer 220 performs a convolution operation on the low-bit input information IN.
  • the convolution layer 220 outputs predetermined output data to the pooling layer 230 as a result of performing a predetermined convolution operation.
  • the pooling layer extracts a representative value of a certain area based on the result of the convolution operation performed by the convolution layer 220. Specifically, the pooling layer 230 compresses the output data of the convolution layer 220 by performing an operation such as average pooling or MAX pooling on the output data of the convolution operation output by the convolution layer 220.
  • the output layer 240 is a layer that outputs the results of the CNN 200.
  • the output layer 240 may output the results of the CNN 200 using, for example, an identity function or a softmax function.
  • the layer provided before the output layer 240 may be the convolution layer 220, the pooling layer 230, or another layer.
  • FIG. 3 is a diagram showing frames constituting a moving image according to the first embodiment.
  • frames used by the input information generation device 10 to generate input information IN will be described.
  • the figure shows a plurality of consecutive frames constituting a moving image.
  • Frames F1 to F7 shown in the figure are examples of a plurality of consecutive frames constituting a moving image.
  • each frame is a RAW image that has not been compressed and encoded, and each pixel is expressed with 12 or 14 bits.
  • the number of pixels in each frame is the number of pixels necessary to satisfy a predetermined video format such as 1920x1080 or 4096x2160.
  • the processing target of the CNN 200 will be described as a RAW image, but the processing target is not limited to this. If the image to be processed contains sufficient signal components, the image that has been subjected to processing such as compression encoding may be used as the target.
  • the input information generation device 10 generates input information IN based on a target frame TF, which is a target frame, and an adjacent frame AF, which is a frame adjacent to the target frame TF.
  • the adjacent frame AF is, for example, a frame that is consecutively adjacent before or after the target frame TF.
  • two frames before and after the target frame TF are set as adjacent frames AF. That is, when the target frame TF is the frame F4, the frame F2, the frame F3, the frame F5, and the frame F6 are the adjacent frames AF.
  • the number of adjacent frames AF is not limited to this example, and may be one frame before and after the target frame TF, or three frames before and after the target frame TF. Further, the adjacent frames AF are not limited to the example of adjacent frames before and after the target frame TF, but may be only frames adjacent to either the front or the rear of the target frame TF, for example. Furthermore, the adjacent frame AF does not need to be continuous with the target frame TF; for example, when frame F4 is the target frame TF, the adjacent frame AF may be frames F2, F6, etc. that are not continuous with frame F4.
  • FIG. 4 is a diagram for explaining an overview of the input information generation method according to the first embodiment.
  • a method for generating input information IN by the input information generation device 10 will be described with reference to the same figure.
  • the figure shows a frame at time t-2, a frame at time t-1, a frame at time t, a frame at time t+1, and a frame at time t+2.
  • the frame at time t corresponds to the above-described target frame TF
  • the frames at time t-2, time t-1, time t+1, and time t+2 correspond to adjacent frames AF.
  • each frame includes a large number of pixels, a circuit that processes the entire frame at the same time becomes large-scale. Therefore, when processing is performed in the CNN 200, it is preferable to divide each frame into predetermined sizes. In this embodiment, as an example, a case where a patch having a size of 256 ⁇ 256 is divided into a plurality of parts will be described.
  • Each frame is configured to include image data of 4 channels: R (Red), G (Green) ⁇ 2 channels, and B (Blue).
  • the input information generation device 10 performs quantization and vectorization of each channel.
  • the nine-channel data may be data in which pixel values are quantized using different threshold values.
  • the input information generation device 10 outputs the combined 180 channel data to the input layer of the CNN 200 as input information IN.
  • the input information generation device 10 may generate the input information IN based on three-channel data including RGB, for example. Further, in the illustrated example, a case has been described in which 9-channel data is generated by performing quantization and vectorization based on image data, but the aspect of the present embodiment is not limited to this example. It is preferable that the number of data to be generated is a number that allows efficient calculation after synthesis.
  • the input information generation device 10 generates data of M channels (M is a natural number of 1 or more) for each of N channels (N is a natural number of 1 or more) of image data constituting one frame.
  • M channel data may be generated.
  • the value of N ⁇ M is preferably a value close to a multiple of 32 (or 64).
  • FIG. 5 is a block diagram illustrating an example of a functional configuration for generating input information according to the first embodiment.
  • the input information generation device 10 includes an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14.
  • the input information generation device 10 includes a CPU (Central Processing Unit) (not shown), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus.
  • the input information generation device 10 functions as a device including an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14 by executing an input information generation program.
  • the input information generation device 10 is realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array). Good too.
  • the input information generation program may be recorded on a computer-readable recording medium.
  • the computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system.
  • the input information generation program may be transmitted via a telecommunications line.
  • a memory in which moving image information IM of a moving image captured by the imaging device 100 is stored is referred to as a first memory M1
  • a memory in which input information IN generated by the input information generating device 10 is stored is referred to as a first memory M1.
  • the first memory M1 and the second memory M2 are storage devices such as ROM or RAM.
  • the image acquisition unit 11 acquires image information IMD that includes the input image used for processing from among the video information IM stored in the first memory M1. Specifically, the image acquisition unit 11 acquires, as input images, a plurality of frames including at least a target frame TF for which input information IN is to be generated, among a plurality of frames constituting a moving image. For example, the image acquisition unit 11 acquires an adjacent frame AF as an input image in addition to the target frame TF.
  • the adjacent frames AF may be a plurality of frames that are consecutively adjacent to each other before and after the target frame TF.
  • the image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images. For example, if the imaging device 100 uses an image sensor employing a Bayer array, the image acquisition unit 11 acquires four channels of RGGB image information from one frame. The pixel value of the image acquired by the image acquisition unit 11 includes multi-bit elements.
  • the input conversion unit 12 acquires image information IMD from the image acquisition unit 11.
  • the input conversion unit 12 converts pixel values of input images of multiple frames included in the image information IMD into low-bit input data IND based on comparisons with multiple threshold values. Since the input image is a RAW image and its pixel value includes multi-bit (for example, 12 bits or 14 bits) elements, the image acquisition unit 11 indicates the pixel value of the input image based on a plurality of threshold values.
  • the input data IND is converted into input data IND having a bit number (eg, 2 bits or 1 bit) less than the bit number (eg, 8 bits).
  • the input conversion unit 12 outputs the converted input data IND to the synthesis unit 13.
  • the input conversion unit 12 performs conversion for each color. That is, the input conversion unit 12 converts each of the plurality of images acquired from one frame into input data IND.
  • FIG. 6 is a block diagram showing an example of the functional configuration of the input conversion section according to the first embodiment.
  • the input conversion section 12 includes a plurality of conversion sections 121 and a threshold storage section 122.
  • the input conversion unit 12 includes a conversion unit 121-1, a conversion unit 121-2, ..., a conversion unit 121-n (n is a natural number of 1 or more) as an example of the plurality of conversion units 121. Equipped with.
  • the number of converters 121 included in the input converter 12 may be the number of input data IND that the input converter 12 generates from one channel of input images. That is, when the input converter 12 converts a one-channel input image into nine-channel input data IND, the input converter 12 includes nine converters 121, converters 121-1 to 121-9.
  • the image data of the input image has a matrix-like data structure in which pixel data is multivalued, with each element in the x-axis direction and y-axis direction having more than 8 bits.
  • each element is quantized and becomes input data of low bits (for example, 2 bits or 1 bit below 8 bits).
  • the conversion unit 121 compares each element of the input image with a predetermined threshold.
  • the conversion unit 121 quantizes each element of the input image based on the comparison result.
  • the conversion unit 121 quantizes, for example, a 12-bit input image into a 2-bit or 1-bit value.
  • the converter 121 may perform quantization by comparing the number of bits with a number of thresholds corresponding to the number of bits after conversion. For example, for conversion to 1 bit, one threshold value is sufficient, and for conversion to 2 bits, three threshold values may be used. In other words, one threshold value may be used when the quantization performed by the conversion unit 121 is 1-bit quantization, and three threshold values may be used when the quantization is 2-bit quantization. Note that if a large number of threshold values such as 8 bits are required, quantization may be performed using a function, a table, or the like instead of a threshold value.
  • Each conversion unit 121 performs quantization on the same element using independent thresholds.
  • the input converter 12 outputs a vector including elements corresponding to the number of converters 121 as the calculation result (input data IND) for one channel of input.
  • the bit precision of the converted result which is the output of the converting unit 121, may be changed as appropriate based on the bit precision of the input image.
  • the threshold storage unit 122 stores a plurality of threshold values used in calculations performed by the conversion unit 121.
  • the threshold value stored in the threshold storage unit 122 is a predetermined value, and is set corresponding to each of the plurality of conversion units 121. Note that each threshold value may be a learning target parameter, and may be determined and updated in the learning step.
  • the mode of the input converter 12 is not limited to this.
  • the input image is image data that includes elements of three or more channels including color components
  • the conversion unit 121 is divided into a plurality of corresponding groups, and the corresponding elements are input to each group and converted. You may.
  • some kind of conversion processing may be applied in advance to the elements to be input to a predetermined conversion unit 121, or which conversion unit 12 to input them to can be switched depending on the presence or absence of pre-processing. It's okay.
  • the number of conversion units 121 does not need to be fixed, and may be determined as appropriate depending on the structure of the neural network or hardware information. Note that if it is necessary to compensate for a decrease in calculation accuracy due to quantization by the converter 121, it is preferable to set the number of converters 121 to be greater than or equal to the bit precision of each element of the input image. More generally, it is preferable to set the number of converters 121 to be greater than the difference in bit precision of the input image before and after quantization. Specifically, when quantizing an input image whose pixel value is represented by 8 bits to 1 bit, the number of converters 121 is 7 or more (for example, 16 or 32), which corresponds to 7 bits as a difference. ) is preferable.
  • the combining unit 13 combines (concats) the plurality of converted input data IND into one data.
  • Data obtained by combining a plurality of input data is also referred to as composite data CD.
  • the combining process by the combining unit 13 may be a process of arranging (or connecting) a plurality of input data IND into one data.
  • the output section 14 outputs the composite data CD synthesized by the synthesis section 13.
  • the composite data CD may be temporarily stored in the second memory M2.
  • the composite data CD is, in other words, the input information IN input to the input layer 210 of the CNN 200.
  • the input information generation device 10 After generating input information IN for the target frame TF, the input information generation device 10 generates input information IN for the next frame of the target frame TF.
  • the next frame may be a frame temporally continuous with the target frame TF. That is, after the output unit 14 outputs the composite data CD regarding the target frame TF, the image acquisition unit 11 shifts the target frame TF by one frame and acquires a frame adjacent to the target frame TF as the target frame TF. do. In this way, the input information generation device 10 acquires a plurality of frames including at least the target frame TF as input images, and generates composite data CD.
  • the input information generation device 10 generates input information IN for all frames included in the video information IM. Note that, in the example described above, an example was described in which the input information generation device 10 generates input information IN for all frames included in the video information IM, but the aspect of the present embodiment is not limited to this example. For example, the input information generation device 10 may generate input information IN every predetermined frame. Furthermore, although the high-quality video generation system 1 converts the video into a high-quality video based on the video information IM, the output format is not limited to the video format. For example, the high-quality video generation system 1 may generate still images from videos. That is, the high-quality video generation system 1 can apply this embodiment even when extracting frames included in video information IM to generate still images by using frames extracted from a video as target frames TF. can.
  • the input information generation device 10 is equipped with the image acquisition unit 11 to generate a plurality of frames including at least the target frame TF, which is the target of generation of the input information IND, among the frames constituting the moving image. Obtain as input image.
  • the input information generation device 10 includes the input conversion unit 12, so that the input information generation device 10 converts the pixel values of the acquired multiple frames of the input image into the number of bits indicating the pixel value of the input image ( The input data IND is converted into input data IND having a smaller number of bits (for example, 2 bits or 1 bit less than 8 bits) (for example, 12 bits).
  • the input information generation device 10 also includes a synthesizing section 13 to synthesize a plurality of converted input data IND into one synthetic data CD, and includes an output section 14 to synthesize the synthesized synthetic data CD. Output. That is, according to the present embodiment, the input information generation device 10 generates input information IN by combining a plurality of input data IND obtained from a plurality of images including at least the target frame TF. The generated input information IN is input to the input layer 210 of the CNN 200.
  • the input information IN is information with lower bits than the input image in each element.
  • the CNN 200 can process low-bit information. Therefore, according to this embodiment, the processing of the CNN 200 can be reduced in weight.
  • the input information IN is information generated based on a plurality of images. Therefore, when improving the quality of a video based on the input information IN, it becomes possible to perform processing that also considers multiple frames adjacent to the target frame TF, making it possible to remove noise with high precision. become able to. Therefore, according to this embodiment, it is possible to convert a low-quality video into a high-quality video with lightweight calculations. Note that the processing of the CNN 200 is not limited to noise removal.
  • the moving image information IM includes, for example, pixel values of each color of RGB.
  • the image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images, and the input conversion unit 12 converts each of the plurality of images acquired from one frame as different input data IND. Therefore, according to this embodiment, more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with even higher accuracy.
  • the image acquisition unit 11 acquires images of a plurality of frames consecutively adjacent to each of the front and rear of the target frame TF, among the plurality of frames constituting the moving image.
  • the input information IN is generated based on the information of the frames adjacent before and after the target frame TF, so that more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with high accuracy.
  • the image acquisition unit 11 sets a frame adjacent to the target frame TF to at least the target frame TF.
  • a plurality of frames including TF are acquired as input images. That is, the input information generation device 10 generates input information IN for each of a plurality of frames included in the video by shifting the target frames TF one after another. Therefore, according to this embodiment, a low quality video can be converted into a high quality video.
  • the input information generation device 10 reads a plurality of frames including a target frame TF and an adjacent frame AF, and performs quantization on each of the plurality of frames. Therefore, according to the input information generation device 10 according to the first embodiment, the number of frames to be quantized is large, and the calculation load on the first layer becomes large. The second embodiment attempts to solve this problem and further lighten the calculation load.
  • FIG. 7 is a diagram for explaining an overview of the input information generation method according to the second embodiment.
  • a method for generating input information IN by the input information generation device 10A according to the second embodiment will be described with reference to the same figure.
  • a frame at time t is shown as the target frame TF.
  • a frame at time t-2, a frame at time t-1, a frame at time t+1, and a frame at time t+2 are shown as adjacent frames AF.
  • Each frame is configured to include image data of four channels of RGGB.
  • the input information generation device 10A performs quantization and vectorization of each channel for the target frame TF.
  • the input information generation device 10A performs quantization and vectorization of each channel on the average image of adjacent frames AF. That is, the second embodiment is different from the first embodiment in that the average image of the adjacent frames AF is quantized and vectorized instead of being quantized and vectorized for each adjacent frame AF. different.
  • the input information generation device 10A converts each of the target frame TF and the average image of the adjacent frame AF into 64 channels of data, so a total of 128 channels of data is generated.
  • the averaging process may be performed by calculating a simple average of pixel values. Furthermore, the input information generation device 10A may generate an average image for each color by calculating the average for each color. That is, the input information generation device 10A generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2, and generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2. An average image based on the B image of the frame in the adjacent frame AF from time t-2 to time t+2 may be generated.
  • quantization and vectorization are performed after calculating the average of adjacent frames AF, but the aspect of the present embodiment is not limited to this example.
  • the input information generation device 10A may be configured to calculate the average after performing quantization and vectorization, for example.
  • input information IN is generated by combining these.
  • 128 channels of data are generated.
  • the combined input information IN is smaller than the amount of information according to the first embodiment.
  • the amount of data representing the target frame TF is increased compared to the first embodiment.
  • FIG. 8 is a block diagram illustrating an example of a functional configuration for generating input information according to the second embodiment.
  • An example of the functional configuration of the input information generation device 10A will be described with reference to the same figure.
  • the input information generation device 10A differs from the input information generation device 10 in that it further includes an integration section 15, an input conversion section 12A instead of the input conversion section 12, and a synthesis section 13A instead of the synthesis section 13.
  • the same components as the input information generation device 10 may be given the same reference numerals and the description thereof may be omitted.
  • the integrating unit 15 acquires information regarding adjacent frames AF from the image acquiring unit 11.
  • the adjacent frame AF is a frame other than the target frame TF among the plurality of frames acquired by the image acquisition unit 11.
  • the integrating unit 15 performs a process of integrating a plurality of adjacent frames AF into one integrated frame IF by performing calculations based on the pixel values of the adjacent frames AF.
  • the integrating unit 15 outputs the integrated frame IF obtained as a result of the calculation to the input converting unit 12A.
  • the integrating unit 15 may perform the integrating process by, for example, obtaining a simple average of adjacent frames AF.
  • the integrating unit 15 takes, for example, the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF.
  • the aspect of the integration process of the integration unit 15 is not limited to the example of calculating a simple average.
  • the integrating unit 15 may perform the integrating process by calculating a weighted average, for example.
  • the weighted average is calculated according to the temporal distance from the target frame TF.
  • the integrating unit 15 reduces the weight by multiplying the pixel values of frames at time t-2 and time t+2, which are temporally distant from time t, by 0.7, which is smaller than 1, and
  • the weight may be increased by multiplying the pixel values of the frames at time t-1 and time t+1 by 1.3, which is greater than 1.
  • the integrating unit 15 may calculate the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. By integrating using a weighted average, the degree of contribution of the adjacent frame AF to the target frame TF can be reflected in the integrated frame IF. Note that the magnitude of the weight to be multiplied may be different depending on whether the temporal distance from the target frame TF is the same or whether it is before or after the target frame TF. For example, if the adjacent frame AF is before the target frame TF, the weight may be increased, and if the adjacent frame AF is after the target frame TF, the weight may be decreased.
  • the input conversion unit 12A converts the pixel value of the target frame TF acquired from the image acquisition unit 11 into input data IND based on comparison with a plurality of threshold values. Further, the input conversion unit 12A converts the pixel values of the integrated frame IF obtained from the integration unit 15 into input data based on comparison with a plurality of threshold values. The input conversion section 12A outputs the converted input data IND to the synthesis section 13A.
  • the synthesis unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD.
  • the combined input information IN is smaller than the amount of information according to the first embodiment.
  • the input information generation device 10A further includes the integrating unit 15, thereby performing calculations based on the pixel values of adjacent frames AF, and converting a plurality of adjacent frames AF into one integrated frame IF. Integrate.
  • the input conversion unit 12A converts the pixel value of the target frame TF into input data IND based on comparison with a plurality of threshold values, and further converts the pixel value of the integrated frame IF into input data IND based on comparison with a plurality of threshold values. Convert to IND.
  • the synthesizing unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD. That is, the input information generation device 10A does not quantize and vectorize the adjacent frame AF like the target frame TF, but calculates an integrated frame IF based on a plurality of adjacent frames AF, and quantizes the integrated frame IF. and vectorization. Therefore, the input information IN synthesized by the synthesizing section 13A is smaller than the amount of information according to the first embodiment, and the calculation load on the first layer can be reduced.
  • the integrating unit 15 sets the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF. That is, the integrating unit 15 takes the simple average of the adjacent frames AF as the integrated frame IF. Therefore, according to the present embodiment, input information IN having a smaller amount of information than the input information IN according to the first embodiment can be generated by easy calculation, and the calculation load on the first layer can be reduced. Can be done.
  • the integrating unit 15 calculates the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. . Therefore, according to the present embodiment, the integrated frame IF can be generated in consideration of the degree of contribution of the adjacent frame AF to the target frame TF. Therefore, by using the input information IN generated by the input information generation device 10A, the CNN 200 can perform image processing with higher accuracy.
  • the input information generation device 10A includes the integrating unit 15 to calculate the average value of adjacent frames AF.
  • calculating the average value for each frame causes duplication of processing and is not efficient. Therefore, the third embodiment attempts to solve this problem and further lighten the calculation load.
  • FIG. 9 is a block diagram illustrating an example of a functional configuration for generating input information according to the third embodiment.
  • An example of the functional configuration of the input information generation device 10B will be described with reference to the same figure.
  • the input information generation device 10B is different from the input information generation device 10A in that it further includes an average value temporary storage section 16, an imaging condition acquisition section 17, and an adjustment section 18, and includes an integration section 15B instead of the integration section 15. different.
  • the same components as the input information generation device 10A may be given the same reference numerals and the description thereof may be omitted.
  • the average value temporary storage unit 16 stores the average value of the pixel values of a predetermined frame among the frames making up the moving image.
  • the value stored in the average value temporary storage section 16 is calculated by the integration section 15B.
  • the integration unit 15B acquires information about the target frame TF from the image acquisition unit 11, and acquires the stored value SV from the average value temporary storage unit 16.
  • the integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the target frame TF and the stored value SV, which is the value stored in the average value temporary storage unit 16.
  • the integrating unit 15B stores the calculated value in the average value temporary storage unit 16 as the calculated value CV. That is, the value stored in the average value temporary storage unit 16 is updated every time the integration unit 15B calculates a new target frame TF.
  • the input information generation device 10B calculates a moving average based on the target frame TF. Note that regarding the calculation for the first frame, since the stored value SV does not yet exist in the average value temporary storage unit 16, the integrating unit 15B may perform the calculation based only on the target frame TF.
  • the imaging condition acquisition unit 17 acquires video imaging conditions from the imaging device 100.
  • the video imaging conditions acquired by the imaging condition acquisition unit 17 may be, for example, settings of the imaging device such as shutter speed, aperture, or ISO sensitivity. Further, the video imaging conditions acquired by the imaging condition acquisition unit 17 may include other information regarding the operation and drive of the imaging device 100.
  • the adjustment unit 18 adjusts the value (average value) stored in the average value temporary storage unit 16 according to the imaging condition acquired by the imaging condition acquisition unit 17.
  • the imaging conditions of the imaging device 100 change, the relationship between the pixel value of the target frame TF and the past moving average value will change.
  • the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, the pixel value of the target frame TF becomes brighter than the past moving average value, so the moving average value suddenly becomes darker. turn into. Therefore, when the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, by doubling the value stored in the average value temporary storage unit 16, the integrating unit 15B continues to calculate the average value. It is possible to continue calculating the moving average value while utilizing the values stored in the value temporary storage unit 16.
  • the CNN 200 may not be able to properly perform image processing regarding the target frame TF. Therefore, in cases where the shooting scene of a moving image has changed, it is preferable not to obtain a moving average with past frames.
  • the adjustment unit 18 is configured to reset the value stored in the average value temporary storage unit 16 when the video shooting scene changes, so that the integration unit 15B starts calculating a new moving average value. You can leave it there. Whether or not the shooting scene of the video has changed may be determined based on whether the power button of the imaging device 100 is turned on or off, the shooting button or the stop button is turned on or off, or the like.
  • a frame with a large change in brightness may be inserted.
  • Possible causes for inserting a frame with a large luminance change include a case where a light source is imaged due to a change in the imaging angle, a case where a car's headlights are reflected, etc.
  • the integrating unit 15B may exclude a frame with a large luminance change among the plurality of frames from the targets for calculating the average value. By excluding frames with large brightness changes from the average value calculation target, it is possible to prevent the moving average value from being dragged by frames with large brightness changes.
  • determining whether the brightness change is large by comparing the pixel value of the immediately previous target frame TF with the pixel value of the target frame TF to be calculated, it is determined whether the difference is less than or equal to a threshold value. It may be determined whether or not.
  • the input information generation device 10B further includes the average value temporary storage unit 16 to store the average value of the pixel values of a predetermined frame among the frames constituting the moving image. Further, according to the input information generation device 10B, the integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the value stored in the average value temporary storage unit 16 and the target frame TF. That is, the input information generation device 10B calculates the pixel value of the integrated frame IF based on the target frame TF and the stored moving average value. Therefore, the input information generation device 10B has a lighter calculation load than the input information generation device 10A. Therefore, according to this embodiment, the calculation load can be further reduced.
  • the integrating unit 15B excludes frames with large luminance changes from among the plurality of frames making up the video from the targets for calculating the average value. Therefore, according to the present embodiment, when a brightness change suddenly becomes large, by excluding the frame from the calculation of the moving average value, the pixel values of the integrated frame IF can be adjusted to avoid sudden brightness changes. It is possible to prevent the pixel value from being dragged down by the pixel value of a frame that has become large.
  • the input information generation device 10B further includes the imaging condition acquisition unit 17 to acquire the imaging conditions for a moving image, and further includes the adjustment unit 18 to adjust the imaging conditions according to the acquired imaging conditions.
  • the average value stored in the average value temporary storage section 16 is adjusted. Therefore, according to this embodiment, it is possible to adjust the average value according to changes in imaging conditions. Therefore, according to this embodiment, even if a change occurs in the imaging conditions, it is possible to continue calculating the moving average value.
  • the input information generation device 10B calculates a moving average by including the average value temporary storage section 16.
  • the integrating unit 15B calculates the moving average of the entire image.
  • the input information generation device 10B generates input data IN based on the target frame TF and the moving average, and the CNN 200 removes noise from the moving image based on the generated input data IN. Since the integrating unit 15B calculates a moving average of the entire image, if a moving subject is photographed in part of the video, afterimages may occur in the part where the moving subject is photographed as a result of noise removal. In some cases, problems may arise.
  • the fourth embodiment attempts to solve this problem.
  • FIG. 10 is a block diagram illustrating an example of a functional configuration for generating input information according to the fourth embodiment.
  • An example of the functional configuration of an input information generation device 10C according to the fourth embodiment will be described with reference to the same figure.
  • the input information generation device 10C is different from the input information generation device 10B in that it further includes a comparison section 19 and includes an integration section 15C instead of the integration section 15B.
  • the input information generation device 10C may or may not include the imaging condition acquisition section 17 and the adjustment section 18 like the input information generation device 10B. In the illustrated example, an example will be described in which the input information generation device 10C does not include the imaging condition acquisition section 17 and the adjustment section 18.
  • the same components as the input information generation device 10B may be given the same reference numerals and the description thereof may be omitted.
  • the comparison unit 19 acquires the stored value SV from the average value temporary storage unit 16 and acquires the target frame TF from the integration unit 15C.
  • the comparison unit 19 compares the acquired stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF.
  • the comparison unit 19 may compare the entire image, each pixel, or each patch composed of a plurality of pixels.
  • the comparison unit 19 outputs the comparison result to the integration unit 15C as a comparison result CR.
  • the comparison result CR may include a difference between pixel values, or may include information about the result of comparing the difference with a predetermined threshold.
  • the integration unit 15C acquires the comparison result CR from the comparison unit 19 and acquires the stored value SV from the average value temporary storage unit 16. Based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C calculates the average value between the stored value SV stored in the average value temporary storage unit 16 and the target frame TF, if the difference is less than a predetermined value. Calculate the moving average based on The integrating unit 15C sets the calculated value as the pixel value of the integrated frame IF. Further, based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF if the difference is not less than a predetermined value.
  • the integration process by the integration unit 15C may be performed for the entire image, for each pixel, or for each patch made up of multiple pixels.
  • the pixel value of the integrated frame IF is a moving average value for locations where the difference is less than a predetermined value (i.e., locations where the movement is small), and for locations where the difference is not less than a predetermined value ( In other words, the pixel value of the target frame TF is used for areas with large movements.
  • the integrating unit 15C stores the calculated result in the average value temporary storage unit 16 as a calculated value CV.
  • a subject with large movements is not included in the average image, and a background with small movements is included in the average image.
  • the calculation performed by the input information generation device 10C can also be called a selective average.
  • the integrating unit 15C instead of choosing between capturing or not capturing into the average image, it may be multiplied by a coefficient. For example, if the difference is less than or equal to a predetermined value based on the comparison result CR that is the result of comparison by the comparison unit 19, the integrating unit 15C adds a predetermined coefficient (for example, a moving average may be calculated based on the value multiplied by 0.9) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF.
  • a predetermined coefficient For example, a moving average may be calculated based on the value multiplied by 0.9
  • the average value of the target frame TF may be used as the pixel value of the integrated frame IF.
  • the integrating section 15C adds a predetermined coefficient (for example, A moving average may be calculated based on the value multiplied by 0.1) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF.
  • a predetermined coefficient for example, A moving average may be calculated based on the value multiplied by 0.1
  • the input information generation device 10C further includes the comparison unit 19 to compare the stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF. . If the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit 19, the integration unit 15C calculates a moving average based on the stored value SV stored in the average value temporary storage unit 16 and the target frame TF. Calculate the pixel value of the integrated frame IF. Further, as a result of the comparison by the comparing unit 19, if the difference is not less than a predetermined value, the integrating unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF.
  • the pixel value of the integrated frame IF is determined by distinguishing between a subject with large movement and a background with small movement, and selectively performing averaging processing.
  • the input information generation device 10C generates input data IN based on the target frame TF and the integrated frame IF. Therefore, according to the present embodiment, since the pixel values of the adjacent frame AF are not reflected in the pixel values of the integrated frame IF in areas where there is large movement, it is possible to prevent the problem of occurrence of afterimages.
  • any one of the frames adjacent to the target frame TF may be specified by a predetermined algorithm and used as the integrated frame IF.
  • the predetermined algorithm may be one that randomly identifies one frame among frames adjacent to the target frame TF.
  • the integrating unit 15 sets a randomly specified frame among the adjacent frames AF acquired by the image acquiring unit 11 as an integrated frame IF.
  • aspects of the present embodiment are not limited to any of the aspects of the first to fourth embodiments described above, and can be modified from the first to fourth embodiments based on predetermined conditions. Either of these may be used selectively.
  • the predetermined conditions may be video shooting conditions, shooting mode, exposure conditions, type of subject, etc.
  • the first to fourth As in any one of the embodiments, it is preferable to perform learning using not only the target frame TF but also the adjacent frame AF. Calculations related to learning do not necessarily need to be executed in the image processing device 2, and results such as parameters learned in advance in a dedicated learning device may be included in the CNN 200 as a learned model.
  • a learning model is trained using a combination of a noise image on which noise is superimposed and a high-quality image as training data.
  • the training data is created by capturing images of the same object with different exposure settings using an imaging device to obtain a high-quality image and a noise image.
  • machine learning requires a large amount of training data, and creating training data by capturing images using a camera is time-consuming. Therefore, a technique is known in which training data is created by adding random noise to a high-quality image (for example, see Japanese Patent Application Laid-Open No. 2021-071936). It is known to use such conventional techniques to create training data for inferring a high quality image from a low quality image by adding random noise to a high quality image.
  • the present invention aims to provide a technology that can generate training data for inferring a high-quality video from a low-quality video.
  • a learning model is trained to infer a high-quality video from which noise is removed by inputting low-quality video information with superimposed noise.
  • Low-quality videos include low-quality videos
  • high-quality videos include high-quality videos.
  • Teacher data used for learning by the learning device, program, and learning method of the noise reduction device according to the present embodiment is generated from a still image of a subject.
  • a still image taken of a subject may be a single high-quality image, or multiple images taken of the same subject (one or more high-quality images and one or more low-quality images). combination of images).
  • a plurality of images of the same subject may be captured under different imaging conditions.
  • the image captured of the subject may be any other image including at least one image.
  • a high-quality image can be, for example, a high-quality image captured by low ISO sensitivity and long exposure.
  • a high quality image may be referred to as GT (Ground Truth).
  • An example of a low-quality image is a low-quality image captured by high ISO sensitivity and short exposure.
  • image quality deterioration due to noise will be described as an example of a low-quality image, but the present embodiment is widely applicable to matters other than noise that degrade image quality.
  • Items that reduce image quality include reduced resolution or color shift due to optical aberrations, reduced resolution due to camera shake or subject blur, uneven black level due to dark current or circuits, ghosts and flare caused by high-brightness objects, Examples include signal level abnormalities.
  • a low-quality image may be referred to as a low-quality image or a noise image.
  • a high quality image may be referred to as a high quality image or GT.
  • a low-quality video may be described as a low-quality video or a noise video.
  • a high-quality video may be referred to as a high-quality video or GT.
  • Images targeted by the learning device may be still images or frames included in a video.
  • the data format may be a format that has not undergone compression encoding processing, such as a Raw format, or a format that has undergone compression encoding processing, such as a Jpeg format or an MPEG format.
  • a Raw format a format that has not undergone compression encoding processing
  • a Jpeg format a format that has undergone compression encoding processing
  • the image targeted by the learning device according to the present embodiment may be an image captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. Further, the image targeted by the learning device according to the present embodiment may be an image captured by a CMOS camera using a complementary metal oxide semiconductor (CMOS) image sensor. Further, the image targeted by the learning device according to the present embodiment may be a color image or a monochrome image. Further, the image targeted by the learning device according to the present embodiment may be an image captured by an infrared camera using an infrared sensor or the like to obtain a non-visible light component.
  • CCD Charge Coupled Devices
  • CMOS complementary metal oxide semiconductor
  • FIG. 11 is a diagram for explaining an overview of the learning system according to the fifth embodiment.
  • An overview of the learning system 1001 will be explained with reference to the same figure.
  • a learning system 1001 shown in the figure is an example of a configuration in the learning stage of machine learning.
  • the learning system 1001 trains the learning model 1040 using teacher data TD generated based on images captured by the imaging device 1020.
  • the learning system 1001 is equipped with an imaging device 1020 to capture a high-quality image 1031 and a low-quality image 1032.
  • the high-quality image 1031 and the low-quality image 1032 are images of the same subject.
  • the high-quality image 1031 and the low-quality image 1032 are captured at the same angle of view and imaging angle, but with different settings such as ISO sensitivity and exposure time.
  • there is one high-quality image 1031 there may be a plurality of high-quality images 1031.
  • there be a plurality of low-quality images 1032 there may be only one low-quality image 1032.
  • the plurality of low-quality images 1032 are different images captured with different settings such as ISO sensitivity and exposure time.
  • the imaging device 1020 may be, for example, a smartphone having communication means, a tablet terminal, or the like. Further, the imaging device 1020 may be a surveillance camera or the like having communication means.
  • the learning system 1001 generates a high-quality video 1033 from a high-quality image 1031 and a low-quality video 1034 from a low-quality image 1032.
  • the high-quality video 1033 is preferably generated from one high-quality image 1031
  • the low-quality video 1034 is preferably generated from a plurality of low-quality images 1032.
  • a high-quality video 1033 and a low-quality video 1034 generated from a high-quality image 1031 and a low-quality image 1032 captured from the same subject are associated with each other.
  • a high-quality video 1033 and a low-quality video 1034 that correspond to each other are input to the learning model 1040 for learning as teacher data TD.
  • the high-quality video 1033 and low-quality video 1034 that correspond to each other may be temporarily stored in a predetermined storage device for later learning. That is, the learning system 1001 may generate a plurality of teacher data TD in advance before learning that is performed later. Further, the high quality image 1031 and the low quality image 1032 captured by the imaging device 1020 may be temporarily stored in a predetermined storage device. In this case, the learning system 1001 may store a plurality of combinations of mutually corresponding high-quality images 1031 and low-quality images 1032, and generate teacher data TD during learning.
  • the learning model 1040 is trained using the teacher data TD generated by the learning system 1001. Specifically, the learning model 1040 is trained to infer high quality videos from low quality videos. In other words, the learning model 1040 after learning infers a high-quality video using a low-quality video as input, and outputs the inference result. That is, the learned model 1040 after learning may be used in a noise reduction device for removing noise from a low-quality video.
  • the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020 are stored in a predetermined storage device that temporarily stores information.
  • the predetermined storage device may be provided in the imaging device 1020, or may be provided in a cloud server or the like. That is, the learning system 1001 may be configured as an edge device or may include an edge device and a cloud server. Furthermore, the learning of the learning model 1040 may also utilize a GPU or the like provided on the server.
  • FIG. 12 is a diagram showing an example of the functional configuration of the learning device according to the fifth embodiment.
  • the functional configuration of the learning device 1010 will be explained with reference to the same figure.
  • the learning device 1010 is used to implement the learning system 1001 described above.
  • the learning device 1010 generates a high-quality video 1033 and a low-quality video 1034 based on the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020.
  • the learning device 1010 causes the learning model 1040 to learn using the generated high-quality video 1033 and low-quality video 1034 as teacher data TD.
  • the learning device 1010 includes an image acquisition section 1011, a video information generation section 1012, and a learning section 1013.
  • the learning device 1010 includes a CPU (Central Processing Unit), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus (not shown).
  • the learning device 1010 functions as a device including an image acquisition section 1011, a video information generation section 1012, and a learning section 1013 by executing a learning program.
  • the learning device 1010 may be realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array).
  • the learning program may be recorded on a computer-readable recording medium.
  • the computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system.
  • the learning program may be transmitted via a telecommunications line.
  • the image acquisition unit 1011 acquires image information I from the imaging device 1020.
  • Image information I includes first image information I1 and second image information I2.
  • the first image information I1 includes at least one high-quality image 1031.
  • the second image information I2 includes at least one low-quality image 1032.
  • the same subject as the subject captured in the high-quality image 1031 included in the first image information I1 is captured in the low-quality image 1032 included in the second image information I2.
  • the image included in the second image information I2 has lower image quality than the image included in the first image information I1.
  • the image acquisition unit 1011 outputs the acquired image information I to the video information generation unit 1012.
  • the video information generation unit 1012 generates video information M by cutting out a plurality of parts of the images included in the image information I and connecting the cut out images as frame images at a predetermined time interval (or it can also be called a frame rate). generate.
  • the frame rate may be, for example, 60 [FPS (frames per second)].
  • the position of the image cut out by the video information generation unit 1012 may be different for each frame.
  • the size of the cut out images may be fixed, and the video information generation unit 1012 may cut out a plurality of images at positions moved by a predetermined number of pixels (bit number) in a predetermined direction.
  • the size of the image to be cut out may be fixed to 256 pixels x 256 pixels.
  • the video information generation unit 1012 may cut out an image at a position where the size is shifted by 10 pixels for each frame. If the amount of shift is too large, the amount of change in the image for each frame will become too large, resulting in an unnatural moving image, so it is preferable to set a limit (upper limit) so that the shift does not exceed a predetermined amount. It is preferable to determine the amount of shift and the limit based on the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, shooting frame rate, and the like. Furthermore, since the speed of a falling subject increases in terms of acceleration, the amount of shift may be increased for frames temporally farther from the target image.
  • the video information generation unit 1012 generates first video information M1 from the images included in the first image information I1, and generates second video information M2 from the images included in the second image information I2. That is, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. Further, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the second image information I2, and generates the second video information M2 by combining the plurality of cut out images. Generating a moving image by combining a plurality of images may mean converting the plurality of images into a file format that displays them at predetermined time intervals depending on the frame rate. The video information generation unit 1012 outputs information including the generated first video information M1 and second video information M2 as video information M to the learning unit 1013.
  • the sizes of the plurality of images cut out by the video information generation unit 1012 and the cutout positions may be arbitrarily determined. However, it is preferable that the position to be cut out from the image included in the first image information I1 and the position to be cut out from the image included in the second image information I2 are approximately the same position. This is because the first video information M1, which is a high-quality video, and the second video information M2, which is a low-quality video, should be of the same subject.
  • the learning unit 1013 acquires video information M from the video information generation unit 1012.
  • the learning unit 1013 causes the learning model 1040 to learn by inputting the acquired video information M to the learning model 1040 as teacher data TD.
  • the learning model 1040 is trained to infer high quality videos from low quality videos. That is, the learning unit 1013 causes learning to infer a high-quality video from a low-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. .
  • the learning model 1040 can also be said to be trained to reason to remove noise from the input video.
  • a method of generating a high-quality video from a high-quality image (described with reference to FIG. 13) and a method of generating a low-quality video from a low-quality image (described with reference to FIG. 14) will be explained.
  • a high-quality video may be generated from a high-quality image
  • a low-quality video may be generated from a low-quality image, using methods similar to each other. That is, a low-quality video may be generated by the method described with reference to FIG. 13, and a high-quality video may be generated by the method described with reference to FIG. 14.
  • FIG. 13 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment.
  • An example of the position of an image cut out from a high-quality image by the learning device 1010 will be described with reference to the same figure.
  • FIG. 13A shows an image I-11 that is an example of an image included in the first image information I1.
  • FIG. 13(B) shows an example of a case where a plurality of images are cut out from image I-11 shown in FIG. 13(A) as image I-12.
  • the ball B which is the subject, is captured in the image I-11.
  • the video information generation unit 1012 generates a video from the still image I-11 by cutting out a plurality of images from the image I-11 and temporally connecting the cut-out images.
  • the image I-12 shown in FIG. 13(B) shows a plurality of cut-out images CI, which are images cut out by the video information generation unit 1012. Specifically, cut-out images CI-11 to cut-out images CI-15 are shown as examples of images cut out by the video information generation unit 1012. When cutout images CI-11 to cutout images CI-15 are not distinguished, they may be simply written as cutout images CI.
  • the cutout images CI-11 to CI-15 are each shifted by a predetermined number of pixels in the vertical and horizontal directions.
  • an image C-11 is displayed at a certain time t1
  • an image C-12 is displayed at a certain time t2
  • an image C-13 is displayed at a certain time t3.
  • image C-14 is displayed at a certain time t4
  • image C-15 is displayed at a certain time t5.
  • the shift direction and shift amount of the image cut out by the video information generation unit 1012 are determined based on shooting conditions such as the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, and shooting frame rate. is suitable. Furthermore, in the case of simulating a falling object, since the speed increases in an accelerated manner, it is preferable to gradually change (increase) the shift amount.
  • the high-quality video (first video information M1) generated by the learning device 1010 is a high-quality video without superimposed noise. Therefore, it is ideal that noise is not superimposed on an image that is a still image for generating a moving image. Further, ideally, each frame of a high-quality video generated from an image without superimposed noise should also be free from superimposed noise. Therefore, it is preferable that the video information generation unit 1012 generates a video from a single image on which no noise is superimposed. That is, it is preferable that the video information generation unit 1012 generates the first video information M1 by cutting out different parts from one high-quality image included in the first image information I1.
  • FIG. 14 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment.
  • An example of the position of an image cut out from a low-quality image by the learning device 1010 will be described with reference to the same figure.
  • the learning device 1010 cuts out images of different frames from a plurality of low-quality images. Images I-21 to I-25, which are different images, are shown in FIGS. 14(A) to 14(E), respectively.
  • the learning device 1010 cuts out images of different frames from images I-21 to I-25.
  • compositions of images I-21 to I-25 which are low-quality images, are similar to image I-11 shown in FIG. 13(A). That is, the ball B is imaged at the same position in images I-21 to I-25. Images I-21 to I-25 differ from image I-11 in that different noises are superimposed on them. Images I-21 to I-25 may have different noises superimposed on them, for example, by using different imaging conditions during imaging.
  • the video information generation unit 1012 cuts out a cutout image CI-21 from the image I-21, cuts out a cutout image CI-22 from the image I-22, cuts out a cutout image CI-23 from the image I-23, and cuts out the cutout image CI-23 from the image I-24.
  • a cutout image CI-24 is cut out from the image, and a cutout image CI-25 is cut out from the image I-25.
  • the cutout images CI-21 to CI-25 are each shifted by a predetermined number of pixels in the vertical and horizontal directions.
  • image C-21 is displayed at a certain time t1
  • image C-22 is displayed at a certain time t2
  • image C-23 is displayed at a certain time t3.
  • image C-24 is displayed at a certain time t4
  • image C-25 is displayed at a certain time t5. Since different noises are superimposed on each of the cutout images CI-21 to CI-25, different noises are superimposed on the generated moving image depending on the time.
  • the low-quality video (second video information M2) generated by the learning device 1010 is a low-quality video on which noise is superimposed. If you cut out multiple different positions from a single image with superimposed noise and create a video, the same noise will be included at every moment (in other words, the noise will not change over time). It may not be appropriate as a low-quality video. Therefore, in this embodiment, a low-quality video is generated by cutting out a plurality of different low-quality images. The same subject as the subject captured in the high-quality image is captured in each of the different low-quality images.
  • the second image information M2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is imaged, and each image has different noise superimposed thereon. is included.
  • the plurality of images included in the second image information I2 may be images captured at different times close to each other.
  • the video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information. Note that, for example, it is not necessary to prepare low-quality images for the number of frames, and the images may be cut out multiple times so as not to be continuous from a plurality of images. The order of cutting out the plurality of images may be random.
  • FIG. 15 is a diagram for explaining an example of the direction in which the learning device according to the fifth embodiment cuts out.
  • an example was described in which a position moved by a predetermined number of pixels in both the vertical and horizontal directions is cut out.
  • the video information generation unit 1012 may cut out positions moved in other directions.
  • Another example of the direction in which the video information generation unit 1012 cuts out the cutout image CI will be described with reference to FIGS. 15(A) to 15(C).
  • FIG. 15(A) shows image I-31.
  • FIG. 15(A) is an example of a case where a position moved only in the lateral direction (horizontal direction) is extracted.
  • the video information generation unit 1012 fixes the y-coordinate in the vertical direction and changes only the x-coordinate in the horizontal direction, thereby cutting out the cut-out images CI at a plurality of different positions. By cutting out the image in this way, it is possible to generate a moving image in which the subject moves laterally (horizontally). Similarly, the video information generation unit 1012 may cut out the cutout image CI at a position moved only in the vertical direction (vertical direction).
  • the video information generation unit 1012 may cut out the cutout image CI at a position moved in both the vertical and horizontal directions. In this case, the amount of movement in the vertical direction and the amount of movement in the lateral direction may be different from each other.
  • FIG. 15(B) shows image I-32.
  • FIG. 15(B) is an example of a case where a position moved in the rotational direction is extracted.
  • the video information generation unit 1012 cuts out the cutout images CI at a plurality of different positions by moving the cutout position in an arc shape having a rotation center of 0 and a radius of r.
  • the video information generation unit 1012 cuts out a position rotated counterclockwise. By cutting out in this way, it is possible to generate a moving image in which the subject moves in the rotational direction.
  • the position of the center of rotation O and the size of the radius r may differ from frame to frame.
  • FIG. 15(C) shows image I-33.
  • FIG. 15C is an example of enlarging and reducing the cutting position.
  • the size of the cutout image CI is constant. Therefore, the video information generation unit 1012 enlarges or reduces the image I and cuts it out while maintaining the size of the cut-out image CI.
  • the size of the cutout image CI is fixed at 256 pixels x 256 pixels, the video information generation unit 1012 enlarges and reduces the image I so that it fits within the size of the cutout image CI. By cutting out the image in this way, it is possible to generate a moving image that looks like the subject is zoomed in or zoomed out.
  • the cutout positions described with reference to FIGS. 15(A) to 15(C) are examples of this embodiment, and the video information generation unit 1012 can generate a video by cutting out and connecting other different positions. Information may be generated.
  • the video information generation unit 1012 may cut out the cutout image CI, for example, by combining the cutout methods described with reference to FIGS. 15(A) to 15(C). In this case, it is possible to generate a moving image in which, for example, the moving image is horizontally or vertically moved and then rotated, or moved and then enlarged or reduced.
  • the movement of the cutout position as described above may be calculated by affine transformation. That is, the predetermined direction in which the video information generation unit 1012 cuts out an image can also be described as being calculated by affine transformation.
  • the video information generation unit 1012 may generate a video by cutting out a part of the image and then moving it.
  • the video information generation unit 1012 cuts out an image of 256 pixels x 256 pixels, and generates a plurality of pixels by moving the cut out image in a predetermined direction.
  • the video information generation unit 1012 generates a video by connecting the cut out images. That is, the video information generation unit 1012 may cut out a plurality of images at different positions by shifting the plurality of cut out images in a predetermined direction. Note that by moving the image after cutting it out, an area where no data exists will occur around the image. However, by predefining the peripheral portion of the image as the margin, it is possible to exclude it from the range of the image to be learned, and to prevent problems from occurring in the later learning stage.
  • the video information generation unit 1012 generates a video by cutting out an image that has been moved in a direction calculated by some method such as affine transformation.
  • the learning device 1010 can generate a moving image by cutting out an image in which the object is moved in a direction based on the trajectory of the actual movement, and can generate training data that is more effective for machine learning.
  • An example of such a case will be described as a modification of the fifth embodiment with reference to FIGS. 16 and 17.
  • the video information generation unit 1012 may generate the video after performing correction to add pseudo subject blur to the still image for which the video is to be created.
  • subject blurring may be added by performing a predetermined averaging process in the shift direction or by performing a process to lower the resolution.
  • FIG. 16 is a diagram illustrating an example of the functional configuration of a learning device according to a modification of the fifth embodiment when the learning device generates a moving image based on a trajectory vector.
  • An example of the functional configuration of a learning device 1010A according to a modification of the fifth embodiment will be described with reference to the same figure.
  • a learning system 1001A according to a modification of the fifth embodiment differs from the learning system 1001 in that it further includes a trajectory vector generation device 1050.
  • the learning device 1010A differs from the learning device 1010 in that it further includes a trajectory vector acquisition unit 1014.
  • the learning device 1010A differs from the learning device 1010 in that the learning device 1010A includes a video information generating section 1012A instead of the video information generating section 1012.
  • the same components as the learning device 1010 may be given the same reference numerals and the description thereof may be omitted.
  • the trajectory vector generation device 1050 acquires information regarding the trajectory of the object captured in the video. Video information is input to the trajectory vector generation device 1050, and the trajectory vector generation device 1050 analyzes the trajectory of the object imaged based on the input video information. Trajectory vector generation device 1050 outputs the analyzed result as trajectory vector TV.
  • the trajectory vector TV indicates the trajectory of the object captured in the video information.
  • Trajectory vector generation device 1050 acquires trajectory vector TV from video information using, for example, conventional technology such as optical flow. Note that the trajectory vector TV may include coordinate information indicating the trajectory of the movement of the object in addition to or in place of the vector information.
  • the trajectory vector acquisition unit 1014 acquires the trajectory vector TV from the trajectory vector generation device 1050.
  • the trajectory vector acquisition unit 1014 outputs the acquired trajectory vector TV to the video information generation unit 1012A.
  • the moving image for which the trajectory vector TV has been acquired by the trajectory vector generation device 1050 and the image acquired by the image acquisition unit 1011 may have a predetermined relationship.
  • the image acquisition unit 1011 may acquire, as an image, one frame of a video whose trajectory vector TV has been acquired by the trajectory vector generation device 1050.
  • the present embodiment is not limited to this example, and the video whose trajectory vector TV is acquired by the trajectory vector generation device 1050 and the video acquired by the image acquisition unit 1011 do not have a predetermined relationship. It's okay.
  • the video information generation unit 1012A acquires image information I from the image acquisition unit 1011 and acquires the trajectory vector TV from the trajectory vector acquisition unit 1014.
  • the video information generation unit 1012A generates video information based on the acquired image information I and trajectory vector TV.
  • the video information generation unit 1012A determines the cutting direction of the cutout image CI and the amount of shift per frame based on the trajectory indicated by the trajectory vector TV. That is, the predetermined direction in which the video information generation unit 1012A cuts out the image is calculated based on the acquired trajectory vector TV.
  • FIG. 17 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector.
  • An example of the position coordinates of the cut-out image CI in the case of generating a moving image based on the trajectory vector TV will be described with reference to the same figure.
  • FIG. 17A shows an image I-41 that is an example of an image included in the first image information I1.
  • FIG. 17B shows an example of a plurality of cut out images CI cut out from the image I-41.
  • the image I-41 shows a trajectory vector TV that is the trajectory of the ball B, which is the subject.
  • the trajectory vector TV represents a vector in which the ball B falls from the upper right direction in the figure to the lower center direction, bounces at the lower center point, and then moves toward the upper left direction in the figure.
  • the video information generation unit 1012A cuts out the cutout image CI of the position coordinates based on the trajectory vector TV shown in the image I-41, and temporally connects the cutout images, so that the image I-41, which is a still image, is extracted from the image I-41. Generate video.
  • FIG. 17(B) shows an example of a cut-out image CI, which is an image cut out by the video information generation unit 1012.
  • cutout images CI-41 to cutout images CI-49 are shown.
  • the cutout images CI-41 to CI-49 are located at coordinates based on the trajectory vector TV. That is, the cutout image CI-41 is located in the upper right direction in the figure, and the cutout position moves toward the center and lower in the figure as it approaches the cutout image CI-45. Further, the cutout position moves toward the upper left in the figure from cutout image CI-45 to cutout image CI-49.
  • FIG. 18 is a flowchart illustrating an example of a series of operations of the learning method of the noise reduction device according to the fifth embodiment. An example of a series of operations of the learning method of the noise reduction device using the learning device 1010 will be described with reference to the same figure.
  • Step S110 First, the image acquisition unit 1011 acquires an image.
  • the image acquisition unit 1011 acquires first image information I1 that includes a high-quality image and second image information I2 that includes a low-quality image.
  • the step of acquiring an image by the image acquisition unit 1011 may be referred to as an image acquisition step or an image acquisition process.
  • Step S130 the video information generation unit 1012 cuts out a part of the acquired image.
  • the video information generation unit 1012 cuts out a plurality of cut images CI from the acquired image.
  • the video information generation unit 1012 cuts out a plurality of cutout images CI from each of the high quality image included in the first image information I1 and the low quality image included in the second image information I2. Note that it is preferable that the position coordinates cut out from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are the same.
  • the position coordinates to be extracted from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are determined by taking into account the deviation due to the time difference. It is preferable. More specifically, it is preferable to change the position coordinates to be cut out from the high-quality image included in the first image information I1 or the low-quality image included in the second image information I2 in a direction that reduces the amount of deviation caused by the time difference.
  • Step S150 the video information generation unit 1012 connects the cut out images to generate a video.
  • the video information generation unit 1012 generates a high-quality video by connecting multiple images cut out from high-quality images, and generates a low-quality video by connecting multiple images cut out from low-quality images.
  • the step of generating video information in step S130 and step S150 may be referred to as a video information generation step or a video information generation step.
  • Step S170 the learning unit 1013 uses the combination of the generated high-quality video and low-quality video as teacher data TD and learns to infer a high-quality video from a low-quality video. This step may be referred to as a learning step or a learning process.
  • the learning device 1010 includes the image acquisition unit 1011 to acquire the first image information I1 and the second image information I2.
  • the first image information I1 includes at least one image
  • the second image information I2 captures the same subject as the subject captured in the image included in the first image information I1
  • the first image information I1 contains at least one image of lower quality than the images contained in the image.
  • the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. do.
  • the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the second image information I2, and combines the plurality of cut out images to generate the second video information M2. generate. Further, the learning device 1010 includes a learning unit 1013, so that the learning device 1010 can change the quality of the video from low-quality video to high-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. Train to infer videos. That is, according to the present embodiment, the learning device 1010 does not need to acquire training data including low-quality videos and high-quality videos by shooting videos, which was conventionally required, and can generate the training data from still images. can. Therefore, according to this embodiment, training data for inferring a high-quality video from a low-quality video can be easily generated.
  • the learning device 1010 can generate a plurality of different moving images from the same still image. Therefore, according to this embodiment, since a huge amount of teacher data TD is generated, it is not necessary to prepare a huge amount of still images, and many moving images can be generated from a small number of still images. Therefore, according to this embodiment, the time required to capture images for use in learning can be shortened.
  • the second image information I2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is captured, and each image is mutually Contains multiple images on which different noises are superimposed.
  • the video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information I2. That is, according to the present embodiment, a low-quality moving image with superimposed noise is generated based on a plurality of different low-quality images with superimposed noise. Therefore, the second video information M2 generated according to the present embodiment has different noise superimposed on each frame, and can be generated by reproducing a low-quality video with noise superimposed more accurately.
  • the plurality of images included in the second image information I2 are images taken at different times that are close to each other. That is, low-quality images for generating a low-quality video are captured at close times.
  • the close time may be, for example, 1/60th of a second.
  • noise peculiar to moving images having a temporal component may be superimposed. Images taken at different times that are close to each other contain noise specific to this moving image. Therefore, according to the present embodiment, since the learning device 1010 generates a moving image based on images captured at different times that are close to each other, the learning device 1010 can reproduce and generate noise peculiar to moving images having a temporal component. .
  • the video information generation unit 1012 generates the first video information M1 by cutting out a different part from one image included in the first image information I1. That is, according to this embodiment, a high-quality video is generated based on one image. Therefore, according to this embodiment, it is possible to easily generate a high-quality moving image without having to capture many high-quality images.
  • the video information generation unit 1012 cuts out a plurality of images at different positions by shifting the plurality of cut out images by different amounts in a predetermined direction. That is, according to this embodiment, the learning device 1010 cuts out the image and then shifts it in a predetermined direction. In other words, after cutting out an image, the learning device 1010 performs processing based on the small image that has been cut out, without requiring processing based on the large image. Therefore, according to this embodiment, the learning device 1010 can lighten the processing.
  • the video information generation unit 1012 cuts out a plurality of images at positions shifted by a predetermined number of bits in a predetermined direction.
  • the video information generation unit 1012 generates a video by connecting the cut out images. That is, the subject imaged in the video generated by the video information generation unit 1012 appears to move in a predetermined direction in the video. Therefore, according to this embodiment, a moving image can be easily generated from a still image.
  • the predetermined direction in which the video information generation unit 1012 cuts out an image is calculated by affine transformation.
  • the predetermined direction in which the video information generation unit 1012 cuts out the image is the direction in which the subject moves in the video. Therefore, according to this embodiment, the learning device 1010 can generate a video in which the subject moves in various directions.
  • the learning device 1010 further includes the trajectory vector acquisition unit 1014 to acquire the trajectory vector TV. Further, the predetermined direction in which the video information generation unit 1012 cuts out the image is calculated based on the acquired trajectory vector TV.
  • the trajectory vector TV is information regarding a vector indicating the trajectory of the subject actually moving in the actually captured moving image. Therefore, according to this embodiment, it is possible to generate a video based on the trajectory of the actual movement of the subject.
  • FIG. 19 is a diagram for explaining an overview of the learning system according to the sixth embodiment.
  • An overview of a learning system 1001B according to the sixth embodiment will be described with reference to the same figure.
  • the same components as those in the fifth embodiment may be given the same reference numerals and the description thereof may be omitted.
  • the imaging device 1020 captures a high-quality image 1031.
  • the low-quality image 1032 is generated based on the high-quality image 1031 by the learning device 1010B according to the sixth embodiment.
  • the low-quality image 1032 is generated, for example, by subjecting the high-quality image 1031 to image processing and superimposing noise. That is, according to the present embodiment, the imaging device 1020 captures only the high-quality image 1031 and does not need to capture the low-quality image 1032.
  • FIG. 20 is a diagram illustrating an example of the functional configuration of the video information generation section according to the sixth embodiment.
  • the video information generation unit 1012B included in the learning device 1010B will be described with reference to the same figure.
  • a learning device 1010B according to the sixth embodiment differs from the learning device 1010 in that it includes a video information generating section 1012B instead of the video information generating section 1012.
  • the video information generation section 1012B includes a cutting section 1121, a noise superimposition section 1123, a first video information generation section 1125, and a second video information generation section 1127.
  • the cutting unit 1121 acquires an image from the image acquiring unit 1011.
  • the learning device 1010B acquires a high-quality image from the imaging device 1020
  • the cutting unit 1121 acquires a high-quality image from the image acquisition unit 1011.
  • the cutout unit 1121 cuts out a plurality of cutout images CI that are part of the acquired high-quality image and have different positional coordinates.
  • the cutout unit 1121 outputs the cutout image CI to the first moving image information generation unit 1125 and the noise superimposition unit 1123.
  • the noise superimposition unit 1123 acquires the cutout image CI cut out by the cutout unit 1121.
  • the noise superimposition unit 1123 superimposes noise on the acquired cutout image CI.
  • the noise superimposition unit 1123 obtains a plurality of cutout images CI obtained by cutting out a plurality of position coordinates, and superimposes noise on each of the plurality of obtained cutout images CI.
  • the noise superimposed by the noise superimposing unit 1123 may be modeled in advance.
  • the modeled noises include shot noise due to fluctuations in the number of photons, noise that occurs when the light incident on the image sensor is converted into electrons, noise that occurs when the converted electrons are converted to analog voltage values, and noise that occurs when the converted electrons are converted to analog voltage values.
  • noise that occurs when converting an analog voltage value into a digital signal.
  • the intensity of the superimposed noise may be adjusted by a predetermined method. It is preferable that the noise superimposition unit 1123 superimposes different noises on each of the plurality of cut-out images CI.
  • the noise superimposition unit 1123 outputs the image after superimposing noise to the second moving image information generation unit 1127 as a noise image NI.
  • the first video information generation unit 1125 acquires a plurality of cut out images CI from the cut out unit 1121.
  • the first video information generation unit 1125 generates first video information M1 by combining the plurality of cut out images.
  • the first video information generation unit 1125 outputs the generated first video information M1 to the learning unit 1013.
  • the second video information generation unit 1127 acquires a plurality of noise images NI from the noise superimposition unit 1123.
  • the second video information generation unit 1127 generates second video information M2 by combining a plurality of noise images NI on which noise is superimposed.
  • the second video information generation unit 1127 outputs the generated second video information M2 to the learning unit 1013.
  • the learning unit 1013 acquires the first video information M1 from the first video information generation unit 1125 and the second video information M2 from the second video information generation unit 1127.
  • the learning unit 1013 trains the learning model 1040 based on the first video information M1 and the second video information M2 generated by the video information generation unit 1012B.
  • the learning device 1010B includes the image acquisition unit 1011 to acquire image information I including at least one high-quality image. Further, the learning device 1010B includes a video information generation unit 1012B to generate both high-quality videos and low-quality videos from high-quality images.
  • the video information generation unit 1012B includes a cutting unit 1121 to cut out a plurality of images at different positions that are part of the acquired image information I. Furthermore, the video information generation unit 1012B includes a noise superimposition unit 1123 to superimpose noise on each of the plurality of images cut out by the cutout unit 1121.
  • the video information generation unit 1012B includes a first video information generation unit 1125, so that the video information generation unit 1012B generates first video information M1, which is a high-quality video, by combining the plurality of images cut out by the cutout unit 1121, and generates the second video information M1.
  • first video information M1 which is a high-quality video
  • second video information M1 is generated by combining a plurality of images on which noise has been superimposed by the noise superimposing section 1123.
  • the learning device 1010B can generate the first video information M1 generated by the first video information generation unit 1125 and the second video information M2 generated by the second video information generation unit 1127.
  • a learning model 1040 is trained that generates a high-quality video and a low-quality video based on one high-quality image, and infers a high-quality video from the low-quality video.
  • inferring a high-quality video from a low-quality video is noise removal. Therefore, according to the present embodiment, it is possible to easily learn the noise removal model without requiring time to acquire the teacher data TD.
  • a high-quality video is generated from a high-quality image
  • a low-quality image is generated by superimposing noise on the high-quality image
  • a low-quality video is generated based on the generated low-quality image.
  • the learning device 1010 may create the teacher data TD based only on low-quality images. That is, a low-quality video may be generated from a low-quality image, a high-quality image may be generated by further removing noise from the low-quality video, and a high-quality video may be generated based on the generated high-quality image.
  • the number of images used to generate the moving image may be one or multiple.
  • the learning device 1010 and learning device 1010A described in the fifth embodiment and the learning device 1010B described in the sixth embodiment are examples used for learning a learning model 1040 that infers a high-quality video from a low-quality video.
  • the learning model 1040 may be configured to have a function to detect a specific subject such as a person in the high-quality video after inferring a high-quality video from the low-quality video, or a function may be provided to detect a specific subject such as a person in the high-quality video. It may also be configured to have a function of recognizing characters on signboards and the like. That is, the high-quality video inferred by the learning model 1040 is not limited to an example of a video for viewing, but may be used for purposes such as object detection.
  • the ideal training data is a video that includes as much of the expected movement of the subject as possible.
  • SYMBOLS 1...High quality video generation system 2...Image processing device, 10...Input information generation device, 11...Image acquisition section, 12...Input conversion section, 13...Composition section, 14...Output section, 15...Integration section, 16... Average value temporary storage section, 17... Imaging condition acquisition section, 18... Adjustment section, 19... Comparison section, 100... Imaging device, 200... CNN, 210... Input layer, 220... Convolution layer, 230... Pooling layer, 240...
  • Output Layer IM...Video information, IN...Input information, TF...Target frame, AF...Adjacent frame, IF...Integrated frame, M1...First memory, M2...Second memory, IMD...Image information, IND...Input data, CD ...Synthetic data, SV...Stored value, CV...Calculated value, CR...Comparison result, 1001...Learning system, 1010...Learning device, 1011...Image acquisition section, 1012...Video information generation section, 1013...Learning section, 1014...Trajectory Vector acquisition unit, 1020... Imaging device, 1031... High quality image, 1032... Low quality image, 1033... High quality video, 1034... Low quality video, 1040... Learning model, 1050...
  • Trajectory vector generation device TD... Teacher data, I...image information, I1...first image information, I2...second image information, M...video information, M1...first video information, M2...second video information, TV...trajectory vector, 1121...cutting section, 1123 ...Noise superimposition unit, 1125...First video information generation unit, 1127...Second video information generation unit

Abstract

This input information generating device comprises: an image acquiring unit that acquires, as an input image, a plurality of frames that include at least a target frame that serves as a target for the generation of input information among the frames that constitute a video; an input converting unit that converts pixel values of the input image of the plurality of acquired frames into a plurality of pieces of two-bit input data; a synthesizing unit that synthesizes the plurality of pieces of the converted input data into one piece of synthesis data; and an output unit that outputs the synthesized synthesis data.

Description

入力情報生成装置、画像処理装置、入力情報生成方法、学習装置、プログラム及びノイズ低減装置の学習方法Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device
 本発明は、入力情報生成装置、画像処理装置、入力情報生成方法、学習装置、プログラム及びノイズ低減装置の学習方法に関する。
 本願は、2022年8月31日に、日本に特許出願された特願2022-137834、及び2022年8月31日に、日本に特許出願された特願2022-137843に基づき優先権を主張し、当該出願に記載された全ての記載内容を援用するものである。
The present invention relates to an input information generation device, an image processing device, an input information generation method, a learning device, a program, and a learning method for a noise reduction device.
This application claims priority based on Japanese Patent Application No. 2022-137834, which was filed in Japan on August 31, 2022, and Japanese Patent Application No. 2022-137843, which was filed in Japan on August 31, 2022. , all contents stated in the application are incorporated by reference.
 撮像装置により画像を撮像する際、周囲の光量が十分でない場合や、シャッタースピード、絞り又はISO感度等の撮像装置の設定が適切でない場合には、低品質画像となってしまう場合がある。既に撮像された低品質画像を、画像処理により高品質画像に変換する技術があった。例えば、機械学習を用いて低品質画像を高品質画像に画像処理する技術があった(例えば、特許文献1を参照)。このような技術分野においては、画像を高画質化させることが最優先の要求事項である。 When capturing an image with an imaging device, if the amount of surrounding light is not sufficient or if the settings of the imaging device such as shutter speed, aperture, or ISO sensitivity are inappropriate, the image may be of low quality. There is a technology that converts a low-quality image that has already been captured into a high-quality image through image processing. For example, there is a technique for image processing a low-quality image into a high-quality image using machine learning (see, for example, Patent Document 1). In such technical fields, the highest priority requirement is to improve the quality of images.
米国特許第10623756号明細書US Patent No. 10623756
 ここで、上述したような従来技術を動画に応用し、動画を構成する各フレームの画質を高画質化することにより、低品質動画を高品質動画に変換することが考えられる。撮像装置により撮像された動画をリアルタイムで高画質化する場合、画像処理に時間を要してしまうと動画のフレームレートが犠牲になるといった問題が生じる場合がある。すなわち、低品質動画を高品質動画に変換する場合は、フレーム画像の高画質化のみを優先させることができず、軽量な画像処理によりフレームレートを維持することが求められる。更に、動画の高画質化処理はエッジデバイス上で行われることもあり、エッジデバイスの処理能力を考慮すると、画像処理の軽量化に対する強い要求があった。 Here, it is conceivable to convert a low-quality video into a high-quality video by applying the above-mentioned conventional technology to a video and increasing the image quality of each frame that makes up the video. When increasing the image quality of a moving image captured by an imaging device in real time, a problem may arise in which the frame rate of the moving image is sacrificed if time is required for image processing. That is, when converting a low-quality video into a high-quality video, it is not possible to give priority only to increasing the image quality of frame images, and it is required to maintain the frame rate through lightweight image processing. Furthermore, processing to improve the image quality of moving images is sometimes performed on edge devices, and in consideration of the processing capabilities of edge devices, there has been a strong demand for lightweight image processing.
 そこで本発明は、軽量な演算で低品質動画を高品質動画に変換可能な技術の提供を目的とする。 Therefore, the present invention aims to provide a technology that can convert a low-quality video to a high-quality video using lightweight calculations.
(1)本発明の一態様は、動画を構成するフレームのうち、入力情報の生成の対象となる対象フレームを少なくとも含む複数フレームを入力画像として取得する画像取得部と、取得した複数フレームの前記入力画像の画素値を、前記入力画像の画素値を示すビット数より少ないビット数の入力データに変換する入力変換部と、変換された複数の入力データを1つの合成データに合成する合成部と、合成された前記合成データを出力する出力部とを備える入力情報生成装置である。 (1) One aspect of the present invention includes an image acquisition unit that acquires, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion unit that converts a pixel value of an input image into input data with a smaller number of bits than the number of bits indicating the pixel value of the input image; and a synthesis unit that combines the plurality of converted input data into one composite data. , and an output unit that outputs the synthesized data.
(2)本発明の一態様は、上記(1)に記載の入力情報生成装置において、前記動画はカラー動画であり、前記画像取得部は、1つのフレームから各色の画素値を複数の異なる画像として取得し、前記入力変換部は、1つのフレームから取得された複数の画像それぞれについて、前記入力データに変換するものである。 (2) One aspect of the present invention is the input information generation device according to (1) above, in which the moving image is a color moving image, and the image acquisition unit converts pixel values of each color from one frame into a plurality of different images. The input conversion unit converts each of a plurality of images acquired from one frame into the input data.
(3)本発明の一態様は、上記(1)又は(2)に記載の入力情報生成装置において、前記画像取得部は、前記動画を構成するフレームのうち、前記対象フレームの前後それぞれに連続して隣接する複数のフレームの画像を取得するものである。 (3) One aspect of the present invention is the input information generation device according to (1) or (2) above, in which the image acquisition unit is configured to continuously capture images before and after the target frame, respectively, of the frames constituting the video. This method acquires images of multiple adjacent frames.
(4)本発明の一態様は、上記(1)から(3)のいずれかに記載の入力情報生成装置において、前記画像取得部は、前記出力部により前記対象フレームについての前記合成データが出力された後、前記対象フレームに隣接するフレームを前記対象フレームとして、少なくとも前記対象フレームを含む複数フレームを前記入力画像として取得するものである。 (4) One aspect of the present invention is the input information generation device according to any one of (1) to (3) above, wherein the image acquisition unit outputs the composite data regarding the target frame by the output unit. After that, a frame adjacent to the target frame is set as the target frame, and a plurality of frames including at least the target frame are acquired as the input image.
(5)本発明の一態様は、上記(1)から(4)のいずれかに記載の入力情報生成装置は、前記画像取得部により取得された複数フレームのうち、前記対象フレーム以外のフレームである隣接フレームの画素値に基づいた演算を行うことにより、複数の前記隣接フレームを1つの統合フレームに統合する統合部を更に備え、前記入力変換部は、前記対象フレームの画素値を、前記入力データに変換し、更に前記統合フレームの画素値を、前記入力データに変換し、前記合成部は、前記対象フレームに基づき変換された複数の前記入力データと、前記統合フレームに基づき変換された複数の前記入力データとを1つの前記合成データに合成するものである。 (5) One aspect of the present invention is that the input information generation device according to any one of (1) to (4) above is configured to use a frame other than the target frame among the plurality of frames acquired by the image acquisition unit. The input conversion unit converts the pixel values of the target frame into one integrated frame by performing an operation based on the pixel values of a certain adjacent frame. data, and further converts the pixel values of the integrated frame into the input data, and the combining unit converts the plurality of input data converted based on the target frame and the plurality of input data converted based on the integrated frame. The above-mentioned input data are combined into one above-mentioned combined data.
(6)本発明の一態様は、上記(5)に記載の入力情報生成装置において、前記統合部は、複数の前記隣接フレームの画素値の平均値を前記統合フレームの画素値とするものである。 (6) One aspect of the present invention is the input information generation device according to (5) above, wherein the integrating unit sets an average value of pixel values of the plurality of adjacent frames as the pixel value of the integrated frame. be.
(7)本発明の一態様は、上記(6)に記載の入力情報生成装置において、前記統合部は、複数の前記隣接フレームのうち、前記対象フレームからの時間的距離に応じた加重平均を算出することにより前記統合フレームの画素値を算出するものである。 (7) One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit calculates a weighted average of the plurality of adjacent frames according to a temporal distance from the target frame. By this calculation, the pixel value of the integrated frame is calculated.
(8)本発明の一態様は、上記(6)に記載の入力情報生成装置において、前記統合部は、前記動画を構成するフレームのうち、輝度変化が大きいフレームを平均値の演算対象から除外するものである。 (8) One aspect of the present invention is the input information generation device according to (6) above, in which the integrating unit excludes frames with large brightness changes from among the frames forming the video from the targets for calculating the average value. It is something to do.
(9)本発明の一態様は、上記(5)に記載の入力情報生成装置は、前記動画を構成するフレームのうち、所定のフレームの画素値の平均値を記憶する平均値一時記憶部を更に備え、前記統合部は、前記平均値一時記憶部に記憶された値と、前記対象フレームとに基づく演算により前記統合フレームの画素値を算出するものである。 (9) One aspect of the present invention is that the input information generation device according to (5) above includes an average value temporary storage unit that stores an average value of pixel values of a predetermined frame among the frames constituting the moving image. Furthermore, the integrating unit calculates the pixel value of the integrated frame by calculation based on the value stored in the average value temporary storage unit and the target frame.
(10)本発明の一態様は、上記(9)に記載の入力情報生成装置は、前記動画の撮像条件を取得する撮像条件取得部と、取得された前記撮像条件に応じて前記平均値一時記憶部に記憶された前記平均値を調整する調整部を更に備えるものである。 (10) One aspect of the present invention is that the input information generation device according to (9) above includes an imaging condition acquisition unit that acquires the imaging conditions of the moving image, and a temporary value of the average value according to the acquired imaging conditions. The apparatus further includes an adjustment section that adjusts the average value stored in the storage section.
(11)本発明の一態様は、上記(9)に記載の入力情報生成装置は、前記平均値一時記憶部に記憶された値と、前記対象フレームの画素値とを比較する比較部を更に備え、前記統合部は、前記比較部により比較された結果、差分が所定値以下である場合、前記平均値一時記憶部に記憶された値と前記対象フレームとに基づく移動平均を算出することにより前記統合フレームの画素値を算出し、差分が所定値以下でない場合、前記対象フレームの画素値を前記統合フレームの画素値とするものである。 (11) One aspect of the present invention is that the input information generation device according to (9) above further includes a comparison unit that compares the value stored in the average value temporary storage unit and the pixel value of the target frame. Preparation: When the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit, the integration unit calculates a moving average based on the value stored in the average value temporary storage unit and the target frame. The pixel value of the integrated frame is calculated, and if the difference is not less than a predetermined value, the pixel value of the target frame is set as the pixel value of the integrated frame.
(12)本発明の一態様は、上記(5)に記載の入力情報生成装置は、前記統合部は、前記画像取得部により取得された前記隣接フレームのうちランダムに特定されたフレームを前記統合フレームとするものである。 (12) One aspect of the present invention is the input information generation device according to (5) above, in which the integrating unit integrates randomly specified frames among the adjacent frames acquired by the image acquiring unit. It is used as a frame.
(13)本発明の一態様は、上記(1)から上記(12)のいずれかに記載の入力情報生成装置と、前記入力情報生成装置により出力された前記合成データを入力情報とする畳み込みニューラルネットワークとを備える画像処理装置である。 (13) One aspect of the present invention includes the input information generation device according to any one of (1) to (12) above, and a convolution neural network that uses the synthetic data outputted by the input information generation device as input information. An image processing apparatus includes a network.
(14)本発明の一態様は、動画を構成するフレームのうち、入力情報の生成の対象となる対象フレームを少なくとも含む複数フレームを入力画像として取得する画像取得工程と、取得した複数フレームの前記入力画像の画素値を、前記入力画像の画素値を示すビット数より少ないビット数の入力データに変換する入力変換工程と、変換された複数の入力データを1つの合成データに合成する合成工程と、合成された前記合成データを出力する出力工程とを有する入力情報生成方法である。 (14) One aspect of the present invention includes an image acquisition step of acquiring, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting a moving image; an input conversion step of converting a pixel value of an input image into input data with a smaller number of bits than the number of bits representing the pixel value of the input image; and a synthesis step of combining the plurality of converted input data into one composite data. , and an output step of outputting the synthesized data.
(A1)本発明の一態様は、少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得部と、取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成部と、前記動画情報生成部により生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習部とを備える学習装置である。 (A1) One aspect of the present invention is that first image information including at least one image and a subject that is the same as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition unit that acquires second image information including at least one image of lower image quality than the image contained in the image; and a plurality of images at different positions that are part of the acquired first image information and are cut out. A plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information. A learning unit that learns to infer a high-quality video from a low-quality video based on a video information generation unit and teacher data that includes the first video information and the second video information generated by the video information generation unit. This is a learning device comprising:
(A2)本発明の一態様は、上記(A1)に記載の学習装置において、前記第2画像情報には、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像された複数の画像であって、それぞれ互いに異なるノイズが重畳された複数の画像が含まれ、前記動画情報生成部は、前記第2画像情報に含まれる複数の画像それぞれから、異なる一部を切り出すことにより前記第2動画情報を生成するものである。 (A2) One aspect of the present invention is the learning device according to (A1) above, in which the second image information includes the same subject as the subject captured in the image included in the first image information. the plurality of images, each of which includes a plurality of images on which different noises are superimposed, and the video information generation unit is configured to cut out different parts from each of the plurality of images included in the second image information. The second moving image information is generated.
(A3)本発明の一態様は、上記(A1)又は(A2)に記載の学習装置において、前記第2画像情報に含まれる複数の画像は、近接した異なる時間において撮像された画像である。 (A3) One aspect of the present invention is the learning device according to (A1) or (A2) above, in which the plurality of images included in the second image information are images captured at different times that are close to each other.
(A4)本発明の一態様は、上記(A1)から(A3)のいずれかに記載の学習装置において、前記動画情報生成部は、前記第1画像情報に含まれる1枚の画像から、異なる一部を切り出すことにより前記第1動画情報を生成するものである。 (A4) One aspect of the present invention is the learning device according to any one of (A1) to (A3) above, in which the video information generation unit generates a different image from one image included in the first image information. The first moving image information is generated by cutting out a portion.
(A5)本発明の一態様は、上記(A1)から(A4)のいずれかに記載の学習装置において、前記動画情報生成部は、切り出した複数の画像を所定の方向にずらすことにより異なる位置の画像を複数切り出すものである。 (A5) One aspect of the present invention is the learning device according to any one of (A1) to (A4) above, in which the video information generation unit shifts the plurality of cut out images in a predetermined direction so as to move the plurality of cut out images to different positions. This is to cut out multiple images.
(A6)本発明の一態様は、上記(A1)から(A5)のいずれかに記載の学習装置において、前記動画情報生成部は、所定の方向に、所定のビット数移動させた位置における複数の画像を切り出すものである。 (A6) One aspect of the present invention is that in the learning device according to any one of (A1) to (A5) above, the video information generation unit is configured to generate a plurality of video information at a position shifted by a predetermined number of bits in a predetermined direction. The image is cut out.
(A7)本発明の一態様は、上記(A6)に記載の学習装置において、前記動画情報生成部が画像を切り出す所定の方向とは、アフィン変換により算出されるものである。 (A7) One aspect of the present invention is that in the learning device described in (A6) above, the predetermined direction in which the video information generation unit cuts out the image is calculated by affine transformation.
(A8)本発明の一態様は、上記(A6)に記載の学習装置において、軌跡ベクトルを取得する軌跡ベクトル取得部を更に備え、前記動画情報生成部が画像を切り出す所定の方向とは、取得された前記軌跡ベクトルに基づいて算出されるものである。 (A8) One aspect of the present invention is the learning device according to (A6) above, further comprising a trajectory vector acquisition unit that acquires a trajectory vector, and the predetermined direction in which the video information generation unit cuts out the image is the acquisition device. This is calculated based on the trajectory vector obtained.
(A9)本発明の一態様は、少なくとも1枚の画像を含む画像情報を取得する画像取得部と、取得した前記画像情報の一部であって異なる位置の画像を複数切り出す切出部と、切り出した複数の画像を組み合わせて第1動画情報を生成する第1動画情報生成部と、前記切出部により切り出された複数の画像それぞれに対しノイズを重畳するノイズ重畳部と、ノイズ重畳部によりノイズが重畳された複数の画像を組み合わせて第2動画情報を生成する第2動画情報生成部と、前記第1動画情報生成部により生成された前記第1動画情報と前記第2動画情報生成部により生成された前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習部とを備える学習装置である。 (A9) One aspect of the present invention includes: an image acquisition unit that acquires image information including at least one image; a cutting unit that cuts out a plurality of images at different positions that are part of the acquired image information; a first video information generation section that combines a plurality of cut out images to generate first video information; a noise superposition section that superimposes noise on each of the plurality of images cut out by the cutout section; and a noise superposition section that a second video information generation unit that generates second video information by combining a plurality of images on which noise is superimposed; and the first video information generated by the first video information generation unit and the second video information generation unit. The learning device is provided with a learning unit that learns to infer a high-quality video from a low-quality video based on teacher data that includes the second video information generated by the above-mentioned second video information.
(A10)本発明の一態様は、コンピュータに、少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得ステップと、取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成ステップと、前記動画情報生成ステップにより生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習ステップとを実行させるプログラムである。 (A10) One aspect of the present invention is that first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are captured by the computer, and the first image information includes at least one image; an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the first image information; and cutting out a plurality of images at different positions that are part of the acquired first image information. , combine a plurality of cut out images to generate first moving image information, cut out a plurality of images at different positions that are part of the acquired second image information, and combine the plurality of cut out images to generate second moving image information. learning to infer a high-quality video from a low-quality video based on training data including the first video information and the second video information generated by the video information generation step; This is a program that executes learning steps.
(A11)本発明の一態様は、少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得工程と、取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成工程と、前記動画情報生成工程により生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習工程とを有するノイズ低減装置の学習方法である。 (A11) One aspect of the present invention is that the first image information including at least one image and the same subject as the subject imaged in the image included in the first image information are imaged, and the first image information an image acquisition step of acquiring second image information including at least one image of lower image quality than the image included in the image; and cutting out a plurality of images at different positions that are part of the acquired first image information. A plurality of images are combined to generate first video information, a plurality of images at different positions that are part of the acquired second image information are cut out, and the plurality of cut out images are combined to generate second video information. a learning step of learning to infer a high-quality video from a low-quality video based on a video information generation step and teacher data that includes the first video information and the second video information generated by the video information generation step; This is a learning method for a noise reduction device having the following steps.
 本発明によれば、軽量な演算で低品質動画を高品質動画に変換することができる。 According to the present invention, it is possible to convert a low-quality video to a high-quality video with lightweight calculations.
第1の実施形態に係る高品質動画生成システムの機能構成の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of a functional configuration of a high-quality video generation system according to a first embodiment. 第1の実施形態に係る畳み込みニューラルネットワークの一例を示す図である。FIG. 1 is a diagram illustrating an example of a convolutional neural network according to the first embodiment. 第1の実施形態に係る動画を構成するフレームについて示す図である。FIG. 3 is a diagram illustrating frames constituting a moving image according to the first embodiment. 第1の実施形態に係る入力情報生成方法の概要について説明するための図である。FIG. 2 is a diagram for explaining an overview of an input information generation method according to the first embodiment. 第1の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a functional configuration of input information generation according to the first embodiment. 第1の実施形態に係る入力変換部の機能構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a functional configuration of an input conversion section according to the first embodiment. 第2の実施形態に係る入力情報生成方法の概要について説明するための図である。FIG. 7 is a diagram for explaining an overview of an input information generation method according to a second embodiment. 第2の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。FIG. 3 is a block diagram illustrating an example of a functional configuration of input information generation according to a second embodiment. 第3の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a third embodiment. 第4の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。FIG. 7 is a block diagram illustrating an example of a functional configuration of input information generation according to a fourth embodiment. 第5の実施形態に係る学習システムの概要について説明するための図である。It is a figure for explaining the outline of the learning system concerning a 5th embodiment. 第5の実施形態に係る学習装置の機能構成の一例を示す図である。It is a figure showing an example of the functional composition of the learning device concerning a 5th embodiment. 第5の実施形態に係る学習装置が高品質画像から切り出す画像の位置の一例について説明するための図である。FIG. 12 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment. 第5の実施形態に係る学習装置が低品質画像から切り出す画像の位置の一例について説明するための図である。FIG. 12 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment. 第5の実施形態に係る学習装置が切り出す方向の一例について説明するための図である。FIG. 12 is a diagram for explaining an example of a direction in which a learning device according to a fifth embodiment cuts out. 第5の実施形態に係る学習装置が軌跡ベクトルに基づいて動画を生成する場合における学習装置の機能構成の一例を示す図である。FIG. 12 is a diagram illustrating an example of a functional configuration of a learning device according to a fifth embodiment when the learning device generates a moving image based on a trajectory vector. 第5の実施形態の変形例に係る学習装置が軌跡ベクトルに基づいて動画を生成する場合において、静止画から切り出す画像の位置の一例について説明するための図である。FIG. 12 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector. 第5の実施形態の変形例に係るノイズ低減装置の学習方法の一連の動作の一例について示すフローチャートである。12 is a flowchart illustrating an example of a series of operations of a learning method of a noise reduction device according to a modification of the fifth embodiment. 第6の実施形態に係る学習システムの概要について説明するための図である。It is a figure for explaining the outline of the learning system concerning a 6th embodiment. 第6の実施形態に係る動画情報生成部の機能構成の一例を示す図である。It is a figure showing an example of functional composition of a video information generation part concerning a 6th embodiment.
 以下、本発明の態様に係る入力情報生成装置、画像処理装置及び入力情報生成方法について、好適な実施の形態を掲げ、添付の図面を参照しながら詳細に説明する。なお、本発明の態様は、これらの実施の形態に限定されるものではなく、多様な変更または改良を加えたものも含まれる。つまり、以下に記載した構成要素には、当業者が容易に想定できるもの、実質的に同一のものが含まれ、以下に記載した構成要素は適宜組み合わせることが可能である。また、本発明の要旨を逸脱しない範囲で構成要素の種々の省略、置換または変更を行うことができる。また、以下の図面においては、各構成をわかりやすくするために、各構造における縮尺および数等を、実際の構造における縮尺および数等と異ならせる場合がある。 Hereinafter, preferred embodiments of an input information generation device, an image processing device, and an input information generation method according to aspects of the present invention will be described in detail with reference to the accompanying drawings. Note that the aspects of the present invention are not limited to these embodiments, and also include those with various changes or improvements. That is, the components described below include those that can be easily assumed by those skilled in the art and are substantially the same, and the components described below can be combined as appropriate. Further, various omissions, substitutions, or changes of the constituent elements can be made without departing from the gist of the present invention. Further, in the following drawings, in order to make each structure easier to understand, the scale, number, etc. of each structure may be different from the scale, number, etc. of the actual structure.
 まず、本実施形態の前提となる事項について説明する。本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法は、ノイズが重畳した低品質な動画情報を入力として、ノイズを取り除いた高品質な動画情報を生成する。低品質動画には低画質動画が含まれ、高品質動画には高画質動画が含まれる。高品質動画とは、一例として、低ISO感度、長秒露光により撮像される画質の高い動画を例示することができる。低品質動画とは、一例として、高ISO感度、短秒露光により撮像される画質の低い動画を例示することができる。 First, the premises of this embodiment will be explained. The input information generation device, image processing device, and input information generation method according to the present embodiment receive low-quality video information with superimposed noise as input and generate high-quality video information from which noise has been removed. Low-quality videos include low-quality videos, and high-quality videos include high-quality videos. An example of a high-quality moving image is a moving image with high image quality captured by low ISO sensitivity and long exposure. An example of a low-quality moving image is a moving image with low image quality captured by high ISO sensitivity and short exposure.
 以下の説明においては低品質動画の一例としてノイズによる画質劣化について説明するが、本実施形態は、ノイズ以外であっても、動画の品質を低下させる事項に対して広く適用可能である。動画の品質を低下させる事項としては、光学収差による解像度の低下もしくは色ずれ、手ブレや被写体ブレによる解像度の低下、暗電流や回路起因による黒レベルの不均一、高輝度被写体によるゴーストやフレア、信号レベル異常等を例示することができる。また、ノイズには画素ごとに発生するランダムノイズ以外にも、画像の水平方向又は垂直方向に発生するスジ状のノイズ、画像中に固定パターンで発生するノイズ等を含む。また、連続するフレーム間で変動するフリッカ状のノイズ等の動画特有のノイズを含んでもよい。本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法は、動画に含まれる各フレームを、フレームごとに画像処理することにより、フレームの画質を向上させ、動画としての高品質化を行う。 In the following description, image quality deterioration due to noise will be described as an example of low-quality video, but the present embodiment is widely applicable to matters other than noise that degrade the quality of video. Things that can degrade video quality include a decrease in resolution or color shift due to optical aberrations, a decrease in resolution due to camera shake or subject shake, uneven black levels due to dark current or circuits, ghosts and flare due to high-brightness subjects, Examples include signal level abnormalities. In addition to random noise that occurs for each pixel, noise includes streak-like noise that occurs in the horizontal or vertical direction of an image, noise that occurs in a fixed pattern in an image, and the like. Further, noise specific to moving images, such as flicker-like noise that fluctuates between consecutive frames, may be included. The input information generation device, image processing device, and input information generation method according to the present embodiment improve the image quality of each frame by image processing each frame included in a video, thereby increasing the quality of the video. I do.
 なお、高品質化の対象となる入力動画は、撮像装置によって撮像された動画が用いられてもよいし、予め用意されていた動画が用いられてもよい。以下の説明において、低品質動画を低画質動画又はノイズ動画と記載する場合がある。また、以下の説明において、高品質動画を高画質動画と記載する場合がある。 Note that as the input video to be improved in quality, a video captured by an imaging device may be used, or a video prepared in advance may be used. In the following description, a low-quality video may be referred to as a low-quality video or a noise video. Furthermore, in the following description, a high-quality video may be referred to as a high-quality video.
 また、本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法が対象とする動画とは、CCD(Charge Coupled Devices)イメージセンサを用いたCCDカメラにより撮像された動画であってもよい。また、本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法が対象とする動画とは、CMOS(Complementary Metal Oxide Semiconductor)イメージセンサを用いたCMOSカメラにより撮像された画像であってもよい。また、本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法が対象とする動画とは、カラー動画であってもよいし、モノクロ動画であってもよい。また、本実施形態に係る入力情報生成装置、画像処理装置及び入力情報生成方法が対象とする動画とは、赤外線センサを用いた赤外線カメラなど非可視光成分を取得することにより撮像された動画であってもよい。 In addition, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a video captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. good. Further, the moving image targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is an image captured by a CMOS (complementary metal oxide semiconductor) camera using a CMOS (complementary metal oxide semiconductor) image sensor. Good too. Further, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment may be a color video or a monochrome video. Further, the video targeted by the input information generation device, image processing device, and input information generation method according to the present embodiment is a video captured by an infrared camera using an infrared sensor or the like that acquires non-visible light components. There may be.
[第1の実施形態]
 まず、図1から図6を参照しながら、第1の実施形態について説明する。
 図1は、第1の実施形態に係る高品質動画生成システムの機能構成の一例を示すブロック図である。同図を参照しながら、高品質動画生成システム1の機能構成の一例について説明する。高品質動画生成システム1は、撮像装置100と、入力情報生成装置10と、畳み込みニューラルネットワーク200(以下、「CNN200」という)とをその機能として備える。入力情報生成装置10及びCNN200は、撮像装置100により撮像された動画を構成する各フレームを画像処理する。入力情報生成装置10及びCNN200は、事前に学習が行われた学習済みモデルを含む。以下の説明において、入力情報生成装置10とCNN200とを備える構成を、画像処理装置2と記載する場合がある。なお、高品質動画生成システム1は画像処理装置2の出力を圧縮符号化する符号化部、及び符号化部によって圧縮符号化された結果を保持する所定のメモリを備える構成としてもよい。
[First embodiment]
First, a first embodiment will be described with reference to FIGS. 1 to 6.
FIG. 1 is a block diagram showing an example of the functional configuration of a high-quality video generation system according to the first embodiment. An example of the functional configuration of the high-quality video generation system 1 will be described with reference to the same figure. The high-quality video generation system 1 includes an imaging device 100, an input information generation device 10, and a convolutional neural network 200 (hereinafter referred to as "CNN 200") as its functions. The input information generation device 10 and CNN 200 perform image processing on each frame constituting the moving image captured by the imaging device 100. The input information generation device 10 and the CNN 200 include trained models that have been trained in advance. In the following description, a configuration including the input information generation device 10 and CNN 200 may be referred to as an image processing device 2. Note that the high-quality moving image generation system 1 may be configured to include an encoding unit that compresses and encodes the output of the image processing device 2, and a predetermined memory that holds the results compressed and encoded by the encoder.
 撮像装置100は、動画を撮像する。撮像装置100により撮像される動画は、高品質化の対象となる低品質動画である。撮像装置100は、例えば暗い(光量の少ない)箇所に設置された監視カメラ等であってもよい。撮像装置100は、例えば光量不足による低品質動画を撮像する。撮像装置100は、撮像した動画を、入力情報生成装置10に出力する。撮像装置100により撮像された動画は、画像処理装置2への入力となる。したがって、撮像装置100から入力情報生成装置10に出力される動画を、動画情報IMと記載する場合がある。 The imaging device 100 images a moving image. The moving image captured by the imaging device 100 is a low-quality moving image that is subject to quality improvement. The imaging device 100 may be, for example, a surveillance camera installed in a dark (low amount of light) location. The imaging device 100 images a low-quality moving image due to insufficient light, for example. The imaging device 100 outputs the captured moving image to the input information generation device 10. A moving image captured by the imaging device 100 becomes an input to the image processing device 2 . Therefore, the video output from the imaging device 100 to the input information generation device 10 may be referred to as video information IM.
 なお、撮像装置100と画像処理装置2とは、いずれもスマートフォンやタブレット端末等の筐体内に存在してもよい。すなわち、高品質動画生成システム1は、エッジデバイスを構成する要素として存在してもよい。また、撮像装置100は、所定の通信ネットワークを介して画像処理装置2と接続されていてもよい。すなわち、高品質動画生成システム1は、所定の通信ネットワークを介して構成要素同士が互いに接続されることにより存在してもよい。
 また、撮像装置100は複数のレンズ及び当該複数のレンズにそれぞれ対応する複数のイメージセンサを備える構成であってもよい。このような構成の具体例として、撮像装置100は、異なる画角の画像を取得するように複数のレンズ及びイメージセンサを備える構成を例示することができる。このように構成された撮像装置100によれば、それぞれのイメージセンサから取得される画像は、空間的に互いに隣接しているということができる。高品質動画生成システム1は、動画のような時間的に互いに隣接している複数の画像のみならず、空間的に互いに隣接している複数の画像に対しても適用可能である。
Note that both the imaging device 100 and the image processing device 2 may exist within a housing of a smartphone, a tablet terminal, or the like. That is, the high-quality video generation system 1 may exist as an element constituting an edge device. Further, the imaging device 100 may be connected to the image processing device 2 via a predetermined communication network. That is, the high-quality video generation system 1 may exist by having components connected to each other via a predetermined communication network.
Further, the imaging device 100 may be configured to include a plurality of lenses and a plurality of image sensors respectively corresponding to the plurality of lenses. As a specific example of such a configuration, the imaging device 100 may include a plurality of lenses and image sensors so as to acquire images with different angles of view. According to the imaging device 100 configured in this way, the images acquired from the respective image sensors can be said to be spatially adjacent to each other. The high-quality video generation system 1 is applicable not only to a plurality of temporally adjacent images such as a video, but also to a plurality of spatially adjacent images.
 入力情報生成装置10は、撮像装置100から動画情報IMを取得する。入力情報生成装置10は、取得した動画情報IMに基づいて入力情報INを生成する。入力情報INは、動画情報IMを構成するフレームごとに生成される。入力情報INは、対象となるフレームと、当該フレームに基づいて決定される他のフレームに基づいて生成されてもよい。当該フレームに基づいて決定される他のフレームとは、対象となるフレームに時間的に隣接するフレームであってもよい。 The input information generation device 10 acquires video information IM from the imaging device 100. The input information generation device 10 generates input information IN based on the acquired video information IM. Input information IN is generated for each frame that constitutes moving image information IM. The input information IN may be generated based on a target frame and other frames determined based on the target frame. The other frame determined based on the frame may be a frame temporally adjacent to the target frame.
 CNN200は、入力情報生成装置10により出力されたデータを入力情報INとする畳み込みニューラルネットワークである。CNN200の一例について、図2を参照しながら説明する。 The CNN 200 is a convolutional neural network that uses the data output by the input information generation device 10 as input information IN. An example of the CNN 200 will be described with reference to FIG. 2.
 図2は、第1の実施形態に係るCNN200の一例を示す図である。同図を参照しながら、CNN200の詳細について詳細に説明する。CNN200は、多層構造を有するニューラルネットワークである。CNN200は、入力情報INが入力される入力層210と、畳み込み演算を行う畳み込み層220と、プーリングを行うプーリング層230と、出力層240とを含む多層構造のネットワークである。CNN200の少なくとも一部において、畳み込み層220とプーリング層230とは交互に連結される。CNN200は、画像認識や動画認識に広く使われるモデルである。CNN200は、全結合層などの他の機能を有する層(レイヤ)をさらに有してもよい。なお、プーリング層230には畳み込み層220の演算結果を低ビット化するための量子化演算を行う量子化層を含んでもよい。具体的には、量子化層は、畳み込み層220における畳み込み演算の結果が16ビットである場合に、量子化層において畳み込み演算の結果を8ビット以下にビット数を削減する演算を行う。 FIG. 2 is a diagram showing an example of the CNN 200 according to the first embodiment. Details of the CNN 200 will be explained in detail with reference to the figure. CNN 200 is a neural network with a multilayer structure. The CNN 200 is a multilayer network including an input layer 210 to which input information IN is input, a convolution layer 220 to perform convolution operations, a pooling layer 230 to perform pooling, and an output layer 240. In at least a portion of the CNN 200, the convolution layer 220 and the pooling layer 230 are alternately connected. CNN200 is a model widely used for image recognition and video recognition. The CNN 200 may further include layers having other functions, such as a fully connected layer. Note that the pooling layer 230 may include a quantization layer that performs a quantization operation to reduce the number of bits to the operation result of the convolution layer 220. Specifically, when the result of the convolution operation in the convolution layer 220 is 16 bits, the quantization layer performs an operation to reduce the number of bits of the result of the convolution operation in the quantization layer to 8 bits or less.
 なお、CNN200は、CNN200に含まれる複数の畳み込み層220とプーリング層230のそれぞれの出力を中間出力として、他の層の入力とする構成を採用してもよい。また、他の実施形態として、CNN200は、CNN200に含まれる複数の畳み込み層220とプーリング層230のそれぞれの出力を中間出力として、他の層の入力とすることにより、U-netを構成していてもよい。この場合、CNN200は、畳み込み演算により特徴量を抽出するエンコーダ部と、抽出された特徴量に基づき逆畳み込み演算を行うデコーダ部とを備える。 Note that the CNN 200 may adopt a configuration in which the outputs of each of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 are used as intermediate outputs and inputs to other layers. In another embodiment, the CNN 200 configures a U-net by using the outputs of the plurality of convolutional layers 220 and pooling layers 230 included in the CNN 200 as intermediate outputs and inputs to other layers. It's okay. In this case, the CNN 200 includes an encoder section that extracts a feature amount by a convolution operation, and a decoder section that performs a deconvolution operation based on the extracted feature amount.
 入力層210には、入力情報INが入力される。入力情報INは、入力画像に基づき生成される。当該入力画像は、動画を構成するフレーム画像である。本実施形態に係る入力情報生成装置10は、入力画像から入力情報INを生成するものである。本実施形態において、入力情報INの要素は、例えば2ビットの符号なし整数(0,1,2,3)であってもよい。また、入力データの要素は、例えば、4ビットや8ビットの整数でもよい。 Input information IN is input to the input layer 210. Input information IN is generated based on the input image. The input image is a frame image that constitutes a moving image. The input information generation device 10 according to this embodiment generates input information IN from an input image. In this embodiment, the elements of the input information IN may be, for example, 2-bit unsigned integers (0, 1, 2, 3). Furthermore, the elements of the input data may be, for example, 4-bit or 8-bit integers.
 畳み込み層220は、入力層210に入力された入力情報INに対して畳み込み演算を行う。畳み込み層220は、低ビットの入力情報INに対して畳み込み演算を行う。畳み込み層220は、所定の畳み込み演算を行った結果、プーリング層230に対して所定の出力データを出力する。 The convolution layer 220 performs a convolution operation on the input information IN input to the input layer 210. The convolution layer 220 performs a convolution operation on the low-bit input information IN. The convolution layer 220 outputs predetermined output data to the pooling layer 230 as a result of performing a predetermined convolution operation.
 プーリング層は、畳み込み層220により畳み込み演算が行われた結果に基づき、ある領域の代表値を抽出する。具体的には、プーリング層230は、畳み込み層220により出力された畳み込み演算の出力データに対して、平均プーリングやMAXプーリング等の演算を実施して、畳み込み層220の出力データを圧縮する。 The pooling layer extracts a representative value of a certain area based on the result of the convolution operation performed by the convolution layer 220. Specifically, the pooling layer 230 compresses the output data of the convolution layer 220 by performing an operation such as average pooling or MAX pooling on the output data of the convolution operation output by the convolution layer 220.
 出力層240は、CNN200の結果を出力する層である。出力層240は、例えば、恒等関数やソフトマックス関数等によりCNN200の結果を出力してもよい。出力層240の前段に備えられるレイヤは、畳み込み層220であってもよいし、プーリング層230であってもよいし、その他のレイヤであってもよい。 The output layer 240 is a layer that outputs the results of the CNN 200. The output layer 240 may output the results of the CNN 200 using, for example, an identity function or a softmax function. The layer provided before the output layer 240 may be the convolution layer 220, the pooling layer 230, or another layer.
 図3は、第1の実施形態に係る動画を構成するフレームについて示す図である。同図を参照しながら、入力情報生成装置10が入力情報INの生成に用いるフレームについて説明する。同図には、動画を構成する複数の連続するフレームを示す。同図に示すフレームF1からフレームF7は、動画を構成する複数の連続するフレームの一例である。
 なお、各フレームは圧縮符号化されていないRAW画像であって、各画素は12ビット又は14ビットで表現される。そして、各フレームの画素数は1920×1080又は4096×2160等の所定の動画フォーマットを満たすために必要な画素数を備える。なお、本実施形態において、CNN200での処理対象をRAW画像として説明するが、これに限られるものではない。処理対象の画像が十分に信号成分を含んでいる場合には圧縮符号化等の処理を行った画像を対象としてもよい。
FIG. 3 is a diagram showing frames constituting a moving image according to the first embodiment. With reference to the figure, frames used by the input information generation device 10 to generate input information IN will be described. The figure shows a plurality of consecutive frames constituting a moving image. Frames F1 to F7 shown in the figure are examples of a plurality of consecutive frames constituting a moving image.
Note that each frame is a RAW image that has not been compressed and encoded, and each pixel is expressed with 12 or 14 bits. The number of pixels in each frame is the number of pixels necessary to satisfy a predetermined video format such as 1920x1080 or 4096x2160. In this embodiment, the processing target of the CNN 200 will be described as a RAW image, but the processing target is not limited to this. If the image to be processed contains sufficient signal components, the image that has been subjected to processing such as compression encoding may be used as the target.
 入力情報生成装置10は、対象となるフレームである対象フレームTFと、対象フレームTFに隣接するフレームである隣接フレームAFとに基づき、入力情報INを生成する。隣接フレームAFは、例えば対象フレームTFの前又は後に連続して隣接するフレームである。図示する一例では、対象フレームTFの前後2フレームずつを隣接フレームAFとしている。すなわち、対象フレームTFをフレームF4とした場合、フレームF2、フレームF3、フレームF5及びフレームF6が隣接フレームAFとなる。 The input information generation device 10 generates input information IN based on a target frame TF, which is a target frame, and an adjacent frame AF, which is a frame adjacent to the target frame TF. The adjacent frame AF is, for example, a frame that is consecutively adjacent before or after the target frame TF. In the illustrated example, two frames before and after the target frame TF are set as adjacent frames AF. That is, when the target frame TF is the frame F4, the frame F2, the frame F3, the frame F5, and the frame F6 are the adjacent frames AF.
 なお、隣接フレームAFの枚数はこの一例に限定されず、対象フレームTFの前後1フレームずつや3フレームずつ等であってもよい。また、隣接フレームAFは、対象フレームTFの前後に隣接する場合の一例に限定されず、例えば対象フレームTFの前又は後のいずれか一方に隣接するフレームのみであってもよい。また、隣接フレームAFは、対象フレームTFと連続している必要はなく、例えばフレームF4を対象フレームTFとした場合、フレームF4とは連続していないフレームF2及びフレームF6等であってもよい。 Note that the number of adjacent frames AF is not limited to this example, and may be one frame before and after the target frame TF, or three frames before and after the target frame TF. Further, the adjacent frames AF are not limited to the example of adjacent frames before and after the target frame TF, but may be only frames adjacent to either the front or the rear of the target frame TF, for example. Furthermore, the adjacent frame AF does not need to be continuous with the target frame TF; for example, when frame F4 is the target frame TF, the adjacent frame AF may be frames F2, F6, etc. that are not continuous with frame F4.
 図4は、第1の実施形態に係る入力情報生成方法の概要について説明するための図である。同図を参照しながら、入力情報生成装置10による入力情報INの生成方法について説明する。同図には、時刻t-2におけるフレームと、時刻t-1におけるフレームと、時刻tにおけるフレームと、時刻t+1におけるフレームと、時刻t+2におけるフレームとが示されている。時刻tにおけるフレームが、上述した対象フレームTFに該当し、時刻t-2、時刻t-1、時刻t+1及び時刻t+2におけるフレームが、隣接フレームAFに該当する。なお、各フレームは多数の画素を含むため、全体を同時に処理する回路は大規模になってしまう。そのため、CNN200において処理を行う場合には各フレームを所定のサイズに分割することが好ましい。本実施形態においては、一例として256×256のサイズを一つのパッチとして複数に分割した場合について例示する。 FIG. 4 is a diagram for explaining an overview of the input information generation method according to the first embodiment. A method for generating input information IN by the input information generation device 10 will be described with reference to the same figure. The figure shows a frame at time t-2, a frame at time t-1, a frame at time t, a frame at time t+1, and a frame at time t+2. The frame at time t corresponds to the above-described target frame TF, and the frames at time t-2, time t-1, time t+1, and time t+2 correspond to adjacent frames AF. Note that since each frame includes a large number of pixels, a circuit that processes the entire frame at the same time becomes large-scale. Therefore, when processing is performed in the CNN 200, it is preferable to divide each frame into predetermined sizes. In this embodiment, as an example, a case where a patch having a size of 256×256 is divided into a plurality of parts will be described.
 各フレームは、R(赤;Red)、G(緑;Green)×2チャネル、B(青;Blue)の4チャネルの画像データを含んで構成される。入力情報生成装置10は、各チャネルの量子化及びベクトル化を行う。入力情報生成装置10は、例えば各チャネルの画像データから、9チャネルのベクトル化されたデータを生成する。すなわち、入力情報生成装置10は、1つのフレームから4チャネル×9チャネル=36チャネルのデータを生成する。9チャネルのデータとは、互いに異なる閾値を用いて画素値が量子化されたものであってもよい。入力情報生成装置10は、対象フレームTF及び隣接フレームAF(図視する一例では合計5フレーム)から生成された5フレーム×36チャネル=180チャネルのデータを合成(concat)する。入力情報生成装置10は、合成した180チャネルのデータを入力情報INとして、CNN200の入力層に対して出力する。 Each frame is configured to include image data of 4 channels: R (Red), G (Green) × 2 channels, and B (Blue). The input information generation device 10 performs quantization and vectorization of each channel. The input information generation device 10 generates nine channels of vectorized data from, for example, image data of each channel. That is, the input information generation device 10 generates 4 channels×9 channels=36 channels of data from one frame. The nine-channel data may be data in which pixel values are quantized using different threshold values. The input information generation device 10 combines (concatenates) data of 5 frames x 36 channels = 180 channels generated from the target frame TF and the adjacent frames AF (5 frames in total in the illustrated example). The input information generation device 10 outputs the combined 180 channel data to the input layer of the CNN 200 as input information IN.
 なお、図示する一例では、1フレームを構成する4チャネルの画像データを用いて入力情報INを生成する場合について説明したが、本実施形態の態様はこの一例に限定されない。入力情報生成装置10は、例えば、RGBを含む3チャネルのデータに基づいて入力情報INを生成してもよい。また、図示する一例では、画像データに基づき量子化及びベクトル化を行うことにより9チャネルのデータを生成する場合について説明したが、本実施形態の態様はこの一例に限定されない。生成されるデータ数は、合成後に効率的な演算が可能となる数であることが好適である。入力情報生成装置10は、例えば、1フレームを構成するNチャネル(Nは1以上の自然数)の画像データそれぞれについて、Mチャネル(Mは1以上の自然数)のデータを生成することによりNチャネル×Mチャネルのデータを生成してもよい。なお、N×Mの値は、32(又は64)の倍数に近い値であることが好適である。 Note that in the illustrated example, a case has been described in which the input information IN is generated using four channels of image data constituting one frame, but the aspect of the present embodiment is not limited to this example. The input information generation device 10 may generate the input information IN based on three-channel data including RGB, for example. Further, in the illustrated example, a case has been described in which 9-channel data is generated by performing quantization and vectorization based on image data, but the aspect of the present embodiment is not limited to this example. It is preferable that the number of data to be generated is a number that allows efficient calculation after synthesis. For example, the input information generation device 10 generates data of M channels (M is a natural number of 1 or more) for each of N channels (N is a natural number of 1 or more) of image data constituting one frame. M channel data may be generated. Note that the value of N×M is preferably a value close to a multiple of 32 (or 64).
 図5は、第1の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。同図を参照しながら入力情報生成装置10の機能構成の一例について説明する。入力情報生成装置10は、画像取得部11と、入力変換部12と、合成部13と、出力部14とを備える。入力情報生成装置10は、バスで接続された不図示のCPU(Central Processing Unit)、ROM(Read only memory)又はRAM(Random access memory)等の記憶装置等を備える。入力情報生成装置10は、入力情報生成プログラムを実行することによって画像取得部11と、入力変換部12と、合成部13と、出力部14とを備える装置として機能する。 FIG. 5 is a block diagram illustrating an example of a functional configuration for generating input information according to the first embodiment. An example of the functional configuration of the input information generation device 10 will be described with reference to the same figure. The input information generation device 10 includes an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14. The input information generation device 10 includes a CPU (Central Processing Unit) (not shown), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus. The input information generation device 10 functions as a device including an image acquisition section 11, an input conversion section 12, a composition section 13, and an output section 14 by executing an input information generation program.
 なお、入力情報生成装置10の各機能の全てまたは一部は、ASIC(Application Specific Integrated Circuit)、PLD(Programmable Logic Device)又はFPGA(Field-Programmable Gate Array)等のハードウェアを用いて実現されてもよい。入力情報生成プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。入力情報生成プログラムは、電気通信回線を介して送信されてもよい。 Note that all or part of each function of the input information generation device 10 is realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array). Good too. The input information generation program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The input information generation program may be transmitted via a telecommunications line.
 図示する一例において、撮像装置100により撮像された動画の動画情報IMが記憶されるメモリを第1メモリM1と記載し、入力情報生成装置10により生成された入力情報INが記憶されるメモリを第2メモリM2と記載する。第1メモリM1及び第2メモリM2は、ROM又はRAM等の記憶装置である。 In the illustrated example, a memory in which moving image information IM of a moving image captured by the imaging device 100 is stored is referred to as a first memory M1, and a memory in which input information IN generated by the input information generating device 10 is stored is referred to as a first memory M1. 2 memory M2. The first memory M1 and the second memory M2 are storage devices such as ROM or RAM.
 画像取得部11は、第1メモリM1に記憶された動画情報IMのうち、処理に用いる入力画像が含まれる画像情報IMDを取得する。具体的には、画像取得部11は、動画を構成する複数のフレームのうち、入力情報INの生成の対象となる対象フレームTFを少なくとも含む複数フレームを入力画像として取得する。画像取得部11は、一例として、対象フレームTFに加えて、隣接フレームAFを入力画像として取得する。隣接フレームAFとは、対象フレームTFの前後それぞれに連続して隣接する複数のフレームであってもよい。 The image acquisition unit 11 acquires image information IMD that includes the input image used for processing from among the video information IM stored in the first memory M1. Specifically, the image acquisition unit 11 acquires, as input images, a plurality of frames including at least a target frame TF for which input information IN is to be generated, among a plurality of frames constituting a moving image. For example, the image acquisition unit 11 acquires an adjacent frame AF as an input image in addition to the target frame TF. The adjacent frames AF may be a plurality of frames that are consecutively adjacent to each other before and after the target frame TF.
 なお、撮像装置100により撮像された動画がカラー動画である場合、画像取得部11は、1つのフレームから各色の画素値を複数の異なる画像として取得する。例えば撮像装置100が、ベイヤー配列を採用したイメージセンサを用いたものである場合、画像取得部11は、1つのフレームからRGGBからなる4チャネルの画像情報を取得する。画像取得部11により取得される画像の画素値は、多ビットの要素を含む。 Note that when the video captured by the imaging device 100 is a color video, the image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images. For example, if the imaging device 100 uses an image sensor employing a Bayer array, the image acquisition unit 11 acquires four channels of RGGB image information from one frame. The pixel value of the image acquired by the image acquisition unit 11 includes multi-bit elements.
 入力変換部12は、画像取得部11から画像情報IMDを取得する。入力変換部12は、画像情報IMDに含まれる複数フレームの入力画像の画素値を、それぞれ複数の閾値との比較に基づいて低ビットの入力データINDに変換する。入力画像はRAW画像であり、その画素値は、多ビット(例えば12ビット又は14ビット等)の要素を含むため、画像取得部11は、複数の閾値に基づいて、入力画像の画素値を示すビット数(例えば8ビット)以下のビット数(例えば、2ビットまたは1ビット)の入力データINDに変換する。入力変換部12は、変換した入力データINDを合成部13に出力する。 The input conversion unit 12 acquires image information IMD from the image acquisition unit 11. The input conversion unit 12 converts pixel values of input images of multiple frames included in the image information IMD into low-bit input data IND based on comparisons with multiple threshold values. Since the input image is a RAW image and its pixel value includes multi-bit (for example, 12 bits or 14 bits) elements, the image acquisition unit 11 indicates the pixel value of the input image based on a plurality of threshold values. The input data IND is converted into input data IND having a bit number (eg, 2 bits or 1 bit) less than the bit number (eg, 8 bits). The input conversion unit 12 outputs the converted input data IND to the synthesis unit 13.
 なお、画像情報IMDに、RGB各色の画像情報が含まれる場合、入力変換部12は、それぞれについて変換を行う。すなわち入力変換部12は、1つのフレームから取得された複数の画像それぞれについて、入力データINDに変換する。 Note that when the image information IMD includes image information for each color of RGB, the input conversion unit 12 performs conversion for each color. That is, the input conversion unit 12 converts each of the plurality of images acquired from one frame into input data IND.
 図6は、第1の実施形態に係る入力変換部の機能構成の一例を示すブロック図である。同図を参照しながら、入力変換部12の機能の詳細について説明する。同図に示すように、入力変換部12は、複数の変換部121と、閾値記憶部122とを備える。図示する一例では、入力変換部12は、複数の変換部121の一例として、変換部121-1と、変換部121-2と、…、変換部121-n(nは1以上の自然数)とを備える。入力変換部12が備える変換部121の数は、入力変換部12が1チャネルの入力画像から生成する入力データINDの数であってもよい。すなわち、入力変換部12が1チャネルの入力画像を9チャネルの入力データINDに変換する場合、入力変換部12は、変換部121-1乃至変換部121-9の9つの変換部121を備える。 FIG. 6 is a block diagram showing an example of the functional configuration of the input conversion section according to the first embodiment. The details of the function of the input conversion section 12 will be explained with reference to the same figure. As shown in the figure, the input conversion section 12 includes a plurality of conversion sections 121 and a threshold storage section 122. In the illustrated example, the input conversion unit 12 includes a conversion unit 121-1, a conversion unit 121-2, ..., a conversion unit 121-n (n is a natural number of 1 or more) as an example of the plurality of conversion units 121. Equipped with. The number of converters 121 included in the input converter 12 may be the number of input data IND that the input converter 12 generates from one channel of input images. That is, when the input converter 12 converts a one-channel input image into nine-channel input data IND, the input converter 12 includes nine converters 121, converters 121-1 to 121-9.
 図示する一例において、入力画像の画像データは、x軸方向及びy軸方向の各要素として8ビットより多いビット数を備える多値を画素データとする行列的なデータ構造を備えるものとする。この入力画像の画像データを入力変換部12により変換すると、各要素は量子化され、低ビット(例えば、8ビット以下の2ビットまたは1ビット)の入力データになる。 In the illustrated example, the image data of the input image has a matrix-like data structure in which pixel data is multivalued, with each element in the x-axis direction and y-axis direction having more than 8 bits. When the image data of this input image is converted by the input conversion unit 12, each element is quantized and becomes input data of low bits (for example, 2 bits or 1 bit below 8 bits).
 変換部121は、入力画像の各要素と所定の閾値とを比較する。変換部121は、比較結果に基づいて入力画像の各要素を量子化する。変換部121は、例えば12ビットの入力画像を2ビット又は1ビットの値に量子化する。変換部121は、変換後のビット数に応じた数の閾値と比較することにより、量子化を行ってもよい。例えば1ビットへの変換であれば閾値は1つであれば足りるし、2ビットへの変換であれば、3つの閾値が用いられてもよい。換言すれば、変換部121が行う量子化が1ビット量子化の場合には1つの閾値が用いられ、2ビット量子化の場合には3つの閾値が用いられてもよい。なお、8ビット等の多数の閾値が必要な場合は、閾値ではなく関数またはテーブル等を用いて量子化を行ってもよい。 The conversion unit 121 compares each element of the input image with a predetermined threshold. The conversion unit 121 quantizes each element of the input image based on the comparison result. The conversion unit 121 quantizes, for example, a 12-bit input image into a 2-bit or 1-bit value. The converter 121 may perform quantization by comparing the number of bits with a number of thresholds corresponding to the number of bits after conversion. For example, for conversion to 1 bit, one threshold value is sufficient, and for conversion to 2 bits, three threshold values may be used. In other words, one threshold value may be used when the quantization performed by the conversion unit 121 is 1-bit quantization, and three threshold values may be used when the quantization is 2-bit quantization. Note that if a large number of threshold values such as 8 bits are required, quantization may be performed using a function, a table, or the like instead of a threshold value.
 それぞれの変換部121は、同じ要素に対して独立した閾値を用いて量子化を行う。つまり、入力変換部12は、1チャネルの入力に対して、変換部121の数に対応する要素を含むベクトルを演算結果(入力データIND)として出力する。なお、変換部121の出力であって変換した結果のビット精度は、入力画像のビット精度等に基づいて適宜変更されてもよい。 Each conversion unit 121 performs quantization on the same element using independent thresholds. In other words, the input converter 12 outputs a vector including elements corresponding to the number of converters 121 as the calculation result (input data IND) for one channel of input. Note that the bit precision of the converted result, which is the output of the converting unit 121, may be changed as appropriate based on the bit precision of the input image.
 閾値記憶部122は、変換部121により行われる演算に用いられる複数の閾値を記憶する。閾値記憶部122に記憶された閾値は所定の値であり、複数の変換部121のそれぞれに対応して設定される。なお、それぞれの閾値は学習対象のパラメータであってもよく、学習ステップにおいて決定及び更新されてもよい。 The threshold storage unit 122 stores a plurality of threshold values used in calculations performed by the conversion unit 121. The threshold value stored in the threshold storage unit 122 is a predetermined value, and is set corresponding to each of the plurality of conversion units 121. Note that each threshold value may be a learning target parameter, and may be determined and updated in the learning step.
 なお、図示する一例では、複数の変換部121に対して入力画像の同一の要素が入力される例を示したが、入力変換部12の態様はこれに限定されない。例えば、入力画像が色成分を含む3チャンネル以上の要素を含む画像データである場合には、変換部121を対応する複数のグループに分け、それぞれのグループに対して対応する要素を入力して変換してもよい。また、色成分以外であっても、所定の変換部121に入力する要素に対して事前に何らかの変換処理を加えてもよいし、事前処理の有無によっていずれの変換部12に入力するかを切り替えてもよい。また、入力画像の全ての要素に対して変換処理を行わなくてもよく、例えば入力画像内の特定の要素である特定色に対応する要素に対してのみ変換処理を行なってもよい。 Although the illustrated example shows an example in which the same element of the input image is input to a plurality of converters 121, the mode of the input converter 12 is not limited to this. For example, if the input image is image data that includes elements of three or more channels including color components, the conversion unit 121 is divided into a plurality of corresponding groups, and the corresponding elements are input to each group and converted. You may. Furthermore, even for elements other than color components, some kind of conversion processing may be applied in advance to the elements to be input to a predetermined conversion unit 121, or which conversion unit 12 to input them to can be switched depending on the presence or absence of pre-processing. It's okay. Further, it is not necessary to perform the conversion process on all elements of the input image, and for example, the conversion process may be performed only on a specific element in the input image, that is, an element corresponding to a specific color.
 なお、変換部121の個数は固定でなくてもよく、ニューラルネットワークの構造またはハードウェア情報に合わせて適宜決定されてもよい。なお、変換部121による量子化による演算精度の低下を補う必要がある場合には、変換部121の個数は入力画像の各要素のビット精度以上に設定することが好適である。より一般的には、量子化前後による入力画像のビット精度の差分以上に変換部121の個数を設定することが好適である。具体的には画素値が8ビットで示される入力画像を1ビットに量子化する場合には、変換部121の個数は差分である7ビットに相当する7個以上(例えば、16個や32個)に設定することが好適である。 Note that the number of conversion units 121 does not need to be fixed, and may be determined as appropriate depending on the structure of the neural network or hardware information. Note that if it is necessary to compensate for a decrease in calculation accuracy due to quantization by the converter 121, it is preferable to set the number of converters 121 to be greater than or equal to the bit precision of each element of the input image. More generally, it is preferable to set the number of converters 121 to be greater than the difference in bit precision of the input image before and after quantization. Specifically, when quantizing an input image whose pixel value is represented by 8 bits to 1 bit, the number of converters 121 is 7 or more (for example, 16 or 32), which corresponds to 7 bits as a difference. ) is preferable.
 図5に戻り、合成部13は、変換された複数の入力データINDを1つのデータに合成(concat)する。複数の入力データが合成されたデータを、合成データCDとも記載する。合成部13による合成処理とは、複数の入力データINDを1つのデータに並べる(又は接続する)処理であってもよい。
 出力部14は、合成部13により合成された合成データCDを出力する。合成データCDは、一時的に第2メモリM2に記憶されてもよい。合成データCDとは、すなわち、CNN200の入力層210に入力される入力情報INである。
Returning to FIG. 5, the combining unit 13 combines (concats) the plurality of converted input data IND into one data. Data obtained by combining a plurality of input data is also referred to as composite data CD. The combining process by the combining unit 13 may be a process of arranging (or connecting) a plurality of input data IND into one data.
The output section 14 outputs the composite data CD synthesized by the synthesis section 13. The composite data CD may be temporarily stored in the second memory M2. The composite data CD is, in other words, the input information IN input to the input layer 210 of the CNN 200.
 入力情報生成装置10は、対象フレームTFについて入力情報INを生成した後、当該対象フレームTFの次のフレームについて、入力情報INを生成する。次のフレームとは、当該対象フレームTFに時間的に連続するフレームであってもよい。すなわち、画像取得部11は、出力部14により対象フレームTFについての合成データCDが出力された後、対象フレームTFを1フレーム分シフトさせて、対象フレームTFに隣接するフレームを対象フレームTFとして取得する。このように、入力情報生成装置10は、少なくとも対象フレームTFを含む複数フレームを入力画像として取得し、合成データCDを生成していく。 After generating input information IN for the target frame TF, the input information generation device 10 generates input information IN for the next frame of the target frame TF. The next frame may be a frame temporally continuous with the target frame TF. That is, after the output unit 14 outputs the composite data CD regarding the target frame TF, the image acquisition unit 11 shifts the target frame TF by one frame and acquires a frame adjacent to the target frame TF as the target frame TF. do. In this way, the input information generation device 10 acquires a plurality of frames including at least the target frame TF as input images, and generates composite data CD.
 入力情報生成装置10は、動画情報IMに含まれる全てのフレームについて、入力情報INを生成していく。なお、上述した一例では、入力情報生成装置10が動画情報IMに含まれるすべてのフレームについて入力情報INを生成する場合の一例について説明したが、本実施形態の態様はこの一例に限定されない。入力情報生成装置10は、例えば、所定のフレームおきに入力情報INを生成してもよい。また、高品質動画生成システム1は動画情報IMに基づいて高品質動画に変換するが、出力形式は動画形式に限定されない。例えば、高品質動画生成システム1は、動画から静止画を生成してもよい。すなわち、高品質動画生成システム1は、動画から抜き出すフレームを対象フレームTFとすることにより、動画情報IMに含まれるフレームを抜き出して静止画を生成する場合においても、本実施形態を適用することができる。 The input information generation device 10 generates input information IN for all frames included in the video information IM. Note that, in the example described above, an example was described in which the input information generation device 10 generates input information IN for all frames included in the video information IM, but the aspect of the present embodiment is not limited to this example. For example, the input information generation device 10 may generate input information IN every predetermined frame. Furthermore, although the high-quality video generation system 1 converts the video into a high-quality video based on the video information IM, the output format is not limited to the video format. For example, the high-quality video generation system 1 may generate still images from videos. That is, the high-quality video generation system 1 can apply this embodiment even when extracting frames included in video information IM to generate still images by using frames extracted from a video as target frames TF. can.
[第1の実施形態のまとめ]
 以上説明した実施形態によれば、入力情報生成装置10は、画像取得部11を備えることにより動画を構成するフレームのうち、入力情報INDの生成の対象となる対象フレームTFを少なくとも含む複数フレームを入力画像として取得する。また、入力情報生成装置10は、入力変換部12を備えることにより、取得した複数フレームの入力画像の画素値を、複数の閾値との比較に基づいて、入力画像の画素値を示すビット数(例えば12ビット)より少ないビット数(例えば8ビット以下の2ビット又は1ビット)の入力データINDに変換する。また、入力情報生成装置10は、合成部13を備えることにより、変換された複数の入力データINDを1つの合成データCDに合成し、出力部14を備えることにより、合成された合成データCDを出力する。すなわち、本実施形態によれば、入力情報生成装置10は、対象フレームTFを少なくとも含む複数の画像から得た複数の入力データINDを合成し入力情報INを生成する。生成された入力情報INは、CNN200の入力層210に入力される。
[Summary of the first embodiment]
According to the embodiment described above, the input information generation device 10 is equipped with the image acquisition unit 11 to generate a plurality of frames including at least the target frame TF, which is the target of generation of the input information IND, among the frames constituting the moving image. Obtain as input image. In addition, the input information generation device 10 includes the input conversion unit 12, so that the input information generation device 10 converts the pixel values of the acquired multiple frames of the input image into the number of bits indicating the pixel value of the input image ( The input data IND is converted into input data IND having a smaller number of bits (for example, 2 bits or 1 bit less than 8 bits) (for example, 12 bits). The input information generation device 10 also includes a synthesizing section 13 to synthesize a plurality of converted input data IND into one synthetic data CD, and includes an output section 14 to synthesize the synthesized synthetic data CD. Output. That is, according to the present embodiment, the input information generation device 10 generates input information IN by combining a plurality of input data IND obtained from a plurality of images including at least the target frame TF. The generated input information IN is input to the input layer 210 of the CNN 200.
 ここで、入力情報INとは、各要素において入力画像より低ビットの情報である。CNN200は、入力画像に代えて入力情報INを処理することにより、低ビットの情報を処理することができるようになる。したがって、本実施形態によれば、CNN200の処理を軽量化することができる。また、入力情報INとは、複数の画像に基づいて生成された情報である。したがって、入力情報INに基づいて動画を高品質化させた場合、対象フレームTFに隣接する複数のフレームについても考慮した処理を行うことができるようになるため、高精度でノイズを除去することができるようになる。したがって、本実施形態によれば、軽量な演算で低品質動画を高品質動画に変換することができる。なお、CNN200の処理は、ノイズ除去に限定されるものではない。 Here, the input information IN is information with lower bits than the input image in each element. By processing the input information IN instead of the input image, the CNN 200 can process low-bit information. Therefore, according to this embodiment, the processing of the CNN 200 can be reduced in weight. Furthermore, the input information IN is information generated based on a plurality of images. Therefore, when improving the quality of a video based on the input information IN, it becomes possible to perform processing that also considers multiple frames adjacent to the target frame TF, making it possible to remove noise with high precision. become able to. Therefore, according to this embodiment, it is possible to convert a low-quality video into a high-quality video with lightweight calculations. Note that the processing of the CNN 200 is not limited to noise removal.
 また、上述した実施形態によれば、処理の対象である動画はカラー動画であるため、動画情報IMには、例えばRGB各色の画素値が含まれる。画像取得部11は、1つのフレームから各色の画素値を複数の異なる画像として取得し、入力変換部12は、1つのフレームから取得された複数の画像それぞれについて、異なる入力データINDとして変換する。したがって、本実施形態によれば、より高精度な画像処理をすることができる。したがって、本実施形態によれば、更に高精度でノイズを除去することができる。 Furthermore, according to the embodiment described above, since the moving image to be processed is a color moving image, the moving image information IM includes, for example, pixel values of each color of RGB. The image acquisition unit 11 acquires pixel values of each color from one frame as a plurality of different images, and the input conversion unit 12 converts each of the plurality of images acquired from one frame as different input data IND. Therefore, according to this embodiment, more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with even higher accuracy.
 また、上述した実施形態によれば、画像取得部11は、動画を構成する複数のフレームのうち、対象フレームTFの前後それぞれに連続して隣接する複数のフレームの画像を取得する。本実施形態によれば、対象フレームTFの前後に隣接するフレームの情報に基づいて入力情報INを生成するため、より高精度な画像処理をすることができる。したがって、本実施形態によれば、高精度でノイズを除去することができる。 Furthermore, according to the embodiment described above, the image acquisition unit 11 acquires images of a plurality of frames consecutively adjacent to each of the front and rear of the target frame TF, among the plurality of frames constituting the moving image. According to the present embodiment, the input information IN is generated based on the information of the frames adjacent before and after the target frame TF, so that more accurate image processing can be performed. Therefore, according to this embodiment, noise can be removed with high accuracy.
 また、上述した実施形態によれば、画像取得部11は、出力部14により対象フレームTFについての合成データCDが出力された後、対象フレームTFに隣接するフレームを対象フレームTFとして、少なくとも対象フレームTFを含む複数フレームを入力画像として取得する。すなわち、入力情報生成装置10は、対象フレームTFを次々にシフトさせていくことにより、動画に含まれる複数のフレームそれぞれについて入力情報INを生成する。したがって、本実施形態によれば、低品質動画を高品質動画に変換することができる。 Further, according to the embodiment described above, after the output unit 14 outputs the composite data CD regarding the target frame TF, the image acquisition unit 11 sets a frame adjacent to the target frame TF to at least the target frame TF. A plurality of frames including TF are acquired as input images. That is, the input information generation device 10 generates input information IN for each of a plurality of frames included in the video by shifting the target frames TF one after another. Therefore, according to this embodiment, a low quality video can be converted into a high quality video.
[第2の実施形態]
 次に、図7及び図8を参照しながら第2の実施形態について説明する。まず、第2の実施形態において解決しようとする課題について説明する。第1の実施形態に係る入力情報生成装置10は、対象フレームTFと隣接フレームAFとを含む複数フレームを読み出し、複数フレームそれぞれについて量子化を行う。したがって、第1の実施形態に係る入力情報生成装置10によれば、量子化を行うフレーム数が多く、第1層目の演算負荷が大きくなってしまっていた。第2の実施形態においては、この問題を解決し、さらに演算負荷を軽くしようとするものである。
[Second embodiment]
Next, a second embodiment will be described with reference to FIGS. 7 and 8. First, the problem to be solved in the second embodiment will be explained. The input information generation device 10 according to the first embodiment reads a plurality of frames including a target frame TF and an adjacent frame AF, and performs quantization on each of the plurality of frames. Therefore, according to the input information generation device 10 according to the first embodiment, the number of frames to be quantized is large, and the calculation load on the first layer becomes large. The second embodiment attempts to solve this problem and further lighten the calculation load.
 図7は、第2の実施形態に係る入力情報生成方法の概要について説明するための図である。同図を参照しながら、第2の実施形態に係る入力情報生成装置10Aによる入力情報INの生成方法について説明する。同図には、対象フレームTFとして時刻tにおけるフレームが示されている。また、同図には、隣接フレームAFとして、時刻t-2におけるフレームと、時刻t-1におけるフレームと、時刻t+1におけるフレームと、時刻t+2におけるフレームとが示されている。 FIG. 7 is a diagram for explaining an overview of the input information generation method according to the second embodiment. A method for generating input information IN by the input information generation device 10A according to the second embodiment will be described with reference to the same figure. In the figure, a frame at time t is shown as the target frame TF. Further, in the figure, a frame at time t-2, a frame at time t-1, a frame at time t+1, and a frame at time t+2 are shown as adjacent frames AF.
 各フレームは、RGGBの4チャネルの画像データを含んで構成される。入力情報生成装置10Aは、対象フレームTFについて各チャネルの量子化及びベクトル化を行う。また、入力情報生成装置10Aは、隣接フレームAFの平均画像について各チャネルの量子化及びベクトル化を行う。すなわち、第2の実施形態においては、隣接フレームAFそれぞれについての量子化及びベクトル化を行うのではなく、隣接フレームAFの平均画像について量子化及びベクトル化を行う点において第1の実施形態とは異なる。 Each frame is configured to include image data of four channels of RGGB. The input information generation device 10A performs quantization and vectorization of each channel for the target frame TF. In addition, the input information generation device 10A performs quantization and vectorization of each channel on the average image of adjacent frames AF. That is, the second embodiment is different from the first embodiment in that the average image of the adjacent frames AF is quantized and vectorized instead of being quantized and vectorized for each adjacent frame AF. different.
 図視する一例では、RGGBの4チャネルの画像データそれぞれを16チャネルのデータに変換している。したがって、1つのフレームから4チャネル×16チャネル=64チャネルのデータが生成される。入力情報生成装置10Aは、対象フレームTFと、隣接フレームAFの平均画像とについて、それぞれ64チャネルのデータに変換するため、合計128チャネルのデータが生成される。 In the illustrated example, each of 4 channels of RGGB image data is converted to 16 channels of data. Therefore, 4 channels x 16 channels = 64 channels of data are generated from one frame. The input information generation device 10A converts each of the target frame TF and the average image of the adjacent frame AF into 64 channels of data, so a total of 128 channels of data is generated.
 平均化処理は、画素値の単純平均を求めることにより行われてもよい。また、入力情報生成装置10Aは、色ごとの平均を求めることにより、色ごとの平均画像を生成してもよい。すなわち、入力情報生成装置10Aは、時刻t-2乃至時刻t+2の隣接フレームAFにおけるフレームのR画像に基づく平均画像を生成し、時刻t-2乃至時刻t+2の隣接フレームAFにおけるフレームのG画像に基づく平均画像を生成し、時刻t-2乃至時刻t+2の隣接フレームAFにおけるフレームのB画像に基づく平均画像を生成してもよい。 The averaging process may be performed by calculating a simple average of pixel values. Furthermore, the input information generation device 10A may generate an average image for each color by calculating the average for each color. That is, the input information generation device 10A generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2, and generates an average image based on the R image of the frame in the adjacent frame AF from time t-2 to time t+2. An average image based on the B image of the frame in the adjacent frame AF from time t-2 to time t+2 may be generated.
 なお、図示する一例では、隣接フレームAFの平均を求めた後に、量子化及びベクトル化を行っているが、本実施形態の態様はこの一例に限定されさない。入力情報生成装置10Aは、例えば、量子化及びベクトル化を行った後に、平均を求めるよう構成されていてもよい。 Note that in the illustrated example, quantization and vectorization are performed after calculating the average of adjacent frames AF, but the aspect of the present embodiment is not limited to this example. The input information generation device 10A may be configured to calculate the average after performing quantization and vectorization, for example.
 対象フレームTFの量子化及びベクトル化が行われ、隣接フレームAFの平均画像について量子化及びベクトル化が行われた後、これらを合成することにより入力情報INが生成される。図視する一例では、128チャネルのデータが生成されている。合成された入力情報INは、第1の実施形態に係る情報量より少ない。一方で、対象フレームTFを表現するデータは第1の実施形態と比較すると多くなっている。 After the target frame TF is quantized and vectorized, and the average image of the adjacent frame AF is quantized and vectorized, input information IN is generated by combining these. In the illustrated example, 128 channels of data are generated. The combined input information IN is smaller than the amount of information according to the first embodiment. On the other hand, the amount of data representing the target frame TF is increased compared to the first embodiment.
 図8は、第2の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。同図を参照しながら入力情報生成装置10Aの機能構成の一例について説明する。入力情報生成装置10Aは、統合部15を更に備え、入力変換部12に代えて入力変換部12Aを、合成部13に代えて合成部13Aを備える点において入力情報生成装置10とは異なる。入力情報生成装置10Aの説明において、入力情報生成装置10と同様の構成については同様の符号を付すことにより説明を省略する場合がある。 FIG. 8 is a block diagram illustrating an example of a functional configuration for generating input information according to the second embodiment. An example of the functional configuration of the input information generation device 10A will be described with reference to the same figure. The input information generation device 10A differs from the input information generation device 10 in that it further includes an integration section 15, an input conversion section 12A instead of the input conversion section 12, and a synthesis section 13A instead of the synthesis section 13. In the description of the input information generation device 10A, the same components as the input information generation device 10 may be given the same reference numerals and the description thereof may be omitted.
 統合部15は、画像取得部11から隣接フレームAFに関する情報を取得する。隣接フレームAFとは、画像取得部11により取得された複数フレームのうち、対象フレームTF以外のフレームである。統合部15は、隣接フレームAFの画素値に基づいた演算を行うことにより、複数の隣接フレームAFを、1つの統合フレームIFに統合する処理を行う。統合部15は、演算した結果得られた統合フレームIFを入力変換部12Aに出力する。 The integrating unit 15 acquires information regarding adjacent frames AF from the image acquiring unit 11. The adjacent frame AF is a frame other than the target frame TF among the plurality of frames acquired by the image acquisition unit 11. The integrating unit 15 performs a process of integrating a plurality of adjacent frames AF into one integrated frame IF by performing calculations based on the pixel values of the adjacent frames AF. The integrating unit 15 outputs the integrated frame IF obtained as a result of the calculation to the input converting unit 12A.
 統合部15は、例えば隣接フレームAFの単純平均を求めることにより、統合処理を行ってもよい。この場合、統合部15は、例えば複数の隣接フレームAFの画素値の平均値を統合フレームIFの画素値とする。 The integrating unit 15 may perform the integrating process by, for example, obtaining a simple average of adjacent frames AF. In this case, the integrating unit 15 takes, for example, the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF.
 なお、本実施形態に係る統合部15の統合処理の態様は、単純平均を求める場合の一例に限定されない。統合部15は、例えば、加重平均を求めることにより統合処理を行ってもよい。加重平均は、対象フレームTFからの時間的距離に応じて求められる。統合部15は、例えば、時刻tから時間的に遠い時刻t-2及び時刻t+2におけるフレームの画素値には1より小さい0.7を乗じることにより重みを小さくし、時刻tから時間的に近い時刻t-1及び時刻t+1におけるフレームの画素値には1より大きい1.3を乗じることにより重みを大きくしてもよい。すなわち、統合部15は、複数の隣接フレームAFのうち、対象フレームTFからの時間的距離に応じた加重平均を算出することにより統合フレームIFの画素値を算出してもよい。加重平均を用いて統合することにより、隣接フレームAFの対象フレームTFに対する寄与度を、統合フレームIFに反映させることができる。なお、乗じる重みの大きさは、対象フレームTFからの時間的距離が同じであっても、対象フレームTFの前後のいずれであるかに応じて異なっていてもよい。例えば、隣接フレームAFが対象フレームTFの前である場合は重みを大きくし、隣接フレームAFが対象フレームTFの後である場合は重みを小さくしてもよい。 Note that the aspect of the integration process of the integration unit 15 according to the present embodiment is not limited to the example of calculating a simple average. The integrating unit 15 may perform the integrating process by calculating a weighted average, for example. The weighted average is calculated according to the temporal distance from the target frame TF. For example, the integrating unit 15 reduces the weight by multiplying the pixel values of frames at time t-2 and time t+2, which are temporally distant from time t, by 0.7, which is smaller than 1, and The weight may be increased by multiplying the pixel values of the frames at time t-1 and time t+1 by 1.3, which is greater than 1. That is, the integrating unit 15 may calculate the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. By integrating using a weighted average, the degree of contribution of the adjacent frame AF to the target frame TF can be reflected in the integrated frame IF. Note that the magnitude of the weight to be multiplied may be different depending on whether the temporal distance from the target frame TF is the same or whether it is before or after the target frame TF. For example, if the adjacent frame AF is before the target frame TF, the weight may be increased, and if the adjacent frame AF is after the target frame TF, the weight may be decreased.
 入力変換部12Aは、画像取得部11から取得した対象フレームTFの画素値を、複数の閾値との比較に基づいて入力データINDに変換する。更に入力変換部12Aは、統合部15から取得した統合フレームIFの画素値を、複数の閾値との比較に基づいて入力データに変換する。入力変換部12Aは、変換した入力データINDを合成部13Aに出力する。 The input conversion unit 12A converts the pixel value of the target frame TF acquired from the image acquisition unit 11 into input data IND based on comparison with a plurality of threshold values. Further, the input conversion unit 12A converts the pixel values of the integrated frame IF obtained from the integration unit 15 into input data based on comparison with a plurality of threshold values. The input conversion section 12A outputs the converted input data IND to the synthesis section 13A.
 合成部13Aは、対象フレームTFに基づき変換された複数の入力データINDと、統合フレームIFに基づき変換された複数の入力データINDとを1つの合成データCDに合成する。合成された入力情報INは、第1の実施形態に係る情報量より少ない。 The synthesis unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD. The combined input information IN is smaller than the amount of information according to the first embodiment.
[第2の実施形態のまとめ]
 以上説明した実施形態によれば、入力情報生成装置10Aは、統合部15を更に備えることにより、隣接フレームAFの画素値に基づいた演算を行い、複数の隣接フレームAFを1つの統合フレームIFに統合する。また、入力変換部12Aは、対象フレームTFの画素値を複数の閾値との比較に基づいて入力データINDに変換し、更に統合フレームIFの画素値を複数の閾値との比較に基づいて入力データINDに変換する。また、合成部13Aは、対象フレームTFに基づき変換された複数の入力データINDと、統合フレームIFに基づき変換された複数の入力データINDとを1つの合成データCDに合成する。すなわち、入力情報生成装置10Aは、隣接フレームAFを対象フレームTFのように量子化及びベクトル化を行うのではなく、複数の隣接フレームAFに基づいた統合フレームIFを求め、統合フレームIFの量子化及びベクトル化を行う。したがって、合成部13Aにより合成された入力情報INは、第1の実施形態に係る情報量より少なくなり、第1層目の演算負荷を軽くすることができる。
[Summary of second embodiment]
According to the embodiment described above, the input information generation device 10A further includes the integrating unit 15, thereby performing calculations based on the pixel values of adjacent frames AF, and converting a plurality of adjacent frames AF into one integrated frame IF. Integrate. In addition, the input conversion unit 12A converts the pixel value of the target frame TF into input data IND based on comparison with a plurality of threshold values, and further converts the pixel value of the integrated frame IF into input data IND based on comparison with a plurality of threshold values. Convert to IND. Furthermore, the synthesizing unit 13A synthesizes a plurality of input data IND converted based on the target frame TF and a plurality of input data IND converted based on the integrated frame IF into one composite data CD. That is, the input information generation device 10A does not quantize and vectorize the adjacent frame AF like the target frame TF, but calculates an integrated frame IF based on a plurality of adjacent frames AF, and quantizes the integrated frame IF. and vectorization. Therefore, the input information IN synthesized by the synthesizing section 13A is smaller than the amount of information according to the first embodiment, and the calculation load on the first layer can be reduced.
 また、上述した実施形態によれば、統合部15は、複数の隣接フレームAFの画素値の平均値を統合フレームIFの画素値とする。すなわち、統合部15は、隣接フレームAFの単純平均を統合フレームIFとする。したがって、本実施形態によれば、容易な演算により、第1の実施形態に係る入力情報INより少ない情報量の入力情報INを生成することができ、第1層目の演算負荷を軽くすることができる。 Furthermore, according to the embodiment described above, the integrating unit 15 sets the average value of the pixel values of the plurality of adjacent frames AF as the pixel value of the integrated frame IF. That is, the integrating unit 15 takes the simple average of the adjacent frames AF as the integrated frame IF. Therefore, according to the present embodiment, input information IN having a smaller amount of information than the input information IN according to the first embodiment can be generated by easy calculation, and the calculation load on the first layer can be reduced. Can be done.
 また、上述した実施形態によれば、統合部15は、複数の隣接フレームAFのうち、対象フレームTFからの時間的距離に応じた加重平均を算出することにより統合フレームIFの画素値を算出する。したがって、本実施形態によれば、隣接フレームAFの対象フレームTFに対する寄与度を考慮して、統合フレームIFを生成することができる。したがって、入力情報生成装置10Aにより生成された入力情報INを用いれば、CNN200は、より精度よく画像処理をおこなうことができる。 Further, according to the embodiment described above, the integrating unit 15 calculates the pixel value of the integrated frame IF by calculating a weighted average of the plurality of adjacent frames AF according to the temporal distance from the target frame TF. . Therefore, according to the present embodiment, the integrated frame IF can be generated in consideration of the degree of contribution of the adjacent frame AF to the target frame TF. Therefore, by using the input information IN generated by the input information generation device 10A, the CNN 200 can perform image processing with higher accuracy.
[第3の実施形態]
 次に、図9を参照しながら第3の実施形態について説明する。まず、第3の実施形態において解決しようとする課題について説明する。第2の実施形態に係る入力情報生成装置10Aは、統合部15を備えることにより、隣接フレームAFの平均値を算出していた。対象フレームTFが1フレームずつシフトしていく場合、毎フレームで平均値を算出することは、処理の重複が発生し効率的でない。したがって、第3の実施形態においては、この問題を解決し、さらに演算負荷を軽くしようとするものである。
[Third embodiment]
Next, a third embodiment will be described with reference to FIG. 9. First, the problem to be solved in the third embodiment will be explained. The input information generation device 10A according to the second embodiment includes the integrating unit 15 to calculate the average value of adjacent frames AF. When the target frame TF is shifted one frame at a time, calculating the average value for each frame causes duplication of processing and is not efficient. Therefore, the third embodiment attempts to solve this problem and further lighten the calculation load.
 図9は、第3の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。同図を参照しながら入力情報生成装置10Bの機能構成の一例について説明する。入力情報生成装置10Bは、平均値一時記憶部16と、撮像条件取得部17と、調整部18とを更に備え、統合部15に代えて統合部15Bを備える点において入力情報生成装置10Aとは異なる。入力情報生成装置10Bの説明において、入力情報生成装置10Aと同様の構成については同様の符号を付すことにより説明を省略する場合がある。 FIG. 9 is a block diagram illustrating an example of a functional configuration for generating input information according to the third embodiment. An example of the functional configuration of the input information generation device 10B will be described with reference to the same figure. The input information generation device 10B is different from the input information generation device 10A in that it further includes an average value temporary storage section 16, an imaging condition acquisition section 17, and an adjustment section 18, and includes an integration section 15B instead of the integration section 15. different. In the description of the input information generation device 10B, the same components as the input information generation device 10A may be given the same reference numerals and the description thereof may be omitted.
 平均値一時記憶部16は、動画を構成するフレームのうち、所定のフレームの画素値の平均値を記憶する。平均値一時記憶部16が記憶する値は、統合部15Bにより演算される。統合部15Bは、画像取得部11から対象フレームTFについての情報を取得し、平均値一時記憶部16から記憶値SVを取得する。統合部15Bは、対象フレームTFと、平均値一時記憶部16に記憶された値である記憶値SVとに基づく演算により、統合フレームIFの画素値を算出する。統合部15Bは、算出した値を演算値CVとして平均値一時記憶部16に記憶させる。すなわち、統合部15Bが新たな対象フレームTFの演算を行うごとに平均値一時記憶部16に記憶される値は更新されていく。このような演算を繰り返すことにより、入力情報生成装置10Bは、対象フレームTFに基づき移動平均を計算する。なお、第1フレーム目の演算については、平均値一時記憶部16にまだ記憶値SVが存在しないため、統合部15Bは、対象フレームTFのみに基づいて演算を行ってもよい。 The average value temporary storage unit 16 stores the average value of the pixel values of a predetermined frame among the frames making up the moving image. The value stored in the average value temporary storage section 16 is calculated by the integration section 15B. The integration unit 15B acquires information about the target frame TF from the image acquisition unit 11, and acquires the stored value SV from the average value temporary storage unit 16. The integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the target frame TF and the stored value SV, which is the value stored in the average value temporary storage unit 16. The integrating unit 15B stores the calculated value in the average value temporary storage unit 16 as the calculated value CV. That is, the value stored in the average value temporary storage unit 16 is updated every time the integration unit 15B calculates a new target frame TF. By repeating such calculations, the input information generation device 10B calculates a moving average based on the target frame TF. Note that regarding the calculation for the first frame, since the stored value SV does not yet exist in the average value temporary storage unit 16, the integrating unit 15B may perform the calculation based only on the target frame TF.
 撮像条件取得部17は、撮像装置100から、動画の撮像条件を取得する。撮像条件取得部17により取得される動画の撮像条件とは、例えば、シャッタースピード、絞り又はISO感度等の撮像装置の設定であってもよい。また、撮像条件取得部17により取得される動画の撮像条件には、撮像装置100の操作や駆動に関するその他の情報が含まれていてもよい。 The imaging condition acquisition unit 17 acquires video imaging conditions from the imaging device 100. The video imaging conditions acquired by the imaging condition acquisition unit 17 may be, for example, settings of the imaging device such as shutter speed, aperture, or ISO sensitivity. Further, the video imaging conditions acquired by the imaging condition acquisition unit 17 may include other information regarding the operation and drive of the imaging device 100.
 調整部18は、撮像条件取得部17により取得された撮像条件に応じて、平均値一時記憶部16に記憶された値(平均値)を調整する。ここで、撮像装置100の撮像条件が変更したような場合、対象フレームTFの画素値と、過去の移動平均値との関係が変わってしまう。例えば、撮像装置100の設定変更により、ISO感度が2倍に変更されたような場合は、過去の移動平均値より対象フレームTFの画素値の方が明るくなるため、移動平均値が急に暗くなってしまう。したがって、撮像装置100の設定変更により、ISO感度が2倍に変更されたような場合は、平均値一時記憶部16に記憶された値を2倍にすることにより、引き続き統合部15Bは、平均値一時記憶部16に記憶された値を活用しながら、移動平均値を演算し続けることができる。 The adjustment unit 18 adjusts the value (average value) stored in the average value temporary storage unit 16 according to the imaging condition acquired by the imaging condition acquisition unit 17. Here, if the imaging conditions of the imaging device 100 change, the relationship between the pixel value of the target frame TF and the past moving average value will change. For example, if the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, the pixel value of the target frame TF becomes brighter than the past moving average value, so the moving average value suddenly becomes darker. turn into. Therefore, when the ISO sensitivity is doubled due to a change in the settings of the imaging device 100, by doubling the value stored in the average value temporary storage unit 16, the integrating unit 15B continues to calculate the average value. It is possible to continue calculating the moving average value while utilizing the values stored in the value temporary storage unit 16.
 また、動画の撮影シーンが切り替わったような場合に、過去のフレームとの間で移動平均を求めてしまうと、CNN200により、適切に対象フレームTFに関する画像処理が行えなくなってしまう場合がある。したがって、動画の撮影シーンが切り替わったような場合は、過去のフレームとの移動平均を求めない方が好適である。調整部18は、動画の撮影シーンが切り替わったような場合は、平均値一時記憶部16に記憶された値をリセットすることにより、統合部15Bにより新たに移動平均値を算出し始めるよう構成されていてもよい。動画の撮影シーンが変わったか否かは、撮像装置100の電源ボタンのON/OFFや、撮影ボタン又は停止ボタンのON/OFF等に基づいて判定されてもよい。 Furthermore, if a moving average is calculated with respect to past frames when the shooting scene of a video changes, the CNN 200 may not be able to properly perform image processing regarding the target frame TF. Therefore, in cases where the shooting scene of a moving image has changed, it is preferable not to obtain a moving average with past frames. The adjustment unit 18 is configured to reset the value stored in the average value temporary storage unit 16 when the video shooting scene changes, so that the integration unit 15B starts calculating a new moving average value. You can leave it there. Whether or not the shooting scene of the video has changed may be determined based on whether the power button of the imaging device 100 is turned on or off, the shooting button or the stop button is turned on or off, or the like.
 なお、動画に含まれる複数のフレームのうち、輝度変化が大きいフレームが挿入される場合がある。輝度変化が大きいフレームが挿入される要因としては、撮像角度の変更等に伴い光源を撮像してしまった場合や、自動車のヘッドライトが映り込んだ場合等が考えられる。このような場合、統合部15Bは、複数のフレームのうち、輝度変化が大きいフレームを平均値の演算対象から除外してもよい。輝度変化が大きいフレームを平均値の演算対象から除外することにより、移動平均値が、輝度変化が大きいフレームに引きずられてしまうことを抑止することができる。輝度変化が大きいか否かを判定する場合の一例として、直前の対象フレームTFの画素値と、演算対象となる対象フレームTFの画素値との比較を行うことにより、差分が閾値以下であるか否かを判定してもよい。 Note that among the multiple frames included in the video, a frame with a large change in brightness may be inserted. Possible causes for inserting a frame with a large luminance change include a case where a light source is imaged due to a change in the imaging angle, a case where a car's headlights are reflected, etc. In such a case, the integrating unit 15B may exclude a frame with a large luminance change among the plurality of frames from the targets for calculating the average value. By excluding frames with large brightness changes from the average value calculation target, it is possible to prevent the moving average value from being dragged by frames with large brightness changes. As an example of determining whether the brightness change is large, by comparing the pixel value of the immediately previous target frame TF with the pixel value of the target frame TF to be calculated, it is determined whether the difference is less than or equal to a threshold value. It may be determined whether or not.
[第3の実施形態のまとめ]
 以上説明した実施形態によれば、入力情報生成装置10Bは、平均値一時記憶部16を更に備えることにより、動画を構成するフレームのうち、所定のフレームの画素値の平均値を記憶する。また、入力情報生成装置10Bによれば、統合部15Bは、平均値一時記憶部16に記憶された値と、対象フレームTFとに基づく演算により統合フレームIFの画素値を算出する。すなわち、入力情報生成装置10Bは、対象フレームTFと、記憶された移動平均値に基づき、統合フレームIFの画素値を演算する。したがって、入力情報生成装置10Bは、入力情報生成装置10Aよりも演算の負荷が軽い。よって、本実施形態によれば、更に演算負荷を軽くすることができる。
[Summary of third embodiment]
According to the embodiment described above, the input information generation device 10B further includes the average value temporary storage unit 16 to store the average value of the pixel values of a predetermined frame among the frames constituting the moving image. Further, according to the input information generation device 10B, the integrating unit 15B calculates the pixel value of the integrated frame IF by calculation based on the value stored in the average value temporary storage unit 16 and the target frame TF. That is, the input information generation device 10B calculates the pixel value of the integrated frame IF based on the target frame TF and the stored moving average value. Therefore, the input information generation device 10B has a lighter calculation load than the input information generation device 10A. Therefore, according to this embodiment, the calculation load can be further reduced.
 また、上述した実施形態によれば、統合部15Bは、動画を構成する複数のフレームのうち、輝度変化が大きいフレームを平均値の演算対象から除外する。したがって、本実施形態によれば、突発的に輝度変化が大きくなったような場合に、当該フレームを移動平均値の演算から除外することにより、統合フレームIFの画素値が、突発的に輝度変化が大きくなったようなフレームの画素値に引きずられてしまうことを抑止することができる。 Furthermore, according to the embodiment described above, the integrating unit 15B excludes frames with large luminance changes from among the plurality of frames making up the video from the targets for calculating the average value. Therefore, according to the present embodiment, when a brightness change suddenly becomes large, by excluding the frame from the calculation of the moving average value, the pixel values of the integrated frame IF can be adjusted to avoid sudden brightness changes. It is possible to prevent the pixel value from being dragged down by the pixel value of a frame that has become large.
 また、上述した実施形態によれば、入力情報生成装置10Bは、撮像条件取得部17を更に備えることにより動画の撮像条件を取得し、調整部18を更に備えることにより取得された撮像条件に応じて平均値一時記憶部16に記憶された平均値を調整する。したがって、本実施形態によれば、撮像条件の変化に応じた平均値を調整することができる。よって、本実施形態によれば、撮像条件に変化が生じた場合であっても、移動平均値を算出し続けることができる。 Further, according to the embodiment described above, the input information generation device 10B further includes the imaging condition acquisition unit 17 to acquire the imaging conditions for a moving image, and further includes the adjustment unit 18 to adjust the imaging conditions according to the acquired imaging conditions. The average value stored in the average value temporary storage section 16 is adjusted. Therefore, according to this embodiment, it is possible to adjust the average value according to changes in imaging conditions. Therefore, according to this embodiment, even if a change occurs in the imaging conditions, it is possible to continue calculating the moving average value.
[第4の実施形態]
 次に、図10を参照しながら第4の実施形態について説明する。まず、第4の実施形態において解決しようとする課題について説明する。第3の実施形態に係る入力情報生成装置10Bは、平均値一時記憶部16を備えることにより、移動平均を算出していた。第3の実施形態によれば、統合部15Bは、画像全体の移動平均を算出する。入力情報生成装置10Bは対象フレームTFと移動平均に基づいて入力データINを生成し、CNN200は生成された入力データINに基づいて動画のノイズ除去を行う。統合部15Bは画像全体の移動平均を算出するため、動画の一部に動いている被写体が撮影されている場合、ノイズ除去の結果、動いている被写体が撮影されている一部において残像が発生するといった問題が生じる場合があった。第4の実施形態においては、この問題を解決しようとするものである。
[Fourth embodiment]
Next, a fourth embodiment will be described with reference to FIG. First, the problem to be solved in the fourth embodiment will be explained. The input information generation device 10B according to the third embodiment calculates a moving average by including the average value temporary storage section 16. According to the third embodiment, the integrating unit 15B calculates the moving average of the entire image. The input information generation device 10B generates input data IN based on the target frame TF and the moving average, and the CNN 200 removes noise from the moving image based on the generated input data IN. Since the integrating unit 15B calculates a moving average of the entire image, if a moving subject is photographed in part of the video, afterimages may occur in the part where the moving subject is photographed as a result of noise removal. In some cases, problems may arise. The fourth embodiment attempts to solve this problem.
 図10は、第4の実施形態に係る入力情報生成の機能構成の一例を示すブロック図である。同図を参照しながら、第4の実施形態に係る入力情報生成装置10Cの機能構成の一例について説明する。入力情報生成装置10Cは、比較部19を更に備え、統合部15Bに代えて統合部15Cを備える点において、入力情報生成装置10Bとは異なる。入力情報生成装置10Cは、入力情報生成装置10Bと同様に撮像条件取得部17及び調整部18を備えていてもよいが、備えていなくてもよい。図示する一例では、入力情報生成装置10Cが撮像条件取得部17及び調整部18を備えない場合の一例について説明する。入力情報生成装置10Cの説明において入力情報生成装置10Bと同様の構成については同様の符号を付すことにより説明を省略する場合がある。 FIG. 10 is a block diagram illustrating an example of a functional configuration for generating input information according to the fourth embodiment. An example of the functional configuration of an input information generation device 10C according to the fourth embodiment will be described with reference to the same figure. The input information generation device 10C is different from the input information generation device 10B in that it further includes a comparison section 19 and includes an integration section 15C instead of the integration section 15B. The input information generation device 10C may or may not include the imaging condition acquisition section 17 and the adjustment section 18 like the input information generation device 10B. In the illustrated example, an example will be described in which the input information generation device 10C does not include the imaging condition acquisition section 17 and the adjustment section 18. In the description of the input information generation device 10C, the same components as the input information generation device 10B may be given the same reference numerals and the description thereof may be omitted.
 比較部19は、平均値一時記憶部16から記憶値SVを取得し、統合部15Cから対象フレームTFを取得する。比較部19は、取得した平均値一時記憶部16に記憶された記憶値SVと、対象フレームTFの画素値とを比較する。比較部19は、画像全体を比較してもよいし、画素ごとに比較してもよいし、複数画素から構成されるパッチごとに比較してもよい。比較部19は、比較した結果を比較結果CRとして統合部15Cに出力する。比較結果CRには、画素値の差分が含まれていてもよいし、当該差分と所定の閾値とを比較した結果についての情報が含まれていてもよい。 The comparison unit 19 acquires the stored value SV from the average value temporary storage unit 16 and acquires the target frame TF from the integration unit 15C. The comparison unit 19 compares the acquired stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF. The comparison unit 19 may compare the entire image, each pixel, or each patch composed of a plurality of pixels. The comparison unit 19 outputs the comparison result to the integration unit 15C as a comparison result CR. The comparison result CR may include a difference between pixel values, or may include information about the result of comparing the difference with a predetermined threshold.
 統合部15Cは、比較部19から比較結果CRを取得し、平均値一時記憶部16から記憶値SVを取得する。統合部15Cは、比較部19により比較された結果である比較結果CRに基づき、差分が所定値以下である場合、平均値一時記憶部16に記憶された記憶値SVと対象フレームTFの平均値とに基づく移動平均を算出する。統合部15Cは、算出した値を統合フレームIFの画素値とする。また、統合部15Cは、比較部19により比較された結果である比較結果CRに基づき、差分が所定値以下でない場合、対象フレームTFの画素値を統合フレームIFの画素値とする。 The integration unit 15C acquires the comparison result CR from the comparison unit 19 and acquires the stored value SV from the average value temporary storage unit 16. Based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C calculates the average value between the stored value SV stored in the average value temporary storage unit 16 and the target frame TF, if the difference is less than a predetermined value. Calculate the moving average based on The integrating unit 15C sets the calculated value as the pixel value of the integrated frame IF. Further, based on the comparison result CR, which is the result of comparison by the comparison unit 19, the integration unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF if the difference is not less than a predetermined value.
 統合部15Cによる統合処理は、画像全体で行われてもよいし、画素ごとに行われてもよいし、複数画素から構成されるパッチごとに行われてもよい。統合処理が画素ごと又はパッチごとに行われる場合、統合フレームIFの画素値は、差分が所定値以下の箇所(すなわち動きが小さい箇所)については移動平均値となり、差分が所定値以下でない箇所(すなわち動きが大きい箇所)については対象フレームTFの画素値となる。統合部15Cは、演算した結果を、演算値CVとして平均値一時記憶部16に記憶させる。入力情報生成装置10Cによれば、動きが大きい被写体は平均画像に取り込まず、動きが小さい背景は平均画像に取り込む。入力情報生成装置10Cによる演算は、選択平均ということもできる。 The integration process by the integration unit 15C may be performed for the entire image, for each pixel, or for each patch made up of multiple pixels. When the integration process is performed pixel by pixel or patch by patch, the pixel value of the integrated frame IF is a moving average value for locations where the difference is less than a predetermined value (i.e., locations where the movement is small), and for locations where the difference is not less than a predetermined value ( In other words, the pixel value of the target frame TF is used for areas with large movements. The integrating unit 15C stores the calculated result in the average value temporary storage unit 16 as a calculated value CV. According to the input information generation device 10C, a subject with large movements is not included in the average image, and a background with small movements is included in the average image. The calculation performed by the input information generation device 10C can also be called a selective average.
 なお、統合部15Cが行う処理の変形例として、平均画像に取り込むか取り込まないかの二者択一とする代わりに、係数を乗じることを行ってもよい。例えば、統合部15Cは、比較部19により比較された結果である比較結果CRに基づき、差分が所定値以下である場合、平均値一時記憶部16に記憶された記憶値SVに所定の係数(例えば0.9)を乗じた値と、対象フレームTFの平均値とに基づく移動平均を算出し、統合フレームIFの画素値としてもよい。また、統合部15Cは、比較部19により比較された結果である比較結果CRに基づき、差分が所定値以下でない場合、平均値一時記憶部16に記憶された記憶値SVに所定の係数(例えば0.1)を乗じた値と、対象フレームTFの平均値とに基づく移動平均を算出し、統合フレームIFの画素値としてもよい。 Note that as a modification of the processing performed by the integrating unit 15C, instead of choosing between capturing or not capturing into the average image, it may be multiplied by a coefficient. For example, if the difference is less than or equal to a predetermined value based on the comparison result CR that is the result of comparison by the comparison unit 19, the integrating unit 15C adds a predetermined coefficient ( For example, a moving average may be calculated based on the value multiplied by 0.9) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF. Furthermore, if the difference is not less than a predetermined value based on the comparison result CR, which is the result of comparison by the comparator 19, the integrating section 15C adds a predetermined coefficient (for example, A moving average may be calculated based on the value multiplied by 0.1) and the average value of the target frame TF, and may be used as the pixel value of the integrated frame IF.
[第4の実施形態のまとめ]
 以上説明した実施形態によれば、入力情報生成装置10Cは、比較部19を更に備えることにより、平均値一時記憶部16に記憶された記憶値SVと、対象フレームTFの画素値とを比較する。統合部15Cは、比較部19により比較された結果、差分が所定値以下である場合、平均値一時記憶部16に記憶された記憶値SVと対象フレームTFとに基づく移動平均を算出することにより統合フレームIFの画素値を算出する。また、統合部15Cは、比較部19により比較された結果、差分が所定値以下でない場合、対象フレームTFの画素値を統合フレームIFの画素値とする。すなわち、入力情報生成装置10Cによれば、動きが大きい被写体と、動きが小さい背景とを区別し、選択的に平均処理を行うことにより統合フレームIFの画素値を決定する。入力情報生成装置10Cは、対象フレームTFと統合フレームIFとに基づき、入力データINを生成する。したがって、本実施形態によれば、動きが大きい箇所については、隣接フレームAFの画素値が統合フレームIFの画素値に反映されないため、残像が発生するといった問題を抑止することができる。
[Summary of the fourth embodiment]
According to the embodiment described above, the input information generation device 10C further includes the comparison unit 19 to compare the stored value SV stored in the average value temporary storage unit 16 and the pixel value of the target frame TF. . If the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit 19, the integration unit 15C calculates a moving average based on the stored value SV stored in the average value temporary storage unit 16 and the target frame TF. Calculate the pixel value of the integrated frame IF. Further, as a result of the comparison by the comparing unit 19, if the difference is not less than a predetermined value, the integrating unit 15C sets the pixel value of the target frame TF to the pixel value of the integrated frame IF. That is, according to the input information generation device 10C, the pixel value of the integrated frame IF is determined by distinguishing between a subject with large movement and a background with small movement, and selectively performing averaging processing. The input information generation device 10C generates input data IN based on the target frame TF and the integrated frame IF. Therefore, according to the present embodiment, since the pixel values of the adjacent frame AF are not reflected in the pixel values of the integrated frame IF in areas where there is large movement, it is possible to prevent the problem of occurrence of afterimages.
 なお、上述した実施形態2から実施形態4では、統合フレームIFの画素値を決定するための所定の演算を要し、処理が重くなってしまう場合がある。処理を更に軽量化するため、対象フレームTFに隣接するフレーム(例えば前後2フレームずつ)のうちいずれか1フレームを所定のアルゴリズムにより特定し、統合フレームIFとしてもよい。所定のアルゴリズムとは、対象フレームTFに隣接するフレームのうちいずれか1フレームをランダムに特定するものであってもよい。この場合、統合部15は、画像取得部11により取得された隣接フレームAFのうち、ランダムに特定されたフレームを統合フレームIFとする。 Note that in the second to fourth embodiments described above, a predetermined calculation is required to determine the pixel value of the integrated frame IF, which may result in heavy processing. In order to further reduce the processing weight, any one of the frames adjacent to the target frame TF (for example, two frames before and after) may be specified by a predetermined algorithm and used as the integrated frame IF. The predetermined algorithm may be one that randomly identifies one frame among frames adjacent to the target frame TF. In this case, the integrating unit 15 sets a randomly specified frame among the adjacent frames AF acquired by the image acquiring unit 11 as an integrated frame IF.
(その他の実施例)
 なお、上述した実施形態において、合成部13及び統合部15は、取得した複数のフレームを合成し、統合する処理を行う例について示した。しかしながら本実施形態はこの一例に限定されず、合成または統合する対象は、複数のフレーム自体である場合に限られるものではない。その他の実施例としては、CNN200の少なくとも一部における中間出力を対象とし、合成処理及び統合処理を行ってもよい。より詳細には、対象フレームTF及び隣接フレームAFをCNN200で処理した際に生成される中間出力を、合成または統合の対象としてもよい。また、隣接フレームAFをCNN200で処理した際に生成される中間出力と、対象フレームTFをベクトル化した結果とを合成または統合の対象としてもよい。なお、本実施形態は上述した一例に限定されるものではなく、合成または統合の対象として、フレームに基づいて処理が行われた結果として得られるあらゆる情報が広く含まれる。
(Other examples)
Note that in the embodiment described above, an example has been described in which the combining unit 13 and the integrating unit 15 perform a process of combining and integrating a plurality of acquired frames. However, the present embodiment is not limited to this example, and the objects to be synthesized or integrated are not limited to a plurality of frames themselves. In other embodiments, intermediate outputs from at least a portion of the CNN 200 may be subjected to synthesis processing and integration processing. More specifically, an intermediate output generated when the target frame TF and the adjacent frame AF are processed by the CNN 200 may be used as a target for synthesis or integration. Further, the intermediate output generated when the adjacent frame AF is processed by the CNN 200 and the result of vectorizing the target frame TF may be combined or integrated. Note that the present embodiment is not limited to the above-mentioned example, and includes a wide variety of information obtained as a result of processing based on frames as a target of synthesis or integration.
 なお、本実施形態の態様は、上述した第1の実施形態から第4の実施形態のいずれかの態様に限定されず、所定の条件に基づいて、第1の実施形態から第4の実施形態のいずれかを選択的に用いるものであってもよい。所定の条件とは、動画の撮影条件や、撮影モード、露出条件、被写体の種類等であってもよい。 Note that aspects of the present embodiment are not limited to any of the aspects of the first to fourth embodiments described above, and can be modified from the first to fourth embodiments based on predetermined conditions. Either of these may be used selectively. The predetermined conditions may be video shooting conditions, shooting mode, exposure conditions, type of subject, etc.
 なお、これまで、対象フレームTFを高品質化するために、隣接フレームAFを用いて演算を行う例を示したが、CNN200に含まれる学習済みモデルを学習する場合においても、第1乃至第4の実施形態のいずれか一つと同様に、対象フレームTFだけでなく、隣接フレームAFを用いて学習を行うことが好ましい。学習に関する演算は、必ずしも画像処理装置2において実行される必要はなく、専用の学習装置において事前に学習したパラメータ等の結果を学習済みモデルとしてCNN200に含めてもよい。 Up to now, an example has been shown in which calculations are performed using adjacent frames AF in order to improve the quality of the target frame TF, but even when learning a trained model included in the CNN 200, the first to fourth As in any one of the embodiments, it is preferable to perform learning using not only the target frame TF but also the adjacent frame AF. Calculations related to learning do not necessarily need to be executed in the image processing device 2, and results such as parameters learned in advance in a dedicated learning device may be included in the CNN 200 as a learned model.
 以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の趣旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the spirit of the present invention. can be added.
 次に、図11から図20を参照しながら、本発明の態様に係る学習装置、プログラム及びノイズ低減装置の学習方法について、好適な実施の形態を掲げ、添付の図面を参照しながら詳細に説明する。なお、本発明の態様は、これらの実施の形態に限定されるものではなく、多様な変更または改良を加えたものも含まれる。つまり、以下に記載した構成要素には、当業者が容易に想定できるもの、実質的に同一のものが含まれ、以下に記載した構成要素は適宜組み合わせることが可能である。また、本発明の要旨を逸脱しない範囲で構成要素の種々の省略、置換または変更を行うことができる。また、以下の図面においては、各構成をわかりやすくするために、各構造における縮尺および数等を、実際の構造における縮尺および数等と異ならせる場合がある。 Next, with reference to FIGS. 11 to 20, a learning device, a program, and a learning method of a noise reduction device according to aspects of the present invention will be described in detail with reference to preferred embodiments and the accompanying drawings. do. Note that aspects of the present invention are not limited to these embodiments, but also include those with various changes or improvements. That is, the components described below include those that can be easily assumed by those skilled in the art and are substantially the same, and the components described below can be combined as appropriate. Moreover, various omissions, substitutions, or changes of the constituent elements can be made without departing from the gist of the present invention. Further, in the following drawings, in order to make each structure easier to understand, the scale, number, etc. of each structure may be different from the scale, number, etc. of the actual structure.
 まず、本発明の背景技術及び発明が解決しようとする課題について説明する。 First, the background art of the present invention and the problems to be solved by the invention will be explained.
 従来、機械学習を用いて、低品質画像を高品質画像に画像処理する技術があった。このような技術分野においては、ノイズが重畳されたノイズ画像と高品質画像との組み合わせを教師データとして学習モデルを学習させる。教師データの作成は、撮像装置により同一の対象物を異なる露出設定で撮像することにより高品質画像とノイズ画像とを得ることにより行われる。一般に機械学習のためには教師データが大量に必要になることが知られており、カメラを用いて撮像による教師データの作成は手間であるという課題があった。そこで、高品質画像にランダムノイズを付加することにより、教師データを作成する技術が知られている(例えば、特開2021-071936号公報を参照)。このような従来技術を用いて、高品質画像にランダムノイズを付加することにより、低品質画像から高品質画像を推論するための教師データを作成することが知られている。 Conventionally, there has been a technology that uses machine learning to process low-quality images into high-quality images. In such technical fields, a learning model is trained using a combination of a noise image on which noise is superimposed and a high-quality image as training data. The training data is created by capturing images of the same object with different exposure settings using an imaging device to obtain a high-quality image and a noise image. It is generally known that machine learning requires a large amount of training data, and creating training data by capturing images using a camera is time-consuming. Therefore, a technique is known in which training data is created by adding random noise to a high-quality image (for example, see Japanese Patent Application Laid-Open No. 2021-071936). It is known to use such conventional techniques to create training data for inferring a high quality image from a low quality image by adding random noise to a high quality image.
 ここで、低品質動画を高品質動画に画像処理する場合にも、上述した静止画の場合と同様に、機械学習のための教師データが大量に必要になることが知られている。しかしながら動画の場合には、同一の対象物を異なる設定で撮影し、同一の被写体が撮影された高品質動画と低品質動画とを容易することが非常に困難である。そこで、上述したような従来技術を応用して、予め撮影された高品質動画の各フレームにノイズを重畳させて低品質動画を生成することも考えられるが、容量が莫大なものとなるといった問題等があり、非常に困難であった。 Here, it is known that when image processing a low-quality video into a high-quality video, a large amount of training data for machine learning is required, as in the case of still images described above. However, in the case of videos, it is very difficult to photograph the same object with different settings and easily create high-quality videos and low-quality videos of the same object. Therefore, it is conceivable to apply the above-mentioned conventional technology to generate a low-quality video by superimposing noise on each frame of a high-quality video that has been shot in advance, but this poses the problem of requiring an enormous amount of storage space. etc., and it was extremely difficult.
 そこで本発明は、低品質動画から高品質動画を推論するための教師データを生成可能な技術の提供を目的とする。 Therefore, the present invention aims to provide a technology that can generate training data for inferring a high-quality video from a low-quality video.
 次に、本実施形態の前提となる事項について説明する。本実施形態に係る学習装置、プログラム及びノイズ低減装置の学習方法は、ノイズが重畳した低品質な動画情報を入力として、ノイズを取り除いた高品質動画を推論するよう、学習モデルを学習させる。低品質動画には低画質動画が含まれ、高品質動画には高画質動画が含まれる。本実施形態に係る学習装置、プログラム及びノイズ低減装置の学習方法が、学習のために用いる教師データは、被写体を撮像した静止画から生成される。被写体を撮像した静止画とは、1枚の高品質画像であってもよいし、同一の被写体を撮像した複数の画像(1枚又は複数枚の高品質画像及び1枚又は複数枚の低品質画像の組み合わせ)であってもよい。同一の被写体を撮像した複数の画像は、互いに異なる撮像条件で撮像されていてもよい。また、被写体を撮像した画像とは、少なくとも1枚の画像を含む、その他の画像であってもよい。高品質画像とは、一例として、低ISO感度、長秒露光により撮像される画質の高い画像を例示することができる。以下の説明において、高品質画像をGT(Ground Truth)と記載する場合がある。低品質画像とは、一例として、高ISO感度、短秒露光により撮像される画質の低い画像を例示することができる。 Next, the premise of this embodiment will be explained. In the learning device, program, and learning method of the noise reduction device according to the present embodiment, a learning model is trained to infer a high-quality video from which noise is removed by inputting low-quality video information with superimposed noise. Low-quality videos include low-quality videos, and high-quality videos include high-quality videos. Teacher data used for learning by the learning device, program, and learning method of the noise reduction device according to the present embodiment is generated from a still image of a subject. A still image taken of a subject may be a single high-quality image, or multiple images taken of the same subject (one or more high-quality images and one or more low-quality images). combination of images). A plurality of images of the same subject may be captured under different imaging conditions. Further, the image captured of the subject may be any other image including at least one image. A high-quality image can be, for example, a high-quality image captured by low ISO sensitivity and long exposure. In the following description, a high quality image may be referred to as GT (Ground Truth). An example of a low-quality image is a low-quality image captured by high ISO sensitivity and short exposure.
 以下の説明においては低品質画像の一例としてノイズによる画質劣化について説明するが、本実施形態は、ノイズ以外であっても、画像の品質を低下させる事項に対して広く適用可能である。画像の品質を低下させる事項としては、光学収差による解像度の低下もしくは色ずれ、手ブレや被写体ブレによる解像度の低下、暗電流や回路起因による黒レベルの不均一、高輝度被写体によるゴーストやフレア、信号レベル異常等を例示することができる。 In the following description, image quality deterioration due to noise will be described as an example of a low-quality image, but the present embodiment is widely applicable to matters other than noise that degrade image quality. Items that reduce image quality include reduced resolution or color shift due to optical aberrations, reduced resolution due to camera shake or subject blur, uneven black level due to dark current or circuits, ghosts and flare caused by high-brightness objects, Examples include signal level abnormalities.
 なお、教師データの生成には、予め用意されていた画像が用いられてもよい。以下の説明において、低品質画像を低画質画像又はノイズ画像と記載する場合がある。また、以下の説明において、高品質画像を高画質画像又はGTと記載する場合がある。同様に、低品質動画を低画質動画又はノイズ動画と記載する場合がある。また、以下の説明において、高品質動画を高画質動画又はGTと記載する場合がある。 Note that images prepared in advance may be used to generate the teacher data. In the following description, a low-quality image may be referred to as a low-quality image or a noise image. Furthermore, in the following description, a high quality image may be referred to as a high quality image or GT. Similarly, a low-quality video may be described as a low-quality video or a noise video. Furthermore, in the following description, a high-quality video may be referred to as a high-quality video or GT.
 本実施形態に係る学習装置が対象とする画像とは、静止画であってもよいし、動画に含まれるフレームであってもよい。また、データ形式としてはRawフォーマットなどの圧縮符号化処理を行っていない形式でもよいし、JpegフォーマットやMPEGフォーマットなどの圧縮符号化処理を行った形式であってもよい。以下、特に限定しない場合においては、画像とはRawフォーマットによる静止画である場合について説明する。 Images targeted by the learning device according to the present embodiment may be still images or frames included in a video. Furthermore, the data format may be a format that has not undergone compression encoding processing, such as a Raw format, or a format that has undergone compression encoding processing, such as a Jpeg format or an MPEG format. In the following, unless there is a particular limitation, the case where the image is a still image in Raw format will be described.
 また、本実施形態に係る学習装置が対象とする画像とは、CCD(Charge Coupled Devices)イメージセンサを用いたCCDカメラにより撮像された画像であってもよい。また、本実施形態に係る学習装置が対象とする画像とは、CMOS(Complementary Metal Oxide Semiconductor)イメージセンサを用いたCMOSカメラにより撮像された画像であってもよい。また、本実施形態に係る学習装置が対象とする画像とは、カラー画像であってもよいし、モノクロ画像であってもよい。また、本実施形態に係る学習装置が対象とする画像とは、赤外線センサを用いた赤外線カメラなど非可視光成分を取得することにより撮像された画像であってもよい。 Further, the image targeted by the learning device according to the present embodiment may be an image captured by a CCD camera using a CCD (Charge Coupled Devices) image sensor. Further, the image targeted by the learning device according to the present embodiment may be an image captured by a CMOS camera using a complementary metal oxide semiconductor (CMOS) image sensor. Further, the image targeted by the learning device according to the present embodiment may be a color image or a monochrome image. Further, the image targeted by the learning device according to the present embodiment may be an image captured by an infrared camera using an infrared sensor or the like to obtain a non-visible light component.
[第5の実施形態]
 まず、図11から図18を参照しながら、第5の実施形態について説明する。
 図11は、第5の実施形態に係る学習システムの概要について説明するための図である。同図を参照しながら、学習システム1001の概要について説明する。同図に示す学習システム1001は、機械学習の学習段階における構成の一例である。学習システム1001は、撮像装置1020により撮像された画像に基づき生成された教師データTDを用いて学習モデル1040を学習させる。
[Fifth embodiment]
First, a fifth embodiment will be described with reference to FIGS. 11 to 18.
FIG. 11 is a diagram for explaining an overview of the learning system according to the fifth embodiment. An overview of the learning system 1001 will be explained with reference to the same figure. A learning system 1001 shown in the figure is an example of a configuration in the learning stage of machine learning. The learning system 1001 trains the learning model 1040 using teacher data TD generated based on images captured by the imaging device 1020.
 学習システム1001は、撮像装置1020を備えることにより高画質画像1031及び低画質画像1032を撮像する。高画質画像1031及び低画質画像1032は、同一の被写体が撮像された画像である。例えば高画質画像1031及び低画質画像1032は、互いに同一の画角及び撮像角度で、ISO感度や露光時間等の設定を異ならせて撮像される。また、高画質画像1031は1枚であることが好適であるが、複数であってもよい。また、低画質画像1032は、複数であることが好適であるが、1枚であってもよい。複数の低画質画像1032は、ISO感度や露光時間等の設定を異ならせて撮像された異なる画像であることが好適である。撮像装置1020は、例えば通信手段を有するスマートフォンや、タブレット端末等であってもよい。また、撮像装置1020は通信手段を有する監視カメラ等であってもよい。 The learning system 1001 is equipped with an imaging device 1020 to capture a high-quality image 1031 and a low-quality image 1032. The high-quality image 1031 and the low-quality image 1032 are images of the same subject. For example, the high-quality image 1031 and the low-quality image 1032 are captured at the same angle of view and imaging angle, but with different settings such as ISO sensitivity and exposure time. Further, although it is preferable that there is one high-quality image 1031, there may be a plurality of high-quality images 1031. Further, although it is preferable that there be a plurality of low-quality images 1032, there may be only one low-quality image 1032. Preferably, the plurality of low-quality images 1032 are different images captured with different settings such as ISO sensitivity and exposure time. The imaging device 1020 may be, for example, a smartphone having communication means, a tablet terminal, or the like. Further, the imaging device 1020 may be a surveillance camera or the like having communication means.
 学習システム1001は、高画質画像1031から高画質動画1033を生成し、低画質画像1032から低画質動画1034を生成する。高画質動画1033は、1枚の高画質画像1031から生成されることが好適であり、低画質動画1034は、複数の低画質画像1032から生成されることが好適である。互いに同一の被写体を撮像した高画質画像1031及び低画質画像1032から生成された高画質動画1033及び低画質動画1034は、互いに対応付けられる。互いに対応する高画質動画1033及び低画質動画1034は、教師データTDとして学習のために学習モデル1040に入力される。 The learning system 1001 generates a high-quality video 1033 from a high-quality image 1031 and a low-quality video 1034 from a low-quality image 1032. The high-quality video 1033 is preferably generated from one high-quality image 1031, and the low-quality video 1034 is preferably generated from a plurality of low-quality images 1032. A high-quality video 1033 and a low-quality video 1034 generated from a high-quality image 1031 and a low-quality image 1032 captured from the same subject are associated with each other. A high-quality video 1033 and a low-quality video 1034 that correspond to each other are input to the learning model 1040 for learning as teacher data TD.
 なお、互いに対応する高画質動画1033及び低画質動画1034は、後に行われる学習のため、所定の記憶装置に一時的に記憶されてもよい。すなわち、学習システム1001は、後に行われる学習の前に、予め複数の教師データTDを生成しておいてもよい。また、撮像装置1020により撮像された高画質画像1031及び低画質画像1032は、一時的に所定の記憶装置に記憶されていてもよい。この場合、学習システム1001は、互いに対応する高画質画像1031及び低画質画像1032の複数の組み合わせを記憶しておき、学習時に教師データTDを生成してもよい。 Note that the high-quality video 1033 and low-quality video 1034 that correspond to each other may be temporarily stored in a predetermined storage device for later learning. That is, the learning system 1001 may generate a plurality of teacher data TD in advance before learning that is performed later. Further, the high quality image 1031 and the low quality image 1032 captured by the imaging device 1020 may be temporarily stored in a predetermined storage device. In this case, the learning system 1001 may store a plurality of combinations of mutually corresponding high-quality images 1031 and low-quality images 1032, and generate teacher data TD during learning.
 学習モデル1040は、学習システム1001により生成された教師データTDを用いて学習される。具体的には、学習モデル1040は、低品質な動画から高品質な動画を推論するように学習される。言い換えれば、学習後の学習モデル1040は低品質な動画を入力として高品質な動画を推論し、推論した結果を出力する。すなわち、学習後の学習モデル1040は、低品質な動画からノイズを除去するためのノイズ低減装置に用いられてもよい。 The learning model 1040 is trained using the teacher data TD generated by the learning system 1001. Specifically, the learning model 1040 is trained to infer high quality videos from low quality videos. In other words, the learning model 1040 after learning infers a high-quality video using a low-quality video as input, and outputs the inference result. That is, the learned model 1040 after learning may be used in a noise reduction device for removing noise from a low-quality video.
 なお、撮像装置1020により撮像された高画質画像1031及び低画質画像1032は、情報を一時的に記憶する所定の記憶装置に記憶される。所定の記憶装置とは、撮像装置1020に備えられていてもよいし、クラウドサーバ等に備えらえられていてもよい。すなわち、学習システム1001は、エッジデバイスに構成されていてもよいし、エッジデバイスとクラウドサーバとを含んで構成されていてもよい。また、学習モデル1040の学習においてもサーバー上に設けられたGPU等を利用するようにしてもよい。 Note that the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020 are stored in a predetermined storage device that temporarily stores information. The predetermined storage device may be provided in the imaging device 1020, or may be provided in a cloud server or the like. That is, the learning system 1001 may be configured as an edge device or may include an edge device and a cloud server. Furthermore, the learning of the learning model 1040 may also utilize a GPU or the like provided on the server.
 図12は、第5の実施形態に係る学習装置の機能構成の一例を示す図である。同図を参照しながら学習装置1010の機能構成について説明する。学習装置1010は、上述した学習システム1001を実現するために用いられる。学習装置1010は、撮像装置1020により撮像された高画質画像1031及び低画質画像1032に基づき、高画質動画1033及び低画質動画1034を生成する。学習装置1010は、生成した高画質動画1033及び低画質動画1034を教師データTDとして学習モデル1040を学習させる。学習装置1010は、画像取得部1011と、動画情報生成部1012と、学習部1013とを備える。学習装置1010は、バスで接続された不図示のCPU(Central Processing Unit)、ROM(Read only memory)又はRAM(Random access memory)等の記憶装置等を備える。学習装置1010は、学習プログラムを実行することによって画像取得部1011と、動画情報生成部1012、学習部1013とを備える装置として機能する。 FIG. 12 is a diagram showing an example of the functional configuration of the learning device according to the fifth embodiment. The functional configuration of the learning device 1010 will be explained with reference to the same figure. The learning device 1010 is used to implement the learning system 1001 described above. The learning device 1010 generates a high-quality video 1033 and a low-quality video 1034 based on the high-quality image 1031 and low-quality image 1032 captured by the imaging device 1020. The learning device 1010 causes the learning model 1040 to learn using the generated high-quality video 1033 and low-quality video 1034 as teacher data TD. The learning device 1010 includes an image acquisition section 1011, a video information generation section 1012, and a learning section 1013. The learning device 1010 includes a CPU (Central Processing Unit), a storage device such as a ROM (Read only memory) or a RAM (Random access memory), etc., which are connected via a bus (not shown). The learning device 1010 functions as a device including an image acquisition section 1011, a video information generation section 1012, and a learning section 1013 by executing a learning program.
 なお、学習装置1010の各機能の全てまたは一部は、ASIC(Application Specific Integrated Circuit)、PLD(Programmable Logic Device)又はFPGA(Field-Programmable Gate Array)等のハードウェアを用いて実現されてもよい。学習プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。学習プログラムは、電気通信回線を介して送信されてもよい。 Note that all or part of each function of the learning device 1010 may be realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field-Programmable Gate Array). . The learning program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The learning program may be transmitted via a telecommunications line.
 画像取得部1011は、撮像装置1020から画像情報Iを取得する。画像情報Iには、第1画像情報I1及び第2画像情報I2が含まれる。第1画像情報I1には、少なくとも1枚の高画質画像1031が含まれる。第2画像情報I2には、少なくとも1枚の低画質画像1032が含まれる。第2画像情報I2に含まれる低画質画像1032には、第1画像情報I1に含まれる高画質画像1031に撮像された被写体と同一の被写体が撮像されている。第2画像情報I2に含まれる画像は、第1画像情報I1に含まれる画像より低画質である。画像取得部1011は、取得した画像情報Iを、動画情報生成部1012に出力する。 The image acquisition unit 1011 acquires image information I from the imaging device 1020. Image information I includes first image information I1 and second image information I2. The first image information I1 includes at least one high-quality image 1031. The second image information I2 includes at least one low-quality image 1032. The same subject as the subject captured in the high-quality image 1031 included in the first image information I1 is captured in the low-quality image 1032 included in the second image information I2. The image included in the second image information I2 has lower image quality than the image included in the first image information I1. The image acquisition unit 1011 outputs the acquired image information I to the video information generation unit 1012.
 動画情報生成部1012は、画像情報Iに含まれる画像の一部を複数切り出し、切り出した画像をフレーム画像として、所定の時間間隔(又はフレームレートということもできる)で繋げることによって動画情報Mを生成する。フレームレートは、例えば60[FPS(frames per second)]であってもよい。動画情報生成部1012により切り出される画像の位置は、フレーム毎に異なっていてもよい。例えば、切り出される画像のサイズは固定とし、動画情報生成部1012は、所定の方向に、所定の画素(ビット数)ずつ移動させた位置における複数の画像を切り出してもよい。具体的には、切り出される画像のサイズは256画素×256画素に固定されていてもよい。また、動画情報生成部1012は、当該サイズをフレーム毎に10画素ずつずらした位置における画像を切り出してもよい。ずらす量を大きくし過ぎると、フレームごとの画像の変化量が大きくなり過ぎる結果不自然な動画になるため、所定量以上にずらさないように制限(上限値)を設けることが好ましい。ずらし量や当該制限については、撮影画角、撮影解像度、光学系の焦点距離、被写体までの距離、撮影フレームレート等に基づいて決定することが好ましい。また、落下している被写体などにおいては、加速度的に速度が増えることから、ずらす量を対象画像から時間的に離れたフレームほど増やしてもよい。 The video information generation unit 1012 generates video information M by cutting out a plurality of parts of the images included in the image information I and connecting the cut out images as frame images at a predetermined time interval (or it can also be called a frame rate). generate. The frame rate may be, for example, 60 [FPS (frames per second)]. The position of the image cut out by the video information generation unit 1012 may be different for each frame. For example, the size of the cut out images may be fixed, and the video information generation unit 1012 may cut out a plurality of images at positions moved by a predetermined number of pixels (bit number) in a predetermined direction. Specifically, the size of the image to be cut out may be fixed to 256 pixels x 256 pixels. Further, the video information generation unit 1012 may cut out an image at a position where the size is shifted by 10 pixels for each frame. If the amount of shift is too large, the amount of change in the image for each frame will become too large, resulting in an unnatural moving image, so it is preferable to set a limit (upper limit) so that the shift does not exceed a predetermined amount. It is preferable to determine the amount of shift and the limit based on the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, shooting frame rate, and the like. Furthermore, since the speed of a falling subject increases in terms of acceleration, the amount of shift may be increased for frames temporally farther from the target image.
 動画情報生成部1012は、第1画像情報I1に含まれる画像から第1動画情報M1を生成し、第2画像情報I2に含まれる画像から第2動画情報M2を生成する。すなわち、動画情報生成部1012は、第1画像情報I1の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報M1を生成する。また、動画情報生成部1012は、第2画像情報I2の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報M2を生成する。複数の画像を組み合わせて動画を生成するとは、複数の画像をフレームレートに応じた所定の時間間隔で表示するようなファイル形式に変換することであってもよい。動画情報生成部1012は、生成した第1動画情報M1及び第2動画情報M2が含まれる情報を動画情報Mとして学習部1013に出力する。 The video information generation unit 1012 generates first video information M1 from the images included in the first image information I1, and generates second video information M2 from the images included in the second image information I2. That is, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. Further, the video information generation unit 1012 cuts out a plurality of images at different positions that are part of the second image information I2, and generates the second video information M2 by combining the plurality of cut out images. Generating a moving image by combining a plurality of images may mean converting the plurality of images into a file format that displays them at predetermined time intervals depending on the frame rate. The video information generation unit 1012 outputs information including the generated first video information M1 and second video information M2 as video information M to the learning unit 1013.
 ここで、動画情報生成部1012によって切り出される複数の画像の大きさや、切り出す位置については任意に定められてもよい。しかしながら、第1画像情報I1に含まれる画像から切り出す位置と、第2画像情報I2に含まれる画像から切り出す位置とは略同じ位置であることが好適である。高画質な動画である第1動画情報M1と、低画質な動画である第2動画情報M2とは、同一の被写体が撮影されているべきだからである。 Here, the sizes of the plurality of images cut out by the video information generation unit 1012 and the cutout positions may be arbitrarily determined. However, it is preferable that the position to be cut out from the image included in the first image information I1 and the position to be cut out from the image included in the second image information I2 are approximately the same position. This is because the first video information M1, which is a high-quality video, and the second video information M2, which is a low-quality video, should be of the same subject.
 学習部1013は、動画情報生成部1012から動画情報Mを取得する。学習部1013は、取得した動画情報Mを教師データTDとして、学習モデル1040に入力することにより、学習モデル1040を学習させる。学習モデル1040は、低画質動画から高画質動画を推論するよう学習させられる。すなわち、学習部1013は、動画情報生成部1012により生成された第1動画情報M1と第2動画情報M2とが含まれる教師データTDに基づき、低画質動画から高画質動画を推論するよう学習させる。学習モデル1040は、入力された動画からノイズを除去するよう推論するよう学習させられるともいうことができる。 The learning unit 1013 acquires video information M from the video information generation unit 1012. The learning unit 1013 causes the learning model 1040 to learn by inputting the acquired video information M to the learning model 1040 as teacher data TD. The learning model 1040 is trained to infer high quality videos from low quality videos. That is, the learning unit 1013 causes learning to infer a high-quality video from a low-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. . The learning model 1040 can also be said to be trained to reason to remove noise from the input video.
 次に図13乃至図15を参照しながら、学習装置1010が、撮像装置1020により撮像された画像から切り出す画像について説明する。なお、以下の説明においては、高品質画像から高品質動画を生成する方法(図13を参照しながら説明する方法)と、低品質画像から低品質動画を生成する方法(図14を参照しながら説明する方法)とは、互いに異なるものとして説明しているが、本実施形態はこの一例に限定されない。以下の説明に代えて、互いに同様の方法により、高品質画像から高品質動画が生成され、低品質画像から低品質動画が生成されてもよい。すなわち、図13を参照しながら説明する方法により低品質動画が生成されてもよいし、図14を参照しながら説明する方法により高品質動画が生成されてもよい。 Next, with reference to FIGS. 13 to 15, an image that the learning device 1010 cuts out from the image captured by the imaging device 1020 will be described. In the following explanation, a method of generating a high-quality video from a high-quality image (described with reference to FIG. 13) and a method of generating a low-quality video from a low-quality image (described with reference to FIG. 14) will be explained. Although the methods described above are described as being different from each other, the present embodiment is not limited to this example. Instead of the following description, a high-quality video may be generated from a high-quality image, and a low-quality video may be generated from a low-quality image, using methods similar to each other. That is, a low-quality video may be generated by the method described with reference to FIG. 13, and a high-quality video may be generated by the method described with reference to FIG. 14.
 図13は、第5の実施形態に係る学習装置が高品質画像から切り出す画像の位置の一例について説明するための図である。同図を参照しながら、学習装置1010が高品質画像から切り出す画像の位置の一例について説明する。図13(A)には、第1画像情報I1に含まれる画像の一例である画像I-11を示す。図13(B)には、図13(A)に示した画像I-11から複数の画像を切り出した場合の一例を画像I-12として示す。 FIG. 13 is a diagram for explaining an example of the position of an image cut out from a high-quality image by the learning device according to the fifth embodiment. An example of the position of an image cut out from a high-quality image by the learning device 1010 will be described with reference to the same figure. FIG. 13A shows an image I-11 that is an example of an image included in the first image information I1. FIG. 13(B) shows an example of a case where a plurality of images are cut out from image I-11 shown in FIG. 13(A) as image I-12.
 図13(A)に示した通り、画像I-11には、被写体であるボールBが撮像されている。動画情報生成部1012は、画像I-11から複数の画像を切り出し、切り出した画像を時間的に繋げることにより、静止画である画像I-11から、動画を生成する。 As shown in FIG. 13(A), the ball B, which is the subject, is captured in the image I-11. The video information generation unit 1012 generates a video from the still image I-11 by cutting out a plurality of images from the image I-11 and temporally connecting the cut-out images.
 図13(B)に示す画像I-12には、動画情報生成部1012により切り出された画像である切り出し画像CIが複数示されている。具体的には、動画情報生成部1012により切り出される画像の一例として、切り出し画像CI-11乃至切り出し画像CI-15が示されている。切り出し画像CI-11乃至切り出し画像CI-15を区別しない場合は、単に切り出し画像CIと記載する場合がある。 The image I-12 shown in FIG. 13(B) shows a plurality of cut-out images CI, which are images cut out by the video information generation unit 1012. Specifically, cut-out images CI-11 to cut-out images CI-15 are shown as examples of images cut out by the video information generation unit 1012. When cutout images CI-11 to cutout images CI-15 are not distinguished, they may be simply written as cutout images CI.
 切り出し画像CI-11乃至切り出し画像CI-15は、それぞれ縦方向及び横方向に所定の画素数シフトしている。動画情報生成部1012により生成された第1動画情報M1によれば、ある時刻t1において画像C-11が映し出され、ある時刻t2において画像C-12が映し出され、ある時刻t3において画像C-13が映し出され、ある時刻t4において画像C-14が映し出され、ある時刻t5において画像C-15が映し出される。このように、異なる切り出し画像CIを時間的につなげることにより、静止画内の被写体であるボールBがまるで動いているかのような動画を生成することができる。動画情報生成部1012がフレームレート60[fps]の動画を生成する場合、各時刻の間隔は、60分の1秒であってもよい。 The cutout images CI-11 to CI-15 are each shifted by a predetermined number of pixels in the vertical and horizontal directions. According to the first video information M1 generated by the video information generation unit 1012, an image C-11 is displayed at a certain time t1, an image C-12 is displayed at a certain time t2, and an image C-13 is displayed at a certain time t3. is displayed, image C-14 is displayed at a certain time t4, and image C-15 is displayed at a certain time t5. In this way, by temporally connecting different cut-out images CI, it is possible to generate a moving image that makes it appear as if the ball B, which is the object in the still image, is moving. When the video information generation unit 1012 generates a video with a frame rate of 60 [fps], the interval between each time may be 1/60th of a second.
 動画情報生成部1012により切り出される画像のシフト方向及びシフト量については、撮影画角、撮影解像度、光学系の焦点距離、被写体までの距離、撮影フレームレート等の撮影条件に基づいて決定されることが好適である。また、落下物の被写体を模擬するような場合には、加速度的に速度が増えることから、シフト量を徐々に変化させる(増やす)ことが好適である。 The shift direction and shift amount of the image cut out by the video information generation unit 1012 are determined based on shooting conditions such as the shooting angle of view, shooting resolution, focal length of the optical system, distance to the subject, and shooting frame rate. is suitable. Furthermore, in the case of simulating a falling object, since the speed increases in an accelerated manner, it is preferable to gradually change (increase) the shift amount.
 ここで、学習装置1010により生成される高品質動画(第1動画情報M1)は、ノイズが重畳していない高画質な動画である。したがって、動画を生成するための静止画である画像には、ノイズが重畳していないことが理想的である。また、ノイズが重畳していない画像から生成された高品質動画の各フレームにおいても、ノイズが重畳していないことが理想的である。したがって、動画情報生成部1012は、1枚のノイズが重畳していない画像から動画を生成することが好適である。すなわち動画情報生成部1012は、第1画像情報I1に含まれる高品質な1枚の画像から、異なる一部を切り出すことにより第1動画情報M1を生成することが好適である。 Here, the high-quality video (first video information M1) generated by the learning device 1010 is a high-quality video without superimposed noise. Therefore, it is ideal that noise is not superimposed on an image that is a still image for generating a moving image. Further, ideally, each frame of a high-quality video generated from an image without superimposed noise should also be free from superimposed noise. Therefore, it is preferable that the video information generation unit 1012 generates a video from a single image on which no noise is superimposed. That is, it is preferable that the video information generation unit 1012 generates the first video information M1 by cutting out different parts from one high-quality image included in the first image information I1.
 図14は、第5の実施形態に係る学習装置が低品質画像から切り出す画像の位置の一例について説明するための図である。同図を参照しながら、学習装置1010が低品質画像から切り出す画像の位置の一例について説明する。学習装置1010は、複数の低品質画像からそれぞれ異なるフレームの画像を切り出す。図14(A)乃至図14(E)には、それぞれ異なる画像である画像I-21乃至画像I-25が示される。学習装置1010は、画像I-21乃至画像I-25からそれぞれ異なるフレームの画像を切り出す。 FIG. 14 is a diagram for explaining an example of the position of an image cut out from a low-quality image by the learning device according to the fifth embodiment. An example of the position of an image cut out from a low-quality image by the learning device 1010 will be described with reference to the same figure. The learning device 1010 cuts out images of different frames from a plurality of low-quality images. Images I-21 to I-25, which are different images, are shown in FIGS. 14(A) to 14(E), respectively. The learning device 1010 cuts out images of different frames from images I-21 to I-25.
 低品質画像である画像I-21乃至画像I-25の構図は、図13(A)に示した画像I-11と同様である。すなわち、画像I-21乃至画像I-25には、同様の位置にボールBが撮像されている。画像I-21乃至画像I-25には、互いに異なるノイズが重畳されている点において、画像I-11とは異なる。画像I-21乃至画像I-25は、例えば撮像時に異なる撮像条件が用いられることにより、それぞれ互いに異なるノイズが重畳されてもよい。 The compositions of images I-21 to I-25, which are low-quality images, are similar to image I-11 shown in FIG. 13(A). That is, the ball B is imaged at the same position in images I-21 to I-25. Images I-21 to I-25 differ from image I-11 in that different noises are superimposed on them. Images I-21 to I-25 may have different noises superimposed on them, for example, by using different imaging conditions during imaging.
 動画情報生成部1012は、画像I-21から切り出し画像CI-21を切り出し、画像I-22から切り出し画像CI-22を切り出し、画像I-23から切り出し画像CI-23を切り出し、画像I-24から切り出し画像CI-24を切り出し、画像I-25から切り出し画像CI-25を切り出す。切り出し画像CI-21乃至切り出し画像CI-25は、それぞれ縦方向及び横方向に所定の画素数シフトしている。動画情報生成部1012により生成された第2動画情報M2によれば、ある時刻t1において画像C-21が映し出され、ある時刻t2において画像C-22が映し出され、ある時刻t3において画像C-23が映し出され、ある時刻t4において画像C-24が映し出され、ある時刻t5において画像C-25が映し出される。切り出し画像CI-21乃至切り出し画像CI-25には、それぞれ異なるノイズが重畳しているため、生成される動画にも、時間ごとに異なるノイズが重畳されることとなる。 The video information generation unit 1012 cuts out a cutout image CI-21 from the image I-21, cuts out a cutout image CI-22 from the image I-22, cuts out a cutout image CI-23 from the image I-23, and cuts out the cutout image CI-23 from the image I-24. A cutout image CI-24 is cut out from the image, and a cutout image CI-25 is cut out from the image I-25. The cutout images CI-21 to CI-25 are each shifted by a predetermined number of pixels in the vertical and horizontal directions. According to the second video information M2 generated by the video information generation unit 1012, image C-21 is displayed at a certain time t1, image C-22 is displayed at a certain time t2, and image C-23 is displayed at a certain time t3. is displayed, image C-24 is displayed at a certain time t4, and image C-25 is displayed at a certain time t5. Since different noises are superimposed on each of the cutout images CI-21 to CI-25, different noises are superimposed on the generated moving image depending on the time.
 ここで、学習装置1010により生成される低品質動画(第2動画情報M2)は、ノイズが重畳している低画質な動画である。1枚のノイズが重畳した画像から複数の異なる位置を切り出して動画にした場合、いずれの瞬間にも同様のノイズが含まれているため(換言すれば、時間ごとにノイズが変化しないため)、低画質動画としては適切でない場合がある。したがって、本実施形態においては、異なる複数の低画質画像から切り出すことにより、低画質動画を生成する。異なる複数の低画質画像には、それぞれ高画質画像に撮像された被写体と同一の被写体が撮像される。すなわち、第2画像情報M2には、第1画像情報I1に含まれる画像に撮像された被写体と同一の被写体が撮像された複数の画像であって、それぞれ互いに異なるノイズが重畳された複数の画像が含まれる。第2画像情報I2に含まれる複数の画像は、近接した異なる時間において撮像された画像であってもよい。動画情報生成部1012は、第2画像情報に含まれる複数の画像それぞれから、異なる一部を切り出すことにより第2動画情報M2を生成する。
 なお、例えば低画質画像をフレーム数分用意する必要はなく、複数枚の画像から、連続しないように複数回切り出してもよい。複数の画像から切り出す順番としては、ランダムであってもよい。
Here, the low-quality video (second video information M2) generated by the learning device 1010 is a low-quality video on which noise is superimposed. If you cut out multiple different positions from a single image with superimposed noise and create a video, the same noise will be included at every moment (in other words, the noise will not change over time). It may not be appropriate as a low-quality video. Therefore, in this embodiment, a low-quality video is generated by cutting out a plurality of different low-quality images. The same subject as the subject captured in the high-quality image is captured in each of the different low-quality images. That is, the second image information M2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is imaged, and each image has different noise superimposed thereon. is included. The plurality of images included in the second image information I2 may be images captured at different times close to each other. The video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information.
Note that, for example, it is not necessary to prepare low-quality images for the number of frames, and the images may be cut out multiple times so as not to be continuous from a plurality of images. The order of cutting out the plurality of images may be random.
 図15は、第5の実施形態に係る学習装置が切り出す方向の一例について説明するための図である。図13及び図14を参照しながら説明した一例では、縦方向及び横方向の両方向に所定の画素数移動した位置を切り出す場合の一例について説明した。しかしながら、動画情報生成部1012は、その他の方向に移動した位置を切り出してもよい。図15(A)乃至図15(C)を参照しながら、動画情報生成部1012が切り出し画像CIを切り出す方向の、その他の一例について説明する。 FIG. 15 is a diagram for explaining an example of the direction in which the learning device according to the fifth embodiment cuts out. In the example described with reference to FIGS. 13 and 14, an example was described in which a position moved by a predetermined number of pixels in both the vertical and horizontal directions is cut out. However, the video information generation unit 1012 may cut out positions moved in other directions. Another example of the direction in which the video information generation unit 1012 cuts out the cutout image CI will be described with reference to FIGS. 15(A) to 15(C).
 図15(A)には、画像I-31を示す。図15(A)は、横方向(水平方向)にのみ移動した位置を切り出した場合の一例である。この場合、動画情報生成部1012は、縦方向のy座標を固定し、横方向のx座標のみ変化させることにより、複数の異なる位置における切り出し画像CIを切り出す。このように切り出すことにより、被写体が横方向に移動(水平移動)するような動画を生成することができる。同様に、動画情報生成部1012は、縦方向(垂直方向)にのみ移動した位置における切り出し画像CIを切り出してもよい。このように切り出すことにより、被写体が縦方向に移動(垂直移動)するような動画を生成することができる。
 また、図13及び図14に示したように、動画情報生成部1012は、縦方向及び横方向の両方向に移動した位置における切り出し画像CIを切り出してもよい。この場合、縦方向の移動量及び横方向の移動量は、互いに異なっていてもよい。
FIG. 15(A) shows image I-31. FIG. 15(A) is an example of a case where a position moved only in the lateral direction (horizontal direction) is extracted. In this case, the video information generation unit 1012 fixes the y-coordinate in the vertical direction and changes only the x-coordinate in the horizontal direction, thereby cutting out the cut-out images CI at a plurality of different positions. By cutting out the image in this way, it is possible to generate a moving image in which the subject moves laterally (horizontally). Similarly, the video information generation unit 1012 may cut out the cutout image CI at a position moved only in the vertical direction (vertical direction). By cutting out the image in this way, it is possible to generate a moving image in which the subject moves in the vertical direction (vertical movement).
Further, as shown in FIGS. 13 and 14, the video information generation unit 1012 may cut out the cutout image CI at a position moved in both the vertical and horizontal directions. In this case, the amount of movement in the vertical direction and the amount of movement in the lateral direction may be different from each other.
 図15(B)には、画像I-32を示す。図15(B)は、回転方向に移動した位置を切り出した場合の一例である。この場合、動画情報生成部1012は、切り出し位置を、回転中心0、半径rを有する円弧状に移動させることにより、複数の異なる位置における切り出し画像CIを切り出す。同図に示す一例では、動画情報生成部1012は、反時計回りに回転した位置を切り出している。このように切り出すことにより、被写体が回転方向に移動するような動画を生成することができる。回転中心Oの位置や、半径rの大きさは、フレーム毎に異なっていてもよい。 FIG. 15(B) shows image I-32. FIG. 15(B) is an example of a case where a position moved in the rotational direction is extracted. In this case, the video information generation unit 1012 cuts out the cutout images CI at a plurality of different positions by moving the cutout position in an arc shape having a rotation center of 0 and a radius of r. In the example shown in the figure, the video information generation unit 1012 cuts out a position rotated counterclockwise. By cutting out in this way, it is possible to generate a moving image in which the subject moves in the rotational direction. The position of the center of rotation O and the size of the radius r may differ from frame to frame.
 図15(C)には、画像I-33を示す。図15(C)は、切り出す位置を拡大及び縮小させる場合の一例である。本実施形態において、切り出し画像CIの大きさは一定であることが好適である。したがって、動画情報生成部1012は、切り出し画像CIの大きさを維持したまま、画像Iを拡大又は縮小させて切り出す。切り出し画像CIの大きさが256画素×256画素に固定されている場合、動画情報生成部1012は、当該切り出し画像CIの大きさに収まるよう、画像Iを拡大及び縮小する。このように切り出すことにより、被写体をズームイン又はズームアウトしたような動画を生成することができる。 FIG. 15(C) shows image I-33. FIG. 15C is an example of enlarging and reducing the cutting position. In this embodiment, it is preferable that the size of the cutout image CI is constant. Therefore, the video information generation unit 1012 enlarges or reduces the image I and cuts it out while maintaining the size of the cut-out image CI. When the size of the cutout image CI is fixed at 256 pixels x 256 pixels, the video information generation unit 1012 enlarges and reduces the image I so that it fits within the size of the cutout image CI. By cutting out the image in this way, it is possible to generate a moving image that looks like the subject is zoomed in or zoomed out.
 なお、図15(A)乃至図15(C)を参照しながら説明した切り出し位置は、本実施形態の一例であり、動画情報生成部1012は、その他の異なる位置を切り出して繋げることにより、動画情報を生成してもよい。動画情報生成部1012は、例えば、図15(A)乃至図15(C)を参照しながら説明した切り出し方法を組み合わせることにより、切り出し画像CIを切り出してもよい。この場合、例えば水平移動又は垂直移動の後に回転移動したり、移動の後に拡大又は縮小したりするような動画を生成することができる。 Note that the cutout positions described with reference to FIGS. 15(A) to 15(C) are examples of this embodiment, and the video information generation unit 1012 can generate a video by cutting out and connecting other different positions. Information may be generated. The video information generation unit 1012 may cut out the cutout image CI, for example, by combining the cutout methods described with reference to FIGS. 15(A) to 15(C). In this case, it is possible to generate a moving image in which, for example, the moving image is horizontally or vertically moved and then rotated, or moved and then enlarged or reduced.
 なお、上述したような切り出し位置の移動は、アフィン変換により算出されてもよい。すなわち、動画情報生成部1012が画像を切り出す所定の方向とは、アフィン変換により算出されるとも記載することができる。 Note that the movement of the cutout position as described above may be calculated by affine transformation. That is, the predetermined direction in which the video information generation unit 1012 cuts out an image can also be described as being calculated by affine transformation.
 なお、動画情報生成部1012は、上述したような切り出し位置を変化させる場合の一例に代えて、画像の一部を切り出した後に、移動させることにより動画を生成してもよい。この場合、動画情報生成部1012は、256画素×256画素の画像を切り出し、切り出した画像を所定の方向に移動した複数の画素を生成する。動画情報生成部1012は、切り出した画像を繋げることにより動画を生成する。すなわち、動画情報生成部1012は、切り出した複数の画像を所定の方向にずらすことにより異なる位置の画像を複数切り出してもよい。
 なお、切り出した後に移動させることにより、画像の周囲にデータが存在しない領域が発生してしまう。しかしながら、画像の周囲部分をのりしろ分として予め定義しておくことにより、学習対象となる画像の範囲から除外し、後の学習段階では問題が生じないようにすることができる。
Note that instead of changing the cutout position as described above, the video information generation unit 1012 may generate a video by cutting out a part of the image and then moving it. In this case, the video information generation unit 1012 cuts out an image of 256 pixels x 256 pixels, and generates a plurality of pixels by moving the cut out image in a predetermined direction. The video information generation unit 1012 generates a video by connecting the cut out images. That is, the video information generation unit 1012 may cut out a plurality of images at different positions by shifting the plurality of cut out images in a predetermined direction.
Note that by moving the image after cutting it out, an area where no data exists will occur around the image. However, by predefining the peripheral portion of the image as the margin, it is possible to exclude it from the range of the image to be learned, and to prevent problems from occurring in the later learning stage.
 上述した説明では、動画情報生成部1012は、アフィン変換等の何らかの方法により算出された方向に移動させた画像を切り出すことにより動画を生成する場合の一例について説明した。しかしながら、実際の動画では、被写体はこれらの算出された方向に移動しないことも多く、むしろランダムに動く場合の方が多い。したがって、学習装置1010は、物体が実際に動く軌跡に基づいた方向に移動させた画像を切り出すことにより動画を生成し、より機械学習に有効な教師データを生成することができる。このような場合の一例について、図16及び図17を参照しながら第5の実施形態の変形例として説明する。 In the above description, an example was described in which the video information generation unit 1012 generates a video by cutting out an image that has been moved in a direction calculated by some method such as affine transformation. However, in actual videos, the subject often does not move in these calculated directions, but rather moves randomly. Therefore, the learning device 1010 can generate a moving image by cutting out an image in which the object is moved in a direction based on the trajectory of the actual movement, and can generate training data that is more effective for machine learning. An example of such a case will be described as a modification of the fifth embodiment with reference to FIGS. 16 and 17.
 ここで、晴天時等の明るいシーンでは、露出を維持し続ける為に、シャッタースピードを上げることが一般的である。そのため、動く被写体のなめらかさがなくなり、カクカクした映像になることが知られている。同様に、解像感の高い静止画から動画を作成する場合に、滑らかさの少ないカクカクした不自然な動画となる場合がある。このため、動画情報生成部1012は、動画を作成する静止画に対して疑似的な被写体ブレを追加する補正を行った後に動画を生成するようにしてもよい。一例として、シフト方向に対して所定の平均化処理を行ったり、解像度を低下する処理を行ったりすることで被写体ブレを追加するようにしてもよい。 Here, in bright scenes such as on sunny days, it is common to increase the shutter speed in order to maintain exposure. This is known to cause moving subjects to lose their smoothness, resulting in choppy images. Similarly, when creating a moving image from high-resolution still images, the resulting moving image may be choppy and unnatural with little smoothness. For this reason, the video information generation unit 1012 may generate the video after performing correction to add pseudo subject blur to the still image for which the video is to be created. As an example, subject blurring may be added by performing a predetermined averaging process in the shift direction or by performing a process to lower the resolution.
 図16は、第5の実施形態の変形例に係る学習装置が軌跡ベクトルに基づいて動画を生成する場合における学習装置の機能構成の一例を示す図である。同図を参照しながら、第5の実施形態の変形例に係る学習装置1010Aの機能構成の一例について説明する。第5の実施形態の変形例に係る学習システム1001Aは、軌跡ベクトル生成装置1050を更に備える点において学習システム1001とは異なる。学習装置1010Aは、更に軌跡ベクトル取得部1014を備える点において学習装置1010とは異なる。また、学習装置1010Aは、動画情報生成部1012に代えて動画情報生成部1012Aを備える点において学習装置1010とは異なる。学習装置1010Aの説明において、学習装置1010と同様の構成については同様の符号を付すことにより説明を省略する場合がある。 FIG. 16 is a diagram illustrating an example of the functional configuration of a learning device according to a modification of the fifth embodiment when the learning device generates a moving image based on a trajectory vector. An example of the functional configuration of a learning device 1010A according to a modification of the fifth embodiment will be described with reference to the same figure. A learning system 1001A according to a modification of the fifth embodiment differs from the learning system 1001 in that it further includes a trajectory vector generation device 1050. The learning device 1010A differs from the learning device 1010 in that it further includes a trajectory vector acquisition unit 1014. Further, the learning device 1010A differs from the learning device 1010 in that the learning device 1010A includes a video information generating section 1012A instead of the video information generating section 1012. In the description of the learning device 1010A, the same components as the learning device 1010 may be given the same reference numerals and the description thereof may be omitted.
 軌跡ベクトル生成装置1050は、動画に撮像された物体の軌跡に関する情報を取得する。軌跡ベクトル生成装置1050には動画情報が入力され、軌跡ベクトル生成装置1050は、入力された動画情報に撮像された物体の軌跡を解析する。軌跡ベクトル生成装置1050は、解析した結果を軌跡ベクトルTVとして出力する。軌跡ベクトルTVには、動画情報に撮像された物体の軌跡が示される。軌跡ベクトル生成装置1050は、例えばオプティカルフロー(Optical Flow)等の従来技術を用いて、動画情報から軌跡ベクトルTVを取得する。
 なお、軌跡ベクトルTVには、ベクトル情報に加えて又は代えて、物体の移動した軌跡が示された座標情報が含まれていてもよい。
The trajectory vector generation device 1050 acquires information regarding the trajectory of the object captured in the video. Video information is input to the trajectory vector generation device 1050, and the trajectory vector generation device 1050 analyzes the trajectory of the object imaged based on the input video information. Trajectory vector generation device 1050 outputs the analyzed result as trajectory vector TV. The trajectory vector TV indicates the trajectory of the object captured in the video information. Trajectory vector generation device 1050 acquires trajectory vector TV from video information using, for example, conventional technology such as optical flow.
Note that the trajectory vector TV may include coordinate information indicating the trajectory of the movement of the object in addition to or in place of the vector information.
 軌跡ベクトル取得部1014は、軌跡ベクトル生成装置1050から軌跡ベクトルTVを取得する。軌跡ベクトル取得部1014は、取得した軌跡ベクトルTVを動画情報生成部1012Aに出力する。なお、軌跡ベクトル生成装置1050により軌跡ベクトルTVが取得された動画と、画像取得部1011により取得された画像とは、所定の関連性を有するものであってもよい。この場合、例えば画像取得部1011は、軌跡ベクトル生成装置1050により軌跡ベクトルTVが取得された動画の1フレームを画像として取得してもよい。
 しかしながら本実施形態はこの一例に限定されず、軌跡ベクトル生成装置1050により軌跡ベクトルTVが取得された動画と、画像取得部1011により取得された動画とは、所定の関連性を有しないものであってもよい。
The trajectory vector acquisition unit 1014 acquires the trajectory vector TV from the trajectory vector generation device 1050. The trajectory vector acquisition unit 1014 outputs the acquired trajectory vector TV to the video information generation unit 1012A. Note that the moving image for which the trajectory vector TV has been acquired by the trajectory vector generation device 1050 and the image acquired by the image acquisition unit 1011 may have a predetermined relationship. In this case, for example, the image acquisition unit 1011 may acquire, as an image, one frame of a video whose trajectory vector TV has been acquired by the trajectory vector generation device 1050.
However, the present embodiment is not limited to this example, and the video whose trajectory vector TV is acquired by the trajectory vector generation device 1050 and the video acquired by the image acquisition unit 1011 do not have a predetermined relationship. It's okay.
 動画情報生成部1012Aは、画像取得部1011から画像情報Iを取得し、軌跡ベクトル取得部1014から軌跡ベクトルTVを取得する。動画情報生成部1012Aは、取得した画像情報Iと軌跡ベクトルTVとに基づいて、動画情報を生成する。動画情報生成部1012Aは、軌跡ベクトルTVに示された軌跡に基づいて切り出し画像CIの切り出し方向や1フレームあたりのずらし量を決定する。すなわち、動画情報生成部1012Aが画像を切り出す所定の方向とは、取得された軌跡ベクトルTVに基づいて算出される。 The video information generation unit 1012A acquires image information I from the image acquisition unit 1011 and acquires the trajectory vector TV from the trajectory vector acquisition unit 1014. The video information generation unit 1012A generates video information based on the acquired image information I and trajectory vector TV. The video information generation unit 1012A determines the cutting direction of the cutout image CI and the amount of shift per frame based on the trajectory indicated by the trajectory vector TV. That is, the predetermined direction in which the video information generation unit 1012A cuts out the image is calculated based on the acquired trajectory vector TV.
 図17は、第5の実施形態の変形例に係る学習装置が軌跡ベクトルに基づいて動画を生成する場合において、静止画から切り出す画像の位置の一例について説明するための図である。同図を参照しながら、軌跡ベクトルTVに基づいて動画を生成する場合における切り出し画像CIの位置座標の一例について説明する。図17(A)には、第1画像情報I1に含まれる画像の一例である画像I-41を示す。図17(B)には、画像I-41から切り出す複数の切り出し画像CIの一例を示す。 FIG. 17 is a diagram for explaining an example of the position of an image cut out from a still image when a learning device according to a modification of the fifth embodiment generates a moving image based on a trajectory vector. An example of the position coordinates of the cut-out image CI in the case of generating a moving image based on the trajectory vector TV will be described with reference to the same figure. FIG. 17A shows an image I-41 that is an example of an image included in the first image information I1. FIG. 17B shows an example of a plurality of cut out images CI cut out from the image I-41.
 図17(A)に示した通り、画像I-41には、被写体であるボールBの軌跡である軌跡ベクトルTVが示されている。軌跡ベクトルTVには、ボールBが図中右上方向から真ん中下方向へ落下し、真ん中下方で跳ねた後、図中左上方向へ向かうベクトルが表現されている。動画情報生成部1012Aは、画像I-41に示される軌跡ベクトルTVに基づいた位置座標の切り出し画像CIを切り出し、切り出した画像を時間的に繋げることにより、静止画である画像I-41から、動画を生成する。 As shown in FIG. 17(A), the image I-41 shows a trajectory vector TV that is the trajectory of the ball B, which is the subject. The trajectory vector TV represents a vector in which the ball B falls from the upper right direction in the figure to the lower center direction, bounces at the lower center point, and then moves toward the upper left direction in the figure. The video information generation unit 1012A cuts out the cutout image CI of the position coordinates based on the trajectory vector TV shown in the image I-41, and temporally connects the cutout images, so that the image I-41, which is a still image, is extracted from the image I-41. Generate video.
 図17(B)には、動画情報生成部1012により切り出された画像である切り出し画像CIの一例が示されている。具体的には、動画情報生成部1012により切り出される画像の一例として、切り出し画像CI-41乃至切り出し画像CI-49が示されている。切り出し画像CI-41乃至切り出し画像CI-49は、軌跡ベクトルTVに基づいた座標に位置する。すなわち、切り出し画像CI-41は図中右上方向に位置し、切り出し画像CI-45にかけて切り出し位置は、図中真ん中下方向へ移動する。また、切り出し位置は、切り出し画像CI-45から切り出し画像CI-49にかけて図中左上方向へ移動する。 FIG. 17(B) shows an example of a cut-out image CI, which is an image cut out by the video information generation unit 1012. Specifically, as examples of images cut out by the video information generation unit 1012, cutout images CI-41 to cutout images CI-49 are shown. The cutout images CI-41 to CI-49 are located at coordinates based on the trajectory vector TV. That is, the cutout image CI-41 is located in the upper right direction in the figure, and the cutout position moves toward the center and lower in the figure as it approaches the cutout image CI-45. Further, the cutout position moves toward the upper left in the figure from cutout image CI-45 to cutout image CI-49.
 図18は、第5の実施形態に係るノイズ低減装置の学習方法の一連の動作の一例について示すフローチャートである。同図を参照しながら、学習装置1010を用いたノイズ低減装置の学習方法の一連の動作の一例について説明する。 FIG. 18 is a flowchart illustrating an example of a series of operations of the learning method of the noise reduction device according to the fifth embodiment. An example of a series of operations of the learning method of the noise reduction device using the learning device 1010 will be described with reference to the same figure.
(ステップS110)まず、画像取得部1011は、画像を取得する。画像取得部1011は、高品質画像が含まれる第1画像情報I1と、低品質画像が含まれる第2画像情報I2とを取得する。なお、画像取得部1011により画像を取得するステップを、画像取得ステップ又は画像取得工程と記載する場合がある。 (Step S110) First, the image acquisition unit 1011 acquires an image. The image acquisition unit 1011 acquires first image information I1 that includes a high-quality image and second image information I2 that includes a low-quality image. Note that the step of acquiring an image by the image acquisition unit 1011 may be referred to as an image acquisition step or an image acquisition process.
(ステップS130)次に、動画情報生成部1012は、取得した画像の一部を切り出す。動画情報生成部1012は、取得した画像から複数の切り出し画像CIを切り出す。動画情報生成部1012は、第1画像情報I1に含まれる高品質画像と、第2画像情報I2に含まれる低品質画像のそれぞれから、複数の切り出し画像CIを切り出す。なお、第1画像情報I1に含まれる高品質画像と、第2画像情報I2に含まれる低品質画像のそれぞれから切り出す位置座標は、互いに同様であることが好適である。ただし、第1画像情報I1に含まれる高品質画像を取得したタイミングと、第2画像情報I2に含まれる低品質画像を取得したタイミングに時間的な差がある場合、切り出した画像に含まれる被写体に時間差に起因するずれが生じる場合がある。このような場合においては、第1画像情報I1に含まれる高品質画像と、第2画像情報I2に含まれる低品質画像のそれぞれから切り出す位置座標は、時間差に起因するずれを考慮して決定することが好ましい。より詳細には時間差に起因するずれる量を減らす方向に第1画像情報I1に含まれる高品質画像または、第2画像情報I2に含まれる低品質画像から切り出す位置座標を変更することが好ましい。 (Step S130) Next, the video information generation unit 1012 cuts out a part of the acquired image. The video information generation unit 1012 cuts out a plurality of cut images CI from the acquired image. The video information generation unit 1012 cuts out a plurality of cutout images CI from each of the high quality image included in the first image information I1 and the low quality image included in the second image information I2. Note that it is preferable that the position coordinates cut out from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are the same. However, if there is a time difference between the timing of acquiring the high-quality image included in the first image information I1 and the timing of acquiring the low-quality image included in the second image information I2, the subject included in the cut-out image There may be deviations due to time differences. In such a case, the position coordinates to be extracted from each of the high-quality image included in the first image information I1 and the low-quality image included in the second image information I2 are determined by taking into account the deviation due to the time difference. It is preferable. More specifically, it is preferable to change the position coordinates to be cut out from the high-quality image included in the first image information I1 or the low-quality image included in the second image information I2 in a direction that reduces the amount of deviation caused by the time difference.
(ステップS150)次に、動画情報生成部1012は、切り出した画像を繋げて動画を生成する。動画情報生成部1012は、高品質画像から切り出した複数の画像を繋げることにより高品質動画を生成し、低品質画像から切り出した複数の画像を繋げることにより低品質動画を生成する。ステップS130とステップS150において動画情報を生成するステップを、動画情報生成ステップ又は動画情報生成工程と記載する場合がある。 (Step S150) Next, the video information generation unit 1012 connects the cut out images to generate a video. The video information generation unit 1012 generates a high-quality video by connecting multiple images cut out from high-quality images, and generates a low-quality video by connecting multiple images cut out from low-quality images. The step of generating video information in step S130 and step S150 may be referred to as a video information generation step or a video information generation step.
(ステップS170)最後に、学習部1013は、生成した高品質動画と低品質動画との組み合わせを教師データTDとして、低品質動画から高品質動画を推論するよう学習する。当該ステップを、学習ステップ又は学習工程と記載する場合がある。 (Step S170) Finally, the learning unit 1013 uses the combination of the generated high-quality video and low-quality video as teacher data TD and learns to infer a high-quality video from a low-quality video. This step may be referred to as a learning step or a learning process.
[第5の実施形態のまとめ]
 以上説明した実施形態によれば、学習装置1010は、画像取得部1011を備えることにより、第1画像情報I1と、第2画像情報I2とを取得する。第1画像情報I1には少なくとも1枚の画像が含まれ、第2画像情報I2には第1画像情報I1に含まれる画像に撮像された被写体と同一の被写体が撮像され、第1画像情報I1に含まれる画像より低画質の画像が少なくとも1枚含まれる。また、学習装置1010は動画情報生成部1012を備えることにより、第1画像情報I1の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報M1を生成する。同様に、学習装置1010は動画情報生成部1012を備えることにより、第2画像情報I2の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報M2を生成する。また、学習装置1010は学習部1013を備えることにより、動画情報生成部1012により生成された第1動画情報M1と第2動画情報M2とが含まれる教師データTDに基づき、低画質動画から高画質動画を推論するよう学習させる。すなわち本実施形態によれば、学習装置1010は、従来必要とされていた低品質動画及び高品質動画を含む教師データを動画の撮影により取得することを要せず、静止画から生成することができる。したがって、本実施形態によれば、低品質動画から高品質動画を推論するための教師データを容易に生成することができる。
[Summary of the fifth embodiment]
According to the embodiment described above, the learning device 1010 includes the image acquisition unit 1011 to acquire the first image information I1 and the second image information I2. The first image information I1 includes at least one image, the second image information I2 captures the same subject as the subject captured in the image included in the first image information I1, and the first image information I1 contains at least one image of lower quality than the images contained in the image. Furthermore, the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the first image information I1, and generates the first video information M1 by combining the plurality of cut out images. do. Similarly, the learning device 1010 includes a video information generation unit 1012, which cuts out a plurality of images at different positions that are part of the second image information I2, and combines the plurality of cut out images to generate the second video information M2. generate. Further, the learning device 1010 includes a learning unit 1013, so that the learning device 1010 can change the quality of the video from low-quality video to high-quality video based on the teacher data TD that includes the first video information M1 and the second video information M2 generated by the video information generation unit 1012. Train to infer videos. That is, according to the present embodiment, the learning device 1010 does not need to acquire training data including low-quality videos and high-quality videos by shooting videos, which was conventionally required, and can generate the training data from still images. can. Therefore, according to this embodiment, training data for inferring a high-quality video from a low-quality video can be easily generated.
 また、本実施形態によれば、学習装置1010は、同一の静止画から複数の異なる動画を生成することができる。したがって、本実施形態によれば、膨大な教師データTDを生成するため、膨大な静止画を用意することを要せず、少ない静止画から多くの動画を生成することができる。よって、本実施形態によれば、学習に用いるための画像の撮像に要する時間を短縮することができる。 Furthermore, according to the present embodiment, the learning device 1010 can generate a plurality of different moving images from the same still image. Therefore, according to this embodiment, since a huge amount of teacher data TD is generated, it is not necessary to prepare a huge amount of still images, and many moving images can be generated from a small number of still images. Therefore, according to this embodiment, the time required to capture images for use in learning can be shortened.
 また、以上説明した実施形態によれば、第2画像情報I2には、第1画像情報I1に含まれる画像に撮像された被写体と同一の被写体が撮像された複数の画像であって、それぞれ互いに異なるノイズが重畳された複数の画像が含まれる。動画情報生成部1012は、第2画像情報I2に含まれる複数の画像それぞれから、異なる一部を切り出すことにより第2動画情報M2を生成する。すなわち、本実施形態によれば、ノイズが重畳した低品質動画は、ノイズが重畳した異なる複数の低品質画像に基づき生成される。したがって、本実施形態により生成された第2動画情報M2は、フレーム毎に異なるノイズが重畳しており、より精度よくノイズが重畳した低品質動画を再現して生成することができる。 Further, according to the embodiment described above, the second image information I2 includes a plurality of images in which the same subject as the subject imaged in the image included in the first image information I1 is captured, and each image is mutually Contains multiple images on which different noises are superimposed. The video information generation unit 1012 generates the second video information M2 by cutting out different parts from each of the plurality of images included in the second image information I2. That is, according to the present embodiment, a low-quality moving image with superimposed noise is generated based on a plurality of different low-quality images with superimposed noise. Therefore, the second video information M2 generated according to the present embodiment has different noise superimposed on each frame, and can be generated by reproducing a low-quality video with noise superimposed more accurately.
 また、以上説明した実施形態によれば、第2画像情報I2に含まれる複数の画像は、近接した異なる時間において撮像された画像である。すなわち、低品質動画を生成するための低品質画像は、近接した時間に撮像される。近接した時間とは、例えば60分の1秒等であってもよい。ここで、動画の場合は静止画と異なり、時間的成分を有した動画特有のノイズが重畳する場合がある。近接した異なる時間において撮像された画像には、この動画特有のノイズが含まれる。したがって、本実施形態によれば、学習装置1010は、近接した異なる時間において撮像された画像に基づいて動画を生成するため、時間的成分を有する動画特有のノイズを再現して生成することができる。 Furthermore, according to the embodiment described above, the plurality of images included in the second image information I2 are images taken at different times that are close to each other. That is, low-quality images for generating a low-quality video are captured at close times. The close time may be, for example, 1/60th of a second. Here, in the case of a moving image, unlike a still image, noise peculiar to moving images having a temporal component may be superimposed. Images taken at different times that are close to each other contain noise specific to this moving image. Therefore, according to the present embodiment, since the learning device 1010 generates a moving image based on images captured at different times that are close to each other, the learning device 1010 can reproduce and generate noise peculiar to moving images having a temporal component. .
 また、以上説明した実施形態によれば、動画情報生成部1012は、第1画像情報I1に含まれる1枚の画像から、異なる一部を切り出すことにより第1動画情報M1を生成する。すなわち、本実施形態によれば、高品質動画は1枚の画像に基づき生成される。したがって、本実施形態によれば、多くの高品質画像を撮像することを要せず、容易に高品質動画を生成することができる。 Furthermore, according to the embodiment described above, the video information generation unit 1012 generates the first video information M1 by cutting out a different part from one image included in the first image information I1. That is, according to this embodiment, a high-quality video is generated based on one image. Therefore, according to this embodiment, it is possible to easily generate a high-quality moving image without having to capture many high-quality images.
 また、以上説明した実施形態によれば、動画情報生成部1012は、切り出した複数の画像をそれぞれ異なる量ずつ所定の方向にずらすことにより、異なる位置の画像を複数切り出す。すなわち、本実施形態によれば、学習装置1010は、画像を切り出した後に、所定の方向にずらす。換言すれば、学習装置1010は、画像を切り出した後は、大きな画像に基づいた処理を要せず、切り出した小さな画像に基づいた処理を行う。したがって、本実施形態によれば、学習装置1010は、処理を軽くすることができる。 Furthermore, according to the embodiment described above, the video information generation unit 1012 cuts out a plurality of images at different positions by shifting the plurality of cut out images by different amounts in a predetermined direction. That is, according to this embodiment, the learning device 1010 cuts out the image and then shifts it in a predetermined direction. In other words, after cutting out an image, the learning device 1010 performs processing based on the small image that has been cut out, without requiring processing based on the large image. Therefore, according to this embodiment, the learning device 1010 can lighten the processing.
 また、以上説明した実施形態によれば、動画情報生成部1012は、所定の方向に、所定のビット数移動させた位置における複数の画像を切り出す。動画情報生成部1012は、切り出した画像を繋げることにより動画を生成する。すなわち、動画情報生成部1012により生成される動画に撮像された被写体は、動画の中では、所定の方向に移動するように見える。したがって、本実施形態によれば、静止画から動画を容易に生成することができる。 Furthermore, according to the embodiment described above, the video information generation unit 1012 cuts out a plurality of images at positions shifted by a predetermined number of bits in a predetermined direction. The video information generation unit 1012 generates a video by connecting the cut out images. That is, the subject imaged in the video generated by the video information generation unit 1012 appears to move in a predetermined direction in the video. Therefore, according to this embodiment, a moving image can be easily generated from a still image.
 また、以上説明した実施形態によれば、動画情報生成部1012が画像を切り出す所定の方向とは、アフィン変換により算出される。動画情報生成部1012が画像を切り出す所定の方向とは、換言すれば、動画の中で被写体が移動する方向である。したがって、本実施形態によれば、学習装置1010は、被写体が様々な方向に移動する動画を生成することができる。 Furthermore, according to the embodiment described above, the predetermined direction in which the video information generation unit 1012 cuts out an image is calculated by affine transformation. In other words, the predetermined direction in which the video information generation unit 1012 cuts out the image is the direction in which the subject moves in the video. Therefore, according to this embodiment, the learning device 1010 can generate a video in which the subject moves in various directions.
 また、以上説明した実施形態によれば、学習装置1010は、軌跡ベクトル取得部1014を更に備えることにより、軌跡ベクトルTVを取得する。また、動画情報生成部1012が画像を切り出す所定の方向は、取得された軌跡ベクトルTVに基づいて算出される。軌跡ベクトルTVとは、実際に撮像された動画の中で、実際に被写体が移動している軌跡を示すベクトルに関する情報である。したがって、本実施形態によれば、実際に被写体が移動している軌跡に基づいた動画を生成することができる。 Furthermore, according to the embodiment described above, the learning device 1010 further includes the trajectory vector acquisition unit 1014 to acquire the trajectory vector TV. Further, the predetermined direction in which the video information generation unit 1012 cuts out the image is calculated based on the acquired trajectory vector TV. The trajectory vector TV is information regarding a vector indicating the trajectory of the subject actually moving in the actually captured moving image. Therefore, according to this embodiment, it is possible to generate a video based on the trajectory of the actual movement of the subject.
[第6の実施形態]
 次に、図19及び図20を参照しながら第6の実施形態について説明する。第1の実施形態においては、教師データTDの作成のために、高品質画像と低品質画像を要していたのに対し、第6の実施形態では、高品質画像のみを必要とする点において、第5の実施形態とは異なる。
[Sixth embodiment]
Next, a sixth embodiment will be described with reference to FIGS. 19 and 20. In the first embodiment, high-quality images and low-quality images are required to create the training data TD, whereas in the sixth embodiment, only high-quality images are required. , which is different from the fifth embodiment.
 図19は、第6の実施形態に係る学習システムの概要について説明するための図である。同図を参照しながら、第6の実施形態に係る学習システム1001Bの概要について説明する。同図の説明において、第5の実施形態と同様の構成については、同様の符号を付すことにより説明を省略する場合がある。第6の実施形態において、撮像装置1020は、高画質画像1031を撮像する。低画質画像1032は、第6の実施形態に係る学習装置1010Bにより、高画質画像1031に基づき生成される。低画質画像1032は、例えば高画質画像1031を画像処理することにより、ノイズを重畳し、生成される。すなわち本実施形態によれば、撮像装置1020は、高画質画像1031のみを撮像し、低画質画像1032の撮像を要しない。 FIG. 19 is a diagram for explaining an overview of the learning system according to the sixth embodiment. An overview of a learning system 1001B according to the sixth embodiment will be described with reference to the same figure. In the description of the figure, the same components as those in the fifth embodiment may be given the same reference numerals and the description thereof may be omitted. In the sixth embodiment, the imaging device 1020 captures a high-quality image 1031. The low-quality image 1032 is generated based on the high-quality image 1031 by the learning device 1010B according to the sixth embodiment. The low-quality image 1032 is generated, for example, by subjecting the high-quality image 1031 to image processing and superimposing noise. That is, according to the present embodiment, the imaging device 1020 captures only the high-quality image 1031 and does not need to capture the low-quality image 1032.
 図20は、第6の実施形態に係る動画情報生成部の機能構成の一例を示す図である。同図を参照しながら、学習装置1010Bが備える動画情報生成部1012Bについて説明する。第6の実施形態に係る学習装置1010Bは、動画情報生成部1012に代えて、動画情報生成部1012Bを備える点において学習装置1010とは異なる。動画情報生成部1012Bは、切出部1121と、ノイズ重畳部1123と、第1動画情報生成部1125と、第2動画情報生成部1127とを備える。 FIG. 20 is a diagram illustrating an example of the functional configuration of the video information generation section according to the sixth embodiment. The video information generation unit 1012B included in the learning device 1010B will be described with reference to the same figure. A learning device 1010B according to the sixth embodiment differs from the learning device 1010 in that it includes a video information generating section 1012B instead of the video information generating section 1012. The video information generation section 1012B includes a cutting section 1121, a noise superimposition section 1123, a first video information generation section 1125, and a second video information generation section 1127.
 切出部1121は、画像取得部1011から画像を取得する。本実施形態において、学習装置1010Bは、撮像装置1020から高品質画像を取得するため、切出部1121は、画像取得部1011から高品質画像を取得する。切出部1121は、取得した高品質画像の一部であって、異なる位置座標の切り出し画像CIを複数切り出す。切出部1121は、切り出した切り出し画像CIを第1動画情報生成部1125及びノイズ重畳部1123に出力する。 The cutting unit 1121 acquires an image from the image acquiring unit 1011. In this embodiment, since the learning device 1010B acquires a high-quality image from the imaging device 1020, the cutting unit 1121 acquires a high-quality image from the image acquisition unit 1011. The cutout unit 1121 cuts out a plurality of cutout images CI that are part of the acquired high-quality image and have different positional coordinates. The cutout unit 1121 outputs the cutout image CI to the first moving image information generation unit 1125 and the noise superimposition unit 1123.
 ノイズ重畳部1123は、切出部1121により切り出された切り出し画像CIを取得する。ノイズ重畳部1123は、取得した切り出し画像CIに対してノイズを重畳する。ノイズ重畳部1123は、複数の位置座標を切り出した複数の切り出し画像CIを取得し、取得した複数の切り出し画像CIそれぞれに対してノイズを重畳する。ノイズ重畳部1123により重畳されるノイズは、予めモデル化されていてもよい。モデル化されたノイズとしては、光子数のゆらぎによるショットノイズ、撮像素子に入射した光を電子に変換する際に生じるノイズ、変換された電子をアナログ電圧値に変換する際に生じるノイズ、変換されたアナログ電圧値をデジタル信号に変換する際に生じるノイズ等を例示することができる。重畳されるノイズの強度は、所定の方法により調整されてもよい。ノイズ重畳部1123は、複数の切り出し画像CIそれぞれに対して、異なるノイズを重畳することが好適である。ノイズ重畳部1123は、ノイズを重畳した後の画像をノイズ画像NIとして第2動画情報生成部1127に出力する。 The noise superimposition unit 1123 acquires the cutout image CI cut out by the cutout unit 1121. The noise superimposition unit 1123 superimposes noise on the acquired cutout image CI. The noise superimposition unit 1123 obtains a plurality of cutout images CI obtained by cutting out a plurality of position coordinates, and superimposes noise on each of the plurality of obtained cutout images CI. The noise superimposed by the noise superimposing unit 1123 may be modeled in advance. The modeled noises include shot noise due to fluctuations in the number of photons, noise that occurs when the light incident on the image sensor is converted into electrons, noise that occurs when the converted electrons are converted to analog voltage values, and noise that occurs when the converted electrons are converted to analog voltage values. An example of this is noise that occurs when converting an analog voltage value into a digital signal. The intensity of the superimposed noise may be adjusted by a predetermined method. It is preferable that the noise superimposition unit 1123 superimposes different noises on each of the plurality of cut-out images CI. The noise superimposition unit 1123 outputs the image after superimposing noise to the second moving image information generation unit 1127 as a noise image NI.
 第1動画情報生成部1125は、切出部1121から複数の切り出し画像CIを取得する。第1動画情報生成部1125は、切り出した複数の画像を組み合わせて第1動画情報M1を生成する。第1動画情報生成部1125は、生成した第1動画情報M1を学習部1013に出力する。 The first video information generation unit 1125 acquires a plurality of cut out images CI from the cut out unit 1121. The first video information generation unit 1125 generates first video information M1 by combining the plurality of cut out images. The first video information generation unit 1125 outputs the generated first video information M1 to the learning unit 1013.
 第2動画情報生成部1127は、ノイズ重畳部1123から複数のノイズ画像NIを取得する。第2動画情報生成部1127は、ノイズが重畳された複数のノイズ画像NIを組み合わせて第2動画情報M2を生成する。第2動画情報生成部1127は、生成した第2動画情報M2を学習部1013に出力する。 The second video information generation unit 1127 acquires a plurality of noise images NI from the noise superimposition unit 1123. The second video information generation unit 1127 generates second video information M2 by combining a plurality of noise images NI on which noise is superimposed. The second video information generation unit 1127 outputs the generated second video information M2 to the learning unit 1013.
 学習部1013は、第1動画情報生成部1125から第1動画情報M1を取得し、第2動画情報生成部1127から第2動画情報M2を取得する。学習部1013は、動画情報生成部1012Bにより生成された第1動画情報M1と、第2動画情報M2とに基づき、学習モデル1040を学習させる。 The learning unit 1013 acquires the first video information M1 from the first video information generation unit 1125 and the second video information M2 from the second video information generation unit 1127. The learning unit 1013 trains the learning model 1040 based on the first video information M1 and the second video information M2 generated by the video information generation unit 1012B.
[第6の実施形態のまとめ]
 以上説明した実施形態によれば、学習装置1010Bは、画像取得部1011を備えることにより、少なくとも1枚の高品質画像を含む画像情報Iを取得する。また、学習装置1010Bは動画情報生成部1012Bを備えることにより高品質画像から高品質動画と低品質動画の両方を生成する。動画情報生成部1012Bは、切出部1121を備えることにより、取得した画像情報Iの一部であって異なる位置の画像を複数切り出す。また、動画情報生成部1012Bは、ノイズ重畳部1123を備えることにより、切出部1121により切り出された複数の画像それぞれに対しノイズを重畳する。動画情報生成部1012Bは、第1動画情報生成部1125を備えることにより、切出部1121により切り出された複数の画像を組み合わせて高品質動画である第1動画情報M1を生成し、第2動画情報生成部1127を備えることによりノイズ重畳部1123によりノイズが重畳された複数の画像を組み合わせて第2動画情報M2を生成する。また、学習装置1010Bは、学習部1013を備えることにより第1動画情報生成部1125により生成された第1動画情報M1と、第2動画情報生成部1127により生成された第2動画情報M2とが含まれる教師データTDに基づき、低画質動画から高画質動画を推論するよう学習させる。すなわち学習装置1010Bによれば、1枚の高品質画像に基づき、高品質動画と低品質動画とを生成し、低品質動画から高品質動画を推論するような学習モデル1040を学習させる。低品質動画から高品質動画の推論は、言い換えればノイズの除去である。したがって、本実施形態によれば、教師データTDの取得に時間を要せず、容易にノイズ除去モデルを学習させることができる。
[Summary of the sixth embodiment]
According to the embodiment described above, the learning device 1010B includes the image acquisition unit 1011 to acquire image information I including at least one high-quality image. Further, the learning device 1010B includes a video information generation unit 1012B to generate both high-quality videos and low-quality videos from high-quality images. The video information generation unit 1012B includes a cutting unit 1121 to cut out a plurality of images at different positions that are part of the acquired image information I. Furthermore, the video information generation unit 1012B includes a noise superimposition unit 1123 to superimpose noise on each of the plurality of images cut out by the cutout unit 1121. The video information generation unit 1012B includes a first video information generation unit 1125, so that the video information generation unit 1012B generates first video information M1, which is a high-quality video, by combining the plurality of images cut out by the cutout unit 1121, and generates the second video information M1. By providing the information generating section 1127, the second moving image information M2 is generated by combining a plurality of images on which noise has been superimposed by the noise superimposing section 1123. Furthermore, by including the learning unit 1013, the learning device 1010B can generate the first video information M1 generated by the first video information generation unit 1125 and the second video information M2 generated by the second video information generation unit 1127. Based on the included teacher data TD, learning is performed to infer a high-quality video from a low-quality video. That is, according to the learning device 1010B, a learning model 1040 is trained that generates a high-quality video and a low-quality video based on one high-quality image, and infers a high-quality video from the low-quality video. In other words, inferring a high-quality video from a low-quality video is noise removal. Therefore, according to the present embodiment, it is possible to easily learn the noise removal model without requiring time to acquire the teacher data TD.
 なお、第6の実施形態では、高品質画像から高品質動画を生成し、更に高品質画像に対してノイズを重畳することにより低品質画像を生成し、生成した低品質画像に基づき低品質動画を生成した。しかしながら本実施形態はこの一例に限定されない。例えば、本実施形態の変形例として、学習装置1010は、低品質画像のみに基づき、教師データTDを作成してもよい。すなわち、低品質画像から低品質動画を生成し、更に低品質動画からノイズを除去することにより高品質画像を生成し、生成した高品質画像に基づき高品質動画を生成してもよい。動画の生成に用いられる画像は、1枚であってもよいし、複数枚であってもよい。 In the sixth embodiment, a high-quality video is generated from a high-quality image, a low-quality image is generated by superimposing noise on the high-quality image, and a low-quality video is generated based on the generated low-quality image. was generated. However, this embodiment is not limited to this example. For example, as a modification of this embodiment, the learning device 1010 may create the teacher data TD based only on low-quality images. That is, a low-quality video may be generated from a low-quality image, a high-quality image may be generated by further removing noise from the low-quality video, and a high-quality video may be generated based on the generated high-quality image. The number of images used to generate the moving image may be one or multiple.
 なお、第5の実施形態において説明した学習装置1010及び学習装置1010A並びに第6の実施形態において説明した学習装置1010Bは、低画質動画から高画質動画を推論する学習モデル1040の学習に用いられる例を示したが、これに限られるものではない。例えば、学習モデル1040において、低画質動画から高画質動画を推論した後に、高画質動画における人物等の特定の被写体を検出する機能を備えるように構成してもよいし、高画質動画において標識や看板等の文字認識を行う機能を備えるように構成してもよい。すなわち、学習モデル1040により推論される高画質動画は、鑑賞用の動画の一例に限定されず、物体検知等の用途に用いられてもよい。 Note that the learning device 1010 and learning device 1010A described in the fifth embodiment and the learning device 1010B described in the sixth embodiment are examples used for learning a learning model 1040 that infers a high-quality video from a low-quality video. shown, but is not limited to this. For example, the learning model 1040 may be configured to have a function to detect a specific subject such as a person in the high-quality video after inferring a high-quality video from the low-quality video, or a function may be provided to detect a specific subject such as a person in the high-quality video. It may also be configured to have a function of recognizing characters on signboards and the like. That is, the high-quality video inferred by the learning model 1040 is not limited to an example of a video for viewing, but may be used for purposes such as object detection.
 従来、学習モデルの汎化性能を向上させるには、想定されるシーンを可能な限りすべて教師データに含めることが好適であった。言い換えれば、想定される被写体の動きを可能な限りすべて含む動画が理想的な教師データといえる。一方で、このような教師データを実際の撮影により取得することは困難であり、膨大なコストと時間が必要になる。本実施形態を学習モデルの学習に用いることで、教師データの収集に要していたコストと時間を大幅に削減することができる、また、本実施形態を学習モデルの学習に用いることで、学習モデルの汎化性能を向上させることが可能となる。 Conventionally, in order to improve the generalization performance of a learning model, it has been suitable to include as many possible scenes as possible in the training data. In other words, the ideal training data is a video that includes as much of the expected movement of the subject as possible. On the other hand, it is difficult to obtain such training data through actual photography, and it requires a huge amount of cost and time. By using this embodiment for learning a learning model, it is possible to significantly reduce the cost and time required to collect teacher data. It becomes possible to improve the generalization performance of the model.
 以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の趣旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the spirit of the present invention. can be added.
1…高品質動画生成システム、2…画像処理装置、10…入力情報生成装置、11…画像取得部、12…入力変換部、13…合成部、14…出力部、15…統合部、16…平均値一時記憶部、17…撮像条件取得部、18…調整部、19…比較部、100…撮像装置、200…CNN、210…入力層、220…畳み込み層、230…プーリング層、240…出力層、IM…動画情報、IN…入力情報、TF…対象フレーム、AF…隣接フレーム、IF…統合フレーム、M1…第1メモリ、M2…第2メモリ、IMD…画像情報、IND…入力データ、CD…合成データ、SV…記憶値、CV…演算値、CR…比較結果、1001…学習システム、1010…学習装置、1011…画像取得部、1012…動画情報生成部、1013…学習部、1014…軌跡ベクトル取得部、1020…撮像装置、1031…高画質画像、1032…低画質画像、1033…高画質動画、1034…低画質動画、1040…学習モデル、1050…軌跡ベクトル生成装置、TD…教師データ、I…画像情報、I1…第1画像情報、I2…第2画像情報、M…動画情報、M1…第1動画情報、M2…第2動画情報、TV…軌跡ベクトル、1121…切出部、1123…ノイズ重畳部、1125…第1動画情報生成部、1127…第2動画情報生成部 DESCRIPTION OF SYMBOLS 1...High quality video generation system, 2...Image processing device, 10...Input information generation device, 11...Image acquisition section, 12...Input conversion section, 13...Composition section, 14...Output section, 15...Integration section, 16... Average value temporary storage section, 17... Imaging condition acquisition section, 18... Adjustment section, 19... Comparison section, 100... Imaging device, 200... CNN, 210... Input layer, 220... Convolution layer, 230... Pooling layer, 240... Output Layer, IM...Video information, IN...Input information, TF...Target frame, AF...Adjacent frame, IF...Integrated frame, M1...First memory, M2...Second memory, IMD...Image information, IND...Input data, CD ...Synthetic data, SV...Stored value, CV...Calculated value, CR...Comparison result, 1001...Learning system, 1010...Learning device, 1011...Image acquisition section, 1012...Video information generation section, 1013...Learning section, 1014...Trajectory Vector acquisition unit, 1020... Imaging device, 1031... High quality image, 1032... Low quality image, 1033... High quality video, 1034... Low quality video, 1040... Learning model, 1050... Trajectory vector generation device, TD... Teacher data, I...image information, I1...first image information, I2...second image information, M...video information, M1...first video information, M2...second video information, TV...trajectory vector, 1121...cutting section, 1123 ...Noise superimposition unit, 1125...First video information generation unit, 1127...Second video information generation unit

Claims (25)

  1.  動画を構成するフレームのうち、入力情報の生成の対象となる対象フレームを少なくとも含む複数フレームを入力画像として取得する画像取得部と、
     取得した複数フレームの前記入力画像の画素値を、前記入力画像の画素値を示すビット数より少ないビット数の入力データに変換する入力変換部と、
     変換された複数の入力データを1つの合成データに合成する合成部と、
     合成された前記合成データを出力する出力部と
     を備える入力情報生成装置。
    an image acquisition unit that acquires, as input images, a plurality of frames that include at least a target frame for which input information is to be generated, among frames that constitute a video;
    an input conversion unit that converts pixel values of the input image of the plurality of acquired frames into input data having a smaller number of bits than the number of bits indicating the pixel value of the input image;
    a synthesis unit that synthesizes a plurality of converted input data into one synthetic data;
    An input information generation device comprising: an output unit that outputs the synthesized data.
  2.  前記動画はカラー動画であり、
     前記画像取得部は、1つのフレームから各色の画素値を複数の異なる画像として取得し、
     前記入力変換部は、1つのフレームから取得された複数の画像それぞれについて、前記入力データに変換する
     請求項1に記載の入力情報生成装置。
    The video is a color video,
    The image acquisition unit acquires pixel values of each color from one frame as a plurality of different images,
    The input information generation device according to claim 1, wherein the input conversion unit converts each of a plurality of images acquired from one frame into the input data.
  3.  前記画像取得部は、前記動画を構成するフレームのうち、前記対象フレームの前後それぞれに連続して隣接する複数のフレームの画像を取得する
     請求項1に記載の入力情報生成装置。
    The input information generation device according to claim 1, wherein the image acquisition unit acquires images of a plurality of frames consecutively adjacent to each of the front and rear of the target frame, among the frames constituting the moving image.
  4.  前記画像取得部は、前記出力部により前記対象フレームについての前記合成データが出力された後、前記対象フレームに隣接するフレームを前記対象フレームとして、少なくとも前記対象フレームを含む複数フレームを前記入力画像として取得する
     請求項1に記載の入力情報生成装置。
    After the output unit outputs the composite data regarding the target frame, the image acquisition unit is configured to set a frame adjacent to the target frame as the target frame, and set a plurality of frames including at least the target frame as the input image. The input information generation device according to claim 1.
  5.  前記画像取得部により取得された複数フレームのうち、前記対象フレーム以外のフレームである隣接フレームの画素値に基づいた演算を行うことにより、複数の前記隣接フレームを1つの統合フレームに統合する統合部を更に備え、
     前記入力変換部は、前記対象フレームの画素値を、前記入力データに変換し、更に前記統合フレームの画素値を、前記入力データに変換し、
     前記合成部は、前記対象フレームに基づき変換された複数の前記入力データと、前記統合フレームに基づき変換された複数の前記入力データとを1つの前記合成データに合成する
     請求項1に記載の入力情報生成装置。
    An integrating unit that integrates the plurality of adjacent frames into one integrated frame by performing an operation based on pixel values of adjacent frames that are frames other than the target frame among the plurality of frames acquired by the image acquisition unit. further comprising;
    The input conversion unit converts the pixel values of the target frame into the input data, further converts the pixel values of the integrated frame into the input data,
    The input according to claim 1, wherein the combining unit combines a plurality of the input data converted based on the target frame and a plurality of the input data converted based on the integrated frame into one composite data. Information generation device.
  6.  前記統合部は、複数の前記隣接フレームの画素値の平均値を前記統合フレームの画素値とする
     請求項5に記載の入力情報生成装置。
    The input information generation device according to claim 5, wherein the integrating unit sets the average value of pixel values of the plurality of adjacent frames as the pixel value of the integrated frame.
  7.  前記統合部は、複数の前記隣接フレームのうち、前記対象フレームからの時間的距離に応じた加重平均を算出することにより前記統合フレームの画素値を算出する
     請求項6に記載の入力情報生成装置。
    The input information generation device according to claim 6, wherein the integrating unit calculates a pixel value of the integrated frame by calculating a weighted average of the plurality of adjacent frames according to a temporal distance from the target frame. .
  8.  前記統合部は、前記動画を構成するフレームのうち、輝度変化が大きいフレームを平均値の演算対象から除外する
     請求項6に記載の入力情報生成装置。
    The input information generation device according to claim 6 , wherein the integrating unit excludes, from among the frames constituting the moving image, frames with large luminance changes from the average value calculation target.
  9.  前記動画を構成するフレームのうち、所定のフレームの画素値の平均値を記憶する平均値一時記憶部を更に備え、
     前記統合部は、前記平均値一時記憶部に記憶された値と、前記対象フレームとに基づく演算により前記統合フレームの画素値を算出する
     請求項5に記載の入力情報生成装置。
    Further comprising an average value temporary storage unit that stores an average value of pixel values of a predetermined frame among the frames constituting the moving image,
    The input information generation device according to claim 5, wherein the integration unit calculates the pixel value of the integrated frame by calculation based on the value stored in the average value temporary storage unit and the target frame.
  10.  前記動画の撮像条件を取得する撮像条件取得部と、
     取得された前記撮像条件に応じて前記平均値一時記憶部に記憶された前記平均値を調整する調整部を更に備える
     請求項9に記載の入力情報生成装置。
    an imaging condition acquisition unit that acquires imaging conditions for the video;
    The input information generation device according to claim 9 , further comprising an adjustment unit that adjusts the average value stored in the average value temporary storage unit according to the acquired imaging condition.
  11.  前記平均値一時記憶部に記憶された値と、前記対象フレームの画素値とを比較する比較部を更に備え、
     前記統合部は、前記比較部により比較された結果、差分が所定値以下である場合、前記平均値一時記憶部に記憶された値と前記対象フレームとに基づく移動平均を算出することにより前記統合フレームの画素値を算出し、差分が所定値以下でない場合、前記対象フレームの画素値を前記統合フレームの画素値とする
     請求項9に記載の入力情報生成装置。
    further comprising a comparison unit that compares the value stored in the average value temporary storage unit and the pixel value of the target frame,
    When the difference is less than or equal to a predetermined value as a result of the comparison by the comparison unit, the integration unit performs the integration by calculating a moving average based on the value stored in the average value temporary storage unit and the target frame. The input information generation device according to claim 9, wherein a pixel value of the frame is calculated, and if the difference is not less than a predetermined value, the pixel value of the target frame is set as the pixel value of the integrated frame.
  12.  前記統合部は、前記画像取得部により取得された前記隣接フレームのうちランダムに特定されたフレームを前記統合フレームとする
     請求項5に記載の入力情報生成装置。
    The input information generation device according to claim 5, wherein the integrating unit sets a randomly specified frame among the adjacent frames acquired by the image acquiring unit as the integrated frame.
  13.  請求項1から請求項12のいずれか一項に記載の入力情報生成装置と、
     前記入力情報生成装置により出力された前記合成データを入力情報とする畳み込みニューラルネットワークと
     を備える画像処理装置。
    The input information generation device according to any one of claims 1 to 12,
    an image processing device comprising: a convolutional neural network whose input information is the composite data outputted by the input information generation device;
  14.  動画を構成するフレームのうち、入力情報の生成の対象となる対象フレームを少なくとも含む複数フレームを入力画像として取得する画像取得工程と、
     取得した複数フレームの前記入力画像の画素値を、前記入力画像の画素値を示すビット数より少ないビット数の入力データに変換する入力変換工程と、
     変換された複数の入力データを1つの合成データに合成する合成工程と、
     合成された前記合成データを出力する出力工程と
     を有する入力情報生成方法。
    an image acquisition step of acquiring, as input images, a plurality of frames including at least a target frame for which input information is to be generated, among frames constituting the video;
    an input conversion step of converting the pixel values of the input image of the plurality of acquired frames into input data having a smaller number of bits than the number of bits indicating the pixel value of the input image;
    a synthesis step of synthesizing a plurality of converted input data into one synthetic data;
    An input information generation method comprising: an output step of outputting the synthesized data.
  15.  少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得部と、
     取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成部と、
     前記動画情報生成部により生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習部と
     を備える学習装置。
    First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition unit that acquires second image information including at least one image;
    Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation unit that cuts out a plurality of images at different positions and generates second video information by combining the plurality of cut out images;
    A learning device configured to learn to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated by the video information generation unit.
  16.  前記第2画像情報には、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像された複数の画像であって、それぞれ互いに異なるノイズが重畳された複数の画像が含まれ、
     前記動画情報生成部は、前記第2画像情報に含まれる複数の画像それぞれから、異なる一部を切り出すことにより前記第2動画情報を生成する
     請求項15に記載の学習装置。
    The second image information includes a plurality of images in which the same subject as the subject captured in the image included in the first image information is captured, and each image includes a plurality of images on which different noises are superimposed on each other. Re,
    The learning device according to claim 15, wherein the video information generation unit generates the second video information by cutting out different parts from each of the plurality of images included in the second image information.
  17.  前記第2画像情報に含まれる複数の画像は、近接した異なる時間において撮像された画像である
     請求項16に記載の学習装置。
    The learning device according to claim 16, wherein the plurality of images included in the second image information are images taken at different times that are close to each other.
  18.  前記動画情報生成部は、前記第1画像情報に含まれる1枚の画像から、異なる一部を切り出すことにより前記第1動画情報を生成する
     請求項15又は請求項16に記載の学習装置。
    The learning device according to claim 15 or 16, wherein the video information generation unit generates the first video information by cutting out a different part from one image included in the first image information.
  19.  前記動画情報生成部は、切り出した複数の画像を所定の方向にずらすことにより異なる位置の画像を複数切り出す
     請求項15または請求項16に記載の学習装置。
    The learning device according to claim 15 or 16, wherein the video information generation unit cuts out a plurality of images at different positions by shifting the plurality of cut out images in a predetermined direction.
  20.  前記動画情報生成部は、所定の方向に、所定のビット数移動させた位置における複数の画像を切り出す
     請求項15または請求項16に記載の学習装置。
    The learning device according to claim 15 or 16, wherein the video information generation unit cuts out a plurality of images at positions shifted by a predetermined number of bits in a predetermined direction.
  21.  前記動画情報生成部が画像を切り出す所定の方向とは、アフィン変換により算出される
     請求項19に記載の学習装置。
    The learning device according to claim 19, wherein the predetermined direction in which the video information generation section cuts out the image is calculated by affine transformation.
  22.  軌跡ベクトルを取得する軌跡ベクトル取得部を更に備え、
     前記動画情報生成部が画像を切り出す所定の方向とは、取得された前記軌跡ベクトルに基づいて算出される
     請求項19に記載の学習装置。
    further comprising a trajectory vector acquisition unit that obtains a trajectory vector,
    The learning device according to claim 19, wherein the predetermined direction in which the video information generation unit cuts out the image is calculated based on the acquired trajectory vector.
  23.  少なくとも1枚の画像を含む画像情報を取得する画像取得部と、
     取得した前記画像情報の一部であって異なる位置の画像を複数切り出す切出部と、
     切り出した複数の画像を組み合わせて第1動画情報を生成する第1動画情報生成部と、
     前記切出部により切り出された複数の画像それぞれに対しノイズを重畳するノイズ重畳部と、
     ノイズ重畳部によりノイズが重畳された複数の画像を組み合わせて第2動画情報を生成する第2動画情報生成部と、
     前記第1動画情報生成部により生成された前記第1動画情報と前記第2動画情報生成部により生成された前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習部と
     を備える学習装置。
    an image acquisition unit that acquires image information including at least one image;
    a cutting unit that cuts out a plurality of images at different positions that are part of the acquired image information;
    a first video information generation unit that generates first video information by combining the plurality of cut out images;
    a noise superimposition unit that superimposes noise on each of the plurality of images cut out by the cutting unit;
    a second video information generation unit that generates second video information by combining a plurality of images on which noise has been superimposed by the noise superimposition unit;
    Converting a high-quality video from a low-quality video based on teacher data including the first video information generated by the first video information generation unit and the second video information generated by the second video information generation unit. A learning device comprising: a learning section for learning to infer;
  24.  コンピュータに、
     少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得ステップと、
     取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成ステップと、
     前記動画情報生成ステップにより生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習ステップと
     を実行させるプログラム。
    to the computer,
    First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition step of acquiring second image information including at least one image;
    Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation step of cutting out a plurality of images at different positions and generating second video information by combining the plurality of cut out images;
    a learning step of learning to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated in the video information generation step.
  25.  少なくとも1枚の画像を含む第1画像情報と、前記第1画像情報に含まれる画像に撮像された被写体と同一の被写体が撮像され、前記第1画像情報に含まれる画像より低画質の画像を少なくとも1枚含む第2画像情報とを取得する画像取得工程と、
     取得した前記第1画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第1動画情報を生成し、取得した前記第2画像情報の一部であって異なる位置の画像を複数切り出し、切り出した複数の画像を組み合わせて第2動画情報を生成する動画情報生成工程と、
     前記動画情報生成工程により生成された前記第1動画情報と前記第2動画情報とが含まれる教師データに基づき、低画質動画から高画質動画を推論するよう学習させる学習工程と
     を有するノイズ低減装置の学習方法。
    First image information including at least one image, an image of the same subject as the image included in the first image information, and an image of lower quality than the image included in the first image information. an image acquisition step of acquiring second image information including at least one image;
    Cut out a plurality of images at different positions that are part of the acquired first image information, combine the plurality of cut out images to generate first video information, and generate first video information, which is a part of the acquired second image information. a video information generation step of cutting out a plurality of images at different positions and generating second video information by combining the plurality of cut out images;
    a learning step of learning to infer a high-quality video from a low-quality video based on teacher data including the first video information and the second video information generated in the video information generation step. How to learn.
PCT/JP2023/021204 2022-08-31 2023-06-07 Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device WO2024047994A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2022-137843 2022-08-31
JP2022137834A JP2024033913A (en) 2022-08-31 2022-08-31 Input information generation device, image processing device, and input information generation method
JP2022-137834 2022-08-31
JP2022137843A JP2024033920A (en) 2022-08-31 2022-08-31 Learning device, program, and learning method for noise reduction device

Publications (1)

Publication Number Publication Date
WO2024047994A1 true WO2024047994A1 (en) 2024-03-07

Family

ID=90099234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/021204 WO2024047994A1 (en) 2022-08-31 2023-06-07 Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device

Country Status (1)

Country Link
WO (1) WO2024047994A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011171843A (en) * 2010-02-16 2011-09-01 Fujifilm Corp Image processing method and apparatus, and program
JP2019106059A (en) * 2017-12-13 2019-06-27 日立オートモティブシステムズ株式会社 Calculation system, server, and on-vehicle device
JP2020010331A (en) * 2018-07-03 2020-01-16 株式会社ユビタス Method for improving image quality
CN113055674A (en) * 2021-03-24 2021-06-29 电子科技大学 Compressed video quality enhancement method based on two-stage multi-frame cooperation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011171843A (en) * 2010-02-16 2011-09-01 Fujifilm Corp Image processing method and apparatus, and program
JP2019106059A (en) * 2017-12-13 2019-06-27 日立オートモティブシステムズ株式会社 Calculation system, server, and on-vehicle device
JP2020010331A (en) * 2018-07-03 2020-01-16 株式会社ユビタス Method for improving image quality
CN113055674A (en) * 2021-03-24 2021-06-29 电子科技大学 Compressed video quality enhancement method based on two-stage multi-frame cooperation

Similar Documents

Publication Publication Date Title
CN110728648B (en) Image fusion method and device, electronic equipment and readable storage medium
US8189960B2 (en) Image processing apparatus, image processing method, program and recording medium
US10021313B1 (en) Image adjustment techniques for multiple-frame images
US8077214B2 (en) Signal processing apparatus, signal processing method, program and recording medium
JP4898761B2 (en) Apparatus and method for correcting image blur of digital image using object tracking
KR101663227B1 (en) Method and apparatus for processing image
US20140218550A1 (en) Image capturing device and image processing method thereof
CN105141841B (en) Picture pick-up device and its method
CN104935826A (en) Method for processing a video sequence, corresponding device, computer program and non-transitory computer-readable medium
US11184553B1 (en) Image signal processing in multi-camera system
CN113170158B (en) Video encoder and encoding method
JP4916378B2 (en) Imaging apparatus, image processing apparatus, image file, and gradation correction method
CN116711317A (en) High dynamic range technique selection for image processing
JP2010220207A (en) Image processing apparatus and image processing program
US20180197282A1 (en) Method and device for producing a digital image
WO2020090176A1 (en) Image processing device and image processing method
JP4879363B1 (en) Image processing system
Jeelani et al. Expanding synthetic real-world degradations for blind video super resolution
WO2024047994A1 (en) Input information generation device, image processing device, input information generation method, learning device, program, and learning method for noise reduction device
CN112750092A (en) Training data acquisition method, image quality enhancement model and method and electronic equipment
EP4167134A1 (en) System and method for maximizing inference accuracy using recaptured datasets
CN109218602B (en) Image acquisition device, image processing method and electronic device
JP5202277B2 (en) Imaging device
JP2024033913A (en) Input information generation device, image processing device, and input information generation method
JP4462017B2 (en) Defect detection and correction apparatus, imaging apparatus, and defect detection and correction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859764

Country of ref document: EP

Kind code of ref document: A1