WO2020059581A1 - Image processing device, image processing method, and image processing program - Google Patents

Image processing device, image processing method, and image processing program Download PDF

Info

Publication number
WO2020059581A1
WO2020059581A1 PCT/JP2019/035631 JP2019035631W WO2020059581A1 WO 2020059581 A1 WO2020059581 A1 WO 2020059581A1 JP 2019035631 W JP2019035631 W JP 2019035631W WO 2020059581 A1 WO2020059581 A1 WO 2020059581A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame group
group
frame
unit
data
Prior art date
Application number
PCT/JP2019/035631
Other languages
French (fr)
Japanese (ja)
Inventor
忍 工藤
翔太 折橋
正樹 北原
清水 淳
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to US17/273,366 priority Critical patent/US11516515B2/en
Priority to JP2020548382A priority patent/JP7104352B2/en
Publication of WO2020059581A1 publication Critical patent/WO2020059581A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/423Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/154Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/177Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/423Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements
    • H04N19/426Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements using memory downsizing methods
    • H04N19/428Recompression, e.g. by spatial or temporal decimation

Definitions

  • the present invention relates to an image processing device, an image processing method, and an image processing program.
  • Priority is claimed on Japanese Patent Application No. 2018-174982 filed on September 19, 2018, the content of which is incorporated herein by reference.
  • One of the methods for encoding an image is a method using an auto encoder (self-encoder).
  • the image referred to here includes a still image and a moving image (hereinafter, referred to as “video”).
  • the auto-encoder is a three-layer neural network including an input layer (encoder), a hidden layer, and an output layer (decoder).
  • Auto-encoders are designed to encode input data into encoded data with an encoder and restore the encoded data into input data with a decoder.
  • the encoder and the decoder are constructed by an arbitrary arithmetic unit.
  • the encoder is configured by a plurality of arithmetic units that perform a convolution operation
  • the decoder is configured by a plurality of arithmetic units that perform an inverse operation to the convolution operation by the encoder.
  • the expression ability and the performance are expected to be improved by increasing the number of parameters.
  • the input data is an image having a high resolution, for example, if the number of parameters increases, the capacity of the memory required for the calculation becomes enormous. Therefore, it is not realistic to improve the expressive ability and performance by increasing the number of parameters.
  • the input data is divided into a plurality of pieces of data of a operable size, the respective pieces of data are subjected to an arithmetic processing by a neural network, and the output decoded data are combined.
  • each divided data is processed independently of each other. Therefore, in this method, the restored input data does not maintain continuity between adjacent decoded data, particularly at a boundary portion of the divided image, and is likely to be an unnatural image. .
  • the above prior art has a problem that it lacks random accessibility.
  • the random access property refers to a property that desired data can be easily obtained even if data is discretely accessed.
  • the related art for example, when input data is video data, encoding and decoding are sequentially performed from the top of the video data. In this case, for example, even if it is desired to obtain only the decoded data at the desired position of the video data, the decoded data at the desired position cannot be obtained unless the decoding is performed sequentially from the beginning of the video data.
  • the above-described conventional technology has a problem of lack of parallelism. In the related art, since arithmetic processing is performed recursively, it is difficult to perform parallel processing. For this reason, it is difficult for the conventional technology to efficiently perform arithmetic processing using a distributed processing system or the like.
  • the present invention has been made in view of such a situation, and an object of the present invention is to provide a technology capable of performing encoding and decoding with random access to image data and parallelism.
  • One embodiment of the present invention is an image processing apparatus which performs correction for each frame group including a predetermined number of frames into which video data has been divided, wherein the first frame group is characterized by using a feature amount of the first frame group.
  • a second frame group which is a frame group temporally continuous with the second frame group, to obtain a corrected frame group
  • the decoding unit comprising: the second frame group and the second frame group;
  • the second frame group and the frame group temporally later than the second frame group are set so that the subjective image quality based on the relationship with the frame group temporally later than the second frame group becomes higher. It is determined that the combined frame group is the same as the corrected frame group obtained by combining the corrected frame group and the corrected frame group obtained by correcting the frame group temporally later than the second frame group.
  • one aspect of the present invention is the image processing device described above, wherein the decoding unit weights heavier in the correction as the feature amount of a frame that is temporally later than the second frame group. I do.
  • One embodiment of the present invention is an image processing device that performs correction for each frame group including a predetermined number of frames into which video data is divided, and includes a frame group temporally earlier than a second frame group. And a feature amount of a first frame group, which is a frame group temporally continuous with the second frame group, and a frame group temporally later than the second frame group, the second frame group And a decoding unit that obtains a corrected frame group by performing correction on the second frame group using the feature amount of a third frame group that is a temporally continuous frame group.
  • a decoding unit configured to correct the subjective image quality based on the relationship between the corrected frame group and the first frame group and the relationship between the corrected frame group and the third frame group; Device.
  • One embodiment of the present invention is the above-described image processing device, wherein the decoding unit is configured to perform decoding based on a parameter value updated by a learning process based on a frame group obtained by dividing video data different from the video data. to correct.
  • One embodiment of the present invention is the above-described image processing device, wherein the learning processing includes a step of acquiring sample data including at least three temporally continuous frame groups; Inputting each of the learning models to obtain the feature amount of the frame group, and inputting the feature amount of the frame group to the second learning model to obtain the corrected frame group corresponding to the frame group Calculating a loss value based on the sample data, the feature amount of the frame group, the corrected frame group and a predetermined loss function, and updating the parameter value using the loss value; Having.
  • one aspect of the present invention is an image processing apparatus that performs correction for each partial data group including a predetermined number of partial data obtained by dividing data, wherein the feature amount of the first partial data group is used to perform the correction.
  • a decoding unit that corrects the second partial data group, which is a partial data group temporally continuous with the first partial data group, to obtain a corrected partial data group; The subjective image quality based on the relationship between the second partial data group and the partial data group temporally later than the second partial data group is increased, and the second partial data group and the second A partial data group in which a partial data group temporally later than the partial data group is combined; and a correction in which the corrected partial data group and the partial data group temporally later than the second partial data group are corrected. Partial data combined with subsequent partial data group And the group, but is an image processing apparatus for correcting as identified by the is the same predetermined identifier.
  • one aspect of the present invention is an image processing method for performing correction for each frame group including a predetermined number of frames into which video data is divided, wherein the first frame group is characterized by using a feature amount of the first frame group.
  • Obtaining a corrected frame group by performing correction on a second frame group that is a frame group temporally continuous with the second frame group, and obtaining a corrected frame group from the second frame group and the second frame group.
  • a frame group in which the subjective image quality based on the relationship with the temporally later frame group is increased, and the second frame group and the temporally later frame group are combined with each other.
  • a frame group obtained by combining the corrected frame group and a corrected frame group obtained by correcting a frame group temporally later than the second frame group by the predetermined classifier Be done A step of urchin correction, an image processing method having.
  • Another embodiment of the present invention is an image processing program for causing a computer to function as the above image processing device.
  • encoding and decoding with random access to image data and parallelism can be performed.
  • FIG. 1 is an overall configuration diagram of a video encoding / decoding system 1 according to a first embodiment.
  • FIG. 2 is a configuration diagram of an encoding unit 120 of the video encoding / decoding system 1 according to the first embodiment.
  • FIG. 2 is a configuration diagram of a decoding unit 210 of the video encoding / decoding system 1 according to the first embodiment.
  • FIG. 11 is a configuration diagram of a decoding unit of a video encoding / decoding system according to a conventional technique.
  • 5 is a flowchart illustrating an operation of the video encoding device 10 according to the first embodiment.
  • FIG. 2 is a configuration diagram of a dimension compression unit 121 of the video encoding / decoding system 1 according to the first embodiment.
  • FIG. 5 is a flowchart illustrating an operation of the video decoding device 20 according to the first embodiment.
  • FIG. 2 is a configuration diagram of a dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment.
  • FIG. 2 is a configuration diagram of a correction unit 214 of the video encoding / decoding system 1 according to the first embodiment.
  • FIG. 2 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment.
  • It is a block diagram of the decoding part 210a of the video encoding / decoding system which concerns on 2nd Embodiment.
  • 13 is a flowchart illustrating an operation of the video decoding device according to the second embodiment.
  • FIG. 11 is a schematic diagram for explaining the learning process by the video encoding / decoding system according to the second embodiment.
  • FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art.
  • FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art.
  • the video encoding / decoding system 1 that encodes and decodes video data will be described.
  • the system is also applicable when encoding and decoding image data other than video data.
  • FIG. 1 is an overall configuration diagram of a video encoding / decoding system 1 (image processing device) according to the first embodiment.
  • the video encoding / decoding system 1 acquires input video data to be encoded, and outputs decoded video data corresponding to the input video data.
  • the video encoding / decoding system 1 includes a video encoding device 10 and a video decoding device 20.
  • the video encoding device 10 includes a video dividing unit 110 and an encoding unit 120.
  • the video division unit 110 acquires input video data.
  • the input video data is composed of a plurality of temporally continuous frames.
  • the video dividing unit 110 generates a plurality of input frame groups by dividing a plurality of continuous frames constituting the obtained input video data by a predetermined number of frames.
  • the video division unit 110 sequentially outputs the generated plurality of input frame groups to the encoding unit 120.
  • FIG. 2 shows the configuration of encoding section 120.
  • the encoding unit 120 includes a dimensional compression unit 121 and a quantization / entropy encoding unit 122.
  • the dimensional compression unit 121 acquires the input frame group output from the video division unit 110.
  • the dimension compression unit 121 generates a compressed frame group by compressing the acquired input frame group so as to reduce the number of dimensions.
  • the dimensional compression unit 121 outputs the generated compressed frame group to the quantization / entropy encoding unit 122.
  • the quantization / entropy encoding unit 122 acquires the compressed frame group output from the dimensional compression unit 121.
  • the quantization / entropy coding unit 122 performs quantization and entropy coding on each of the values of the compressed frames constituting the obtained compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups.
  • the quantization / entropy coding unit 122 outputs the generated coded data to a decoding unit 210 of the video decoding device 20, which will be described later.
  • the video decoding device 20 includes a decoding unit 210 and a video combining unit 220.
  • FIG. 3 shows the configuration of the decoding unit 210.
  • the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, and a correction unit 214.
  • the entropy decoding unit 211 acquires the encoded data output from the quantization / entropy encoding unit 122 of the encoding unit 120.
  • the entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data.
  • the entropy decoding unit 211 outputs the generated entropy decoding data to the dimension expansion unit 212.
  • the dimension expansion unit 212 expands the entropy decoding data output from the entropy decoding unit 211 by expanding the entropy decoding data until the number of dimensions becomes the same as that of the input frame group (before being compressed by the dimension compression unit 121). Generate decrypted data.
  • the dimension expansion unit 212 outputs the generated expanded decoded data to the intermediate data memory 213 and the correction unit 214.
  • the intermediate data memory 213 acquires and stores the decompressed decoded data output from the dimension decompression unit 212. Note that the decompressed decoded data stored in the intermediate data memory 213 is hereinafter referred to as “intermediate data”. The intermediate data is output to the correction unit 214 as needed.
  • the intermediate data memory 213 is a volatile recording medium such as a RAM (Random Access Memory).
  • the correction unit 214 acquires the expanded decoded data output from the dimension expansion unit 212. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.
  • the video combining unit 220 acquires the decoded frame group output from the decoding unit 210.
  • the video combining unit 220 generates decoded video data by combining the acquired decoded frame groups.
  • the video combining unit 220 outputs the generated decoded video data as final output data.
  • FIG. 4 shows the configuration of the decoding unit 210 of the video encoding / decoding system according to the related art in order to explain the difference from the related art.
  • the difference between the configuration of the decoding unit 210 according to the above-described first embodiment and the configuration of the decoding unit in the related art is that the decoding unit in the related art does not include the correction unit.
  • the decoding unit 210 according to the first embodiment includes a correction unit 214.
  • the dimension expansion unit of the decoding unit in the related art acquires the entropy decoded data output from the entropy decoding unit.
  • the dimension expansion unit in the related art expands the number of dimensions of the acquired entropy decoded data using the intermediate data stored in the intermediate data memory to generate a decoded frame group.
  • the correction unit 214 obtains the expanded decoded data from the dimension expansion unit 212 and obtains the intermediate data from the intermediate data memory 213. Then, the correcting unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data.
  • FIG. 5 is a flowchart illustrating the operation of the video encoding device 10 according to the first embodiment.
  • the video dividing unit 110 acquires input video data S (x, y, z) in the horizontal direction x, the vertical direction y, and the time direction z.
  • the video dividing unit 110 generates a plurality of input frame groups Si (x, y, z) by dividing the obtained input video data S (x, y, z) into N frames (Step S10).
  • the dimensions of x, y, and z are X, Y, and Z, respectively.
  • I is an index indicating the number of the input frame group.
  • each frame group does not necessarily have to be the same.
  • a frame group composed of N frames and a frame group composed of L frames (L is a positive number different from N) may be mixed.
  • the input video data S (x, y, z) is alternately divided into N frames and L frames, and is composed of a frame group composed of N frame groups and an L frame group.
  • a configuration in which frame groups are alternately generated may be employed.
  • the dimension compression unit 121 of the encoding unit 120 generates a compressed frame group by compressing each input frame group Si (x, y, z) to have the number of dimensions (X ′, Y ′, N ′). (Step S102).
  • the number of dimensions (X ', Y', N ') is the number of dimensions satisfying X' * Y '* N' ⁇ X * Y * N.
  • the dimension compression unit 121 is configured by a neural network (combination of convolution operation, downsampling, and non-linear conversion) as shown in FIG. 6, for example.
  • FIG. 6 is a configuration diagram of the dimensional compression unit 121 of the video encoding / decoding system 1 according to the first embodiment.
  • the dimensional compression unit 121 is configured by constituent units (first-layer constituent unit 121a-1 to M-th-layer constituent unit 121a-M) composed of M layers.
  • Each component includes a convolutional layer unit c1, a downsampling unit c2, and a non-linear conversion unit c3.
  • the convolution layer section c1 of the first layer configuration section 121a-1 acquires the input frame group output from the video division section 100.
  • the convolution layer section c1 of the first layer configuration section 121a-1 performs a convolution operation on the acquired input frame group.
  • the convolution layer unit c1 outputs the group of frames on which the convolution operation has been performed to the downsampling unit c2.
  • the downsampling unit c2 of the first layer configuration unit 121a-1 acquires the frame group output from the convolutional layer unit c1.
  • the downsampling unit c2 compresses the acquired frame group so as to reduce the number of dimensions.
  • the down-sampling unit c2 outputs the compressed frame group to the nonlinear conversion unit c3.
  • the nonlinear conversion unit c3 of the first layer configuration unit 121a-1 acquires the frame group output from the downsampling unit c2.
  • the non-linear conversion unit c3 performs a non-linear conversion process on the acquired frame group.
  • the non-linear conversion unit c3 outputs the frame group subjected to the non-linear conversion processing to the convolution layer unit c1 of the next layer component (second layer component).
  • the dimensional compression unit 121 converts the input frame group input from the video division unit 100 into a compressed frame group with a reduced number of dimensions, and performs quantization. / Entropy encoding section 122.
  • the quantization / entropy coding unit 122 of the coding unit 120 performs quantization and entropy coding on each compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups (step S103).
  • the operation of the video encoding device 10 shown in the flowchart of FIG. 5 ends.
  • FIG. 7 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.
  • the entropy decoding unit 211 of the decoding unit 210 acquires encoded data.
  • the entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data (step S111).
  • the dimension expansion unit 212 of the decoding unit 210 generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit 121). (Step S112).
  • the dimension expansion unit 212 is configured by a neural network (combination of deconvolution operation and non-linear conversion) as shown in FIG. 8, for example.
  • FIG. 8 is a configuration diagram of the dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment.
  • the dimensional extension unit 212 is configured by a configuration unit composed of M layers (first layer configuration unit 212a-1 to Mth layer configuration unit 212a-M). Each component includes a deconvolution layer unit c4 and a non-linear conversion unit c5.
  • the deconvolution layer unit c4 of the first layer configuration unit 212a-1 acquires the entropy decoded frame group output from the entropy decoding unit 211.
  • the deconvolution layer unit c4 performs a deconvolution operation on the obtained entropy-decoded frame group.
  • the deconvolution layer unit c4 outputs the group of frames on which the deconvolution operation has been performed to the nonlinear conversion unit c5.
  • the nonlinear conversion unit c5 of the first layer configuration unit 212a-1 acquires the frame group output from the deconvolution layer unit c4.
  • the non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group.
  • the non-linear conversion unit c5 outputs the frame group on which the non-linear conversion processing has been performed to the deconvolution layer unit c4 of the configuration unit (second layer configuration unit) of the next layer.
  • the dimensional expansion unit 212 converts the entropy decoded frame group output from the entropy decoding unit 211 into dimensional expansion data in which the number of dimensions has been restored.
  • the data is output to the data memory 213 and the correction unit 214.
  • the intermediate data memory 213 of the decoding unit 210 stores the intermediate data Mi, which is the decompressed decoded data generated in step S112 (step S113).
  • the correction unit 214 of the decoding unit 210 corrects the expanded decoded data acquired from the dimensional expansion unit 212 using the intermediate data Mi stored in the intermediate data memory 213.
  • the correction unit 214 adds the intermediate data Mi-1 which is the intermediate data stored in the intermediate data memory 213 before the intermediate data corresponding to the expanded decoded data to be corrected. Is corrected using.
  • the correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi using the intermediate data Mi-1 which is one intermediate data before the intermediate data Mi in the time direction.
  • the number of intermediate data used for correction may be two or more.
  • the correction unit 214 corrects the decompressed decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 in the dimension of the z direction.
  • the correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S114).
  • the reason why the correction process is performed by the correction unit 214 is as follows. Since encoding is performed for each frame group composed of frames in the time direction z, subjective continuity may not be ensured between frame groups that are temporally close to or adjacent to each other. Therefore, in order to ensure continuity, a correction process is performed on the decompressed decoded data using intermediate data temporally close to or adjacent to the decompressed decoded data. By providing continuity, the subjective image quality of the decoded video obtained by combining the frame groups is improved.
  • the video combining unit 220 generates decoded video data by combining the generated decoded frame groups (Step S115). Thus, the operation of the video decoding device 20 shown in the flowchart of FIG. 7 ends.
  • the correction unit 214 is configured by, for example, a neural network (combination of convolution operation and non-linear conversion and scaling processing) as shown in FIG.
  • FIG. 9 is a configuration diagram of the correction unit 214 of the video encoding / decoding system 1 according to the first embodiment.
  • the correction unit 214 includes a configuration unit including M layers (first layer configuration unit 214a-1 to Mth layer configuration unit 214a-M) and a scaling unit 214b.
  • Each component is configured by a convolutional layer unit c6 and a non-linear conversion unit c7.
  • the convolutional layer section c6 of the first layer configuration section 214a-1 acquires the expanded decoded data output from the dimensional expansion section 212 and the intermediate data stored in the intermediate data memory 213.
  • the convolution layer unit c6 performs a convolution operation on the obtained decompressed decoded data.
  • the convolution layer unit c6 outputs the frame group on which the convolution operation has been performed to the nonlinear conversion unit c7.
  • the nonlinear conversion unit c7 of the first layer configuration unit 214a-1 acquires the frame group output from the convolution layer unit c6.
  • the non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group.
  • the non-linear conversion unit c7 outputs a frame group subjected to the non-linear conversion processing. Data obtained by adding the frame group output from the non-linear conversion unit c7 and the intermediate data immediately before in time are input to the convolution layer unit c6 of the configuration unit of the next layer (second layer configuration unit). You.
  • the correction unit 214 performs scaling by a scaling unit 214b on a frame group obtained by repeating the above processing from the first layer to the Mth layer. Through the above processing, the correction unit 214 corrects the decompressed decoded data output from the dimensional decompression unit 212 with the intermediate data stored in the intermediate data memory 213, and outputs a decoded frame group that is the corrected decompressed decoded data to the video. Output to the combining unit 220.
  • FIG. 10 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment.
  • a data set including three temporally continuous input frames as one sample data is input.
  • these three input frame groups will be referred to as S1 (x, y, z), S2 (x, y, z) (first frame group), S3 (x, y, z) (second Frame group).
  • the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z).
  • the processing A is a dimension compression processing, a quantization / entropy encoding processing, an entropy decoding processing, and a dimension expansion processing.
  • intermediate data is generated.
  • intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively.
  • M2 (x, y, z) features of the first frame group
  • M3 (x, y, z) features of the second frame group).
  • M1 (x, y, z) and M2 (x, y, z), and M2 (x, y, z) and M3 (x, y, z) are set as a set.
  • decompressed decoded data corresponding to the intermediate data M1 (x, y, z) and intermediate data M2 (x, y, z) and decompressed decoded corresponding to the intermediate data M2 (x, y, z)
  • the data and the intermediate data M3 (x, y, z) are set as a set, and correction is performed.
  • two decoded frame groups are generated.
  • each decoded frame group is referred to as R2 (x, y, z) and R3 (x, y, z) (corrected frame group).
  • the loss value loss is calculated using a loss function defined by the following equations (1) to (3).
  • Restoration error 1 ⁇ x ⁇ y ⁇ z (diff (S2 (x, y, z), R2 (x, y, z))) + ⁇ x ⁇ y ⁇ z (diff (S3 (x, y, z), R3 (x, y, z))) ...
  • Restoration error 2 ⁇ x ⁇ y ⁇ z (w (z) * diff (M2 (x, y, z), R2 (x, y, z))) + ⁇ x ⁇ y ⁇ z (w (z) * diff (M3 (x, y, z), R3 (x, y, z))) ...
  • diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b.
  • $ Concat () is an operation for connecting each input in the time direction.
  • GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability.
  • the discriminator is constructed by a neural network.
  • FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.
  • the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. It should be noted that the configuration of the loss functions shown in the above equations (1) to (3) is merely an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.
  • the flow of the learning process in the first embodiment is as follows. 1. Prepare three consecutive input frame groups as one sample. 2. Each sample is input to a neural network (encoder / decoder) as an auto encoder to obtain intermediate data. 3. The decoded video data corresponding to S2 (x, y, z) and S3 (x, y, z) is obtained by the neural network for correction. 4. The loss calculation is performed by adding the following values 1) to 4). 1) Restoration error between S2 (x, y, z) and R2 (x, y, z) and restoration error between S3 (x, y, z) and R3 (x, y, z).
  • GAN error a binary cross-entropy error when R2 (x, y, z) and R3 (x, y, z) are input to a neural network that performs the identification process).
  • FM error for a neural network performing identification processing, S2 (x, y, z) and S3 (x, y, z), R2 (x, y, z) and R3 (x, y, z) , And the error of the feature amount of the hidden layer at the time of inputting). 5.
  • Each neural network is updated by the backpropagation method.
  • the identification processing is processing for identifying whether or not a video based on the input video data is a true video.
  • the weighted restoration error of 2) is a term calculated so as to be continuous with an adjacent frame group in time.
  • the GAN error of 3) and the FM error of 4) are terms calculated so that the video based on the decoded video data has a more natural output.
  • M1 (x, y, z), S2 (x, y, z) and S3 (x, y, z), which are three temporally continuous input frame groups, are converted to M1 ( x, y, z), M2 (x, y, z), M3 (x, y, z) and R2 (x, y, z), R3 (x, y, z) are generated, and R2 (x , Y, z) + R3 (x, y, z) is natural (ie, has continuity).
  • the present invention is not limited to the configuration in which a data set including three temporally continuous input frames is input, and a data set including four or more temporally continuous input frames is input. May be adopted.
  • S1 (x, y, z), S2 (x, y, z), S3 (x, y, z) and S4 (x, y, z) M1 (x, y, z), M2 (x, y, z), M3 (x, y, z), M4 (x, y, z) and R2 (x, y, z), R3 (x, y) , Z), R4 (x, y, z), and R2 (x, y, z) + R3 (x, y, z) + R4 (x, y, z) become natural (ie, Learning (to have continuity) may be performed.
  • the video encoding / decoding system 1 stores encoded data in the intermediate data memory 213 as intermediate data instead of decoding the encoded data directly into decoded video data. Then, the video encoding / decoding system 1 performs a correction process on the encoded data to be processed using surrounding data (intermediate data) that is temporally continuous, and decodes the encoded data. As a result, continuity between temporally continuous surrounding data and encoded data to be processed is maintained.
  • the video encoding / decoding system 1 can perform encoding and decoding with random access to image data and parallelism.
  • the video encoding / decoding system 1 performs learning using the restoration error 2 as described above. Therefore, for example, when M2 (x, y, z) shown in FIG. 10 is corrected to R2 (x, y, z), R2 (x, y, z) and R3 (x, y, z) Are maintained such that no change occurs in a frame close to R3 (x, y, z). That is, the subjective image quality based on the relationship between S2 (x, y, z) and S3 (x, y, z), which is an input frame group temporally later than S2 (x, y, z), is increased. It is a constraint condition. Thereby, according to the video encoding / decoding system 1, the correction is performed so that R2 (x, y, z) is continuous with R3 (x, y, z), so that the image quality is improved.
  • the neural network (the dimensional compression unit 121 and the dimensional expansion unit 212) as the auto encoder (the first learning model) and the Is a separate neural network and the learning process is performed separately, so that the learning process is stabilized.
  • the overall configuration of the video encoding / decoding system according to the second embodiment and the configuration of the encoding unit are the same as those of the video encoding / decoding system according to the first embodiment described with reference to FIGS. 1 and the configuration of the encoding unit 120, and a description thereof will be omitted.
  • the video encoding / decoding system 1 according to the first embodiment differs from the video encoding / decoding system according to the second embodiment described below in the configuration of the decoding unit included in the video decoding device.
  • FIG. 11 illustrates a configuration of the decoding unit 210a included in the video decoding device of the video encoding / decoding system according to the second embodiment.
  • the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, a correction unit 214, and a correction changeover switch 215.
  • the difference between the decoding unit 210a according to the second embodiment and the decoding unit 210 according to the first embodiment is that, in addition to the functional configuration of the decoding unit 210, the decoding unit 210a further includes a correction process changeover switch 215. This is the configuration.
  • the -dimensional decompression unit 212 outputs the generated decompressed decoded data to the intermediate data memory 213 and the correction process changeover switch 215, respectively.
  • the correction process changeover switch 215 acquires the expanded decoded data output from the dimensional expansion unit 212.
  • the correction processing changeover switch 215 switches whether to output the obtained decompressed decoded data as a decoded frame group to the video combining unit or to the correcting unit 214.
  • the correction unit 214 acquires the decompressed decoded data output from the correction processing changeover switch 215. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.
  • the operation of the video encoding device according to the second embodiment is the same as the operation of the video encoding device 10 according to the first embodiment described with reference to FIG. Therefore, description of the operation of the video encoding device according to the second embodiment will be omitted.
  • FIG. 12 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.
  • the entropy decoding unit 211 of the decoding unit 210a acquires the encoded data.
  • the entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the acquired encoded data (step S211).
  • the dimension expansion unit 212 of the decoding unit 210a generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit) ( Step S212).
  • the intermediate data memory 213 of the decoding unit 210a stores the intermediate data Mi, which is the decompressed decoded data generated in step S212 (step S213).
  • the correction processing changeover switch 215 of the decoding unit 210a checks the value of the index i indicating the number of the input frame group by referring to the expanded decoded data generated by the dimension expanding unit 212. If the value of i is an odd number (step S214: YES), the correction processing changeover switch 215 outputs the obtained decompressed decoded data as it is to the video combining unit as a decoded frame group.
  • the video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216). This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.
  • step S214 if the value of i is an even number (step S214: NO), the correction process changeover switch 215 outputs the obtained decompressed decoded data to the correction unit 214 of the decoding unit 210a.
  • the correction unit 214 corrects the decompressed and decoded data obtained via the correction processing changeover switch 215 using the intermediate data Mi stored in the intermediate data memory 213.
  • the correction processing changeover switch 215 When the value of i is an even number, the correction processing changeover switch 215 outputs the decompressed decoded data as it is to the video combining unit as a decoded frame group, and corrects the decompressed decoded data when the value of i is an odd number.
  • the output to the unit 214 may be used.
  • the correction process changeover switch 215 performs a correction process on every other acquired decompressed decoded data, and the purpose is as follows.
  • the subjective image quality is improved by correcting the frame group (Mi) to be corrected so as to be temporally continuous with the frame group (Mi-1) temporally preceding.
  • the temporally previous frame group (Mi-1) is further corrected based on the temporally previous frame group (Mi-2). Therefore, the temporally previous frame group (Mi-1) is a frame group different from the time point when the frame group to be corrected (Mi) is referred to, and the final output is temporally continuous. Is not guaranteed.
  • the second embodiment has a configuration in which a group of frames to be corrected and a group of frames not to be corrected are alternately continuous.
  • the frame groups before and after the frame group do not change from the point in time when they are referred to, so that temporal continuity is ensured.
  • the correction unit 214 applies the intermediate data Mi-1 (first frame group) and the intermediate data Mi + 1 (third frame group) to the expanded decoded data (second frame group) to be corrected.
  • the correction is performed using.
  • the intermediate data Mi-1 is intermediate data stored in the intermediate data memory 213 prior to the intermediate data Mi corresponding to the decompressed decoded data.
  • the intermediate data Mi + 1 is intermediate data stored in the intermediate data memory 213 after the intermediate data Mi corresponding to the decompressed decoded data.
  • the correction unit 214 converts the decompressed and decoded data corresponding to the intermediate data Mi into intermediate data Mi-1 which is one intermediate data in the temporal direction before the intermediate data Mi, and temporal data in the temporal direction from the intermediate data Mi. And the intermediate data Mi + 1 that is the next intermediate data.
  • the number of intermediate data used for correction may be three or more.
  • the correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 and the intermediate data Mi + 1 in the dimension in the z direction.
  • the correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S215).
  • the video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216). This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.
  • FIG. 13 is a schematic diagram illustrating a learning process performed by the video encoding / decoding system according to the second embodiment.
  • a data set including three temporally continuous input frames as one sample data is input.
  • these three input frame groups are referred to as S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) in order of time.
  • the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z).
  • the process A is a dimensional compression process, a quantization / entropy encoding process, an entropy decoding process, and a dimensional expansion process.
  • intermediate data is generated.
  • intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively. ), M2 (x, y, z) and M3 (x, y, z).
  • correction is performed using M1 (x, y, z), M2 (x, y, z), and M3 (x, y, z) as a set.
  • the decompressed decoded data corresponding to the intermediate data M2 (x, y, z), the intermediate data M1 (x, y, z), and the intermediate data M3 (x, y, z) are set as corrections, respectively. Is performed.
  • a decoded frame group is generated.
  • the generated decoded frame group is defined as R2 (x, y, z).
  • the loss value loss is calculated using a loss function defined by the following equations (4) and (5).
  • Restoration error 1 ⁇ x ⁇ y ⁇ z (diff (S1 (x, y, z), M1 (x, y, z))) + ⁇ x ⁇ y ⁇ z (diff (S3 (x, y, z), M3 (x, y, z))) ... (5)
  • diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b.
  • concat () is an operation of connecting each input in the time direction.
  • GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability.
  • the discriminator is constructed by a neural network.
  • FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.
  • the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. Note that the configuration of the loss function shown in the above equations (4) to (5) is an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.
  • the video encoding / decoding system can perform encoding and decoding with random access to image data and parallelism.
  • the video encoding / decoding system 1 according to the first embodiment independently corrects each input frame group. Therefore, in the video encoding / decoding system 1 according to the first embodiment, although each input is corrected so as to be temporally continuous with the previous output, how the previous output is corrected Is unknown. For this reason, in the video encoding / decoding system 1 according to the first embodiment, it may not be possible to reliably ensure that the corrected decoded frame groups have continuity.
  • the video encoding / decoding system is configured such that, for a frame group having an odd (or even) index value, the decompressed decoded data itself is used as a decoded frame group. Learning is performed, and correction is performed so that the index value is continuous with a frame group that is not an odd number (or even number). As a result, the output before and after the group of frames to be corrected does not change. Therefore, the video encoding / decoding system according to the second embodiment includes the corrected decoded frame group and the corrected decoded frame group. It is possible to ensure that the decoded frame groups adjacent before and after in time have continuity.
  • a part or all of the video encoding / decoding system in the above-described embodiment may be realized by a computer.
  • a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read and executed by a computer system.
  • the “computer system” includes an OS and hardware such as peripheral devices.
  • the “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system.
  • a "computer-readable recording medium” refers to a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short time.
  • a program may include a program that holds a program for a certain period of time, such as a volatile memory in a computer system serving as a server or a client in that case.
  • the program may be for realizing a part of the functions described above, or may be a program that can realize the functions described above in combination with a program already recorded in a computer system, It may be realized using hardware such as a PLD (Programmable Logic Device) or an FPGA (Field Programmable Gate Array).
  • Reference Signs List 1 video encoding / decoding system 10 video encoding device 20 video decoding device 110 video dividing unit 120 encoding unit 121 dimensional compression unit 122 entropy encoding unit 210 decoding unit 211 entropy decoding unit 212 dimensional expansion unit 213 intermediate data memory 214 correction Unit 220 Image Coupling Unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

An image processing device which performs correction for each frame group composed of a predetermined number of frames in which picture data are divided is provided with a decoding unit which, using a feature quantity of a first frame group, performs correction with respect to a second frame group that is a frame group temporally successive to the first frame group, thereby obtaining a corrected frame group. The decoding unit performs the correction so that the subjective image quality based on the relationship between the second frame group and a frame group temporally subsequent to the second frame group is increased, and so that a predetermined identifying unit identifies that a frame group in which the second frame group and the frame group temporally subsequent to the second frame group are combined is identical to a frame group in which the corrected frame group and a corrected frame group obtained by correcting the frame group temporally subsequent to the second frame group are combined.

Description

画像処理装置、画像処理方法及び画像処理プログラムImage processing apparatus, image processing method, and image processing program
 本発明は、画像処理装置、画像処理方法及び画像処理プログラムに関する。
 本願は、2018年9月19日に、日本に出願された特願2018-174982号に基づき優先権を主張し、その内容をここに援用する。
The present invention relates to an image processing device, an image processing method, and an image processing program.
Priority is claimed on Japanese Patent Application No. 2018-174982 filed on September 19, 2018, the content of which is incorporated herein by reference.
 画像を符号化する方法の一つとして、オートエンコーダ(自己符号化器)を利用した方法がある。ここでいう画像には、静止画像及び動画像(以下「映像」という。)が含まれる。オートエンコーダは、入力層(エンコーダ)、隠れ層、及び出力層(デコーダ)からなる3層のニューラルネットワークである。オートエンコーダは、エンコーダにより入力データを符号化データに符号化し、デコーダにより符号化データを入力データに復元するように設計される。エンコーダ及びデコーダは、任意の演算器によって構築される。例えば入力データが画像である場合、エンコーダは畳み込み演算を行う複数の演算器、及び、デコーダはエンコーダによる畳み込み演算に対する逆演算を行う複数の演算器によって構築される。 一 つ One of the methods for encoding an image is a method using an auto encoder (self-encoder). The image referred to here includes a still image and a moving image (hereinafter, referred to as “video”). The auto-encoder is a three-layer neural network including an input layer (encoder), a hidden layer, and an output layer (decoder). Auto-encoders are designed to encode input data into encoded data with an encoder and restore the encoded data into input data with a decoder. The encoder and the decoder are constructed by an arbitrary arithmetic unit. For example, when the input data is an image, the encoder is configured by a plurality of arithmetic units that perform a convolution operation, and the decoder is configured by a plurality of arithmetic units that perform an inverse operation to the convolution operation by the encoder.
 ニューラルネットワークによる演算では、パラメータの数を増やすことによって表現能力及び性能の向上が見込まれる。しかしながら、入力データが例えば解像度の高い画像である場合、パラメータ数が多くなると、演算に必要なメモリの容量が膨大になる。そのため、パラメータ数を増やすことよって表現能力及び性能を向上させることは現実的ではない。 演算 In the operation by the neural network, the expression ability and the performance are expected to be improved by increasing the number of parameters. However, when the input data is an image having a high resolution, for example, if the number of parameters increases, the capacity of the memory required for the calculation becomes enormous. Therefore, it is not realistic to improve the expressive ability and performance by increasing the number of parameters.
 そこで、例えば図14に示すように、入力データを演算可能なサイズの複数のデータに分割し、分割されたそれぞれのデータに対してニューラルネットワークによる演算処理を行い、出力された復号データを結合して元の入力データを復元する方法が考えられる。しかしながら、この方法では、分割されたそれぞれのデータは互いに独立に処理される。そのため、この方法では、復元された入力データは、特に分割がなされた画像の境界部分において、隣接する復号データの間の連続性が保たれておらず、不自然な画像になる可能性が高い。 Therefore, for example, as shown in FIG. 14, the input data is divided into a plurality of pieces of data of a operable size, the respective pieces of data are subjected to an arithmetic processing by a neural network, and the output decoded data are combined. To restore the original input data. However, in this method, each divided data is processed independently of each other. Therefore, in this method, the restored input data does not maintain continuity between adjacent decoded data, particularly at a boundary portion of the divided image, and is likely to be an unnatural image. .
 これに対し、例えば図15に示すように、処理対象データと併せて周囲の復号データを、エンコーダ、デコーダ、又はその両方に再帰的に入力する従来技術がある。このように、再帰的に周囲の復号データを入力することによって、処理対象データと周囲の復号データとの連続性が考慮され、より自然な復元データが得られる。 On the other hand, as shown in FIG. 15, for example, there is a conventional technique in which surrounding decoded data is recursively input to an encoder, a decoder, or both together with data to be processed. In this way, by inputting the surrounding decoded data recursively, continuity between the processing target data and the surrounding decoded data is considered, and more natural restored data can be obtained.
 しかしながら上記の従来技術は、ランダムアクセス性に欠けるという課題がある。ここでいうランダムアクセス性とは、データに対して離散的にアクセスしても所望のデータを容易に得ることができる性質のことである。従来技術では、例えば入力データが映像データである場合、映像データの先頭から順に符号化及び復号が行われる。この場合、例えば映像データの所望の位置の復号データのみを得たい場合であっても、映像データの先頭から順に復号を行っていかなければ、所望の位置の復号データを得ることができない。
 また、上記の従来技術は、並列性に欠けるという課題がある。従来技術は、再帰的に演算処理を行うことから、並列処理が行うことが難しい。そのため、従来技術は、分散処理システム等を用いて効率的に演算処理を行うことが難しい。
However, the above prior art has a problem that it lacks random accessibility. Here, the random access property refers to a property that desired data can be easily obtained even if data is discretely accessed. In the related art, for example, when input data is video data, encoding and decoding are sequentially performed from the top of the video data. In this case, for example, even if it is desired to obtain only the decoded data at the desired position of the video data, the decoded data at the desired position cannot be obtained unless the decoding is performed sequentially from the beginning of the video data.
In addition, the above-described conventional technology has a problem of lack of parallelism. In the related art, since arithmetic processing is performed recursively, it is difficult to perform parallel processing. For this reason, it is difficult for the conventional technology to efficiently perform arithmetic processing using a distributed processing system or the like.
 本発明はこのような状況を鑑みてなされたもので、画像データに対するランダムアクセス性、及び並列性を有する符号化及び復号を行うことができる技術の提供を目的としている。 The present invention has been made in view of such a situation, and an object of the present invention is to provide a technology capable of performing encoding and decoding with random access to image data and parallelism.
 本発明の一態様は、映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理装置であって、第1のフレーム群の特徴量を用いて、前記第1のフレーム群と時間的に連続したフレーム群である第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得る復号部を備え、前記復号部は、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群との関係に基づく主観画質が高くなるように、かつ、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群とが結合されたフレーム群と、前記補正後フレーム群と前記第2のフレーム群より時間的に後のフレーム群が補正された補正後フレーム群とが結合されたフレーム群と、が同一であると所定の識別器により識別されるように補正する画像処理装置である。 One embodiment of the present invention is an image processing apparatus which performs correction for each frame group including a predetermined number of frames into which video data has been divided, wherein the first frame group is characterized by using a feature amount of the first frame group. A second frame group, which is a frame group temporally continuous with the second frame group, to obtain a corrected frame group, the decoding unit comprising: the second frame group and the second frame group; The second frame group and the frame group temporally later than the second frame group are set so that the subjective image quality based on the relationship with the frame group temporally later than the second frame group becomes higher. It is determined that the combined frame group is the same as the corrected frame group obtained by combining the corrected frame group and the corrected frame group obtained by correcting the frame group temporally later than the second frame group. By the classifier of An image processing apparatus for correcting as another.
 また、本発明の一態様は、上記の画像処理装置であって、前記復号部は、前記第2のフレーム群に対し時間的により後のフレームの特徴量であるほど、前記補正における重み付けを重くする。 Further, one aspect of the present invention is the image processing device described above, wherein the decoding unit weights heavier in the correction as the feature amount of a frame that is temporally later than the second frame group. I do.
 また、本発明の一態様は、映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理装置であって、第2のフレーム群より時間的に前のフレーム群であって前記第2のフレーム群と時間的に連続したフレーム群である第1のフレーム群の特徴量と、前記第2のフレーム群より時間的に後のフレーム群であって前記第2のフレーム群と時間的に連続したフレーム群である第3のフレーム群の特徴量と、を用いて前記第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得る復号部を備え、前記復号部は、前記補正後フレーム群と前記第1のフレーム群との関係、及び、前記補正後フレーム群と前記第3のフレーム群との関係に基づく主観画質が高くなるように補正する画像処理装置である。 One embodiment of the present invention is an image processing device that performs correction for each frame group including a predetermined number of frames into which video data is divided, and includes a frame group temporally earlier than a second frame group. And a feature amount of a first frame group, which is a frame group temporally continuous with the second frame group, and a frame group temporally later than the second frame group, the second frame group And a decoding unit that obtains a corrected frame group by performing correction on the second frame group using the feature amount of a third frame group that is a temporally continuous frame group. A decoding unit configured to correct the subjective image quality based on the relationship between the corrected frame group and the first frame group and the relationship between the corrected frame group and the third frame group; Device.
 また、本発明の一態様は、上記の画像処理装置であって、前記復号部は、前記映像データとは異なる映像データが分割されたフレーム群に基づく学習処理によって更新されたパラメータ値に基づいて補正する。 One embodiment of the present invention is the above-described image processing device, wherein the decoding unit is configured to perform decoding based on a parameter value updated by a learning process based on a frame group obtained by dividing video data different from the video data. to correct.
 また、本発明の一態様は、上記の画像処理装置であって、前記学習処理は、時間的に連続する少なくとも3つのフレーム群からなるサンプルデータを取得するステップと、前記サンプルデータを第1の学習モデルにそれぞれ入力して、前記フレーム群の特徴量をそれぞれ得るステップと、前記フレーム群の特徴量を第2の学習モデルに入力して前記フレーム群に対応する前記補正後フレーム群を得るステップと、前記サンプルデータと前記フレーム群の特徴量と前記補正後フレーム群と所定の損失関数とに基づいて損失値を算出するステップと、前記損失値を用いて前記パラメータ値を更新するステップと、を有する。 One embodiment of the present invention is the above-described image processing device, wherein the learning processing includes a step of acquiring sample data including at least three temporally continuous frame groups; Inputting each of the learning models to obtain the feature amount of the frame group, and inputting the feature amount of the frame group to the second learning model to obtain the corrected frame group corresponding to the frame group Calculating a loss value based on the sample data, the feature amount of the frame group, the corrected frame group and a predetermined loss function, and updating the parameter value using the loss value; Having.
 また、本発明の一態様は、データが分割された所定の部分データ数からなる部分データ群ごとに補正を行う画像処理装置であって、第1の部分データ群の特徴量を用いて、前記第1の部分データ群と時間的に連続した部分データ群である第2の部分データ群に対して補正を行うことにより、補正後部分データ群を得る復号部を備え、前記復号部は、前記第2の部分データ群と前記第2の部分データ群より時間的に後の部分データ群との関係に基づく主観画質が高くなるように、かつ、前記第2の部分データ群と前記第2の部分データ群より時間的に後の部分データ群とが結合された部分データ群と、前記補正後部分データ群と前記第2の部分データ群より時間的に後の部分データ群が補正された補正後部分データ群とが結合された部分データ群と、が同一であると所定の識別器により識別されるように補正する画像処理装置である。 Further, one aspect of the present invention is an image processing apparatus that performs correction for each partial data group including a predetermined number of partial data obtained by dividing data, wherein the feature amount of the first partial data group is used to perform the correction. A decoding unit that corrects the second partial data group, which is a partial data group temporally continuous with the first partial data group, to obtain a corrected partial data group; The subjective image quality based on the relationship between the second partial data group and the partial data group temporally later than the second partial data group is increased, and the second partial data group and the second A partial data group in which a partial data group temporally later than the partial data group is combined; and a correction in which the corrected partial data group and the partial data group temporally later than the second partial data group are corrected. Partial data combined with subsequent partial data group And the group, but is an image processing apparatus for correcting as identified by the is the same predetermined identifier.
 また、本発明の一態様は、映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理方法であって、第1のフレーム群の特徴量を用いて、前記第1のフレーム群と時間的に連続したフレーム群である第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得るステップと、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群との関係に基づく主観画質が高くなるように、かつ、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群とが結合されたフレーム群と、前記補正後フレーム群と前記第2のフレーム群より時間的に後のフレーム群が補正された補正後フレーム群とが結合されたフレーム群と、が同一であると所定の識別器により識別されるように補正するステップと、を有する画像処理方法である。 Further, one aspect of the present invention is an image processing method for performing correction for each frame group including a predetermined number of frames into which video data is divided, wherein the first frame group is characterized by using a feature amount of the first frame group. Obtaining a corrected frame group by performing correction on a second frame group that is a frame group temporally continuous with the second frame group, and obtaining a corrected frame group from the second frame group and the second frame group. A frame group in which the subjective image quality based on the relationship with the temporally later frame group is increased, and the second frame group and the temporally later frame group are combined with each other. And a frame group obtained by combining the corrected frame group and a corrected frame group obtained by correcting a frame group temporally later than the second frame group by the predetermined classifier. Be done A step of urchin correction, an image processing method having.
 また、本発明の一態様は、上記の画像処理装置としてコンピュータを機能させるための画像処理プログラムである。 Another embodiment of the present invention is an image processing program for causing a computer to function as the above image processing device.
 本発明により、画像データに対するランダムアクセス性、及び並列性を有する符号化及び復号を行うことができる。 According to the present invention, encoding and decoding with random access to image data and parallelism can be performed.
第1の実施形態に係る映像符号化・復号システム1の全体構成図である。1 is an overall configuration diagram of a video encoding / decoding system 1 according to a first embodiment. 第1の実施形態に係る映像符号化・復号システム1の符号化部120の構成図である。FIG. 2 is a configuration diagram of an encoding unit 120 of the video encoding / decoding system 1 according to the first embodiment. 第1の実施形態に係る映像符号化・復号システム1の復号部210の構成図である。FIG. 2 is a configuration diagram of a decoding unit 210 of the video encoding / decoding system 1 according to the first embodiment. 従来技術における映像符号化・復号システムの復号部の構成図である。FIG. 11 is a configuration diagram of a decoding unit of a video encoding / decoding system according to a conventional technique. 第1の実施形態に係る映像符号化装置10の動作を示すフローチャートである。5 is a flowchart illustrating an operation of the video encoding device 10 according to the first embodiment. 第1の実施形態に係る映像符号化・復号システム1の次元圧縮部121の構成図である。FIG. 2 is a configuration diagram of a dimension compression unit 121 of the video encoding / decoding system 1 according to the first embodiment. 第1の実施形態に係る映像復号装置20の動作を示すフローチャートである。5 is a flowchart illustrating an operation of the video decoding device 20 according to the first embodiment. 第1の実施形態に係る映像符号化・復号システム1の次元伸張部212の構成図である。FIG. 2 is a configuration diagram of a dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment. 第1の実施形態に係る映像符号化・復号システム1の補正部214の構成図である。FIG. 2 is a configuration diagram of a correction unit 214 of the video encoding / decoding system 1 according to the first embodiment. 第1の実施形態に係る映像符号化・復号システム1による学習処理を説明するための模式図である。FIG. 2 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment. 第2の実施形態に係る映像符号化・復号システムの復号部210aの構成図である。It is a block diagram of the decoding part 210a of the video encoding / decoding system which concerns on 2nd Embodiment. 第2の実施形態に係る映像復号装置の動作を示すフローチャートである。13 is a flowchart illustrating an operation of the video decoding device according to the second embodiment. 第2の実施形態に係る映像符号化・復号システムによる学習処理を説明するための模式図である。It is a schematic diagram for explaining the learning process by the video encoding / decoding system according to the second embodiment. 従来技術における映像符号化・復号システムによる学習処理を説明するための模式図である。FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art. 従来技術における映像符号化・復号システムによる学習処理を説明するための模式図である。FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art.
<第1の実施形態>
 以下、本発明の第1の実施形態について、図面を参照しながら説明する。
<First embodiment>
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.
 以下、映像データを符号化及び復号する映像符号化・復号システム1について説明する。但し、当該システムは、映像データ以外の画像データを符号化及び復号する場合にも適用可能である。 Hereinafter, the video encoding / decoding system 1 that encodes and decodes video data will be described. However, the system is also applicable when encoding and decoding image data other than video data.
[映像符号化・復号システムの構成]
 以下、映像符号化・復号システム1の構成について説明する。
 図1は、第1の実施形態に係る映像符号化・復号システム1(画像処理装置)の全体構成図である。図1に示すように、映像符号化・復号システム1は、符号化の対象となる入力映像データを取得し、当該入力映像データに対応する復号映像データを出力する。映像符号化・復号システム1は、映像符号化装置10と、映像復号装置20と、を含んで構成される。
[Configuration of video encoding / decoding system]
Hereinafter, the configuration of the video encoding / decoding system 1 will be described.
FIG. 1 is an overall configuration diagram of a video encoding / decoding system 1 (image processing device) according to the first embodiment. As shown in FIG. 1, the video encoding / decoding system 1 acquires input video data to be encoded, and outputs decoded video data corresponding to the input video data. The video encoding / decoding system 1 includes a video encoding device 10 and a video decoding device 20.
 映像符号化装置10は、映像分割部110と、符号化部120と、を含んで構成される。映像分割部110は、入力映像データを取得する。入力映像データは、時間的に連続した複数のフレームによって構成される。映像分割部110は、取得された入力映像データを構成する連続した複数のフレームを所定のフレーム数ごとに分割することにより、複数の入力フレーム群を生成する。映像分割部110は、生成された複数の入力フレーム群を、符号化部120へ順に出力する。 The video encoding device 10 includes a video dividing unit 110 and an encoding unit 120. The video division unit 110 acquires input video data. The input video data is composed of a plurality of temporally continuous frames. The video dividing unit 110 generates a plurality of input frame groups by dividing a plurality of continuous frames constituting the obtained input video data by a predetermined number of frames. The video division unit 110 sequentially outputs the generated plurality of input frame groups to the encoding unit 120.
 符号化部120の構成を図2に示す。図2に示すように、符号化部120は、次元圧縮部121と、量子化/エントロピー符号化部122と、を含んで構成される。
 次元圧縮部121は、映像分割部110から出力された入力フレーム群を取得する。次元圧縮部121は、取得された入力フレーム群に対し、次元数を少なくするように圧縮することにより圧縮フレーム群を生成する。次元圧縮部121は、生成された圧縮フレーム群を、量子化/エントロピー符号化部122へ出力する。
FIG. 2 shows the configuration of encoding section 120. As illustrated in FIG. 2, the encoding unit 120 includes a dimensional compression unit 121 and a quantization / entropy encoding unit 122.
The dimensional compression unit 121 acquires the input frame group output from the video division unit 110. The dimension compression unit 121 generates a compressed frame group by compressing the acquired input frame group so as to reduce the number of dimensions. The dimensional compression unit 121 outputs the generated compressed frame group to the quantization / entropy encoding unit 122.
 量子化/エントロピー符号化部122は、次元圧縮部121から出力された圧縮フレーム群を取得する。量子化/エントロピー符号化部122は、取得された圧縮フレーム群を構成する圧縮フレームそれぞれの値に対し、量子化及びエントロピー符号化を行う。そして、量子化/エントロピー符号化部122は、量子化及びエントロピー符号化された圧縮フレーム群を連結することにより、符号化データを生成する。量子化/エントロピー符号化部122は、生成された符号化データを、映像復号装置20の後述する復号部210へ出力する。 The quantization / entropy encoding unit 122 acquires the compressed frame group output from the dimensional compression unit 121. The quantization / entropy coding unit 122 performs quantization and entropy coding on each of the values of the compressed frames constituting the obtained compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups. The quantization / entropy coding unit 122 outputs the generated coded data to a decoding unit 210 of the video decoding device 20, which will be described later.
 再び図1に戻って説明する。
 映像復号装置20は、復号部210と、映像結合部220と、を含んで構成される。
 復号部210の構成を図3に示す。図3に示すように、復号部210は、エントロピー復号部211と、次元伸張部212と、中間データメモリ213と、補正部214と、を含んで構成される。
Returning to FIG. 1, the description will be continued.
The video decoding device 20 includes a decoding unit 210 and a video combining unit 220.
FIG. 3 shows the configuration of the decoding unit 210. As shown in FIG. 3, the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, and a correction unit 214.
 エントロピー復号部211は、符号化部120の量子化/エントロピー符号化部122から出力された符号化データを取得する。エントロピー復号部211は、取得された符号化データをエントロピー復号することにより、エントロピー復号データを生成する。エントロピー復号部211は、生成されたエントロピー復号データを、次元伸張部212へ出力する。 The entropy decoding unit 211 acquires the encoded data output from the quantization / entropy encoding unit 122 of the encoding unit 120. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data. The entropy decoding unit 211 outputs the generated entropy decoding data to the dimension expansion unit 212.
 次元伸張部212は、エントロピー復号部211から出力されたエントロピー復号データに対し、上述した(次元圧縮部121によって圧縮される前の)入力フレーム群と同一の次元数になるまで伸張することにより伸張復号データを生成する。次元伸張部212は、生成された伸張復号データを、中間データメモリ213及び補正部214へそれぞれ出力する。 The dimension expansion unit 212 expands the entropy decoding data output from the entropy decoding unit 211 by expanding the entropy decoding data until the number of dimensions becomes the same as that of the input frame group (before being compressed by the dimension compression unit 121). Generate decrypted data. The dimension expansion unit 212 outputs the generated expanded decoded data to the intermediate data memory 213 and the correction unit 214.
 中間データメモリ213は、次元伸張部212から出力された伸張復号データを取得し、記憶する。なお、中間データメモリ213に記憶された伸張復号データを、以下「中間データ」という。中間データは、必要に応じて補正部214へ出力される。中間データメモリ213は、例えばRAM(Random Access Memory;読み書き可能なメモリ)等の揮発性の記録媒体である。 (4) The intermediate data memory 213 acquires and stores the decompressed decoded data output from the dimension decompression unit 212. Note that the decompressed decoded data stored in the intermediate data memory 213 is hereinafter referred to as “intermediate data”. The intermediate data is output to the correction unit 214 as needed. The intermediate data memory 213 is a volatile recording medium such as a RAM (Random Access Memory).
 補正部214は、次元伸張部212から出力された伸張復号データを取得する。また、補正部214は、中間データメモリ213に記憶された中間データを取得する。補正部214は、中間データを用いて伸張復号データを補正することにより復号フレーム群を生成する。補正部214は、生成された復号フレーム群を映像結合部220へ出力する。 The correction unit 214 acquires the expanded decoded data output from the dimension expansion unit 212. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.
 再び図1に戻って説明する。
 映像結合部220は、復号部210から出力された復号フレーム群を取得する。映像結合部220は、取得された復号フレーム群を結合することにより復号映像データを生成する。映像結合部220は、生成された復号映像データを最終的な出力データとして出力する。
Returning to FIG. 1, the description will be continued.
The video combining unit 220 acquires the decoded frame group output from the decoding unit 210. The video combining unit 220 generates decoded video data by combining the acquired decoded frame groups. The video combining unit 220 outputs the generated decoded video data as final output data.
 なお、従来技術との差異を説明するため、従来技術における映像符号化・復号システムの復号部210の構成を図4に示す。図3及び図4に示すように、上述した第1の実施形態に係る復号部210の構成と従来技術における復号部の構成との差異は、従来技術における復号部が補正部を備えないのに対して、第1の実施形態に係る復号部210が補正部214を備える点である。 FIG. 4 shows the configuration of the decoding unit 210 of the video encoding / decoding system according to the related art in order to explain the difference from the related art. As shown in FIGS. 3 and 4, the difference between the configuration of the decoding unit 210 according to the above-described first embodiment and the configuration of the decoding unit in the related art is that the decoding unit in the related art does not include the correction unit. On the other hand, the decoding unit 210 according to the first embodiment includes a correction unit 214.
 従来技術における復号部の次元伸張部は、エントロピー復号部から出力されたエントロピー復号データを取得する。従来技術における次元伸張部は、取得されたエントロピー復号データに対して、中間データメモリに記憶された中間データを用いて次元数の伸張を行い、復号フレーム群を生成する。 次 元 The dimension expansion unit of the decoding unit in the related art acquires the entropy decoded data output from the entropy decoding unit. The dimension expansion unit in the related art expands the number of dimensions of the acquired entropy decoded data using the intermediate data stored in the intermediate data memory to generate a decoded frame group.
 一方、第1の実施形態に係る復号部210では、上述したように、補正部214が、次元伸張部212から伸張復号データを取得し、中間データメモリ213から中間データを取得する。そして、補正部214が、中間データを用いて伸張復号データを補正することにより復号フレーム群を生成する。 On the other hand, in the decoding unit 210 according to the first embodiment, as described above, the correction unit 214 obtains the expanded decoded data from the dimension expansion unit 212 and obtains the intermediate data from the intermediate data memory 213. Then, the correcting unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data.
[映像符号化装置の動作]
 以下、映像符号化装置10の動作の一例について説明する。
 図5は、第1の実施形態に係る映像符号化装置10の動作を示すフローチャートである。
[Operation of video encoding device]
Hereinafter, an example of the operation of the video encoding device 10 will be described.
FIG. 5 is a flowchart illustrating the operation of the video encoding device 10 according to the first embodiment.
 映像分割部110は、水平方向x、垂直方向y、時間方向zとする入力映像データS(x,y,z)を取得する。映像分割部110は、取得された入力映像データS(x,y,z)をN個のフレームごとに分割することにより、複数の入力フレーム群Si(x,y,z)を生成する(ステップS101)。ここで、x,y,zの次元数を、それぞれX,Y,Zとする。また、iは、入力フレーム群の番号を表すインデックスである。 The video dividing unit 110 acquires input video data S (x, y, z) in the horizontal direction x, the vertical direction y, and the time direction z. The video dividing unit 110 generates a plurality of input frame groups Si (x, y, z) by dividing the obtained input video data S (x, y, z) into N frames (Step S10). S101). Here, the dimensions of x, y, and z are X, Y, and Z, respectively. I is an index indicating the number of the input frame group.
 なお、各フレーム群のサイズは必ずしも同一である必要はない。例えば、N個のフレームからなるフレーム群と、L個(LはNとは異なる正数)のフレームからなるフレーム群とが混在していても構わない。また、例えば、入力映像データS(x,y,z)がN個のフレームとL個のフレームとに交互に分割されて、N個のフレーム群からなるフレーム群とL個のフレーム群からなるフレーム群とが交互に生成される構成であってもよい。 The size of each frame group does not necessarily have to be the same. For example, a frame group composed of N frames and a frame group composed of L frames (L is a positive number different from N) may be mixed. Further, for example, the input video data S (x, y, z) is alternately divided into N frames and L frames, and is composed of a frame group composed of N frame groups and an L frame group. A configuration in which frame groups are alternately generated may be employed.
 符号化部120の次元圧縮部121は、各入力フレーム群Si(x,y,z)を、次元数(X’,Y’,N’)となるように圧縮することにより圧縮フレーム群を生成する(ステップS102)。なお、次元数(X’,Y’,N’)は、X’*Y’*N’<X*Y*Nを満たす次元数である。 The dimension compression unit 121 of the encoding unit 120 generates a compressed frame group by compressing each input frame group Si (x, y, z) to have the number of dimensions (X ′, Y ′, N ′). (Step S102). The number of dimensions (X ', Y', N ') is the number of dimensions satisfying X' * Y '* N' <X * Y * N.
 なお、次元圧縮部121は、例えば図6に示すような、ニューラルネットワーク(畳み込み演算、ダウンサンプリング及び非線形変換の組み合わせ)によって構成される。
 図6は、第1の実施形態に係る映像符号化・復号システム1の次元圧縮部121の構成図である。図6に示すように、次元圧縮部121は、M層からなる構成部(第1層構成部121a-1~第M層構成部121a-M)によって構成される。各構成部は、畳み込み層部c1と、ダウンサンプリング部c2と、非線形変換部c3と、によって構成される。
Note that the dimension compression unit 121 is configured by a neural network (combination of convolution operation, downsampling, and non-linear conversion) as shown in FIG. 6, for example.
FIG. 6 is a configuration diagram of the dimensional compression unit 121 of the video encoding / decoding system 1 according to the first embodiment. As shown in FIG. 6, the dimensional compression unit 121 is configured by constituent units (first-layer constituent unit 121a-1 to M-th-layer constituent unit 121a-M) composed of M layers. Each component includes a convolutional layer unit c1, a downsampling unit c2, and a non-linear conversion unit c3.
 第1層構成部121a-1の畳み込み層部c1は、映像分割部100から出力された入力フレーム群を取得する。第1層構成部121a-1の畳み込み層部c1は、取得された入力フレーム群に対して畳み込み演算を行う。畳み込み層部c1は、畳み込み演算が行われたフレーム群をダウンサンプリング部c2へ出力する。
 第1層構成部121a-1のダウンサンプリング部c2は、畳み込み層部c1から出力されたフレーム群を取得する。ダウンサンプリング部c2は、取得されたフレーム群を、次元数を少なくするように圧縮する。ダウンサンプリング部c2は、圧縮されたフレーム群を非線形変換部c3へ出力する。
 第1層構成部121a-1の非線形変換部c3は、ダウンサンプリング部c2から出力されたフレーム群を取得する。非線形変換部c3は、取得されたフレーム群に対し非線形変換処理を行う。非線形変換部c3は、非線形変換処理が行われたフレーム群を、次の層の構成部(第2層構成部)の畳み込み層部c1へ出力する。
The convolution layer section c1 of the first layer configuration section 121a-1 acquires the input frame group output from the video division section 100. The convolution layer section c1 of the first layer configuration section 121a-1 performs a convolution operation on the acquired input frame group. The convolution layer unit c1 outputs the group of frames on which the convolution operation has been performed to the downsampling unit c2.
The downsampling unit c2 of the first layer configuration unit 121a-1 acquires the frame group output from the convolutional layer unit c1. The downsampling unit c2 compresses the acquired frame group so as to reduce the number of dimensions. The down-sampling unit c2 outputs the compressed frame group to the nonlinear conversion unit c3.
The nonlinear conversion unit c3 of the first layer configuration unit 121a-1 acquires the frame group output from the downsampling unit c2. The non-linear conversion unit c3 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c3 outputs the frame group subjected to the non-linear conversion processing to the convolution layer unit c1 of the next layer component (second layer component).
 上記の処理を第1層から第M層まで繰り返すことにより、次元圧縮部121は、映像分割部100から入力された入力フレーム群を、次元数が削減ざれた圧縮フレーム群に変換し、量子化/エントロピー符号化部122へ出力する。 By repeating the above process from the first layer to the M-th layer, the dimensional compression unit 121 converts the input frame group input from the video division unit 100 into a compressed frame group with a reduced number of dimensions, and performs quantization. / Entropy encoding section 122.
 再び図5に戻って説明する。
 符号化部120の量子化/エントロピー符号化部122は、各圧縮フレーム群に対して量子化及びエントロピー符号化を行う。そして、量子化/エントロピー符号化部122は、量子化及びエントロピー符号化された圧縮フレーム群を連結することにより、符号化データを生成する(ステップS103)。
 以上で、図5のフローチャートが示す映像符号化装置10の動作が終了する。
Returning to FIG. 5, the description will be continued.
The quantization / entropy coding unit 122 of the coding unit 120 performs quantization and entropy coding on each compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups (step S103).
Thus, the operation of the video encoding device 10 shown in the flowchart of FIG. 5 ends.
[映像復号装置の動作]
 以下、映像復号装置20の動作の一例について説明する。
 図7は、第1の実施形態に係る映像復号装置20の動作を示すフローチャートである。
[Operation of video decoding device]
Hereinafter, an example of the operation of the video decoding device 20 will be described.
FIG. 7 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.
 復号部210のエントロピー復号部211は、符号化データを取得する。エントロピー復号部211は、取得された符号化データに対してエントロピー復号を行うことにより、エントロピー復号データを生成する(ステップS111)。
 復号部210の次元伸張部212は、生成されたエントロピー復号データに対して、(次元圧縮部121によって次元数が削減される前の)元の次元数に復元することにより伸張復号データを生成する(ステップS112)。
The entropy decoding unit 211 of the decoding unit 210 acquires encoded data. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data (step S111).
The dimension expansion unit 212 of the decoding unit 210 generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit 121). (Step S112).
 なお、次元伸張部212は、例えば図8に示すような、ニューラルネットワーク(逆畳み込み演算及び非線形変換の組み合わせ)によって構成される。
 図8は、第1の実施形態に係る映像符号化・復号システム1の次元伸張部212の構成図である。図8に示すように、次元伸張部212は、M層からなる構成部(第1層構成部212a-1~第M層構成部212a-M)によって構成される。各構成部は、逆畳み込み層部c4と、非線形変換部c5と、によって構成される。
The dimension expansion unit 212 is configured by a neural network (combination of deconvolution operation and non-linear conversion) as shown in FIG. 8, for example.
FIG. 8 is a configuration diagram of the dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment. As shown in FIG. 8, the dimensional extension unit 212 is configured by a configuration unit composed of M layers (first layer configuration unit 212a-1 to Mth layer configuration unit 212a-M). Each component includes a deconvolution layer unit c4 and a non-linear conversion unit c5.
 第1層構成部212a-1の逆畳み込み層部c4は、エントロピー復号部211から出力されたエントロピー復号フレーム群を取得する。逆畳み込み層部c4は、取得されたエントロピー復号フレーム群に対して逆畳み込み演算を行う。逆畳み込み層部c4は、逆畳み込み演算が行われたフレーム群を非線形変換部c5へ出力する。
 第1層構成部212a-1の非線形変換部c5は、逆畳み込み層部c4から出力されたフレーム群を取得する。非線形変換部c5は、取得されたフレーム群に対し非線形変換処理を行う。非線形変換部c5は、非線形変換処理が行われたフレーム群を、次の層の構成部(第2層構成部)の逆畳み込み層部c4へ出力する。
The deconvolution layer unit c4 of the first layer configuration unit 212a-1 acquires the entropy decoded frame group output from the entropy decoding unit 211. The deconvolution layer unit c4 performs a deconvolution operation on the obtained entropy-decoded frame group. The deconvolution layer unit c4 outputs the group of frames on which the deconvolution operation has been performed to the nonlinear conversion unit c5.
The nonlinear conversion unit c5 of the first layer configuration unit 212a-1 acquires the frame group output from the deconvolution layer unit c4. The non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c5 outputs the frame group on which the non-linear conversion processing has been performed to the deconvolution layer unit c4 of the configuration unit (second layer configuration unit) of the next layer.
 上記の処理を第1層から第M層まで繰り返すことにより、次元伸張部212は、エントロピー復号部211から出力されたエントロピー復号フレーム群を、次元数が復元ざれた次元伸張データに変換し、中間データメモリ213及び補正部214へ出力する。 By repeating the above processing from the first layer to the M-th layer, the dimensional expansion unit 212 converts the entropy decoded frame group output from the entropy decoding unit 211 into dimensional expansion data in which the number of dimensions has been restored. The data is output to the data memory 213 and the correction unit 214.
 再び図7に戻って説明する。
 復号部210の中間データメモリ213は、ステップS112によって生成された伸張復号データである中間データMiを記憶する(ステップS113)。
 復号部210の補正部214は、中間データメモリ213に記憶された中間データMiを用いて、次元伸張部212から取得した伸張復号データを補正する。
Returning to FIG. 7, the description will be continued.
The intermediate data memory 213 of the decoding unit 210 stores the intermediate data Mi, which is the decompressed decoded data generated in step S112 (step S113).
The correction unit 214 of the decoding unit 210 corrects the expanded decoded data acquired from the dimensional expansion unit 212 using the intermediate data Mi stored in the intermediate data memory 213.
 ここで、補正部214は、補正の対象である伸張復号データに対して、当該伸張復号データに相当する中間データよりも以前に中間データメモリ213に記憶された中間データである中間データMi-1を用いて補正を行う。例えば、補正部214は、中間データMiに相当する伸張復号データを、当該中間データMiよりも時間方向に1つ前の中間データである中間データMi-1を用いて補正する。なお、補正に用いられる中間データは2つ以上であってもよい。 Here, the correction unit 214 adds the intermediate data Mi-1 which is the intermediate data stored in the intermediate data memory 213 before the intermediate data corresponding to the expanded decoded data to be corrected. Is corrected using. For example, the correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi using the intermediate data Mi-1 which is one intermediate data before the intermediate data Mi in the time direction. The number of intermediate data used for correction may be two or more.
 補正部214は、中間データMiに相当する伸張復号データに対して、中間データMi-1をz方向の次元で結合することにより補正する。補正部214は、全ての伸張復号データに対して上記の処理を行うことにより、復号フレーム群を生成する(ステップS114)。 The correction unit 214 corrects the decompressed decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 in the dimension of the z direction. The correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S114).
 なお、補正部214によって補正処理が行われる理由は以下のとおりである。時間方向zのフレームによって構成されたフレーム群毎に符号化がなされているため、時間的に互いに近接ないし隣接するフレーム群同士の間に主観的な連続性が担保されない場合がある。そこで、連続性を担保するために、伸張復号データに対して、当該伸張復号データに時間的に近接ないし隣接する中間データを用いて補正処理が行われる。連続性を持たせることによって、フレーム群を結合して得られる復号映像の主観画質が向上される。 The reason why the correction process is performed by the correction unit 214 is as follows. Since encoding is performed for each frame group composed of frames in the time direction z, subjective continuity may not be ensured between frame groups that are temporally close to or adjacent to each other. Therefore, in order to ensure continuity, a correction process is performed on the decompressed decoded data using intermediate data temporally close to or adjacent to the decompressed decoded data. By providing continuity, the subjective image quality of the decoded video obtained by combining the frame groups is improved.
 映像結合部220は、生成された復号フレーム群を結合することにより復号映像データを生成する(ステップS115)。
 以上で、図7のフローチャートが示す映像復号装置20の動作が終了する。
The video combining unit 220 generates decoded video data by combining the generated decoded frame groups (Step S115).
Thus, the operation of the video decoding device 20 shown in the flowchart of FIG. 7 ends.
 なお、補正部214は、例えば図9に示すようなニューラルネットワーク(畳み込み演算及び非線形変換の組み合わせ、及びスケーリング処理)によって構成される。
 図9は、第1の実施形態に係る映像符号化・復号システム1の補正部214の構成図である。図9に示すように、補正部214は、M層からなる構成部(第1層構成部214a-1~第M層構成部214a-M)と、スケーリング部214bと、によって構成される。各構成部は、畳み込み層部c6と、非線形変換部c7と、によって構成される。
The correction unit 214 is configured by, for example, a neural network (combination of convolution operation and non-linear conversion and scaling processing) as shown in FIG.
FIG. 9 is a configuration diagram of the correction unit 214 of the video encoding / decoding system 1 according to the first embodiment. As illustrated in FIG. 9, the correction unit 214 includes a configuration unit including M layers (first layer configuration unit 214a-1 to Mth layer configuration unit 214a-M) and a scaling unit 214b. Each component is configured by a convolutional layer unit c6 and a non-linear conversion unit c7.
 第1層構成部214a-1の畳み込み層部c6は、次元伸張部212から出力された伸張復号データと、中間データメモリ213に記憶された中間データと、を取得する。畳み込み層部c6は、取得された伸張復号データに対して畳み込み演算を行う。畳み込み層部c6は、畳み込み演算が行われたフレーム群を非線形変換部c7へ出力する。
 第1層構成部214a-1の非線形変換部c7は、畳み込み層部c6から出力されたフレーム群を取得する。非線形変換部c5は、取得されたフレーム群に対し非線形変換処理を行う。非線形変換部c7は、非線形変換処理が行われたフレーム群を出力する。次の層の構成部(第2層構成部)の畳み込み層部c6へは、非線形変換部c7から出力されたフレーム群と時間的に1つ前の中間データとが加算されたデータが入力される。
The convolutional layer section c6 of the first layer configuration section 214a-1 acquires the expanded decoded data output from the dimensional expansion section 212 and the intermediate data stored in the intermediate data memory 213. The convolution layer unit c6 performs a convolution operation on the obtained decompressed decoded data. The convolution layer unit c6 outputs the frame group on which the convolution operation has been performed to the nonlinear conversion unit c7.
The nonlinear conversion unit c7 of the first layer configuration unit 214a-1 acquires the frame group output from the convolution layer unit c6. The non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c7 outputs a frame group subjected to the non-linear conversion processing. Data obtained by adding the frame group output from the non-linear conversion unit c7 and the intermediate data immediately before in time are input to the convolution layer unit c6 of the configuration unit of the next layer (second layer configuration unit). You.
 補正部214は、上記の処理を第1層から第M層まで繰り返すことによって得られたフレーム群に対してスケーリング部214bによってスケーリングを行う。以上の処理により、補正部214は、次元伸張部212から出力された伸張復号データを、中間データメモリ213に記憶された中間データによって補正し、補正された伸張復号データである復号フレーム群を映像結合部220へ出力する。 The correction unit 214 performs scaling by a scaling unit 214b on a frame group obtained by repeating the above processing from the first layer to the Mth layer. Through the above processing, the correction unit 214 corrects the decompressed decoded data output from the dimensional decompression unit 212 with the intermediate data stored in the intermediate data memory 213, and outputs a decoded frame group that is the corrected decompressed decoded data to the video. Output to the combining unit 220.
[学習処理]
 以下、次元圧縮部121、次元伸張部212、及び補正部214のニューラルネットワークによる学習処理について説明する。
 次元圧縮部121、次元伸張部212、及び補正部214のニューラルネットワークによる学習処理は、同時に行われる。
[Learning process]
Hereinafter, a learning process by the neural network of the dimension compression unit 121, the dimension expansion unit 212, and the correction unit 214 will be described.
The learning process by the neural network of the dimensional compression unit 121, the dimensional expansion unit 212, and the correction unit 214 is performed simultaneously.
 図10は、第1の実施形態に係る映像符号化・復号システム1による学習処理を説明するための模式図である。
 図10に示すように、まず入力データとして、3つの時間的に連続する入力フレーム群を1つのサンプルデータとするデータセットが入力される。以下、これら3つの入力フレーム群を、時間順にそれぞれS1(x,y,z),S2(x,y,z)(第1のフレーム群),S3(x,y,z)(第2のフレーム群)とする。
FIG. 10 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment.
As shown in FIG. 10, first, as input data, a data set including three temporally continuous input frames as one sample data is input. Hereinafter, these three input frame groups will be referred to as S1 (x, y, z), S2 (x, y, z) (first frame group), S3 (x, y, z) (second Frame group).
 次に、各入力フレーム群S1(x,y,z),S2(x,y,z),S3(x,y,z)に対し、それぞれ処理Aが実行される。ここでいう処理Aとは、次元圧縮処理、量子化/エントロピー符号化処理、エントロピー復号処理、及び次元伸張処理である。これにより、中間データがそれぞれ生成される。以下、各入力フレーム群S1(x,y,z),S2(x,y,z),S3(x,y,z)に基づいて生成される中間データを、それぞれM1(x,y,z)、M2(x,y,z)(第1のフレーム群の特徴量),M3(x,y,z)(第2のフレーム群の特徴量)とする。 Next, the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z). Here, the processing A is a dimension compression processing, a quantization / entropy encoding processing, an entropy decoding processing, and a dimension expansion processing. As a result, intermediate data is generated. Hereinafter, intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively. ), M2 (x, y, z) (features of the first frame group) and M3 (x, y, z) (features of the second frame group).
 次に、図10に示すように、M1(x,y,z)とM2(x,y,z)、及び、M2(x,y,z)とM3(x,y,z)をセットとして、それぞれ補正が行われる。具体的には、中間データM1(x,y,z)に対応する伸張復号データと中間データM2(x,y,z)、及び、中間データM2(x,y,z)に対応する伸張復号データと中間データM3(x,y,z)をセットとして、それぞれ補正が行われる。これにより、2つの復号フレーム群が生成される。以下、各復号フレーム群を、それぞれR2(x,y,z),R3(x,y,z)(補正後フレーム群)とする。 Next, as shown in FIG. 10, M1 (x, y, z) and M2 (x, y, z), and M2 (x, y, z) and M3 (x, y, z) are set as a set. , Respectively. More specifically, decompressed decoded data corresponding to the intermediate data M1 (x, y, z) and intermediate data M2 (x, y, z), and decompressed decoded corresponding to the intermediate data M2 (x, y, z) The data and the intermediate data M3 (x, y, z) are set as a set, and correction is performed. Thereby, two decoded frame groups are generated. Hereinafter, each decoded frame group is referred to as R2 (x, y, z) and R3 (x, y, z) (corrected frame group).
 次に、以下に示す式(1)~式(3)によって定義される損失関数を用いて、損失値lossが算出される。 Next, the loss value loss is calculated using a loss function defined by the following equations (1) to (3).
 loss=
   復元誤差1+復元誤差2+GAN(concat(R2,R3))
  +FM(concat(S2,S3),concat(R2,R3))
                               ・・・(1)
loss =
Restoration error 1 + Restoration error 2 + GAN (concat (R2, R3))
+ FM (concat (S2, S3), concat (R2, R3))
... (1)
 復元誤差1=
   ΣxΣyΣz(diff(S2(x,y,z),R2(x,y,z)))
  +ΣxΣyΣz(diff(S3(x,y,z),R3(x,y,z)))
                               ・・・(2)
Restoration error 1 =
ΣxΣyΣz (diff (S2 (x, y, z), R2 (x, y, z)))
+ ΣxΣyΣz (diff (S3 (x, y, z), R3 (x, y, z)))
... (2)
 復元誤差2=
   ΣxΣyΣz(w(z)*diff(M2(x,y,z),R2(x,y,z)))
  +ΣxΣyΣz(w(z)*diff(M3(x,y,z),R3(x,y,z)))
                               ・・・(3)
Restoration error 2 =
ΣxΣyΣz (w (z) * diff (M2 (x, y, z), R2 (x, y, z)))
+ ΣxΣyΣz (w (z) * diff (M3 (x, y, z), R3 (x, y, z)))
... (3)
 ここで、diff(a,b)は、aとbとの距離を測る関数(例えば二乗誤差等)である。また、w(z)は、時間方向zに応じた重み係数である。なお、w(z)は、インデックスzが大きいほど重み付けが重くなるように設定される。すなわち、符号化対象の入力フレーム群に対して時間的により後の入力フレーム群に対応する中間データであるほど、補正における重み付けが重くなるように設定される。例えば、w(z)=z、又はw(z)=z2等が用いられる。 Here, diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b. Further, w (z) is a weight coefficient according to the time direction z. Note that w (z) is set such that the larger the index z, the heavier the weight. That is, it is set so that the intermediate data corresponding to the input frame group that is temporally later than the input frame group to be encoded has a higher weight in the correction. For example, w (z) = z or w (z) = z2 is used.
 concat()は、各入力を時間方向に連結する操作である。GAN(x)は、入力映像xが真の映像であるか否かを判定し、その確率を出力する識別器である。当該識別器は、ニューラルネットワークによって構築される。FM(a,b)は、当該識別器に対して、それぞれaとbとを入力した場合における、ニューラルネットワークの中間層の値についての誤差和(例えば二乗誤差等)である。 $ Concat () is an operation for connecting each input in the time direction. GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability. The discriminator is constructed by a neural network. FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.
 次に、算出された損失値を用いて、逆誤差伝播法等により各部のパラメータ値が更新される。上記の一連の流れを1回として、複数のサンプルデータを用いて、一定回数繰り返されることによって学習が行われる。又は、損失値が収束するまで繰り返されることによって学習が行われる。なお、上記式(1)~式(3)で示した損失関数の構成は一例であり、上記のうち一部の誤差のみが計算される損失関数、又は、異なる誤差項を追加された損失関数等であってもよい。 (4) Next, using the calculated loss value, the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. It should be noted that the configuration of the loss functions shown in the above equations (1) to (3) is merely an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.
 上述したように、第1の実施形態における学習処理の流れは以下のとおりである。
1.3つの連続する入力フレーム群を1サンプルとして用意する。
2.各サンプルを、オートエンコーダとしてのニューラルネットワーク(エンコーダ/デコーダ)に入力し、中間データを得る。
3.補正のためのニューラルネットワークよって、上記S2(x,y,z)とS3(x,y,z)に対応する復号映像データを得る。
4.下記1)~4)の値を加算することにより損失計算を行う。
 1)S2(x,y,z)とR2(x,y,z)との復元誤差、及び、S3(x,y,z)とR3(x,y,z)との復元誤差。
 2)M2(x,y,z)とR2(x,y,z)との重み付き復元誤差、及び、M3(x,y,z)とR3(x,y,z)との重み付き復元誤差。
 3)GAN誤差(識別処理を行うニューラルネットワークに対し、R2(x,y,z)及びR3(x,y,z)を入力した時のバイナリクロスエントロピー誤差)。
 4)FM誤差(識別処理を行うニューラルネットワークに対し、S2(x,y,z)及びS3(x,y,z)と、R2(x,y,z)及びR3(x,y,z)と、を入力した時の中間層特徴量の誤差)。
5.誤差逆伝播法により各ニューラルネットワークを更新する。
 なお、ここでいう識別処理とは、入力された映像データに基づく映像が真の映像であるか否かを識別する処理である。
 なお、2)の重み付き復元誤差は、時間的に後に隣接するフレーム群と連続させるように算出される項である。3)のGAN誤差と4)のFM誤差は、映像復号データに基づく映像がより自然な出力となるように算出される項である。
As described above, the flow of the learning process in the first embodiment is as follows.
1. Prepare three consecutive input frame groups as one sample.
2. Each sample is input to a neural network (encoder / decoder) as an auto encoder to obtain intermediate data.
3. The decoded video data corresponding to S2 (x, y, z) and S3 (x, y, z) is obtained by the neural network for correction.
4. The loss calculation is performed by adding the following values 1) to 4).
1) Restoration error between S2 (x, y, z) and R2 (x, y, z) and restoration error between S3 (x, y, z) and R3 (x, y, z).
2) Weighted restoration error between M2 (x, y, z) and R2 (x, y, z) and weighted restoration error between M3 (x, y, z) and R3 (x, y, z) error.
3) GAN error (a binary cross-entropy error when R2 (x, y, z) and R3 (x, y, z) are input to a neural network that performs the identification process).
4) FM error (for a neural network performing identification processing, S2 (x, y, z) and S3 (x, y, z), R2 (x, y, z) and R3 (x, y, z) , And the error of the feature amount of the hidden layer at the time of inputting).
5. Each neural network is updated by the backpropagation method.
Here, the identification processing is processing for identifying whether or not a video based on the input video data is a true video.
Note that the weighted restoration error of 2) is a term calculated so as to be continuous with an adjacent frame group in time. The GAN error of 3) and the FM error of 4) are terms calculated so that the video based on the decoded video data has a more natural output.
 なお、上記の通り、ここでは3つの時間的に連続する入力フレーム群であるS1(x,y,z),S2(x,y,z),S3(x,y,z)から、M1(x,y,z)、M2(x,y,z),M3(x,y,z)とR2(x,y,z),R3(x,y,z)とが生成され、R2(x,y,z)+R3(x,y,z)が自然になるように(すなわち、連続性を持つように)学習が行われる構成であった。 Note that, as described above, here, M1 (x, y, z), S2 (x, y, z) and S3 (x, y, z), which are three temporally continuous input frame groups, are converted to M1 ( x, y, z), M2 (x, y, z), M3 (x, y, z) and R2 (x, y, z), R3 (x, y, z) are generated, and R2 (x , Y, z) + R3 (x, y, z) is natural (ie, has continuity).
 しかしながら、上記のように3つの時間的に連続する入力フレーム群からなるデータセットが入力される構成に限られるものではなく、4つ以上の時間的に連続する入力フレーム群からなるデータセットが入力される構成であってもよい。
 例えば、4つの時間的に連続する入力フレーム群であるS1(x,y,z),S2(x,y,z),S3(x,y,z)S4(x,y,z)から、M1(x,y,z)、M2(x,y,z),M3(x,y,z),M4(x,y,z)とR2(x,y,z),R3(x,y,z),R4(x,y,z)とが生成され、R2(x,y,z)+R3(x,y,z)+R4(x,y,z)が自然になるように(すなわち、連続性を持つように)学習が行われる構成であってもよい。
However, as described above, the present invention is not limited to the configuration in which a data set including three temporally continuous input frames is input, and a data set including four or more temporally continuous input frames is input. May be adopted.
For example, from four temporally continuous input frame groups S1 (x, y, z), S2 (x, y, z), S3 (x, y, z) and S4 (x, y, z), M1 (x, y, z), M2 (x, y, z), M3 (x, y, z), M4 (x, y, z) and R2 (x, y, z), R3 (x, y) , Z), R4 (x, y, z), and R2 (x, y, z) + R3 (x, y, z) + R4 (x, y, z) become natural (ie, Learning (to have continuity) may be performed.
 以上説明したように、第1の実施形態に係る映像符号化・復号システム1は、符号化データをそのまま復号映像データに復号するのではなく、中間データとして中間データメモリ213に格納する。そして、映像符号化・復号システム1は、処理対象の符号化データに対して、時間的に連続する周囲のデータ(中間データ)を用いて補正処理を行い、復号する。これにより、時間的に連続する周囲のデータと処理対象の符号化データとの連続性が保たれる。 As described above, the video encoding / decoding system 1 according to the first embodiment stores encoded data in the intermediate data memory 213 as intermediate data instead of decoding the encoded data directly into decoded video data. Then, the video encoding / decoding system 1 performs a correction process on the encoded data to be processed using surrounding data (intermediate data) that is temporally continuous, and decodes the encoded data. As a result, continuity between temporally continuous surrounding data and encoded data to be processed is maintained.
 なおかつ、第1の実施形態に係る映像符号化・復号システム1では、処理対象の符号化データを復号する際に必要なデータは、周囲の少数データのみ(第1の実施形態においては、時間的に1つ前の中間データのみ)である。これにより、映像符号化・復号システム1は、画像データに対するランダムアクセス性、及び並列性を有する符号化及び復号を行うことができる。 In addition, in the video encoding / decoding system 1 according to the first embodiment, only a small number of surrounding data (in the first embodiment, temporal data is necessary) when decoding encoded data to be processed. In the previous intermediate data only). Accordingly, the video encoding / decoding system 1 can perform encoding and decoding with random access to image data and parallelism.
 また、第1の実施形態に係る映像符号化・復号システム1は、上述したように、復元誤差2を用いて学習を行う。そのため、例えば図10に示したM2(x,y,z)をR2(x,y,z)に補正する場合には、R2(x,y,z)とR3(x,y,z)との連続性を保つために、R3(x,y,z)に近いフレームでは変化が起きないような拘束条件になっている。すなわち、S2(x,y,z)とS2(x,y,z)より時間的に後の入力フレーム群であるS3(x,y,z)との関係に基づく主観画質が高くなるような拘束条件になっている。これにより、映像符号化・復号システム1によれば、R2(x,y,z)がR3(x,y,z)と連続するように補正が行われるため、画質が向上する。 {Circle around (2)} The video encoding / decoding system 1 according to the first embodiment performs learning using the restoration error 2 as described above. Therefore, for example, when M2 (x, y, z) shown in FIG. 10 is corrected to R2 (x, y, z), R2 (x, y, z) and R3 (x, y, z) Are maintained such that no change occurs in a frame close to R3 (x, y, z). That is, the subjective image quality based on the relationship between S2 (x, y, z) and S3 (x, y, z), which is an input frame group temporally later than S2 (x, y, z), is increased. It is a constraint condition. Thereby, according to the video encoding / decoding system 1, the correction is performed so that R2 (x, y, z) is continuous with R3 (x, y, z), so that the image quality is improved.
 また、第1の実施形態に係る映像符号化・復号システム1では、オートエンコーダとしてのニューラルネットワーク(次元圧縮部121及び次元伸張部212)(第1の学習モデル)と、連続性の確保のためのニューラルネットワーク(補正部214)(第2の学習モデル)とが、別々のニューラルネットワークであり、別々に学習処理が行われるため、学習処理が安定する。 In the video encoding / decoding system 1 according to the first embodiment, the neural network (the dimensional compression unit 121 and the dimensional expansion unit 212) as the auto encoder (the first learning model) and the Is a separate neural network and the learning process is performed separately, so that the learning process is stabilized.
<第2の実施形態>
 以下、本発明の第2の実施形態について、図面を参照しながら説明する。
<Second embodiment>
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.
 以下、第2の実施形態に係る映像符号化・復号システムについて説明する。なお、第2の実施形態に係る映像符号化・復号システムの全体構成及び符号化部の構成は、図1及び図2を参照しながら説明した第1の実施形態に係る映像符号化・復号システム1の全体構成及び符号化部120の構成と同一であるため、説明を省略する。第1の実施形態に係る映像符号化・復号システム1と、以下に説明する第2の実施形態に係る映像符号化・復号システムとは、映像復号装置が備える復号部の構成が異なる。 Hereinafter, a video encoding / decoding system according to the second embodiment will be described. The overall configuration of the video encoding / decoding system according to the second embodiment and the configuration of the encoding unit are the same as those of the video encoding / decoding system according to the first embodiment described with reference to FIGS. 1 and the configuration of the encoding unit 120, and a description thereof will be omitted. The video encoding / decoding system 1 according to the first embodiment differs from the video encoding / decoding system according to the second embodiment described below in the configuration of the decoding unit included in the video decoding device.
 以下、第2の実施形態に係る映像符号化・復号システムの映像復号装置が備える復号部210aの構成を図11に示す。なお、第1の実施形態と機能構成が同一である機能ブロックに対しては同一の符号を付し、説明を省略する。図11に示すように、復号部210は、エントロピー復号部211と、次元伸張部212と、中間データメモリ213と、補正部214と、補正切り替えスイッチ215と、を含んで構成される。 Hereinafter, FIG. 11 illustrates a configuration of the decoding unit 210a included in the video decoding device of the video encoding / decoding system according to the second embodiment. Note that the same reference numerals are given to functional blocks having the same functional configuration as in the first embodiment, and description thereof will be omitted. As shown in FIG. 11, the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, a correction unit 214, and a correction changeover switch 215.
 第2の実施形態に係る復号部210aと第1の実施形態に係る復号部210との差異点は、復号部210の機能構成に加えて、復号部210aが、さらに補正処理切り替えスイッチ215を備える構成である点である。 The difference between the decoding unit 210a according to the second embodiment and the decoding unit 210 according to the first embodiment is that, in addition to the functional configuration of the decoding unit 210, the decoding unit 210a further includes a correction process changeover switch 215. This is the configuration.
 次元伸張部212は、生成された伸張復号データを、中間データメモリ213及び補正処理切り替えスイッチ215へそれぞれ出力する。 The -dimensional decompression unit 212 outputs the generated decompressed decoded data to the intermediate data memory 213 and the correction process changeover switch 215, respectively.
 補正処理切り替えスイッチ215は、次元伸張部212から出力された伸張復号データを取得する。補正処理切り替えスイッチ215は、取得された伸張復号データを、そのまま復号フレーム群として映像結合部へ出力するか、又は、補正部214へ出力するかを切り替える。 The correction process changeover switch 215 acquires the expanded decoded data output from the dimensional expansion unit 212. The correction processing changeover switch 215 switches whether to output the obtained decompressed decoded data as a decoded frame group to the video combining unit or to the correcting unit 214.
 補正部214は、補正処理切り替えスイッチ215から出力された伸張復号データを取得する。また、補正部214は、中間データメモリ213に記憶された中間データを取得する。補正部214は、中間データを用いて伸張復号データを補正することにより復号フレーム群を生成する。補正部214は、生成された復号フレーム群を映像結合部220へ出力する。 The correction unit 214 acquires the decompressed decoded data output from the correction processing changeover switch 215. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.
 第2の実施形態に係る映像符号化装置の動作は、図5を参照しながら説明した第1の実施形態に係る映像符号化装置10の動作と同一である。よって、第2の実施形態に係る映像符号化装置の動作についての説明は省略する。 The operation of the video encoding device according to the second embodiment is the same as the operation of the video encoding device 10 according to the first embodiment described with reference to FIG. Therefore, description of the operation of the video encoding device according to the second embodiment will be omitted.
[映像復号装置の動作]
 以下、第2の実施形態に係る映像復号装置の動作の一例について説明する。
 図12は、第1の実施形態に係る映像復号装置20の動作を示すフローチャートである。
[Operation of video decoding device]
Hereinafter, an example of the operation of the video decoding device according to the second embodiment will be described.
FIG. 12 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.
 復号部210aのエントロピー復号部211は、符号化データを取得する。エントロピー復号部211は、取得された符号化データに対してエントロピー復号を行うことにより、エントロピー復号データを生成する(ステップS211)。
 復号部210aの次元伸張部212は、生成されたエントロピー復号データに対して、(次元圧縮部によって次元数が削減される前の)元の次元数に復元することにより伸張復号データを生成する(ステップS212)。
The entropy decoding unit 211 of the decoding unit 210a acquires the encoded data. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the acquired encoded data (step S211).
The dimension expansion unit 212 of the decoding unit 210a generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit) ( Step S212).
 復号部210aの中間データメモリ213は、ステップS212によって生成された伸張復号データである中間データMiを記憶する(ステップS213)。 The intermediate data memory 213 of the decoding unit 210a stores the intermediate data Mi, which is the decompressed decoded data generated in step S212 (step S213).
 復号部210aの補正処理切り替えスイッチ215は、次元伸張部212によって生成された伸張復号データを参照し、入力フレーム群の番号を表すインデックスiの値を確認する。iの値が奇数である場合(ステップS214・YES)、補正処理切り替えスイッチ215は、取得された伸張復号データを、そのまま復号フレーム群として映像結合部へ出力する。
 映像結合部は、生成された復号フレーム群を結合することにより復号映像データを生成する(ステップS216)。
 以上で、図12のフローチャートが示す映像復号装置20の動作が終了する。
The correction processing changeover switch 215 of the decoding unit 210a checks the value of the index i indicating the number of the input frame group by referring to the expanded decoded data generated by the dimension expanding unit 212. If the value of i is an odd number (step S214: YES), the correction processing changeover switch 215 outputs the obtained decompressed decoded data as it is to the video combining unit as a decoded frame group.
The video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216).
This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.
 一方、iの値が偶数である場合(ステップS214・NO)、補正処理切り替えスイッチ215は、取得された伸張復号データを、復号部210aの補正部214へ出力する。補正部214は、中間データメモリ213に記憶された中間データMiを用いて、補正処理切り替えスイッチ215を介して取得した伸張復号データを補正する。 On the other hand, if the value of i is an even number (step S214: NO), the correction process changeover switch 215 outputs the obtained decompressed decoded data to the correction unit 214 of the decoding unit 210a. The correction unit 214 corrects the decompressed and decoded data obtained via the correction processing changeover switch 215 using the intermediate data Mi stored in the intermediate data memory 213.
 なお、補正処理切り替えスイッチ215が、iの値が偶数である場合に、伸張復号データをそのまま復号フレーム群として映像結合部へ出力し、iの値が奇数である場合に、伸張復号データを補正部214へ出力する構成であってもよい。 When the value of i is an even number, the correction processing changeover switch 215 outputs the decompressed decoded data as it is to the video combining unit as a decoded frame group, and corrects the decompressed decoded data when the value of i is an odd number. The output to the unit 214 may be used.
 なお、上記の通り、補正処理切り替えスイッチ215は、取得される伸長復号データに対して1つおきに補正処理を行うが、その目的は以下のとおりである。
 第1の実施形態では、補正対象のフレーム群(Mi)が、時間的に前のフレーム群(Mi-1)と時間的に連続するように補正されることによって主観画質が向上する構成であった。しかしながら、時間的に前のフレーム群(Mi-1)は、更に時間的に前のフレーム群(Mi-2)に基づいて補正される。そのため、時間的に前のフレーム群(Mi-1)は、補正対象のフレーム群(Mi)が参照された時点とは異なるフレーム群になっているため、最終的な出力が時間的に連続性を有していることは担保されない。
Note that, as described above, the correction process changeover switch 215 performs a correction process on every other acquired decompressed decoded data, and the purpose is as follows.
In the first embodiment, the subjective image quality is improved by correcting the frame group (Mi) to be corrected so as to be temporally continuous with the frame group (Mi-1) temporally preceding. Was. However, the temporally previous frame group (Mi-1) is further corrected based on the temporally previous frame group (Mi-2). Therefore, the temporally previous frame group (Mi-1) is a frame group different from the time point when the frame group to be corrected (Mi) is referred to, and the final output is temporally continuous. Is not guaranteed.
 一方、第2の実施形態では、補正されるフレーム群と補正されないフレーム群とが交互に連続する構成である。これによって、第2の実施形態では、補正対象のフレーム群が補正された後に、その前後のフレーム群は参照された時点から変化しないため、時間的な連続性が担保される。 On the other hand, the second embodiment has a configuration in which a group of frames to be corrected and a group of frames not to be corrected are alternately continuous. Thus, in the second embodiment, after the frame group to be corrected is corrected, the frame groups before and after the frame group do not change from the point in time when they are referred to, so that temporal continuity is ensured.
 ここで、補正部214は、補正の対象である伸張復号データ(第2のフレーム群)に対して、中間データMi-1(第1のフレーム群)と中間データMi+1(第3のフレーム群)とを用いて補正を行う。ここで、中間データMi-1は、当該伸張復号データに相当する中間データMiよりも先に中間データメモリ213に記憶された中間データである。また、中間データMi+1は、当該伸張復号データに相当する中間データMiよりも後に中間データメモリ213に記憶された中間データである。例えば、補正部214は、中間データMiに相当する伸張復号データを、当該中間データMiよりも時間方向に1つ前の中間データである中間データMi-1と、当該中間データMiよりも時間方向に1つ後の中間データである中間データMi+1と、を用いて補正する。なお、補正に用いられる中間データは3つ以上であってもよい。 Here, the correction unit 214 applies the intermediate data Mi-1 (first frame group) and the intermediate data Mi + 1 (third frame group) to the expanded decoded data (second frame group) to be corrected. The correction is performed using. Here, the intermediate data Mi-1 is intermediate data stored in the intermediate data memory 213 prior to the intermediate data Mi corresponding to the decompressed decoded data. Further, the intermediate data Mi + 1 is intermediate data stored in the intermediate data memory 213 after the intermediate data Mi corresponding to the decompressed decoded data. For example, the correction unit 214 converts the decompressed and decoded data corresponding to the intermediate data Mi into intermediate data Mi-1 which is one intermediate data in the temporal direction before the intermediate data Mi, and temporal data in the temporal direction from the intermediate data Mi. And the intermediate data Mi + 1 that is the next intermediate data. The number of intermediate data used for correction may be three or more.
 補正部214は、中間データMiに相当する伸張復号データに対して、中間データMi-1と中間データMi+1をz方向の次元で結合することにより補正する。補正部214は、全ての伸張復号データに対して上記の処理を行うことにより、復号フレーム群を生成する(ステップS215)。
 映像結合部は、生成された復号フレーム群を結合することにより復号映像データを生成する(ステップS216)。
 以上で、図12のフローチャートが示す映像復号装置20の動作が終了する。
The correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 and the intermediate data Mi + 1 in the dimension in the z direction. The correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S215).
The video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216).
This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.
[学習処理]
 以下、第2の実施形態に係る、次元圧縮部、次元伸張部、及び補正部214のニューラルネットワークによる学習処理について説明する。
 次元圧縮部、次元伸張部、及び補正部214のニューラルネットワークによる学習処理は、同時に行われる。
[Learning process]
Hereinafter, a learning process by the neural network of the dimensional compression unit, the dimensional expansion unit, and the correction unit 214 according to the second embodiment will be described.
The learning processing by the neural network of the dimensional compression unit, the dimensional expansion unit, and the correction unit 214 is performed simultaneously.
 図13は、第2の実施形態に係る映像符号化・復号システムによる学習処理を説明するための模式図である。
 図13に示すように、まず入力データとして、3つの時間的に連続する入力フレーム群を1つのサンプルデータとするデータセットが入力される。以下、これら3つの入力フレーム群を、時間順にそれぞれS1(x,y,z),S2(x,y,z),S3(x,y,z)とする。
FIG. 13 is a schematic diagram illustrating a learning process performed by the video encoding / decoding system according to the second embodiment.
As shown in FIG. 13, first, as input data, a data set including three temporally continuous input frames as one sample data is input. Hereinafter, these three input frame groups are referred to as S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) in order of time.
 次に、各入力フレーム群S1(x,y,z),S2(x,y,z),S3(x,y,z)に対し、それぞれ処理Aが実行される。ここでいう処理Aとは、上述したように、次元圧縮処理、量子化/エントロピー符号化処理、エントロピー復号処理、及び次元伸張処理である。これにより、中間データがそれぞれ生成される。以下、各入力フレーム群S1(x,y,z),S2(x,y,z),S3(x,y,z)に基づいて生成される中間データを、それぞれM1(x,y,z)、M2(x,y,z),M3(x,y,z)とする。 Next, the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z). As described above, the process A is a dimensional compression process, a quantization / entropy encoding process, an entropy decoding process, and a dimensional expansion process. As a result, intermediate data is generated. Hereinafter, intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively. ), M2 (x, y, z) and M3 (x, y, z).
 次に、図13に示すように、M1(x,y,z)、M2(x,y,z)、及びM3(x,y,z)をセットとして補正が行われる。具体的には、中間データM2(x,y,z)に対応する伸張復号データ、中間データM1(x,y,z)、及び中間データM3(x,y,z)をセットとして、それぞれ補正が行われる。これにより、復号フレーム群が生成される。以下、生成された復号フレーム群を、R2(x,y,z)とする。 (13) Next, as shown in FIG. 13, correction is performed using M1 (x, y, z), M2 (x, y, z), and M3 (x, y, z) as a set. Specifically, the decompressed decoded data corresponding to the intermediate data M2 (x, y, z), the intermediate data M1 (x, y, z), and the intermediate data M3 (x, y, z) are set as corrections, respectively. Is performed. As a result, a decoded frame group is generated. Hereinafter, the generated decoded frame group is defined as R2 (x, y, z).
 次に、以下に示す式(4)~式(5)によって定義される損失関数を用いて、損失値lossが算出される。 Next, the loss value loss is calculated using a loss function defined by the following equations (4) and (5).
  loss=
    復元誤差1+GAN(concat(M1,R2,M3))
   +FM(concat(S1,S2,S3),concat(M1,R2,M3))
                               ・・・(4)
loss =
Restoration error 1 + GAN (concat (M1, R2, M3))
+ FM (concat (S1, S2, S3), concat (M1, R2, M3))
... (4)
  復元誤差1=
    ΣxΣyΣz(diff(S1(x,y,z),M1(x,y,z)))
   +ΣxΣyΣz(diff(S3(x,y,z),M3(x,y,z)))
                               ・・・(5)
Restoration error 1 =
ΣxΣyΣz (diff (S1 (x, y, z), M1 (x, y, z)))
+ ΣxΣyΣz (diff (S3 (x, y, z), M3 (x, y, z)))
... (5)
 ここで、diff(a,b)は、aとbとの距離を測る関数(例えば二乗誤差等)である。concat()は、各入力を時間方向に連結する操作である。GAN(x)は、入力映像xが真の映像であるか否かを判定し、その確率を出力する識別器である。当該識別器は、ニューラルネットワークによって構築される。FM(a,b)は、当該識別器に対して、それぞれaとbとを入力した場合における、ニューラルネットワークの中間層の値についての誤差和(例えば二乗誤差等)である。 Here, diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b. concat () is an operation of connecting each input in the time direction. GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability. The discriminator is constructed by a neural network. FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.
 次に、算出された損失値を用いて、逆誤差伝播法等により各部のパラメータ値が更新される。上記の一連の流れを1回として、複数のサンプルデータを用いて、一定回数繰り返されることによって学習が行われる。又は、損失値が収束するまで繰り返されることによって学習が行われる。なお、上記式(4)~式(5)で示した損失関数の構成は一例であり、上記のうち一部の誤差のみが計算される損失関数、又は、異なる誤差項を追加された損失関数等であってもよい。 (4) Next, using the calculated loss value, the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. Note that the configuration of the loss function shown in the above equations (4) to (5) is an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.
 以上の構成を備えることによって、第2の実施形態に係る映像符号化・復号システムは、画像データに対するランダムアクセス性、及び並列性を有する符号化及び復号を行うことができる。 With the above configuration, the video encoding / decoding system according to the second embodiment can perform encoding and decoding with random access to image data and parallelism.
 また、上述したように、第1の実施形態に係る映像符号化・復号システム1は、各入力フレーム群を独立に補正する。そのため、第1の実施形態に係る映像符号化・復号システム1では、それぞれの入力は時間的に前の出力と連続するように補正が行われるものの、前の出力がどのように補正されるかは未知である。そのため、第1の実施形態に係る映像符号化・復号システム1では、補正後の復号フレーム群どうしが連続性を有することを確実には担保できない可能性がある。 As described above, the video encoding / decoding system 1 according to the first embodiment independently corrects each input frame group. Therefore, in the video encoding / decoding system 1 according to the first embodiment, although each input is corrected so as to be temporally continuous with the previous output, how the previous output is corrected Is unknown. For this reason, in the video encoding / decoding system 1 according to the first embodiment, it may not be possible to reliably ensure that the corrected decoded frame groups have continuity.
 一方、以上説明したように、第2の実施形態に係る映像符号化・復号システムは、インデックスの値が奇数(又は偶数)のフレーム群については、伸張復号データそのものを復号フレーム群とするように学習を行い、インデックスの値が奇数(又は偶数)でないフレーム群と連続するように補正を行う。これにより、補正処理対象のフレーム群の前後の出力は変化しないことから、第2の実施形態に係る映像符号化・復号システムは、補正後の復号フレーム群と、当該補正後の復号フレーム群と時間的に前後に隣接する復号フレーム群とが、連続性を有することを担保することができる。 On the other hand, as described above, the video encoding / decoding system according to the second embodiment is configured such that, for a frame group having an odd (or even) index value, the decompressed decoded data itself is used as a decoded frame group. Learning is performed, and correction is performed so that the index value is continuous with a frame group that is not an odd number (or even number). As a result, the output before and after the group of frames to be corrected does not change. Therefore, the video encoding / decoding system according to the second embodiment includes the corrected decoded frame group and the corrected decoded frame group. It is possible to ensure that the decoded frame groups adjacent before and after in time have continuity.
 上述した実施形態における映像符号化・復号システムの一部又は全部を、コンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、OSや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、上述した機能の一部を実現するためのものであっても良く、さらに上述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、PLD(Programmable Logic Device)やFPGA(Field Programmable Gate Array)等のハードウェアを用いて実現されるものであってもよい。 一部 A part or all of the video encoding / decoding system in the above-described embodiment may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read and executed by a computer system. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" refers to a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short time. Such a program may include a program that holds a program for a certain period of time, such as a volatile memory in a computer system serving as a server or a client in that case. The program may be for realizing a part of the functions described above, or may be a program that can realize the functions described above in combination with a program already recorded in a computer system, It may be realized using hardware such as a PLD (Programmable Logic Device) or an FPGA (Field Programmable Gate Array).
 以上、図面を参照して本発明の実施形態を説明してきたが、上記実施形態は本発明の例示に過ぎず、本発明が上記実施形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び要旨を逸脱しない範囲で構成要素の追加、省略、置換、及びその他の変更を行ってもよい。 Although the embodiments of the present invention have been described with reference to the drawings, it is clear that the embodiments are merely examples of the present invention and the present invention is not limited to the embodiments. Therefore, additions, omissions, replacements, and other modifications of the components may be made without departing from the technical idea and the gist of the present invention.
 1  映像符号化・復号システム
 10  映像符号化装置
 20  映像復号装置
 110  映像分割部
 120  符号化部
 121  次元圧縮部
 122  エントロピー符号化部
 210  復号部
 211  エントロピー復号部
 212  次元伸張部
 213  中間データメモリ
 214  補正部
 220  映像結合部
Reference Signs List 1 video encoding / decoding system 10 video encoding device 20 video decoding device 110 video dividing unit 120 encoding unit 121 dimensional compression unit 122 entropy encoding unit 210 decoding unit 211 entropy decoding unit 212 dimensional expansion unit 213 intermediate data memory 214 correction Unit 220 Image Coupling Unit

Claims (8)

  1.  映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理装置であって、
     第1のフレーム群の特徴量を用いて、前記第1のフレーム群と時間的に連続したフレーム群である第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得る復号部
     を備え、
     前記復号部は、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群との関係に基づく主観画質が高くなるように、かつ、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群とが結合されたフレーム群と、前記補正後フレーム群と前記第2のフレーム群より時間的に後のフレーム群が補正された補正後フレーム群とが結合されたフレーム群と、が同一であると所定の識別器により識別されるように補正する
     画像処理装置。
    An image processing apparatus that performs correction for each frame group including a predetermined number of frames into which video data is divided,
    A decoding unit that performs correction on a second frame group, which is a frame group temporally continuous with the first frame group, using the feature amount of the first frame group to obtain a corrected frame group With
    The decoding unit is configured to increase a subjective image quality based on a relationship between the second frame group and a frame group temporally later than the second frame group, and to execute the second frame group and the second A frame group in which a frame group temporally later than the second frame group is combined; a corrected frame group in which the corrected frame group and a frame group temporally later than the second frame group are corrected; An image processing apparatus that corrects a group of frames to which a is combined so as to be identified by a predetermined classifier.
  2.  前記復号部は、前記第2のフレーム群に対し時間的により後のフレームの特徴量であるほど、前記補正における重み付けを重くする
     請求項1に記載の画像処理装置。
    The image processing device according to claim 1, wherein the decoding unit increases the weight in the correction as the feature amount of a frame temporally later than the second frame group.
  3.  映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理装置であって、
     第2のフレーム群より時間的に前のフレーム群であって前記第2のフレーム群と時間的に連続したフレーム群である第1のフレーム群の特徴量と、前記第2のフレーム群より時間的に後のフレーム群であって前記第2のフレーム群と時間的に連続したフレーム群である第3のフレーム群の特徴量と、を用いて前記第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得る復号部
     を備え、
     前記復号部は、前記補正後フレーム群と前記第1のフレーム群との関係、及び、前記補正後フレーム群と前記第3のフレーム群との関係に基づく主観画質が高くなるように補正する
     画像処理装置。
    An image processing apparatus that performs correction for each frame group including a predetermined number of frames into which video data is divided,
    A feature value of a first frame group which is a frame group temporally preceding the second frame group and is temporally continuous with the second frame group; The correction is performed on the second frame group by using the feature amount of a third frame group that is a frame group that is temporally consecutive with the second frame group. A decoding unit that obtains a group of frames after correction,
    The decoding unit corrects the subjective image quality based on the relationship between the corrected frame group and the first frame group and the relationship between the corrected frame group and the third frame group. Processing equipment.
  4.  前記復号部は、前記映像データとは異なる映像データが分割されたフレーム群に基づく学習処理によって更新されたパラメータ値に基づいて補正する
     請求項1又は請求項2に記載の画像処理装置。
    The image processing device according to claim 1, wherein the decoding unit performs correction based on a parameter value updated by a learning process based on a frame group obtained by dividing video data different from the video data.
  5.  前記学習処理は、
     時間的に連続する少なくとも3つのフレーム群からなるサンプルデータを取得するステップと、
     前記サンプルデータを第1の学習モデルにそれぞれ入力して、前記フレーム群の特徴量をそれぞれ得るステップと、
     前記フレーム群の特徴量を第2の学習モデルに入力して前記フレーム群に対応する前記補正後フレーム群を得るステップと、
     前記サンプルデータと前記フレーム群の特徴量と前記補正後フレーム群と所定の損失関数とに基づいて損失値を算出するステップと、
     前記損失値を用いて前記パラメータ値を更新するステップと、
     を有する
     請求項4に記載の画像処理装置。
    The learning process includes:
    Obtaining sample data consisting of at least three frames that are temporally continuous;
    Inputting each of the sample data to a first learning model to obtain a feature amount of the frame group;
    Inputting the feature value of the frame group to a second learning model to obtain the corrected frame group corresponding to the frame group;
    Calculating a loss value based on the sample data, the feature amount of the frame group, the corrected frame group, and a predetermined loss function;
    Updating the parameter value using the loss value;
    The image processing device according to claim 4, comprising:
  6.  データが分割された所定の部分データ数からなる部分データ群ごとに補正を行う画像処理装置であって、
     第1の部分データ群の特徴量を用いて、前記第1の部分データ群と時間的に連続した部分データ群である第2の部分データ群に対して補正を行うことにより、補正後部分データ群を得る復号部
     を備え、
     前記復号部は、前記第2の部分データ群と前記第2の部分データ群より時間的に後の部分データ群との関係に基づく主観画質が高くなるように、かつ、前記第2の部分データ群と前記第2の部分データ群より時間的に後の部分データ群とが結合された部分データ群と、前記補正後部分データ群と前記第2の部分データ群より時間的に後の部分データ群が補正された補正後部分データ群とが結合された部分データ群と、が同一であると所定の識別器により識別されるように補正する
     画像処理装置。
    An image processing apparatus that performs correction for each partial data group including a predetermined number of divided partial data,
    The second partial data group, which is a partial data group that is temporally continuous with the first partial data group, is corrected using the feature amount of the first partial data group, so that the corrected partial data group is obtained. A decoding unit for obtaining the group,
    The decoding unit increases the subjective image quality based on the relationship between the second partial data group and the partial data group temporally later than the second partial data group, and A partial data group in which a group and a partial data group temporally later than the second partial data group are combined; and a partial data group temporally later than the corrected partial data group and the second partial data group. An image processing apparatus for correcting a partial data group obtained by combining a corrected partial data group whose group has been corrected with a partial data group to be identified by a predetermined identifier.
  7.  映像データが分割された所定のフレーム数からなるフレーム群ごとに補正を行う画像処理方法であって、
     第1のフレーム群の特徴量を用いて、前記第1のフレーム群と時間的に連続したフレーム群である第2のフレーム群に対して補正を行うことにより、補正後フレーム群を得るステップと、
     前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群との関係に基づく主観画質が高くなるように、かつ、前記第2のフレーム群と前記第2のフレーム群より時間的に後のフレーム群とが結合されたフレーム群と、前記補正後フレーム群と前記第2のフレーム群より時間的に後のフレーム群が補正された補正後フレーム群とが結合されたフレーム群と、が同一であると所定の識別器により識別されるように補正するステップと、
     を有する画像処理方法。
    An image processing method for performing correction for each frame group including a predetermined number of frames into which video data is divided,
    Obtaining a corrected frame group by performing correction on a second frame group, which is a frame group temporally continuous with the first frame group, using the feature amount of the first frame group; ,
    The subjective image quality based on the relationship between the second frame group and the frame group temporally later than the second frame group is increased, and the second frame group and the second frame group are compared with each other. A frame in which a group of frames that are temporally later are combined, and a frame in which a group of frames after correction and a group of frames after correction in which a frame group that is temporally later than the second frame group are corrected are combined. Correcting the groups so that they are identified by a predetermined classifier as being the same;
    An image processing method comprising:
  8.  請求項1から請求項5のうちいずれか一項に記載の画像処理装置としてコンピュータを機能させるための画像処理プログラム。 An image processing program for causing a computer to function as the image processing device according to any one of claims 1 to 5.
PCT/JP2019/035631 2018-09-19 2019-09-11 Image processing device, image processing method, and image processing program WO2020059581A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/273,366 US11516515B2 (en) 2018-09-19 2019-09-11 Image processing apparatus, image processing method and image processing program
JP2020548382A JP7104352B2 (en) 2018-09-19 2019-09-11 Image processing device, image processing method and image processing program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-174982 2018-09-19
JP2018174982 2018-09-19

Publications (1)

Publication Number Publication Date
WO2020059581A1 true WO2020059581A1 (en) 2020-03-26

Family

ID=69887000

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/035631 WO2020059581A1 (en) 2018-09-19 2019-09-11 Image processing device, image processing method, and image processing program

Country Status (3)

Country Link
US (1) US11516515B2 (en)
JP (1) JP7104352B2 (en)
WO (1) WO2020059581A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008177648A (en) * 2007-01-16 2008-07-31 Nippon Hoso Kyokai <Nhk> Motion picture data decoding device and program

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2336269B (en) * 1997-12-08 2002-01-16 Sony Corp Encoder and encoding method
US7136508B2 (en) * 2000-11-09 2006-11-14 Minolta Co., Ltd. Image processing apparatus, method, and program for processing a moving image
US7680326B2 (en) * 2004-03-18 2010-03-16 Fujifilm Corporation Method, system, and program for correcting the image quality of a moving image
JP4440051B2 (en) * 2004-09-08 2010-03-24 キヤノン株式会社 Image encoding apparatus and method, computer program, and computer-readable storage medium
JP4867235B2 (en) * 2004-10-26 2012-02-01 ソニー株式会社 Information processing apparatus, information processing method, recording medium, and program
JP4618098B2 (en) * 2005-11-02 2011-01-26 ソニー株式会社 Image processing system
US20080002771A1 (en) * 2006-06-30 2008-01-03 Nokia Corporation Video segment motion categorization
JP4747975B2 (en) * 2006-07-14 2011-08-17 ソニー株式会社 Image processing apparatus and method, program, and recording medium
JP5643574B2 (en) * 2010-08-26 2014-12-17 キヤノン株式会社 Image processing apparatus and image processing method
US10545651B2 (en) * 2013-07-15 2020-01-28 Fox Broadcasting Company, Llc Providing bitmap image format files from media
US10679145B2 (en) * 2015-08-07 2020-06-09 Nec Corporation System and method for balancing computation with communication in parallel learning
US11586960B2 (en) * 2017-05-09 2023-02-21 Visa International Service Association Autonomous learning platform for novel feature discovery
US11528720B2 (en) * 2018-03-27 2022-12-13 Nokia Solutions And Networks Oy Method and apparatus for facilitating resource pairing using a deep Q-network
US11019355B2 (en) * 2018-04-03 2021-05-25 Electronics And Telecommunications Research Institute Inter-prediction method and apparatus using reference frame generated based on deep learning
US10798394B2 (en) * 2018-06-27 2020-10-06 Avago Technologies International Sales Pte. Limited Low complexity affine merge mode for versatile video coding
US11526953B2 (en) * 2019-06-25 2022-12-13 Iqvia Inc. Machine learning techniques for automatic evaluation of clinical trial data
CN110464326B (en) * 2019-08-19 2022-05-10 上海联影医疗科技股份有限公司 Scanning parameter recommendation method, system, device and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008177648A (en) * 2007-01-16 2008-07-31 Nippon Hoso Kyokai <Nhk> Motion picture data decoding device and program

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AKBARI, MOHAMMAD ET AL.: "Semi-Recurrent Cnn-Based Vae-Gan for Sequential Data Generation", 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP, 13 September 2018 (2018-09-13), pages 2321 - 2325, XP033401002 *
MAHASSENI, BEHROOZ ET AL.: "Unsupervised Video Summarization with Adversarial LSTM Networks", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 9 November 2017 (2017-11-09), pages 2982 - 2991, XP033249644 *
XIE, JIANWEN ET AL.: "Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 9 November 2017 (2017-11-09), pages 1061 - 1069 *

Also Published As

Publication number Publication date
JP7104352B2 (en) 2022-07-21
US11516515B2 (en) 2022-11-29
US20210344967A1 (en) 2021-11-04
JPWO2020059581A1 (en) 2021-05-20

Similar Documents

Publication Publication Date Title
US11568571B2 (en) Techniques and apparatus for lossless lifting for attribute coding
CN109451308B (en) Video compression processing method and device, electronic equipment and storage medium
WO2018068532A1 (en) Image encoding and decoding devices, image processing system, image encoding and decoding methods, and training method
US11917205B2 (en) Techniques and apparatus for scalable lifting for point-cloud attribute coding
US11671576B2 (en) Method and apparatus for inter-channel prediction and transform for point-cloud attribute coding
US11551334B2 (en) Techniques and apparatus for coarse granularity scalable lifting for point-cloud attribute coding
JP7434604B2 (en) Content-adaptive online training using image replacement in neural image compression
Ayzik et al. Deep image compression using decoder side information
WO2020261314A1 (en) Image encoding method and image decoding method
JP2019028746A (en) Network coefficient compressing device, network coefficient compressing method and program
CN111641826A (en) Method, device and system for encoding and decoding data
JP7041380B2 (en) Coding systems, learning methods, and programs
CN113747163A (en) Image coding and decoding method and compression method based on context reorganization modeling
JP2023532397A (en) Content-adaptive online training method, apparatus and computer program for post-filtering
Mahmud An improved data compression method for general data
WO2020059581A1 (en) Image processing device, image processing method, and image processing program
JP7274427B2 (en) Method and device for encoding and decoding data streams representing at least one image
US20230026190A1 (en) Signaling of coding tree unit block partitioning in neural network model compression
John Discrete cosine transform in JPEG compression
US11350134B2 (en) Encoding apparatus, image interpolating apparatus and encoding program
Siddeq et al. DCT and DST based Image Compression for 3D Reconstruction
WO2020230188A1 (en) Encoding device, encoding method and program
WO2019225337A1 (en) Encoding device, decoding device, encoding method, decoding method, encoding program and decoding program
JP2024518766A (en) Online training-based encoder tuning in neural image compression
JP6317272B2 (en) Video encoded stream generation method, video encoded stream generation apparatus, and video encoded stream generation program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19861681

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020548382

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19861681

Country of ref document: EP

Kind code of ref document: A1