WO2020059581A1

WO2020059581A1 - Image processing device, image processing method, and image processing program

Info

Publication number: WO2020059581A1
Application number: PCT/JP2019/035631
Authority: WO
Inventors: 忍工藤; 翔太折橋; 正樹北原; 清水　淳
Original assignee: 日本電信電話株式会社
Priority date: 2018-09-19
Filing date: 2019-09-11
Publication date: 2020-03-26
Also published as: JP7104352B2; US11516515B2; US20210344967A1; JPWO2020059581A1

Abstract

An image processing device which performs correction for each frame group composed of a predetermined number of frames in which picture data are divided is provided with a decoding unit which, using a feature quantity of a first frame group, performs correction with respect to a second frame group that is a frame group temporally successive to the first frame group, thereby obtaining a corrected frame group. The decoding unit performs the correction so that the subjective image quality based on the relationship between the second frame group and a frame group temporally subsequent to the second frame group is increased, and so that a predetermined identifying unit identifies that a frame group in which the second frame group and the frame group temporally subsequent to the second frame group are combined is identical to a frame group in which the corrected frame group and a corrected frame group obtained by correcting the frame group temporally subsequent to the second frame group are combined.

Description

Image processing apparatus, image processing method, and image processing program

The present invention relates to an image processing device, an image processing method, and an image processing program.
Priority is claimed on Japanese Patent Application No. 2018-174982 filed on September 19, 2018, the content of which is incorporated herein by reference.

一つ One of the methods for encoding an image is a method using an auto encoder (self-encoder). The image referred to here includes a still image and a moving image (hereinafter, referred to as “video”). The auto-encoder is a three-layer neural network including an input layer (encoder), a hidden layer, and an output layer (decoder). Auto-encoders are designed to encode input data into encoded data with an encoder and restore the encoded data into input data with a decoder. The encoder and the decoder are constructed by an arbitrary arithmetic unit. For example, when the input data is an image, the encoder is configured by a plurality of arithmetic units that perform a convolution operation, and the decoder is configured by a plurality of arithmetic units that perform an inverse operation to the convolution operation by the encoder.

演算 In the operation by the neural network, the expression ability and the performance are expected to be improved by increasing the number of parameters. However, when the input data is an image having a high resolution, for example, if the number of parameters increases, the capacity of the memory required for the calculation becomes enormous. Therefore, it is not realistic to improve the expressive ability and performance by increasing the number of parameters.

Therefore, for example, as shown in FIG. 14, the input data is divided into a plurality of pieces of data of a operable size, the respective pieces of data are subjected to an arithmetic processing by a neural network, and the output decoded data are combined. To restore the original input data. However, in this method, each divided data is processed independently of each other. Therefore, in this method, the restored input data does not maintain continuity between adjacent decoded data, particularly at a boundary portion of the divided image, and is likely to be an unnatural image. .

On the other hand, as shown in FIG. 15, for example, there is a conventional technique in which surrounding decoded data is recursively input to an encoder, a decoder, or both together with data to be processed. In this way, by inputting the surrounding decoded data recursively, continuity between the processing target data and the surrounding decoded data is considered, and more natural restored data can be obtained.

However, the above prior art has a problem that it lacks random accessibility. Here, the random access property refers to a property that desired data can be easily obtained even if data is discretely accessed. In the related art, for example, when input data is video data, encoding and decoding are sequentially performed from the top of the video data. In this case, for example, even if it is desired to obtain only the decoded data at the desired position of the video data, the decoded data at the desired position cannot be obtained unless the decoding is performed sequentially from the beginning of the video data.
In addition, the above-described conventional technology has a problem of lack of parallelism. In the related art, since arithmetic processing is performed recursively, it is difficult to perform parallel processing. For this reason, it is difficult for the conventional technology to efficiently perform arithmetic processing using a distributed processing system or the like.

The present invention has been made in view of such a situation, and an object of the present invention is to provide a technology capable of performing encoding and decoding with random access to image data and parallelism.

One embodiment of the present invention is an image processing apparatus which performs correction for each frame group including a predetermined number of frames into which video data has been divided, wherein the first frame group is characterized by using a feature amount of the first frame group. A second frame group, which is a frame group temporally continuous with the second frame group, to obtain a corrected frame group, the decoding unit comprising: the second frame group and the second frame group; The second frame group and the frame group temporally later than the second frame group are set so that the subjective image quality based on the relationship with the frame group temporally later than the second frame group becomes higher. It is determined that the combined frame group is the same as the corrected frame group obtained by combining the corrected frame group and the corrected frame group obtained by correcting the frame group temporally later than the second frame group. By the classifier of An image processing apparatus for correcting as another.

Further, one aspect of the present invention is the image processing device described above, wherein the decoding unit weights heavier in the correction as the feature amount of a frame that is temporally later than the second frame group. I do.

One embodiment of the present invention is an image processing device that performs correction for each frame group including a predetermined number of frames into which video data is divided, and includes a frame group temporally earlier than a second frame group. And a feature amount of a first frame group, which is a frame group temporally continuous with the second frame group, and a frame group temporally later than the second frame group, the second frame group And a decoding unit that obtains a corrected frame group by performing correction on the second frame group using the feature amount of a third frame group that is a temporally continuous frame group. A decoding unit configured to correct the subjective image quality based on the relationship between the corrected frame group and the first frame group and the relationship between the corrected frame group and the third frame group; Device.

One embodiment of the present invention is the above-described image processing device, wherein the decoding unit is configured to perform decoding based on a parameter value updated by a learning process based on a frame group obtained by dividing video data different from the video data. to correct.

One embodiment of the present invention is the above-described image processing device, wherein the learning processing includes a step of acquiring sample data including at least three temporally continuous frame groups; Inputting each of the learning models to obtain the feature amount of the frame group, and inputting the feature amount of the frame group to the second learning model to obtain the corrected frame group corresponding to the frame group Calculating a loss value based on the sample data, the feature amount of the frame group, the corrected frame group and a predetermined loss function, and updating the parameter value using the loss value; Having.

Further, one aspect of the present invention is an image processing apparatus that performs correction for each partial data group including a predetermined number of partial data obtained by dividing data, wherein the feature amount of the first partial data group is used to perform the correction. A decoding unit that corrects the second partial data group, which is a partial data group temporally continuous with the first partial data group, to obtain a corrected partial data group; The subjective image quality based on the relationship between the second partial data group and the partial data group temporally later than the second partial data group is increased, and the second partial data group and the second A partial data group in which a partial data group temporally later than the partial data group is combined; and a correction in which the corrected partial data group and the partial data group temporally later than the second partial data group are corrected. Partial data combined with subsequent partial data group And the group, but is an image processing apparatus for correcting as identified by the is the same predetermined identifier.

Further, one aspect of the present invention is an image processing method for performing correction for each frame group including a predetermined number of frames into which video data is divided, wherein the first frame group is characterized by using a feature amount of the first frame group. Obtaining a corrected frame group by performing correction on a second frame group that is a frame group temporally continuous with the second frame group, and obtaining a corrected frame group from the second frame group and the second frame group. A frame group in which the subjective image quality based on the relationship with the temporally later frame group is increased, and the second frame group and the temporally later frame group are combined with each other. And a frame group obtained by combining the corrected frame group and a corrected frame group obtained by correcting a frame group temporally later than the second frame group by the predetermined classifier. Be done A step of urchin correction, an image processing method having.

Another embodiment of the present invention is an image processing program for causing a computer to function as the above image processing device.

According to the present invention, encoding and decoding with random access to image data and parallelism can be performed.

1 is an overall configuration diagram of a video encoding / decoding system 1 according to a first embodiment. FIG. 2 is a configuration diagram of an encoding unit 120 of the video encoding / decoding system 1 according to the first embodiment. FIG. 2 is a configuration diagram of a decoding unit 210 of the video encoding / decoding system 1 according to the first embodiment. FIG. 11 is a configuration diagram of a decoding unit of a video encoding / decoding system according to a conventional technique. 5 is a flowchart illustrating an operation of the video encoding device 10 according to the first embodiment. FIG. 2 is a configuration diagram of a dimension compression unit 121 of the video encoding / decoding system 1 according to the first embodiment. 5 is a flowchart illustrating an operation of the video decoding device 20 according to the first embodiment. FIG. 2 is a configuration diagram of a dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment. FIG. 2 is a configuration diagram of a correction unit 214 of the video encoding / decoding system 1 according to the first embodiment. FIG. 2 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment. It is a block diagram of the decoding part 210a of the video encoding / decoding system which concerns on 2nd Embodiment. 13 is a flowchart illustrating an operation of the video decoding device according to the second embodiment. It is a schematic diagram for explaining the learning process by the video encoding / decoding system according to the second embodiment. FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art. FIG. 11 is a schematic diagram for explaining a learning process performed by a video encoding / decoding system according to the related art.

<First embodiment>
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.

Hereinafter, the video encoding / decoding system 1 that encodes and decodes video data will be described. However, the system is also applicable when encoding and decoding image data other than video data.

[Configuration of video encoding / decoding system]
Hereinafter, the configuration of the video encoding / decoding system 1 will be described.
FIG. 1 is an overall configuration diagram of a video encoding / decoding system 1 (image processing device) according to the first embodiment. As shown in FIG. 1, the video encoding / decoding system 1 acquires input video data to be encoded, and outputs decoded video data corresponding to the input video data. The video encoding / decoding system 1 includes a video encoding device 10 and a video decoding device 20.

The video encoding device 10 includes a video dividing unit 110 and an encoding unit 120. The video division unit 110 acquires input video data. The input video data is composed of a plurality of temporally continuous frames. The video dividing unit 110 generates a plurality of input frame groups by dividing a plurality of continuous frames constituting the obtained input video data by a predetermined number of frames. The video division unit 110 sequentially outputs the generated plurality of input frame groups to the encoding unit 120.

FIG. 2 shows the configuration of encoding section 120. As illustrated in FIG. 2, the encoding unit 120 includes a dimensional compression unit 121 and a quantization / entropy encoding unit 122.
The dimensional compression unit 121 acquires the input frame group output from the video division unit 110. The dimension compression unit 121 generates a compressed frame group by compressing the acquired input frame group so as to reduce the number of dimensions. The dimensional compression unit 121 outputs the generated compressed frame group to the quantization / entropy encoding unit 122.

The quantization / entropy encoding unit 122 acquires the compressed frame group output from the dimensional compression unit 121. The quantization / entropy coding unit 122 performs quantization and entropy coding on each of the values of the compressed frames constituting the obtained compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups. The quantization / entropy coding unit 122 outputs the generated coded data to a decoding unit 210 of the video decoding device 20, which will be described later.

Returning to FIG. 1, the description will be continued.
The video decoding device 20 includes a decoding unit 210 and a video combining unit 220.
FIG. 3 shows the configuration of the decoding unit 210. As shown in FIG. 3, the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, and a correction unit 214.

The entropy decoding unit 211 acquires the encoded data output from the quantization / entropy encoding unit 122 of the encoding unit 120. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data. The entropy decoding unit 211 outputs the generated entropy decoding data to the dimension expansion unit 212.

The dimension expansion unit 212 expands the entropy decoding data output from the entropy decoding unit 211 by expanding the entropy decoding data until the number of dimensions becomes the same as that of the input frame group (before being compressed by the dimension compression unit 121). Generate decrypted data. The dimension expansion unit 212 outputs the generated expanded decoded data to the intermediate data memory 213 and the correction unit 214.

(4) The intermediate data memory 213 acquires and stores the decompressed decoded data output from the dimension decompression unit 212. Note that the decompressed decoded data stored in the intermediate data memory 213 is hereinafter referred to as “intermediate data”. The intermediate data is output to the correction unit 214 as needed. The intermediate data memory 213 is a volatile recording medium such as a RAM (Random Access Memory).

The correction unit 214 acquires the expanded decoded data output from the dimension expansion unit 212. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.

Returning to FIG. 1, the description will be continued.
The video combining unit 220 acquires the decoded frame group output from the decoding unit 210. The video combining unit 220 generates decoded video data by combining the acquired decoded frame groups. The video combining unit 220 outputs the generated decoded video data as final output data.

FIG. 4 shows the configuration of the decoding unit 210 of the video encoding / decoding system according to the related art in order to explain the difference from the related art. As shown in FIGS. 3 and 4, the difference between the configuration of the decoding unit 210 according to the above-described first embodiment and the configuration of the decoding unit in the related art is that the decoding unit in the related art does not include the correction unit. On the other hand, the decoding unit 210 according to the first embodiment includes a correction unit 214.

次元 The dimension expansion unit of the decoding unit in the related art acquires the entropy decoded data output from the entropy decoding unit. The dimension expansion unit in the related art expands the number of dimensions of the acquired entropy decoded data using the intermediate data stored in the intermediate data memory to generate a decoded frame group.

On the other hand, in the decoding unit 210 according to the first embodiment, as described above, the correction unit 214 obtains the expanded decoded data from the dimension expansion unit 212 and obtains the intermediate data from the intermediate data memory 213. Then, the correcting unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data.

[Operation of video encoding device]
Hereinafter, an example of the operation of the video encoding device 10 will be described.
FIG. 5 is a flowchart illustrating the operation of the video encoding device 10 according to the first embodiment.

The video dividing unit 110 acquires input video data S (x, y, z) in the horizontal direction x, the vertical direction y, and the time direction z. The video dividing unit 110 generates a plurality of input frame groups Si (x, y, z) by dividing the obtained input video data S (x, y, z) into N frames (Step S10). S101). Here, the dimensions of x, y, and z are X, Y, and Z, respectively. I is an index indicating the number of the input frame group.

The size of each frame group does not necessarily have to be the same. For example, a frame group composed of N frames and a frame group composed of L frames (L is a positive number different from N) may be mixed. Further, for example, the input video data S (x, y, z) is alternately divided into N frames and L frames, and is composed of a frame group composed of N frame groups and an L frame group. A configuration in which frame groups are alternately generated may be employed.

The dimension compression unit 121 of the encoding unit 120 generates a compressed frame group by compressing each input frame group Si (x, y, z) to have the number of dimensions (X ′, Y ′, N ′). (Step S102). The number of dimensions (X ', Y', N ') is the number of dimensions satisfying X' * Y '* N' <X * Y * N.

Note that the dimension compression unit 121 is configured by a neural network (combination of convolution operation, downsampling, and non-linear conversion) as shown in FIG. 6, for example.
FIG. 6 is a configuration diagram of the dimensional compression unit 121 of the video encoding / decoding system 1 according to the first embodiment. As shown in FIG. 6, the dimensional compression unit 121 is configured by constituent units (first-layer constituent unit 121a-1 to M-th-layer constituent unit 121a-M) composed of M layers. Each component includes a convolutional layer unit c1, a downsampling unit c2, and a non-linear conversion unit c3.

The convolution layer section c1 of the first layer configuration section 121a-1 acquires the input frame group output from the video division section 100. The convolution layer section c1 of the first layer configuration section 121a-1 performs a convolution operation on the acquired input frame group. The convolution layer unit c1 outputs the group of frames on which the convolution operation has been performed to the downsampling unit c2.
The downsampling unit c2 of the first layer configuration unit 121a-1 acquires the frame group output from the convolutional layer unit c1. The downsampling unit c2 compresses the acquired frame group so as to reduce the number of dimensions. The down-sampling unit c2 outputs the compressed frame group to the nonlinear conversion unit c3.
The nonlinear conversion unit c3 of the first layer configuration unit 121a-1 acquires the frame group output from the downsampling unit c2. The non-linear conversion unit c3 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c3 outputs the frame group subjected to the non-linear conversion processing to the convolution layer unit c1 of the next layer component (second layer component).

By repeating the above process from the first layer to the M-th layer, the dimensional compression unit 121 converts the input frame group input from the video division unit 100 into a compressed frame group with a reduced number of dimensions, and performs quantization. / Entropy encoding section 122.

Returning to FIG. 5, the description will be continued.
The quantization / entropy coding unit 122 of the coding unit 120 performs quantization and entropy coding on each compressed frame group. Then, the quantization / entropy encoding unit 122 generates encoded data by connecting the compressed and entropy-encoded compressed frame groups (step S103).
Thus, the operation of the video encoding device 10 shown in the flowchart of FIG. 5 ends.

[Operation of video decoding device]
Hereinafter, an example of the operation of the video decoding device 20 will be described.
FIG. 7 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.

The entropy decoding unit 211 of the decoding unit 210 acquires encoded data. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the obtained encoded data (step S111).
The dimension expansion unit 212 of the decoding unit 210 generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit 121). (Step S112).

The dimension expansion unit 212 is configured by a neural network (combination of deconvolution operation and non-linear conversion) as shown in FIG. 8, for example.
FIG. 8 is a configuration diagram of the dimension expansion unit 212 of the video encoding / decoding system 1 according to the first embodiment. As shown in FIG. 8, the dimensional extension unit 212 is configured by a configuration unit composed of M layers (first layer configuration unit 212a-1 to Mth layer configuration unit 212a-M). Each component includes a deconvolution layer unit c4 and a non-linear conversion unit c5.

The deconvolution layer unit c4 of the first layer configuration unit 212a-1 acquires the entropy decoded frame group output from the entropy decoding unit 211. The deconvolution layer unit c4 performs a deconvolution operation on the obtained entropy-decoded frame group. The deconvolution layer unit c4 outputs the group of frames on which the deconvolution operation has been performed to the nonlinear conversion unit c5.
The nonlinear conversion unit c5 of the first layer configuration unit 212a-1 acquires the frame group output from the deconvolution layer unit c4. The non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c5 outputs the frame group on which the non-linear conversion processing has been performed to the deconvolution layer unit c4 of the configuration unit (second layer configuration unit) of the next layer.

By repeating the above processing from the first layer to the M-th layer, the dimensional expansion unit 212 converts the entropy decoded frame group output from the entropy decoding unit 211 into dimensional expansion data in which the number of dimensions has been restored. The data is output to the data memory 213 and the correction unit 214.

Returning to FIG. 7, the description will be continued.
The intermediate data memory 213 of the decoding unit 210 stores the intermediate data Mi, which is the decompressed decoded data generated in step S112 (step S113).
The correction unit 214 of the decoding unit 210 corrects the expanded decoded data acquired from the dimensional expansion unit 212 using the intermediate data Mi stored in the intermediate data memory 213.

Here, the correction unit 214 adds the intermediate data Mi-1 which is the intermediate data stored in the intermediate data memory 213 before the intermediate data corresponding to the expanded decoded data to be corrected. Is corrected using. For example, the correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi using the intermediate data Mi-1 which is one intermediate data before the intermediate data Mi in the time direction. The number of intermediate data used for correction may be two or more.

The correction unit 214 corrects the decompressed decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 in the dimension of the z direction. The correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S114).

The reason why the correction process is performed by the correction unit 214 is as follows. Since encoding is performed for each frame group composed of frames in the time direction z, subjective continuity may not be ensured between frame groups that are temporally close to or adjacent to each other. Therefore, in order to ensure continuity, a correction process is performed on the decompressed decoded data using intermediate data temporally close to or adjacent to the decompressed decoded data. By providing continuity, the subjective image quality of the decoded video obtained by combining the frame groups is improved.

The video combining unit 220 generates decoded video data by combining the generated decoded frame groups (Step S115).
Thus, the operation of the video decoding device 20 shown in the flowchart of FIG. 7 ends.

The correction unit 214 is configured by, for example, a neural network (combination of convolution operation and non-linear conversion and scaling processing) as shown in FIG.
FIG. 9 is a configuration diagram of the correction unit 214 of the video encoding / decoding system 1 according to the first embodiment. As illustrated in FIG. 9, the correction unit 214 includes a configuration unit including M layers (first layer configuration unit 214a-1 to Mth layer configuration unit 214a-M) and a scaling unit 214b. Each component is configured by a convolutional layer unit c6 and a non-linear conversion unit c7.

The convolutional layer section c6 of the first layer configuration section 214a-1 acquires the expanded decoded data output from the dimensional expansion section 212 and the intermediate data stored in the intermediate data memory 213. The convolution layer unit c6 performs a convolution operation on the obtained decompressed decoded data. The convolution layer unit c6 outputs the frame group on which the convolution operation has been performed to the nonlinear conversion unit c7.
The nonlinear conversion unit c7 of the first layer configuration unit 214a-1 acquires the frame group output from the convolution layer unit c6. The non-linear conversion unit c5 performs a non-linear conversion process on the acquired frame group. The non-linear conversion unit c7 outputs a frame group subjected to the non-linear conversion processing. Data obtained by adding the frame group output from the non-linear conversion unit c7 and the intermediate data immediately before in time are input to the convolution layer unit c6 of the configuration unit of the next layer (second layer configuration unit). You.

The correction unit 214 performs scaling by a scaling unit 214b on a frame group obtained by repeating the above processing from the first layer to the Mth layer. Through the above processing, the correction unit 214 corrects the decompressed decoded data output from the dimensional decompression unit 212 with the intermediate data stored in the intermediate data memory 213, and outputs a decoded frame group that is the corrected decompressed decoded data to the video. Output to the combining unit 220.

[Learning process]
Hereinafter, a learning process by the neural network of the dimension compression unit 121, the dimension expansion unit 212, and the correction unit 214 will be described.
The learning process by the neural network of the dimensional compression unit 121, the dimensional expansion unit 212, and the correction unit 214 is performed simultaneously.

FIG. 10 is a schematic diagram for explaining a learning process performed by the video encoding / decoding system 1 according to the first embodiment.
As shown in FIG. 10, first, as input data, a data set including three temporally continuous input frames as one sample data is input. Hereinafter, these three input frame groups will be referred to as S1 (x, y, z), S2 (x, y, z) (first frame group), S3 (x, y, z) (second Frame group).

Next, the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z). Here, the processing A is a dimension compression processing, a quantization / entropy encoding processing, an entropy decoding processing, and a dimension expansion processing. As a result, intermediate data is generated. Hereinafter, intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively. ), M2 (x, y, z) (features of the first frame group) and M3 (x, y, z) (features of the second frame group).

Next, as shown in FIG. 10, M1 (x, y, z) and M2 (x, y, z), and M2 (x, y, z) and M3 (x, y, z) are set as a set. , Respectively. More specifically, decompressed decoded data corresponding to the intermediate data M1 (x, y, z) and intermediate data M2 (x, y, z), and decompressed decoded corresponding to the intermediate data M2 (x, y, z) The data and the intermediate data M3 (x, y, z) are set as a set, and correction is performed. Thereby, two decoded frame groups are generated. Hereinafter, each decoded frame group is referred to as R2 (x, y, z) and R3 (x, y, z) (corrected frame group).

Next, the loss value loss is calculated using a loss function defined by the following equations (1) to (3).

loss =
Restoration error 1 + Restoration error 2 + GAN (concat (R2, R3))
+ FM (concat (S2, S3), concat (R2, R3))
... (1)

Restoration error 1 =
ΣxΣyΣz (diff (S2 (x, y, z), R2 (x, y, z)))
+ ΣxΣyΣz (diff (S3 (x, y, z), R3 (x, y, z)))
... (2)

Restoration error 2 =
ΣxΣyΣz (w (z) * diff (M2 (x, y, z), R2 (x, y, z)))
+ ΣxΣyΣz (w (z) * diff (M3 (x, y, z), R3 (x, y, z)))
... (3)

Here, diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b. Further, w (z) is a weight coefficient according to the time direction z. Note that w (z) is set such that the larger the index z, the heavier the weight. That is, it is set so that the intermediate data corresponding to the input frame group that is temporally later than the input frame group to be encoded has a higher weight in the correction. For example, w (z) = z or w (z) = z2 is used.

$ Concat () is an operation for connecting each input in the time direction. GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability. The discriminator is constructed by a neural network. FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.

(4) Next, using the calculated loss value, the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. It should be noted that the configuration of the loss functions shown in the above equations (1) to (3) is merely an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.

As described above, the flow of the learning process in the first embodiment is as follows.
1. Prepare three consecutive input frame groups as one sample.
2. Each sample is input to a neural network (encoder / decoder) as an auto encoder to obtain intermediate data.
3. The decoded video data corresponding to S2 (x, y, z) and S3 (x, y, z) is obtained by the neural network for correction.
4. The loss calculation is performed by adding the following values 1) to 4).
1) Restoration error between S2 (x, y, z) and R2 (x, y, z) and restoration error between S3 (x, y, z) and R3 (x, y, z).
2) Weighted restoration error between M2 (x, y, z) and R2 (x, y, z) and weighted restoration error between M3 (x, y, z) and R3 (x, y, z) error.
3) GAN error (a binary cross-entropy error when R2 (x, y, z) and R3 (x, y, z) are input to a neural network that performs the identification process).
4) FM error (for a neural network performing identification processing, S2 (x, y, z) and S3 (x, y, z), R2 (x, y, z) and R3 (x, y, z) , And the error of the feature amount of the hidden layer at the time of inputting).
5. Each neural network is updated by the backpropagation method.
Here, the identification processing is processing for identifying whether or not a video based on the input video data is a true video.
Note that the weighted restoration error of 2) is a term calculated so as to be continuous with an adjacent frame group in time. The GAN error of 3) and the FM error of 4) are terms calculated so that the video based on the decoded video data has a more natural output.

Note that, as described above, here, M1 (x, y, z), S2 (x, y, z) and S3 (x, y, z), which are three temporally continuous input frame groups, are converted to M1 ( x, y, z), M2 (x, y, z), M3 (x, y, z) and R2 (x, y, z), R3 (x, y, z) are generated, and R2 (x , Y, z) + R3 (x, y, z) is natural (ie, has continuity).

However, as described above, the present invention is not limited to the configuration in which a data set including three temporally continuous input frames is input, and a data set including four or more temporally continuous input frames is input. May be adopted.
For example, from four temporally continuous input frame groups S1 (x, y, z), S2 (x, y, z), S3 (x, y, z) and S4 (x, y, z), M1 (x, y, z), M2 (x, y, z), M3 (x, y, z), M4 (x, y, z) and R2 (x, y, z), R3 (x, y) , Z), R4 (x, y, z), and R2 (x, y, z) + R3 (x, y, z) + R4 (x, y, z) become natural (ie, Learning (to have continuity) may be performed.

As described above, the video encoding / decoding system 1 according to the first embodiment stores encoded data in the intermediate data memory 213 as intermediate data instead of decoding the encoded data directly into decoded video data. Then, the video encoding / decoding system 1 performs a correction process on the encoded data to be processed using surrounding data (intermediate data) that is temporally continuous, and decodes the encoded data. As a result, continuity between temporally continuous surrounding data and encoded data to be processed is maintained.

In addition, in the video encoding / decoding system 1 according to the first embodiment, only a small number of surrounding data (in the first embodiment, temporal data is necessary) when decoding encoded data to be processed. In the previous intermediate data only). Accordingly, the video encoding / decoding system 1 can perform encoding and decoding with random access to image data and parallelism.

{Circle around (2)} The video encoding / decoding system 1 according to the first embodiment performs learning using the restoration error 2 as described above. Therefore, for example, when M2 (x, y, z) shown in FIG. 10 is corrected to R2 (x, y, z), R2 (x, y, z) and R3 (x, y, z) Are maintained such that no change occurs in a frame close to R3 (x, y, z). That is, the subjective image quality based on the relationship between S2 (x, y, z) and S3 (x, y, z), which is an input frame group temporally later than S2 (x, y, z), is increased. It is a constraint condition. Thereby, according to the video encoding / decoding system 1, the correction is performed so that R2 (x, y, z) is continuous with R3 (x, y, z), so that the image quality is improved.

In the video encoding / decoding system 1 according to the first embodiment, the neural network (the dimensional compression unit 121 and the dimensional expansion unit 212) as the auto encoder (the first learning model) and the Is a separate neural network and the learning process is performed separately, so that the learning process is stabilized.

<Second embodiment>
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.

Hereinafter, a video encoding / decoding system according to the second embodiment will be described. The overall configuration of the video encoding / decoding system according to the second embodiment and the configuration of the encoding unit are the same as those of the video encoding / decoding system according to the first embodiment described with reference to FIGS. 1 and the configuration of the encoding unit 120, and a description thereof will be omitted. The video encoding / decoding system 1 according to the first embodiment differs from the video encoding / decoding system according to the second embodiment described below in the configuration of the decoding unit included in the video decoding device.

Hereinafter, FIG. 11 illustrates a configuration of the decoding unit 210a included in the video decoding device of the video encoding / decoding system according to the second embodiment. Note that the same reference numerals are given to functional blocks having the same functional configuration as in the first embodiment, and description thereof will be omitted. As shown in FIG. 11, the decoding unit 210 includes an entropy decoding unit 211, a dimension expansion unit 212, an intermediate data memory 213, a correction unit 214, and a correction changeover switch 215.

The difference between the decoding unit 210a according to the second embodiment and the decoding unit 210 according to the first embodiment is that, in addition to the functional configuration of the decoding unit 210, the decoding unit 210a further includes a correction process changeover switch 215. This is the configuration.

The -dimensional decompression unit 212 outputs the generated decompressed decoded data to the intermediate data memory 213 and the correction process changeover switch 215, respectively.

The correction process changeover switch 215 acquires the expanded decoded data output from the dimensional expansion unit 212. The correction processing changeover switch 215 switches whether to output the obtained decompressed decoded data as a decoded frame group to the video combining unit or to the correcting unit 214.

The correction unit 214 acquires the decompressed decoded data output from the correction processing changeover switch 215. Further, the correction unit 214 acquires the intermediate data stored in the intermediate data memory 213. The correction unit 214 generates a decoded frame group by correcting the decompressed decoded data using the intermediate data. The correction unit 214 outputs the generated decoded frame group to the video combining unit 220.

The operation of the video encoding device according to the second embodiment is the same as the operation of the video encoding device 10 according to the first embodiment described with reference to FIG. Therefore, description of the operation of the video encoding device according to the second embodiment will be omitted.

[Operation of video decoding device]
Hereinafter, an example of the operation of the video decoding device according to the second embodiment will be described.
FIG. 12 is a flowchart illustrating the operation of the video decoding device 20 according to the first embodiment.

The entropy decoding unit 211 of the decoding unit 210a acquires the encoded data. The entropy decoding unit 211 generates entropy decoded data by performing entropy decoding on the acquired encoded data (step S211).
The dimension expansion unit 212 of the decoding unit 210a generates expanded decoded data by restoring the generated entropy decoded data to the original number of dimensions (before the number of dimensions is reduced by the dimensional compression unit) ( Step S212).

The intermediate data memory 213 of the decoding unit 210a stores the intermediate data Mi, which is the decompressed decoded data generated in step S212 (step S213).

The correction processing changeover switch 215 of the decoding unit 210a checks the value of the index i indicating the number of the input frame group by referring to the expanded decoded data generated by the dimension expanding unit 212. If the value of i is an odd number (step S214: YES), the correction processing changeover switch 215 outputs the obtained decompressed decoded data as it is to the video combining unit as a decoded frame group.
The video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216).
This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.

On the other hand, if the value of i is an even number (step S214: NO), the correction process changeover switch 215 outputs the obtained decompressed decoded data to the correction unit 214 of the decoding unit 210a. The correction unit 214 corrects the decompressed and decoded data obtained via the correction processing changeover switch 215 using the intermediate data Mi stored in the intermediate data memory 213.

When the value of i is an even number, the correction processing changeover switch 215 outputs the decompressed decoded data as it is to the video combining unit as a decoded frame group, and corrects the decompressed decoded data when the value of i is an odd number. The output to the unit 214 may be used.

Note that, as described above, the correction process changeover switch 215 performs a correction process on every other acquired decompressed decoded data, and the purpose is as follows.
In the first embodiment, the subjective image quality is improved by correcting the frame group (Mi) to be corrected so as to be temporally continuous with the frame group (Mi-1) temporally preceding. Was. However, the temporally previous frame group (Mi-1) is further corrected based on the temporally previous frame group (Mi-2). Therefore, the temporally previous frame group (Mi-1) is a frame group different from the time point when the frame group to be corrected (Mi) is referred to, and the final output is temporally continuous. Is not guaranteed.

On the other hand, the second embodiment has a configuration in which a group of frames to be corrected and a group of frames not to be corrected are alternately continuous. Thus, in the second embodiment, after the frame group to be corrected is corrected, the frame groups before and after the frame group do not change from the point in time when they are referred to, so that temporal continuity is ensured.

Here, the correction unit 214 applies the intermediate data Mi-1 (first frame group) and the intermediate data Mi + 1 (third frame group) to the expanded decoded data (second frame group) to be corrected. The correction is performed using. Here, the intermediate data Mi-1 is intermediate data stored in the intermediate data memory 213 prior to the intermediate data Mi corresponding to the decompressed decoded data. Further, the intermediate data Mi + 1 is intermediate data stored in the intermediate data memory 213 after the intermediate data Mi corresponding to the decompressed decoded data. For example, the correction unit 214 converts the decompressed and decoded data corresponding to the intermediate data Mi into intermediate data Mi-1 which is one intermediate data in the temporal direction before the intermediate data Mi, and temporal data in the temporal direction from the intermediate data Mi. And the intermediate data Mi + 1 that is the next intermediate data. The number of intermediate data used for correction may be three or more.

The correction unit 214 corrects the decompressed and decoded data corresponding to the intermediate data Mi by combining the intermediate data Mi-1 and the intermediate data Mi + 1 in the dimension in the z direction. The correction unit 214 generates a decoded frame group by performing the above-described processing on all the decompressed decoded data (step S215).
The video combining unit generates decoded video data by combining the generated decoded frame groups (Step S216).
This is the end of the operation of the video decoding device 20 shown in the flowchart of FIG.

[Learning process]
Hereinafter, a learning process by the neural network of the dimensional compression unit, the dimensional expansion unit, and the correction unit 214 according to the second embodiment will be described.
The learning processing by the neural network of the dimensional compression unit, the dimensional expansion unit, and the correction unit 214 is performed simultaneously.

FIG. 13 is a schematic diagram illustrating a learning process performed by the video encoding / decoding system according to the second embodiment.
As shown in FIG. 13, first, as input data, a data set including three temporally continuous input frames as one sample data is input. Hereinafter, these three input frame groups are referred to as S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) in order of time.

Next, the process A is performed on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z). As described above, the process A is a dimensional compression process, a quantization / entropy encoding process, an entropy decoding process, and a dimensional expansion process. As a result, intermediate data is generated. Hereinafter, intermediate data generated based on each of the input frame groups S1 (x, y, z), S2 (x, y, z), and S3 (x, y, z) will be referred to as M1 (x, y, z), respectively. ), M2 (x, y, z) and M3 (x, y, z).

(13) Next, as shown in FIG. 13, correction is performed using M1 (x, y, z), M2 (x, y, z), and M3 (x, y, z) as a set. Specifically, the decompressed decoded data corresponding to the intermediate data M2 (x, y, z), the intermediate data M1 (x, y, z), and the intermediate data M3 (x, y, z) are set as corrections, respectively. Is performed. As a result, a decoded frame group is generated. Hereinafter, the generated decoded frame group is defined as R2 (x, y, z).

Next, the loss value loss is calculated using a loss function defined by the following equations (4) and (5).

loss =
Restoration error 1 + GAN (concat (M1, R2, M3))
+ FM (concat (S1, S2, S3), concat (M1, R2, M3))
... (4)

Restoration error 1 =
ΣxΣyΣz (diff (S1 (x, y, z), M1 (x, y, z)))
+ ΣxΣyΣz (diff (S3 (x, y, z), M3 (x, y, z)))
... (5)

Here, diff (a, b) is a function (for example, a square error or the like) for measuring the distance between a and b. concat () is an operation of connecting each input in the time direction. GAN (x) is a discriminator that determines whether the input video x is a true video and outputs the probability. The discriminator is constructed by a neural network. FM (a, b) is a sum of errors (for example, a square error or the like) of the value of the intermediate layer of the neural network when a and b are input to the classifier.

(4) Next, using the calculated loss value, the parameter value of each unit is updated by the back error propagation method or the like. Learning is performed by repeating the above-described series of flows once and using a plurality of sample data for a fixed number of times. Alternatively, learning is performed by repeating until the loss value converges. Note that the configuration of the loss function shown in the above equations (4) to (5) is an example, and the loss function in which only a part of the errors is calculated, or the loss function in which a different error term is added, is provided. And so on.

With the above configuration, the video encoding / decoding system according to the second embodiment can perform encoding and decoding with random access to image data and parallelism.

As described above, the video encoding / decoding system 1 according to the first embodiment independently corrects each input frame group. Therefore, in the video encoding / decoding system 1 according to the first embodiment, although each input is corrected so as to be temporally continuous with the previous output, how the previous output is corrected Is unknown. For this reason, in the video encoding / decoding system 1 according to the first embodiment, it may not be possible to reliably ensure that the corrected decoded frame groups have continuity.

On the other hand, as described above, the video encoding / decoding system according to the second embodiment is configured such that, for a frame group having an odd (or even) index value, the decompressed decoded data itself is used as a decoded frame group. Learning is performed, and correction is performed so that the index value is continuous with a frame group that is not an odd number (or even number). As a result, the output before and after the group of frames to be corrected does not change. Therefore, the video encoding / decoding system according to the second embodiment includes the corrected decoded frame group and the corrected decoded frame group. It is possible to ensure that the decoded frame groups adjacent before and after in time have continuity.

一部 A part or all of the video encoding / decoding system in the above-described embodiment may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read and executed by a computer system. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" refers to a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short time. Such a program may include a program that holds a program for a certain period of time, such as a volatile memory in a computer system serving as a server or a client in that case. The program may be for realizing a part of the functions described above, or may be a program that can realize the functions described above in combination with a program already recorded in a computer system, It may be realized using hardware such as a PLD (Programmable Logic Device) or an FPGA (Field Programmable Gate Array).

Although the embodiments of the present invention have been described with reference to the drawings, it is clear that the embodiments are merely examples of the present invention and the present invention is not limited to the embodiments. Therefore, additions, omissions, replacements, and other modifications of the components may be made without departing from the technical idea and the gist of the present invention.

Reference Signs List 1 video encoding / decoding system 10 video encoding device 20 video decoding device 110 video dividing unit 120 encoding unit 121 dimensional compression unit 122 entropy encoding unit 210 decoding unit 211 entropy decoding unit 212 dimensional expansion unit 213 intermediate data memory 214 correction Unit 220 Image Coupling Unit

Claims

An image processing apparatus that performs correction for each frame group including a predetermined number of frames into which video data is divided,
A decoding unit that performs correction on a second frame group, which is a frame group temporally continuous with the first frame group, using the feature amount of the first frame group to obtain a corrected frame group With
The decoding unit is configured to increase a subjective image quality based on a relationship between the second frame group and a frame group temporally later than the second frame group, and to execute the second frame group and the second A frame group in which a frame group temporally later than the second frame group is combined; a corrected frame group in which the corrected frame group and a frame group temporally later than the second frame group are corrected; An image processing apparatus that corrects a group of frames to which a is combined so as to be identified by a predetermined classifier.
The image processing device according to claim 1, wherein the decoding unit increases the weight in the correction as the feature amount of a frame temporally later than the second frame group.
An image processing apparatus that performs correction for each frame group including a predetermined number of frames into which video data is divided,
A feature value of a first frame group which is a frame group temporally preceding the second frame group and is temporally continuous with the second frame group; The correction is performed on the second frame group by using the feature amount of a third frame group that is a frame group that is temporally consecutive with the second frame group. A decoding unit that obtains a group of frames after correction,
The decoding unit corrects the subjective image quality based on the relationship between the corrected frame group and the first frame group and the relationship between the corrected frame group and the third frame group. Processing equipment.
The image processing device according to claim 1, wherein the decoding unit performs correction based on a parameter value updated by a learning process based on a frame group obtained by dividing video data different from the video data.
The learning process includes:
Obtaining sample data consisting of at least three frames that are temporally continuous;
Inputting each of the sample data to a first learning model to obtain a feature amount of the frame group;
Inputting the feature value of the frame group to a second learning model to obtain the corrected frame group corresponding to the frame group;
Calculating a loss value based on the sample data, the feature amount of the frame group, the corrected frame group, and a predetermined loss function;
Updating the parameter value using the loss value;
The image processing device according to claim 4, comprising:
An image processing apparatus that performs correction for each partial data group including a predetermined number of divided partial data,
The second partial data group, which is a partial data group that is temporally continuous with the first partial data group, is corrected using the feature amount of the first partial data group, so that the corrected partial data group is obtained. A decoding unit for obtaining the group,
The decoding unit increases the subjective image quality based on the relationship between the second partial data group and the partial data group temporally later than the second partial data group, and A partial data group in which a group and a partial data group temporally later than the second partial data group are combined; and a partial data group temporally later than the corrected partial data group and the second partial data group. An image processing apparatus for correcting a partial data group obtained by combining a corrected partial data group whose group has been corrected with a partial data group to be identified by a predetermined identifier.
An image processing method for performing correction for each frame group including a predetermined number of frames into which video data is divided,
Obtaining a corrected frame group by performing correction on a second frame group, which is a frame group temporally continuous with the first frame group, using the feature amount of the first frame group; ,
The subjective image quality based on the relationship between the second frame group and the frame group temporally later than the second frame group is increased, and the second frame group and the second frame group are compared with each other. A frame in which a group of frames that are temporally later are combined, and a frame in which a group of frames after correction and a group of frames after correction in which a frame group that is temporally later than the second frame group are corrected are combined. Correcting the groups so that they are identified by a predetermined classifier as being the same;
An image processing method comprising:
An image processing program for causing a computer to function as the image processing device according to any one of claims 1 to 5.