WO2019093234A1

WO2019093234A1 - Encoding device, decoding device, encoding method, and decoding method

Info

Publication number: WO2019093234A1
Application number: PCT/JP2018/040801
Authority: WO
Inventors: アレックホジキンソン; ルカリザジオ; 遠間　正真; 西　孝啓; 安倍　清史; 龍一加納
Original assignee: パナソニックインテレクチュアルプロパティコーポレーションオブアメリカ
Priority date: 2017-11-08
Filing date: 2018-11-02
Publication date: 2019-05-16

Abstract

An encoding device (100) is provided with a memory (162) and a circuit (160). The circuit (160) uses a first convolution neural network model to execute, on an input image, conversion of an image space region into an encoded space region to execute a compression process on the input image, and uses a second convolutional neural network model to execute a process of extracting feature amounts used for a post-process of making a decompressed image closer to the input image, the decompressed image being obtained as a result of compression and decompression of the input image.

Description

Encoding device, decoding device, encoding method and decoding method

The present disclosure relates to an encoding device, a decoding device, an encoding method, and a decoding method.

Conventionally, as a standard for coding moving pictures, H.264, also called High Efficiency Video Coding (HEVC), is used. There exist 265 (nonpatent literature 1).

H. At 265, to perform compression on the input image, the image space domain is transformed into a coding space domain using a Fourier transform such as discrete cosine transform.

However, the encoding space region transformed by Fourier transform may not be the optimal encoding space region for performing compression on the input image.

Thus, the present disclosure provides an encoding device and the like that can perform compression of an image in which deterioration of image quality is further suppressed.

An encoding apparatus according to an aspect of the present disclosure includes a memory and a circuit accessible to the memory, and the circuit accessible to the memory uses a first convolutional neural network model for the input image. Compression processing is performed on the input image by performing conversion from an image space region to an encoding space region, and compression that is a result of compression and decompression on the input image using a second convolutional neural network model A process of extracting a feature amount used in post-processing, which is a process of bringing a release image close to the input image, is performed.

Note that these general or specific aspects may be realized by a system, an apparatus, a method, an integrated circuit, a computer program, or a non-transitory recording medium such as a computer readable CD-ROM. The present invention may be realized as any combination of a system, an apparatus, a method, an integrated circuit, a computer program, and a recording medium.

The encoding device and the like in one aspect of the present disclosure can perform compression of an image with further suppressed deterioration in image quality.

FIG. 1 is a diagram showing an MS-SSIM curve of the codec architecture in the comparative example. FIG. 2 is a block diagram showing the configuration of the image processing apparatus according to the first embodiment. FIG. 3 is a block diagram showing an example of a configuration of a coding apparatus according to Embodiment 1. FIG. 4 is a block diagram showing an example of the configuration of the decoding apparatus in the first embodiment. FIG. 5 is a block diagram showing a connection configuration of the convolutional neural network in the first embodiment. FIG. 6 is a block diagram showing an example of a specific connection configuration of the convolutional neural network according to the first embodiment. FIG. 7 is a block diagram showing the configuration of a convolution block in the first embodiment. FIG. 8 is a block diagram showing a configuration of a residual block in the first embodiment. FIG. 9 is a diagram showing an experimental result of verifying the effectiveness of the image processing apparatus according to the first embodiment. FIG. 10 is a block diagram showing an implementation example of the coding apparatus according to Embodiment 1. FIG. 11 is a flowchart of an exemplary operation of the coding apparatus according to Embodiment 1. FIG. 12 is a block diagram showing an implementation example of the decoding apparatus according to the first embodiment. FIG. 13 is a flowchart showing an operation example of the decoding apparatus according to the first embodiment. FIG. 14 is an overall configuration diagram of a content supply system for realizing content distribution service. FIG. 15 is a diagram illustrating an example of a coding structure at the time of scalable coding. FIG. 16 is a diagram illustrating an example of a coding structure at the time of scalable coding. FIG. 17 is a diagram showing an example of a display screen of a web page. FIG. 18 is a diagram showing an example of a display screen of a web page. FIG. 19 is a diagram illustrating an example of a smartphone. FIG. 20 is a block diagram showing a configuration example of a smartphone.

An encoding apparatus according to an aspect of the present disclosure includes a memory and a circuit accessible to the memory, and the circuit accessible to the memory uses a first convolutional neural network model to generate an input image. On the other hand, compression processing is performed on the input image by performing conversion from an image space region to a coding space region, and using the second convolutional neural network model, the result is compression and decompression on the input image. A process of extracting a feature amount used in post-processing which is a process of bringing a decompressed image close to the input image is performed.

Thereby, the encoding apparatus uses the first convolutional neural network model for converting to the encoding space and the second convolutional neural network model for extracting the feature quantity used in the post-processing, thereby causing the image quality to be degraded. More suppressed image compression can be performed.

Also, for example, the feature amount is high frequency information included in the input image.

As described above, the encoding apparatus extracts a high-frequency image included in the input image that predominantly includes the information lost by the quantization process, as a feature amount for bringing the decompressed image closer to the input image. As a result, processing can be performed to bring the decompressed image closer to the input image in post-processing, so it is possible to perform compression of the image in which deterioration of the image quality is further suppressed.

Also, for example, the first convolutional neural network model and the second convolutional neural network model include two or more convolutional blocks, and include one or more residual blocks, and the two or more convolutions Each of the blocks is a processing block including one or more convolutional layers, and each of the one or more residual blocks is a convolutional group including at least one convolutional layer of the two or more convolutional blocks The data input to the residual block is input to the convolution group included in the residual block, and the data input to the residual block is added to the data output from the convolution group It is a processing block.

As a result, the encoding apparatus can perform compression of an image in which the deterioration of image quality is further suppressed by using a convolutional neural network model capable of learning and inference with higher accuracy.

Also, for example, the one or more residual blocks are two or more residual blocks.

Also, for example, the two or more convolutional blocks are four or more convolutional blocks, and the one or more residual blocks constitute a residual group, and at least one of the four or more convolutional blocks. At least one convolutional block including two convolutional blocks and not included in the residual group among the four or more convolutional blocks constitutes a first convolutional group, and the remaining ones of the four or more convolutional blocks At least one convolution block which is not included in the difference group or the first convolution group constitutes a second convolution group, and data outputted from the first convolution group is inputted to the residual group, Data output from the residual group is input to the second convolution group.

Thus, the encoding apparatus can apply more sophisticated operations to the abstracted feature of the image. Therefore, efficient processing is possible.

Also, for example, the circuit includes a memory and a circuit accessible to the memory, and the circuit accessible to the memory uses a first convolutional neural network model to generate an input image from an encoding space area. Decompression processing is performed on the input image by performing conversion to an image space area, and a decompressed image as a result of decompression on the input image is processed using the second convolutional neural network model. A process is performed to acquire feature amounts used in post-processing, which is a process of approaching an image.

Thereby, the decoding apparatus further suppresses the deterioration of the image quality by using the first convolutional neural network model for converting to the image space and the second convolutional neural network model for acquiring the feature amount used in the post-processing. Can be obtained.

Also, for example, the circuit capable of accessing the memory may further use the third convolutional neural network model, and use the feature value acquired using the second convolutional neural network model as the post-processing. The decompressed image obtained using the first convolutional neural network model is processed to be close to the original image.

As a result, processing can be performed to bring the decompressed image closer to the input image in post-processing, so it is possible to obtain a decompressed image in which the deterioration of the image quality is further suppressed.

In addition, for example, the first convolutional neural network model is used to convert the input image from the image space area to the encoding space area to perform compression processing on the input image, thereby performing a second convolutional neural network. A network model is used to extract feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of compression and decompression on the input image, closer to the input image.

By this encoding method, the deterioration of the image quality is further suppressed by using the first convolutional neural network model for converting to the encoding space and the second convolutional neural network model for extracting feature quantities used in post-processing Image compression can be performed.

Also, for example, the input image is converted from the encoding space region to the image space region using the first convolutional neural network model to perform decompression processing on the input image, and the second convolution is performed. A neural network model is used to acquire feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of decompression on the input image, closer to the original image of the input image.

This decoding method uses the first convolutional neural network model for conversion to image space and the second convolutional neural network model for acquiring feature quantities used in post-processing to further suppress image quality deterioration. A cancellation image can be obtained.

Furthermore, these general or specific aspects may be realized by a system, an apparatus, a method, an integrated circuit, a computer program, or a non-transitory recording medium such as a computer readable CD-ROM. The present invention may be realized as any combination of a system, an apparatus, a method, an integrated circuit, a computer program, and a recording medium.

Embodiments will be specifically described below with reference to the drawings.

The embodiments described below are all inclusive or specific examples. Numerical values, shapes, materials, components, arrangement positions and connection forms of components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the scope of the claims. Further, among the components in the following embodiments, components not described in the independent claim indicating the highest concept are described as arbitrary components.

Embodiment 1
First, an outline of the first embodiment will be described as an example of an image processing apparatus to which the processing and / or configuration described in each aspect of the present disclosure described later can be applied. However, Embodiment 1 is merely an example of an image processing apparatus, encoding apparatus or decoding apparatus to which the processing and / or configuration described in each aspect of the present disclosure can be applied, and will be described in each aspect of the present disclosure. The processing and / or configuration can also be implemented in an image processing apparatus, an encoding apparatus or a decoding apparatus different from the first embodiment.

When the processing and / or configuration described in each aspect of the present disclosure is applied to Embodiment 1, for example, any of the following may be performed.

(1) With respect to the image processing apparatus, the encoding apparatus, or the decoding apparatus according to the first embodiment, each aspect of the present disclosure among a plurality of components constituting the image processing apparatus, the encoding apparatus, or the decoding apparatus Replacing the component corresponding to the component described in the above with the component described in each aspect of the present disclosure (2) The image processing apparatus, encoding apparatus or decoding apparatus according to the first embodiment The present disclosure is applied to an arbitrary change such as addition, replacement, or deletion of a function or a process to be performed on a part of a plurality of components constituting the processing device, the encoding device, or the decoding device. Replacing the component corresponding to the component described in each aspect with the component described in each aspect of the present disclosure (3) A method implemented by the image processing apparatus, the encoding apparatus or the decoding apparatus according to the first embodiment To the processes described in each aspect of the present disclosure, after arbitrary changes such as addition of processing and / or partial processing of a plurality of processing included in the method are performed. Replacing the corresponding processing with the processing described in each aspect of the present disclosure (4) Configuration of a part of a plurality of components constituting the image processing apparatus, the encoding apparatus or the decoding apparatus according to the first embodiment Processing performed by the component described in each aspect of the present disclosure, the component provided with a part of the function of the component described in each aspect of the present disclosure, or the component described in each aspect of the present disclosure (5) A part of components of the plurality of components constituting the image processing apparatus, the encoding apparatus or the decoding apparatus according to the first embodiment is provided Equipped with some of the features A component that performs a part of processing performed by a part of the plurality of components constituting the image processing apparatus, the encoding apparatus, or the decoding apparatus according to the first embodiment, or A component described in each aspect of the present disclosure, a component provided with a part of a function provided in a component described in each aspect of the present disclosure, or a part of processing performed by a component described in each aspect of the present disclosure (6) The method performed by the image processing apparatus, the encoding apparatus, or the decoding apparatus according to the first embodiment is not limited to the process of the present embodiment among the plurality of processes included in the method. Replacing the process corresponding to the process described in each aspect of the disclosure with the process described in each aspect of the disclosure (7) A method implemented by the image processing apparatus, the encoding apparatus, or the decoding apparatus according to the first embodiment Multiple included Performing some of the processing in combination with the processing described in each aspect of the present disclosure

Note that the manner of implementation of the processing and / or configuration described in each aspect of the present disclosure is not limited to the above example. For example, it may be implemented in an apparatus used for a purpose different from the moving picture / image coding apparatus or the moving picture / image decoding apparatus disclosed in the first embodiment, or the process and / or the process described in each aspect. The configuration may be implemented alone. Also, the processes and / or configurations described in the different embodiments may be implemented in combination.

[Overview of image processing apparatus]
Currently, images and video account for over 70% of the media consumed online, and image and video compression is becoming increasingly important. Conventionally, codecs do not perform compression and / or decompression on an individual image basis, but perform compression and / or decompression on a "one size" basis. Also, it is not practical to apply the techniques used in conventional codecs to individual images. For this reason, it has been proposed that compression and / or decompression be tailored to individual images using deep learning techniques for codecs.

Also, conventional codecs have a number of areas that can be easily improved. For example, H. The codec used in H.265 or BPG (Better Portable Graphics) utilizes sophisticated encoding, decoding and pipeline processing. However, even with sophisticated pipelined processing, linear transformations are used in filtering, feature extraction, prediction, etc., so there are limitations due to linear transformations. On the other hand, Deep Neural Networks (DNN) are inherently non-linear functions. And, since a neural network can be approximated to a global function, it is possible to remove the restriction by linear transformation by replacing part or all of the pipeline used in the codec with the neural network.

Therefore, in the present embodiment, a convolutional neural network (CNN: Convolutional Neural Network) is applied to image compression.

By the way, in order to apply a convolutional neural network (CNN) to image compression, it is required not only that the processing be performed relatively fast, but also to realize at least a compression rate equal to that of the conventional codec. .

FIG. 1 is a diagram showing an MS-SSIM curve of the codec architecture in the comparative example. In FIG. 1, the vertical axis indicates MS-SSIM (multi-scale structural similarity) with respect to RGB, and the horizontal axis indicates compression ratio (Bits per Pixsel). Further, in FIG. The architecture corresponds to 265 conventional codecs, and WaveOne means that it is a Sakai architecture using a convolutional neural network (CNN). As shown in FIG. 1, it can be seen that an architecture using a convolutional neural network (CNN) achieves performance over conventional codecs.

As a method of applying a convolutional neural network (CNN) to image compression, a method of using an auto-encoder for learning a mapping from an image space area to an encoding space area for an input image has been proposed first. Subsequently, it has been proposed to quantize the input image mapped to the coding space region and learn the mapping to the image space region.

In recent years, convolutional neural networks (CNN) have been used to achieve state-of-the-art performance in many vision tasks, including from semantic segmentation to image classification to compression. These performances are realized by having a convolutional neural network (CNN) learn the functions suitable for the task.

From the above, it can be seen that the application of convolutional neural networks (CNN) to image compression has the potential to solve the drawbacks present in conventional codecs.

Conventional codecs transform an input image from an image space domain to a coding space domain using Fourier transform such as discrete cosine transform. However, although the Fourier transform provides many good properties for the codec, the Fourier transform transformed encoding space region may not be the optimal encoding space region for performing compression on the input image .

On the other hand, in the image processing apparatus according to the present embodiment, by using the convolutional neural network, it is possible to perform compression of the image in which the deterioration of the image quality is further suppressed and obtain a decompressed image in which the deterioration of the quality is further suppressed. be able to.

The image processing apparatus according to the present embodiment performs compression processing or decompression processing of an image using two convolutional neural networks. More specifically, the image processing apparatus uses a convolutional neural network model for performing a compression process and a convolutional neural network model for performing a process of extracting feature quantities used in post-processing. In addition, the image processing apparatus uses a convolutional neural network model for performing decompression processing and a convolutional neural network model for performing processing for acquiring feature amounts used in post-processing.

The image processing apparatus may include an encoding apparatus and a decoding apparatus. The encoding device encodes an image. That is, the encoding apparatus compresses the original image (input image) to output a compressed image which is a result of the compression on the original image. The decoding device decodes the encoded image. That is, the decoding apparatus performs decompression on the compressed image that is the result of compression on the original image, thereby outputting a decompressed image that is the result of decompression on the compressed image.

[Specific example of image processing apparatus]
FIG. 2 is a block diagram showing an example of the configuration of the image processing apparatus 10 according to the present embodiment. FIG. 3 is a block diagram showing an example of a configuration of coding apparatus 100 in the first embodiment. FIG. 4 is a block diagram showing an example of a configuration of decoding apparatus 200 in the first embodiment. In FIGS. 3 and 4, the same elements as in FIG. 2 are denoted by the same reference numerals.

The image processing apparatus 10 illustrated in FIG. 2 includes an image coding unit 101, a post-processing feature extraction unit 102, a quantum unit 103, an entropy coding unit 104, a storage unit 105, an image decoding unit 106, and a post-processing. And a post-processing unit 108. The image processing apparatus 10 may include the encoding apparatus 100 shown in FIG. 3 and the decoding apparatus 200 shown in FIG. 4.

The image encoding unit 101 transforms an input image from an image space region to an encoding space region using a first convolutional neural network model.

Here, the first convolutional neural network model is subjected to learning for conversion into a coding space region optimal for image compression. The first convolutional neural network model includes two or more convolutional blocks. Also, the first convolutional neural network model includes one or more residual blocks.

Hereinafter, the first convolutional neural network model in the present embodiment will be described.

FIG. 5 is a block diagram showing a connection configuration of convolutional neural network 300 in the first embodiment. FIG. 6 is a block diagram showing an example of a specific connection configuration of convolutional neural network 300 in the first embodiment. FIG. 7 is a block diagram showing a configuration of convolution block 310 in the first embodiment. FIG. 8 is a block diagram showing a configuration of residual block 320 in the first embodiment.

The first convolutional neural network model includes, for example, one or more convolutional blocks 310 and one or more convolutional blocks 310 followed by one or more residual blocks 320, as shown in FIG. 5, for example. After the residual block 320 of, one or more convolutional blocks 330 are included. The configuration of the first convolutional neural network model is not limited to the configuration of the convolutional neural network 300 shown in FIG. One or more convolutional blocks and one or more residual blocks may be configured in any way. The one or more convolutional blocks are four or more convolutional blocks, and the one or more residual blocks constitute a residual group and may include at least two convolutional blocks of the four or more convolutional blocks. Good. In this case, at least one convolutional block not included in the residual group among the four or more convolutional blocks constitutes a first convolutional group, and the first convolutional group is also included in the residual group among the four or more convolutional blocks. The at least one convolution block which is not included also constitutes a second convolution group. Data output from the first convolutional group is input to the residual group, and data output from the residual group is input to the second convolution group.

For example, the first convolutional neural network model may be a convolutional neural network model 300 shown in FIG. That is, the first convolutional neural network model includes, for example, two convolutional blocks 310 forming the first convolutional group, two residual blocks 320 forming the residual group, and two forming the second convolutional group. And a convolution block 330. Here, two convolution blocks 310 constituting a first convolution group are connected in series, and two convolution blocks 330 constituting a second convolution group are connected in series. The two residual blocks 320 that make up the residual group are also arranged in series.

The

convolution block

310, 330 samples the input data at twice the original frequency The residual block 320 adds more capacity to the convolutional neural network model while keeping the receptive field the same size as the

convolution block

310, 330 . Thereby, the first convolutional neural network model can provide 16 times downsampling on the input image. Thus, the first convolutional neural network model needs to have a latent space with high information density while learning strong expressions for the input image while reducing the receptive field to 16/1. It becomes. Having a high information density latent space eliminates the need to worry about collisions in the latent space throughout the latent space. Also, convolutional neural network models with high information density latent space can reconstruct arbitrary images from the latent space. This means that even with quantization, graceful degradation can be maintained while suppressing the introduction of distortion.

Hereinafter, details of the residual block 320 such as the convolution block 310 will be described.

The convolution block 310 is a processing block including one or more convolutional layers, and includes two or more convolutional layers 311, a non-linear activation function 312, and a normalization layer 313, as shown in FIG. 7, for example. Note that, since the convolution block 330 and the convolution block 322 are also the same, the convolution block 310 will be described as an example here. In the example illustrated in FIG. 7, data input to the convolution block 310 is output from the convolution block 310 via the convolution layer 311, the non-linear activation function 312, and the normalization layer 313. The convolution layer 311 is a processing layer that performs a convolution operation on the data input to the convolution block 310 and outputs the result of the convolution operation. The convolution layer 311 is configured by, for example, 32 filters with a kernel size of 3 and stride 2. The nonlinear activation function 312 is a function that outputs an operation result using data output from the convolution layer 311 as an argument. For example, the non-linear activation function 312 controls the output of the non-linear activation function 312 according to the bias. The normalization layer 313 normalizes the data output from the non-linear activation function 312 and outputs normalized data in order to suppress data bias. In the present embodiment, the normalization layer 313 normalizes data output from the nonlinear activation function 312 using Batch Normalization that smoothes data values.

The residual block 320 is a processing block configured in a convolution group including two or more convolution layers 311 of at least one of the two or more convolution blocks 310 and 330 described above. Also, residual block 320 inputs incoming data into the convolutional group and adds incoming data to the data output from the convolutional group. In the present embodiment, residual block 320 includes two convolutional blocks 322 connected in series, as shown for example in FIG. For example, data input to residual block 320 is input to one convolutional block 322 (ie, left convolutional block 322 in FIG. 8). Then, data output from one convolution block 322 is input to the other convolution block 322 (that is, the right convolution block 322 in FIG. 8). The convolution block 322 has the same configuration as the

convolution block

310 or 330.

Also, data input to the residual block 320 is added to data output from the right convolution block 322 and output from the residual block 320. That is, the data input to the residual block 320 and the data output from the right convolution block 322 are summed and output from the residual block 320.

Here, two convolutional blocks 322 are connected in series as the residual group 321, but three or more convolutional blocks 322 may be connected in series.

Note that the image encoding unit 101 does not use the first convolutional neural network model, but uses a conventional method, that is, Fourier transform such as discrete cosine transform, to encode an input image from an image space region to an encoding space region. Conversion may be performed.

Hereinafter, the description will be continued returning to FIG.

The post-processing feature extraction unit 102 performs processing for extracting feature quantities used in post-processing using the second convolutional neural network model. Here, the feature amount is high frequency information included in the input image. Post-processing is processing for bringing a decompressed image, which is the result of compression and decompression on an input image, closer to the input image.

The second convolutional neural network model is subjected to learning for performing processing for extracting feature quantities used in post-processing. The second convolutional neural network model includes two or more convolutional blocks. Also, the first convolutional neural network model includes one or more residual blocks. That is, the configuration of the second convolutional neural network model is the same as that of the first convolutional neural network model, but may be different. The configuration of the first convolutional neural network model is as described above, and thus the description thereof is omitted.

The quantum unit 103 quantizes the data output from the image encoding unit 101 and inversely quantizes the quantized data. The quantum unit 103 includes, for example, the quantization unit 103A or the inverse quantization unit 103B illustrated in FIG.

The quantization unit 103A quantizes the data output from the image coding unit 101.

By the way, since many of the architectures proposed in the comparative example shown in FIG. 1 use coarse quantization steps, much information is lost. On the other hand, the quantizing unit 103A of the present embodiment is configured of, for example, a quantizer that controls the particle size according to (Expression 1). As a result, not only smooth quantization can be performed, but also errors in reconstruction (inverse quantization) can be suppressed.

Furthermore, in the quantization unit 103A of the present embodiment, in place of rounding up, the quantization unit that controls the particle size by (Equation 1) rounds off. As a result, the quantized representation will not be full bits, but will lose half of the bits. Although the quantization unit 103B of the present embodiment can perform quantization more smoothly, it does not use vector quantization and perceptual metrics. If perceptual metrics are not used, it is possible to introduce a large distortion at the time of reconstruction (during inverse quantization). That is, in the quantization unit 103A of the present embodiment, there is room for further improvement because there is no function of vector quantization and perceptual metric.

The inverse quantization unit 103B performs inverse quantization on the compressed image (input image) decoded by the entropy decoding unit 104B. Specifically, the inverse quantization unit 103B inversely quantizes the data quantized by the quantization unit 103A, that is, the compressed image (input image) decoded by the entropy decoding unit 104B. Similar to the quantization unit 103A, the inverse quantization unit 103B may be a dequantizer that uses rounding instead of rounding up and controls the particle size according to (Expression 1).

The entropy coding unit 104 performs compression and decompression processing on the input image. The entropy coding unit 104 includes, for example, the entropy coding unit 104A or the entropy decoding unit 104B shown in FIG.

The entropy coding unit 104A entropy codes the data output from the quantization unit 103A. The entropy coding unit 104A according to the present embodiment performs adaptive binary arithmetic coding suitable for learning expression in order to remove all redundancy from the quantized expression.

More specifically, the entropy coding unit 104A acquires, as the context of the pixel to be encoded, all pixels before the pixel to be encoded, which is all quantized representations. The entropy coding unit 104A creates a histogram for all the previous pixels acquired as contexts. This histogram is used as a probability table by the entropy coding unit 104A. Although coding by this method is simple, it has the same function as classical arithmetic coding or an entropy coder using deep learning. That is, the coding according to this method is H.264. H.264 / H. Although simpler than the CABAC used at 265, the results that can be obtained are fully available.

The entropy coding unit 104A may entropy code the data output from the quantization unit 103A and the feature quantity extracted by the post-processing feature extraction unit 102.

The entropy decoding unit 104B entropy decodes the compressed image (input image). Specifically, the entropy decoding unit 104B entropy decodes the compressed image (input image) using adaptive binary arithmetic coding. The detailed method is the same as the entropy coding, so the description is omitted. When the compressed image and the feature amount are input from the storage unit 105 as the data entropy-coded by the entropy coding unit 104, the entropy decoding unit 104B entropy decodes the compressed image (input image).

The storage unit 105 stores the data entropy-coded by the entropy coding unit 104. The storage unit 105 also outputs the stored entropy-coded data to the entropy decoding unit 104B.

The image decoding unit 106 transforms the input image from the encoding space region to the image space region using the first convolutional neural network model.

The first convolutional neural network model here is subjected to learning for conversion into an image space area optimal for image decompression. Note that the first convolutional neural network model here also includes two or more convolutional blocks. Also, the first convolutional neural network model includes one or more residual blocks. That is, the configuration of the first convolutional neural network model used by the image decoding unit 106 is the same as the configuration of the first convolutional neural network model used by the image encoding unit 101. Therefore, since the configuration of the first convolutional neural network model is as described above, the description will be omitted.

It should be noted that the image decoding unit 106 does not use the first convolutional neural network model, but uses the conventional method, that is, Fourier transform such as discrete cosine transform, to convert the input image into the image space region from the encoding space region. A conversion (inverse conversion) may be performed.

The post-processing feature acquisition unit 107 uses the second convolutional neural network model to perform processing for acquiring feature quantities used in post-processing. Here, the feature amount is high frequency information included in the input image. The post-processing is processing for bringing a decompressed image, which is a result of decompression on an input image, closer to an original image of the input image. The configuration of the second convolutional neural network model is the same as the configuration of the second convolutional neural network model used by the post-processing feature extraction unit 102. Therefore, the configuration of the second convolutional neural network model is as described above, and thus the description thereof is omitted.

The post-processing feature acquisition unit 107 expands (decompresses) the entropy-coded data when the compressed image and the feature amount are input as the entropy-coded data to the entropy decoding unit 104B. Acquired by extracting feature quantities. When the compressed image is input to the entropy decoding unit 104B as the entropy-coded data, the post-processing feature acquisition unit 107 may acquire the feature quantity extracted by the post-processing feature extraction unit 102.

The post-processing unit 108 performs a process for bringing the decompressed image closer to the input image, and outputs the decompressed image subjected to the process as an output image. The post-processing unit 108 performs post-processing using the third convolutional neural network model. Here, post-processing is processing for bringing a decompressed image obtained using the first convolutional neural network model closer to the original image, using feature amounts acquired using the second convolutional neural network model. . In other words, the post-processing unit 108 uses the feature amount acquired by the post-processing feature acquisition unit 107 for the decompressed image obtained by the image decoding unit 106 using the third convolutional neural network model. Perform processing to make it close to the original image.

Here, in the third convolutional neural network model, learning is performed to bring the decompressed image closer to the original image. The third convolutional neural network model consists of a series of convolutional blocks that maintain a constant receptive field. More specifically, the third convolutional neural network model includes two or more convolutional blocks. Also, the first convolutional neural network model includes one or more residual blocks. That is, the configuration of the third convolutional neural network model may be the same as the configuration of the first and second convolutional neural network models. The configuration of the first convolutional neural network model and the like is as described above, and thus the description thereof is omitted.

Thus, the post-processing unit 108 can improve the quality of the image using the third convolutional neural network model. As a result, it is possible to cause the image encoding unit 101 or the like to perform aggressive image compression while maintaining the value of MS-SSIM.

The post-processing unit 108 causes the post-processing feature acquisition unit 107 to acquire high-frequency information extracted from the original image in order to improve the quality of the image quality. As described above, the high frequency information extracted from the original image is entropy coded together with the quantized image, and is entropy decoded in the entropy decoding unit 104B. The entropy decoded high frequency information may be further decoded into the image space by the post-processing feature acquisition unit 107 or the entropy decoding unit 104B.

The post-processing unit 108 converts the high frequency information decoded into the image space acquired by the post-processing feature acquisition unit 107 into a decompressed image converted from the encoding space region to the image space region by the image decoding unit 106. include. This makes it possible to reintroduce the details lost due to quantization in the decompressed image, so that a higher MS-SSIM can be obtained.

In the present embodiment, the first to third convolutional neural networks are described as having residual blocks connected in a residual manner, but the present invention is not limited to this. Other architectures may be applied.

For example, a feedback structure may be applied, such as a Recurrent Neural Network or a Recursive Neural Network. In particular, the output of one or more convolutional blocks may be used as the input of the one or more convolutional blocks. The residual connection may then be used in the reverse direction.

[Effects of image processing apparatus etc.]
According to the image processing apparatus 10 of the present embodiment, it is possible to perform compression of an image in which deterioration of image quality is further suppressed using a convolutional neural network, and to obtain a decompressed image in which deterioration of quality is further suppressed. it can.

By the way, in the codec architecture in the comparative example shown in FIG. 1, an important tendency is to rely on intra-frame prediction to improve compression. Of the codec architectures shown in FIG. 1, no prediction is used in the JPEG standard, but BPG uses 35 different "direction modes" to predict the current (current) pixel block to be coded. On the other hand, in the case of using a convolutional neural network (CNN), it is possible not only to raise different directional modes to the limit used for intra-frame prediction, but also to learn complex dependencies on various scales. Can. Thus, using a convolutional neural network (CNN) may overcome the shortcomings present in conventional codecs and convert it into an optimal coding space domain for compression on the input image. That is, in the convolutional neural network (CNN), while operating at the pixel level in the initial layer, it operates at the global scale in the deeper layer. Thus, this multi-scale aspect can learn not only between "predicted blocks" but also pixel and global dependencies.

Therefore, according to the present embodiment, by using the convolutional neural network, it is possible to compress an image in which the deterioration of image quality is further suppressed, and an image in which a decompressed image in which the deterioration of quality is further suppressed can be obtained. The processing device 10 can be realized.

Although convolutional neural networks (CNN) are very powerful, they have some drawbacks. One of the disadvantages of convolutional neural networks is that it takes a long time to set various hyperparameters. This is because the models or network architectures that make up the convolutional neural network are very sensitive to certain hyperparameters, so it takes time to converge.

Recently, several approaches have been proposed for using convolutional neural networks (CNN) for image compression. Although each proposed network architecture has advantages and disadvantages, most proposed network architectures consist of an encoding network module, a decoding network module, a quantization module, and an entropy coding module as a basic form. Ru. Also, most of the proposed network architectures focus on feature extraction and entropy coding improvement.

The use of hostile learning, in particular Generative Adversalial Networks (GAN), is promising in generative models for complex tasks such as large image generation, super-resolution and even image compression. In other words, if GAN is used for image compression, the ability of GAN to model the underlying posterior distribution of data has the potential to realize image compression that can hold visually attractive images at very low bit rates. is there. In the present embodiment, image compression is performed using a convolutional neural network that does not employ GAN. This embodiment does not focus on the basic network architecture needed for good image modeling, but focuses on the network architecture for the very difficult hyperparameter search needed to achieve good convergence. It is because it is applied. However, GAN may be employed to obtain better results.

(Experimental example)
The effectiveness of the image processing apparatus 10 according to the present embodiment will be described because it has been verified using a data set of a learning image and a data set of a test image.

FIG. 9 is a diagram showing an experimental result on effectiveness verification of the image processing apparatus 10 in the first embodiment. FIG. 9 shows experimental results when learned with the RAISE 6K data set and verified with the KODAK test data set. In FIG. 9, “Encoder” corresponds to the image encoding unit 101, and “PostProcessor” corresponds to the post-processing unit 108. “All Modules” corresponds to the image processing apparatus 10.

The RAISE 6K dataset is a dataset consisting of raw natural images, consisting of 6,000 4K photographs evenly divided into seven categories: indoor, outdoor, nature, people, objects, buildings . At the time of learning, in the data set of RAISE 6K, preparation is made to randomly take out 10 parts of 128 × 128 pixels in size for each image to make a data set of learning images.

KODAK's data set is a test data set composed of natural images, and consists of 24 images of 768 × 512 pixels. Note that the natural images that make up the KODAK data set include various colors and textures, so it is a difficult data set for image compression.

Also, in this experiment, learning was performed using the following hyper parameters. That is, the learning rate was reduced by a factor of five every 100,000 times, with 0.0001. Also, training was performed using 128 images per batch and repeated 400,000 times. Moreover, Adam (Adaptive moment estimation) was adopted as an optimization method.

The experimental results obtained as a result of the verification show that, as shown in FIG. 9, in “Encoder” and “PostProcessor”, the compression ratio is 2.05 and MS-SSIM is a value close to 1 0.96 and 0 It showed .975. Moreover, in "All Modules", the compression ratio showed 2.05 and showed a better value than each module "Encoder" and "PostProcessor". In "All Modules", MS-SSIM showed the same value 0.975 as PostProcessor.

From this experimental result, by using the convolutional neural network, the image processing apparatus 10 according to this embodiment can perform compression of the image in which the deterioration of the image quality is further suppressed, and the compression in which the deterioration of the quality is further suppressed It turned out that a cancellation image can be obtained.

[Implementation example of encoding device]
FIG. 10 is a block diagram showing an implementation example of the coding apparatus 100 according to the first embodiment. The encoding device 100 includes a circuit 160 and a memory 162. For example, a part of the image processing apparatus 10 shown in FIG. 2 and a plurality of components of the encoding apparatus 100 shown in FIG. 3 are implemented by the circuit 160 and the memory 162 shown in FIG.

The circuit 160 is a circuit that performs information processing and can access the memory 162. For example, the circuit 160 is a dedicated or general-purpose electronic circuit that encodes an image. The circuit 160 may be a processor such as a CPU. The circuit 160 may also be an assembly of a plurality of electronic circuits. Also, for example, the circuit 160 may play a role of a plurality of components excluding the component for storing information among the plurality of components of the encoding device 100 illustrated in FIG. 3 and the like.

The memory 162 is a dedicated or general-purpose memory in which information for the circuit 160 to encode an image is stored. The memory 162 may be an electronic circuit or may be connected to the circuit 160. The memory 162 may also be included in the circuit 160. Also, the memory 162 may be a collection of a plurality of electronic circuits. In addition, the memory 162 may be a magnetic disk or an optical disk, or may be expressed as a storage or a recording medium. The memory 162 may be a non-volatile memory or a volatile memory.

For example, the memory 162 may store a moving image composed of a plurality of images to be encoded, or may store a bit string corresponding to the encoded image. The memory 162 may also store a program for the circuit 160 to encode a moving image. Further, in the memory 162, a plurality of convolutional neural network models may be stored. For example, the memory 162 may store a plurality of parameters of a plurality of convolutional neural network models.

Note that, in the encoding apparatus 100, all of the plurality of components shown in FIG. 3 and the like may not be mounted, or all of the plurality of processes described above may not be performed. Some of the plurality of components shown in FIG. 3 and the like may be included in another device, and some of the plurality of processes described above may be performed by another device.

An operation example of the coding apparatus 100 shown in FIG. 3 will be shown below.

FIG. 11 is a flowchart showing an operation example of the coding apparatus 100 shown in FIG. For example, the coding apparatus 100 shown in FIG. 10 performs the operation shown in FIG.

Specifically, the circuit 160 of the encoding device 100 transforms the input image from the image space region to the encoding space region using the first convolutional neural network model using the memory 162. The compression process is performed on the input image (S101).

Next, using the second convolutional neural network model, the circuit 160 of the encoding apparatus 100 extracts a feature value used in post-processing, which is processing to bring the decompressed image closer to the input image, using the memory 162. A process is performed (S102).

Thereby, the encoding apparatus 100 degrades the image quality using the first convolutional neural network model for converting to the encoding space and the second convolutional neural network model for extracting the feature amount used in the post-processing. Can be compressed more effectively.

[Implementation example of decryption device]
FIG. 12 is a block diagram showing an implementation example of the decoding apparatus 200 according to the first embodiment. The decoding device 200 includes a circuit 260 and a memory 262. For example, a part of the image processing apparatus 10 shown in FIG. 2 and a plurality of components of the decoding apparatus 200 shown in FIG. 4 are implemented by the circuit 260 and the memory 262 shown in FIG.

The circuit 260 is a circuit that performs information processing and can access the memory 262. For example, circuit 260 is a dedicated or general purpose electronic circuit that uses memory 262 to decode the compressed image. The circuit 260 may be a processor such as a CPU. Also, the circuit 260 may be a collection of a plurality of electronic circuits. Also, for example, the circuit 260 may play a role of a plurality of components excluding the component for storing information among the plurality of components of the decoding apparatus 200 illustrated in FIG. 4 and the like.

The memory 262 is a dedicated or general-purpose memory in which information for the circuit 260 to decode a compressed image or a decompressed image after decoding is stored. The memory 262 may be an electronic circuit or may be connected to the circuit 260. Also, the memory 262 may be included in the circuit 260. Further, the memory 262 may be a collection of a plurality of electronic circuits. Also, the memory 262 may be a magnetic disk or an optical disk, or may be expressed as a storage or a recording medium. The memory 262 may be either a non-volatile memory or a volatile memory.

For example, in the memory 262, a bit string corresponding to the encoded image (compressed image) may be stored, or a decompressed image corresponding to the decoded bit string may be stored. The memory 262 may also store a program for the circuit 260 to decode an image. Further, the memory 262 may store a plurality of convolutional neural network models. For example, the memory 262 may store a plurality of parameters of a plurality of convolutional neural network models.

In the decoding apparatus 200, all of the plurality of components shown in FIG. 4 and the like may not be mounted, or all of the plurality of processes described above may not be performed. Some of the plurality of components shown in FIG. 4 and the like may be included in another device, and some of the plurality of processes described above may be performed by another device.

Hereinafter, an operation example of the decoding apparatus 200 shown in FIG. 12 will be shown.

FIG. 13 is a flow chart showing an operation example of the decoding apparatus 200 shown in FIG. For example, the decoding apparatus 200 shown in FIG. 12 performs the operation shown in FIG.

Specifically, the circuit 260 of the decoding device 200 performs conversion of the input image from the encoding space region to the image space region using the memory 262 and using the first convolutional neural network model. Decompression processing is performed on the input image (S411).

Next, the circuit 260 of the decoding device 200 uses the memory 262 to process the decompressed image, which is the result of decompression on the input image, closer to the original image of the input image using the second convolutional neural network model. A process is performed to acquire a feature amount used in a certain post-process (S412).

As a result, the decompressed image in which the deterioration of the image quality is further suppressed by using the first convolutional neural network model for converting to the image space and the second convolutional neural network model for acquiring the feature used in the post-processing is obtained. You can get it.

[Supplement]
In addition, coding apparatus 100 and decoding apparatus 200 in the present embodiment may be used as an image coding apparatus that codes an image such as an intra picture and an image decoding apparatus that decodes a compressed image. Furthermore, even if encoding apparatus 100 and decoding apparatus 200 in the present embodiment are each used as a moving image encoding apparatus that encodes each of a plurality of images and a moving image decoding apparatus that decodes each of a plurality of compressed images. Good.

In addition, at least a part of the present embodiment may be used as a coding method, may be used as a decoding method, or may be used as another method.

Further, in the present embodiment, each component may be configured by dedicated hardware or implemented by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

Specifically, the image processing apparatus 10 may include a processing circuit (Processing Circuitry) and a storage device (Storage) electrically connected to the processing circuit and accessible to the processing circuit. For example, the processing circuit corresponds to the circuit 110 and the storage device corresponds to the memory 262.

The processing circuit includes at least one of dedicated hardware and a program execution unit, and executes processing using a storage device. In addition, when the processing circuit includes a program execution unit, the storage device stores a software program executed by the program execution unit.

Here, the software for realizing the image processing apparatus 10 and the like of the present embodiment is a program as follows.

That is, this program performs compression processing on the input image by performing conversion on the input image from the image space region to the encoding space region using the first convolutional neural network model in the computer, Coding that uses a second convolutional neural network model to extract feature quantities used in post processing, which is processing for bringing a decompressed image, which is the result of compression and decompression on the input image, closer to the input image The method may be implemented. The program also causes the computer to perform a decompression process on the input image by transforming the input image from the encoding space region to the image space region using the first convolutional neural network model. The second convolutional neural network model is used to obtain a feature amount to be used in post-processing, which is processing for bringing a decompressed image, which is a result of decompression on the input image, closer to the original image of the input image. A decryption method may be performed.

Also, each component may be a circuit as described above. These circuits may constitute one circuit as a whole or may be separate circuits. Each component may be realized by a general purpose processor or a dedicated processor.

Also, another component may execute the processing that a particular component performs. Further, the order of executing the processing may be changed, or a plurality of processing may be executed in parallel. In addition, first and second ordinal numbers may be given as appropriate to components and the like.

As mentioned above, although the aspect of the image processing apparatus 10 was demonstrated based on embodiment, the aspect of the image processing apparatus 10 is not limited to this embodiment. Without departing from the spirit of the present disclosure, various modifications that may occur to those skilled in the art may be applied to the present embodiment, and a form configured by combining components in different embodiments may be included within the scope of the image processing apparatus 10. It may be included.

This aspect may be practiced in combination with at least some of the other aspects in this disclosure. Also, part of the processing or part of the configuration of this aspect may be implemented in combination with other aspects.

This aspect may be practiced in combination with at least some of the other aspects in the present disclosure. In addition, part of the processing described in the flowchart of this aspect, part of the configuration of the apparatus, part of the syntax, and the like may be implemented in combination with other aspects.

Second Embodiment
In each of the above embodiments, each of the functional blocks can usually be realized by an MPU, a memory, and the like. Further, the processing by each of the functional blocks is usually realized by a program execution unit such as a processor reading and executing software (program) recorded in a recording medium such as a ROM. The software may be distributed by downloading or the like, or may be distributed by being recorded in a recording medium such as a semiconductor memory. Of course, it is also possible to realize each functional block by hardware (dedicated circuit).

Also, the processing described in each embodiment may be realized by centralized processing using a single device (system), or may be realized by distributed processing using a plurality of devices. Good. Moreover, the processor that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

The aspects of the present disclosure are not limited to the above examples, and various modifications are possible, which are also included in the scope of the aspects of the present disclosure.

Furthermore, an application example of the moving picture coding method (image coding method) or the moving picture decoding method (image decoding method) shown in each of the above-described embodiments and a system using the same will be described. The system is characterized by having an image coding apparatus using an image coding method, an image decoding apparatus using an image decoding method, and an image coding / decoding apparatus provided with both. Other configurations in the system can be suitably modified as the case may be.

[Example of use]
FIG. 14 is a diagram showing an overall configuration of a content supply system ex100 for realizing content distribution service. The area for providing communication service is divided into desired sizes, and base stations ex106, ex107, ex108, ex109 and ex110, which are fixed wireless stations, are installed in each cell.

In this content supply system ex100, each device such as a computer ex111, a game machine ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 via the Internet service provider ex102 or the communication network ex104 and the base stations ex106 to ex110 on the Internet ex101 Is connected. The content supply system ex100 may connect any of the above-described elements in combination. The respective devices may be connected to each other directly or indirectly via a telephone network, near-field radio, etc., not via the base stations ex106 to ex110 which are fixed wireless stations. Also, the streaming server ex103 is connected to each device such as the computer ex111, the game machine ex112, the camera ex113, the home appliance ex114, and the smartphone ex115 via the Internet ex101 or the like. Also, the streaming server ex103 is connected to a terminal or the like in a hotspot in the aircraft ex117 via the satellite ex116.

A radio access point or a hotspot may be used instead of base stations ex106 to ex110. Also, the streaming server ex103 may be directly connected to the communication network ex104 without the internet ex101 or the internet service provider ex102, or may be directly connected with the airplane ex117 without the satellite ex116.

The camera ex113 is a device capable of shooting a still image such as a digital camera and shooting a moving image. The smartphone ex115 is a smartphone, a mobile phone, a PHS (Personal Handyphone System), or the like compatible with a mobile communication system generally called 2G, 3G, 3.9G, 4G, and 5G in the future.

The home appliance ex118 is a refrigerator or a device included in a home fuel cell cogeneration system.

In the content supply system ex100, when a terminal having a photographing function is connected to the streaming server ex103 through the base station ex106 or the like, live distribution and the like become possible. In live distribution, a terminal (a computer ex111, a game machine ex112, a camera ex113, a home appliance ex114, a smartphone ex115, a terminal in an airplane ex117, etc.) transmits the still image or moving image content captured by the user using the terminal. The encoding process described in each embodiment is performed, and video data obtained by the encoding and sound data obtained by encoding a sound corresponding to the video are multiplexed, and the obtained data is transmitted to the streaming server ex103. That is, each terminal functions as an image coding apparatus according to an aspect of the present disclosure.

On the other hand, the streaming server ex 103 streams the content data transmitted to the requested client. The client is a computer ex111, a game machine ex112, a camera ex113, a home appliance ex114, a smartphone ex115, a terminal in the airplane ex117, or the like capable of decoding the above-described encoded data. Each device that receives the distributed data decrypts and reproduces the received data. That is, each device functions as an image decoding device according to an aspect of the present disclosure.

[Distributed processing]
Also, the streaming server ex103 may be a plurality of servers or a plurality of computers, and may process, record, or distribute data in a distributed manner. For example, the streaming server ex103 may be realized by a CDN (Contents Delivery Network), and content delivery may be realized by a network connecting a large number of edge servers distributed around the world and the edge servers. In the CDN, physically close edge servers are dynamically assigned according to clients. The delay can be reduced by caching and distributing the content to the edge server. In addition, when there is an error or when the communication status changes due to an increase in traffic etc., processing is distributed among multiple edge servers, or the distribution subject is switched to another edge server, or a portion of the network where a failure has occurred. Since the delivery can be continued bypassing, high-speed and stable delivery can be realized.

In addition to the distributed processing of distribution itself, each terminal may perform encoding processing of captured data, or may perform processing on the server side, or may share processing with each other. As an example, generally in the encoding process, a processing loop is performed twice. In the first loop, the complexity or code amount of the image in frame or scene units is detected. In the second loop, processing is performed to maintain the image quality and improve the coding efficiency. For example, the terminal performs a first encoding process, and the server receiving the content performs a second encoding process, thereby improving the quality and efficiency of the content while reducing the processing load on each terminal. it can. In this case, if there is a request to receive and decode in substantially real time, the first encoded data made by the terminal can also be received and reproduced by another terminal, enabling more flexible real time delivery Become.

As another example, the camera ex 113 or the like extracts a feature amount from an image, compresses data relating to the feature amount as metadata, and transmits the data to the server. The server performs compression according to the meaning of the image, for example, determining the importance of the object from the feature amount and switching the quantization accuracy. Feature amount data is particularly effective in improving the accuracy and efficiency of motion vector prediction at the time of second compression in the server. Also, the terminal may perform simple coding such as VLC (variable length coding) and the server may perform coding with a large processing load such as CABAC (context adaptive binary arithmetic coding method).

As still another example, in a stadium, a shopping mall, or a factory, there may be a plurality of video data in which substantially the same scenes are shot by a plurality of terminals. In this case, for example, a unit of GOP (Group of Picture), a unit of picture, or a tile into which a picture is divided, using a plurality of terminals for which photographing was performed and other terminals and servers which are not photographing as necessary. The encoding process is allocated in units, etc. to perform distributed processing. This reduces delay and can realize more real time performance.

Further, since a plurality of video data are substantially the same scene, the server may manage and / or instruct the video data captured by each terminal to be mutually referred to. Alternatively, the server may receive the encoded data from each terminal and change the reference relationship among a plurality of data, or may correct or replace the picture itself and re-encode it. This makes it possible to generate streams with enhanced quality and efficiency of each piece of data.

Also, the server may deliver the video data after performing transcoding for changing the coding method of the video data. For example, the server may convert the encoding system of the MPEG system into the VP system, or the H.264 system. H.264. It may be converted to 265.

Thus, the encoding process can be performed by the terminal or one or more servers. Therefore, in the following, although the description such as "server" or "terminal" is used as the subject of processing, part or all of the processing performed by the server may be performed by the terminal, or the processing performed by the terminal Some or all may be performed on the server. In addition, with regard to these, the same applies to the decoding process.

[3D, multi-angle]
In recent years, it has been increasingly used to integrate and use different scenes captured by terminals such as a plurality of cameras ex113 and / or smartphone ex115 which are substantially synchronized with each other, or images or videos of the same scene captured from different angles. There is. The images captured by each terminal are integrated based on the relative positional relationship between the terminals acquired separately, or an area where the feature points included in the image coincide with each other.

The server not only encodes a two-dimensional moving image, but also automatically encodes a still image based on scene analysis of the moving image or at a time designated by the user and transmits it to the receiving terminal. It is also good. Furthermore, if the server can acquire relative positional relationship between the imaging terminals, the three-dimensional shape of the scene is not only determined based on the two-dimensional moving image but also the video of the same scene captured from different angles. Can be generated. Note that the server may separately encode three-dimensional data generated by a point cloud or the like, or an image to be transmitted to the receiving terminal based on a result of recognizing or tracking a person or an object using the three-dimensional data. Alternatively, it may be generated by selecting or reconfiguring from videos taken by a plurality of terminals.

In this way, the user can enjoy the scene by arbitrarily selecting each video corresponding to each photographing terminal, or from the three-dimensional data reconstructed using a plurality of images or videos, the video of the arbitrary viewpoint You can also enjoy the extracted content. Furthermore, the sound may be picked up from a plurality of different angles as well as the video, and the server may multiplex the sound from a specific angle or space with the video and transmit it according to the video.

Also, in recent years, content in which the real world and the virtual world are associated, such as Virtual Reality (VR) and Augmented Reality (AR), has also become widespread. In the case of VR images, the server may create viewpoint images for the right eye and for the left eye, respectively, and may perform coding to allow reference between each viewpoint video using Multi-View Coding (MVC) or the like. It may be encoded as another stream without reference. At the time of decoding of another stream, reproduction may be performed in synchronization with each other so that a virtual three-dimensional space is reproduced according to the viewpoint of the user.

In the case of an AR image, the server superimposes virtual object information in the virtual space on camera information in the real space based on the three-dimensional position or the movement of the user's viewpoint. The decoding apparatus may acquire or hold virtual object information and three-dimensional data, generate a two-dimensional image according to the movement of the user's viewpoint, and create superimposed data by smoothly connecting. Alternatively, the decoding device transmits the motion of the user's viewpoint to the server in addition to the request for virtual object information, and the server creates superimposed data in accordance with the motion of the viewpoint received from the three-dimensional data held in the server. The superimposed data may be encoded and distributed to the decoding device. Note that the superimposed data has an α value indicating transparency as well as RGB, and the server sets the α value of a portion other than the object created from the three-dimensional data to 0 etc., and the portion is transparent , May be encoded. Alternatively, the server may set RGB values of predetermined values as a background, such as chroma key, and generate data in which the portion other than the object has a background color.

Similarly, the decryption processing of the distributed data may be performed by each terminal which is a client, may be performed by the server side, or may be performed sharing each other. As one example, one terminal may send a reception request to the server once, the content corresponding to the request may be received by another terminal and decoded, and the decoded signal may be transmitted to a device having a display. Data of high image quality can be reproduced by distributing processing and selecting appropriate content regardless of the performance of the communicable terminal itself. As another example, while receiving image data of a large size by a TV or the like, a viewer's personal terminal may decode and display a partial area such as a tile in which a picture is divided. Thereby, it is possible to confirm at hand the area in which the user is in charge or the area to be checked in more detail while sharing the whole image.

Also, from now on, for communication under connection using a delivery system standard such as MPEG-DASH under a situation where multiple short distance, middle distance or long distance wireless communication can be used regardless of indoor or outdoor. It is expected to receive content seamlessly while switching the appropriate data. As a result, the user can switch in real time while freely selecting not only his own terminal but also a decoding apparatus or display apparatus such as a display installed indoors and outdoors. In addition, decoding can be performed while switching between a terminal to be decoded and a terminal to be displayed, based on own position information and the like. This makes it possible to move while displaying map information on the wall surface or part of the ground of the next building in which the displayable device is embedded while moving to the destination. Also, access to encoded data over the network, such as encoded data being cached on a server that can be accessed in a short time from a receiving terminal, or copied to an edge server in a content delivery service, etc. It is also possible to switch the bit rate of the received data based on ease.

[Scalable coding]
The switching of content will be described using a scalable stream compression-coded by applying the moving picture coding method shown in each of the above-described embodiments shown in FIG. The server may have a plurality of streams with the same content but different qualities as individual streams, but is temporally / spatial scalable which is realized by coding into layers as shown in the figure. The configuration may be such that the content is switched using the feature of the stream. That is, the decoding side determines low-resolution content and high-resolution content by determining which layer to decode depending on the internal factor of performance and external factors such as the state of the communication band. It can be switched freely and decoded. For example, when it is desired to view the continuation of the video being watched by the smartphone ex115 while moving on a device such as the Internet TV after returning home, the device only has to decode the same stream to different layers, so the burden on the server side Can be reduced.

Furthermore, as described above, the picture is encoded for each layer, and the enhancement layer includes meta information based on statistical information of the image, etc., in addition to the configuration for realizing the scalability in which the enhancement layer exists above the base layer. The decoding side may generate high-quality content by super-resolving a picture of the base layer based on the meta information. The super resolution may be either an improvement in the SN ratio at the same resolution or an expansion of the resolution. Meta information includes information for identifying linear or non-linear filter coefficients used for super-resolution processing, or information for identifying parameter values in filter processing used for super-resolution processing, machine learning or least squares operation, etc. .

Alternatively, the picture may be divided into tiles or the like according to the meaning of an object or the like in the image, and the decoding side may be configured to decode only a part of the area by selecting the tile to be decoded. Also, by storing the attribute of the object (person, car, ball, etc.) and the position in the image (coordinate position in the same image, etc.) as meta information, the decoding side can position the desired object based on the meta information And determine the tile that contains the object. For example, as shown in FIG. 16, meta information is stored using a data storage structure different from pixel data, such as an SEI message in HEVC. This meta information indicates, for example, the position, size, or color of the main object.

Also, meta information may be stored in units of a plurality of pictures, such as streams, sequences, or random access units. As a result, the decoding side can acquire the time when a specific person appears in the video and the like, and can identify the picture in which the object exists and the position of the object in the picture by combining the information with the picture unit.

Web Page Optimization
FIG. 17 is a diagram showing an example of a display screen of a web page in the computer ex111 and the like. FIG. 18 is a diagram showing an example of a display screen of a web page in the smartphone ex115 and the like. As shown in FIGS. 17 and 18, the web page may include a plurality of link images which are links to image content, and the appearance differs depending on the browsing device. When multiple link images are visible on the screen, the display device until the user explicitly selects the link image, or until the link image approaches near the center of the screen or the entire link image falls within the screen The (decoding device) displays still images or I pictures of each content as link images, displays images such as gif animation with a plurality of still images or I pictures, etc., receives only the base layer Decode and display.

When the link image is selected by the user, the display device decodes the base layer with the highest priority. Note that the display device may decode up to the enhancement layer if there is information indicating that the content is scalable in the HTML configuring the web page. Also, in order to secure real-time capability, the display device decodes only forward referenced pictures (I picture, P picture, forward referenced only B picture) before the selection or when the communication band is very strict. And, by displaying, it is possible to reduce the delay between the decoding time of the leading picture and the display time (delay from the start of decoding of the content to the start of display). In addition, the display device may roughly ignore the reference relationship of pictures and roughly decode all B pictures and P pictures with forward reference, and may perform normal decoding as time passes and the number of received pictures increases.

[Auto run]
In addition, when transmitting or receiving still image or video data such as two-dimensional or three-dimensional map information for automatic traveling or driving assistance of a car, the receiving terminal is added as image information belonging to one or more layers as meta information Information on weather or construction may also be received, and these may be correlated and decoded. The meta information may belong to the layer or may be simply multiplexed with the image data.

In this case, since a car including a receiving terminal, a drone or an airplane moves, the receiving terminal transmits the position information of the receiving terminal at the time of reception request to seamlessly receive and decode while switching the base stations ex106 to ex110. Can be realized. In addition, the receiving terminal can dynamically switch how much meta information is received or how much map information is updated according to the user's selection, the user's situation or the state of the communication band. become.

As described above, in the content providing system ex100, the client can receive, decode, and reproduce the encoded information transmitted by the user in real time.

[Distribution of personal content]
Further, in the content supply system ex100, not only high-quality and long-time content by a video distribution company but also unicast or multicast distribution of low-quality and short-time content by an individual is possible. In addition, such personal content is expected to increase in the future. In order to make the personal content more excellent, the server may perform the encoding process after performing the editing process. This can be realized, for example, with the following configuration.

At the time of shooting, the server performs recognition processing such as shooting error, scene search, meaning analysis, and object detection from the original image or encoded data after shooting in real time or by accumulation. Then, the server manually or automatically corrects out-of-focus or camera shake, etc. based on the recognition result, or a scene with low importance such as a scene whose brightness is low or out of focus compared with other pictures. Make edits such as deleting, emphasizing the edge of an object, or changing the color. The server encodes the edited data based on the edited result. It is also known that the audience rating drops when the shooting time is too long, and the server works not only with scenes with low importance as described above, but also moves as content becomes within a specific time range according to the shooting time. Scenes with a small amount of motion may be clipped automatically based on the image processing result. Alternatively, the server may generate and encode a digest based on the result of semantic analysis of the scene.

In some cases, there are cases where personal content may infringe copyright, author's personality right, portrait right, etc. as it is, and it is inconvenient for the individual, such as the range of sharing exceeds the intended range. There are also cases. Thus, for example, the server may change and encode the face of a person at the periphery of the screen, or the inside of a house, etc. into an image out of focus. In addition, the server recognizes whether or not the face of a person different from the person registered in advance appears in the image to be encoded, and if so, performs processing such as mosaicing the face portion. May be Alternatively, the user designates a person or background area desired to process an image from the viewpoint of copyright etc. as preprocessing or post-processing of encoding, and the server replaces the designated area with another video or blurs the focus. It is also possible to perform such processing. If it is a person, it is possible to replace the image of the face part while tracking the person in the moving image.

Also, since viewing of personal content with a small amount of data has a strong demand for real-time performance, the decoding apparatus first receives the base layer with the highest priority, and performs decoding and reproduction, although it depends on the bandwidth. The decoding device may receive the enhancement layer during this period, and may play back high-quality video including the enhancement layer if it is played back more than once, such as when playback is looped. In the case of a stream in which scalable coding is performed as described above, it is possible to provide an experience in which the stream gradually becomes smart and the image becomes better although it is a rough moving image when it is not selected or when it starts watching. Besides scalable coding, the same experience can be provided even if the coarse stream played back first and the second stream coded with reference to the first moving image are configured as one stream .

[Other use cases]
Also, these encoding or decoding processes are generally processed in an LSI ex 500 that each terminal has. The LSI ex 500 may be a single chip or a plurality of chips. Software for moving image encoding or decoding is incorporated in any recording medium (CD-ROM, flexible disk, hard disk, etc.) readable by computer ex111 or the like, and encoding or decoding is performed using the software. It is also good. Furthermore, when the smartphone ex115 is equipped with a camera, moving image data acquired by the camera may be transmitted. The moving image data at this time is data encoded by the LSI ex 500 included in the smartphone ex 115.

The LSI ex 500 may be configured to download and activate application software. In this case, the terminal first determines whether the terminal corresponds to the content coding scheme or has the ability to execute a specific service. If the terminal does not support the content encoding method or does not have the ability to execute a specific service, the terminal downloads the codec or application software, and then acquires and reproduces the content.

Further, the present invention is not limited to the content supply system ex100 via the Internet ex101, but also to a system for digital broadcasting at least a moving picture coding apparatus (image coding apparatus) or a moving picture decoding apparatus (image decoding apparatus) of the above embodiments. Can be incorporated. There is a difference in that it is multicast-oriented with respect to the configuration in which the content supply system ex100 can be easily unicasted, since multiplexed data in which video and sound are multiplexed is transmitted on broadcast radio waves using satellites etc. Similar applications are possible for the encoding process and the decoding process.

[Hardware configuration]
FIG. 19 is a diagram showing the smartphone ex115. FIG. 20 is a diagram showing a configuration example of the smartphone ex115. The smartphone ex115 receives an antenna ex450 for transmitting and receiving radio waves to and from the base station ex110, a camera unit ex465 capable of taking video and still images, a video taken by the camera unit ex465, and the antenna ex450 And a display unit ex <b> 458 for displaying data obtained by decoding an image or the like. The smartphone ex115 further includes an operation unit ex466 that is a touch panel or the like, a voice output unit ex457 that is a speaker or the like for outputting voice or sound, a voice input unit ex456 that is a microphone or the like for inputting voice, Identify the user, the memory unit ex 467 capable of storing encoded video or still image, recorded voice, received video or still image, encoded data such as mail, or decoded data, and specify a network, etc. And a slot unit ex464 that is an interface unit with the SIM ex 468 for authenticating access to various data. Note that an external memory may be used instead of the memory unit ex467.

Further, a main control unit ex460 that integrally controls the display unit ex458 and the operation unit ex466, a power supply circuit unit ex461, an operation input control unit ex462, a video signal processing unit ex455, a camera interface unit ex463, a display control unit ex459, / Demodulation unit ex452, multiplexing / demultiplexing unit ex453, audio signal processing unit ex454, slot unit ex464, and memory unit ex467 are connected via a bus ex470.

When the power supply key is turned on by the user's operation, the power supply circuit unit ex461 activates the smartphone ex115 to an operable state by supplying power from the battery pack to each unit.

The smartphone ex115 performs processing such as call and data communication based on control of the main control unit ex460 having a CPU, a ROM, a RAM, and the like. At the time of a call, the audio signal collected by the audio input unit ex456 is converted to a digital audio signal by the audio signal processing unit ex454, spread spectrum processing is performed by the modulation / demodulation unit ex452, and digital analog conversion is performed by the transmission / reception unit ex451. After processing and frequency conversion processing, transmission is performed via the antenna ex450. Further, the received data is amplified and subjected to frequency conversion processing and analog-to-digital conversion processing, subjected to spectrum despreading processing by modulation / demodulation unit ex452, and converted to an analog sound signal by sound signal processing unit ex454. Output from In the data communication mode, text, still images, or video data are sent to the main control unit ex460 via the operation input control unit ex462 by the operation of the operation unit ex466 or the like of the main unit, and transmission and reception processing is similarly performed. In the case of transmitting video, still images, or video and audio in the data communication mode, the video signal processing unit ex 455 executes the video signal stored in the memory unit ex 467 or the video signal input from the camera unit ex 465 as described above. The video data is compressed and encoded by the moving picture encoding method shown in the form, and the encoded video data is sent to the multiplexing / demultiplexing unit ex453. Further, the audio signal processing unit ex454 encodes an audio signal collected by the audio input unit ex456 while capturing a video or a still image with the camera unit ex465, and sends the encoded audio data to the multiplexing / demultiplexing unit ex453. Do. The multiplexing / demultiplexing unit ex453 multiplexes the encoded video data and the encoded audio data according to a predetermined method, and performs modulation processing and conversion by the modulation / demodulation unit (modulation / demodulation circuit unit) ex452 and the transmission / reception unit ex451. It processes and transmits via antenna ex450.

When a video attached to an e-mail or a chat or a video linked to a web page or the like is received, the multiplexing / demultiplexing unit ex453 multiplexes in order to decode multiplexed data received via the antenna ex450. By separating the data, the multiplexed data is divided into a bit stream of video data and a bit stream of audio data, and the encoded video data is supplied to the video signal processing unit ex455 via the synchronization bus ex470, and The converted audio data is supplied to the audio signal processing unit ex 454. The video signal processing unit ex 455 decodes the video signal by the moving picture decoding method corresponding to the moving picture coding method described in each of the above embodiments, and is linked from the display unit ex 458 via the display control unit ex 459. An image or a still image included in the moving image file is displayed. The audio signal processing unit ex 454 decodes the audio signal, and the audio output unit ex 457 outputs the audio. Furthermore, since real-time streaming is widespread, depending on the user's situation, it may happen that sound reproduction is not socially appropriate. Therefore, as an initial value, it is preferable to have a configuration in which only the video data is reproduced without reproducing the audio signal. Audio may be synchronized and played back only when the user performs an operation such as clicking on video data.

Also, although the smartphone ex115 has been described as an example, in addition to a transceiving terminal having both an encoder and a decoder as a terminal, a transmitting terminal having only the encoder and a receiver having only the decoder There are three possible implementation forms: terminals. Furthermore, in the digital broadcasting system, it has been described that multiplexed data in which audio data is multiplexed with video data is received or transmitted, but in multiplexed data, character data related to video other than audio data is also described. It may be multiplexed, or video data itself may be received or transmitted, not multiplexed data.

Although the main control unit ex 460 including the CPU is described as controlling the encoding or decoding process, the terminal often includes a GPU. Therefore, a configuration in which a large area is collectively processed using the performance of the GPU may be performed using a memory shared by the CPU and the GPU, or a memory whose address is managed so as to be commonly used. As a result, coding time can be shortened, real time property can be secured, and low delay can be realized. In particular, it is efficient to perform processing of motion search, deblock filter, sample adaptive offset (SAO), and transform / quantization collectively in units of pictures or the like on the GPU instead of the CPU.

The present disclosure is applicable to, for example, a television receiver, a digital video recorder, a car navigation system, a mobile phone, a digital camera, a digital video camera, a video conference system, an electronic mirror, and the like.

DESCRIPTION OF SYMBOLS 10 Image processing apparatus 100 Encoding apparatus 101 Image coding part 102 Feature extraction part 103 for post-processing Quantum part 103A Quantization part 103B Inverse quantization part 104 Entropy coding part 104A Entropy coding part 104B Entropy decoding part 105 Storage part 106 Image Decoding unit 107 Post-processing feature acquisition unit 108

Post-processing unit

160, 260

Circuit

162, 262 Memory 200 Decoding device 300 Convolutional

neural network

310, 322, 330 Convoluted block 311 Convoluted layer 312 Nonlinear activation function 313 Normalized layer 320 Residual Block 321 residual group

Claims

With memory
A circuit capable of accessing the memory;
The circuit accessible to the memory is
A compression process is performed on the input image by performing conversion on the input image from the image space region to the encoding space region using the first convolutional neural network model,
The second convolutional neural network model is used to extract feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of compression and decompression on the input image, closer to the input image.
Encoding device.
The feature amount is high frequency information included in the input image.
The encoding device according to claim 1.
The first convolutional neural network model and the second convolutional neural network model include two or more convolutional blocks and include one or more residual blocks,
Each of the two or more convolutional blocks is a processing block comprising one or more convolutional layers,
Each of the one or more residual blocks is composed of a convolutional group including two or more of at least one convolution layer of the two or more convolutional blocks, and the data input to the residual block is the residual A processing block that is input to the convolutional group included in the block and adds data input to the residual block to data output from the convolutional group;
An encoding device according to claim 1 or 2.
The one or more residual blocks are two or more residual blocks,
The encoding device according to claim 3.
The two or more convolutional blocks are four or more convolutional blocks,
The one or more residual blocks constitute a residual group and include at least two convolutional blocks of the four or more convolutional blocks,
At least one convolutional block not included in the residual group among the four or more convolutional blocks constitutes a first convolutional group,
At least one convolutional block among the four or more convolutional blocks which is not included in the residual group or the first convolutional group constitutes a second convolutional group,
Data output from the first convolutional group is input to the residual group,
Data output from the residual group is input to the second convolution group,
The encoding device according to claim 3 or 4.
With memory
A circuit capable of accessing the memory;
The circuit accessible to the memory is
The input image is subjected to a decompression process on the input image by performing conversion from the encoding space region to the image space region using the first convolutional neural network model,
The second convolutional neural network model is used to acquire feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of decompression on the input image, closer to the original image of the input image.
Decoding device.
The circuit accessible to the memory is
Furthermore, a third convolutional neural network model is used, and the post-processing is performed using the first convolutional neural network model using the feature quantity acquired using the second convolutional neural network model Perform processing for bringing the decompressed image closer to the original image,
The decoding apparatus according to claim 6.
A compression process is performed on the input image by performing conversion on the input image from the image space region to the encoding space region using the first convolutional neural network model,
The second convolutional neural network model is used to extract feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of compression and decompression on the input image, closer to the input image.
Encoding method.
The input image is subjected to a decompression process on the input image by performing conversion from the encoding space region to the image space region using the first convolutional neural network model,
The second convolutional neural network model is used to acquire feature quantities used in post-processing, which is processing for bringing a decompressed image, which is the result of decompression on the input image, closer to the original image of the input image.
Decryption method.