CN113411615A

CN113411615A - Virtual reality-oriented latitude self-adaptive panoramic image coding method

Info

Publication number: CN113411615A
Application number: CN202110694372.9A
Authority: CN
Inventors: 李穆; 李锦兴; 张大鹏
Original assignee: Shenzhen Research Institute of Big Data SRIBD
Current assignee: Shenzhen Research Institute of Big Data SRIBD
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-17
Anticipated expiration: 2041-06-22
Also published as: CN113411615B

Abstract

The invention discloses a virtual reality-oriented latitude self-adaptive panoramic image coding method, which is characterized in that when a panoramic image is stored, the latitude of each area in the panoramic image is referred, and the height of a coding column corresponding to each area is determined according to the latitude of each area, so that lower coding columns are distributed for images in high-latitude areas, and higher coding columns are distributed for images in low-latitude areas. Therefore, the problem that in the prior art, the panoramic image is compressed and stored by adopting an equidistant rectangular projection technology, and serious oversampling exists in high-latitude areas such as two poles is solved. Meanwhile, the problem that the panoramic image in the projection domain is stretched and deformed is solved by introducing image distortion defined in the observation domain of the panoramic image.

Description

Virtual reality-oriented latitude self-adaptive panoramic image coding method

Technical Field

The invention relates to the technical field of image coding, in particular to a virtual reality-oriented latitude self-adaptive panoramic image coding method.

Background

With the rapid development of multimedia information technology, panoramic video technology has become one of the hot spots in the information technology field, and is applied to more and more scenes. Compared to conventional flat video, panoramic video contains more information, but at the same time requires a larger amount of data to carry the information. Therefore, in the application field of panoramic video, the panoramic video compression technology is an extremely important technology. The panoramic video is essentially a spherical video containing a panorama, and the existing video coding standard has no function of processing the spherical video. Since the spherical video is difficult to store and display directly, it needs to be sampled into a flat rectangular video according to a specific method to facilitate subsequent compression encoding. At present, a compression encoding method of a panoramic image is mainly to convert a spherical image into a planar image by an isometric rectangular projection (ERP) technology, and then encode the planar image by using a conventional image compression method. However, the rectangular equidistance projection technique has severe oversampling and a phenomenon that the panoramic image is stretched and deformed in a high-latitude region such as two poles.

Thus, there is still a need for improvement and development of the prior art.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a virtual reality oriented latitude adaptive panoramic image coding method, aiming at solving the problems that in the prior art, a high latitude area such as two poles has a serious oversampling phenomenon and the panoramic image is stretched and deformed to affect the image quality evaluation because of compressing and storing the panoramic image by adopting an equidistant rectangular projection technology.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides a virtual reality-oriented latitude adaptive panoramic image encoding method, where the method is applied to a storage process of a rectangular isometric projection diagram, and the method includes:

acquiring latitude information of image data, generating target coding structure prediction graph data according to the latitude information, and generating structure coding stream data corresponding to the target coding structure prediction graph data according to the target coding structure prediction graph data; the latitude information is used for reflecting the position of each region in the image data on a projection spherical surface; the target coding structure prediction image data is used for reflecting height information of coding columns corresponding to all areas in the image data;

acquiring image characteristic information corresponding to each area on the image data, and generating image coding stream data corresponding to the image characteristic information according to the image characteristic information;

and generating target coded stream data according to the structural coded stream data and the image coded stream data, wherein the target coded stream data is used for storing the image data.

In a second aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a virtual reality oriented latitude adaptive panoramic image encoding method as described in any one of the above.

The invention has the beneficial effects that: according to the embodiment of the invention, when the panoramic image is stored, the latitude of each area in the panoramic image is referred, and the height of the coding column corresponding to each area is determined according to the latitude of each area, so that the lower coding column is allocated to the image in the high-latitude area, and the higher coding column is allocated to the image in the low-latitude area. Therefore, the problem that in the prior art, the panoramic image is compressed and stored by adopting an equidistant rectangular projection technology, serious oversampling phenomenon exists in high-latitude areas such as two poles, and the coding efficiency is low is solved. In addition, the embodiment of the invention guides the learning of the coding scheme by defining the image quality evaluation function in the observation domain of the panoramic image, and solves the problem that the panoramic image in the equidistant rectangular projection domain is stretched and deformed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a virtual reality-oriented latitude adaptive panoramic image encoding method according to an embodiment of the present invention.

Fig. 2 is a specific flowchart for encoding image data according to an embodiment of the present invention.

Fig. 3 is a specific flowchart for decoding target encoded stream data according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an encoder according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a latitude adaptive scaler according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a coding structure predictor provided in an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of a decoder according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a second entropy prediction network provided by an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a down-sampling module according to an embodiment of the present invention.

Fig. 10 is a schematic structural diagram of an up-sampling module according to an embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a residual error module according to an embodiment of the present invention.

Fig. 12 is a schematic structural diagram of an attention module according to an embodiment of the present invention.

Fig. 13 is a functional block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

With the rapid development of multimedia information technology, panoramic video technology has become one of the hot spots in the information technology field, and is applied to more and more scenes. Compared to conventional flat video, panoramic video contains more information, but at the same time requires a larger amount of data to carry the information. Therefore, in the application field of panoramic video, the panoramic video compression technology is an extremely important technology. The panoramic video is essentially a spherical video containing a panorama, and the existing video coding standard has no function of processing the spherical video. Since the spherical video is difficult to store and display directly, it needs to be sampled into a flat rectangular video according to a specific method to facilitate subsequent compression encoding.

At present, a compression encoding method of a panoramic image is mainly to convert a spherical image into a planar image by an isometric rectangular projection (ERP) technology, and then encode the planar image by using a conventional image compression method. However, the rectangular equidistance projection technique has a severe oversampling phenomenon in a high-latitude region such as two poles, which causes the panoramic image to be stretched and deformed.

Aiming at the defects in the prior art, the invention provides a virtual reality-oriented latitude self-adaptive panoramic image coding method, which is applied to the storage process of equidistant rectangular projection image data, and generates image coding stream data corresponding to the image data according to image characteristic information by acquiring the image characteristic information corresponding to each area on the image data; acquiring latitude information of the image data, generating target coding structure prediction graph data according to the latitude information, and generating structure coding stream data corresponding to the image data according to the target coding structure prediction graph data; and generating target coded stream data according to the structural coded stream data and the image coded stream data. When the panoramic image is stored, the latitude of each area in the panoramic image is referred, and the height of the coding column corresponding to each area is determined according to the latitude of each area, so that the lower coding column is distributed to the image in the high latitude area, and the higher coding column is distributed to the image in the low latitude area. Therefore, the problem that in the prior art, the panoramic image is compressed and stored by adopting an equidistant rectangular projection technology, and serious oversampling exists in high-latitude areas such as two poles is solved.

As shown in fig. 1, the method is applied to a storage process of image data of an equidistant rectangular projection drawing, and comprises the following steps:

step S100, acquiring latitude information of image data, generating target coding structure prediction graph data according to the latitude information, and generating structure coding stream data corresponding to the target coding structure prediction graph data according to the target coding structure prediction graph data; the latitude information is used for reflecting the position of each region in the image data on a projection spherical surface; the target coding structure prediction image data is used for reflecting height information of coding columns corresponding to all the areas in the image data.

In order to implement the adaptive encoding of the latitudes of the regions on the equidistant rectangular projection graph, the embodiment first needs to acquire the image data of the type of the equidistant rectangular projection graph to be encoded, and acquire the latitude information of the image data, where the latitude information may reflect the position of each region on the projection spherical surface in the image data, for example, whether a certain region is located near two poles of the projection spherical surface or near the equator of the projection spherical surface. After the positions of the regions on the projection spherical surface are determined, different encoding column heights can be respectively allocated to the regions at high latitudes (such as the regions near the two poles) and the regions at low latitudes (such as the regions near the equator), that is, target encoding structure prediction map data is generated, and finally, structure encoding stream data corresponding to the target encoding structure prediction map data is generated according to the target encoding structure prediction map data. For example, in the depth image coding framework, the code is a 3D cuboid of n × h × w. The coding block can be further decomposed into h x w coding columns with the height of n, and each coding column corresponds to an area in the image. In this embodiment, the heights of the coding columns corresponding to the regions are predicted through the unique latitude information of the equidistant rectangular projection diagram, the coding columns with lower heights are allocated to the high-latitude regions, and the coding columns with higher heights are allocated to the low-latitude regions, so that a latitude-adaptive image coding framework is realized.

In one implementation, the step S100 specifically includes the following steps:

step S101, latitude information of the image data is obtained and input into a latitude self-adaptive scaler;

step S102, latitude scaling weight map data generated by the latitude adaptive scaler based on the latitude information is obtained, and first coding structure prediction map data are generated according to the latitude scaling weight map data; the latitude scaling weight map data is used for reflecting weight values corresponding to all areas on the image data determined by the latitude information; the first coding structure diagram data is used for reflecting encoded data corresponding to each region on the image data predicted by latitude information corresponding to each region on the image data;

step S103, generating the target coding structure prediction graph data according to the first coding structure prediction graph data;

and step S104, generating the structure coding stream data according to the target coding structure prediction graph data.

Specifically, after the latitude information of the image data is acquired, the latitude information is input into a preset latitude adaptive scaler. The latitude self-adaptive scaler predicts the weight value of each area in the image data through the latitude information, namely the weight value of each area on the image is determined by the latitude of the latitude self-adaptive scaler, and latitude scaling weight map data are obtained.

In one implementation, as shown in fig. 5, the latitude adaptive scaler is composed of a latitude vector generation module, an input module, three residual modules, an output module and an augmentation module. The latitude vector generation module generates a latitude vector theta which is determined by the perimeter of each dimension and contains h elements by adopting the following formula:

wherein i is the number of lines of the image where the element is located, and h is the height of the code. The input module is formed by convolution of 3 multiplied by 1 of 16 channels, the residual module is formed by convolution of 1 multiplied by 1, and the output module is formed by convolution of 1 multiplied by 1 of 1 channel using Sigmoid activated function.

After obtaining the latitude scaling weight map data, the encoding structure prediction map data having the same size as the latitude scaling weight map data may be generated according to a pre-trained prediction model. In one implementation, the present embodiment may generate the structure encoded stream data directly based on the first encoding structure diagram data, that is, the present embodiment may select to generate the structure encoded stream data directly according to the latitude information of the image data without considering the image content of the image data, and then the height of the encoding column of each region in the image data of different contents is only related to the own latitude.

In another implementation manner, the present embodiment may further combine the latitude information and the content of the image data to obtain a coding allocation method with adaptive content and latitude information. Specifically, this embodiment determines the input data of the preset coding structure predictor according to the image data, for example, the input data of the coding structure predictor may be the feature data generated by the fourth downsampling convolution in the encoder. Specifically, as shown in fig. 6, the coding structure predictor may be composed of two residual blocks and one output block. The output module is a1 x 1 convolution of 1 channel using Sigmoid activation function.

Then, second encoding structure prediction graph data generated by the encoding predictor based on the input data is obtained, wherein the second encoding structure graph data is used for reflecting content information corresponding to each area on the image data, and encoded data corresponding to each area on the image data is predicted according to the importance degree of the content. Since the first encoding structure prediction map data is generated based on latitude information of the image and the second encoding structure prediction map data is generated based on the content of the image, the image content and latitude information adaptive combined encoding structure prediction map data can be obtained by combining the first encoding structure prediction map data and the second encoding structure prediction map data. Then, the combined coding structure prediction graph data is quantized, and the coding structure prediction graph data obtained after quantization is used as target coding structure prediction graph data. In short, the embodiment needs to quantize the obtained combined encoded structure prediction graph data into target encoded structure prediction graph data containing multiple different encoding heights in one step.

For example, assume that a target coding structure prediction graph is obtained

Predicting the target coding structure by the following formula

Quantizing target coding structure prediction graph data rho with L different heights:

wherein i is the number of image rows where the element is located, and j is the number of image columns where the element is located.

In one implementation, as shown in FIG. 2, structure coded stream data is generated based on the resulting target coded structure prediction graph data. In this embodiment, the target coding structure prediction graph data is input into a first entropy prediction network, and a first probability distribution table generated by the first entropy prediction network based on the target coding structure prediction graph data is obtained, where the first probability distribution table is used to reflect probabilities of different coding symbols corresponding to each region on the quantized coding structure prediction graph data. Then, the first probability distribution table and the target encoding structure prediction map data are input to a first entropy encoder, and structure encoding stream data generated by the first entropy encoder encoding the target encoding structure prediction map data based on the first probability distribution table is acquired.

For example, the first entropy prediction network directly generates a discrete probability table corresponding to the target encoding structure prediction map data ρ, i.e., a first probability distribution table, using the target encoding structure prediction map data ρ as input data, and P (ρ (i, j) ═ k), k ═ 0. Specifically, the first entropy prediction network only comprises one branch, and the branch structure of the first entropy prediction network consists of two 5 × 5 mask convolution layers, three residual error modules and one output module, wherein the output layer uses a Softmax activation function. The output of the first entropy prediction network is denoted u. Wherein u is_k,i,jOne item P (ρ (i, j) ═ k) representing the first probability distribution table.

After the structure coded stream data is acquired, in order to accurately reconstruct an equidistant rectangular projection graph in the subsequent decoding, as shown in fig. 1, the method further includes the following steps:

step S200, acquiring image characteristic information corresponding to each area on the image data, and generating image coding stream data corresponding to the image characteristic information according to the image characteristic information.

In brief, when encoding image data to be encoded, it is not necessary to encode global information of the image data, but only image feature information of each region in the image data needs to be acquired, and the local feature information is encoded. And then global information of the image data can be obtained after the local characteristic information is integrated. Therefore, in this embodiment, image feature information corresponding to each region on the image data needs to be obtained first, and image encoding stream data corresponding to the image data is generated according to the image feature information.

In one implementation, the step S100 specifically includes the following steps:

step S201, inputting the image data into an encoder, and acquiring feature encoding block data generated by the encoder based on the image data; the feature coding block data is used for reflecting image feature information of each area on the image data;

step S202, generating the image coding stream data according to the characteristic coding block data.

Specifically, in order to acquire the features of each region in the image data, the present embodiment is provided with an encoder for acquiring the features of the image of each region in the image data in advance. In one implementation, the encoder may include a down-sampling module, a residual module, an attention module, and an output module, and after image data to be encoded is input into the encoder, the down-sampling module reduces the size of the image data, the residual module maintains a gradient of a network, the attention module focuses a portion of the image data having a largest amount of information, and the output module generates feature encoded block data of the encoder.

In one implementation, as shown in fig. 4, the encoder may consist of four down-sampling modules, each of which reduces the size of the input image data by a factor of two, and one output module. As shown in fig. 9, the down-sampling module consists of two branches: the first branch contains two convolution operations of 3 × 3, the first convolution operation having a step size of 2 and the second convolution operation having a step size of 1; the second branch contains a convolution of 1 x 1 with a step size of 2. The results of the two branches are combined by bitwise addition to generate the output result of the down-sampling module.

In one implementation, there is one residual block after each downsampling block. As shown in fig. 11, the residual module is also composed of two branches: the first branch contains two 3 x 3 convolution operations; the second branch directly copies the input data of the residual error module; the results of the two branches are then combined by bitwise addition to produce the output result of the residual module. After the second and fourth down-sampling blocks, there is one attention block each, in addition to the residual block.

In one implementation, as shown in fig. 12, the attention module consists of three branches: the first branch consists of three attention module substructures and a3 × 3 convolution using Sigmoid activation function; the second branch consists of three attention module substructures; the third branch then directly copies the input data of the attention module. And then combining the first branch and the second branch by a bitwise multiplication method, and combining the combined data with the output data of the third branch by a bitwise addition method to obtain the output result of the attention module. In one implementation, the attention module sub-structure includes one 96-channel 1 × 1 convolution, one 96-channel 3 × 3 convolution, and one 192-channel 1 × 1 convolution.

In one implementation, the output module consists of a 192-channel 1 × 1 convolutional layer using Sigmoid activation function to form signature encoded block data.

To summarize, as shown in fig. 4, after the image data to be encoded is input into the encoder, the image data passes through the down-sampling module one, the residual error module one, the down-sampling module two, the residual error module two, the attention module two, the down-sampling module three, the residual error module three, the down-sampling module four, the residual error module four, the attention module four, and the output module in sequence. And finally outputting the feature coding block data corresponding to the image data by an output module. It should be noted that all convolution operations in the above encoder that do not specifically indicate the use of Sigmoid activation function use the prime lu activation function.

After the feature coding block data output by the encoder is acquired, in order to perform compression coding on the equidistant rectangular projection drawing, image coding stream data also needs to be generated according to the feature coding block data. In one implementation, the present embodiment may input the feature coded block data into a quantizer, and obtain quantized coded block data generated by the quantizer based on the feature coded block data. Then, mask block data for identifying a masked area and a non-masked area on the target coding structure prediction map data is generated from the target coding structure prediction map data. Then, the image coded stream data is generated based on the mask block data. In short, after the feature encoded block data is obtained, the present embodiment needs to quantize the feature encoded block data to obtain discrete encoded block data, that is, quantized encoded block data.

For example, assuming that the output result of the encoder is a feature encoding block e, the feature encoding block e is input into a quantizer, and the quantizer can learn the adaptive quantization function of each channel in an optimized manner. Specifically, the quantization parameter of the k-th channel may be defined as

Where ω is a parameter of the learnable quantization function, k is the number of the coding plane in which the code is located, L_qThe number of quantization centers. The quantization center of the k-th channel can be expressed as

Where i is the number of image rows in which the element is located and j is the number of image columns in which the element is located. The quantizer performs quantization operations by:

where y is the quantized code, g_qFor the quantization function, e is the code before quantization and l is the index of the quantization center, representing the l-th quantization center of the code plane k. Then, the parameters of each channel in the quantizer are learned through the following minimization formula:

wherein the content of the first and second substances,

representing quantization error of quantization function

After the discrete quantized coding block data is obtained by the quantizer, in order to adjust the sampling amount of each region based on the latitude information of each region in the image data, thereby achieving the purpose of avoiding oversampling of a high-latitude region, this embodiment further needs to generate mask block data according to previously generated target coding structure prediction map data, where the mask block data is used to identify a shielded region and a non-shielded region on the target coding structure prediction map data. After obtaining the mask block data, in one implementation, as shown in fig. 2, it is further required to determine clipped quantized coded block data according to the mask block data and the quantized coded block data, where the clipped quantized coded block data is quantized coded block data corresponding to the non-mask region. In brief, the quantized coding blocks and the mask blocks have the same size, after the mask blocks are obtained, the mask blocks are acted on the quantized coding blocks, then the height of coding columns of each area is accurately controlled by discarding the quantized coding blocks with the mask of 0, the clipped quantized coding blocks are obtained, and the purpose of self-adaptive code rate control based on image content and latitude information is further achieved.

For example, the target coding structure prediction map data ρ obtained previously is mapped into 0, 1 mask blocks m having the same size as the quantized coding blocks y by the following formula:

where k is the height of the mask, i is the number of rows of the image where the element is located, and j is the number of columns of the image where the element is located. Then, the mask block m is applied to the quantized encoded block y, and the clipped quantized encoded block z is obtained as y · m.

In one implementation, the size of the equidistant rectangular projection is n × H × W, the size of the quantized coding block is n × H × W, the size of the coding structure prediction map is 1 × H × W, and the size of the mask block is n × H × W. Wherein the content of the first and second substances,

n is 192. In one implementation, the number of quantization centers, L, in the quantizer_qThe coding structure prediction graph comprises 48 different coding heights, namely L-8.

Then, the clipped quantized coded block data and the mask block data are input into a second entropy prediction network, a second probability distribution table generated by the second entropy prediction network based on the clipped quantized coded block data and the mask block data is acquired, and the image coded stream data is generated from the clipped quantized coded block data and the second probability distribution table. In one implementation, the second entropy prediction network is mainly composed of a mask convolution layer, a residual module, and an output module. For example, as shown in fig. 8, the second entropy prediction network may include three branches, each branch consisting of two 5 × 5 mask convolution layers, three residual modules, and one output module. The operation of the mask convolutional layer is the prior art. The residual block of the second entropy prediction network is obtained by converting the convolution operation in the residual block in the encoder into a 5 × 5 mask convolution. The output module of the first branch is a 5 × 5 mask convolution layer, and the branch adopts the Softmax activation function. The output module of the second branch is a 5 × 5 mask convolution layer, and the branch takes the ReLU activation function. The output module of the third branch is a 5 x 5 mask convolution layer, and the branch does not use the activation function. The output of the first branch is the weight of mixed Gaussian distribution, and is recorded as pi; the second branch output is the standard deviation of the Gaussian mixture distribution and is recorded as delta; the third branch output is the mean of the mixture of gaussian distributions, denoted as μ. When calculating the probability of the coding block, the second entropy prediction network is mainly calculated by a mixed Gaussian distribution in the following formula:

wherein, P (z)_k，i，j) For coding blocks z_k,i,jProbability of (Q)_l＝z_k,i,jRepresenting a code z_k,i,jThe quantization centers used. The probability model corresponding to the second entropy prediction network has the continuous derivable property, and can be used for defining the compression rate in the subsequent model training.

In order to generate image coded stream data, the present embodiment further needs to determine target clipped quantized coded block data among the clipped quantized coded block data based on the mask block data, where the target clipped quantized coded block data is clipped quantized coded block data corresponding to mask block data having a mask value of 1. Then, the target clipping quantized coded block data and the second probability distribution table are input into a second entropy coder, and the image coded stream data generated by the second entropy coder after coding the target clipping quantized coded block data based on the second probability distribution table is obtained.

In one implementation, the second entropy encoder may output the image coding stream data in an arithmetic coding manner. In particular for coding z_k,i,jA discrete probability table p (z) needs to be calculated_k,i,j＝Ω_l) I.e. a second probability distribution table, where L is 0_q-1. The second probability distribution table may be calculated by the following formula:

wherein, for the case where l is 0, Ω_k,l-1═ infinity; for L ═ L_qCase of-1, Ω_k,l+1＝∞。

In one implementation, since multiple predictive networks are involved in this embodiment, it is naturally necessary to train these predictive networks before they can be used. Specifically, the first entropy prediction network is independent of other prediction networks, and is obtained by minimizing the following formula training after the target coding structure prediction graph data ρ is determined after the training of other prediction networks is completed:

where s (condition) ═ 1 if the condition is true, and s (condition) ═ 0 if the condition is false.

In one implementation, because the quantizer used in this embodiment has a non-conductive corresponding step quantization function, it is not possible to optimize the network before the quantizer by using a gradient descent method. Thus, the present embodiment is directed to a quantizer, in forward propagation, the following equationQuantization function g in (1)_q(e_k,i,j)：

Using alternative functions of the above formula in back propagation, i.e.

Solving for the derivative of the quantizer is performed.

In one implementation, the operations of generating the target encoding structure prediction map and generating the mask block data are also not conducive due to the present embodiment. Therefore, in the first stage, the present embodiment first proposes the alternative function of the target function and its confidence domain, and solves the optimal target coding structure prediction graph ρ in the confidence domain under the current situation^*. The second stage makes the current target coding structure prediction graph

Approximation by elimination rho^*。

Specifically, in the first stage, as shown in the following formula, the present embodiment disassembles the objective function L into two parts, one part being L containing the cropped coded block z_zThe other part is L containing only mask blocks m_z：

L＝L_m+L_z

Therefore, the substitute function of the objective function employed in the present embodiment is shown by the following formula:

wherein the content of the first and second substances,

representing the current feature code, ξ is a small positive integer for the Taylor expanded parameter. Optimal quantization coding structure diagram rho under current condition^*This can be solved by the following formula.

Wherein the content of the first and second substances,

indicating the current mask.

In one implementation, L⁰(i，j)^*As shown in the following equation:

s.t.ρ(i，j)∈{0，...，L-1}

in one implementation, L¹(i，j)^*As shown in the following equation:

s.t.ρ(i，j)∈{0，...，L-1}

in the second stage, the present embodiment enables the target coding structure prediction graph by minimizing the following formula

Approaching to the optimal target coding structure prediction graph rho under the current condition^*：

Wherein, beta is an updated structure code pattern

Step size of (2).

The gradient of (d) is shown by the following equation:

in addition, since image compression is a comprehensive task, two indexes are mainly considered in the task, one is compression rate, and the other is distortion of a decoded image, so that the learned image compression model mainly sets a joint optimization objective function according to the two indexes, and optimizes the whole model by using a gradient descent mode. In one implementation, the objective function L corresponding to the compression ratio in this embodiment_RThe entropy value of the quantized coding block and the structure of the quantized coding block are comprehensively set, and the entropy value is shown as the following formula:

wherein the first item is the total number of used codes and is used for restricting the structure of the coding block; the second term is the entropy value of the encoding used. Gamma adjusts the proportion of the coding block structure. Aiming at the problems that sampling at different latitudes in an equidistant rectangular projection graph is unbalanced and oversampling exists in a high-latitude area, the embodiment introduces a latitude self-adaptive constraint term which acts on a coding structure part and requires that the coding sum used at the same latitude should be adapted to the latitude of the same latitude. The latitude self-adaptive constraint term is shown in the following formula:

wherein r represents the ratio of the code actually used by the equatorial part to the total available codes, and can be flexibly adjusted. f. of_s(i) Is a latitude adaptive function, as shown in the following formula:

f_s(i)＝ηcos(θ_i)+(1-η)

where 1- η represents the minimum available coding ratio. In short, in the optimization process of the model, according to the sampling rates of the equidistant rectangular projection graph at different latitudes, the total heights of all the coding columns corresponding to the latitudes are constrained not to exceed a given value, so that the purposes of allocating fewer codes to areas with higher latitudes and allocating more codes to areas with lower latitudes are achieved, and the problems of high latitude information redundancy and high code rate caused by oversampling in high latitude areas are solved. In the same latitude, the embodiment can also allocate more codes to the area with large information amount according to the image content, so as to achieve the effect of content self-adaptation and further improve the coding performance aiming at the equidistant rectangular projection graph.

As shown in fig. 1, the method further comprises the steps of:

and step S300, generating target coded stream data according to the structure coded stream data and the image coded stream data, wherein the target coded stream data is used for storing the image data.

After the structure coding stream data and the image coding stream data are obtained, because the structure coding stream data stores the structure information corresponding to the rectangular equidistant projection drawing, and the image coding stream data stores the image information corresponding to the rectangular equidistant projection drawing, the rectangular equidistant projection drawing stored by the two coding stream data together is not easy to generate the deformation, and the important content of the image can be well kept.

In one implementation, as shown in fig. 3, after obtaining the encoded data corresponding to the equidistant rectangular projection graph, the method further includes: and inputting the target encoding data into a decoder, and acquiring the image data reconstructed by the decoder based on the target encoding data.

In general terms, the decoder is a mirror operation of the encoder. Firstly, structural coding stream data is input into a first entropy decoder, the first entropy decoder estimates the probability distribution function of each code of the current decoding part by using a first entropy prediction network, and the structural coding stream data is restored into a target coding structure prediction graph. And then inputting the image coding stream data into a second entropy decoder, wherein the second entropy decoder estimates the probability distribution function of each code of the current decoding part by using a second entropy prediction network, and restores the image coding stream data into the clipping coding block data by combining the decoded target coding structure prediction graph. The decoder then takes the recovered cropped encoded block data as input, producing a reconstructed equidistant rectangular projection map.

In one implementation, as shown in fig. 7, the decoder consists essentially of an input module, an upsampling module, and a residual module and attention module. In particular, the decoder may comprise one input module, four upsampling modules, four residual modules and two attention modules. Wherein the input module consists of a convolution operation of 192 channels. In one implementation, as shown in fig. 10, the up-sampling module includes two branches: the first branch is two 3 × 3 convolutions, the first convolution is followed by up-sampling the input feature map by a factor of 2 by depth-to-width warping; the second branch is a1 x 1 convolution, after which the input feature map is also up-sampled by a factor of 2 by depth-to-width deformation. In one implementation, the depth-to-width warping merges 4 signatures a1, a2, a3, a4 of size h × w into one signature b of size h × 2 w. The merging method is as follows:

b(i*2,j*2)＝a1(i,j)；

b(i*2,j*2+1)＝a2(i,j)；

b(i*2+1,j*2)＝a3(i,j)；

b(i*2+1,j*2+1)＝a4(i,j)。

wherein i is the number of image lines where the element is located, j is the number of image columns where the element is located, i is greater than or equal to 0 and is less than h, and j is greater than or equal to 0 and is less than w.

Then, the output results of the two branches are combined through bitwise addition to obtain the output result of the up-sampling module. In addition, the output module is a1 × 1 3-channel convolution, and no activation function is used. As shown in fig. 7, after the clipped coding block is input into the decoder, the clipped coding block passes through the input module, the attention module i, the residual error module i, the upsampling module i, the residual error module ii, the upsampling module ii, the attention module iii, the residual error module iii, the upsampling module iii, the residual error module iv, the upsampling module iv and the output module in sequence, and finally the reconstructed isometric rectangular projection diagram is output.

In one implementation, the distortion of the decoded image in this embodiment is defined in the viewing area of the panoramic image player. For example, the present embodiment selects 14 viewing angles, each having horizontal and vertical fields of view

And

the width and length of the viewing angle are

And

the latitude and longitude of 14 views on a spherical surface are expressed as:

the transformation function for these 14 views is denoted f_v0...,f_v13. In one implementation, the present embodiment employs two loss functions, one is mean square error loss

As shown in the following equation:

the other is structural similarity error loss

As shown in the following equation:

wherein the content of the first and second substances,

representing the decoded image.

In one implementation, the final objective function adopted in this embodiment is represented by the following formula, and includes three parts of constraint terms including compression rate loss, distortion loss of decoded image, and latitude adaptation:

where λ and α control the ratio of the contribution of the compressibility loss and latitude-adaptive constraint terms in the overall objective function. L is_DMay be a loss of mean square error

And loss of structural similarity error

Any one of them.

In one implementation, in order to keep the sizes of the output feature map and the input feature map consistent in the convolutional neural network, the present embodiment performs spherical filling on the spherical continuous rows of the panoramic image. Specifically, for an h × w image x, a pixels are filled around by spherical filling, resulting in an image y of (h +2a) × (w +2 a). The formula for spherical filling is as follows:

in one implementation, the present embodiment performs spherical filling only in the encoder, the decoder, and the entropy prediction network of the first encoding structure diagram.

Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 13. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a virtual reality-oriented latitude-adaptive panorama image encoding method. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen.

It will be understood by those skilled in the art that the block diagram of fig. 13 is only a block diagram of a portion of the structure associated with the solution of the present invention, and does not constitute a limitation of the terminal to which the solution of the present invention is applied, and a specific terminal may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.

In one implementation, one or more programs are stored in a memory of the terminal and configured to be executed by one or more processors include instructions for a virtual reality-oriented latitude-adaptive panoramic image encoding method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In summary, the present invention provides a virtual reality-oriented latitude adaptive panoramic image coding method, which is applied to equidistant rectangular projection image data, and generates image coding stream data corresponding to the image data according to image characteristic information by acquiring image characteristic information corresponding to each region on the image data; acquiring latitude information of the image data, generating target coding structure prediction graph data according to the latitude information, and generating structure coding stream data corresponding to the image data according to the target coding structure prediction graph data; and generating target coded stream data according to the structural coded stream data and the image coded stream data. When the panoramic image is stored, the latitude of each area in the panoramic image is referred, and the height of the coding column corresponding to each area is determined according to the latitude of each area, so that the lower coding column is distributed to the image in the high latitude area, and the higher coding column is distributed to the image in the low latitude area. Therefore, the problems that in the prior art, the panoramic image is compressed and stored by adopting an equidistant rectangular projection technology, serious oversampling phenomenon exists in high-latitude areas such as two poles, and the panoramic image is stretched and deformed are solved.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A virtual reality-oriented latitude-adaptive panoramic image coding method is applied to a storage process of image data of an equidistant rectangular projection drawing, and comprises the following steps:

2. The virtual reality-oriented latitude adaptive panoramic image encoding method according to claim 1, wherein the acquiring latitude information of the image data, generating target encoding structure prediction map data according to the latitude information, and generating structure encoding stream data corresponding to the target encoding structure prediction map data according to the target encoding structure prediction map data comprises:

acquiring latitude information of the image data, and inputting the latitude information into a latitude self-adaptive scaler;

acquiring latitude scaling weight map data generated by the latitude adaptive scaler based on the latitude information, and generating first coding structure prediction map data according to the latitude scaling weight map data; the latitude scaling weight map data is used for reflecting weight values corresponding to all areas on the image data determined by the latitude information; the first coding structure diagram data is used for reflecting encoded data corresponding to each region on the image data predicted by latitude information corresponding to each region on the image data;

generating the target coding structure prediction graph data according to the first coding structure prediction graph data;

and generating the structure coding stream data according to the target coding structure prediction graph data.

3. The virtual reality-oriented latitude-adaptive panorama image encoding method according to claim 2, wherein said generating the target coding structure prediction map data from the first coding structure prediction map data comprises:

determining input data of a coding structure predictor according to the image data, and acquiring second coding structure prediction image data generated by the coding predictor based on the input data; the second coding structure diagram data is used for reflecting content information corresponding to each area on the image data, and the coding data corresponding to each area on the image data is predicted according to the importance degree of the content;

combining the first coding structure prediction graph data with the second coding structure prediction graph data to obtain combined coding structure prediction graph data;

and quantizing the combined coding structure prediction graph data, and taking the coding structure prediction graph data obtained after quantization as target coding structure prediction graph data.

4. The virtual reality-oriented latitude-adaptive panoramic image encoding method according to claim 2, wherein the generating the structure encoding stream data according to the target encoding structure prediction map data comprises:

inputting the target coding structure prediction graph data into a first entropy prediction network, and acquiring a first probability distribution table generated by the first entropy prediction network based on the target coding structure prediction graph data; the first probability distribution table is used for reflecting the probability of different coding symbols corresponding to each region on the target coding structure prediction graph data;

and inputting the first probability distribution table and the target coding structure prediction graph data into a first entropy coder, and acquiring structure coding stream data generated by the first entropy coder after coding the target coding structure prediction graph data based on the first probability distribution table.

5. The virtual reality-oriented latitude adaptive panoramic image encoding method according to claim 1, wherein the acquiring image feature information corresponding to each region on the image data and generating image encoding stream data corresponding to the image feature information according to the image feature information comprises:

inputting the image data into an encoder, and acquiring feature encoding block data generated by the encoder based on the image data; the feature coding block data is used for reflecting image feature information of each area on the image data;

and generating the image coding stream data according to the characteristic coding block data.

6. The virtual reality-oriented latitude-adaptive panoramic image encoding method according to claim 5, wherein the generating the image encoding stream data according to the feature encoding block data comprises:

inputting the feature coding block data into a quantizer, and acquiring quantized coding block data generated by the quantizer based on the feature coding block data;

generating mask block data according to the target coding structure prediction graph data; the mask block data is used for identifying a shielding area and a non-shielding area on the target coding structure prediction graph data;

and generating the image coding stream data according to the mask block data.

7. The virtual reality-oriented latitude-adaptive panoramic image encoding method according to claim 6, wherein the generating the image encoding stream data according to the mask block data comprises:

determining clipped quantized coded block data from the mask block data and the quantized coded block data; the cropped quantized coded block data is quantized coded block data corresponding to the non-shielding area;

inputting the clipped quantized coded block data and the mask block data into a second entropy prediction network, and acquiring a second probability distribution table generated by the second entropy prediction network based on the clipped quantized coded block data and the mask block data;

and generating the image coding stream data according to the clipping quantization coding block data and the second probability distribution table.

8. The virtual reality-oriented latitude-adaptive panorama image encoding method of claim 7, wherein said generating the image encoding stream data based on the pruned quantized encoded block data and the second probability distribution table comprises:

determining target clipped quantized coded block data in the clipped quantized coded block data according to the mask block data; the target clipping quantized coding block data is the clipping quantized coding block data corresponding to the mask block data with the mask value of 1;

and inputting the target clipping quantized coding block data and the second probability distribution table into a second entropy coder, and acquiring the image coding stream data generated by the second entropy coder after coding the target clipping quantized coding block data based on the second probability distribution table.

9. A virtual reality oriented latitude-adaptive panorama image encoding method according to claim 1, characterized in that said method further comprises:

and inputting the target encoding data into a decoder, and acquiring the image data reconstructed by the decoder based on the target encoding data.

10. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a virtual reality oriented latitude adaptive panorama image encoding method according to any one of claims 1-9.