EP4268464A1

EP4268464A1 - Method for encoding images of a video sequence to be encoded, decoding method, corresponding devices and system with spatial image sub-sampling

Info

Publication number: EP4268464A1
Application number: EP21840565.2A
Authority: EP
Inventors: Benoît LE LUDEC; Patrick DUMÉNIL; Franck CHI
Original assignee: Fondation B Com
Current assignee: Fondation B Com
Priority date: 2020-12-22
Filing date: 2021-12-17
Publication date: 2023-11-01
Also published as: CN116746158A; FR3118380A1; KR20230124619A; US20240056603A1; WO2022136143A1; JP2024500168A

Abstract

The invention relates to a method for encoding images of a video sequence that comprises implementing the following steps: c) obtaining an initial configuration (E100) representative of structural characteristics of a display device and, for an image of the video sequence referred to as the input sub-sequence, d1) carrying out a first spatial sub-sampling operation (E102) on the elements of the image by using a filter oriented in a first direction and comprising a first set (ENS1) of values of sub-sampling factors, the latter being determined as a function of the initial configuration, then inserting the resulting image into an output sub-sequence, and e) encoding (E2) the images of the output sub-sequence.

Description

METHOD FOR CODING IMAGES OF A VIDEO SEQUENCE TO BE CODED, METHOD FOR DECODING, CORRESPONDING DEVICES AND SYSTEM WITH SPATIAL SUB-SAMPLING

IMAGE

Technical field of the invention

The present invention relates to the technical field of the coding and decoding of video sequences. More particularly, it relates to a method and a device for coding, a method and a device for decoding as well as to the corresponding devices.

State of the art

To be able to transmit high spatial and temporal resolution video content, it is known to implement a so-called scalable method, as described in patent application WO2020/025510. By processing a video sequence using this process, the amount of video data to be transmitted is reduced, without visible deterioration of the quality of the video sequences in the spatial and temporal domains, when displaying the images.

Presentation of the invention

In this context, according to a first aspect of the invention, there is proposed a method for coding images of a video sequence, each image being formed of elements organized in rows and columns. The method comprises the implementation of the following steps c) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of a sub-sequence of one or more images of the video sequence called input sub-sequence, d1) carrying out a first spatial sub-sampling of at least part of the elements of the image by using a filter oriented in a first direction and comprising a first set of at at least two different values of sub-sampling factors, the first set of values being determined as a function of said initial configuration, then inserting the resulting image into an output sub-sequence, and e) coding (E2) the images of the output subsequence. Advantageously, the method according to the first aspect of the invention makes it possible to reduce the quantity of coded data to be transmitted, without loss of visual quality for the user viewing the video sequence. Indeed, the sub-sampling is a function of structural characteristics (for example optical or relating to the shape of the display device) of the display device via which the user views the video sequence. It is therefore possible to adjust the sub-sampling to the areas of interest of the images for the user, with regard to the display device used.

Preferably, step d1) further comprises a second spatial sub-sampling of at least some of the elements of the sub-sampled image using a filter oriented in a second direction and comprising a second set of at least two different values of downsampling factors, the second set of values being determined according to said initial configuration.

The implementation of two successive sub-samplings using two filters oriented in two different directions makes it possible to produce relatively complex patterns, the pattern defining zones having different spatial resolution values. It is then possible to finely adjust the resolution of each zone of the processed images according to the structural characteristics of the display device.

Preferably, spatial sub-sampling is carried out using filters oriented in one of the following directions:

- horizontal,

- vertical.

The choice of these directions makes it possible to implement the solution in a particularly simple way within a coding device, while allowing the realization of patterns of complex spatial resolutions.

Each subsampling factor value of a set of values is preferably respectively applied to at least one group of p successive elements according to the direction in which the filter is oriented, p being a positive integer.

According to one mode of implementation, the method may further comprise an analysis step comprising an analysis of the content of at least one image of said input sub-sequence, and a step of updating the values of the factors of sub- sampling prior to the implementation of step d1), depending on the result of the analysis of the content.

According to one embodiment, the method may further comprise an analysis step comprising an analysis of measurements representative of movements performed by a user, the display device being a visiohead-mounted display worn by said user (for example his head and /or his eyes), and a step for updating the values of the subsampling factors prior to the implementation of step d1), depending on the result of the analysis of the measurements.

According to another mode of implementation, the method may further comprise another analysis step comprising an analysis of the visual quality of the images of said output sub-sequence, and a step of updating the values of sub-sampling prior to the implementation of step d1), if the visual quality is lower than a predetermined threshold.

According to another mode of implementation, the method may further comprise a preliminary step comprising the following sub-steps a) obtaining from said video sequence, so-called initial sub-sequences, and for at least one initial sub-sequence : b1) determination of information representative of the content of at least one image of the initial sub-sequence, and as a function of said information, b2) determination for the initial sub-sequence, of a processing frequency, lower or equal to the initial image display frequency, as a function of the determined information, and b3) insertion, as a function of the determined processing frequency, of all or part of the images of the group of images in a sub-sequence d images forming an input subsequence.

Preferably, the method can then further comprise an a posteriori step for the M images of an output sub-sequence, M being an integer, said a posteriori step comprising the following sub-steps d2) comparison between the associated processing frequency to the images of the output sub-sequence and the initial frame display frequency, and if the processing frequency is lower than the initial frequency, spatial division of each of the M images of the output sub-sequence into N sub-images, N being an integer whose value depends on the ratio between the processing frequency and the initial frequency, the coding step e) corresponding to the coding of the M*N sub-images of the output sub-sequence, otherwise the coding step e) corresponds to the coding of said M images of the output sub-sequence.

According to another mode of implementation, the method can also comprise the following steps for each output sub-sequence

- obtaining information representative of at least one of the elements from the list below: o values of sub-sampling factors, o measurements representative of movements performed by a user, the display device being a head-mounted display worn by said user, o structural characteristics of the display device, o processing frequency, and

- coding of said information.

According to a second aspect of the invention, there is proposed a method for decoding data corresponding to images of a video sequence, each image being formed of elements organized in rows and columns, the images of the video sequence being grouped in sub-sequences of one or more images called output sub-sequences. The method comprises the implementation of the following steps c1) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of an output sub-sequence; d11 ) performing a first spatial oversampling of at least part of the elements of the image using a filter oriented along a first direction and comprising a first set of at least two different values of oversampling factors , the first set of values being determined as a function of said initial configuration, then insertion of the resulting image into a sub-sequence to be decoded, and e1) decoding of the images of the sub-sequence to be decoded.

Preferably, step d11) can further comprise a second spatial oversampling of at least some of the elements of the oversampled image using a filter oriented in a second direction and comprising a second set of at least two different values oversampling factors, the second set of values being determined as a function of said initial configuration.

According to a third aspect of the invention, there is proposed a device for coding images of a video sequence, each image being formed of elements organized in rows and columns. The device is configured to implement the following steps c) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of a sub-sequence of one or more images of the video sequence called the input sub-sequence, d1) carrying out a first spatial sub-sampling of at least part of the elements of the image by using a filter oriented in a first direction and comprising a first set of at least two different values of subsampling factors, the first set of values being determined as a function of said initial configuration, then inserting the resulting image into an output sub-sequence, and e) coding the images of the sub -output sequence.

According to a fourth aspect of the invention, there is proposed a device for decoding data corresponding to images of a video sequence, each image being formed of elements organized in rows and columns, the images of the video sequence being grouped in sub-sequences of one or more images called output sub-sequences. The device is configured to implement the following steps c1) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of an output sub-sequence; d11 ) performing a first spatial oversampling of at least part of the elements of the image using a filter oriented along a first direction and comprising a first set of at least two different values of oversampling factors , the first set of values being determined as a function of said initial configuration, then insertion of the resulting image into a sub-sequence to be decoded, and e1) decoding of the images of the sub-sequence to be decoded.

Detailed description of the invention

In addition, various other characteristics of the invention emerge from the appended description made with reference to the drawings which illustrate non-limiting forms of embodiment of the invention and where:

- Figure 1 shows a mode of implementation of a method according to the first aspect of the invention;

- Figure 2 illustrates the optical characteristics of a head-mounted display unit used to display a video sequence;

- Figure 3 illustrates more precisely certain steps of the mode of implementation shown in Figure 1;

- Figure 4 schematically represents a pattern obtained according to an embodiment of the method according to the invention;

- Figure 5 schematically represents another pattern obtained according to another mode of implementation of the method according to the invention;

- Figure 6 details a mode of implementation of a spatial sub-sampling step according to the invention;

- Figure 7 details another mode of implementation of a spatial sub-sampling step according to the invention;

- Figure 8 schematically shows the patterns obtained after successive applications of the embodiments illustrated in Figures 6 and 7; - Figure 9 illustrates more precisely certain steps of the mode of implementation shown in Figure 1;

- Figure 10 shows a mode of implementation of a method according to the second aspect of the invention;

- Figure 11 shows an embodiment of a device according to the third aspect of the invention;

- Figure 12 shows an embodiment of a device according to the fourth aspect of the invention; and

- Figure 13 shows a possible implementation of the devices according to the third or fourth aspect of the invention.

FIG. 1 represents a mode of implementation of a method according to the invention. SVD source video data are supplied as input to a preprocessing step E1, for example in an UHD (“Ultra High Definition”) video format. Each source video is processed group of images by group of images (GOP for “Group Of Pictures” in English). A group of images forms an input sub-sequence. This step E1, described in more detail below, makes it possible to apply spatial processing and optionally temporal processing to the video data. The spatial and possibly temporal frequency of the pixels constituting the images of the SVD video sequence is reduced at the end of the preprocessing. The format of the video is then modified. Optionally, metadata relating to the preprocessing are generated as explained in the following description.

The processed video data is coded during a step E2, then transmitted, step E3, to display means. Prior to display, the transmitted coded video data is decoded, step E4. Then they are subjected to an E5 post-processing function of the E3 pre-processing which was carried out before encoding. Possibly post-processing E5 depends on the metadata generated during the pre-processing step E2. The post-processed video data is finally displayed using the display means in step E6.

Steps E1, E2 and E3 are implemented by a transmitter device while steps E4, E5 and E6 are implemented by a receiver device comprising the display means. The display means can comprise a screen with variable dimensions, a head-mounted display or even a simple display surface, this list not being exhaustive. These display means have display configurations that are specific to them. A display configuration can be defined by the structural characteristics of the display device such as its dimensions or else the parameters of its optical components. By way of example, FIG. 2 schematizes the optical characteristics of the respectively left LG and right LD lenses of a head-mounted display. Conventionally, each LG, LD lens allowing binocular vision has an optimal spatial resolution inside a central circle CCG, CCD. This central circle is, when the visiohead-mounted display is worn by a user, facing the fovea of the left or right eye respectively. The fovea is the area of the retina where the vision of details is most precise. Then as one moves away from the central circle, the spatial resolution of the displayed image decreases progressively, in stages, according to concentric circles of increasing radius. Finally, a black zone ZN surrounds the circular display zones of the video.

FIG. 3 illustrates steps E1 and E5 of FIG. 1 in more detail. The preprocessing step E1 comprises a first substep E100 of initial configuration. This sub-step allows obtaining a set of spatial and optionally temporal filters, to be applied to a group of images of the source video according to a filtering pattern resulting from the configuration. The initial configuration can be defined by default (for example memorized) or updated for each input sub-sequence in order to take into account context variations during use. An initial configuration is a function of one or more criteria combined together, these criteria including the display configuration defined above. It can also be a function of additional criteria, for example relating to the video transmission chain between the transmitter and the receiver or even to instructions given by a user, instructions relating to a quantity of data that it is possible to transmit, an authorized latency threshold or a tolerable level of complexity.

The sub-step E100 makes it possible to deliver a value of a spatial and possibly temporal frequency, acceptable a priori so that the content is rendered on the display device without notable loss of quality. A spatial frequency value is associated with a partition of an image (the image comprising at least two partitions). A temporal frequency value (corresponding to an image transmission frequency or “transmitted image frequency”) corresponds to the frequency of images transmitted within a group of images of the source video.

Depending on the initial configuration obtained, the following two sub-steps E101 and E102 are implemented, step E101 being optional. If the initial configuration involves temporal filtering, step E101 is implemented. It includes for a group of images, a modification of the sub-sequence of input images, keeping only part of the images. For the purposes of simplification, unless otherwise indicated, it is considered in the remainder of the description that spatial filtering (E102) and temporal filtering (E102) are implemented. The input sub-sequences of the source video are therefore subjected to a combination of processing including spatial and temporal sub-sampling for at least part of the images.

For each group of images processed, sub-step E101 delivers a sub-sequence of images whose value of the temporal frequency depends on the initial configuration. The time frame can be the same as the original time frame of the input subsequence GOP. The sub-sequence of images delivered at the output of sub-step E101 is then identical to the sub-sequence of input images. Conversely, the temporal frequency resulting from the initial configuration can correspond to said original frequency divided by N (N being an integer greater than or equal to 2). One image out of N from the input stream is then deleted. The sub-sequence of images delivered at the output of sub-step E101 therefore then has a temporal frequency divided by N.

In one mode of implementation, the sub-step E101 can receive information resulting from an analysis (E105) of the measurements of movements performed by the display device and/or by a user (or his eyes) in the case where the display device would be a head-mounted display worn by that user. This information representative of the measurements making it possible to estimate the movement is then used to adapt the temporal frequency in order to prevent the symptoms of kinetosis (“motion sickness” in English) felt by the wearer of the head-mounted display, which could be generated by approaches to motion sickness. state of the art, that is to say non-dynamic with regard to the temporal frequency. Preferably, if the input sub-sequence presents important movements, in this case the temporal frequency will be kept to its maximum, and the reduction of the spatial resolution, implemented in the sub-step E102, will be preferred. Conversely, if the input sub-sequence has few movements, the reduction of the temporal frequency will be preferred, and the spatial resolution, implemented in the sub-step E102, will be reduced little or not at all.

A spatial filtering (E102) is then applied to the images of at least one group of images of the input sub-sequence, according to the initial configuration. Spatial filtering is performed using at least one spatial subsampling of elements of at least one row or at least one column of the image. This spatial sub-sampling is a function of a set of factors also called sub-sampling step defined by the initial configuration. An element represents a pixel of the image or the component of this pixel for one of the color components of the image.

As a variant and as considered in the following description, the spatial filtering is carried out according to two successive sub-samplings, using filters respectively oriented in two different directions, horizontal (horizontal filter) and vertical (vertical filters) independently of the order. Thus, the columns and then the rows of the image are successively under-sampled. As a variant, it is possible to alternate the sub-sampling of a row then the sub-sampling of a column or vice versa.

Breaking down the spatial filtering into two sub-samplings, using for each of the sub-samplings, filters oriented in two different directions, makes it possible to obtain within an image, zones or partitions having a different resolution, according to the factors of sampling implemented by the filters. The implementation of an electronic processing in a programmable circuit capable of carrying out sub-samplings using vertical or horizontal filters is simple while requiring little memory and limiting processing latency. By finely adapting the values taken by the sampling factors, it is possible to obtain very precise patterns, each having its own spatial resolution, depending on the areas of interest of the image. For example, the closer the area of the image is displayed to the fovea of the eye, the greater the spatial resolution. In other words, a pattern makes it possible to apply different downsampling factors depending on depending on the different areas of the image, these areas being able to be defined in the initial configuration using their spatial coordinates.

FIGS. 4 and 5 respectively present two images sub-sampled according to two different configurations of the sub-sampling steps or factors and of the subsets of pixels concerned by each sub-sampling step value.

Each square corresponds to a group of elements of an image. The pattern (horizontal bands) in Figure 4 results from a single downsampling using a set of different sample step values applied using vertical filters. The pattern in Figure 5 results from applying a first sub-sampling using a first set of different sample step values applied using vertical filters, followed by a second sub-sampling. - sampling using a second set of different sampling step values applied using horizontal filters. The order of application of the first and second subsampling can be reversed. Rectangular patterns are obtained according to the values of the sampling steps applied, and the number of pixels affected by each sampling step. The lighter the shade of a rectangle in the pattern, the higher the spatial resolution of the corresponding area of the image. Conversely, the darker the tint, the more the spatial resolution of the corresponding area of the image has been reduced.

FIGS. 6 and 7 respectively explain the first and second spatial sub-samplings.

Figure 6 schematizes an image or part of an IMA1 image to be processed. The lines of the image are organized in L successive horizontal bands BD1, BD2, BD3, BD4 and BD5. More generally, L is a positive integer. For example, each horizontal band comprises a number of lines depending on the configuration of the filter(s) used to perform the spatial sub-sampling (for example 8 lines).

A first set of sub-sampling steps ENS1 is then applied to the image IMA1 using a vertical filter FLV. This first set ENS1 comprises in this example the values of the following sub-sampling factors: {1/3, 1/2, 1 , 1/2, 1/3}. Thus, for the lines belonging to the first BD1 and the fifth horizontal band BD5, only one pixel out of three successive pixels in the vertical direction is retained. For the lines belonging to the second BD2 and the fourth horizontal band BD4, only one pixel out of two successive pixels is retained in the vertical direction. Finally for the third horizontal band BD3, all the pixels are kept.

The value of each pixel retained at the end of the sub-sampling can be interpolated using known methods such as bilinear or bi-cubic algorithms or even using the Lanczos method well known to those skilled in the art. . As a variant, the value of the retained pixel can be equal to its original value.

Once all the horizontal stripes have been subsampled, the resulting subsampled image IMAF1 is obtained, such that the darker the strip represented (the hatches are the denser), the greater the number of remaining pixels.

Figure 7 schematizes an image or part of an IMA2 image to be processed. The columns of the image are organized in M successive vertical bands BD6, BD7, BD8, BD9, BD10, BD11, BD12, BD13 and BD14. More generally, M is a positive integer. For example, each vertical band comprises a number of columns depending on the configuration of the filter used to perform the spatial sub-sampling (for example 8 columns).

A second set of ENS2 subsampling steps is then applied to the IMA2 image using a FLH horizontal filter. This second set ENS2 includes in this example the following subsampling factor values: {1/8, 1/2,1, 1/2, 1/8, 1/2, 1, 1/2, 1/8 }. Thus, for the columns belonging to the first BD6, the fifth BD10 and the last vertical band BD14, only one pixel out of eight successive pixels is retained in the horizontal direction. For the columns belonging to the second BD7, the fourth BD9, the sixth BD11 and the eighth BD13 vertical band, only one pixel out of two successive pixels is retained in the horizontal direction. Finally for the third BD8 and the seventh vertical BD12 band, all the pixels are kept in the horizontal direction.

As for the sub-sampling described in the previous figure, the value of each pixel preserved at the end of the sub-sampling can be interpolated using known methods such as bilinear or bi-cubic algorithms or else using the Lanczos method well known to those skilled in the art. As a variant, the value of the retained pixel can be equal to its original value.

Once the sub-sampling has been carried out, we obtain the resulting sub-sampled image IMAF2, such that the darker the band represented (the hatches are the denser), the higher the number of remaining pixels.

The first and the second sub-sampling can be successively applied, regardless of the order. If the sub-sampling of the horizontal bands is applied first, the output image IMA1 F then corresponds to the image to be sub-sampled IMA2 of the second sub-sampling of the vertical bands.

Figure 8 schematizes a pattern reflecting a non-uniform resolution of the entire doubly sampled image, this pattern emerging after the successive application of the two spatial sub-samplings illustrated in Figures 6 and 7.

The spatial resolution of each part or tile of the doubly undersampled IMAF image depends on the values of the undersampling factors applied to the bands including the considered tile. Finally, 8 different values of uniform spatial resolutions Ri coexist within the IMAF image, such as R0<R1 <R2<R3<R4<R5<R6<R7. The double sub-sampling along two different directions makes it possible to obtain a pattern of complex spatial resolution making it possible to preserve a maximum resolution in certain places of the image when the spatial resolution is equal to R7 (brightest areas). The controlled reduction of the spatial resolution at certain locations in the image also makes it possible to reduce the amount of data that will be transmitted.

For example, the pattern in Figure 8 can be implemented when the display configuration is associated with a head-mounted display, like that shown in Figure 2. The maximum spatial resolution R7 then corresponds to the areas located opposite the central circles.

According to one embodiment, the higher the value of the temporal frequency of a group of images, the lower the values of the spatial resolutions. For example, the preprocessing means implementing the preprocessing step can store a correspondence table between temporal frequency values implemented in step E101 and sets of sub-sampling steps to be applied during step E102. The correspondence table can memorize an intermediate value of overall resolution of the image once sub-sampled (for example divided by a positive integer P with respect to the original image. At an intermediate value of overall resolution of the image matches one or more sets of downsampling steps, such that the complete image is on average downsampled by the overall resolution intermediate value.

For example, the initial configuration may include as an instruction a quantity of data that it is possible to transmit, an instruction expressed as follows:

- an overall reduction rate RED of the amount of initial data - RED can be expressed as integer or decimal positive values;

- an authorized temporal sub-sampling rate TEMP (this rate being able to take on positive integer values for less complex processing, this constraint being able to be lifted if the technical context allows more complex processing).

The SPAT spatial sub-sampling rate is then obtained from the following formula: SPAT=RED/TEMP. The latter can take positive integer values or not.

For example, if the global reduction rate is equal to RED=4, it comes:

-if TEMP=4, then SPAT=1;

-if TEMP=3, then SPAT=4/3;

-if TEMP=2, then SPAT=2;

-if TEMP=1 , then SPAT=4.

The set or sets of sub-sampling steps are obtained using a correspondence table for example defined by the initial configuration according to the value taken by SPAT.

Reference is made again to FIG. 3. Optionally, a sub-step of cutting up the images E103 is implemented for the images of a group of images. This step precedes the coding step E2. It aims to break down each image of the group of images into k sub-images (k being a positive integer). For example, if k=2, each image is split into two halves. More generally, if the temporal frequency of the group of images output from sub-step E101 is equal to the original frequency divided by N, each image is then divided into N sub-images during sub-step E103. When all the images of the input sub-sequence are processed, these are delivered (E104) to be coded.

Thus at the output of sub-step E104, the group of processed images forms an output sub-sequence to be coded, this output sub-sequence having rather a low spatial resolution (the value being equal on average to the intermediate value of global resolution) and a temporal frequency equal to the original temporal frequency due to the decomposition of the images into sub-images during the sub-step E103. The conservation of the original temporal frequency makes the preprocessing compatible with an encoding implemented using an encoder operating at a fixed input frequency.

The present invention could be combined with the proposal of patent application WO2020/025510 in the names of the applicants. In which case, only the sub-steps E101 and E103 can also be implemented. In this case, the process resulting from the combination would make it possible to divide the quantity of data by 2 (if N=2) without changing the resolution and also without subjective loss of visual quality. The method resulting from said combination (and the corresponding device) therefore offers three variants making it possible to reduce the quantity of data to be transmitted with, depending on the variant, a reduction factor varying from 2 to 4 in the case where N=2. Either only the temporal frequency is reduced, or only the spatial resolution is degraded or either the spatial resolution and the temporal frequency are both reduced.

The coding of step E2 can therefore be carried out using a standard low-latency codec operating at fixed resolution (the lowest spatial resolution, for example RO in FIG. 8) and at high temporal frequency (frequency original time). An electronic circuit implementing sub-samplings per row and/or per column according to the invention can be miniaturized. Being moreover compatible with a standard codec, it is then possible to embark it within a visiohead-mounted display without noticeable excess weight, for example a VIVE™ headset from the company HTC. Each coded output sub-sequence is then transmitted (step E3) via for example a wireless transmission channel (non-limiting example). For example, the output sub-sequences can be intended for several users within the framework of a virtual reality application involving several wearers of visiohead-mounted displays. The wireless transmission channel is then multi-user. For example, the 60 GHz Wi-Fi WiGig wireless network protocol can be used for transmission (the bandwidth is around 7Gbps). Alternatively, the Wi-Fi 5 protocol offering 600 Mbps bandwidth can be used.

Each output sub-sequence is received and decoded (step E4). The decoding implemented depends on the coding implemented during step E2. Then the post-processing step E5 takes place. This step includes a sub-step E500 for obtaining a post-processing configuration. This sub-step is detailed in more detail below with reference to Figure 10.

Then step E5 includes an image recomposition sub-step E501, in the case where the image splitting sub-step E103 has been implemented during preprocessing E1. If each image has been divided into 2 halves during step E103, each new recomposed image is obtained by appropriately juxtaposing two successive images of the output sub-sequence received and decoded. Once the images have been recomposed, an over-sampling sub-step E502 making it possible to increase the spatial resolution of the recomposed images. Oversampling is performed in the same directions as undersampling, and using sets of oversampling steps of inverse values to oversampling step values. The value of each new pixel linked to the oversampling can be extrapolated for example using known methods such as bilinear or bi-cubic algorithms or even using the Lanczos method well known to those skilled in the art. At the end of the oversampling sub-step E502, the spatial resolution of the images of the recomposed images is equal to the spatial resolution of the images of the input sub-sequence before the sub-sampling step E102. Finally, if a sub-step E101 for reducing the temporal frequency took place in pre-processing, the post-processing includes a sub-step E503 for restoring the original frequency of the input sub-sequence. To do this, if the time frequency of the output subsequence matches the time frequency of the input subsequence divided by N, each image resulting from sub-step E502 is then repeated N times, so as to restore the temporal frequency of the input sub-sequence. Thus is delivered at the input of the display step E6, a sub-sequence of decoded and post-processed images, at the maximum spatial resolution and temporal frequency, equal to those of the input sub-sequence.

According to a first mode of implementation, the temporal and spatial filters are predefined and stored both for pre and post processing. A correspondence table then associates a configuration with a selection of temporal and/or spatial filters. According to a second mode of implementation, the identification of the spatial and/or temporal filters at the time of the preprocessing is coupled with the generation and the sending of dedicated metadata, transmitted to the device implementing the postprocessing. Figure 9 illustrates the second mode of implementation. Sub-step E100 is itself broken down into several sub-steps. A first of these sub-steps E1000 comprises obtaining the initial configuration and the parameters associated with this initial configuration, for example: configuration relating to an optic of a visiohead-mounted device. If the filters that can be associated with the initial configuration are not predefined (for example memorized beforehand) T1001, in this case (arrow “N”), the group of images to be processed is read E1002 then analyzed E1003. The analysis may comprise an analysis of the content of the images (or of a reference image from among the group of images) with for example a detection of contours, an estimation of movements for example using measurements carried out by sensors movement, a determination of a histogram of pixel values. This analysis can be implemented using an algorithm based on prior learning (“machine learning”). The analysis step E1003 can also include an analysis of external information such as the movement of the visiohead-mounted display worn by the user or the analysis of information complementary to the images, such as depth information. At the end of the analysis, the optimal filters for carrying out the filtering steps are identified and selected (E1004) for example using a correspondence table between a content analysis result and temporal filters and /or spatial. An optional verification (E1005) of the settings of the selected filters with respect to a predetermined minimum visually acceptable quality, can be Implementation. If this minimum quality criterion is not satisfied, an update of the temporal and/or spatial filters can be implemented.

If filters that can be associated with this configuration are predefined T1001 (arrow “Y”), these are then generated (E1006). Then the images of the group of images to be processed are read (E1007) and their content is analyzed (E1008). Depending on the result of the analysis, a T1009 test is implemented to check if an update of the filters parameters is authorized. If this is not the case (“N” arrow), filtering E101, E102 and E103 are then implemented with the generated filters. If an update is authorized (“Y” arrow), a T1010 test is implemented to verify the quality of the images that would result from filtering with the selected filters (for example compared to a predetermined minimum visually acceptable quality) is sufficient or not. If the quality is insufficient ("Y" arrow), optimal filters with respect to the minimum acceptable visual quality are identified and selected (E1004) using the correspondence table between a content analysis result and temporal and/or spatial filters. The optional check E1005 can again be implemented. If the quality is sufficient (T1010, arrow “N”), the filterings E101, E102 and E103 are then implemented with the generated filters.

According to another variant not illustrated, the sub-steps E1004, E1005, E1007 and E1008 as well as the tests T1009 and T1010 are not implemented. The filters generated (E1006) are directly used for filtering E101, E102 and E103.

In one mode of implementation, the sub-step E104 can comprise the performance of a test T1041 to check whether the sending of metadata is authorized or not. If this is not the case (“N” arrow), the output sub-sequence to be encoded is transmitted directly to be encoded (step E1043). If the sending of metadata is authorized (arrow "Y"), metadata obtained during the sub-step E100 can be transmitted directly by Ethernet or any other means such as the auxiliary data to the images (E1042) for the production of a or several filtering sub-steps E101, E102, E103 on the one hand and on the other hand intended for the device implementing the post-processing, the metadata possibly being synchronized or not with the images to which they relate. In the latter case, the metadata is transmitted via auxiliary channels to the transmission protocol used for video, for example MJPEG, “Motion Joint Photography Experts Group”). Metadata can represent selected filters and their parameters (for example using an identifier designating a filter from a predetermined list), or parameters making it possible to modify or configure predefined filters or even parameters fully describing the filters using a list of properties to generate these filters.

Metadata exchange between sender and receiver is optional. It can be omitted in particular in the case where during post-processing E5, the configuration can be obtained directly from the video format of the output sub-sequences for example.

Finally a T1044 test verifies if a new input sub-sequence is available. If so (“Y” arrow), a new input subsequence is read E1007. Otherwise (arrow “N”) the coding step E2 is implemented.

FIG. 10 illustrates a mode of implementation of post processing E5. The sub-step E500 includes beforehand a reading of an initial configuration (E5001) for example stored in a memory. This initial configuration can for example correspond to a visiohead. A T5002 test checks whether this initial configuration allows suitable filters to be obtained for each output sub-sequence or whether the filters corresponding to the configuration obtained are valid for a set of output sub-sequences. If the filters can be updated for each output sub-sequence (“Y” arrow), a configuration of the spatial and/or temporal filters is obtained (E5003), for example two successive spatial sub-samplings, in a vertical direction then horizontal. The corresponding filters are then generated (E5004). Then the output sub-sequence to be post-processed is read (E5005). If the filters cannot be updated for each output sub-sequence (“N” arrow), the post-processing process goes directly to the stage of reading the output sub-sequence to be post-processed ( E5005).

Then the post-processing includes a verification of receipt or not of metadata corresponding to the output sub-sequence considered (T5006). If metadata has been received (“Y” arrow), the filters obtained are configured (no sampling, temporal filtering frequency, etc.) during a step E5007. The various filterings E501, E502 and E503 are then applied to the output sub-sequence. If a new output subsequence is available for post-processing (“Y” arrow of a T504 test), the process is repeated. Otherwise, the post-processing is finished (“N” arrow).

FIG. 11 schematically illustrates an embodiment of pre-processing means integrated into a DC coding device according to the invention. The device comprises reading means MLC1 able to read images from a source video SVD, by group of images. The read frames are passed to a means of identification of an optimal MIDI preprocessing. Signaling means MSGA are able to generate metadata MTDA describing the optimal pre-processing or comprising an identifier of this optimal pre-processing if the metadata describing it are not transmitted to the post-processing device. Generation means MGNF1 are capable of generating parameterized filters according to the pre-processing identified by the MIDI means and according to an initial configuration stored in a memory MEM. In this embodiment, the device DPRT comprises means able to generate metadata MTDB describing the filtering parameters of the generated filters.

The pre-processing means also include means for temporal filtering MFT 1 , spatial sub-sampling MFS1 and image decomposition MD1 capable of filtering the images of the source video SVD as a function of the filters generated by the means MGNF. The IMPR pre-processed images form output sub-sequences transmitted with the MTDA and MTDB metadata to a display device coupled to a decoder and a post-processing device.

FIG. 12 schematically illustrates an embodiment of post-processing means forming part of a DDEC decoding device according to the invention. Reading means MLC2 are configured to read the preprocessed images IMPR of the successive output sub-sequences. These MLC2 means can implement the reading of the images for example using the identifier of the preprocessing MTDA transmitted simultaneously to the preprocessed images in order to match each image read with the descriptive metadata of the preprocessing to be applied, for example stored in an annex memory in the form of a list, the annex memory not being represented for the purposes of simplification. Each pre-processing is identifiable thanks to this identifier. For example, the list can vary according to a result provided by analysis means (not shown) of the scenes present on the images. Then identification means MID2 are able to determine the post-processing filtering to be applied to the images of the output sub-sequences, using the aforementioned identifier MTDA. The identification means MID2 are able to select and configure the filters for the implementation of the identified post-processing. Generation means MGNF2 are configured to generate filters suitable for post-processing using the metadata MTDB transmitted simultaneously to the pre-processed images. The generation means MGNF2 are coupled to a memory MEM2 capable of storing a configuration as described above.

The post-processing means further comprise temporal filtering means MFT2, spatial over-sampling MFS2 and image recomposition MD2 capable of filtering the images read by the reading means MLC2 as a function of the post-processing identified by the means MID2 and parameters generated by means MGNF2. The images reconstructed in the format of the MTDC source video are output.

FIG. 13 schematically represents an electronic circuit CIR capable of implementing a pre-processing or post-processing method as described with reference to FIGS. 3, 9 and 10. implemented by the first spatial filtering means MFIL1, the spatial undersampling or oversampling and the decompositions or recompositions of images implemented by the second temporal filtering means MFIL2. Furthermore, the microprocessor |iP is capable of generating or processing (in post-processing) the processing metadata mentioned above. The microprocessor |iP is also coupled to a memory MEM suitable for saving initial configurations as well as, where appropriate, the correspondence tables mentioned above. The microprocessor |iP and the spatial MFIL1 and temporal MFIL2 filtering means are respectively coupled to MCOME input and MCOMS output communication means able to exchange processed data or to process with another device such as an encoder or a decoder for example. For example, the data passing through the MCOME input communication means may include the images of the video data sources delivered to the spatial filtering means MFIL1 and configuration parameters of the filtering means supplied to the microprocessor |iP. The data transmitted via the input communication means MCOMS can comprise for example the processing metadata generated by the microprocessor |iP as well as the spatially and temporally subsampled images.

Claims

23 Claims

1 . Method for coding images of a video sequence, each image being formed of elements organized in rows and columns, the method comprising the implementation of the following steps c) obtaining an initial configuration (E100) representative of characteristics structures of a display device, and for at least one image of a sub-sequence of one or more images of the video sequence called the input sub-sequence, d1) carrying out a first spatial sub-sampling (E102) of at least some of the elements of the image by using a filter oriented along a first direction and comprising a first set (ENS1) of at least two different values of subsampling factors, the first set of values being determined as a function of said initial configuration, then insertion of the resulting image into an output sub-sequence, and e) coding (E2) of the images of the output sub-sequence.

2. Method according to the preceding claim, in which step d1) further comprises a second spatial sub-sampling of at least some of the elements of the sub-sampled image using a filter oriented in a second direction and comprising a second set (ENS2) of at least two different values of sub-sampling factors, the second set of values being determined according to said initial configuration.

3. Method according to any one of claims 1 or 2, in which spatial sub-sampling is carried out using filters oriented in one of the following directions:

- horizontal,

- vertical.

4. Method according to one of the preceding claims, in which each subsampling factor value of a set of values is respectively applied to at least one group of p successive elements according to the direction in which the filter is oriented, p being a positive integer.

5. Method according to one of the preceding claims, further comprising an analysis step (E1003, E1008) comprising an analysis of the content of at least one image of said input sub-sequence, and a step of updating the values of the sub-sampling factors prior to the implementation of step d1), according to the result of the analysis of the content.

6. Method according to one of the preceding claims, further comprising an analysis step comprising an analysis of measurements representative of movements performed by a user, the display device being a head-mounted display worn by said user, and a step of setting updating of the values of the sub-sampling factors prior to the implementation of step d1), according to the result of the analysis of the measurements.

7. Method according to one of the preceding claims, further comprising another analysis step comprising an analysis of the visual quality (E1005) of the images of said output sub-sequence, and a step of updating the values of sub-sampling factors prior to the implementation of step d1), if the visual quality is lower than a predetermined threshold.

8. Method according to one of the preceding claims, comprising a preliminary step (E101) comprising the following sub-steps a) obtaining from said video sequence, so-called initial sub-sequences, and for at least one initial sub-sequence : b1) determination of information representative of the content of at least one image of the initial sub-sequence, and as a function of said information, b2) determination for the initial sub-sequence, of a processing frequency, lower or equal to the initial image display frequency, as a function of the determined information, and b3) insertion, as a function of the determined processing frequency, of all or part of the images of the group of images in a sub-sequence d images forming an input subsequence.

9. Method according to the preceding claim, further comprising an a posteriori step (E103) for the M images of an output sub-sequence, M being an integer, said a posteriori step comprising the following sub-steps d2) comparison between the processing frequency associated with the images of the output sub-sequence and the initial frequency of displaying the images, and if the processing frequency is lower than the initial frequency, spatial division of each of the M images of the output sub-sequence into N sub-images, N being an integer whose value depends on the ratio between the processing frequency and the initial frequency, the coding step e) corresponding to the coding of the M*N sub-images of the output sub-sequence, otherwise the coding step e) corresponds to the coding of said M images of the output sub-sequence.

10. Method according to the preceding claim, further comprising the following steps for each output sub-sequence

- coding of said information.

11 . Method for decoding data corresponding to images of a video sequence, each image being formed of elements organized in rows and columns, the images of the video sequence being grouped together in sub-sequences of one or more images called sub- output sequences, the method comprising the implementation of the following steps c1) obtaining an initial configuration (E5001) representative of structural characteristics of a display device, and for at least one image of a sub-sequence of exit; d11 ) carrying out a first spatial oversampling (E502) of at least part of the elements of the image by using a filter oriented along a first direction and comprising a first set of at least two different values of oversampling, the first set of values being determined 26 according to said initial configuration, then insertion of the resulting image in a sub-sequence to be decoded, and e1) decoding (E4) of the images of the sub-sequence to be decoded.

12. Method according to the preceding claim, in which step d11) further comprises a second spatial oversampling of at least some of the elements of the oversampled image by using a filter oriented in a second direction and comprising a second set of at least two different values of oversampling factors, the second set of values being determined according to said initial configuration.

13. Device for coding (DC) images of a video sequence, each image being formed of elements organized in rows and columns, the device being configured to implement the following steps c) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of a sub-sequence of one or more images of the video sequence referred to as the input sub-sequence, d1) production of a first sub-sequence - spatial sampling of at least part of the elements of the image using a filter oriented along a first direction and comprising a first set of at least two different values of sub-sampling factors, the first set of values being determined as a function of said initial configuration, then inserting the resulting image into an output sub-sequence, and e) coding the images of the output sub-sequence.

14. Device for decoding (DDEC) data corresponding to images of a video sequence, each image being formed of elements organized in rows and columns, the images of the video sequence being grouped into sub-sequences of one or several images called output sub-sequences, the device being configured to implement the following steps c1) obtaining an initial configuration representative of structural characteristics of a display device, and for at least one image of a sub - output sequence; Tl d11 ) carrying out a first spatial oversampling of at least part of the elements of the image by using a filter oriented in a first direction and comprising a first set of at least two different values of oversampling factors sampling, the first set of values being determined as a function of said initial configuration, then insertion of the resulting image into a sub-sequence to be decoded, and e1) decoding of the images of the sub-sequence to be decoded.