WO2018150083A1 - Procédé et équipement technique de traitement vidéo - Google Patents

Procédé et équipement technique de traitement vidéo Download PDF

Info

Publication number
WO2018150083A1
WO2018150083A1 PCT/FI2018/050049 FI2018050049W WO2018150083A1 WO 2018150083 A1 WO2018150083 A1 WO 2018150083A1 FI 2018050049 W FI2018050049 W FI 2018050049W WO 2018150083 A1 WO2018150083 A1 WO 2018150083A1
Authority
WO
WIPO (PCT)
Prior art keywords
media data
neural network
data
image
indication
Prior art date
Application number
PCT/FI2018/050049
Other languages
English (en)
Inventor
Francesco Cricri
Emre Baris Aksu
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to EP18754044.8A priority Critical patent/EP3583777A4/fr
Publication of WO2018150083A1 publication Critical patent/WO2018150083A1/fr

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • H03M7/6082Selection strategies
    • H03M7/6094Selection strategies according to reasons other than compression rate or data type
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • Semantic information may be represented by metadata which may express the type of scene, the occurrence of a specific action/activity, the presence of a specific object, etc. Such semantic information can be obtained by analyzing the media.
  • a method comprising receiving media data for compression; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and providing the media data and the indication to a data compressor.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data for compression; to determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data for compression; determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.
  • the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.
  • the indication of said at least one part of the media comprises a binary mask indicating at least one region that is determinable based on the at least one other part of the media data.
  • parameters for the first neural network and the parameters for the second neural network are updated based on a context of the media data.
  • a method comprising receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters.
  • the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.
  • the indication of said at least one part of the media comprises a binary mask.
  • Fig. 1 shows a computer graphics system suitable to be used in a computer vision process according to an embodiment
  • Fig. 2 shows an example of a Convolutional Neural Network
  • Fig. 3 shows a general overview of a method according to an embodiment
  • Fig. 5 shows an example of a method of an encoder
  • Fig. 6 shows an example of a method of a decoder
  • Fig. 7a is a flowchart illustrating a method according to an embodiment
  • Fig. 7b is a flowchart illustrating a method according to another embodiment
  • Fig. 8 shows an apparatus according to an embodiment in a simplified block chart.
  • the main processing unit 100 is a conventional processing unit arranged to process data within the data processing system.
  • the main processing unit 100 may comprise or be implemented as one or more processors or processor circuitry.
  • the memory 102, the storage device 104, the input device 106, and the output device 108 may include conventional components as recognized by those skilled in the art.
  • the memory 102 and storage device 104 store data in the data processing system 100.
  • Computer program code resides in the memory 102 for implementing, for example, computer vision process or a media compression process.
  • the input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display or for transmission to a receiver.
  • the data bus 1 12 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.
  • various processes of the media compression or decompression system may be carried out in one or more processing devices; for example, entirely in one computer device, or in one server device or across multiple user devices.
  • the elements of media compression or decompression process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud.
  • the present embodiments relate to data compression, communication, and decompression, and to the field of machine learning and artificial intelligence.
  • Data compression such as image and video compression, comprises reducing the amount of data used to represent certain information.
  • the output of such an operation is a reduced set of data, which occupies less memory space or can be transmitted using less bitrate or bandwidth.
  • image compression consists of removing data from the original image, which can be easily predicted from the rest of the data by exploiting for example redundancies (smooth regions).
  • JPEG Joint Photographic Experts Group
  • video compressor exploits also temporal redundancy, as objects and regions usually move at a low pace compared to the frame sampling rate.
  • An example of video compressor is the H.264 standard.
  • compression can be either loss-less or lossy, meaning that the reconstruction of the original data from the compressed data may be perfect or non-perfect, respectively.
  • Reconstruction of the original data, or an estimate of the original data, from the compressed data may be referred to as decompression.
  • CNN Convolutional Neural Network
  • the input to a CNN is an image, but any other media content object, such as video file, could be used as well.
  • Each layer of a CNN represents a certain abstraction (or semantic) level, and the CNN extracts multiple feature maps.
  • a feature map may for example comprise a dense matrix of Real numbers representing values of the extracted features.
  • the CNN in Fig. 2 has only three feature (or abstraction, or semantic) layers C1 , C2, C3 for the sake of simplicity, but CNNs may have more than three, and even over convolution layers.
  • the first convolution layer C1 of the CNN may be configured to extract 4 feature- maps from the first layer (i.e. from the input image). These maps may represent low- level features found in the input image, such as edges and corners.
  • the second convolution layer C2 of the CNN which may be configured to extract 6 feature-maps from the previous layer, increases the semantic level of extracted features.
  • the third convolution layer C3 may represent more abstract concepts found in images, such as combinations of edges and corners, shapes, etc.
  • the last layer of the CNN referred to as fully connected Multi-Layer Perceptron (MLP) may include one or more fully-connected (i.e., dense) layers and a final classification layer.
  • MLP Multi-Layer Perceptron
  • the MLP uses the feature-maps from the last convolution layer in order to predict (recognize) for example the object class. For example, it may predict that the object in the image is a house.
  • An artificial neural network is a computation graph consisting of successive layers of computation, usually performing a highly non-linear mapping in a highly- dimensional manifold. Neural networks work in two phases: the development or training phase, and the test or utilization phase. During training, the network exploits training data for learning the mapping. Training can be done unsupervised (where there are no manually-provided labels or targets) or supervised (the network receives manually-provided labels or targets).
  • GAN Generative Adversarial Networks
  • a teacher is another neural network, called Discriminator, which indirectly teaches the first neural network (i.e. the Generator) to generate data which looks realistic.
  • GANs One common use of GANs is in image generation, although GANs may be also used for other purposes, like style transfer, super-resolution, inpainting, etc.
  • the Generator tries to generate images which look similar (but not the same) as those in the training dataset, with the goal of fooling the Discriminator (i.e., convincing the Discriminator that the image is from the training set and not generated by the Generator).
  • the Generator tries to model the probability distribution of the data, so that generated images look like they were drawn (or sampled) from the true probability distribution of the data.
  • the Discriminator sometimes receives images from the training set, and sometimes from the Generator, and has the goal of learning to correctly discriminate them.
  • the loss is computed on the Discriminator's side, by checking its classification (or discrimination) accuracy. This loss is then used for training both the Discriminator and the Generator.
  • the known solutions mostly focus on the low- level characteristics by using traditional signal processing methodologies.
  • the known algorithms need to compress and then store/transmit every part of the face, although, to an intelligent agent (e.g. humans) it would be easy to imagine how one eye would look like when the other eye is already visible, or even how one eye would look like when only half of it is visible. If a compressor (and a de-compressor) were able to perform such "imagining" operations, the whole pipeline would greatly benefit from it by obtaining big savings in bitrate. In fact, the "imaginable” or “determinable” parts of the image may be fully discarded from storage/transmission or kept with lower representation precision (e.g., lower bit-rate).
  • the present embodiments are targeted to a neural network based framework for compression, streaming and de-compression of data such as images and videos.
  • an image is compressed.
  • the image may be an image of a face.
  • the basic idea of the present embodiments is to have a neural network that is able to decide which regions of the image should be encoded with higher-quality and which other regions can be encoded with lower quality. The decision is based on how easy or difficult it is for a second neural network to imagine those regions.
  • the regions which are encoded with low quality are those regions which are easily imaginable, such as specular regions (e.g. right eye after having observed left eye and general pose of the face) and regions which do not change much among different examples of the same region type (e.g., a certain region of the face which does not change much among different persons).
  • a method for achieving a collaborative- adversarial training is disclosed.
  • this approach there are three neural networks which are trained simultaneously, each of them for a different task, but in effect they implicitly contribute to the training of each other.
  • One of the neural networks is called Collaborator (or a Friend) "C” network. This network receives an input image and generates a masked image, where the masked or missing parts are supposed to be "easily imaginable" by a second neural network.
  • the Collaborator C represents the encoder or compressor
  • the Generator G represents the decoder or decompressor
  • the Adversary A may be discarded or kept for future further training of the Collaborator C and Generator G networks.
  • the decoding side needs to have appropriate parameters and topology information about the Generator neural network.
  • the appropriate parameters and topology information refer to Generator parameters which were trained jointly with the Collaborator parameters used by the encoder. Therefore, parameters of Collaborator and Generator need to be compatible with each other and thus the version of the Collaborator and the Generator needs to be trackable, as multiple versions may be available at different points in time due to retraining and other updates.
  • one simple method is to signal the Generator parameters version number inside the encoded image format, for example as a signalling field in the header portion of the image.
  • neural network versions for different contexts, such as sport, concert, indoor, outdoor, artificial (man-made) scene, natural (e.g. forest) scene, etc.
  • the system may decide to use one of these networks for encoding and decoding. The decision may be manual or automated. Automated decision may be implemented by using a context classifier at the encoder's side, and then the classified context is signaled to the decoder's side.
  • the server may communicate with the client which neural network topology type is to be used for inpainting.
  • the server may stream the network topology in-band or out- band of/from the video bitstream and have the new topology ready in the client before it is used for inpainting. Furthermore, instead of sending the whole topology and parameters at every update time, the system may send only the difference between the currently used topology and parameters and their updated or latest version, in order to further reduce the bitrate. The present embodiments are discussed in more detailed manner next.
  • the embodiments can be used to reduce required data rate in any type of media transmission, for example transmission of images, audio or video through local wired or wireless connections, and streaming, multicasting or broadcasting over wired or wireless networks such as cellular networks or terrestrial, satellite or cable broadcast networks.
  • a neural network can be implemented in different ways, also depending on the type of input data.
  • a Convolutional Neural Network (CNN), which consists of a set of layers of convolutional kernel matrices and non-linearity functions.
  • the encoding side may be considered as a system that receives an input image and produces an encoded image as an output.
  • the encoding side may comprise various components, e.g. a neural network and an image/video compression block.
  • the decoding side may be considered as a system that receives an encoded image and outputs a decoded image, and may comprise various components, e.g. a decoding algorithm (such as JPEG, JPEG2000, H.264, H.265 or alike) and a neural network.
  • the encoded image may be transmitted by a transmitter to a receiver, where the decoder resides, or it may be stored locally as a file onto a memory.
  • the encoded image is assumed to require fewer bits to be represented than the original image.
  • the receiver may comprise an apparatus similar to apparatus 50 or the computer graphics system of Figure 1 .
  • the receiver may be also considered to be at least one physical or logical sub-function of such apparatus or a system.
  • the term receiver may refer to decompressor circuitry or a memory storing a neural network, which may reside in apparatus 50 or the computer graphics system of Figure 1 .
  • a neural network is trained and/or used to decide which regions of the input image are encoded and which ones are not encoded at all.
  • the neural network may be trained and/or used to decide which regions are encoded with higher bitrate and which ones are encoded with lower bitrate.
  • the specific bitrate used for a certain region may be adaptive to the region itself and may not be fixed to any of the two values.
  • the neural network may be also configured to decide about the regions based on the semantics of the region.
  • the deep learning field has made possible for neural networks to generate missing data in images and videos, or "inpaint" them. Therefore, the neural network may decide based on how easy it is to imagine the considered region. The bitrate for that region will then be inversely proportional to how well it can be imagined. It is worth noticing that image enhancement techniques based on neural networks may be used not only for image inpainting (replacing missing content regions with plausible content) but also for quality improvement of existing data, such as increasing the resolution of images.
  • FIG. 3 illustrates a general overview of the solution according to an embodiment.
  • a first neural network 310 receives an input image 300 for analysis.
  • a mask 320 of easily imaginable regions is produced as an output.
  • the non-easily imaginable regions are encoded with higher bitrate 330, and streamed in a bitstream 340 to a receiver.
  • the masked regions may be either encoded at lower bitrate and then streamed, or not encoded and streamed at all.
  • Bitstream 340 may also include information about the mask and/or parameters for the second neural network 380.
  • the bitstream is de-compressed 350 to produce a de-compressed image 370.
  • the de-compression may take into account the encoding method used in the compression stage 330, as in common encoding-decoding techniques.
  • the de-compressed image 370 and the mask and the neural network parameters 360 are input to a second neural network 380. Then the easily-imaginable regions are regenerated, determined, or imagined by the second neural network 380 to produce an output image 390.
  • the present embodiments provide a training framework "Collaborative-Adversarial Training” (CAT), shown in Figure 4.
  • the CAT uses one Generator network G and two auxiliary networks; Collaborator C, Adversary A.
  • the first neural network C 410 receives an input image 400 and is configured to output a mask of regions 420 which are easily imaginable by a second network G 430.
  • the second network G 430 receives the masked image 420 (or, alternatively, an image where the masked regions are encoded at lower bitrate) and is configured to imagine or reconstruct the masked regions.
  • Training is performed in unsupervised way, or more precisely, in self-supervised regime. In fact, no manually-provided labels are needed, but only images such as those in ImageNet dataset.
  • the target needed to compute the loss is represented by the true label about the origin of the input image to A, i.e., either the generator or the training set.
  • Each network may have a different architecture.
  • the Collaborator C may consist of a "neural encoder" that is configured to apply several convolutions and non-linearities, and optionally to reduce the resolution by using pooling layers or strided convolution.
  • the Collaborator C may consist of a "neural encoder-decoder network", formed by a “neural decoder” that is configured to mirror a “neural encoder”, by using convolutions, non-linearities and eventually up- sampling.
  • the Collaborator may output a layer matrix representing the masked regions and the un-masked regions with one value per pixel representing how well that pixel is imaginable by the Generator G. If a binary mask is desired, the values may be simply thresholded. Alternatively, a binary mask may be obtained by having an output layer which performs regression on the coordinates of the masked regions. Then, the areas delimited by the coordinates will represent the masked regions.
  • the easily imaginable regions are not encoded and not streamed.
  • the non-easily imaginable regions of the image may be divided in a plurality of sub-images that collectively cover the non-easily imaginable regions.
  • the sub-images may be encoded separately.
  • the masked region may be assigned a predetermined pixel value before encoding. The whole image may be encoded together (including both masked and non-masked regions), but the masked region will be encoded with a very low number of bits because of the fixed pixel values in the masked region.
  • an input image 500 is received by the Collaborator 510 that outputs a mask 520.
  • the input image 500 and the mask 520 are input to compressor 530, which compresses image 500 according to previous embodiments, so that non- easily imaginable parts of the image 500 are less compressed, resulting in a higher bitrate, and easily imaginable parts of the image 500, corresponding to mask 520 are more compressed, resulting in a lower bitrate.
  • the portions covered by mask 520 may be alternatively removed and not encoded at all.
  • the receiver receives a bit-stream 600 including image data, a mask, and/or G version.
  • the image data is de-compressed 610 to provide an decompressed image 630 having regions with higher bitrate (solid lines) and lower bitrate (hatched lines).
  • the decompressed image 630 is transmitted to the Generator 640 that may receive also the mask and the G version 620. With such data, the Generator 640 is able to regenerate the image 650.
  • the server is configured to provide the context related neural network topologies to the client once the client needs it for inpainting.
  • the server may send the context related neural network simply by streaming the topology to the client either inside the video bitstream (e.g. utilizing metadata carriage mechanisms of the video bitstream or media segment file) or totally out of band by embedding the neural network topology representation inside an HTTP(s) response which is sent to the client.
  • Figure 7a is a flowchart illustrating a method according to an embodiment.
  • a method implemented on a transmitter comprises receiving media data for compression 710; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data 720; and providing the media data and the indication to a data compressor 730.
  • Figure 7b is a flowchart illustrating a method according to another embodiment.
  • a part of media data that is determinable based on at least one other part of the media data may refer to a portion of the media data that has been removed from the original media data or that has been modified in some way, for example compressed in a higher level, such that other parts of the media data include information usable in at least partially recovering, reconstructing, or deducing the missing part or original form of the modified part.
  • the determinable part may be also referred to as an imaginable part and these terms are used interchangeably throughout the specification.
  • An apparatus comprises means for receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters.
  • the means comprises a processor, a memory, and a computer program code residing in the memory.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera system 42 capable of recording or capturing images and/or video.
  • the camera system 42 may contain one or more cameras.
  • the camera system is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the controller 56 may be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus may be formed as a part of a server or cloud computing system.
  • the apparatus may be configured to receive video and audio data from a capture device, such as for example a mobile phone, through one or more wireless or wired connections.
  • the apparatus may be configured to analyze the received audio and video data and to generate a widened video field of view as described in the previous embodiments.
  • the apparatus may be configured to transmit the generated video and/or audio data to an immersive video display apparatus, such as for example a head-mounted display or a virtual reality application of a mobile phone.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé et un équipement technique destiné à la compression/décompression multimédia. Le procédé comprend : la réception de données multimédias (300) destinées à la compression; la détermination, par un premier réseau neuronal (310), d'une indication (320) d'au moins une partie des données multimédias (300) pouvant être déterminées sur la base d'au moins une autre partie des données multimédias (300); et la fourniture des données multimédias (300) et de l'indication (320) à un compresseur de données (330). Un autre aspect du procédé consiste à : recevoir des données multimédias (340) avec une indication (360) d'au moins une partie des données multimédias (340) pouvant être déterminées sur la base d'au moins une autre partie des données multimédias (340), et des paramètres (360) d'un réseau neuronal (380); décompresser (350) les données multimédias (340); et régénérer des données multimédias finales (390) dans le réseau neuronal (380) à l'aide de l'indication (360) et des paramètres (360).
PCT/FI2018/050049 2017-02-16 2018-01-23 Procédé et équipement technique de traitement vidéo WO2018150083A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP18754044.8A EP3583777A4 (fr) 2017-02-16 2018-01-23 Procédé et équipement technique de traitement vidéo

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20175136 2017-02-16
FI20175136 2017-02-16

Publications (1)

Publication Number Publication Date
WO2018150083A1 true WO2018150083A1 (fr) 2018-08-23

Family

ID=63169186

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050049 WO2018150083A1 (fr) 2017-02-16 2018-01-23 Procédé et équipement technique de traitement vidéo

Country Status (2)

Country Link
EP (1) EP3583777A4 (fr)
WO (1) WO2018150083A1 (fr)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179212A (zh) * 2018-11-10 2020-05-19 杭州凝眸智能科技有限公司 集成蒸馏策略和反卷积的微小目标检测片上实现方法
EP3657803A1 (fr) * 2018-11-20 2020-05-27 Koninklijke KPN N.V. Génération et affichage d'un flux vidéo
WO2020131645A1 (fr) * 2018-12-17 2020-06-25 Qualcomm Incorporated Procédé et appareil pour fournir un modèle de moteur de rendu comprenant une description d'un réseau neuronal incorporée dans un élément multimédia
WO2020255367A1 (fr) * 2019-06-21 2020-12-24 日本電信電話株式会社 Dispositif de codage, procédé de codage et programme
CN112561799A (zh) * 2020-12-21 2021-03-26 江西师范大学 一种红外图像超分辨率重建方法
WO2021175413A1 (fr) * 2020-03-03 2021-09-10 Telefonaktiebolaget Lm Ericsson (Publ) Système, agencement, agencement de module logiciel informatique, agencement de circuits et procédé de traitement d'image amélioré faisant appel à deux entités
CN113508399A (zh) * 2019-03-15 2021-10-15 杜比国际公司 用于更新神经网络的方法和装置
WO2022020297A1 (fr) * 2020-07-21 2022-01-27 Tencent America LLC Procédé et appareil de compression d'image neuronale à fréquence adaptative avec générateurs adverses
CN114095033A (zh) * 2021-11-16 2022-02-25 上海交通大学 基于上下文的图卷积的目标交互关系语义无损压缩系统及方法
EP3975452A1 (fr) * 2020-09-24 2022-03-30 ATLAS ELEKTRONIK GmbH Récepteur de sono d'origine hydrique et système de transmission de données d'images utilisant un signal de son d'origine hydrique
CN114616825A (zh) * 2020-09-29 2022-06-10 腾讯美国有限责任公司 利用微结构掩码的多品质视频超分辨率
US11397893B2 (en) 2019-09-04 2022-07-26 Google Llc Neural network formation configuration feedback for wireless communications
EP3925214A4 (fr) * 2019-02-15 2022-11-23 Nokia Technologies Oy Appareil, procédé et programme informatique pour le codage et le décodage de vidéo
US11516521B2 (en) 2018-07-30 2022-11-29 Koninklijke Kpn N.V. Generating composite video stream for display in VR
EP4064283A4 (fr) * 2019-12-27 2022-12-28 Samsung Electronics Co., Ltd. Procédé et appareil de transmission/réception d'un signal vocal sur la base d'un réseau neuronal artificiel
WO2023047485A1 (fr) * 2021-09-22 2023-03-30 株式会社日立国際電気 Appareil de communication et procédé de communication de données
WO2023056364A1 (fr) * 2021-09-29 2023-04-06 Bytedance Inc. Procédé, dispositif et support de traitement vidéo
US11657264B2 (en) 2018-04-09 2023-05-23 Nokia Technologies Oy Content-specific neural network distribution
US11663472B2 (en) 2020-06-29 2023-05-30 Google Llc Deep neural network processing for a user equipment-coordination set
US11689940B2 (en) 2019-12-13 2023-06-27 Google Llc Machine-learning architectures for simultaneous connection to multiple carriers
US11886991B2 (en) 2019-11-27 2024-01-30 Google Llc Machine-learning architectures for broadcast and multicast communications
US11928587B2 (en) 2019-08-14 2024-03-12 Google Llc Base station-user equipment messaging regarding deep neural networks
US12001943B2 (en) 2019-08-14 2024-06-04 Google Llc Communicating a neural network formation configuration

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070248272A1 (en) * 2006-04-19 2007-10-25 Microsoft Corporation Vision-Based Compression
US20090067491A1 (en) * 2007-09-07 2009-03-12 Microsoft Corporation Learning-Based Image Compression
WO2012033966A1 (fr) * 2010-09-10 2012-03-15 Thomson Licensing Codage vidéo à l'aide d'un élagage de données de résolution mélangée par bloc

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070248272A1 (en) * 2006-04-19 2007-10-25 Microsoft Corporation Vision-Based Compression
US20090067491A1 (en) * 2007-09-07 2009-03-12 Microsoft Corporation Learning-Based Image Compression
WO2012033966A1 (fr) * 2010-09-10 2012-03-15 Thomson Licensing Codage vidéo à l'aide d'un élagage de données de résolution mélangée par bloc

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
BOESEN LINDBO LARSEN, ANDERS ET AL.: "Autoencoding beyond pixels using a learned similarity metric", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 10 February 2016 (2016-02-10), XP055379931, Retrieved from the Internet <URL:https://arxiv.org/pdf/1512.09300> [retrieved on 20180606] *
DONG LIU ET AL., INPAINTING WITH IMAGE PATCHES FOR COMPRESSION, 31 August 2011 (2011-08-31)
GREGOR, KAROL ET AL.: "Towards Conceptual Compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 29 April 2016 (2016-04-29), XP080805978, Retrieved from the Internet <URL:https://arxiv.org/pdf/1604.08772> [retrieved on 20180608] *
JIANG WEI, RATE-DISTORTION OPTIMIZED IMAGE COMPRESSION BASED ON IMAGE INPAINTING, 2 November 2014 (2014-11-02)
LIU, DONG ET AL.: "Image Compression With Edge-Based Inpainting", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 17, no. 10, 1 October 2007 (2007-10-01), pages 1273 - 1287, XP011193147, ISSN: 1051-8215, [retrieved on 20180613] *
MAKHZANI, ALIREZA ET AL.: "Adversarial Autoencoders", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 25 May 2016 (2016-05-25), XP055532752, Retrieved from the Internet <URL:https://arxiv.org/pdf/1511.05644> [retrieved on 20180608] *
PATHAK, DEEPAK ET AL.: "Context Encoders: Feature Learning by Inpainting", PROCEEDINGS OF THE 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2016, 12 December 2016 (2016-12-12), Las Vegas, NV, USA, pages 2536 - 2544, XP033021434, ISSN: 1063-6919, [retrieved on 20180612] *
See also references of EP3583777A4
YEH, RAYMOND ET AL.: "Semantic Image Inpainting with Perceptual and Contextual Losses", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 14 November 2016 (2016-11-14), XP055532717, Retrieved from the Internet <URL:https://arxiv.org/pdf/1607.07539v2> [retrieved on 20180613] *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11657264B2 (en) 2018-04-09 2023-05-23 Nokia Technologies Oy Content-specific neural network distribution
US11516521B2 (en) 2018-07-30 2022-11-29 Koninklijke Kpn N.V. Generating composite video stream for display in VR
CN111179212A (zh) * 2018-11-10 2020-05-19 杭州凝眸智能科技有限公司 集成蒸馏策略和反卷积的微小目标检测片上实现方法
CN111179212B (zh) * 2018-11-10 2023-05-23 杭州凝眸智能科技有限公司 集成蒸馏策略和反卷积的微小目标检测片上实现方法
EP3657803A1 (fr) * 2018-11-20 2020-05-27 Koninklijke KPN N.V. Génération et affichage d'un flux vidéo
US11924442B2 (en) 2018-11-20 2024-03-05 Koninklijke Kpn N.V. Generating and displaying a video stream by omitting or replacing an occluded part
TWI749426B (zh) * 2018-12-17 2021-12-11 美商高通公司 用於媒體資料之嵌入式呈現引擎
WO2020131645A1 (fr) * 2018-12-17 2020-06-25 Qualcomm Incorporated Procédé et appareil pour fournir un modèle de moteur de rendu comprenant une description d'un réseau neuronal incorporée dans un élément multimédia
US10904637B2 (en) 2018-12-17 2021-01-26 Qualcomm Incorporated Embedded rendering engine for media data
US11831867B2 (en) 2019-02-15 2023-11-28 Nokia Technologies Oy Apparatus, a method and a computer program for video coding and decoding
EP3925214A4 (fr) * 2019-02-15 2022-11-23 Nokia Technologies Oy Appareil, procédé et programme informatique pour le codage et le décodage de vidéo
JP2022522685A (ja) * 2019-03-15 2022-04-20 ドルビー・インターナショナル・アーベー ニューラルネットワークを更新するための方法および装置
CN113508399A (zh) * 2019-03-15 2021-10-15 杜比国际公司 用于更新神经网络的方法和装置
JP7196331B2 (ja) 2019-03-15 2022-12-26 ドルビー・インターナショナル・アーベー ニューラルネットワークを更新するための方法および装置
JP7303456B2 (ja) 2019-06-21 2023-07-05 日本電信電話株式会社 符号化装置、符号化方法及びプログラム
JPWO2020255367A1 (fr) * 2019-06-21 2020-12-24
WO2020255367A1 (fr) * 2019-06-21 2020-12-24 日本電信電話株式会社 Dispositif de codage, procédé de codage et programme
US12001943B2 (en) 2019-08-14 2024-06-04 Google Llc Communicating a neural network formation configuration
US11928587B2 (en) 2019-08-14 2024-03-12 Google Llc Base station-user equipment messaging regarding deep neural networks
US11397893B2 (en) 2019-09-04 2022-07-26 Google Llc Neural network formation configuration feedback for wireless communications
US11886991B2 (en) 2019-11-27 2024-01-30 Google Llc Machine-learning architectures for broadcast and multicast communications
US11689940B2 (en) 2019-12-13 2023-06-27 Google Llc Machine-learning architectures for simultaneous connection to multiple carriers
EP4064283A4 (fr) * 2019-12-27 2022-12-28 Samsung Electronics Co., Ltd. Procédé et appareil de transmission/réception d'un signal vocal sur la base d'un réseau neuronal artificiel
WO2021175413A1 (fr) * 2020-03-03 2021-09-10 Telefonaktiebolaget Lm Ericsson (Publ) Système, agencement, agencement de module logiciel informatique, agencement de circuits et procédé de traitement d'image amélioré faisant appel à deux entités
US11663472B2 (en) 2020-06-29 2023-05-30 Google Llc Deep neural network processing for a user equipment-coordination set
WO2022020297A1 (fr) * 2020-07-21 2022-01-27 Tencent America LLC Procédé et appareil de compression d'image neuronale à fréquence adaptative avec générateurs adverses
US11622117B2 (en) 2020-07-21 2023-04-04 Tencent America LLC Method and apparatus for rate-adaptive neural image compression with adversarial generators
EP3975452A1 (fr) * 2020-09-24 2022-03-30 ATLAS ELEKTRONIK GmbH Récepteur de sono d'origine hydrique et système de transmission de données d'images utilisant un signal de son d'origine hydrique
WO2022063657A1 (fr) * 2020-09-24 2022-03-31 Atlas Elektronik Gmbh Récepteur de sons transmis par l'eau et système de transmission de données d'image utilisant un signal sonore transmis par l'eau
CN114616825A (zh) * 2020-09-29 2022-06-10 腾讯美国有限责任公司 利用微结构掩码的多品质视频超分辨率
CN114616825B (zh) * 2020-09-29 2024-05-24 腾讯美国有限责任公司 视频数据解码方法和计算机系统以及存储介质
CN112561799A (zh) * 2020-12-21 2021-03-26 江西师范大学 一种红外图像超分辨率重建方法
WO2023047485A1 (fr) * 2021-09-22 2023-03-30 株式会社日立国際電気 Appareil de communication et procédé de communication de données
WO2023056364A1 (fr) * 2021-09-29 2023-04-06 Bytedance Inc. Procédé, dispositif et support de traitement vidéo
CN114095033A (zh) * 2021-11-16 2022-02-25 上海交通大学 基于上下文的图卷积的目标交互关系语义无损压缩系统及方法
CN114095033B (zh) * 2021-11-16 2024-05-14 上海交通大学 基于上下文的图卷积的目标交互关系语义无损压缩系统及方法

Also Published As

Publication number Publication date
EP3583777A1 (fr) 2019-12-25
EP3583777A4 (fr) 2020-12-23

Similar Documents

Publication Publication Date Title
WO2018150083A1 (fr) Procédé et équipement technique de traitement vidéo
EP3777207B1 (fr) Distribution de réseau neuronal spécifique au contenu
CN110225341B (zh) 一种任务驱动的码流结构化图像编码方法
EP4218238A1 (fr) Compression d&#39;image et de vidéo adaptative par l&#39;exemple à l&#39;aide de systèmes d&#39;apprentissage automatique
CN118233636A (zh) 使用深度生成性模型的视频压缩
WO2019001108A1 (fr) Procédé et appareil de traitement de vidéo
WO2023016155A1 (fr) Appareil et procédé de traitement d&#39;image, support, et dispositif électronique
CN112565777B (zh) 基于深度学习模型视频数据传输方法、系统、介质及设备
US20210150769A1 (en) High efficiency image and video compression and decompression
CN114723760B (zh) 人像分割模型的训练方法、装置及人像分割方法、装置
CN116233445B (zh) 视频的编解码处理方法、装置、计算机设备和存储介质
CN111641826A (zh) 对数据进行编码、解码的方法、装置与系统
US20220398692A1 (en) Video conferencing based on adaptive face re-enactment and face restoration
Löhdefink et al. Focussing learned image compression to semantic classes for V2X applications
WO2023050720A1 (fr) Procédé de traitement d&#39;image, appareil de traitement d&#39;image et procédé de formation de modèle
US20220335560A1 (en) Watermark-Based Image Reconstruction
US11095901B2 (en) Object manipulation video conference compression
CN116847087A (zh) 视频处理方法、装置、存储介质及电子设备
CN114501031A (zh) 一种压缩编码、解压缩方法以及装置
WO2020107376A1 (fr) Procédé de traitement d&#39;image, dispositif et support d&#39;enregistrement
CN111491166A (zh) 基于内容分析的动态压缩系统及方法
WO2024093627A1 (fr) Procédé de compression vidéo, procédé de décodage vidéo et appareils associés
US20230162492A1 (en) Method, server device, and system for processing offloaded data
CN116634178B (zh) 一种极低码率的安防场景监控视频编解码方法及系统
CN110868615B (zh) 一种视频处理方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18754044

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018754044

Country of ref document: EP

Effective date: 20190916