EP3583777A1 - Verfahren und technische ausrüstung zur videoverarbeitung - Google Patents

Verfahren und technische ausrüstung zur videoverarbeitung

Info

Publication number
EP3583777A1
EP3583777A1 EP18754044.8A EP18754044A EP3583777A1 EP 3583777 A1 EP3583777 A1 EP 3583777A1 EP 18754044 A EP18754044 A EP 18754044A EP 3583777 A1 EP3583777 A1 EP 3583777A1
Authority
EP
European Patent Office
Prior art keywords
media data
neural network
data
image
indication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP18754044.8A
Other languages
English (en)
French (fr)
Other versions
EP3583777A4 (de
Inventor
Francesco Cricri
Emre Baris Aksu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of EP3583777A1 publication Critical patent/EP3583777A1/de
Publication of EP3583777A4 publication Critical patent/EP3583777A4/de
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • H03M7/6082Selection strategies
    • H03M7/6094Selection strategies according to reasons other than compression rate or data type
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present solution generally relates to virtual reality and machine learning.
  • the solution relates to streaming and processing of media content.
  • Semantic information may be represented by metadata which may express the type of scene, the occurrence of a specific action/activity, the presence of a specific object, etc. Such semantic information can be obtained by analyzing the media.
  • a method comprising receiving media data for compression; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and providing the media data and the indication to a data compressor.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data for compression; to determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data for compression; determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.
  • the media data is compressed with the data compressor according to the indication and the compressed media data and the indication are transmitted to a receiver.
  • the media data is regenerated by a second neural network to obtain regenerated media data, and the first and the second neural network are trained based on a quality indicator obtained by comparing the regenerated media data to training data by a third neural network.
  • parameters of the second neural network are transmitted to the receiver.
  • the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.
  • the indication of said at least one part of the media comprises a binary mask indicating at least one region that is determinable based on the at least one other part of the media data.
  • parameters for the first neural network and the parameters for the second neural network are updated based on a context of the media data.
  • the updated parameters of the second neural network are transmitted to the receiver.
  • a method comprising receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters.
  • the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.
  • the indication of said at least one part of the media comprises a binary mask.
  • Fig. 1 shows a computer graphics system suitable to be used in a computer vision process according to an embodiment
  • Fig. 2 shows an example of a Convolutional Neural Network
  • Fig. 3 shows a general overview of a method according to an embodiment
  • Fig. 4 shows an embodiment for training neural networks for encoding and decoding
  • Fig. 5 shows an example of a method of an encoder
  • Fig. 6 shows an example of a method of a decoder
  • Fig. 7a is a flowchart illustrating a method according to an embodiment
  • Fig. 7b is a flowchart illustrating a method according to another embodiment
  • Fig. 8 shows an apparatus according to an embodiment in a simplified block chart.
  • Figure 1 shows a computer graphics system suitable to be used in image processing, for example in a media compression or decompression process according to an embodiment.
  • the generalized structure of the computer graphics system will be explained in accordance with the functional blocks of the system. Several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor if desired.
  • a data processing system of an apparatus according to an example of Fig. 1 comprises a main processing unit 100, a memory 102, a storage device 104, an input device 106, an output device 108, and a graphics subsystem 1 10, which are all connected to each other via a data bus 1 12.
  • the main processing unit 100 is a conventional processing unit arranged to process data within the data processing system.
  • the main processing unit 100 may comprise or be implemented as one or more processors or processor circuitry.
  • the memory 102, the storage device 104, the input device 106, and the output device 108 may include conventional components as recognized by those skilled in the art.
  • the memory 102 and storage device 104 store data in the data processing system 100.
  • Computer program code resides in the memory 102 for implementing, for example, computer vision process or a media compression process.
  • the input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display or for transmission to a receiver.
  • the data bus 1 12 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.
  • various processes of the media compression or decompression system may be carried out in one or more processing devices; for example, entirely in one computer device, or in one server device or across multiple user devices.
  • the elements of media compression or decompression process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud.
  • the present embodiments relate to data compression, communication, and decompression, and to the field of machine learning and artificial intelligence.
  • Data compression such as image and video compression, comprises reducing the amount of data used to represent certain information.
  • the output of such an operation is a reduced set of data, which occupies less memory space or can be transmitted using less bitrate or bandwidth.
  • image compression consists of removing data from the original image, which can be easily predicted from the rest of the data by exploiting for example redundancies (smooth regions).
  • JPEG Joint Photographic Experts Group
  • video compressor exploits also temporal redundancy, as objects and regions usually move at a low pace compared to the frame sampling rate.
  • An example of video compressor is the H.264 standard.
  • compression can be either loss-less or lossy, meaning that the reconstruction of the original data from the compressed data may be perfect or non-perfect, respectively.
  • Reconstruction of the original data, or an estimate of the original data, from the compressed data may be referred to as decompression.
  • Machine learning is a field which studies how to learn mappings from a certain input to a certain output, where the learning is performed based on data.
  • a sub-field of machine learning which has been particularly successful recently is deep learning.
  • Deep learning studies how to use artificial neural networks for learning from raw data, without preliminary feature extraction.
  • Deep learning techniques may be used for recognizing and detecting objects in images or videos with great accuracy, outperforming previous methods.
  • the fundamental difference of deep learning image recognition technique compared to previous methods is learning to recognize image objects directly from the raw data, whereas previous techniques are based on recognizing the image objects from hand-engineered features (e.g. SIFT features).
  • deep learning techniques build hierarchical computation layers which extract features of increasingly abstract level.
  • CNN Convolutional Neural Network
  • the input to a CNN is an image, but any other media content object, such as video file, could be used as well.
  • Each layer of a CNN represents a certain abstraction (or semantic) level, and the CNN extracts multiple feature maps.
  • a feature map may for example comprise a dense matrix of Real numbers representing values of the extracted features.
  • the CNN in Fig. 2 has only three feature (or abstraction, or semantic) layers C1 , C2, C3 for the sake of simplicity, but CNNs may have more than three, and even over convolution layers.
  • the first convolution layer C1 of the CNN may be configured to extract 4 feature- maps from the first layer (i.e. from the input image). These maps may represent low- level features found in the input image, such as edges and corners.
  • the second convolution layer C2 of the CNN which may be configured to extract 6 feature-maps from the previous layer, increases the semantic level of extracted features.
  • the third convolution layer C3 may represent more abstract concepts found in images, such as combinations of edges and corners, shapes, etc.
  • the last layer of the CNN referred to as fully connected Multi-Layer Perceptron (MLP) may include one or more fully-connected (i.e., dense) layers and a final classification layer.
  • MLP Multi-Layer Perceptron
  • the MLP uses the feature-maps from the last convolution layer in order to predict (recognize) for example the object class. For example, it may predict that the object in the image is a house.
  • An artificial neural network is a computation graph consisting of successive layers of computation, usually performing a highly non-linear mapping in a highly- dimensional manifold. Neural networks work in two phases: the development or training phase, and the test or utilization phase. During training, the network exploits training data for learning the mapping. Training can be done unsupervised (where there are no manually-provided labels or targets) or supervised (the network receives manually-provided labels or targets).
  • GAN Generative Adversarial Networks
  • a teacher is another neural network, called Discriminator, which indirectly teaches the first neural network (i.e. the Generator) to generate data which looks realistic.
  • GANs One common use of GANs is in image generation, although GANs may be also used for other purposes, like style transfer, super-resolution, inpainting, etc.
  • the Generator tries to generate images which look similar (but not the same) as those in the training dataset, with the goal of fooling the Discriminator (i.e., convincing the Discriminator that the image is from the training set and not generated by the Generator).
  • the Generator tries to model the probability distribution of the data, so that generated images look like they were drawn (or sampled) from the true probability distribution of the data.
  • the Discriminator sometimes receives images from the training set, and sometimes from the Generator, and has the goal of learning to correctly discriminate them.
  • the loss is computed on the Discriminator's side, by checking its classification (or discrimination) accuracy. This loss is then used for training both the Discriminator and the Generator.
  • the known solutions mostly focus on the low- level characteristics by using traditional signal processing methodologies.
  • the known algorithms need to compress and then store/transmit every part of the face, although, to an intelligent agent (e.g. humans) it would be easy to imagine how one eye would look like when the other eye is already visible, or even how one eye would look like when only half of it is visible. If a compressor (and a de-compressor) were able to perform such "imagining" operations, the whole pipeline would greatly benefit from it by obtaining big savings in bitrate. In fact, the "imaginable” or “determinable” parts of the image may be fully discarded from storage/transmission or kept with lower representation precision (e.g., lower bit-rate).
  • a deep learning system is presented to cope with the problem of leveraging semantic aspects of data such as images and videos, in order to obtain a bit-rate reduction.
  • a novel pipeline is proposed for both training and utilizing neural networks for this goal.
  • network topology parameters are disclosed that can be streamed and sent to the client in parallel to the encoded bitstream so that the neural network can be adapted and/or changed on-the-fly during a streaming session.
  • the present embodiments are targeted to a neural network based framework for compression, streaming and de-compression of data such as images and videos.
  • an image is compressed.
  • the image may be an image of a face.
  • the basic idea of the present embodiments is to have a neural network that is able to decide which regions of the image should be encoded with higher-quality and which other regions can be encoded with lower quality. The decision is based on how easy or difficult it is for a second neural network to imagine those regions.
  • the regions which are encoded with low quality are those regions which are easily imaginable, such as specular regions (e.g. right eye after having observed left eye and general pose of the face) and regions which do not change much among different examples of the same region type (e.g., a certain region of the face which does not change much among different persons).
  • a method for achieving a collaborative- adversarial training is disclosed.
  • this approach there are three neural networks which are trained simultaneously, each of them for a different task, but in effect they implicitly contribute to the training of each other.
  • One of the neural networks is called Collaborator (or a Friend) "C” network. This network receives an input image and generates a masked image, where the masked or missing parts are supposed to be "easily imaginable" by a second neural network.
  • the masked image is encoded by encoding only the non-missing (non-easily imaginable) parts, or it is encoded by using two or more different qualities (or bitrates) for the different regions of the image.
  • any suitable encoder may be used, such as JPEG for images.
  • the second neural network called Generator "G”
  • G receives the masked image, or the mask and the image separately, and tries to imagine the missing parts or to improve the quality of the lower-quality parts.
  • a third neural network called Adversary "A” network, tries to understand if the imagined image is a real image (from the training set) or an imagined image (from Generator).
  • the output of third neural network A is used to produce a loss metric or a loss value, which may be a real number. This loss may then be used for updating the learnable parameters of all three networks (for example the "weights" of the neural networks).
  • the Collaborator C represents the encoder or compressor
  • the Generator G represents the decoder or decompressor
  • the Adversary A may be discarded or kept for future further training of the Collaborator C and Generator G networks.
  • the decoding side needs to have appropriate parameters and topology information about the Generator neural network.
  • the appropriate parameters and topology information refer to Generator parameters which were trained jointly with the Collaborator parameters used by the encoder. Therefore, parameters of Collaborator and Generator need to be compatible with each other and thus the version of the Collaborator and the Generator needs to be trackable, as multiple versions may be available at different points in time due to retraining and other updates.
  • one simple method is to signal the Generator parameters version number inside the encoded image format, for example as a signalling field in the header portion of the image.
  • neural network versions for different contexts, such as sport, concert, indoor, outdoor, artificial (man-made) scene, natural (e.g. forest) scene, etc.
  • the system may decide to use one of these networks for encoding and decoding. The decision may be manual or automated. Automated decision may be implemented by using a context classifier at the encoder's side, and then the classified context is signaled to the decoder's side.
  • the server may communicate with the client which neural network topology type is to be used for inpainting.
  • the server may stream the network topology in-band or out- band of/from the video bitstream and have the new topology ready in the client before it is used for inpainting. Furthermore, instead of sending the whole topology and parameters at every update time, the system may send only the difference between the currently used topology and parameters and their updated or latest version, in order to further reduce the bitrate. The present embodiments are discussed in more detailed manner next.
  • the embodiments can be used to reduce required data rate in any type of media transmission, for example transmission of images, audio or video through local wired or wireless connections, and streaming, multicasting or broadcasting over wired or wireless networks such as cellular networks or terrestrial, satellite or cable broadcast networks.
  • a neural network can be implemented in different ways, also depending on the type of input data.
  • a Convolutional Neural Network (CNN), which consists of a set of layers of convolutional kernel matrices and non-linearity functions.
  • the encoding side may be considered as a system that receives an input image and produces an encoded image as an output.
  • the encoding side may comprise various components, e.g. a neural network and an image/video compression block.
  • the decoding side may be considered as a system that receives an encoded image and outputs a decoded image, and may comprise various components, e.g. a decoding algorithm (such as JPEG, JPEG2000, H.264, H.265 or alike) and a neural network.
  • the encoded image may be transmitted by a transmitter to a receiver, where the decoder resides, or it may be stored locally as a file onto a memory.
  • the encoded image is assumed to require fewer bits to be represented than the original image.
  • the receiver may comprise an apparatus similar to apparatus 50 or the computer graphics system of Figure 1 .
  • the receiver may be also considered to be at least one physical or logical sub-function of such apparatus or a system.
  • the term receiver may refer to decompressor circuitry or a memory storing a neural network, which may reside in apparatus 50 or the computer graphics system of Figure 1 .
  • a neural network is trained and/or used to decide which regions of the input image are encoded and which ones are not encoded at all.
  • the neural network may be trained and/or used to decide which regions are encoded with higher bitrate and which ones are encoded with lower bitrate.
  • the specific bitrate used for a certain region may be adaptive to the region itself and may not be fixed to any of the two values.
  • the neural network may be also configured to decide about the regions based on the semantics of the region.
  • the deep learning field has made possible for neural networks to generate missing data in images and videos, or "inpaint" them. Therefore, the neural network may decide based on how easy it is to imagine the considered region. The bitrate for that region will then be inversely proportional to how well it can be imagined. It is worth noticing that image enhancement techniques based on neural networks may be used not only for image inpainting (replacing missing content regions with plausible content) but also for quality improvement of existing data, such as increasing the resolution of images.
  • FIG. 3 illustrates a general overview of the solution according to an embodiment.
  • a first neural network 310 receives an input image 300 for analysis.
  • a mask 320 of easily imaginable regions is produced as an output.
  • the non-easily imaginable regions are encoded with higher bitrate 330, and streamed in a bitstream 340 to a receiver.
  • the masked regions may be either encoded at lower bitrate and then streamed, or not encoded and streamed at all.
  • Bitstream 340 may also include information about the mask and/or parameters for the second neural network 380.
  • the bitstream is de-compressed 350 to produce a de-compressed image 370.
  • the de-compression may take into account the encoding method used in the compression stage 330, as in common encoding-decoding techniques.
  • the de-compressed image 370 and the mask and the neural network parameters 360 are input to a second neural network 380. Then the easily-imaginable regions are regenerated, determined, or imagined by the second neural network 380 to produce an output image 390.
  • the present embodiments provide a training framework "Collaborative-Adversarial Training” (CAT), shown in Figure 4.
  • the CAT uses one Generator network G and two auxiliary networks; Collaborator C, Adversary A.
  • the first neural network C 410 receives an input image 400 and is configured to output a mask of regions 420 which are easily imaginable by a second network G 430.
  • the second network G 430 receives the masked image 420 (or, alternatively, an image where the masked regions are encoded at lower bitrate) and is configured to imagine or reconstruct the masked regions.
  • the imagined image 440 being output by the second network G 430 is then analyzed by a third neural network A 450.
  • the purpose of the third neural network A 450 is to try to discriminate this image from an image in the training set.
  • Images in the training set may be natural images, i.e., images not modified in their content and semantics with respect to the original images.
  • the third neural network A 450 receives images that are either imagined images from the generator or training set images, and it is configured to determine whether a received image originated from the generator or from the training set.
  • the output of the third neural network A 450 is classification probability, which is used to compute a loss 460 for training all the three networks, by comparison to the origin of image 400 input to the third neural network A 450.
  • the first neural network C is trained to help or collaborate with the second neural network G
  • the second neural network G is trained to fool the third neural network A
  • the third neural network A is trained to discriminate the second neural network G.
  • the loss 470 tells the first neural network C how well the third neural network A has managed to discriminate the second neural network G in the last batch of training images, and thus the first neural network C needs to update its parameters accordingly, in order to help the second neural network G to improve in fooling the third neural network A.
  • Training is performed in unsupervised way, or more precisely, in self-supervised regime. In fact, no manually-provided labels are needed, but only images such as those in ImageNet dataset.
  • the target needed to compute the loss is represented by the true label about the origin of the input image to A, i.e., either the generator or the training set.
  • Each network may have a different architecture.
  • the Collaborator C may consist of a "neural encoder" that is configured to apply several convolutions and non-linearities, and optionally to reduce the resolution by using pooling layers or strided convolution.
  • the Collaborator C may consist of a "neural encoder-decoder network", formed by a “neural decoder” that is configured to mirror a “neural encoder”, by using convolutions, non-linearities and eventually up- sampling.
  • the Collaborator may output a layer matrix representing the masked regions and the un-masked regions with one value per pixel representing how well that pixel is imaginable by the Generator G. If a binary mask is desired, the values may be simply thresholded. Alternatively, a binary mask may be obtained by having an output layer which performs regression on the coordinates of the masked regions. Then, the areas delimited by the coordinates will represent the masked regions.
  • the binary mask isolates the masked media part from the rest of the image or video frame.
  • the Generator G may be a "neural encoder-decoder", too, where the first part of the network extracts features and the second part reconstructs or imagines the image.
  • the Adversary A may be a traditional classification CNN, formed by a set of convolutional layers and non-linearities, followed by fully-connected layers and softmax layer for outputting a probability vector.
  • the Collaborator C has generated a mask of easily imaginable regions. In both cases the masked image is processed by a compression algorithm, but in different ways. According to an embodiment, the compression algorithm encodes only the non-easily imaginable regions.
  • the easily imaginable regions are not encoded and not streamed.
  • the non-easily imaginable regions of the image may be divided in a plurality of sub-images that collectively cover the non-easily imaginable regions.
  • the sub-images may be encoded separately.
  • the masked region may be assigned a predetermined pixel value before encoding. The whole image may be encoded together (including both masked and non-masked regions), but the masked region will be encoded with a very low number of bits because of the fixed pixel values in the masked region.
  • both non-easily imaginable and easily imaginably regions are encoded but at different bitrates, where the bitrate may be inversely proportional to the values output by the Collaborator C in that region.
  • the information about the mask may be included into the encoded bitstream or sent separately.
  • the Adversary A can be discarded and only the Collaborator C and the Generator G may be kept.
  • the first neural network C will be included as part of the encoder's (i.e. transmitter's) side, whereas the Generator G will be part of the decoder's (i.e. receiver's) side.
  • the Adversarial A may still be kept in case the whole system will be updated further by continuing training after deployment.
  • Figure 5 illustrates an overview of the encoder's side
  • Figure 6 illustrates an overview of the decoder's side.
  • an input image 500 is received by the Collaborator 510 that outputs a mask 520.
  • the input image 500 and the mask 520 are input to compressor 530, which compresses image 500 according to previous embodiments, so that non- easily imaginable parts of the image 500 are less compressed, resulting in a higher bitrate, and easily imaginable parts of the image 500, corresponding to mask 520 are more compressed, resulting in a lower bitrate.
  • the portions covered by mask 520 may be alternatively removed and not encoded at all.
  • the receiver receives a bit-stream 600 including image data, a mask, and/or G version.
  • the image data is de-compressed 610 to provide an decompressed image 630 having regions with higher bitrate (solid lines) and lower bitrate (hatched lines).
  • the decompressed image 630 is transmitted to the Generator 640 that may receive also the mask and the G version 620. With such data, the Generator 640 is able to regenerate the image 650.
  • the trained Generator G is optimized to work on images that have been encoded using a specific Collaborator C being trained at the same training session. Therefore, the Collaborator C and the Generator G need to be compatible or synchronized.
  • the version of the Generator G which is needed for decoding an image encoded via a certain Collaborator C network needs to be signaled to the decoder's side, for example as part of the bitstream.
  • the version data is signaled between the transmitter and the receiver, and delivered to Generator 640.
  • the parameters that defined Generator 640 may be sent to the receiver in advance or be pre-configured at the receiver. Therefore, it's not always necessary to signal any Generator related information during the media streaming.
  • the server is configured to provide the context related neural network topologies to the client once the client needs it for inpainting.
  • the server may send the context related neural network simply by streaming the topology to the client either inside the video bitstream (e.g. utilizing metadata carriage mechanisms of the video bitstream or media segment file) or totally out of band by embedding the neural network topology representation inside an HTTP(s) response which is sent to the client.
  • the information sent by the server may include an "effective-start-time" or a time interval parameter which indicates where in the presentation time the new network topology context can be utilized.
  • the topology and parameters to be sent at every update may include only the difference between those in the currently used version and those in the updated version, in order to further reduce the bitrate.
  • the proposed Collaborative-Adversarial Training is contextual ized here within the data compression domain, it is to be understood that it can be generalized to other domains.
  • the masked image produced by the Collaborator network may be used for different final tasks than reducing the bitrate.
  • even the entire Collaborator network may be trained for a completely different task, and it may be also trained to support/help multiple Generator networks instead of only one.
  • visual data may be considered but also other types of data which have statistical structure and semantics (i.e. data which carry useful information, and not random data chunks), such as audio where frequency values may be predicted from other frequency values not based on redundancy (as in normal audio encoders) but based on semantics (e.g. semantics of speech).
  • audio where frequency values may be predicted from other frequency values not based on redundancy (as in normal audio encoders) but based on semantics (e.g. semantics of speech).
  • Figure 7a is a flowchart illustrating a method according to an embodiment.
  • a method implemented on a transmitter comprises receiving media data for compression 710; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data 720; and providing the media data and the indication to a data compressor 730.
  • Figure 7b is a flowchart illustrating a method according to another embodiment.
  • a method implemented on a receiver comprises receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network 740; decompressing the media data 750; and regenerating a final media data in the neural network by using the indication and the parameters 760.
  • a part of media data that is determinable based on at least one other part of the media data may refer to a portion of the media data that has been removed from the original media data or that has been modified in some way, for example compressed in a higher level, such that other parts of the media data include information usable in at least partially recovering, reconstructing, or deducing the missing part or original form of the modified part.
  • the determinable part may be also referred to as an imaginable part and these terms are used interchangeably throughout the specification.
  • An apparatus comprises means for receiving media data for compression; means for determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and means for providing the media data and the indication to a data compressor.
  • the means comprises a processor, a memory, and a computer program code residing in the memory.
  • An apparatus comprises means for receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters.
  • the means comprises a processor, a memory, and a computer program code residing in the memory.
  • the apparatus 50 may comprise a housing for incorporating and protecting the device.
  • the apparatus 50 may further comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera system 42 capable of recording or capturing images and/or video.
  • the camera system 42 may contain one or more cameras.
  • the camera system is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the controller 56 may be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus may be formed as a part of a server or cloud computing system.
  • the apparatus may be configured to receive video and audio data from a capture device, such as for example a mobile phone, through one or more wireless or wired connections.
  • the apparatus may be configured to analyze the received audio and video data and to generate a widened video field of view as described in the previous embodiments.
  • the apparatus may be configured to transmit the generated video and/or audio data to an immersive video display apparatus, such as for example a head-mounted display or a virtual reality application of a mobile phone.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a computer program may be configured to carry out the features of one or more embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
EP18754044.8A 2017-02-16 2018-01-23 Verfahren und technische ausrüstung zur videoverarbeitung Pending EP3583777A4 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20175136 2017-02-16
PCT/FI2018/050049 WO2018150083A1 (en) 2017-02-16 2018-01-23 A method and technical equipment for video processing

Publications (2)

Publication Number Publication Date
EP3583777A1 true EP3583777A1 (de) 2019-12-25
EP3583777A4 EP3583777A4 (de) 2020-12-23

Family

ID=63169186

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18754044.8A Pending EP3583777A4 (de) 2017-02-16 2018-01-23 Verfahren und technische ausrüstung zur videoverarbeitung

Country Status (2)

Country Link
EP (1) EP3583777A4 (de)
WO (1) WO2018150083A1 (de)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11657264B2 (en) 2018-04-09 2023-05-23 Nokia Technologies Oy Content-specific neural network distribution
EP3831075A1 (de) 2018-07-30 2021-06-09 Koninklijke KPN N.V. Erzeugung eines zusammengesetzten videostroms zur anzeige in vr
CN111179212B (zh) * 2018-11-10 2023-05-23 杭州凝眸智能科技有限公司 集成蒸馏策略和反卷积的微小目标检测片上实现方法
US11924442B2 (en) 2018-11-20 2024-03-05 Koninklijke Kpn N.V. Generating and displaying a video stream by omitting or replacing an occluded part
US10904637B2 (en) * 2018-12-17 2021-01-26 Qualcomm Incorporated Embedded rendering engine for media data
US11831867B2 (en) 2019-02-15 2023-11-28 Nokia Technologies Oy Apparatus, a method and a computer program for video coding and decoding
EP3938962A1 (de) * 2019-03-15 2022-01-19 Dolby International AB Verfahren und vorrichtung zur aktualisierung eines neuronalen netzwerks
JP7303456B2 (ja) * 2019-06-21 2023-07-05 日本電信電話株式会社 符号化装置、符号化方法及びプログラム
EP4014166A1 (de) 2019-08-14 2022-06-22 Google LLC Nachrichtenübermittlung hinsichtlich tiefer neuronaler netze zwischen basisstation und benutzergerät
WO2021029891A1 (en) 2019-08-14 2021-02-18 Google Llc Communicating a neural network formation configuration
US11397893B2 (en) 2019-09-04 2022-07-26 Google Llc Neural network formation configuration feedback for wireless communications
JP7404525B2 (ja) 2019-10-31 2023-12-25 グーグル エルエルシー ネットワークスライシングのための機械学習アーキテクチャの決定
US11886991B2 (en) 2019-11-27 2024-01-30 Google Llc Machine-learning architectures for broadcast and multicast communications
US11689940B2 (en) 2019-12-13 2023-06-27 Google Llc Machine-learning architectures for simultaneous connection to multiple carriers
KR102423977B1 (ko) * 2019-12-27 2022-07-22 삼성전자 주식회사 인공신경망 기반의 음성 신호 송수신 방법 및 장치
US20230100728A1 (en) * 2020-03-03 2023-03-30 Telefonaktiebolaget Lm Ericsson (Publ) A system, an arrangement, a computer software module arrangement, a circuitry arrangement and a method for improved image processing utilzing two entities
US11663472B2 (en) 2020-06-29 2023-05-30 Google Llc Deep neural network processing for a user equipment-coordination set
US11622117B2 (en) * 2020-07-21 2023-04-04 Tencent America LLC Method and apparatus for rate-adaptive neural image compression with adversarial generators
EP3975452A1 (de) * 2020-09-24 2022-03-30 ATLAS ELEKTRONIK GmbH Wasserschallempfänger und system zur übertragung von bilddaten unter verwendung eines wasserschallsignals
US11445198B2 (en) * 2020-09-29 2022-09-13 Tencent America LLC Multi-quality video super resolution with micro-structured masks
CN112561799A (zh) * 2020-12-21 2021-03-26 江西师范大学 一种红外图像超分辨率重建方法
JPWO2023047485A1 (de) * 2021-09-22 2023-03-30
CN118077201A (zh) * 2021-09-29 2024-05-24 字节跳动有限公司 用于视频处理的方法、设备和介质
CN114095033B (zh) * 2021-11-16 2024-05-14 上海交通大学 基于上下文的图卷积的目标交互关系语义无损压缩系统及方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019171B2 (en) * 2006-04-19 2011-09-13 Microsoft Corporation Vision-based compression
US8223837B2 (en) * 2007-09-07 2012-07-17 Microsoft Corporation Learning-based image compression
US20130170558A1 (en) * 2010-09-10 2013-07-04 Thomson Licensing Video decoding using block-based mixed-resolution data pruning

Also Published As

Publication number Publication date
EP3583777A4 (de) 2020-12-23
WO2018150083A1 (en) 2018-08-23

Similar Documents

Publication Publication Date Title
EP3583777A1 (de) Verfahren und technische ausrüstung zur videoverarbeitung
EP3777207B1 (de) Inhaltsspezifische neuronale netzverteilung
CN110225341B (zh) 一种任务驱动的码流结构化图像编码方法
EP4218238A1 (de) Instanzadaptive bild- und videokomprimierung unter verwendung von maschinenlernsystemen
CN118233636A (zh) 使用深度生成性模型的视频压缩
WO2023016155A1 (zh) 图像处理方法、装置、介质及电子设备
CN112565777B (zh) 基于深度学习模型视频数据传输方法、系统、介质及设备
JP2024535693A (ja) 機械学習システムを使用するネットワークパラメータ部分空間におけるインスタンス適応画像及びビデオ圧縮
CN114723760B (zh) 人像分割模型的训练方法、装置及人像分割方法、装置
CN116233445B (zh) 视频的编解码处理方法、装置、计算机设备和存储介质
US20220398692A1 (en) Video conferencing based on adaptive face re-enactment and face restoration
Löhdefink et al. Focussing learned image compression to semantic classes for V2X applications
US11095901B2 (en) Object manipulation video conference compression
CN113628116B (zh) 图像处理网络的训练方法、装置、计算机设备和存储介质
US20220335560A1 (en) Watermark-Based Image Reconstruction
CN114501031B (zh) 一种压缩编码、解压缩方法以及装置
WO2023133888A1 (zh) 图像处理方法、装置、遥控设备、系统及存储介质
WO2023133889A1 (zh) 图像处理方法、装置、遥控设备、系统及存储介质
CN115299048A (zh) 图像编码、解码方法及装置、编解码器
WO2020107376A1 (zh) 图像处理的方法、设备及存储介质
CN111491166A (zh) 基于内容分析的动态压缩系统及方法
WO2024093627A1 (zh) 一种视频压缩方法、视频解码方法和相关装置
US20230162492A1 (en) Method, server device, and system for processing offloaded data
CN116634178B (zh) 一种极低码率的安防场景监控视频编解码方法及系统
CN110868615B (zh) 一种视频处理方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20190916

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
REG Reference to a national code

Ref country code: DE

Ref legal event code: R079

Free format text: PREVIOUS MAIN CLASS: H04N0019170000

Ipc: H04N0019850000

A4 Supplementary search report drawn up and despatched

Effective date: 20201119

RIC1 Information provided on ipc code assigned before grant

Ipc: H04N 19/132 20140101ALI20201113BHEP

Ipc: G06N 3/08 20060101ALI20201113BHEP

Ipc: H04N 19/46 20140101ALI20201113BHEP

Ipc: H04N 19/17 20140101ALI20201113BHEP

Ipc: H03M 7/30 20060101ALI20201113BHEP

Ipc: G06T 7/10 20170101ALI20201113BHEP

Ipc: H04N 19/593 20140101ALI20201113BHEP

Ipc: G06T 5/00 20060101ALI20201113BHEP

Ipc: G10L 19/04 20130101ALI20201113BHEP

Ipc: G06T 9/00 20060101ALI20201113BHEP

Ipc: G06N 3/04 20060101ALI20201113BHEP

Ipc: H04N 19/85 20140101AFI20201113BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230217