WO2020222027A1 - Procédé de catégorisation de contenu multimédia dynamique - Google Patents

Procédé de catégorisation de contenu multimédia dynamique Download PDF

Info

Publication number
WO2020222027A1
WO2020222027A1 PCT/IB2019/053484 IB2019053484W WO2020222027A1 WO 2020222027 A1 WO2020222027 A1 WO 2020222027A1 IB 2019053484 W IB2019053484 W IB 2019053484W WO 2020222027 A1 WO2020222027 A1 WO 2020222027A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
feature vectors
media
generating
media content
Prior art date
Application number
PCT/IB2019/053484
Other languages
English (en)
Inventor
Gabriel AUTÈS
Sami Arpa
Sabine SÜSSTRUNK
Original Assignee
Ecole Polytechnique Federale De Lausanne (Epfl)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ecole Polytechnique Federale De Lausanne (Epfl) filed Critical Ecole Polytechnique Federale De Lausanne (Epfl)
Priority to US17/607,393 priority Critical patent/US11961300B2/en
Priority to EP19729353.3A priority patent/EP3963503A1/fr
Priority to PCT/IB2019/053484 priority patent/WO2020222027A1/fr
Publication of WO2020222027A1 publication Critical patent/WO2020222027A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present invention relates to the field of computer vision and deep learning, and more particularly to a method for dynamically and automatically assessing the content of a media item, such as a video, and for labelling the content based on the assessed content by using a set of artificial neural networks.
  • the labelling may be used to determine an age suitability rating for the content.
  • the present invention also relates to a system for implementing the method.
  • Motion picture age ratings have been and still are a very important issue for the film, tv and video game industries.
  • films must receive an age rating before being released to the public.
  • these ratings can range from being purely advisory to being legally restricting.
  • MPAA Motion Picture Association of America
  • films receive a rating given by the Motion Picture Association of America (MPAA), which is a trade association representing the major film studios of the country. While this rating is not mandatory, most theatres do not show films which have not received a seal of approval from the MPAA.
  • MPAA Motion Picture Association of America
  • these ratings are mandatory and must be obtained from a governmental agency before a film release.
  • Video streaming service providers rely on their own experts or on user generated ratings. Other platforms hire content moderators whose task is to watch posted videos and decide if they should be age-restricted or removed. Some service providers also use some kind of automated systems for detecting content violating their terms of service.
  • Deep artificial neural networks have been successfully used to classify video content according to their genre and action content. Similarly, deep neural network can perform text classification.
  • the detection of disturbing content has been so far limited to specific types of sensitive material, such as the presence of violence or adult content. Nevertheless, the specific task of age rating detection goes beyond the detection of violent and pornographic content. In particular, profanities, mature topics and substance abuse can all result in a higher age rating in many countries.
  • most of the existing solutions that use deep learning for sensitive content detection concentrate on the task of detecting a specific type of content, such as violence or pornography. Some other solutions use diverse low-level feature descriptors from audio, video and text inputs to filter sensitive media.
  • the present invention discloses a new computer- implemented method based on deep artificial neural networks to assign e.g. age suitability classes or ratings to media content.
  • the present invention is superior to the ‘panel of expert’ or‘user generated’ methods in several ways:
  • the present invention can learn the age rating principles specific to different cultures and countries.
  • a film studio can directly assess the certification their movie will obtain in different parts of the world without hiring external experts.
  • the proposed computer-implemented solution allows assessing the age suitability dynamically as it varies through the sub-content (which are different sections of the content). For example, the proposed solution allows detecting the age rating per scene and identify which parts of a video contribute to a specific rating. This advantage can be used to modify the film to obtain a desired rating. For example, the proposed solution could identify which scene of a film results in an R rating and allow creating of a cut or modified version suitable for younger audience. It is also to be noted that the present solution, which is based on convolutional neural networks, does not require crafting of low-level features.
  • the proposed solution can provide specific age suitability ratings in a detailed classification system.
  • a computer program product comprising instructions stored on a non-transitory medium for implementing the steps of the method when loaded and run on computing means of an electronic device.
  • a system configured for carrying out the method.
  • Figure 1 is a block diagram schematically illustrating a classification system according to an example of the present invention
  • Figure 2 is a block diagram schematically illustrating a training
  • Figure 3 is a flow chart summarising a method of classifying a media item according to an example of the present invention.
  • the embodiment is used to dynamically classify media or multimedia content into one or more user profile suitability classes but the teachings of the present invention could instead be applied to other data classification tasks as well.
  • the user profile suitability classes are in the embodiment explained below user or viewer age suitability classes. Identical or corresponding functional and structural elements that appear in the different drawings are assigned the same reference numerals.
  • FIG. 1 schematically illustrates a media item classification or labelling network or system 1 according to an embodiment of the present invention.
  • the proposed system in this example is used for dynamically classifying video scenes as will be explained later in more detail.
  • the word“video” may be understood in its broad sense to mean an electronic medium for the recording, copying, playback, broadcasting, and display of moving or unmoving visual or other types of media.
  • Figure 1 also shows various output data sets that result from different functional blocks shown in Figure 1. These data sets may not be considered to be part of the actual system but are shown in Figure 1 for illustration purposes.
  • the system takes as its input a digital media file, which in this example is a video or motion picture file 3, which is fed into a data pre-processing unit 5.
  • the data pre-processing unit 5 is configured to pre-process the input video file as will be explained later in more detail and output a sequence or stream of image frames 7, audio clips 9 and text portions or words 11 , which may be subtitles, a video summary, a script, reviews etc related to the audio frames and/or audio clips.
  • the sequences of audio clips, image frames and words are also referred to as a first data stream, a second data stream and a third data stream, although not necessarily in this particular order.
  • the audio clips, image frames and text portions represent different content forms or types.
  • the pre-processing unit may thus output a data stream consisting of image and audio signals as well as any textual input related to the image and/or audio signals.
  • the system further comprises an audio and image processing block or unit 13 for converting or transforming a respective sequence of audio clips and image frames into a respective single audio and image feature vector 15, a text processing block or unit 17 for converting or transforming a respective sequence of words into a respective single text feature vector 19, a classifier block or unit 21 for generating a probability score or vector 25 for a concatenated audio, image and text vector 23 to obtain an estimated age suitability class, and a post-processing block or unit 27 for using the result of the estimation to take an action depending on the estimated classification.
  • the audio and image processing unit 13, the text processing unit 17 and the classifier unit 21 together form an artificial neural network system.
  • the audio and image processing unit 13 comprises a set of first artificial neural networks, which in this example are convolutional neural networks (CNNs) 29 trained for image processing and referred to as a set of image CNNs.
  • the image CNNs receive at their inputs the sequence of image frames 7 (four image frames in the example shown in Figure 1 ), such that one image CNN 29 is arranged to receive and process one image frame.
  • the image input is a sequence of t consecutive frames separated by a time duration s, which may be e.g. between 0.1 seconds and 10 seconds or more specifically between 0.5 seconds and 3 seconds.
  • Each one of the image CNNs 29 is arranged to process the received image frame and output an image feature vector 31.
  • the number of image feature vectors 31 output by the set of image CNNs equals the number of image CNNs in the set.
  • the output is a sequence of t image feature vectors of size D 1.
  • a feature vector is understood a vector that contains information describing an object's important characteristics.
  • a feature vector is a vector of characters and more specifically numbers describing the object.
  • the audio and image processing unit 13 also comprises a set of second artificial neural networks, which in this example are CNNs 33 trained for audio processing and/or recognition and referred to as a set of audio CNNs.
  • the audio CNNs receive at their inputs the sequence of audio clips 9 of a given length (four audio clips in the example shown in Figure 1 ), such that one audio CNN 33 is arranged to receive one audio clip.
  • the audio input is a sequence of t audio clips with a duration, which equals 5 in this particular example.
  • Each one of the audio CNNs 33 is arranged to process the received audio clip and output an audio feature vector 35.
  • the number of audio feature vectors 35 output by the set of audio CNNs equals the number of audio CNNs in the set.
  • the output is a sequence of t audio feature vectors of size D2 , which typically is not the same as D 1.
  • the audio feature vectors and the image feature vectors are then arranged to be fed into a first concatenation unit 37, which is configured to
  • the concatenation unit is configured to take an audio feature vector and append the corresponding (timewise) image feature vector to it or vice versa. In this manner the audio feature vectors and the image feature vectors can be merged so that the number of the concatenated audio and image feature vectors 39 equals the number of audio feature vectors 35 or the number of image feature vectors 31 in the sequence.
  • the output of the first concatenation unit is a sequence of t concatenated audio and image feature vectors of size D 1 + D2.
  • the concatenated audio and image feature vectors 39 are then fed into a third artificial neural network 41 , which in this example is a first convolution through time (CTT) network, which is a one-dimensional CNN.
  • the CTT network applies a succession of one-dimensional convolution filters to extract temporal information from the input sequence.
  • the CTT network consists of a series of one-dimensional convolution layers, the convolution being applied to the temporal dimension.
  • the input sequence of concatenated audio and image feature vectors (size (t, D 1 + D2) thus goes through the first CTT network consisting of a series of one-dimensional convolution layers.
  • the first CTT network 41 is thus configured to process the incoming concatenated audio and image feature vectors to output the single image and audio feature vector 15, which indirectly describes the sequences of audio feature vectors and image feature vectors.
  • the operation of the text processing unit 17 is similar to the audio and image processing unit 13 with the main difference that there is no need to carry out a concatenation operation within the text processing unit 17.
  • the text processing unit 17 comprises a set of text processing elements 43, which in this example are word embedding matrices 43 trained for text or word processing.
  • Word embedding is the collective name for a set of feature learning techniques and language modelling in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
  • Word embedding is the collective name for a set of feature learning techniques and language modelling in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
  • it typically involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.
  • the word embedding matrices receive at their inputs the sequence of words 11 , such that one word embedding matrix 43 is arranged to receive and process one word. Each one of the word embedding matrices 43 is arranged to process the received word and output a text feature vector 45 for the received word. In this example, each word embedding matrix is configured to process one word. Thus, the number of text feature vectors 45 output by the set of word embedding matrices 43 equals the number of word embedding matrices in the set.
  • the text feature vectors 45 are then arranged to be fed into a fourth artificial neural network 47, which in this example is a second convolution through time (CTT) network, (which is a one-dimensional CNN similar to the first CTT network, however with different operating parameters compared with the first CTT network).
  • the second CTT network 47 is configured to process the incoming text feature vectors to output the single text feature vector 19.
  • the first and second CTT networks 41 , 47 could be merged into one single CTT system.
  • a first and second recurrent neural networks (RNNs) could be used, for example with long-short term memory units (LSTM), where connections between nodes form a directed graph which contains a cycle. These networks exhibit a memory effect, which make them particularly efficient for sequence and time series classification problems.
  • first and second CTT networks each comprise a series of n conv one-dimensional convolutional layers with f , f 2 . fn conv filters of kernel sizes k l t k 2 . k nconv , respectively.
  • the length ti of the sequence is reduced according to the size of the kernel ki . - k t + 1.
  • a max pooling operation is applied over the remaining sequence of length to obtain a single feature vector of size encoding the full sequences of audio and images, and text.
  • the max pooling is a sample-based discretisation process whose objective is to down-sample an input representation, reducing its dimensionality and allowing for assumptions to be made about features contained in sub-regions binned. It is also possible to use CTT networks having only one layer.
  • the kernel size(s) of the second CTT network (for the text processing) may be smaller, although not necessarily, than the kernel sizes for the first CTT network.
  • the kernel sizes may by determined and/or optimised through training of the CTT networks.
  • the kernel size of each layer in this example equals at most t, in other words k £ t, where t denotes the number of image frames (or audio clips) in a sequence fed into the first CTT network 41.
  • the audio clips and the audio frames are typically synchronised in the time domain in a given sequence. However, this does not have to be the case. More specifically, the audio frames may be taken at regular (or irregular) time intervals with a given time separation T between any two consecutive image frames. The audio clips then have the same time duration T.
  • the word stream fed into the text processing unit then includes all the words present in the video during this sequence of frames or audio clips.
  • the number of image frames during a given sequence in this example equals the number of audio clips.
  • the number of words is however typically different from the number of frames or number of audio clips in a given sequence.
  • the system also comprises a second concatenation unit 49, which in the example illustrated in Figure 1 is part of the classifier unit 21 .
  • the single audio and image feature vector and the single text feature vector is arranged to be fed into the second concatenation unit 49 which is configured to concatenate these two feature vectors to output the concatenated audio, image and text feature vector.
  • the first and second concatenation units 37, 49 may form one single processing unit. In other words, the same concatenation unit may be used for carrying out the first and second concatenation operations in the system.
  • the concatenated audio, image and text feature vector 23 is then arranged to be fed into a fifth artificial neural network 51 , which in this example is a feedforward artificial neural network and more specifically a multilayer perceptron (MLP) network.
  • the MLP may also be considered as a series of linear regression units and non-linear activation units arranged to model the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
  • the MLP 51 is configured to output the probability vector 25 that is configured to be fed into the post-processing unit 27, which is arranged to assign the age suitability class to the video scene under consideration based on the probability vector 25 as will be described later in more detail.
  • the MLP uses n dense fully connected layers of sizes d 1 d 2 , ... , d ndense , where d x denotes the number of nodes or neurons in the layer.
  • the last layer of this fully connected network is a dense layer having a size equalling the number of target ratings or in this example the age suitability classes n ciasses in the system.
  • the output vector of length n ciasses can then be transformed into the class probability vector 25 by applying a softmax activation function.
  • the softmax function also known as softargmax or normalised exponential function, is a function that takes as input a vector of M real numbers, and normalises it into a probability vector consisting of M probabilities. That means that prior to applying the softmax function, some vector components could be negative, or greater than one, and might thus not sum to 1 . After applying softmax, each component will be in the interval (0,1 ), and the components will add up to 1 , so that they can be interpreted as probabilities. It is to be noted that the activation function applied to the output vector is the softmax function if a single label is needed (age suitability classes) and a sigmoid function for a multi-label classification.
  • the MLP consists of two layers of nodes or neurons.
  • the first and input layer is a fully connected layer, while the second and output layer is also a fully connected layer.
  • the MLP instead of the MLP having two layers, any other suitable number of layers is possible. For example, there could be any suitable number of hidden layers between the input and output layers.
  • each node is a neuron that uses a nonlinear activation function.
  • MLPs use a supervised learning technique called backpropagation for training. MLPs can distinguish data that are not linearly separable.
  • the training of the system is first briefly explained with reference to the block diagram of Figure 2 describing a training system 60.
  • the training may for instance follow the principles explained in the publication: D. P. Kingma and J. Ba,“Adam: A Method for Stochastic Optimization,” ArXiv14126980 Cs, Dec. 2014.
  • the system 1 or more specifically the artificial neural network system 61 consisting of the audio and image processing unit 13, the text processing unit 17 and the classifier unit 21 is trained by using labelled data, i.e. labelled videos 62, to classify video scenes into age suitability classes. From the labelled videos, the following streams are extracted: an image stream 63, an audio stream 65 and a text stream 67.
  • An age label rating 69 of the extracted streams is also extracted from the labelled video and fed to a loss function computation unit 71.
  • the system 61 outputs a class probability vector 73 (or an age rating probability prediction) for the received streams.
  • the class probability vector is then fed into the loss function computation unit 71 , which is configured to compare the class probability vector 73 and the age rating label 69 assigned to the stream and it thereby aims to minimise the loss function.
  • the output of the loss function computation is fed into a parameter optimisation unit 75, which is configured to optimise the parameters of the system 61 based on the output of the loss function computation.
  • the proposed method is able to achieve high accuracy thanks to training the system by using a relatively small dataset of labelled videos by leveraging a technique of transfer learning. More specifically, high level audio, image and text features are extracted from the audio, image and text inputs, respectively, by using pre-trained neural networks which achieve high accuracy on audio, image and text classification tasks. These pre-trained networks are used to extract temporal sequences of feature vectors from the audio, image and text inputs.
  • the present invention then makes use of deep learning sequence classification methods, such as one-dimensional CNNs (i.e. the CTTs) to classify these feature vector sequences into age suitability classes.
  • deep learning sequence classification methods such as one-dimensional CNNs (i.e. the CTTs) to classify these feature vector sequences into age suitability classes.
  • the CTTs 41 , 47 and the MLP 51 need to be optimised for the specific task at hand, while the parameters used for optimal audio, image and text feature extraction can remain fixed (i.e. the image CNNs 29, the audio CNNs 33 and the word embedding matrices 43).
  • the full system 61 can be trained directly.
  • a digital media file which in this example is a video file 3 is received and read by the pre-processing unit 5.
  • the pre-processing unit cuts or extracts three data streams, namely a sequence of audio clips, image frames and the text portions from the received file.
  • the extracted streams are then stored in the pre-processing unit or in any suitable storage.
  • the image stream may be stored in any desired colour space format, such as the RGB colour format, as an image tensor of integers.
  • the image tensors as stored may have the size (“number of pixels in the height direction, i.e. image height”,“number of pixels in the width direction, i.e. image width” and “number of colour space components”).
  • the image height and width would typically be between 0 and 255.
  • the pre-processing unit processes the sequence of image frames. More specifically, in this step the image frames are cropped and/or resized so that they can be processed by the image CNN 29.
  • the images may be cropped to remove any possible black bars around the images, e.g. at the top and bottom.
  • the images are resized to the input size of the pre trained network, such that in this example the images have a size of (224, 224, 3).
  • step 107 which may be carried out in parallel with step 105, the sequence of audio clips is pre-processed. More, specifically, the pre-processing unit 5 obtains a spectrogram for each audio clip in the sequence of audio clips.
  • the obtained spectrogram is a log mel spectrogram tensor. More specifically, step 107 may comprise at least some of the following operations:
  • the audio clips are resampled, in this example to 16 kHz.
  • a short-term Fourier transformation is performed, in this example with a periodic Hann window size of 25 ms, and a hop, such as a 10 ms hop, is computed for each audio clip, to obtain the spectrograms.
  • the spectrograms are mapped to mel bins, in this example to 64 bins, in the 125-7500 Hz range and a logarithmic function is applied to the mel spectrogram amplitude, optionally with a small offset of 0.01 for example to avoid the zero singularity.
  • step 109 the image frame stream is fed into the image CNNs 29 and a sequence or set of image feature vectors is generated from the image stream by the image CNNs 29 such that in this example one image feature vector is generated per one image frame.
  • step 11 1 which may be carried out in parallel with step 109, the audio clip stream is fed into the audio CNNs 33, and a sequence or set of audio feature vectors is generated from the audio stream by the audio CNNs 33 such that in this example one audio feature vector is generated per one audio clip or frame.
  • step 1 13 which may be carried out in parallel with steps 109 and 11 1 , the text stream is fed into the word embedding matrices 43, and a sequence or set of text or word feature vectors is generated from the word stream by the image word embedding matrices 45 such that in this example one word feature vector is generated per one text portion or word. It is to be noted that in this example there is no need to pre- process the text stream but depending on the capabilities of the word embedding matrices, pre-processing of the text may be performed if needed prior to carrying out step 1 13.
  • a sequence or set of concatenated audio and image feature vectors 39 is generated from the sequence of audio feature vectors and from the sequence of image feature vectors by the first concatenation unit 37. This is carried out in this example so that a first audio feature vector of a first time instant is concatenated with a first image feature vector of the first time instant, a second audio feature vector of a second time instant is concatenated with a second image feature vector of the second time instant etc.
  • a single audio and image feature vector 15 is generated from the sequence of concatenated audio and feature vectors by the first CTT network 41.
  • step 119 which may be carried out in parallel with step 115 or 1 17, a single text feature vector 19 is generated from the sequence of text feature vectors by the second CTT network 47. In other words, only one text feature vector is generated, which describes all the text feature vectors in a given sequence or stream.
  • the second concatenation unit 49 (or the first concatenation unit 37) concatenates the single audio and image feature vector and the single text feature vector to obtain a single concatenated audio, image and text feature vector 23 describing the three data streams output by the pre-processing unit 5.
  • the MLP 51 determines age suitability class probabilities by using the single concatenated audio, image and text feature vector 23. In other words, a vector of probabilities is computed or determined so that the number of entries in the vector equals the number of possible age suitability categories. In this manner, one probability value is allocated to each age suitability category. If there are for instance five different age suitability categories, then the probability vector could be for example [0.0, 0.1 , 0.0, 0.8, 0.1].
  • an age suitability class is assigned to the data stream under consideration.
  • This step may be carried out by the post-processing unit 27.
  • a sequence or stream classification is carried out. In practice this step may be implemented so that that the highest probability value is selected from the probability vector and the assigned age suitability class is the class corresponding to that probability value.
  • the viewer age suitability class or classes is/are selected or the selection is received by the post-processing unit 27. It is to be noted that this step may be carried out at any moment prior to carrying out step 129.
  • step 129 it is determined whether or not the assigned age suitability class is compatible with the selection received in step 127.
  • step 131 it is determined whether or not the assigned age suitability class is the same as the selected age suitability class or is within the range of the allowed age suitability classes for this user or viewer. In the affirmative, in step 131 , the stream is displayed or played to the viewer and the process then continues in step 103. If the assigned class is not compatible with the selection, then in step 133, it is decided not to show the stream in question to the viewer. After this step, the process continues in step 103 where the following streams are extracted. The process may be then repeated as many times as desired, and it can be stopped at any moment.
  • the process continues to a second or next image frame and/or audio clip and includes these items and a given number of subsequent audio frames, audio clips and words into the next sequence.
  • the length of this sequence may or may not be equal to t.
  • the third or next sequence would start with a third or next audio clip or image frame.
  • the process can be run so that once the first sequence of t audio clips and image frames has been processed, then the first audio clip or image frame of the following sequence would be (t + l)th audio clip or image frame.
  • multiple streams may first be processed before deciding whether or not the stream(s) should be displayed to the viewer.
  • the result of every determination may be stored in a database or memory and in this manner, only once a given number of streams (for example corresponding to the length of an entire film) have been assessed, step(s) 131 and/or 133 are carried out.
  • step(s) 131 and/or 133 are carried out.
  • a slightly modified film may be displayed to the viewer.
  • the system may also determine which one of the assessed streams contributes most to the incompatible age suitability class. In other words, the system may rank the assessed streams according to their contribution to the estimated age suitability class. Then, for example, the stream having the greatest contribution may be modified or its playback prevented for a particular scene.
  • an embodiment was described above for classifying multimedia content.
  • an artificial neural network was developed to detect directly any dynamically the age suitability of the multimedia content.
  • the invention thus provides a dynamic media or video filter, which can understand dynamically the suitability of content for age groups according to the criteria of the motion picture age classification of different countries. And this can be implemented per scene, multiple scenes or even for the entire film.
  • the system can be plugged in a video player or be part of it and it allows a user to watch the films with/without content incompatible with the user class.
  • the proposed solution can be integrated into a parental control application in an electronic device capable of rendering multimedia files.
  • the audio, image and text streams can be fused at different points in the system, i.e.
  • the audio feature vectors and the image feature vectors would typically be fused after the first CTT 41. It is also to be noted that it is not necessary to use all the three streams for the content classification. Instead, only one or any one of the two streams may be used for this purpose. However, the reliability of the end result would typically increase with the increasing number of considered streams.
  • the geographical location of the playback device and/or the viewer may also advantageously be taken into account when determining whether or not play the media content to the viewer and how or in which format.
  • the geographical location may be determined by satellite positioning or by using an identifier or network address, such as an internet protocol (IP) address, of the playback device. If the playback device is a cellular communication device, such as a smart phone, the cell identity may also be used for this purpose. More specifically, the geographical location of the playback device (or the classification system, which may be part of the playback device) or the viewer may affect the age suitability class given for the media content. This means that also the values of the probability vector may be dependent on the above geographical location.
  • IP internet protocol
  • the actual computing system may comprise a central or computing processing unit (CPU), a graphical processing unit (GPU), a memory unit and a storage device to store digital files.
  • CPU central or computing processing unit
  • GPU graphical processing unit
  • the parameters and the operating software of the neural network system are first loaded from the storage to the memory.
  • the multimedia item in this example comprising audio, image and text components
  • the input file may be analysed and loaded from the storage to the memory.
  • the age rating predictions for the multimedia item are calculated by software modules running on the CPU and GPU and using the neural network system parameters.
  • the software modules are operable for (a) decomposing the input multimedia file into sequences of audio clips, image frames and words to be classified or rated, (b) pre-processing the input file into numerical tensors used as input for the neural network, (c) computing the age rating prediction for each tensor or for a sequence of tensors by applying the successive convolution, linear regression and activation function operations of the neural networks using the parameters loaded into the memory.
  • the final age ratings predictions are stored in the storage.
  • the modified or unmodified video may then be displayed if so decided on a display or screen optionally alongside the predicted dynamic age ratings.
  • the display may for instance be directly or indirectly connected to the post-processing block or unit 27.
  • the displayed video can be filtered in real time to show only the sequences with suitable age ratings.
  • the content could be classified into user profile suitability classes.
  • a set of different user profiles may be created in the system and a respective user may be allocated a user profile. Users may be able to select by themselves their user profile or the system may automatically select the user profiles for the users based on some information including e.g. viewing preferences and/or the age of the users.
  • one parameter of the user profile may be e.g. the age of the user.
  • a given age class such as an adult age class, may include several profiles or sub-categories (e.g. one profile may accept violence to a certain extent but no pornographic content).

Abstract

La présente invention concerne un procédé de classification d'un élément multimédia dans une classe d'adéquation pour un profil d'utilisateur dans un système de classification comprenant un ensemble de réseaux neuronaux artificiels préentraînés. Le procédé consiste à : recevoir (101) un fichier multimédia ; extraire (103) un premier flux de données et un deuxième flux de données à partir du fichier multimédia, le premier flux multimédia comprenant du premier contenu multimédia d'une première forme de contenu multimédia, le deuxième flux de données comprenant du deuxième contenu multimédia d'une deuxième forme de contenu multimédia différente, les premier et deuxième flux de données étant compris dans l'élément multimédia ; produire (109, 111, 113) une première séquence de premiers vecteurs caractéristiques décrivant le premier contenu multimédia ; produire (109, 111, 113) une deuxième séquence de deuxièmes vecteurs caractéristiques décrivant le deuxième contenu multimédia ; produire (117) au moins un premier vecteur caractéristique unique représentant la première séquence de premiers vecteurs caractéristiques et la deuxième séquence de deuxièmes vecteurs de caractéristiques, ou produire (117) au moins un premier vecteur caractéristique unique représentant au moins la première séquence de premiers vecteurs caractéristiques, et produire (119) un deuxième vecteur caractéristique représentant au moins la deuxième séquence de deuxièmes vecteurs caractéristiques ; produire (123) un vecteur de probabilité au moins à partir du premier vecteur caractéristique unique, ou au moins à partir des premier et deuxième vecteurs caractéristiques uniques ; et attribuer (125) une classe d'adéquation pour un profil d'utilisateur à l'élément multimédia en fonction du vecteur de probabilité.
PCT/IB2019/053484 2019-04-29 2019-04-29 Procédé de catégorisation de contenu multimédia dynamique WO2020222027A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/607,393 US11961300B2 (en) 2019-04-29 2019-04-29 Dynamic media content categorization method
EP19729353.3A EP3963503A1 (fr) 2019-04-29 2019-04-29 Procédé de catégorisation de contenu multimédia dynamique
PCT/IB2019/053484 WO2020222027A1 (fr) 2019-04-29 2019-04-29 Procédé de catégorisation de contenu multimédia dynamique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2019/053484 WO2020222027A1 (fr) 2019-04-29 2019-04-29 Procédé de catégorisation de contenu multimédia dynamique

Publications (1)

Publication Number Publication Date
WO2020222027A1 true WO2020222027A1 (fr) 2020-11-05

Family

ID=66794039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2019/053484 WO2020222027A1 (fr) 2019-04-29 2019-04-29 Procédé de catégorisation de contenu multimédia dynamique

Country Status (3)

Country Link
US (1) US11961300B2 (fr)
EP (1) EP3963503A1 (fr)
WO (1) WO2020222027A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020222027A1 (fr) * 2019-04-29 2020-11-05 Ecole Polytechnique Federale De Lausanne (Epfl) Procédé de catégorisation de contenu multimédia dynamique

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6650779B2 (en) * 1999-03-26 2003-11-18 Georgia Tech Research Corp. Method and apparatus for analyzing an image to detect and identify patterns
US7668376B2 (en) * 2004-06-30 2010-02-23 National Instruments Corporation Shape feature extraction and classification
US9135103B2 (en) * 2012-02-16 2015-09-15 Mentor Graphics Corporation Hybrid memory failure bitmap classification
US10516893B2 (en) * 2015-02-14 2019-12-24 Remote Geosystems, Inc. Geospatial media referencing system
BR102016007265B1 (pt) 2016-04-01 2022-11-16 Samsung Eletrônica da Amazônia Ltda. Método multimodal e em tempo real para filtragem de conteúdo sensível
US10909459B2 (en) * 2016-06-09 2021-02-02 Cognizant Technology Solutions U.S. Corporation Content embedding using deep metric learning algorithms
US10832129B2 (en) * 2016-10-07 2020-11-10 International Business Machines Corporation Transfer of an acoustic knowledge to a neural network
CN106778590B (zh) 2016-12-09 2020-07-17 厦门大学 一种基于卷积神经网络模型的暴恐视频检测方法
CN108229262B (zh) 2016-12-22 2021-10-15 腾讯科技(深圳)有限公司 一种色情视频检测方法及装置
US11461554B2 (en) * 2017-07-26 2022-10-04 Siuvo Inc. Semantic classification of numerical data in natural language context based on machine learning
US10262240B2 (en) * 2017-08-14 2019-04-16 Microsoft Technology Licensing, Llc Fast deep neural network training
US20190273510A1 (en) * 2018-03-01 2019-09-05 Crowdstrike, Inc. Classification of source data by neural network processing
CN111428088B (zh) 2018-12-14 2022-12-13 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器
WO2020222027A1 (fr) * 2019-04-29 2020-11-05 Ecole Polytechnique Federale De Lausanne (Epfl) Procédé de catégorisation de contenu multimédia dynamique
US11720789B2 (en) * 2019-06-07 2023-08-08 Apple Inc. Fast nearest neighbor search for output generation of convolutional neural networks
KR20210087786A (ko) * 2020-01-03 2021-07-13 엘지전자 주식회사 인공지능 기반의 이퀄라이저 제어방법

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ALESSANDRO SUGLIA ET AL: "A Deep Architecture for Content-based Recommendations Exploiting Recurrent Neural Networks", PROCEEDINGS OF THE 25TH CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, UMAP '17, 1 January 2017 (2017-01-01), New York, New York, USA, pages 202 - 211, XP055656923, ISBN: 978-1-4503-4635-1, DOI: 10.1145/3079628.3079684 *
CHU WEI-TA ET AL: "A hybrid recommendation system considering visual information for predicting favorite restaurants", WORLD WIDE WEB, BALTZER SCIENCE PUBLISHERS, BUSSUM, NL, vol. 20, no. 6, 20 January 2017 (2017-01-20), pages 1313 - 1331, XP036305531, ISSN: 1386-145X, [retrieved on 20170120], DOI: 10.1007/S11280-017-0437-1 *
D. P. KINGMAJ. BA: "Adam: A Method for Stochastic Optimization", ARXIV14126980 CS, December 2014 (2014-12-01)
SUHANG WANG ET AL: "What Your Images Reveal", WORLD WIDE WEB, INTERNATIONAL WORLD WIDE WEB CONFERENCES STEERING COMMITTEE, REPUBLIC AND CANTON OF GENEVA SWITZERLAND, 3 April 2017 (2017-04-03), pages 391 - 400, XP058327282, ISBN: 978-1-4503-4913-0, DOI: 10.1145/3038912.3052638 *
YASHAR DELDJOO ET AL: "Audio-visual encoding of multimedia content for enhancing movie recommendations", RECOMMENDER SYSTEMS, ACM, 2 PENN PLAZA, SUITE 701NEW YORKNY10121-0701USA, 27 September 2018 (2018-09-27), pages 455 - 459, XP058415609, ISBN: 978-1-4503-5901-6, DOI: 10.1145/3240323.3240407 *

Also Published As

Publication number Publication date
EP3963503A1 (fr) 2022-03-09
US11961300B2 (en) 2024-04-16
US20220207864A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
CN109740670B (zh) 视频分类的方法及装置
CN109874053B (zh) 基于视频内容理解和用户动态兴趣的短视频推荐方法
CN108763325B (zh) 一种网络对象处理方法及装置
US10528821B2 (en) Video segmentation techniques
US20170177972A1 (en) Method for analysing media content
CN111428088A (zh) 视频分类方法、装置及服务器
CN109508406B (zh) 一种信息处理方法、装置及计算机可读存储介质
CN111708915B (zh) 内容推荐方法、装置、计算机设备和存储介质
CN111209440A (zh) 一种视频播放方法、装置和存储介质
CN112749608A (zh) 视频审核方法、装置、计算机设备和存储介质
CN111708941A (zh) 内容推荐方法、装置、计算机设备和存储介质
CN111222450A (zh) 模型的训练及其直播处理的方法、装置、设备和存储介质
CN111783712A (zh) 一种视频处理方法、装置、设备及介质
CN114339362B (zh) 视频弹幕匹配方法、装置、计算机设备和存储介质
CN114297439A (zh) 一种短视频标签确定方法、系统、装置及存储介质
US11636282B2 (en) Machine learned historically accurate temporal classification of objects
US11961300B2 (en) Dynamic media content categorization method
Jayanthiladevi et al. AI in video analysis, production and streaming delivery
CN108881950B (zh) 一种视频处理的方法和装置
CN112749660A (zh) 一种视频内容描述信息的生成方法和设备
CN112989115A (zh) 待推荐视频的筛选控制方法及装置
EP3772856A1 (fr) Identification de la partie d'introduction d'un contenu vidéo
WO2023169159A1 (fr) Procédé d'établissement de graphe d'événements et appareil associé
CN117156078B (zh) 一种视频数据处理方法、装置、电子设备及存储介质
CN110674783B (zh) 一种基于多级预测架构的视频描述方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19729353

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019729353

Country of ref document: EP

Effective date: 20211129