FR3137475A1

FR3137475A1 - Method and device for estimating the authenticity of audio or video content and associated computer program

Info

Publication number: FR3137475A1
Application number: FR2206573A
Authority: FR
Inventors: Matthieu DELMAS; Amine Kacete; Stéphane Paquelet
Original assignee: Fondation B Com
Current assignee: Fondation B Com
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2024-01-05
Also published as: WO2024003016A1

Abstract

La présente invention concerne un procédé d’estimation de l’authenticité d’un contenu audio ou vidéo, le contenu audio ou vidéo étant représenté par un ensemble de valeurs d’entrée, le procédé comprenant des étapes de : - détermination, au moyen d’un système de traitement (5) et sur la base dudit ensemble de valeurs d’entrée, d’un vecteur intermédiaire (w) conformément à une distribution apprise, le système de traitement (5) étant configuré, au préalable, pour produire, en sortie, des vecteurs de sortie répartis selon la distribution apprise et conçus de sorte qu’un réseau générateur produise un contenu de même type que le contenu audio ou vidéo lorsque lesdits vecteurs de sortie sont appliqués en entrée de ce réseau de neurones générateur, et - estimation d’un niveau d’authenticité (m) du contenu audio au ou vidéo par application de moyens de classification (20) audit vecteur intermédiaire (w). L’invention concerne également un dispositif d’estimation de l’authenticité d’un contenu audio ou vidéo et un programme d’ordinateur associé. Figure pour l’abrégé : Fig. 1The present invention relates to a method for estimating the authenticity of audio or video content, the audio or video content being represented by a set of input values, the method comprising steps of: - determination, by means of 'a processing system (5) and on the basis of said set of input values, an intermediate vector (w) in accordance with a learned distribution, the processing system (5) being configured, beforehand, to produce, at output, output vectors distributed according to the learned distribution and designed so that a generator network produces content of the same type as the audio or video content when said output vectors are applied as input to this generator neural network, and - estimation of a level of authenticity (m) of the audio or video content by application of classification means (20) to said intermediate vector (w). The invention also relates to a device for estimating the authenticity of audio or video content and an associated computer program. Figure for abstract: Fig. 1

Description

Method and device for estimating the authenticity of audio or video content and associated computer program

Technical field of the invention

La présente invention concerne le domaine technique de la détermination de l’authenticité de données, et en particulier de l’authenticité de contenu audio ou vidéo.The present invention relates to the technical field of determining the authenticity of data, and in particular the authenticity of audio or video content.

Elle concerne en particulier un procédé et un dispositif d’estimation de l’authenticité d’un contenu audio ou vidéo et un programme d’ordinateur associé.It relates in particular to a method and a device for estimating the authenticity of audio or video content and an associated computer program.

State of the art

L’hypertrucage (ou «deepfake» selon le terme d’origine anglosaxonne couramment utilisé) est une technique qui consiste à modifier des données existantes, par exemple des données vidéos, pour les manipuler et associer à ces données un message différent de celui qui était associé initialement.Hyperfaking (or “ deepfake ” according to the term of Anglo-Saxon origin commonly used) is a technique which consists of modifying existing data, for example video data, to manipulate it and associate with this data a message different from that which was initially associated.

Cette technique est notamment connue pour reproduire la voix d’une personne et lui faire tenir des propos inventés. La détection de données manipulées constitue donc un enjeu majeur.This technique is particularly known for reproducing a person's voice and making them say invented things. The detection of manipulated data therefore constitutes a major challenge.

Il est par exemple connu d’utiliser des structures de réseaux de neurones artificiels pour détecter d’éventuelles données falsifiées.For example, it is known to use artificial neural network structures to detect possible falsified data.

L’utilisation de ces structures de réseaux de neurones artificiels rend cette détection très efficace. Toutefois, ces structures de réseaux de neurones artificiels sont souvent entraînées sur des données manipulées d’une certaine manière (par exemple, avec l’utilisation d’une méthode de compression particulière pour des données vidéos). Ainsi, dès que les éléments structurels des données analysées (par exemple la forme ou la résolution des données) changent, les performances des structures des réseaux artificiels utilisées sont fortement réduites.The use of these artificial neural network structures makes this detection very effective. However, these artificial neural network structures are often trained on data manipulated in a certain way (for example, with the use of a particular compression method for video data). Thus, as soon as the structural elements of the analyzed data (for example the shape or resolution of the data) change, the performance of the artificial network structures used is greatly reduced.

L’article «ID- Reveal : Identity- aware Deepfake Video Detection» de D. Cozzolino, A. Rössler, J. Thies, M. Nießner and L. Verdoliva, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 15088-15097, doi: 10.1109/ICCV48922.2021.01483 décrit une mise en œuvre des structures de réseaux de neurones artificiels permettant de s’affranchir de ces changements dans les éléments structurels des données analysées.The article “ ID- Reveal : Identity- aware Deepfake Video Detection ” by D. Cozzolino, A. Rössler, J. Thies, M. Nießner and L. Verdoliva, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021 , pp. 15088-15097, doi: 10.1109/ICCV48922.2021.01483 describes an implementation of artificial neural network structures making it possible to overcome these changes in the structural elements of the analyzed data.

Dans ce document, les structures de réseaux de neurones ne sont pas entraînées pour détecter des empreintes de manipulation des données mais pour analyser les expressions et mouvements de visages présents sur les images considérées. Ainsi, en analysant les expressions et mouvements des visages et en les comparant à des expressions connues comme authentiques, la structure de réseaux de neurones concernée détecte une éventuelle falsification de l’image (ou de la vidéo) initiale.In this document, the neural network structures are not trained to detect traces of data manipulation but to analyze facial expressions and movements present in the images considered. Thus, by analyzing facial expressions and movements and comparing them to expressions known to be authentic, the relevant neural network structure detects a possible falsification of the initial image (or video).

Cependant, la mise en œuvre de telle structure de réseaux de neurones est très lourde et nécessite l’utilisation de formats de données volumineux. Par ailleurs, la connaissance de données authentiques pour les visages analysés est nécessaire pour réussir à identifier d’éventuelles falsifications des expressions et mouvements de ces visages.However, the implementation of such a neural network structure is very cumbersome and requires the use of large data formats. Furthermore, knowledge of authentic data for the analyzed faces is necessary to successfully identify possible falsifications of the expressions and movements of these faces.

Presentation of the invention

Dans ce contexte, la présente invention propose d’améliorer la détermination de l’authenticité de contenu audio ou vidéo.In this context, the present invention proposes to improve the determination of the authenticity of audio or video content.

Plus particulièrement, on propose selon l’invention un procédé d’estimation de l’authenticité d’un contenu audio ou vidéo, le contenu audio ou vidéo étant représenté par un ensemble de valeurs d’entrée, le procédé comprenant des étapes de :More particularly, according to the invention, we propose a method for estimating the authenticity of audio or video content, the audio or video content being represented by a set of input values, the method comprising steps of:

- détermination, au moyen d’un système de traitement et sur la base dudit ensemble de valeurs d’entrée, d’un vecteur intermédiaire conformément à une distribution apprise, le système de traitement étant configuré, au préalable, pour produire, en sortie, des vecteurs de sortie répartis selon la distribution apprise et conçus de sorte qu’un réseau générateur produise un contenu de même type que le contenu audio ou vidéo lorsque lesdits vecteurs de sortie sont appliqués en entrée de ce réseau générateur, et- determination, by means of a processing system and on the basis of said set of input values, of an intermediate vector in accordance with a learned distribution, the processing system being configured, beforehand, to produce, as output, output vectors distributed according to the learned distribution and designed so that a generator network produces content of the same type as the audio or video content when said output vectors are applied to the input of this generator network, and

- estimation d’un niveau d’authenticité du contenu audio ou vidéo par application de moyens de classification audit vecteur intermédiaire.- estimation of a level of authenticity of the audio or video content by application of classification means to said intermediate vector.

Ainsi, l’utilisation d’un vecteur intermédiaire, qui peut présenter une dimension réduite par rapport à la dimension du contenu audio ou vidéo initial, permet d’améliorer la simplicité et le temps d’analyse. Par ailleurs, l’utilisation de ce vecteur intermédiaire ne diminue pas la robustesse du niveau d’authenticité obtenu car ce vecteur intermédiaire, grâce à l’utilisation de la distribution apprise, ne comprend que les paramètres pertinents nécessaires pour la détermination de ce niveau d’authenticité.Thus, the use of an intermediate vector, which can have a reduced dimension compared to the dimension of the initial audio or video content, makes it possible to improve simplicity and analysis time. Furthermore, the use of this intermediate vector does not reduce the robustness of the level of authenticity obtained because this intermediate vector, thanks to the use of the learned distribution, only includes the relevant parameters necessary for determining this level of authenticity. 'authenticity.

D’autres caractéristiques non limitatives et avantageuses du procédé conforme à l’invention, prises individuellement ou selon toutes les combinaisons techniquement possibles, sont les suivantes :Other non-limiting and advantageous characteristics of the process according to the invention, taken individually or in all technically possible combinations, are as follows:

- il est également prévu, préalablement à l’étape d’estimation, un procédé préliminaire d’apprentissage de la distribution comprenant des étapes de :- it is also provided, prior to the estimation step, a preliminary process for learning the distribution comprising steps of:

- fourniture, en entrée d’un réseau de redistribution, de données réparties selon une distribution aléatoire, de manière à obtenir, en sortie, des vecteurs répartis selon une distribution (initiale ou courante), lesdits vecteurs étant fournis en entrée du réseau générateur de manière à fournir, en sortie, un ensemble de valeurs d’apprentissage, et- supply, at the input of a redistribution network, of data distributed according to a random distribution, so as to obtain, at output, vectors distributed according to a distribution (initial or current), said vectors being supplied at the input of the network generating so as to provide, as output, a set of learning values, and

- entraînement du réseau de redistribution et du réseau générateur de manière à actualiser ladite distribution (en vue de l’optimisation de la distribution) ;- training of the redistribution network and the generator network so as to update said distribution (with a view to optimizing distribution);

- il est également prévu, lors de l’étape de détermination, des sous-étapes de :- it is also planned, during the determination stage, sub-stages of:

- fourniture, en entrée d’un réseau générateur, dudit vecteur intermédiaire de manière à obtenir, en sortie dudit réseau générateur, un ensemble généré de valeurs, et- supply, at the input of a generator network, of said intermediate vector so as to obtain, at the output of said generator network, a generated set of values, and

- actualisation dudit vecteur intermédiaire de manière à optimiser une fonction de coût représentant une distance entre l’ensemble de valeurs d’entrée et ledit ensemble généré de valeurs,- updating said intermediate vector so as to optimize a cost function representing a distance between the set of input values and said generated set of values,

ledit vecteur intermédiaire actualisé étant utilisé en tant que vecteur intermédiaire lors de l’étape d’estimation ;said updated intermediate vector being used as an intermediate vector during the estimation step;

- l’étape d’actualisation est mise en œuvre par une méthode de descente de gradient ;- the updating step is implemented by a gradient descent method;

- il est prévu, préalablement à l’étape d’estimation, des étapes de :- it is planned, prior to the estimation stage, stages of:

- fourniture, en entrée du réseau générateur, dudit vecteur intermédiaire actualisé de manière à obtenir, en sortie dudit réseau générateur, un ensemble généré actualisé de valeurs, et- supply, at the input of the generator network, of said updated intermediate vector so as to obtain, at the output of said generator network, an updated generated set of values, and

- nouvelle actualisation dudit vecteur intermédiaire de manière à optimiser une fonction de coût représentant une distance entre l’ensemble de valeurs d’entrée et ledit ensemble généré actualisé de valeurs ;- new updating of said intermediate vector so as to optimize a cost function representing a distance between the set of input values and said updated generated set of values;

- l’étape de détermination comprend la fourniture, en entrée du système de traitement, dudit ensemble de valeurs d’entrée de manière à obtenir, en sortie du système de traitement et sur la base de la distribution apprise, ledit vecteur intermédiaire ;- the determination step comprises supplying, at the input of the processing system, said set of input values so as to obtain, at the output of the processing system and on the basis of the learned distribution, said intermediate vector;

- l’ensemble de valeurs d’entrée comprend un premier nombre de valeurs et le vecteur intermédiaire comprend un deuxième nombre de valeurs, le deuxième nombre de valeurs étant strictement inférieur au premier nombre de valeurs ;- the set of input values comprises a first number of values and the intermediate vector comprises a second number of values, the second number of values being strictly less than the first number of values;

- le deuxième nombre de valeurs est compris entre un centième et un trentième du premier nombre de valeurs ;- the second number of values is between one hundredth and one thirtieth of the first number of values;

- le deuxième nombre de valeurs est inférieur ou égal à 512 ;- the second number of values is less than or equal to 512;

- il est prévu, en amont de l’étape d’estimation, une étape d’entraînement des moyens de classification à partir d’une pluralité de vecteurs intermédiaires associés chacun à un contenu audio ou vidéo authentique ou falsifié de telle manière que l’estimation du niveau d’authenticité pour un vecteur intermédiaire associé à un contenu audio ou vidéo authentique donne un premier résultat et que l’estimation du niveau d’authenticité pour un vecteur intermédiaire associé à un contenu audio ou vidéo falsifié donne un deuxième résultat distinct du premier résultat ;- there is provided, upstream of the estimation step, a step of training the classification means from a plurality of intermediate vectors each associated with authentic or falsified audio or video content in such a way that the estimation of the level of authenticity for an intermediate vector associated with authentic audio or video content gives a first result and that the estimation of the level of authenticity for an intermediate vector associated with falsified audio or video content gives a second result distinct from the first result;

- l’ensemble de valeurs d’entrée est obtenu par extraction d’une partie du contenu audio ou vidéo ; et- the set of input values is obtained by extracting part of the audio or video content; And

- l’ensemble de valeurs d’entrée est formé de valeurs associées à des pixels d’une image.- the set of input values is made up of values associated with pixels of an image.

La présente invention concerne également un dispositif d’estimation de l’authenticité d’un contenu audio ou vidéo, le contenu audio ou vidéo étant représenté par un ensemble de valeurs d’entrée, le dispositif comprenant :The present invention also relates to a device for estimating the authenticity of audio or video content, the audio or video content being represented by a set of input values, the device comprising:

- un système de traitement configuré pour déterminer, sur la base dudit ensemble de valeurs d’entrée, un vecteur intermédiaire conformément à une distribution apprise, le système de traitement étant configuré pour produire, en sortie, des vecteurs de sortie répartis selon la distribution apprise et conçus de sorte qu’un réseau générateur produise un contenu de même type que le contenu audio ou vidéo lorsque lesdits vecteurs de sortie sont appliqués en entrée de ce réseau générateur, et- a processing system configured to determine, on the basis of said set of input values, an intermediate vector in accordance with a learned distribution, the processing system being configured to produce, as output, output vectors distributed according to the learned distribution and designed so that a generator network produces content of the same type as the audio or video content when said output vectors are applied to the input of this generator network, and

- un module d’estimation d’un niveau d’authenticité du contenu audio au vidéo par application de moyens de classification audit vecteur intermédiaire.- a module for estimating a level of authenticity of the audio or video content by applying classification means to said intermediate vector.

La présente invention concerne encore un programme d’ordinateur comprenant des instructions exécutables par un processeur et conçues pour mettre en œuvre un procédé d’estimation tel qu’introduit précédemment lorsque ces instructions sont exécutées par le processeur.The present invention also relates to a computer program comprising instructions executable by a processor and designed to implement an estimation method as introduced previously when these instructions are executed by the processor.

Bien entendu, les différentes caractéristiques, variantes et formes de réalisation de l'invention peuvent être associées les unes avec les autres selon diverses combinaisons dans la mesure où elles ne sont pas incompatibles ou exclusives les unes des autres.Of course, the different characteristics, variants and embodiments of the invention can be associated with each other in various combinations as long as they are not incompatible or exclusive of each other.

Detailed description of the invention

De plus, diverses autres caractéristiques de l'invention ressortent de la description annexée effectuée en référence aux dessins qui illustrent des formes, non limitatives, de réalisation de l'invention et où :In addition, various other characteristics of the invention emerge from the appended description made with reference to the drawings which illustrate non-limiting forms of embodiment of the invention and where:

représente un ensemble d’éléments mis en jeu dans le cadre du procédé d’estimation d’un niveau d’authenticité conforme à la présente invention, represents a set of elements brought into play as part of the method for estimating a level of authenticity in accordance with the present invention,

représente un ensemble de réseaux de neurones artificiels mis en jeu dans le cadre d’un procédé préliminaire d’apprentissage conforme à la présente invention, represents a set of artificial neural networks brought into play as part of a preliminary learning method in accordance with the present invention,

représente, sous forme fonctionnelle, un dispositif d’estimation d’un niveau d’authenticité configuré pour mettre en œuvre un procédé d’estimation d’un niveau d’authenticité d’un contenu audio ou vidéo conforme à l’invention, represents, in functional form, a device for estimating a level of authenticity configured to implement a method for estimating a level of authenticity of audio or video content in accordance with the invention,

représente, sous forme de logigramme, un exemple de procédé d’estimation d’un niveau d’authenticité d’un contenu audio ou vidéo conforme à la présente invention, represents, in flowchart form, an example of a method for estimating a level of authenticity of audio or video content in accordance with the present invention,

représente, sous forme de logigramme, un exemple de procédé préliminaire d’apprentissage utilisable dans le cadre de l’invention, et represents, in flowchart form, an example of a preliminary learning method usable in the context of the invention, and

représente, sous forme de logigramme, un exemple de procédé préliminaire d’entraînement utilisable dans le cadre de l’invention. represents, in flowchart form, an example of a preliminary training method usable in the context of the invention.

La représente un exemple d’un ensemble 1 d’éléments mis en jeu dans la présente invention. Cet ensemble 1 comprend ici un système de traitement 5 et des moyens de classification 20.There represents an example of a set 1 of elements involved in the present invention. This set 1 here comprises a processing system 5 and classification means 20.

Cet ensemble 1 d’éléments est configuré pour traiter des données d’entrée. Ces données d’entrée sont par exemple sous forme d’images. Comme cela sera vu par la suite, ces images sont par exemple issues ici d’un contenu audio ou vidéo.This set 1 of elements is configured to process input data. This input data is for example in the form of images. As will be seen later, these images come for example from audio or video content.

Pour la suite, on note que chaque image est représentée, en utilisant une pluralité de pixels, par une pluralité de composantes pour chaque pixel. Généralement, chaque image est représentée par trois composantes couleur pour chaque pixel (une composante rouge R, une composante verte G et une composante bleue B). En d’autres termes, chaque image est représentée par un ensemble de valeurs associé aux différents pixels la formant.Below, we note that each image is represented, using a plurality of pixels, by a plurality of components for each pixel. Generally, each image is represented by three color components for each pixel (a red component R, a green component G and a blue component B). In other words, each image is represented by a set of values associated with the different pixels forming it.

Le système de traitement 5 comporte ici un bloc d’optimisation 10, un bloc de transmission 12 et un réseau générateur 34.The processing system 5 here comprises an optimization block 10, a transmission block 12 and a generator network 34.

Le système de traitement 5 présente les caractéristiques d’un encodeur, c’est-à-dire qu’il est configuré pour condenser les informations qui lui sont fournies en entrée.The processing system 5 has the characteristics of an encoder, that is to say it is configured to condense the information provided to it as input.

En pratique, à partir de données d’entrée, ici à partir d’une image Im fournie en entrée du système de traitement 5 ( ) c’est-à-dire d’un ensemble de valeurs d’entrée qui lui sont associées, ce système de traitement 5 est configuré pour fournir, en sortie, un vecteur intermédiaire w de taille réduite par rapport à l’ensemble de valeurs d’entrée.In practice, from input data, here from an image Im provided as input to the processing system 5 ( ) that is to say a set of input values associated with it, this processing system 5 is configured to provide, at output, an intermediate vector w of reduced size compared to the set of values entry.

En d’autres termes, en considérant que l’ensemble de valeurs d’entrée comprend un premier nombre de valeurs et que le vecteur intermédiaire comprend un deuxième nombre de valeurs, le deuxième nombre de valeurs est ici strictement inférieur au premier nombre de valeurs. Par exemple ici, le deuxième nombre de valeurs est compris entre un centième et un trentième du premier nombre de valeurs. Par exemple, le deuxième nombre de valeurs est inférieur à 512.In other words, considering that the set of input values includes a first number of values and that the intermediate vector includes a second number of values, the second number of values is here strictly less than the first number of values. For example here, the second number of values is between one hundredth and one thirtieth of the first number of values. For example, the second number of values is less than 512.

Ainsi, le système de traitement 5 fournit, en sortie, un vecteur intermédiaire w représenté dans un espace particulier dit « espace latent ». Le vecteur intermédiaire w est également appelé « vecteur latent ». Ce vecteur latent correspond donc à une nouvelle représentation de l’image fournie en entrée du système de traitement 5. Cette nouvelle représentation conserve les informations les plus pertinentes de l’image Im fournie en entrée, c’est-à-dire celles qui caractérisent le mieux cette image Im, avec une dimension bien plus faible que celle de l’image Im.Thus, the processing system 5 provides, as output, an intermediate vector w represented in a particular space called “latent space”. The intermediate vector w is also called a “latent vector”. This latent vector therefore corresponds to a new representation of the image provided as input to the processing system 5. This new representation retains the most relevant information of the image Im provided as input, that is to say those which characterize the best this image Im, with a much lower dimension than that of the image Im.

De manière générale, le système de traitement 5 met en œuvre une fonction d’inversion associée à une fonction de génération.Generally speaking, the processing system 5 implements an inversion function associated with a generation function.

Cette fonction d’inversion est par exemple mise en œuvre par l’intermédiaire du système de traitement 5 à l’aide du bloc d’optimisation 10, selon des étapes d’optimisation décrites dans la suite. La fonction d’inversion est par exemple du type de celle présentée dans l’article «Image2StyleGAN+ +: How to Edit the Embedded Images?» de Abdal R., Qin Y. et Wonka P., 10.48550/ARXIV.1911.11544, 2019.This inversion function is for example implemented via the processing system 5 using the optimization block 10, according to optimization steps described below. The inversion function is for example of the type presented in the article “ Image2StyleGAN+ +: How to Edit the Embedded Images? » by Abdal R., Qin Y. and Wonka P., 10.48550/ARXIV.1911.11544, 2019.

La fonction de génération est mise en œuvre par le réseau générateur 34 décrit ci-après.The generation function is implemented by the generator network 34 described below.

Comme cela est représenté sur la , l’ensemble 1 d’éléments comprend également les moyens de classification 20. Ces moyens de classification 20 reposent par exemple ici sur l’utilisation d’un algorithme couramment dénommé « forêt d’arbres décisionnels » (ou «Random Forest» selon l’appellation d’origine anglo-saxonne couramment utilisée). Plus de détails sur les forêts d’arbres décisionnels peuvent être trouvés dans l’article «Random forests», de Breiman L., dansMachine learning, 45(1), 5-3, 2001.As shown on the , the set 1 of elements also includes the classification means 20. These classification means 20 are based for example here on the use of an algorithm commonly called "decision tree forest" (or " Random Forest " according to the commonly used Anglo-Saxon designation of origin). More details on decision tree forests can be found in the article " Random forests ", by Breiman L., in Machine learning , 45(1), 5-3, 2001.

Ainsi, comme le montre la , à partir du vecteur intermédiaire w reçu en entrée, les moyens de classification 20 estiment un niveau m d’authenticité de l’ensemble de valeurs d’entrée associé à l’image Im. Ce niveau m d’authenticité prend par exemple un premier résultat m₁lorsque l’ensemble de valeurs d’entrée est associé à un contenu audio ou vidéo authentique. Il prend par exemple un deuxième résultat m₂lorsque l’ensemble de valeurs d’entrée est associé à un contenu audio ou vidéo falsifié. Le deuxième résultat m₂est distinct du premier résultat m₁.So, as shown in , from the intermediate vector w received as input, the classification means 20 estimate a level m of authenticity of the set of input values associated with the image Im. This level m of authenticity takes for example a first result m ₁ when the set of input values is associated with authentic audio or video content. For example, it takes a second result m ₂ when the set of input values is associated with falsified audio or video content. The second result m ₂ is distinct from the first result m ₁ .

En variante, les moyens de classification peuvent comprendre un réseau de neurones artificiels configuré pour recevoir, en entrée, le vecteur intermédiaire w et fournir, en sortie, une estimation du niveau d’authenticité du contenu audio ou vidéo associé au vecteur intermédiaire w (fourni en entrée).Alternatively, the classification means may comprise an artificial neural network configured to receive, as input, the intermediate vector w and provide, at output, an estimate of the level of authenticity of the audio or video content associated with the intermediate vector w (provided entrance).

De manière avantageuse selon l’invention, le système de traitement 5 est configuré pour déterminer le vecteur intermédiaire w conformément à une distribution p_wapprise au préalable. Cette distribution p_west apprise préalablement à la mise en œuvre du procédé d’estimation conforme à l’invention, selon un procédé préliminaire d’apprentissage décrit ci-après.Advantageously according to the invention, the processing system 5 is configured to determine the intermediate vector w in accordance with a distribution p _w learned beforehand. This distribution p _w is learned before implementing the estimation method according to the invention, according to a preliminary learning method described below.

Ce procédé préliminaire d’apprentissage (qui sera décrit ultérieurement en référence à la ) est mis en œuvre par l’intermédiaire d’une structure 35 de réseaux de neurones artificiels. Cette structure 35 de réseaux de neurones artificiels est représentée sur la .This preliminary learning process (which will be described later with reference to the ) is implemented via a structure 35 of artificial neural networks. This structure 35 of artificial neural networks is represented on the .

Comme le montre la , la structure 35 de réseaux de neurones artificiels comprend un réseau 32 dit de redistribution (ou « mapping network » selon l’appellation d’origine anglosaxonne) et le réseau générateur 34 (compris dans le système de traitement 5 introduit précédemment).As shown in the , the structure 35 of artificial neural networks comprises a so-called redistribution network 32 (or “mapping network” according to the Anglo-Saxon designation of origin) and the generator network 34 (included in the processing system 5 introduced previously).

Le réseau 32 de redistribution et le réseau générateur 34 sont par exemple du type de ceux utilisés dans les réseaux adverses génératifs StyleGAN. Plus de détails sur les réseaux de type StyleGAN peuvent être trouvés dans l’article «Analyzing and Improving the Image Quality of StyleGAN» de Karras T., Laine S., Aittala M., Hellsten J., Lehtinen J. and Aila T., 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8107-8116, 2020.The redistribution network 32 and the generator network 34 are for example of the type used in StyleGAN generative adversarial networks. More details on StyleGAN networks can be found in the article “ Analyzing and Improving the Image Quality of StyleGAN ” by Karras T., Laine S., Aittala M., Hellsten J., Lehtinen J. and Aila T. , 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8107-8116, 2020.

En pratique, le réseau 32 de redistribution comprend par exemple ici huit couches. Le réseau 32 de redistribution est par exemple de type perceptron multicouche (avec ici huit couches).In practice, the redistribution network 32 comprises, for example, eight layers here. The redistribution network 32 is for example of the multilayer perceptron type (here with eight layers).

Comme la montre la , le réseau 32 de redistribution reçoit, en entrée, des données z réparties selon une distribution aléatoire. Ce réseau 32 de redistribution est alors configuré pour fournir, en sortie, un vecteur latent u représenté dans un espace latent prédéterminé.As shown in , the redistribution network 32 receives, as input, data z distributed according to a random distribution. This redistribution network 32 is then configured to provide, at output, a latent vector u represented in a predetermined latent space.

Le réseau générateur 34 est un réseau de neurones artificiels qui comprend ici une pluralité de couches. Chacune de ces couches est par exemple une couche de convolution combinée à une fonction de normalisation et une fonction de non-linéarité.The generator network 34 is an artificial neural network which here comprises a plurality of layers. Each of these layers is for example a convolution layer combined with a normalization function and a non-linearity function.

En pratique, la structure 35 de réseaux de neurones artificiels est structurée par l’intermédiaire de nœuds formant les différentes couches du réseau 32 de redistribution et du réseau générateur 34. Ces nœuds sont caractérisés par une pluralité de coefficients de pondération qui peuvent être ajustés, préalablement à la mise en œuvre du procédé d’estimation du niveau d’authenticité lui-même, grâce à une phase d’entraînement telle que décrite ultérieurement dans le procédé préliminaire d’apprentissage.In practice, the structure 35 of artificial neural networks is structured via nodes forming the different layers of the redistribution network 32 and the generator network 34. These nodes are characterized by a plurality of weighting coefficients which can be adjusted, prior to the implementation of the process for estimating the level of authenticity itself, thanks to a training phase as described subsequently in the preliminary learning process.

Comme cela est représenté sur la , le réseau générateur 34 est configuré pour recevoir, en entrée, le vecteur latent u obtenu en sortie du réseau 32 de redistribution et pour fournir en sortie un ensemble de valeurs x (associé à une image).As shown on the , the generator network 34 is configured to receive, as input, the latent vector u obtained at the output of the redistribution network 32 and to provide as output a set of values x (associated with an image).

La présente invention concerne plus particulièrement l’estimation du niveau m d’authenticité d’un contenu audio ou vidéo. Pour cela, l’invention concerne également un dispositif 50 d’estimation de l’authenticité du contenu audio ou vidéo. La représente, sous forme fonctionnelle, un tel dispositif 50 d’estimation configuré pour mettre en œuvre l’invention.The present invention relates more particularly to the estimation of the level m of authenticity of audio or video content. For this, the invention also relates to a device 50 for estimating the authenticity of audio or video content. There represents, in functional form, such an estimation device 50 configured to implement the invention.

Ce dispositif 50 d’estimation comprend une unité de commande 52 munie d’un processeur 54 et d’une mémoire 56.This estimation device 50 comprises a control unit 52 provided with a processor 54 and a memory 56.

Ce dispositif 50 d’estimation est conçu pour mettre en œuvre un ensemble de modules et blocs fonctionnels. Par exemple, il comprend un module d’estimation, le bloc d’optimisation 10 et le bloc de transmission 12. Chacun des différents modules et blocs est par exemple réalisé au moyen d’instructions de programme d’ordinateur mémorisées par la mémoire 56 de l’unité de commande 52 et conçues pour mettre en œuvre le module concerné lorsque ces instructions sont exécutées par le processeur 54 de l’unité de commande 52.This estimation device 50 is designed to implement a set of modules and functional blocks. For example, it includes an estimation module, the optimization block 10 and the transmission block 12. Each of the different modules and blocks is for example produced by means of computer program instructions stored by the memory 56 of the control unit 52 and designed to implement the module concerned when these instructions are executed by the processor 54 of the control unit 52.

La est un logigramme représentant un exemple de procédé d’estimation de l’authenticité d’un contenu audio ou vidéo mis en œuvre dans le contexte décrit précédemment.There is a flowchart representing an example of a method for estimating the authenticity of audio or video content implemented in the context described above.

Comme indiqué précédemment, préalablement à la mise en œuvre du procédé d’estimation, un procédé préliminaire d’apprentissage est mis en œuvre afin de procéder à l’apprentissage de la distribution p_w(qui permet ensuite la détermination du vecteur intermédiaire w).As indicated previously, prior to implementing the estimation process, a preliminary learning process is implemented in order to learn the distribution p _w (which then allows the determination of the intermediate vector w).

La est un logigramme représentant un exemple de procédé préliminaire d’apprentissage. En pratique, ce procédé préliminaire d’apprentissage permet de déterminer les coefficients de pondération associés à la pluralité de nœuds de la structure 35 de réseaux de neurones artificiels.There is a flowchart representing an example of a preliminary learning process. In practice, this preliminary learning method makes it possible to determine the weighting coefficients associated with the plurality of nodes of the structure 35 of artificial neural networks.

Comme le montre cette figure, ce procédé préliminaire comprend une première étape (étape E50) d’initialisation des coefficients de pondération de la structure 35 de réseaux de neurones artificiels à des valeurs initiales. Ces valeurs initiales sont par exemple des valeurs aléatoires.As shown in this figure, this preliminary method comprises a first step (step E50) of initializing the weighting coefficients of the structure 35 of artificial neural networks to initial values. These initial values are for example random values.

Le procédé préliminaire d’apprentissage se poursuit ensuite à l’étape E52 lors de laquelle des données z réparties selon une distribution aléatoire sont fournies en entrée de la structure 35 de réseaux de neurones artificiels. Le vecteur u est obtenu en sortie du réseau 32 de redistribution. Un ensemble de valeurs d’apprentissage x est alors obtenu en sortie de la structure 35 de réseaux de neurones.The preliminary learning process then continues in step E52 during which data z distributed according to a random distribution are provided as input to the structure 35 of artificial neural networks. The vector u is obtained at the output of the redistribution network 32. A set of learning values x is then obtained at the output of the neural network structure 35.

Le procédé se poursuit ensuite à l’étape E54 formant une étape d’apprentissage du réseau 32 de redistribution et du réseau générateur 34. De manière générale, cette étape vise à actualiser les coefficients de pondération de la structure 35 de réseaux de neurones artificiels (plus particulièrement les coefficients de pondération respectifs du réseau 32 de redistribution et du réseau générateur 34) de manière que le contenu obtenu en sortie de la structure 35 de réseaux de neurones artificiels s’approche le plus possible d’un contenu de référence (à savoir ici un contenu de référence dont il est connu qu’il correspond à un contenu authentique). En pratique ce contenu de référence peut comprendre des images (vérifiées comme authentiques) d’une base de données.The process then continues in step E54 forming a step of learning the redistribution network 32 and the generator network 34. Generally speaking, this step aims to update the weighting coefficients of the structure 35 of artificial neural networks ( more particularly the respective weighting coefficients of the redistribution network 32 and the generator network 34) so that the content obtained at the output of the structure 35 of artificial neural networks comes as close as possible to a reference content (i.e. here reference content which is known to correspond to authentic content). In practice this reference content may include images (verified as authentic) from a database.

Cette étape d’apprentissage repose par exemple sur l’optimisation d’une fonction de coût dite adversaire progressive telle que décrite dans l’article «Progressive Growing of GANs for Improved Quality , Stability , and Variation» de Karras T., Aila T., Laine S. et Lehtinen J., doi : 10.48550/ARXIV.1710.10196, 2017.This learning step is based for example on the optimization of a so-called progressive adversarial cost function as described in the article “ Progressive Growing of GANs for Improved Quality , Stability , and Variation ” by Karras T., Aila T. , Laine S. and Lehtinen J., doi: 10.48550/ARXIV.1710.10196, 2017.

Finalement, les coefficients de pondération sont modifiés de manière à minimiser la fonction de coût, par exemple au cours de plusieurs itérations, jusqu’à obtenir les coefficients de pondération optimaux qui minimise effectivement la fonction de coût. Ces coefficients de pondération optimaux permettent alors l’obtention de vecteurs u répartis selon une certaine distribution p_wdéterminée (dite « distribution apprise p_w»).Finally, the weighting coefficients are modified in such a way as to minimize the cost function, for example over several iterations, until the optimal weighting coefficients are obtained which effectively minimizes the cost function. These optimal weighting coefficients then make it possible to obtain vectors u distributed according to a certain determined distribution p _w (called “learned distribution p _w ”).

En variante, l’optimisation de la fonction de coût peut être mise en œuvre par l’intermédiaire d’un nombre prédéterminé d’itérations. Dans ce cas, les coefficients de pondération obtenus ne sont pas forcément optimaux mais en forment une approximation correcte permettant également d’obtenir des vecteurs u répartis selon une distribution p_wconvenable (formant donc une bonne approximation de la distribution apprise précédemment décrite). Une telle variante permet notamment d’obtenir plus rapidement cette distribution.Alternatively, optimization of the cost function can be implemented via a predetermined number of iterations. In this case, the weighting coefficients obtained are not necessarily optimal but form a correct approximation also making it possible to obtain vectors u distributed according to a suitable distribution p _w (therefore forming a good approximation of the learned distribution previously described). Such a variant makes it possible in particular to obtain this distribution more quickly.

Cette distribution apprise p_west ensuite utilisée dans le procédé d’estimation de l’authenticité d’un contenu audio ou vidéo décrit ci-après.This learned distribution p _w is then used in the method of estimating the authenticity of audio or video content described below.

Le procédé d’estimation de l’authenticité d’un contenu audio ou vidéo est donc mis en œuvre à la suite du procédé préliminaire d’apprentissage.The process for estimating the authenticity of audio or video content is therefore implemented following the preliminary learning process.

Comme le montre la , le procédé d’estimation débute à l’étape E2 lors de laquelle l’unité de commande 52 reçoit un contenu audio ou vidéo. Il s’agit du contenu audio ou vidéo dont on souhaite déterminer s’il est authentique ou s’il a été falsifié (totalement ou en partie).As shown in the , the estimation process begins in step E2 during which the control unit 52 receives audio or video content. This is audio or video content that we want to determine if it is authentic or if it has been falsified (totally or in part).

Pour cela, à l’étape E4, le processeur 54 détermine un ensemble de valeurs d’entrée associées au contenu audio ou vidéo. Cet ensemble de valeurs d’entrée est obtenu par extraction d’une partie du contenu audio ou vidéo.To do this, in step E4, the processor 54 determines a set of input values associated with the audio or video content. This set of input values is obtained by extracting part of the audio or video content.

Plus précisément, dans le cas d’un contenu vidéo (c’est-à-dire que ce contenu comprend au moins une image, éventuellement une séquence d’images), ce dernier est décomposé sous la forme d’une pluralité d’images selon des méthodes bien connues (en décomposant, par exemple, le contenu vidéo selon une séquence d’une pluralité de plans immobiles, à intervalles de temps réguliers, chacun correspondant alors à une image).More precisely, in the case of video content (that is to say that this content comprises at least one image, possibly a sequence of images), the latter is broken down into the form of a plurality of images according to well-known methods (by breaking down, for example, the video content into a sequence of a plurality of still shots, at regular time intervals, each then corresponding to an image).

Dans le cas d’un contenu audio, les composantes sonores peuvent être transcrites sous la forme de composantes d’une image comme cela est par exemple décrit dans l’article «Learning and controlling the source- filter representation of speech with a variational autoencoder», de Sadok, S., Leglaive, S., Girin, L., Alameda-Pineda, X. et Séguier, R., 10.48550/ARXIV.2204.07075, 2022. D’après cet article, les composantes sonores sont traitées sous la forme de spectrogrammes qui sont des graphiques représentant la fréquence en fonction du temps. Sur ces spectrogrammes, une échelle de couleur traduit l’intensité de chaque composante sonore. L’intensité de chaque composante sonore représentée par un couleur peut donc être utilisée comme la valeur de pixel d’une image (correspondant alors à l’image utilisée ci-après).In the case of audio content, the sound components can be transcribed in the form of components of an image as is, for example, described in the article “ Learning and controlling the source- filter representation of speech with a variational autoencoder ” , de Sadok, S., Leglaive, S., Girin, L., Alameda-Pineda, form of spectrograms which are graphs representing frequency versus time. On these spectrograms, a color scale reflects the intensity of each sound component. The intensity of each sound component represented by a color can therefore be used as the pixel value of an image (then corresponding to the image used below).

Ainsi, que le contenu initial reçu à l’étape E2 soit un contenu audio ou vidéo, le procédé d’estimation de son authenticité est basé sur des données d’entrée formées à partir d’une image formatée par exemple de la manière décrite ci-après.Thus, whether the initial content received in step E2 is audio or video content, the method of estimating its authenticity is based on input data formed from an image formatted for example in the manner described below. -After.

Sur chaque image, le processeur 54 détecte tout d’abord la présence du visage d’au moins une personne. Cela s’effectue par exemple en identifiant certaines particularités permettant de définir un visage.On each image, the processor 54 first detects the presence of the face of at least one person. This is done, for example, by identifying certain particularities allowing a face to be defined.

Puis, une fois le visage détecté sur l’image, le processeur 54 extrait l’image autour du visage détecté. Il « rogne » l’image de manière à ne conserver que la partie de l’image comprenant le visage détecté.Then, once the face is detected on the image, the processor 54 extracts the image around the detected face. It “crops” the image so as to keep only the part of the image including the detected face.

Ensuite, le processeur 54 adapte la partie de l’image comprenant le visage de manière à procéder à un alignement du visage. Cela consiste alors à localiser des points caractéristiques du visage afin d’identifier la structure géométrique du visage concerné.Then, the processor 54 adapts the part of the image comprising the face so as to carry out facial alignment. This then consists of locating characteristic points of the face in order to identify the geometric structure of the face concerned.

Plus précisément, le processeur 54 se base sur les éléments du visage contenant le plus d’information sémantique (comme par exemple les yeux, le nez ou la bouche) afin de déterminer la géométrie des composantes du visage. Cela permet alors de représenter de manière fiable les éléments « non rigides » du visage. Le processeur 54 détermine alors une image modifiée sur laquelle le visage est « aligné ».More precisely, the processor 54 is based on the elements of the face containing the most semantic information (such as the eyes, the nose or the mouth) in order to determine the geometry of the components of the face. This then makes it possible to reliably represent the “non-rigid” elements of the face. The processor 54 then determines a modified image on which the face is “aligned”.

Enfin, à l’issue de cette étape E4, le processeur 54 décompose l’image modifiée en un ensemble de valeurs d’entrée associé aux différents pixels la formant. Comme défini précédemment, l’ensemble de valeurs d’entrée est formé des valeurs associées aux différents pixels de l’image (ici l’image modifiée).Finally, at the end of this step E4, the processor 54 breaks down the modified image into a set of input values associated with the different pixels forming it. As defined previously, the set of input values is formed from the values associated with the different pixels of the image (here the modified image).

Comme le montre la , le procédé d’estimation se poursuit à l’étape E6. Lors de cette étape, le processeur 54 fournit, en entrée du système de traitement 5, et plus particulièrement en entrée du bloc d’optimisation 10, l’ensemble de valeurs d’entrée obtenu à l’étape E4. Ce système de traitement 5 fournit alors, en sortie, un vecteur initial w₀. Ce vecteur initial w₀est par exemple déterminé de manière aléatoire.As shown in the , the estimation process continues in step E6. During this step, the processor 54 provides, as input to the processing system 5, and more particularly as input to the optimization block 10, the set of input values obtained in step E4. This processing system 5 then provides, as output, an initial vector w ₀ . This initial vector w ₀ is for example determined randomly.

En variante, afin d’accélérer le processus d’optimisation décrit ci-après, le vecteur initial peut être déterminé de la manière décrite dans l’article «In-Domain GAN Inversion for Real Image Editing» de Zhu, Jiapeng and Shen, Yujun and Zhao, Deli and Zhou, Bolei, 10.48550/ARXIV.2004.00049, 2020.Alternatively, in order to accelerate the optimization process described below, the initial vector can be determined in the manner described in the article “ In-Domain GAN Inversion for Real Image Editing ” by Zhu, Jiapeng and Shen, Yujun and Zhao, Deli and Zhou, Bolei, 10.48550/ARXIV.2004.00049, 2020.

Afin d’obtenir la meilleure estimation possible du niveau d’authenticité du contenu audio ou vidéo reçu, le procédé d’estimation comprend un processus d’optimisation du vecteur intermédiaire décrit ci-après par les étapes E8 à E20. La méthode d’optimisation du vecteur intermédiaire w est par exemple ici une méthode d’optimisation itérative par descente de gradient.In order to obtain the best possible estimate of the level of authenticity of the audio or video content received, the estimation method includes a process of optimizing the intermediate vector described below by steps E8 to E20. The method of optimizing the intermediate vector w is for example here an iterative optimization method by gradient descent.

Pour cela, le procédé d’estimation comprend une étape E8 d’initialisation à la valeur 0 d’un indice i. Cet indice i désigne le tour courant de cette méthode d’optimisation. Lors de cette étape, le processeur 54 initialise également le vecteur intermédiaire. Le vecteur initial w₀déterminé à l’étape E6 est utilisé comme valeur d’initialisation.For this, the estimation method comprises a step E8 of initialization to the value 0 of an index i. This index i designates the current round of this optimization method. During this step, the processor 54 also initializes the intermediate vector. The initial vector w ₀ determined in step E6 is used as initialization value.

A l’étape E10, comme cela est également illustré sur la , pour le tour courant, le processeur 54 fournit, en entrée du réseau générateur 34, la valeur courante du vecteur intermédiaire w_i(pour le premier tour, il s’agit donc du vecteur initial w₀). Le réseau générateur 34 fournit alors, en sortie, un ensemble de valeurs généré correspondant à ce vecteur intermédiaire w_icourant. En d’autres termes, le réseau générateur 34 fournit, en sortie, l’image Y_icorrespondant au vecteur intermédiaire w_icourant.In step E10, as is also illustrated in the , for the current round, the processor 54 provides, as input to the generator network 34, the current value of the intermediate vector w _i (for the first round, it is therefore the initial vector w ₀ ). The generator network 34 then provides, as output, a set of generated values corresponding to this current intermediate vector w _i . In other words, the generator network 34 provides, as output, the image Y _i corresponding to the current intermediate vector w _i .

Les deux ensembles de valeurs, l’ensemble de valeurs d’entrée et l’ensemble de valeurs généré sont ensuite comparés, par le bloc d’optimisation 10, pour évaluer si le vecteur intermédiaire w_icourant forme une représentation pertinente de l’image initiale Im issue du contenu audio ou vidéo reçu à l’étape E2.The two sets of values, the set of input values and the set of generated values are then compared, by the optimization block 10, to evaluate whether the current intermediate vector w _i forms a relevant representation of the image initial Im from the audio or video content received in step E2.

Plus précisément, à l’étape E12, le processeur 54 détermine une fonction de coût L. Cette fonction de coût L dépend de l’ensemble des valeurs d’entrée (étape E4) et de l’ensemble des valeurs généré (étape E10). Cette fonction de coût permet de quantifier la différence entre l’image reçue en entrée (à l’étape E2) et celle générée à partir du vecteur intermédiaire w_icourant.More precisely, in step E12, the processor 54 determines a cost function L. This cost function L depends on all of the input values (step E4) and on all of the values generated (step E10). . This cost function makes it possible to quantify the difference between the image received as input (in step E2) and that generated from the current intermediate vector w _i .

En pratique, lors de cette étape E12, la fonction de coût L est ici représentée par une distance entre l’ensemble des valeurs d’entrée et l’ensemble des valeurs généré.In practice, during this step E12, the cost function L is here represented by a distance between the set of input values and the set of generated values.

La fonction de coût L s’exprime de la forme suivante :The cost function L is expressed in the following form:

avec Im l’image reçue en entrée (à l’étape E2), Y_il’image générée à partir du vecteur intermédiaire wi, λ un paramètre variable en fonction de l’application, N le nombre total de pixels dans chaque image Im ou Y_i, la notation ||Y_i, Im||₂représentant la distance euclidienne entre les deux éléments Im et Y_iet Lpercept une fonction définie par l’expression suivante :with Im the image received as input (in step E2), Y _i the image generated from the intermediate vector wi, λ a variable parameter depending on the application, N the total number of pixels in each image Im or Y _i , the notation ||Y _i , Im|| ₂ representing the Euclidean distance between the two elements Im and Y _i and Lpercept a function defined by the following expression:

avec la notation vgg₁₆correspondant à l’application d’un réseau de neurones artificiels de type vgg16 à l’image Im ou l’image Y_i, le résultat obtenu, pour chaque image étant par exemple un vecteur associé de 4096 valeurs.with the notation vgg ₁₆ corresponding to the application of an artificial neural network of the vgg16 type to the image Im or the image Y _i , the result obtained, for each image being for example an associated vector of 4096 values.

Comme le montre la , le procédé se poursuit à l’étape E14 lors de laquelle le processeur 54 évalue si l’ensemble de valeurs généré permet d’optimiser la fonction de coût L. Par exemple, le processeur 54 évalue ici si la distance entre l’ensemble des valeurs d’entrée et l’ensemble des valeurs généré est suffisamment faible. Cette distance évalue par exemple l’écart quadratique moyen entre l’ensemble des valeurs d’entrée et l’ensemble des valeurs généré. Pour cela, le processeur 54 compare cette distance à une valeur prédéterminée ε.As shown in the , the process continues in step E14 during which the processor 54 evaluates whether the set of values generated makes it possible to optimize the cost function L. For example, the processor 54 evaluates here whether the distance between the set of input values and the set of values generated is sufficiently small. This distance evaluates for example the average squared difference between all the input values and all the generated values. To do this, the processor 54 compares this distance to a predetermined value ε.

Si, à l’étape E16, la fonction de coût (c’est-à-dire ici la distance entre l’ensemble des valeurs d’entrée et l’ensemble des valeurs généré) est supérieure à cette valeur prédéterminée ε, cela signifie que l’ensemble de valeurs généré, et donc le vecteur intermédiaire courant w_iconcerné, ne forment pas une représentation pertinente de l’image issue du contenu audio ou vidéo. Le procédé se poursuit alors à l’étape E18. Lors de cette étape, le processeur 54 actualise le vecteur intermédiaire courant w_isur la base de la fonction de coût déterminée. Plus précisément, le vecteur intermédiaire actualisé w_actest donné par l’expression suivante :If, in step E16, the cost function (that is to say here the distance between the set of input values and the set of values generated) is greater than this predetermined value ε, this means that the generated set of values, and therefore the current intermediate vector w _i concerned, do not form a relevant representation of the image resulting from the audio or video content. The process then continues in step E18. During this step, the processor 54 updates the current intermediate vector w _i on the basis of the determined cost function. More precisely, the updated intermediate vector w _act is given by the following expression:

avec w_act, le vecteur intermédiaire actualisé, w_i, le vecteur intermédiaire courant, η une constante, L(w_i, x) la fonction de coût dépendant de l’ensemble de valeurs d’entrée et de l’ensemble de valeurs généré et l’opérateur gradient.with w _act , the updated intermediate vector, w _i , the current intermediate vector, η a constant, L(w _i , x) the cost function depending on the set of input values and the generated set of values And the gradient operator.

Puis, à l’étape E20, l’indice i est incrémenté. Le vecteur intermédiaire actualisé w_actdevient la valeur courante du vecteur intermédiaire. Une nouvelle itération est alors mise en œuvre et le procédé reprend à l’étape E12.Then, in step E20, the index i is incremented. The updated intermediate vector w _act becomes the current value of the intermediate vector. A new iteration is then implemented and the process resumes at step E12.

En revanche, si, à l’étape E16, la fonction de coût L est inférieure à la valeur prédéterminée ε, le vecteur intermédiaire courant w_ipeut être considéré comme correspondant à une représentation pertinente de l’image d’entrée. Le vecteur intermédiaire courant w_iest donc utilisé pour la suite du procédé d’estimation. Le vecteur intermédiaire courant w_ioptimisé (et utilisé pour la suite du procédé d’estimation) est noté « vecteur intermédiaire w ».On the other hand, if, in step E16, the cost function L is less than the predetermined value ε, the current intermediate vector w _i can be considered as corresponding to a relevant representation of the input image. The current intermediate vector w _i is therefore used for the rest of the estimation process. The current intermediate vector w _i optimized (and used for the rest of the estimation process) is denoted “intermediate vector w”.

Ce vecteur intermédiaire w est déterminé conformément à la distribution apprise p_wlors du procédé préliminaire d’apprentissage décrit précédemment. Grâce au procédé préliminaire d’apprentissage décrit précédemment, le vecteur intermédiaire obtenu est tel que, lorsque ce vecteur intermédiaire est fourni en entrée du réseau générateur 34, ce dernier fournit, en sortie, un contenu de même type (c’est-à-dire par exemple une image représentant une personne de même identité, pose et expression que sur l’image de référence).This intermediate vector w is determined in accordance with the learned distribution p _w during the preliminary learning process described above. Thanks to the preliminary learning process described above, the intermediate vector obtained is such that, when this intermediate vector is supplied as input to the generator network 34, the latter provides, at output, content of the same type (i.e. say for example an image representing a person with the same identity, pose and expression as in the reference image).

Autrement dit, le réseau générateur 34 permet d’obtenir, à partir d’un vecteur intermédiaire w suivant la distribution apprise p_w, un contenu plausible (par rapport à un contenu de référence dont il est connu qu’il est authentique).In other words, the generator network 34 makes it possible to obtain, from an intermediate vector w following the learned distribution p _w , plausible content (compared to reference content which is known to be authentic).

Ainsi, le système de traitement 5 tel qu’il est conçu peut être vu comme un encodeur qui permet de produire des vecteurs intermédiaires répartis selon une certaine distribution p_wpréalablement apprise. De manière avantageuse, lorsque le réseau générateur 34 est appliqué au vecteur intermédiaire w, cela permet alors de retrouver, sensiblement, les données d’entrée.Thus, the processing system 5 as it is designed can be seen as an encoder which makes it possible to produce intermediate vectors distributed according to a certain previously learned distribution p _w . Advantageously, when the generator network 34 is applied to the intermediate vector w, this then makes it possible to find, substantially, the input data.

Le bloc de transmission 12 transmet, vers les moyens de classification 20, ensuite, à l’étape E22, ce vecteur intermédiaire w (ici sous la commande du bloc d'optimisation 20 lorsque la condition d'arrêt de l'optimisation est rencontrée), pour la mise en œuvre de l’étape E24.The transmission block 12 transmits, to the classification means 20, then, in step E22, this intermediate vector w (here under the control of the optimization block 20 when the optimization stopping condition is met) , for the implementation of step E24.

Comme cela est représenté sur la , le procédé se poursuit à l’étape E24 d’estimation du niveau d’authenticité du contenu audio ou vidéo reçu à l’étape E2.As shown on the , the process continues in step E24 of estimating the level of authenticity of the audio or video content received in step E2.

Pour cela, le processeur 54 fournit le vecteur intermédiaire w, transmis à l’étape E22, aux moyens de classification 20. Les moyens de classification 20 estiment alors le niveau m d’authenticité de l’ensemble de valeurs d’entrée associé à l’image issue du contenu audio ou vidéo reçu à l’étape E2. Les moyens de classification 20 fournissent alors, en sortie, un premier résultat m₁ou un deuxième résultat m₂. Le premier résultat m₁et le deuxième résultat m₂sont distincts l’un de l’autre. Ici, le premier résultat m₁s’interprète par exemple comme un caractère authentique du contenu audio ou vidéo analysé tandis que le deuxième résultat m₂signifie que le contenu audio ou vidéo reçu à l’étape E2 a été falsifié.For this, the processor 54 provides the intermediate vector w, transmitted in step E22, to the classification means 20. The classification means 20 then estimate the level m of authenticity of the set of input values associated with the 'image from the audio or video content received in step E2. The classification means 20 then provide, as output, a first result m ₁ or a second result m ₂ . The first result m ₁ and the second result m ₂ are distinct from each other. Here, the first result m ₁ is interpreted for example as an authentic nature of the audio or video content analyzed while the second result m ₂ means that the audio or video content received in step E2 has been falsified.

Les moyens de classification 20 sont entraînés préalablement à la mise en œuvre de l’étape E24 selon un procédé préliminaire d’entraînement. La est un logigramme représentant un exemple de procédé préliminaire d’entraînement. Ce procédé préliminaire d’entraînement est par exemple mis en œuvre préalablement à la mise en œuvre du procédé d’estimation (donc en amont de l’étape E2).The classification means 20 are trained prior to the implementation of step E24 according to a preliminary training method. There is a flowchart representing an example of a preliminary training process. This preliminary training process is for example implemented prior to the implementation of the estimation process (therefore before step E2).

Ce procédé préliminaire d’entraînement est mis en œuvre pour des contenus audio ou vidéo dont le statut « authentique » ou « falsifié » est connu. Ces contenus sont par exemple ceux issus de la base de données FaceForensics++.This preliminary training process is implemented for audio or video content whose “authentic” or “falsified” status is known. These contents are, for example, those from the FaceForensics++ database.

Comme le montre la , ce procédé débute par une étape E70 d’initialisation des valeurs de nœuds des moyens de classification 20 à des valeurs initiales. Ces valeurs initiales sont par exemple des valeurs aléatoires.As shown in the , this method begins with a step E70 of initializing the node values of the classification means 20 to initial values. These initial values are for example random values.

Le procédé préliminaire d’entraînement se poursuit ensuite à l’étape E72. Lors de cette étape, le processeur 54 reçoit un contenu audio ou vidéo de référence dont le statut (authentique ou falsifié) est connu.The preliminary training process then continues in step E72. During this step, the processor 54 receives reference audio or video content whose status (authentic or falsified) is known.

Puis, de manière similaire à l’étape E4 décrite précédemment, le processeur 54 décompose le contenu audio ou vidéo en une pluralité d’images et détermine, pour chaque image, un ensemble de valeurs d’entrée de référence (étape E74).Then, similarly to step E4 described previously, processor 54 breaks down the audio or video content into a plurality of images and determines, for each image, a set of reference input values (step E74).

A l’étape E76, le processeur 54 fournit, en entrée du système de traitement 5, l’ensemble de valeurs d’entrée de référence. De manière similaire à l’étape E6 décrite, un vecteur intermédiaire de référence est déterminé à l’issue de cette étape E76.In step E76, the processor 54 provides, as input to the processing system 5, the set of reference input values. Similar to step E6 described, an intermediate reference vector is determined at the end of this step E76.

Ce vecteur intermédiaire de référence est ensuite optimisé lors des étapes E78 à E90 de la même façon qu’aux étapes E8 à E20 décrites précédemment.This intermediate reference vector is then optimized during steps E78 to E90 in the same way as in steps E8 to E20 described previously.

A l’étape E92, le processeur 54 fournit, en entrée des moyens de classification 20, le vecteur intermédiaire de référence (actualisé).In step E92, the processor 54 provides, as input to the classification means 20, the (updated) intermediate reference vector.

Puis, à l’étape E94, le processeur 54 ajuste les valeurs de nœuds des moyens de classification de manière que l’estimation du niveau d’authenticité pour un vecteur intermédiaire de référence associé à un contenu audio ou vidéo authentique donne le premier résultat m₁et que l’estimation du niveau d’authenticité pour un vecteur intermédiaire de référence associé à un contenu audio ou vidéo falsifié donne le deuxième résultat m₂distinct du premier résultat m₁.Then, in step E94, the processor 54 adjusts the node values of the classification means so that the estimation of the level of authenticity for an intermediate reference vector associated with authentic audio or video content gives the first result m ₁ and that the estimation of the level of authenticity for an intermediate reference vector associated with falsified audio or video content gives the second result m ₂ distinct from the first result m ₁ .

De manière avantageuse selon l’invention, l’utilisation du vecteur intermédiaire, présentant une dimension considérablement réduite par rapport à la dimension de la partie d’image analysée, permet d’améliorer la simplicité et le temps d’analyse. Par ailleurs, cette réduction de dimension (donc en apparence d’informations disponibles pour effectuer l’estimation de l’authenticité) ne diminue pas la robustesse du niveau d’authenticité obtenu car finalement le vecteur intermédiaire ne comprend que les paramètres pertinents nécessaires pour la détermination de ce niveau. Par exemple, le procédé selon l’invention ne prend pas en compte les informations concernant les éclairages ou les arrière-plans, informations qui ne sont pas utiles pour la détermination de l’authenticité du contenu à partir des visages présents sur l’image analysée.Advantageously according to the invention, the use of the intermediate vector, having a considerably reduced dimension compared to the dimension of the image part analyzed, makes it possible to improve the simplicity and analysis time. Furthermore, this reduction in dimension (therefore apparently of information available to carry out the estimation of authenticity) does not reduce the robustness of the level of authenticity obtained because ultimately the intermediate vector only includes the relevant parameters necessary for the determination of this level. For example, the method according to the invention does not take into account information concerning lighting or backgrounds, information which is not useful for determining the authenticity of the content from the faces present in the analyzed image. .

De plus, les réseaux de neurones et moyens de classification impliqués dans l’invention ne nécessitent pas de phase d’entraînement coûteuse (en temps, en mémoire disponible).In addition, the neural networks and classification means involved in the invention do not require a costly training phase (in time, in available memory).

La présente invention présente une application particulièrement avantageuse pour la détection de contenus audio ou vidéo qui auraient falsifiés mais qui seraient diffusés, de manière intentionnelle, en tant que contenus authentiques. Cela est particulièrement avantageux afin d’identifier des contenus diffusant de fausses informations (ou « fake news »), de lutter contre le cyberharcèlement ou encore de lutter contre certaines arnaques liées à d’éventuels chantages basés sur des contenus falsifiés.The present invention presents a particularly advantageous application for the detection of audio or video content which may have been falsified but which would be broadcast, intentionally, as authentic content. This is particularly advantageous in order to identify content disseminating false information (or “fake news”), to fight against cyberharassment or to fight against certain scams linked to possible blackmail based on falsified content.

Claims

Method for estimating the authenticity of audio or video content, the audio or video content being represented by a set of input values, the method comprising steps of:
- determination (E6), by means of a processing system (5) and on the basis of said set of input values, of an intermediate vector (w) in accordance with a learned distribution (p _w ), the system of processing (5) being configured, beforehand, to produce, as output, output vectors distributed according to the learned distribution (p _w ) and designed so that a generator network (34) produces content of the same type as the content audio or video when said output vectors are applied to the input of this generator network (34), and
- estimation (E22) of a level of authenticity (m) of the audio or video content by application of classification means (20) to said intermediate vector (w).

Method according to claim 1, in which there is also provided, prior to the estimation step, a preliminary method of learning the distribution (p _w ) comprising steps of:
- supply (E52), at the input of a redistribution network (32), of data (z) distributed according to a random distribution, so as to obtain, at the output, vectors (u) distributed according to a distribution (p _w ) , said vectors (u) being supplied as input to the generator network (34) so as to provide, at output, a set of learning values (x), and
- drive (E54) of the redistribution network (32) and the generator network (34) so as to update said distribution (p _w ).

Method according to claim 1 or 2, in which it is provided, during the determination step (E6), sub-steps of:
- supply (E10), at the input of a generator network (34), of said intermediate vector (w) so as to obtain, at the output of said generator network (34), a generated set of values, and
- updating (E18) of said intermediate vector (w) so as to optimize a cost function representing a distance between the set of input values and said generated set of values,
said updated intermediate vector being used as intermediate vector (w) during the estimation step.

Method according to claim 3, in which the updating step (E18) is implemented by a gradient descent method.

Method according to claim 3 or 4, in which steps of:
- supply, at the input of the generator network (34), of said updated intermediate vector so as to obtain, at the output of said generator network (34), an updated generated set of values, and
- new updating of said intermediate vector (w) so as to optimize a cost function representing a distance between the set of input values and said updated generated set of values.

Method according to any one of claims 1 to 5, in which the determination step comprises supplying, at the input of the processing system (5), said set of input values so as to obtain, at the output of said processing system processing (5) and on the basis of the learned distribution (p _w ), said intermediate vector (w).

Method according to any one of claims 1 to 6, in which the set of input values comprises a first number of values and the intermediate vector (w) comprises a second number of values, the second number of values being strictly lower to the first number of values.

Method according to claim 7, wherein the second number of values is between one hundredth and one thirtieth of the first number of values.

Method according to claim 7 or 8, wherein the second number of values is less than or equal to 512.

Method according to any one of claims 1 to 9, in which there is provided, upstream of the estimation step (E22), a step of training the classification means (20) from a plurality of intermediate vectors each associated with authentic or falsified audio or video content in such a way that the estimation of the level of authenticity (m) for an intermediate vector (w) associated with authentic audio or video content gives a first result and that the The estimation of the level of authenticity (m) for an intermediate vector (w) associated with falsified audio or video content gives a second result distinct from the first result.

Method according to any one of claims 1 to 10, wherein the set of input values is obtained by extracting part of the audio or video content.

Method according to any one of claims 1 to 11, in which the set of input values is formed of values associated with pixels of an image.

Device (50) for estimating the authenticity of audio or video content, the audio or video content being represented by a set of input values, the device (50) comprising:
- a processing system (5) configured to determine, on the basis of said set of input values, an intermediate vector (w) in accordance with a learned distribution (p _w ), the processing system (5) being configured to produce , at output, output vectors distributed according to the learned distribution (p _w ) and designed so that a generator network (34) produces content of the same type as the audio or video content when said output vectors are applied as input of this generator network (34), and
- a module for estimating a level of authenticity (m) of the audio or video content by application of classification means (20) to said intermediate vector (w).

Computer program comprising instructions executable by a processor (54) and designed to implement a method according to one of claims 1 to 12 when these instructions are executed by the processor (54).