WO2023068956A1

WO2023068956A1 - Method and system for identifying synthetically altered face images in a video

Info

Publication number: WO2023068956A1
Application number: PCT/RU2021/000445
Authority: WO
Inventors: Кирилл Евгеньевич ВЫШЕГОРОДЦЕВ; Григорий Алексеевич ВЕЛЬМОЖИН; Валентин Валерьевич СЫСОЕВ; Александр Викторович БАЛАШОВ
Original assignee: Публичное Акционерное Общество "Сбербанк России"
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2023-04-27
Also published as: RU2768797C1

Abstract

The claimed method of identifying synthetically altered face images in a video includes: obtaining images from a video; detecting face images on said images; calculating a vector representation of geometric characteristics of the detected face images, using an algorithm for comparing facial reference points to identify face images of a person; calculating, using frame-by-frame video analysis, the spatial and temporal weight of each face image of each person in an image, said weight being determined as a vector representation of a spatial face characteristic characterizing the size of the face region in relation to the frame, and a vector representation of a temporal face characteristic characterizing the display time of the face image under analysis on the frames of the video; calculating a vector of assessments of the probability of synthetic alterations for the face images of a person, said vector characterizing the presence of synthetic alterations to the face images of said person in each frame; calculating an overall assessment of the probability of synthetic alterations on the basis of the vector representations of spatial and temporal distribution and the vector of assessment of synthetic alterations for the face images of each person in the video; generating a summative assessment of the presence in the video of a synthetic alteration to a face image; generating an integral assessment on the basis of the summative assessment of a model and generating a warning about the presence of a synthetically altered face in the video.

Description

METHOD AND SYSTEM FOR DETERMINING SYNTHETICALLY MODIFIED IMAGES OF FACES IN VIDEO

FIELD OF TECHNOLOGY

[0001] The present technical solution relates to the field of computer technology used in the field of data processing, in particular to a method and system for determining synthetically modified images of faces in a video.

BACKGROUND OF THE INVENTION

[0002] To date, the use of technologies for the formation of synthetic images superimposed on images of the faces of real people is based, as a rule, on the use of machine learning algorithms, for example, artificial neural networks (ANNs). Such approaches are aimed at imposing digital masks that mimic human faces. An example of such a technology is the DeepFake technique, based on artificial intelligence and used for image synthesis (see htps ://ru. wikipedia. org/ wiki/Deepfake) .

[0003] There is a known method for recognizing synthetically modified images of people's faces, in particular DeepFake images (Tolosana et al. DeepFakes Evolution: Analysis of Facial Regions and Fake Detection Performance // Biometrics and Data Patern Analytics - BiDA Lab, Universidad Autonoma de Madrid. 2020), which is based on the analysis of the segments that form the face images. The analysis is carried out using an ANN trained on real and synthetic images of people's faces, in particular, celebrities, which can be used to detect fake (fake) videos. The method makes it possible to analyze facial segments, on the basis of which a classification of the corresponding image is issued as containing synthetic changes or not.

[0004] The disadvantage of this approach is low efficiency due to the lack of application of an integral estimate, which would be formed based on the geometric parameters of the face image, and on the basis of the spatio-temporal characteristics of the person's face in the video. Another drawback is the lack of processing of several people in some solutions if there are several people on the video. In other well-known open solutions (https://www.kaggle.eom/c/deepfake-detection-challenge, htps://ai.facebook.com/datasets/dfdc/), such processing is carried out by independently evaluating each face image, each person, on each analyzed frame of the video and further averaging all such ratings. All such solutions show low efficiency when processing videos with several people.

SUMMARY OF THE INVENTION

[0005] The claimed method and system is aimed at solving the technical problem of efficiently and accurately detecting synthetic changes in facial images in video.

[0006] The technical result is to increase the accuracy and efficiency of detecting a synthetic change in images of people's faces in a video.

[0007] In a first preferred embodiment of the invention, a computer-implemented method for determining synthetically modified images of faces in a video, performed by a processor, is provided, wherein: a) obtaining at least one image from the video; b) detecting images of faces in said image; c) calculating a vector representation of the geometric characteristics of the detected face images using at least a face reference point comparison algorithm to determine images of at least one person's face; d) using frame-by-frame video analysis, the spatio-temporal significance of each face image of each person in said image is calculated, which is defined as a vector representation of the spatial characteristic of the face, characterizing the size of the face area in relation to the frame, and a vector representation of the temporal characteristic of the face image, characterizing the display time the analyzed image of the face on the video frames; e) calculate a vector of synthetic change probability estimates for images of the faces of a person, characterizing the presence of synthetic changes in the images of faces of this person in each frame in video; g) forming a final assessment of the presence in the video of a synthetic image change of at least one face; h) forming an integral estimate of the presence of a synthetically modified face image in the video according to at least one final assessment of the model and generating a notification about the presence of a synthetically modified face in the video.

[0008] In one of the particular implementations of the method, steps c) - h) are performed by a machine learning model or ensemble of models, while the machine learning model or ensemble of models is trained on a data set containing synthesized images of people's faces.

[0009] In another particular implementation of the method, the machine learning model uses an automatic markup correction function that corrects the incorrect markup of each face on frames by comparing the images of faces on the synthesized video with their images on the original video.

[0010] In another particular implementation of the method, faces are compared based on the value of the vector proximity of the reference points that form the geometric characteristics of the original face image and the synthesized image based on it.

[0011 ] In another particular implementation of the method, comparison of faces is carried out by analyzing the coordinates of the regions of the original face image and the synthesized face image.

[0012] In another particular implementation of the method, the spatiotemporal significance is calculated as a general matrix based on the values of vector representations, and the assessment of the presence of synthetic changes in the images of the faces of an individual is formed by a machine learning model using the obtained general matrix.

[0013] In another particular implementation of the method, an ensemble of machine learning models consists of a group of models, each of which is trained to identify a specific synthetic imaging algorithm.

[0014] In another particular implementation of the method, it contains an integral classifier that receives as input estimates generated using models included in the ensemble.

[0015] In another particular implementation of the method, the overall score is calculated using an integral classifier.

[0016] In another particular implementation of the method, an algorithm for generating a synthetic face image in the analyzed video stream is additionally defined.

[0017] In another particular implementation of the method, the video is an online video conference.

[0018] In another particular implementation of the method, when a synthetically modified face image is determined, a notification is generated in its display area. [0019] In another particular implementation of the method, when determining a synthetically modified face image, the connection with this user is blocked.

[0020] In another particular implementation of the method, the analyzed image is obtained from a biometric identification or biometric authentication system.

[0021] In another particular implementation of the method, when determining a synthetically modified face image, access or the requested action is blocked from the user.

[0022] In another private implementation of the method, when determining a synthetically modified face image, additional user authentication data is requested, selected from the group: login, code, password, two-factor authentication, or combinations thereof.

[0023] In another particular implementation of the method, a signal is generated in the form of a quantitative estimate of the probability of the presence of a synthetically modified face image.

[0024] In another particular implementation of the method, images are obtained from a video media space monitoring and social media and media analysis system that performs content verification on social media and media.

[0025] In another particular implementation of the method, when a synthetically modified face image is determined, a notification is generated to inform the person who has been exposed to the creation of the modified face image.

[0026] In a second preferred embodiment of the invention, there is provided a system for detecting synthetically modified face images in video, comprising at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the above method.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] FIG. 1 illustrates a block diagram of the implementation of the claimed method.

[0028] FIG. 2 illustrates an example of generating a vector representation of images of faces in a video.

[0029] FIG. ZA - ST illustrate an example of the formation of vector representations of space-time characteristics.

[0030] FIG. 4 illustrates a block diagram of generating the synthetic face image score vector, the face image spatial characteristic vector, and the vector the timing of face images for the face images of each person in the video.

[0031] FIG. 5 illustrates a block diagram of the independent formation of the final spatial and temporal characteristics, and the overall evaluation of synthetic face images.

[0032] FIG. 6 illustrates a flowchart for processing the final spatial and temporal characteristics, when they are generated independently from the overall assessment of synthetic face images, to exclude human faces from the calculation of the assessment of synthetic changes in video.

[0033] FIG. 7 illustrates a block diagram of generating a notification with an integrated assessment of the presence of synthetic images of people's faces in a video, a notification about a probable algorithm for generating synthetic changes data, when using a set of ensembles of trained machine learning models, when the models of each ensemble are trained on a data set with one specific algorithm for generating synthetic face changes, and at least models of one ensemble, are trained on a dataset with several algorithms for generating synthetic face changes.

[0034] FIG. 8 illustrates a flowchart when a notification is generated by an integrated classifier based on the scores of several trained machine learning models or their ensembles, and there are several people on the video.

[0035] FIG. 9 illustrates a general view of the computing device.

IMPLEMENTATION OF THE INVENTION

[0036] In this decision, the term "synthetically modified face image" hereinafter in the text will be understood as any type of digital imaging that imitates the face or part of the face of another person, including by applying digital masks, distorting/changing parts of the face, etc. .P. A synthetically modified face image should be understood as both fully generated images, for example, masks using DeepFake technology, superimposed on the face of a real person in the frame while maintaining the mimic activity of the image, and the formation of a partial change in individual parts of the face (eyes, nose, lips, ears and so on.). [0037] As shown in FIG. 1, the implementation of the claimed method (100) for determining synthetically modified images of faces in a video consists in executing by a computing computer device, in particular, using one or more processors in an automated mode, a software algorithm, presented as a sequence of steps (101) - (107), providing performing material actions in the form of processing electronic signals generated when the processor of the computing device performs its functions in order to implement data processing within the execution of method (100). [0038] At the first stage (101), one or more images obtained from the video are obtained and stored in the memory of the computing device that performs the method (100). In the present application materials, the term "video" will mean a video image, a video stream (for example, from an IP camera, an electronic device camera, a virtual camera, from an Internet application), an ordered sequence of frames (images), subsampling of frames, including up to and including up to one image.

[0039] At step (102), the received images are analyzed for the presence of images of faces to determine the presence of its synthetic change. Subsequent analysis of the resulting images can be performed using one or more (ensemble) machine learning models that are trained to detect and classify face images.

[0040] When detecting a synthetic change in face images in a video, various machine learning models can be used, for example, neural network architectures, such as fully connected neural networks, CNN (convolutional networks), RNN (recurrent networks), Transformer (transformer networks), CapsNet ( capsule networks) and their combinations.

[0041] In their training, the networks may identify one or more features of synthetically modified face images, in particular:

- anatomical proportion of the face and head;

- anatomical feature of the location of parts of the face;

- proportions of parts of the face;

- plasticity and relief of mimic diversity;

- features of plastic parts of the face: eyebrows, eyes, nose, ears, lips, skin;

- general characteristics of the muscles of the face and neck;

- structure and distribution of muscles into groups (facial, chewing, suboccipital and others), location;

- unnaturalness of shadows, light, glare, penumbra, reflexes of illumination and the environment of the details of the face and the surrounding space;

- temperature distribution by face elements; - blurring, smoothing when rendering elements of the face, head and other elements of the image;

- sharpening (sharpness) and artificial enhancement of features when rendering elements of the face, head and other image elements;

- graphic artifacts left by generation algorithms and/or their specific implementations in software when creating synthetic images.

[0042] It is also possible to use pre-trained neural networks with or without further training. In the case of using architectures with convolutional networks, such pre-trained models can be used as: AlexNet, VGG, NASNet-A, DenseNet, DenseNet-B, DenseNet-BC, Inception, Xception, GoogleNet, PReLU-net, BN-inception, AmoebaNet, SENet, ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152, XResNet, Squeeze-and-Excitation ResNet (SE-ResNet), EfficientNet-BO, EfficientNet EfficientNet-Bl, EfficientNet-B2, EfficientNet-B3 , EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, EfficientNet-B7, YOLO and inherited from them.

[0043] The machine learning model was trained with at least one of the following steps:

- Obtaining classified (labeled, with affixed classes) data in one or more formats: video stream, video file, video frames (frame);

- Selection of frames in case of receiving a video stream or video file;

- Detection of face(s) on frames. Cutting them out of the frame with some neighborhood around the face and getting arrays of face data;

- For data of the "Synthetically modified image" class, if there is an initial image frame - the frame from which such a modified image was formed, checking the correctness of the affixed class;

- For each face, its data array (pixel values, bmp-map) is transformed according to the preprocessing algorithm (data standardization, image scaling, etc.);

- Data augmentation;

- Formation of a data package and its submission for training a neural network;

- Calculation of the value of the objective function and backpropagation of the data packet error for network training. As indicators of quality, apply: LogLoss, accuracy, precision (accuracy), recall (completeness), F-measure, AUC-ROC, AUC-PR, Gini coefficient / index (Gini coefficient), confusion matrix (error matrix).

[0044] One or more of the following algorithms can be used as a training algorithm for a machine learning model: Adagrad (Adaptive gradient algorithm), RMS (Root mean square), RMSProp (Root mean square propagation), Rprop (Resilient backpropagation algorithm), SGD (Stochastic Gradient Descent), BGD (Batch Gradient Descent), MBGD (Mini-batch Gradient Descent), Momentum, Nesterov Momentum, NAG (Nesterov Accelerated Gradient), FussySGD, SGDNesterov (SGD + Nesterov Momentum), AdaDelta, Adam (Adaptive Moment Estimation) , AMSGrad, AdamW, ASGD (Averaged Stochastic Gradient Descent), LBFGS (L-BFGS algorithm - Broyden-Fletcher-Goldfarb-Shanno algorithm with limited memory usage), as well as second-order optimizers such as: Newton method, Quasi-Newton method, Gauss-Newton algorithm, Conjugate gradient method, Levenberg-Marquardt algorithm.

[0045] At least one of the following functions is used as an objective function when training a machine learning model: LILoss, MSELoss, CrossEntropy Loss, CTCLoss, NLLLoss, PoissonNLLLoss, GaussianNLLLoss, KLDivLoss, BCELoss, BCEWithLogitsLoss, MarginRankingLoss, HingeEmbeddingLoss, MultiLabelMarginLoss, SmoothLILoss, SoftMarginLoss, MultiLabelSoftMarginLoss, CosineEmbeddingLoss, MultiMarginLoss, TripletMarginLoss, TripletMarginWithDistanceLoss.

[0046] When training a machine learning model, a markup self-check (automatic markup correction) stage can be used, which checks each face in the image (frame) that is marked as containing a synthetic change, that it really contains signs of such a change.

[0047] This check is implemented if the source video (frames, images) is available. The source video (frames, images) is a real (unmodified by the introduction of a synthetic change) video from which a synthetically modified video of the training (additionally and test) set was formed. This feature is implemented as follows and may include the following steps:

- The face detection algorithm in the image on the frame of the synthetically modified video detects a face. A part of the image with a face and some neighborhood around it is cut out. The size of the neighborhood may vary.

- All faces are detected in the corresponding frame of the original video. The face with the closest characteristics to the face from the previous step is selected. As Proximity measures, depending on the face detection algorithm used, proximity at one or more points (set of points) of the face is used:

- nose;

- nostrils;

- hair lines;

- lines of vegetation on the face (beard, mustache);

- mouth;

- lips (upper and lower);

- forehead;

- eye;

- pupils

- ears;

- eyebrows;

- age;

- heads;

- cheekbones;

- chin;

- nasolabial triangle;

- coordinates of the face rectangle.

[0048] As an algorithm for detecting people's faces, approaches such as: adapted improvement and the Viola-Jones method based on it, MTCNN, flexible graph matching method (Elastic graph matching), DeepFace Facebook, hidden Markov models (HMM, HMM), Principal component analysis and data matrix decomposition algorithms (PCA, SVD, LDA), Active Appearance Models (AAM), Active Shape Models (ASM), FERET (face recognition technology), SURF, NeoFace, SHORE, ROI, Template Matching Methods, DPM (deformable part model), Artificial neural networks (Neural network: Multilayer Perceptrons), Factor analysis (FA), Linear Discriminant Analysis (Linear Discriminant Analysis), Support Vector Machines (SVM), Naive Bayesian classifier, Hidden Markov model, Distribution-based method, Mixture of PCA, Mixture of factor analyzers, Sparse network of winnows (SNOW)). [0049] By proximity we mean the minimum distance for numerical data according to the Bray-Curtis metric, Canberra, Ruzicka, Kulczynski, Jaccard, Euclidean distance, Manhattan metric, Penrose size distance, Penrose shape distance, Lorentzian distance, Hellinger distance, Minkowski distance measure p, Mahalanobis distance, statistical distance, correlation similarities and distances (Pearson correlation, Orchini similarity, normalized dot product) or otherwise. When calculating proximity, the coordinates of points on the frame of the synthetically modified video and the coordinates of such points on the frame of the original video are taken for calculation, then the closest face image is selected, as faces with the minimum distances between the points used.

[0050] In one of the particular examples of implementation, it is also possible to select (obtain coordinates) a face area on a frame of a synthetically modified video, after which an area with the same coordinates is cut out on the frame of the original video. In another particular implementation example, the reverse type of processing can be performed - a face is detected on the frame of the original video, and an area with the same coordinates is cut out on the frames of the synthetically modified video. As a result of the performed operations, two images are obtained, which are an area with a face from the frame of the original video and a face with a frame of synthetically modified video.

[0051] The resulting pair of images is compared with each other according to a given metric to assess the level of image distortion. As such a metric can be used:

- Peak signal-to-noise ratio (PSNR - peak signal-to-noise ratio). bcrz://gi.\y1k1resna.o^/'mk1/Peak_signal_to_noise_ratio ;

- Mean square error (MSE - mean square error). bp8://gi.lu1k1resya.og^Ay1k1/Standard_deviation ;

- The square root of the root mean square error function (RMSE - root-meansquare error). https://ru.wikipedia.or /wiki/nHKOBoe_OTHoineHHe_CHrHana_K_inyMy ;

- Relative mean squared deviation (RMD - Root mean squared deviation);

- Standard deviation (RMS - Root Mean Squared);

- Index of structural similarity (SSIM - structure similarity). https://ru.wikipedia.org/wiki/SSIM ;

- Structural differences (DSSIM - structural dissimilarity). https://en.wikipedia.org/wiki/SSIM ; - Signal-to-noise ratio (SNR; SNR - signal-to-noise ratio). https://ru.wikipedia.org/wiki/O HOHieHHe_CHrHan/niyM/ ;

- Absolute difference between pixels and indicators inherited from it (average, relative, etc.).

[0052] In this case, if color images are analyzed (with several components per pixel), similar ones are applied with further weighted averaging for each of the components. For example, for an RGB image, PSNR or MSE is calculated over all three components (and divided by three times the image size). For a good quality synthetic image and a good video quality (no interference from noise), it is preferable to use "PSNR". If only part of the face is to be overlaid with the synthetic image, it is preferable to use PSNR. If the video is noisy or grainy, then it is preferable to use DSSIM or SSIM. In the presence of a lot of interference, it is preferable to apply SNR. If the video quality is extremely low quality, for example, with a high compression ratio, then it is preferable to use MSE or RMD. If the dimensions of the face in relation to the frame are small, then the absolute difference between the pixels is applied.

[0053] According to the applied metric, a boundary value is selected, and if the value of the metric between two received images is greater than this boundary value, then the face on the frame is accepted as synthetically modified. If the value is less than or equal, then, despite the marking of this image as synthetically modified, then this face image is taken as real.

[0054] When performing the transformation of face data arrays, such elements as: data normalization, data standardization, size reduction to a given one, image scaling algorithms can be used.

[0055] Data augmentation for training one or more machine learning models may be performed using at least one of the following approaches: image scaling (increase, decrease); image cropping; dimming the entire image, individual channels of the image; clarification of the entire image, individual channels of the image; increase in contrast; color transformations: repositioning (mixing) of color channels, amplification, reduction of one or more color channels, obtaining an image in grayscale, obtaining a monochrome image, deleting a color channel; shifts and decentering of the image; rotation of the image at different angles in different directions, rotation of the image or its part; slopes, distortions of the image; mirror image along an arbitrary axis, line; additional lines or geometric objects in the image: with transparency of their color, without transparency, colored objects; gray objects (from white to black), including the removal of a part of the image (placing a black object on the image) at the geometric or semantic positions of the image; adding any background to the image; glare and darkening of parts of the image; defocus (blur) of the image or its parts; increasing graininess, sharpness (sharpness) of the image; compression and stretching along axes, lines; image noise over the entire image or part of it, the placement of white or other noise; adding one or more elements of Gaussian noise (Blur), spotty noise; superposition (overlay) of two or more images from the training sample (parts of images) with different weights; elastic transformation of the image (Elastic Transform); grid distortion of the image (GridDistortion); compression of image data by various image processing algorithms with some quality (for example, compressing the original bmp image according to the JPEG standard of some quality, and then obtaining a bmp image from it again); isotropic, affine and other transformations (https://github.com/albumentations-team/albumentations) .

[0056] In this case, all of the above are applicable in various types of graphic representation or its channels: RGB, sRGB, RGBA, ProPhoto, CMYK, XYZ, LMS, HKS, HSV, HSB, HSL, AHSL, RYB, LAB, NCS, RAL, YUV . YCbCr. YPbPr, YDbDr, YIQ, PMS (Panton), Munsella. These methods of augmentation can be applied to one image, in any sequence, with or without the possibility of application.

[0057] Using the trained model or algorithm for detecting people's faces or at step (102), faces are highlighted. At step (103), the face images extracted at step (102) are processed to determine which face images belong to one person. To do this, at stage (SW), the vector representation of the geometric characteristics of face images is calculated. In the general case, this is done using the face reference point comparison algorithm. Using the definition of geometric characteristics, images of faces belonging directly to the same person are determined. The formation of this vector makes it possible to estimate the probability of having a face of a real person. The algorithm of work can be carried out as follows. The j-th face is highlighted on the i-th frame. This j-oe face is searched for in subsequent frames.

[0058] In one of the particular examples of the implementation of the invention, the search is carried out by selecting the closest face image in space among all detected faces on the i + 1st frame. As a measure of proximity, depending on of the face detection algorithm used, proximity is used (numerical data by Bray-Curtis metric, Canberra, Ruzicka, Kulczynski, Jaccard, Euclidean distance, Manhattan metric, Penrose size distance, Penrose shape distance, Lorentz distance, Hellinger distance, Minkowski distance measure p, distance Mahalanobis, statistical distance, correlation similarities and distances - Pearson's correlation, Orchini's similarity, normalized scalar product, or otherwise) on one or more points of the face (reference points of the face): nose, nostrils, hairline, lines of facial hair (beard, mustache ), mouth, lips (upper and lower), forehead, eyes, pupils, ears, eyebrows, eyelids, head, cheekbones, chin, nasolabial triangle, face rectangle coordinates.

[0059] The distance between the corresponding reference points of the j-th face in the i-th frame and the points of each face in the i+1-th frame is calculated. Then a face is selected from the i+1-th frame with the smallest distances by reference points. In another particular embodiment of the invention, on the i+1st frame, a face with the closest characteristics between the given reference points (the relative position of the points) is searched for. In this case, the geometric characteristics (dimensions) of the location of the reference points of the j-th face image are considered, and on the i + 1 frame, the face image with the most similar geometric characteristics is searched. In yet another embodiment, for each face, a certain spatial neighborhood (location area) is allocated on the frame and it is checked whether there is any face image in the i+1 -th frame. The implementation of approaches in the implementation of the claimed method (100) does not limit other possible ways to search for a face image in frames.

[0060] Next, at step (104), for each detected face image in the frames, an estimate of the probability of its synthetic change is calculated using the trained machine learning model for detecting and classifying synthetic changes. This rating is added to the vector of ratings for facial images of the j-th person. If the j-oro face image is not detected on the next frame (or series of frames) of the ordered sequence of frames, then the formation of the rating vector can be completed. An example of the formation of a rating vector for the image of a person's face on video frames is shown in Fig. 2. In another implementation, the formation of a rating vector for face images of the j-th person occurs throughout the video, and is not completed if the face image is not detected on the next frame.

[0061] Next, at step (104), for each specific image of a person's face, its spatial and temporal significance is determined, which is defined as a vector representation of a spatial characteristic of a person's face, which characterizes the size of the face area in relation to the frame, and a vector representation of the temporal characteristic of the face image, which characterizes the display time of the analyzed face image on video frames. On FIG. Figure 4 shows a diagram of step 104. Calculations of the vector of estimates of synthetic changes in the face images of the j-th person in the video, which consists of estimates of changes in the face image in each analyzed frame, the calculation of the vector of spatial and temporal characteristics (spatial vector and temporal vector) can be carried out sequentially, as this is shown in FIG. 4, or in parallel, independently of each other. The description of the invention does not limit the order and method of calculating these vectors, but describes their use to improve the quality of detecting synthetic changes in face images in video.

[0062] In FIG. 3A-STB presents an example of calculating the vectors of spatial and temporal significance and the vector of estimates of synthetic changes. In the presented example, for each frame (K1) - (Kb) of the video obtained at step (101), the vector of estimates of the synthetically modified face, the vector of the spatial distribution of faces on the frames, and the temporal characteristic of the face on the frames are calculated. The spatial characteristic can be calculated based on the occupied fraction of the face area from the frame size. For example, the rectangle in which the face is inscribed in the frame has the following coordinates: Xl=100, Yl=50 - upper left corner; Х2=300, Y2=l 50 - lower right corner. The area of such a rectangle is 200*100 = 20000. The video was taken at a resolution of 1280x1920 pixels and its area is 2457600. The proportion of the face area in the frame is 20000/2457600 = 0.8%. The time characteristic for each face can be calculated as a scalar value, for example, the time it is displayed on the video. In another implementation, a vector may be formed in which 1 is assigned if the person is present in the frame, or 0 if he is not in the frame. Spatial and temporal significance can be represented, in particular, as a common matrix based on the values of the generated vector representations.

[0063] At step (105), an overall score of synthetic changes in images of human faces on video is formed based on the vectors obtained at steps (103) - (104). That is, the calculation of the synthetic image change probability estimate for each person's face in the video is performed based on the vectors of the temporal distribution, the spatial distribution, and the vector of probability estimates that the face image in the frame was subject to synthetic changes.

[0064] A separate machine learning model can be used to generate an overall estimate of synthetic changes in images of faces j-ro of a person. For the formation of the said general assessment, the obtained vectors of spatial and temporal distribution, the vector of estimates of synthetic changes in the face image are combined into a common two-dimensional matrix presented in Table 1 for an example in Fig. BEHIND. The resulting matrix is fed to the input of the machine learning model to form an overall estimate of the synthetic change in the j-ro person's face in the video. This model can be a recurrent neural network, a convolutional neural network, a fully connected neural network. Such a combination is recommended for the case when a person is present at different time intervals of the video, and not only in one consecutive series of frames.

Table 1. Two-dimensional matrix of the vector representation of the spatiotemporal distribution and the vector of estimates of synthetic changes in faces.

[0065] In another particular embodiment of the invention, a vector of spatial distribution of a human face image and a vector of estimates of the presence of synthetic changes are combined into a common two-dimensional matrix. But their unification occurs only in frames on which there is a face of a given person. An example is shown in Table 2 for FIG. BEHIND. Such a combination is recommended for the case when a person is present in one consecutive series of frames.

Table 2. Two-dimensional matrix of the vector representation of the space-time distribution.

One of the options for generating a notification of the presence of changes in the video, in the method of calculating the overall score of synthetic changes in images of faces j-ro of a person at step 105 using a trained model that uses a matrix of combining spatiotemporal representation vectors and a score vector, is presented in Figure 8. [ 0066] In another particular example of the implementation of the invention, at step (105), a score vector characterizing that the face image in the frame was subjected to synthetic changes, is analyzed separately from the vectors of temporal and spatial distribution. An example of this circuit is shown in Fig. 5. In this case, the overall assessment of the synthetic change in the j-ro person's face is based only on the vector of estimates of synthetic changes. A separate machine learning model or a separate algorithm can be used to generate the overall score.

[0067] In one of the particular embodiments of the invention shown in FIG. 5, the score vector is input to the separately trained model. This model can be a recurrent neural network, a convolutional neural network, a fully connected neural network. In such cases, a vector of a certain length can be used. If the resulting vector of face image ratings is less than the specified length of the vector, then such a vector is supplemented with values, for example, 0.5 from a certain end. If the vector is larger than the given length, then it is cut off from a certain end.

[0068] In another particular embodiment of the invention, the number of ratings at given intervals or the frequency of rating intervals is counted. For example, intervals are taken with a step of 0.1: [0-0.1; 0.1-0.2; 0.2-0.3; 0.3-0.4; 0.4-0.5; 0.5-0.6; 0.6-0.7; 0.7-0.8; 0.8-0.9; 0.9-1] and the frequency of estimates from the vector in these intervals is calculated. The obtained values are fed to the machine learning model, for example, support vectors (SVM), K-neighbors (K-nearest neighbor), linear (non-linear) regression, classification tree model. The description of the invention does not limit the form of the machine learning model, but describes its application to the obtained score vector.

[0069] In another particular embodiment of the invention, the overall assessment of synthetic changes in images of a person's face is obtained by averaging a vector of estimates, or obtained by extracting the maximum value, either over the entire vector or over a part of it.

[0070] In one of the particular embodiments of the invention shown in FIG. 5, general spatial and temporal characteristics of human face images are built for further analysis. The general spatial characteristic is calculated as an average over the spatial vector of a given person. The overall time response is obtained as the length of the time response vector in relation to the length of the video, that is, it is the proportion of the time the person was present on the video from the total time on the video. In another particular variant, the maximum value or the minimum value is selected to calculate the overall spatial characteristic.

[0071] In step (106), a final synthetic face change presence score is calculated for the entire video. This score is built using each overall score synthetic changes in images of people's faces. In other words, in steps (104) - (105), we obtain synthetic change scores for each person in the video (an individual person is highlighted in step 103), and in step (106), from the scores for people, we calculate a score for the video. This stage of the cumulative analysis of the ratings of all people in the video allows you to improve the quality of the invention compared to existing ones. For example, if there are a lot of people in the video and for all of them we have a high synthetic change score, then the video under study is probably very compressed, and we have a false positive decision of the models when analyzing face images. The cumulative analysis of the ratings in step (106) would then generate a final rating for the video as "video without synthetic changes".

[0072] In one of the private embodiments of the invention, shown in Fig. 1, using the total scores of the faces of all people to form the final score of the video can be applied as follows:

- The weighted average value of the estimates of the people's faces used is determined. In one particular embodiment of the invention, the weights for ratings can be the product of the average size of the images of the faces of a given person and the share of time of presence on the video.

- Calculates a simple average of the estimates of synthetic changes in the images of the used people's faces.

- The maximum rating is formed among the ratings of the people's faces used.

[0073] In another particular embodiment of the invention, a trained model may be used for the above example. This model can represent the support vector machine (SVM), K-neighbors (K-nearest neighbor), linear (non-linear) regression, classification tree model, one or more neural networks. Such a model can take as input a vector (vector representation of data) that characterizes the amount of use of intervals of estimates of synthetic changes by faces.

[0074] In another particular embodiment of the invention, an example of the steps of which are shown in Figure 5, the general spatial and temporal characteristics of images of a person's face are formed. At step (106) these characteristics are compared with the corresponding boundary values. If, according to the final characteristic, the size of the face image or the time of its presence is less than the boundary value, then the assessment of this person's face is not taken into account when calculating the probability of a synthetic change in the video (None score). The schematic of this example is shown in Fig. 6. The remaining estimates of synthetic changes in people's faces are analyzed further by the methods described above. [0075] In another particular embodiment of the invention, two-dimensional matrices of vector representations of the spatio-temporal distribution and estimates of synthetic changes in the faces of various people, the formation of which is described above, are fed to the input of stage (106), where the formation of the final assessment of the presence of synthetic changes in the faces of people in the video is performed. This step can also be performed using a single machine learning model or an ensemble of trained models.

[0076] At step (107), an integral assessment of the presence of a synthetically modified face image on the video is formed based on the final assessment of the presence of synthetic changes in people's faces in the video. For this, at least one final assessment of the presence of synthetic changes in faces in the video is used, which is formed according to a separate model for classifying and detecting synthetic changes in faces. Upon completion of step (107), a notification about the presence of a synthetically modified face in the video is generated.

[0077] The notification may be displayed directly in the graphical user interface, for example, during an online conference (Zoom, Skype, MS Teams). Also, the notification may be displayed directly in the synthetic face change detection area, for example, in the area with the image of a person's face. An additional effect of the application of the invention may be its use in biometric control systems, for example, when receiving services (for example, banking services) or access (access control system, turnstile with a biometric sensor). When a synthetically modified face image is detected, access or the requested action is blocked by the user. In this case, user authentication data may be additionally requested, selected from the group: login, code, password, two-factor authentication, or combinations thereof.

[0078] The claimed solution can be used in systems for monitoring the media space and analyzing social media and the media, to identify publicly known people (first persons of the state, media personalities, famous people, etc.), on which an attempt can be made to compromise them. Such systems will be the source of the received video for its subsequent analysis, and, if synthetic changes in the images of the faces of such people are detected, they or the corresponding service may be notified of the falsely generated information. For this type of notification, information about the time of the detected event, the source of the event can also be stored.

[0079] In one particular embodiment of the invention, several models for detecting synthetic changes in face images are used, each of which, at least one model, is trained on a different algorithm for generating synthetic changes. [0080] In another particular embodiment, an ensemble of models is trained for each algorithm for generating synthetic changes. Estimates from several models in this ensemble are averaged.

[0081] For the final classification, the obtained scores are processed by an integral classifier, which makes it possible to reveal hidden relationships between model predictions for various algorithms for generating synthetic changes. This quality makes it possible to achieve a super-additive effect (synergistic) and improve the quality of video detection with the presence of synthetic changes in face images. The general scheme is shown in Fig. 7. A more detailed diagram is shown in FIG. 8.

[0082] In another particular embodiment of the invention, the integrated classifier generates not only an integral assessment of the presence of synthetic changes in people's faces in the video, but also the most probable algorithm by which these synthetic changes were created. This example is shown in Fig. 7.

[0083] In FIG. 9 shows a general view of a computing device (600) suitable for implementing the claimed solution. Device (600) may be, for example, a computer, server, or other type of computing device that can be used to implement the claimed technical solution. Including being part of a cloud computing platform.

[0084] In the General case, the computing device (600) contains one or more processors (601), memory means, such as RAM (602) and ROM (603), input / output interfaces (604), input devices connected by a common information exchange bus / output (605), and a device for networking (606).

[0085] The processor (601) (or multiple processors, multi-core processor) may be selected from a variety of devices currently widely used, such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and etc. The processor (601) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.

[0086] RAM (602) is a random access memory and is designed to store machine-readable instructions executable by the processor (601) to perform the necessary data logical processing operations. RAM (602) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).

[0087] A ROM (603) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), a flash memory (EEPROM, NAND, etc.), optical storage media (CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0088] Various types of I/O interfaces (604) are used to organize the operation of device components (600) and organize the operation of external connected devices. The choice of appropriate interfaces depends on the particular design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0089] To ensure user interaction with the computing device (600), various means (605) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

[0090] The networking means (606) enables the communication of data by the device (600) via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (606) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others

[0091] Additionally, satellite navigation tools in the device (600) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

[0092] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be construed as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA

1. A computer-implemented method for determining synthetically modified images of faces in a video, performed by a processor and comprising the steps of: a) receiving at least one image from the video; b) detecting images of faces in said image; c) calculating a vector representation of the geometric characteristics of the detected face images using at least a face reference point comparison algorithm to determine images of at least one person's face; d) using frame-by-frame video analysis, the spatio-temporal significance of each image of the face of each person in said image is calculated, which is defined as a vector representation of the spatial characteristic of the face, characterizing the size of the face area in relation to the frame, and a vector representation of the temporal characteristic of the face image, characterizing the display time the analyzed image of the face on the video frames; e) calculate a vector of synthetic change probability estimates for images of the faces of a person, characterizing the presence of synthetic changes in the images of faces of this person in each frame in video; g) forming a final assessment of the presence in the video of a synthetic image change of at least one face; h) forming an integral estimate of the presence of a synthetically modified face image in the video according to at least one final assessment of the model and generating a notification about the presence of a synthetically modified face in the video.

2. The method according to claim 1, characterized in that steps c) - h) are performed by a machine learning model or ensemble of models, while the machine learning model or ensemble of models is trained on a data set containing synthesized images of people's faces.

3. The method according to claim 2, characterized in that the machine learning model uses the automatic markup correction function, which corrects the incorrect markup of each face on the frames by comparing the images of faces on the synthesized video with their images on the original video.

4. The method according to claim 3, characterized in that the comparison of faces is carried out on the basis of the value of the vector proximity of the reference points that form the geometric characteristics of the original face image and the synthesized image based on it.

5. The method according to claim 3, characterized in that face comparison is carried out by analyzing the coordinates of the areas of the original face image and the synthesized face image.

6. The method according to claim 1, characterized in that the spatio-temporal significance is calculated as a general matrix based on the values of vector representations, and the assessment of the presence of synthetic changes in the images of the faces of an individual is formed by a machine learning model based on the obtained general matrix.

7. The method according to claim 2, characterized in that the ensemble of machine learning models consists of a group of models, each of which is trained to identify a specific synthetic imaging algorithm.

8. The method according to claim 7, characterized in that it contains an integral classifier that receives as input estimates generated using models included in the ensemble.

9. The method according to claim 8, characterized in that the final grade is calculated using an integral classifier.

10. The method according to claim 9, characterized in that the algorithm for generating a synthetic face image in the analyzed video stream is additionally determined.

I. The method according to claim 1, characterized in that the video is an online video conference.

12. The method according to claim 11, characterized in that when a synthetically modified face image is detected, a notification is generated in its display area.

13. The method according to claim 11, characterized in that when determining a synthetically modified face image, the connection with this user is blocked.

14. The method according to claim 1, characterized in that the analyzed image is obtained from a biometric identification or biometric authentication system.

15. The method according to claim 14, characterized in that when determining the synthetically modified face image, access or the requested action is blocked from the user.

16. The method according to claim 14, characterized in that when determining a synthetically modified face image, additionally request user authentication data selected from the group: login, code, password, two-factor authentication, or combinations thereof.

17. The method according to claim 14, characterized in that a signal is generated in the form of a quantitative estimate of the probability of the presence of a synthetically modified face image.

18. The method according to claim 1, characterized in that the images are obtained from a video system for monitoring the media space and analyzing social media and media, which performs content verification in social media and media.

19. The method according to claim 18, characterized in that when the synthetically modified face image is determined, a notification is generated to inform the person who has been exposed to the creation of the modified face image.

20. A system for determining synthetically modified images of faces in a video, containing at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the method according to any one of paragraphs. 1-19.