WO2023068956A1 - Procédé et système pour déterminer des images de visages modifiées synthétiquement dans une vidéo - Google Patents

Procédé et système pour déterminer des images de visages modifiées synthétiquement dans une vidéo Download PDF

Info

Publication number
WO2023068956A1
WO2023068956A1 PCT/RU2021/000445 RU2021000445W WO2023068956A1 WO 2023068956 A1 WO2023068956 A1 WO 2023068956A1 RU 2021000445 W RU2021000445 W RU 2021000445W WO 2023068956 A1 WO2023068956 A1 WO 2023068956A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
video
images
image
vector
Prior art date
Application number
PCT/RU2021/000445
Other languages
English (en)
Russian (ru)
Inventor
Кирилл Евгеньевич ВЫШЕГОРОДЦЕВ
Григорий Алексеевич ВЕЛЬМОЖИН
Валентин Валерьевич СЫСОЕВ
Александр Викторович БАЛАШОВ
Original Assignee
Публичное Акционерное Общество "Сбербанк России"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Публичное Акционерное Общество "Сбербанк России" filed Critical Публичное Акционерное Общество "Сбербанк России"
Publication of WO2023068956A1 publication Critical patent/WO2023068956A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning

Definitions

  • the present technical solution relates to the field of computer technology used in the field of data processing, in particular to a method and system for determining synthetically modified images of faces in a video.
  • the claimed method and system is aimed at solving the technical problem of efficiently and accurately detecting synthetic changes in facial images in video.
  • the technical result is to increase the accuracy and efficiency of detecting a synthetic change in images of people's faces in a video.
  • a computer-implemented method for determining synthetically modified images of faces in a video performed by a processor, wherein: a) obtaining at least one image from the video; b) detecting images of faces in said image; c) calculating a vector representation of the geometric characteristics of the detected face images using at least a face reference point comparison algorithm to determine images of at least one person's face; d) using frame-by-frame video analysis, the spatio-temporal significance of each face image of each person in said image is calculated, which is defined as a vector representation of the spatial characteristic of the face, characterizing the size of the face area in relation to the frame, and a vector representation of the temporal characteristic of the face image, characterizing the display time the analyzed image of the face on the video frames; e) calculate a vector of synthetic change probability estimates for images of the faces of a person, characterizing the presence of synthetic changes in the images of faces of this person in each frame in video; g) forming
  • steps c) - h) are performed by a machine learning model or ensemble of models, while the machine learning model or ensemble of models is trained on a data set containing synthesized images of people's faces.
  • the machine learning model uses an automatic markup correction function that corrects the incorrect markup of each face on frames by comparing the images of faces on the synthesized video with their images on the original video.
  • faces are compared based on the value of the vector proximity of the reference points that form the geometric characteristics of the original face image and the synthesized image based on it.
  • comparison of faces is carried out by analyzing the coordinates of the regions of the original face image and the synthesized face image.
  • the spatiotemporal significance is calculated as a general matrix based on the values of vector representations, and the assessment of the presence of synthetic changes in the images of the faces of an individual is formed by a machine learning model using the obtained general matrix.
  • an ensemble of machine learning models consists of a group of models, each of which is trained to identify a specific synthetic imaging algorithm.
  • the method contains an integral classifier that receives as input estimates generated using models included in the ensemble.
  • the overall score is calculated using an integral classifier.
  • an algorithm for generating a synthetic face image in the analyzed video stream is additionally defined.
  • the video is an online video conference.
  • the analyzed image is obtained from a biometric identification or biometric authentication system.
  • additional user authentication data is requested, selected from the group: login, code, password, two-factor authentication, or combinations thereof.
  • a signal is generated in the form of a quantitative estimate of the probability of the presence of a synthetically modified face image.
  • images are obtained from a video media space monitoring and social media and media analysis system that performs content verification on social media and media.
  • a notification is generated to inform the person who has been exposed to the creation of the modified face image.
  • a system for detecting synthetically modified face images in video comprising at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the above method.
  • FIG. 1 illustrates a block diagram of the implementation of the claimed method.
  • FIG. 2 illustrates an example of generating a vector representation of images of faces in a video.
  • FIG. ZA - ST illustrate an example of the formation of vector representations of space-time characteristics.
  • FIG. 4 illustrates a block diagram of generating the synthetic face image score vector, the face image spatial characteristic vector, and the vector the timing of face images for the face images of each person in the video.
  • FIG. 5 illustrates a block diagram of the independent formation of the final spatial and temporal characteristics, and the overall evaluation of synthetic face images.
  • FIG. 6 illustrates a flowchart for processing the final spatial and temporal characteristics, when they are generated independently from the overall assessment of synthetic face images, to exclude human faces from the calculation of the assessment of synthetic changes in video.
  • FIG. 7 illustrates a block diagram of generating a notification with an integrated assessment of the presence of synthetic images of people's faces in a video, a notification about a probable algorithm for generating synthetic changes data, when using a set of ensembles of trained machine learning models, when the models of each ensemble are trained on a data set with one specific algorithm for generating synthetic face changes, and at least models of one ensemble, are trained on a dataset with several algorithms for generating synthetic face changes.
  • FIG. 8 illustrates a flowchart when a notification is generated by an integrated classifier based on the scores of several trained machine learning models or their ensembles, and there are several people on the video.
  • FIG. 9 illustrates a general view of the computing device.
  • synthetically modified face image hereinafter in the text will be understood as any type of digital imaging that imitates the face or part of the face of another person, including by applying digital masks, distorting/changing parts of the face, etc. .P.
  • a synthetically modified face image should be understood as both fully generated images, for example, masks using DeepFake technology, superimposed on the face of a real person in the frame while maintaining the mimic activity of the image, and the formation of a partial change in individual parts of the face (eyes, nose, lips, ears and so on.).
  • the implementation of the claimed method (100) for determining synthetically modified images of faces in a video consists in executing by a computing computer device, in particular, using one or more processors in an automated mode, a software algorithm, presented as a sequence of steps (101) - (107), providing performing material actions in the form of processing electronic signals generated when the processor of the computing device performs its functions in order to implement data processing within the execution of method (100).
  • a software algorithm presented as a sequence of steps (101) - (107)
  • video will mean a video image, a video stream (for example, from an IP camera, an electronic device camera, a virtual camera, from an Internet application), an ordered sequence of frames (images), subsampling of frames, including up to and including up to one image.
  • the received images are analyzed for the presence of images of faces to determine the presence of its synthetic change.
  • Subsequent analysis of the resulting images can be performed using one or more (ensemble) machine learning models that are trained to detect and classify face images.
  • neural network architectures such as fully connected neural networks, CNN (convolutional networks), RNN (recurrent networks), Transformer (transformer networks), CapsNet ( capsule networks) and their combinations.
  • the networks may identify one or more features of synthetically modified face images, in particular:
  • pre-trained neural networks with or without further training.
  • pre-trained models can be used as: AlexNet, VGG, NASNet-A, DenseNet, DenseNet-B, DenseNet-BC, Inception, Xception, GoogleNet, PReLU-net, BN-inception, AmoebaNet, SENet, ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152, XResNet, Squeeze-and-Excitation ResNet (SE-ResNet), EfficientNet-BO, EfficientNet EfficientNet-Bl, EfficientNet-B2, EfficientNet-B3 , EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, EfficientNet-B7, YOLO and inherited from them.
  • the machine learning model was trained with at least one of the following steps:
  • Adagrad Adaptive gradient algorithm
  • RMS Room mean square
  • RMSProp Root mean square propagation
  • Rprop Resilient backpropagation algorithm
  • SGD Spochastic Gradient Descent
  • BGD Batch Gradient Descent
  • MBGD Mini-batch Gradient Descent
  • Momentum Nesterov Momentum
  • NAG Nesterov Accelerated Gradient
  • FussySGD SGDNesterov (SGD + Nesterov Momentum
  • AdaDelta Adam (Adaptive Moment Estimation) , AMSGrad, AdamW, ASGD (Averaged Stochastic Gradient Descent), LBFGS (L-BFGS algorithm - Broyden-Fletcher-Goldfarb-Shanno algorithm with limited memory usage), as well as second-order optimizers such as: Newton method, Quasi-Newton method, Gauss-Newton algorithm, Conjugate gradient method
  • At least one of the following functions is used as an objective function when training a machine learning model: LILoss, MSELoss, CrossEntropy Loss, CTCLoss, NLLLoss, PoissonNLLLoss, GaussianNLLLoss, KLDivLoss, BCELoss, BCEWithLogitsLoss, MarginRankingLoss, HingeEmbeddingLoss, MultiLabelMarginLoss, SmoothLILoss, SoftMarginLoss, MultiLabelSoftMarginLoss, CosineEmbeddingLoss, MultiMarginLoss, TripletMarginLoss, TripletMarginWithDistanceLoss.
  • a markup self-check (automatic markup correction) stage can be used, which checks each face in the image (frame) that is marked as containing a synthetic change, that it really contains signs of such a change.
  • the source video (frames, images) is available.
  • the source video (frames, images) is a real (unmodified by the introduction of a synthetic change) video from which a synthetically modified video of the training (additionally and test) set was formed.
  • This feature is implemented as follows and may include the following steps:
  • the face detection algorithm in the image on the frame of the synthetically modified video detects a face.
  • a part of the image with a face and some neighborhood around it is cut out.
  • the size of the neighborhood may vary.
  • proximity we mean the minimum distance for numerical data according to the Bray-Curtis metric, Canberra, Ruzicka, Kulczynski, Jaccard, Euclidean distance, Manhattan metric, Penrose size distance, Penrose shape distance, Lorentzian distance, Hellinger distance, Minkowski distance measure p, Mahalanobis distance, statistical distance, correlation similarities and distances (Pearson correlation, Orchini similarity, normalized dot product) or otherwise.
  • proximity the coordinates of points on the frame of the synthetically modified video and the coordinates of such points on the frame of the original video are taken for calculation, then the closest face image is selected, as faces with the minimum distances between the points used.
  • the reverse type of processing can be performed - a face is detected on the frame of the original video, and an area with the same coordinates is cut out on the frames of the synthetically modified video.
  • two images are obtained, which are an area with a face from the frame of the original video and a face with a frame of synthetically modified video.
  • the resulting pair of images is compared with each other according to a given metric to assess the level of image distortion.
  • a metric can be used:
  • PSNR Peak signal-to-noise ratio
  • MSE Mean square error
  • RMSE root mean square error function
  • PSNR color images are analyzed (with several components per pixel), similar ones are applied with further weighted averaging for each of the components. For example, for an RGB image, PSNR or MSE is calculated over all three components (and divided by three times the image size). For a good quality synthetic image and a good video quality (no interference from noise), it is preferable to use "PSNR". If only part of the face is to be overlaid with the synthetic image, it is preferable to use PSNR. If the video is noisy or grainy, then it is preferable to use DSSIM or SSIM. In the presence of a lot of interference, it is preferable to apply SNR. If the video quality is extremely low quality, for example, with a high compression ratio, then it is preferable to use MSE or RMD. If the dimensions of the face in relation to the frame are small, then the absolute difference between the pixels is applied.
  • a boundary value is selected, and if the value of the metric between two received images is greater than this boundary value, then the face on the frame is accepted as synthetically modified. If the value is less than or equal, then, despite the marking of this image as synthetically modified, then this face image is taken as real.
  • Data augmentation for training one or more machine learning models may be performed using at least one of the following approaches: image scaling (increase, decrease); image cropping; dimming the entire image, individual channels of the image; clarification of the entire image, individual channels of the image; increase in contrast; color transformations: repositioning (mixing) of color channels, amplification, reduction of one or more color channels, obtaining an image in grayscale, obtaining a monochrome image, deleting a color channel; shifts and decentering of the image; rotation of the image at different angles in different directions, rotation of the image or its part; slopes, distortions of the image; mirror image along an arbitrary axis, line; additional lines or geometric objects in the image: with transparency of their color, without transparency, colored objects; gray objects (from white to black), including the removal of a part of the image (placing a black object on the image) at the geometric or semantic positions of the image; adding any background to the image; glare and darkening of parts of the image; defocus (blu
  • the face images extracted at step (102) are processed to determine which face images belong to one person.
  • the vector representation of the geometric characteristics of face images is calculated. In the general case, this is done using the face reference point comparison algorithm.
  • images of faces belonging directly to the same person are determined. The formation of this vector makes it possible to estimate the probability of having a face of a real person.
  • the algorithm of work can be carried out as follows. The j-th face is highlighted on the i-th frame. This j-oe face is searched for in subsequent frames.
  • the search is carried out by selecting the closest face image in space among all detected faces on the i + 1st frame.
  • proximity is used (numerical data by Bray-Curtis metric, Canberra, Ruzicka, Kulczynski, Jaccard, Euclidean distance, Manhattan metric, Penrose size distance, Penrose shape distance, Lorentz distance, Hellinger distance, Minkowski distance measure p, distance Mahalanobis, statistical distance, correlation similarities and distances - Pearson's correlation, Orchini's similarity, normalized scalar product, or otherwise) on one or more points of the face (reference points of the face): nose, nostrils, hairline, lines of facial hair (beard, mustache ), mouth, lips (upper and lower), forehead, eyes, pupils, ears, eyebrows, eyelids, head, cheekbones, chin, nasolabial triangle, face rectangle coordinates.
  • the distance between the corresponding reference points of the j-th face in the i-th frame and the points of each face in the i+1-th frame is calculated. Then a face is selected from the i+1-th frame with the smallest distances by reference points.
  • a face with the closest characteristics between the given reference points is searched for.
  • the geometric characteristics (dimensions) of the location of the reference points of the j-th face image are considered, and on the i + 1 frame, the face image with the most similar geometric characteristics is searched.
  • a certain spatial neighborhood (location area) is allocated on the frame and it is checked whether there is any face image in the i+1 -th frame.
  • step (104) for each detected face image in the frames, an estimate of the probability of its synthetic change is calculated using the trained machine learning model for detecting and classifying synthetic changes.
  • This rating is added to the vector of ratings for facial images of the j-th person. If the j-oro face image is not detected on the next frame (or series of frames) of the ordered sequence of frames, then the formation of the rating vector can be completed.
  • An example of the formation of a rating vector for the image of a person's face on video frames is shown in Fig. 2.
  • the formation of a rating vector for face images of the j-th person occurs throughout the video, and is not completed if the face image is not detected on the next frame.
  • step (104) for each specific image of a person's face, its spatial and temporal significance is determined, which is defined as a vector representation of a spatial characteristic of a person's face, which characterizes the size of the face area in relation to the frame, and a vector representation of the temporal characteristic of the face image, which characterizes the display time of the analyzed face image on video frames.
  • FIG. Figure 4 shows a diagram of step 104. Calculations of the vector of estimates of synthetic changes in the face images of the j-th person in the video, which consists of estimates of changes in the face image in each analyzed frame, the calculation of the vector of spatial and temporal characteristics (spatial vector and temporal vector) can be carried out sequentially, as this is shown in FIG. 4, or in parallel, independently of each other.
  • the description of the invention does not limit the order and method of calculating these vectors, but describes their use to improve the quality of detecting synthetic changes in face images in video.
  • FIG. 3A-STB presents an example of calculating the vectors of spatial and temporal significance and the vector of estimates of synthetic changes.
  • the vector of estimates of the synthetically modified face for each frame (K1) - (Kb) of the video obtained at step (101), the vector of estimates of the synthetically modified face, the vector of the spatial distribution of faces on the frames, and the temporal characteristic of the face on the frames are calculated.
  • the spatial characteristic can be calculated based on the occupied fraction of the face area from the frame size.
  • the video was taken at a resolution of 1280x1920 pixels and its area is 2457600.
  • the time characteristic for each face can be calculated as a scalar value, for example, the time it is displayed on the video.
  • a vector may be formed in which 1 is assigned if the person is present in the frame, or 0 if he is not in the frame. Spatial and temporal significance can be represented, in particular, as a common matrix based on the values of the generated vector representations.
  • an overall score of synthetic changes in images of human faces on video is formed based on the vectors obtained at steps (103) - (104). That is, the calculation of the synthetic image change probability estimate for each person's face in the video is performed based on the vectors of the temporal distribution, the spatial distribution, and the vector of probability estimates that the face image in the frame was subject to synthetic changes.
  • a separate machine learning model can be used to generate an overall estimate of synthetic changes in images of faces j-ro of a person.
  • the obtained vectors of spatial and temporal distribution, the vector of estimates of synthetic changes in the face image are combined into a common two-dimensional matrix presented in Table 1 for an example in Fig. BEHIND.
  • the resulting matrix is fed to the input of the machine learning model to form an overall estimate of the synthetic change in the j-ro person's face in the video.
  • This model can be a recurrent neural network, a convolutional neural network, a fully connected neural network. Such a combination is recommended for the case when a person is present at different time intervals of the video, and not only in one consecutive series of frames.
  • Table 1 Two-dimensional matrix of the vector representation of the spatiotemporal distribution and the vector of estimates of synthetic changes in faces.
  • a vector of spatial distribution of a human face image and a vector of estimates of the presence of synthetic changes are combined into a common two-dimensional matrix. But their unification occurs only in frames on which there is a face of a given person.
  • An example is shown in Table 2 for FIG. BEHIND. Such a combination is recommended for the case when a person is present in one consecutive series of frames.
  • Table 2 Two-dimensional matrix of the vector representation of the space-time distribution.
  • a score vector characterizing that the face image in the frame was subjected to synthetic changes is analyzed separately from the vectors of temporal and spatial distribution.
  • An example of this circuit is shown in Fig. 5.
  • the overall assessment of the synthetic change in the j-ro person's face is based only on the vector of estimates of synthetic changes.
  • a separate machine learning model or a separate algorithm can be used to generate the overall score.
  • the score vector is input to the separately trained model.
  • This model can be a recurrent neural network, a convolutional neural network, a fully connected neural network.
  • a vector of a certain length can be used. If the resulting vector of face image ratings is less than the specified length of the vector, then such a vector is supplemented with values, for example, 0.5 from a certain end. If the vector is larger than the given length, then it is cut off from a certain end.
  • the number of ratings at given intervals or the frequency of rating intervals is counted.
  • intervals are taken with a step of 0.1: [0-0.1; 0.1-0.2; 0.2-0.3; 0.3-0.4; 0.4-0.5; 0.5-0.6; 0.6-0.7; 0.7-0.8; 0.8-0.9; 0.9-1] and the frequency of estimates from the vector in these intervals is calculated.
  • the obtained values are fed to the machine learning model, for example, support vectors (SVM), K-neighbors (K-nearest neighbor), linear (non-linear) regression, classification tree model.
  • SVM support vectors
  • K-nearest neighbor K-neighbors
  • linear (non-linear) regression classification tree model.
  • the description of the invention does not limit the form of the machine learning model, but describes its application to the obtained score vector.
  • the overall assessment of synthetic changes in images of a person's face is obtained by averaging a vector of estimates, or obtained by extracting the maximum value, either over the entire vector or over a part of it.
  • general spatial and temporal characteristics of human face images are built for further analysis.
  • the general spatial characteristic is calculated as an average over the spatial vector of a given person.
  • the overall time response is obtained as the length of the time response vector in relation to the length of the video, that is, it is the proportion of the time the person was present on the video from the total time on the video.
  • the maximum value or the minimum value is selected to calculate the overall spatial characteristic.
  • step (106) a final synthetic face change presence score is calculated for the entire video. This score is built using each overall score synthetic changes in images of people's faces.
  • steps (104) - (105) we obtain synthetic change scores for each person in the video (an individual person is highlighted in step 103), and in step (106), from the scores for people, we calculate a score for the video.
  • This stage of the cumulative analysis of the ratings of all people in the video allows you to improve the quality of the invention compared to existing ones. For example, if there are a lot of people in the video and for all of them we have a high synthetic change score, then the video under study is probably very compressed, and we have a false positive decision of the models when analyzing face images. The cumulative analysis of the ratings in step (106) would then generate a final rating for the video as "video without synthetic changes".
  • the weighted average value of the estimates of the people's faces used is determined.
  • the weights for ratings can be the product of the average size of the images of the faces of a given person and the share of time of presence on the video.
  • the maximum rating is formed among the ratings of the people's faces used.
  • a trained model may be used for the above example.
  • This model can represent the support vector machine (SVM), K-neighbors (K-nearest neighbor), linear (non-linear) regression, classification tree model, one or more neural networks.
  • SVM support vector machine
  • K-neighbors K-nearest neighbor
  • linear (non-linear) regression classification tree model
  • neural networks one or more neural networks.
  • Such a model can take as input a vector (vector representation of data) that characterizes the amount of use of intervals of estimates of synthetic changes by faces.
  • the general spatial and temporal characteristics of images of a person's face are formed.
  • these characteristics are compared with the corresponding boundary values. If, according to the final characteristic, the size of the face image or the time of its presence is less than the boundary value, then the assessment of this person's face is not taken into account when calculating the probability of a synthetic change in the video (None score).
  • the schematic of this example is shown in Fig. 6. The remaining estimates of synthetic changes in people's faces are analyzed further by the methods described above.
  • two-dimensional matrices of vector representations of the spatio-temporal distribution and estimates of synthetic changes in the faces of various people are fed to the input of stage (106), where the formation of the final assessment of the presence of synthetic changes in the faces of people in the video is performed.
  • This step can also be performed using a single machine learning model or an ensemble of trained models.
  • step (107) an integral assessment of the presence of a synthetically modified face image on the video is formed based on the final assessment of the presence of synthetic changes in people's faces in the video. For this, at least one final assessment of the presence of synthetic changes in faces in the video is used, which is formed according to a separate model for classifying and detecting synthetic changes in faces. Upon completion of step (107), a notification about the presence of a synthetically modified face in the video is generated.
  • the notification may be displayed directly in the graphical user interface, for example, during an online conference (Zoom, Skype, MS Teams). Also, the notification may be displayed directly in the synthetic face change detection area, for example, in the area with the image of a person's face.
  • An additional effect of the application of the invention may be its use in biometric control systems, for example, when receiving services (for example, banking services) or access (access control system, turnstile with a biometric sensor).
  • services for example, banking services
  • access access control system, turnstile with a biometric sensor
  • user authentication data may be additionally requested, selected from the group: login, code, password, two-factor authentication, or combinations thereof.
  • the claimed solution can be used in systems for monitoring the media space and analyzing social media and the media, to identify publicly known people (first persons of the state, media personalities, famous people, etc.), on which an attempt can be made to compromise them.
  • Such systems will be the source of the received video for its subsequent analysis, and, if synthetic changes in the images of the faces of such people are detected, they or the corresponding service may be notified of the falsely generated information.
  • information about the time of the detected event, the source of the event can also be stored.
  • model for detecting synthetic changes in face images are used, each of which, at least one model, is trained on a different algorithm for generating synthetic changes.
  • an ensemble of models is trained for each algorithm for generating synthetic changes. Estimates from several models in this ensemble are averaged.
  • the obtained scores are processed by an integral classifier, which makes it possible to reveal hidden relationships between model predictions for various algorithms for generating synthetic changes. This quality makes it possible to achieve a super-additive effect (synergistic) and improve the quality of video detection with the presence of synthetic changes in face images.
  • the general scheme is shown in Fig. 7.
  • a more detailed diagram is shown in FIG. 8.
  • the integrated classifier generates not only an integral assessment of the presence of synthetic changes in people's faces in the video, but also the most probable algorithm by which these synthetic changes were created. This example is shown in Fig. 7.
  • FIG. 9 shows a general view of a computing device (600) suitable for implementing the claimed solution.
  • Device (600) may be, for example, a computer, server, or other type of computing device that can be used to implement the claimed technical solution. Including being part of a cloud computing platform.
  • the computing device (600) contains one or more processors (601), memory means, such as RAM (602) and ROM (603), input / output interfaces (604), input devices connected by a common information exchange bus / output (605), and a device for networking (606).
  • processors such as RAM (602) and ROM (603
  • memory means such as RAM (602) and ROM (603
  • input / output interfaces (604)
  • input devices connected by a common information exchange bus / output (605
  • a device for networking (606).
  • the processor (601) may be selected from a variety of devices currently widely used, such as IntelTM, AMDTM, AppleTM, Samsung ExynosTM, MediaTEKTM, Qualcomm SnapdragonTM, and etc.
  • the processor (601) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.
  • RAM (602) is a random access memory and is designed to store machine-readable instructions executable by the processor (601) to perform the necessary data logical processing operations.
  • RAM (602) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).
  • a ROM is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), a flash memory (EEPROM, NAND, etc.), optical storage media (CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.
  • I/O interfaces 604 are used to organize the operation of device components (600) and organize the operation of external connected devices.
  • the choice of appropriate interfaces depends on the particular design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
  • various means (605) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.
  • the networking means (606) enables the communication of data by the device (600) via an internal or external computer network, such as an Intranet, Internet, LAN, and the like.
  • an internal or external computer network such as an Intranet, Internet, LAN, and the like.
  • one or more means (606) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others
  • satellite navigation tools in the device (600) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé pour déterminer des images de visages modifiées synthétiquement dans une vidéo, lequel consiste à: obtenir des images à partir d'une vidéo; isoler des images de visages dans l'image; calculer une représentation vectorielle des caractéristiques géométriques des images de visages isolées à l'aide d'un algorithme de comparaison de points de repère des visages afin de déterminer des images du visage d'une personne; à l'aide d'une analyse trame par trame de la vidéo, calculer la valeur spatio-temporelle de chaque image du visage de chaque personne dans l'image, laquelle est déterminée comme étant une représentation vectorielle de la caractéristique spatiale du visage caractérisant la taille de la région du visage par rapport à la trame et une représentation vectorielle de la caractéristique temporelle de l'image du visage caractérisant le temps de représentation de l'image du visage à analyser sur les trames de la vidéo; calculer un vecteur d'estimations de probabilité de modifications synthétiques pour les images de visages de la personne, caractérisant la présence de modifications synthétiques des images de visages de cette personne dans chaque trame; calculer une estimation globale de probabilité de changements synthétiques sur la base des représentations vectorielles de la distribution spatio-temporelle et du vecteur d'estimations de changements synthétiques pour les images de visages de chaque personne dans la vidéo; générer une estimation finale de la présence dans la vidéo d'un changement synthétique de l'image du visage; générer une estimation intégrale en fonction de l'estimation finale d'un modèle et générer une notification concernant la présence d'un visage modifié synthétiquement dans la vidéo.
PCT/RU2021/000445 2021-10-19 2021-10-19 Procédé et système pour déterminer des images de visages modifiées synthétiquement dans une vidéo WO2023068956A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2021130421A RU2768797C1 (ru) 2021-10-19 2021-10-19 Способ и система для определения синтетически измененных изображений лиц на видео
RU2021130421 2021-10-19

Publications (1)

Publication Number Publication Date
WO2023068956A1 true WO2023068956A1 (fr) 2023-04-27

Family

ID=80819517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2021/000445 WO2023068956A1 (fr) 2021-10-19 2021-10-19 Procédé et système pour déterminer des images de visages modifiées synthétiquement dans une vidéo

Country Status (2)

Country Link
RU (1) RU2768797C1 (fr)
WO (1) WO2023068956A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311720B (zh) * 2022-08-11 2023-06-06 山东省人工智能研究院 一种基于Transformer的deepfake生成方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2570195C2 (ru) * 2011-05-19 2015-12-10 Сони Компьютер Энтертэйнмент Инк. Устройство съемки движущихся изображений, система и устройство обработки информации и способ обработки изображений
US20200160502A1 (en) * 2018-11-16 2020-05-21 Artificial Intelligence Foundation, Inc. Identification of Neural-Network-Generated Fake Images
CN111444881A (zh) * 2020-04-13 2020-07-24 中国人民解放军国防科技大学 伪造人脸视频检测方法和装置
CN112069891A (zh) * 2020-08-03 2020-12-11 武汉大学 一种基于光照特征的深度伪造人脸鉴别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2570195C2 (ru) * 2011-05-19 2015-12-10 Сони Компьютер Энтертэйнмент Инк. Устройство съемки движущихся изображений, система и устройство обработки информации и способ обработки изображений
US20200160502A1 (en) * 2018-11-16 2020-05-21 Artificial Intelligence Foundation, Inc. Identification of Neural-Network-Generated Fake Images
CN111444881A (zh) * 2020-04-13 2020-07-24 中国人民解放军国防科技大学 伪造人脸视频检测方法和装置
CN112069891A (zh) * 2020-08-03 2020-12-11 武汉大学 一种基于光照特征的深度伪造人脸鉴别方法

Also Published As

Publication number Publication date
RU2768797C1 (ru) 2022-03-24

Similar Documents

Publication Publication Date Title
US10650280B2 (en) Systems and methods for machine learning enhanced by human measurements
JP7078803B2 (ja) 顔写真に基づくリスク認識方法、装置、コンピュータ設備および記憶媒体
US10574883B2 (en) System and method for guiding a user to take a selfie
US7835549B2 (en) Learning method of face classification apparatus, face classification method, apparatus and program
US20110243431A1 (en) Detecting Objects Of Interest In Still Images
CN109271930B (zh) 微表情识别方法、装置与存储介质
Das et al. SSERBC 2017: Sclera segmentation and eye recognition benchmarking competition
US10936868B2 (en) Method and system for classifying an input data set within a data category using multiple data recognition tools
WO2021196721A1 (fr) Procédé et appareil de réglage d'environnement intérieur d'habitacle
CN111008971B (zh) 一种合影图像的美学质量评价方法及实时拍摄指导系统
US9355303B2 (en) Face recognition using multilayered discriminant analysis
WO2021088640A1 (fr) Technologie de reconnaissance faciale basée sur une transformation de nuage gaussien heuristique
Zhao et al. Applying contrast-limited adaptive histogram equalization and integral projection for facial feature enhancement and detection
Mayer et al. Adjusted pixel features for robust facial component classification
CN113436735A (zh) 基于人脸结构度量的体重指数预测方法、设备和存储介质
Szankin et al. Influence of thermal imagery resolution on accuracy of deep learning based face recognition
RU2768797C1 (ru) Способ и система для определения синтетически измененных изображений лиц на видео
Bekhouche Facial soft biometrics: extracting demographic traits
JP4708835B2 (ja) 顔検出装置、顔検出方法、及び顔検出プログラム
JP2006285959A (ja) 顔判別装置の学習方法、顔判別方法および装置並びにプログラム
JP2006244385A (ja) 顔判別装置およびプログラム並びに顔判別装置の学習方法
EA043568B1 (ru) Способ и система для определения синтетически измененных изображений лиц на видео
CN112837304A (zh) 一种皮肤检测方法、计算机存储介质及计算设备
Frieslaar Robust south african sign language gesture recognition using hand motion and shape
Shah Automatic Analysis and Recognition of Facial Expressions Invariant to Illumination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21961542

Country of ref document: EP

Kind code of ref document: A1