CN111611873A

CN111611873A - Face replacement detection method and device, electronic equipment and computer storage medium

Info

Publication number: CN111611873A
Application number: CN202010353319.8A
Authority: CN
Inventors: 田金戈; 徐国强
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-09-01

Abstract

The application discloses a face replacement detection method and device, and relates to the technical field of artificial intelligence. The method comprises the following steps: extracting image frames of a video stream to be detected to obtain an image sequence corresponding to the video stream; taking two adjacent frames of images in the image sequence as an image pair, and extracting image pair features of each image pair in the image sequence; respectively acquiring optical flow characteristics of each image pair according to the image pair characteristics of each image pair, wherein the optical flow characteristics reflect pixel motion information of a next frame image relative to a previous frame image in each image pair; and according to the image characteristics of each frame of image in the image sequence and the optical flow characteristics of each image pair in the image sequence, identifying the face characteristic change contained in the continuous images in the image sequence, and obtaining the probability of face replacement in the video stream. The method and the device can predict the probability of face replacement in the video stream.

Description

Face replacement detection method and device, electronic equipment and computer storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for face replacement detection, an electronic device, and a computer-readable storage medium.

Background

With the development of artificial intelligence technology, especially the continuous progress of image generation technology, some originally non-existent super-reality videos can be generated, and the reality of the videos is difficult to distinguish by human eyes. Improper use of the technology can bring different degrees of security risks, for example, in a face replacement application scenario, privacy information of a user is easily leaked, and loss of personal or collective reputation can be brought, even justice is affected. Therefore, there is a need for authentication of the authenticity of the face in video.

At present, the detection of the authenticity of the face in a video is generally realized through face recognition and living body detection, and when a legal user is detected, the mobility of the user needs to be further determined, but the detection technology cannot be applied to a face replacement scene.

Therefore, how to identify the situation that the face replacement exists in the video is an urgent technical problem to be solved.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

In order to solve the technical problem, the application provides a face replacement detection method and device, an electronic device, and a computer-readable storage medium.

The technical scheme disclosed by the application comprises the following steps:

a face replacement detection method includes: extracting image frames of a video stream to be detected to obtain an image sequence corresponding to the video stream; taking two adjacent frames of images in the image sequence as an image pair, and extracting image pair features of each image pair in the image sequence; respectively acquiring optical flow characteristics of each image pair according to the image pair characteristics of each image pair, wherein the optical flow characteristics reflect pixel motion information of a next frame image relative to a previous frame image in each image pair; and according to the image characteristics of the continuous images in the image sequence and the optical flow characteristics of each image pair in the image sequence, identifying the change of the face characteristics contained in each frame of image in the image sequence, and obtaining the probability of face replacement in the video stream.

In an exemplary embodiment, the extracting, with two adjacent frames of images in the image sequence as an image pair, the image pair feature of each image pair in the image sequence includes: obtaining the original image characteristics of each frame of image by extracting the characteristic information of each frame of image in the image sequence on the space dimension and the channel dimension; sequentially stacking original image features of images contained in each image pair in the image sequence to obtain the original image pair features of each image pair, wherein the original image pair features contain feature information of the image pairs in a time dimension; and sequentially carrying out feature separation processing on the time dimension and the channel dimension on the features of each original image pair to obtain the image pair features of each image pair.

In an exemplary embodiment, the respectively acquiring optical flow features of the respective image pairs according to the image pair features of the respective image pairs includes: performing feature compression processing on the image pair features to obtain compressed features of the image pair, wherein the compressed features comprise compressed optical flow information of the image pair; and performing feature amplification processing on the compressed features to obtain optical flow features of the image pair.

In an exemplary embodiment, the identifying, according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence, that the face feature changes are contained in consecutive images in the image sequence to obtain the probability of face replacement occurring in the video stream includes: determining a target image containing human face features for the first time in the video stream according to the image features of each frame of image in the image sequence, and acquiring the human face features as target features; starting from an image pair where the target image is located, tracking the target feature in the image sequence, and comparing image feature differences of a previous frame image and a next frame image in each image pair according to optical flow features of each image pair in the video stream to obtain a comparison result corresponding to each image pair; and predicting the probability of face replacement in the video stream according to the comparison result corresponding to each image pair.

In an exemplary embodiment, the extracting image frames from the video stream to be detected to obtain an image sequence corresponding to the video stream includes: in the process of extracting image frames of a video stream to be detected, identifying key images in the video stream according to the image characteristics of each extracted image frame; and acquiring an image set formed by the key images as an image sequence corresponding to the video stream.

In an exemplary embodiment, the step of recognizing changes of face features contained in consecutive images in the image sequence according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence to obtain the probability of face replacement occurring in the video stream is performed by a face replacement probability prediction model trained in advance, and the training method of the face replacement probability prediction model includes: acquiring a plurality of video positive samples containing real face features and video negative samples containing virtual face features; and performing iterative training of the face replacement probability prediction model by taking the mixed video positive sample and the mixed video negative sample as training sample data until the face replacement probability prediction model is converged.

A face replacement detection apparatus comprising: the image frame extraction module is used for extracting image frames of a video stream to be detected to obtain an image sequence corresponding to the video stream; the image pair feature extraction module is used for taking two adjacent frames of images in the image sequence as an image pair and extracting image pair features of each image pair in the image sequence; the optical flow feature extraction module is used for respectively acquiring the optical flow features of each image pair according to the image pair features of each image pair, wherein the optical flow features reflect the pixel motion information of a next frame image relative to a previous frame image in each image pair; and the face replacement prediction module is used for identifying face feature changes contained in continuous images in the image sequence according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence to obtain the probability of face replacement in the video stream.

In one exemplary embodiment, the face replacement prediction module includes: the target feature acquisition unit is used for determining a target image which contains human face features for the first time in the video stream according to the image features of each frame of image in the image sequence and acquiring the human face features as target features; an image feature comparison unit, configured to start with an image pair where the target image is located, perform tracking on the target feature in the video stream according to an optical flow feature of each image pair in the video stream, and compare an image feature difference between a previous frame image and a next frame image in each image pair to obtain a comparison result corresponding to each image pair; and the human face replacement probability prediction unit is used for predicting the probability of human face replacement in the video stream according to the comparison result corresponding to each image pair.

An electronic device comprising a processor and a memory, the memory having stored thereon computer-readable instructions that, when executed by the processor, implement a face replacement detection method as in any one of the preceding claims.

A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a face replacement detection method as in any one of the preceding claims.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the technical scheme, image frame extraction is carried out on a video stream to be detected to obtain an image sequence corresponding to the video stream, and then two adjacent frame images in the image sequence are used as an image pair to extract image pair features of each image pair in the image sequence. According to the characteristics of each image pair, the optical flow characteristics of each image pair can be further acquired, and the optical flow characteristics can reflect the pixel change between the front frame image and the rear frame image in the image pair, so that the condition that the face characteristics in the video stream are possibly changed can be acquired according to the optical flow characteristics of each image pair and the image characteristics of each frame image in the image sequence, and the probability of face replacement in the video stream can be predicted.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram illustrating a face replacement detection method in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram for one embodiment of step 110 in the embodiment shown in FIG. 1;

FIG. 3 is a flow chart of one embodiment of step 120 in the embodiment shown in FIG. 1;

FIG. 4 is a schematic diagram illustrating feature separation of features for an original image in accordance with an exemplary embodiment;

FIG. 5 is a flow chart of one embodiment of step 130 in the embodiment of FIG. 1;

FIG. 6 is a flow chart of one embodiment of step 140 in the embodiment of FIG. 1;

fig. 7 is a block diagram illustrating a face replacement detection apparatus according to an exemplary embodiment.

FIG. 8 is a diagram illustrating a hardware architecture of a server in accordance with an illustrative embodiment;

while certain embodiments of the present application have been illustrated by the accompanying drawings and described in detail below, such drawings and description are not intended to limit the scope of the inventive concepts in any manner, but are rather intended to explain the concepts of the present application to those skilled in the art by reference to the particular embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

As described above, in the prior art implementation, the detection of the authenticity of the face in the video depends on the face recognition technology and the living body detection technology, and in the case that a corresponding legal user can be determined according to the face in the video, the mobility of the legal user is further determined, that is, the face in the video can be determined to be the real face of the legal user.

The face replacement technology replaces the face features of the user, and then changes the imaging of the face of the user in the video. Even if the face replacement technology can replace the face image of the user a in the video with the face image of the user B, the behavior of the user in the video is actually the real behavior of the user a.

Improper use of this technology can present varying degrees of security risks. For example, in various user identification applications of the internet system, such as an internet banking system, an illegal person can forge a video containing a face of a legal user through the technology, and further can bypass conventional face identification and living body detection to cheat the trust of the internet system, so that the authenticity of the face in the video is very necessary to be identified.

Based on this, embodiments of the present application provide a face replacement detection method, an apparatus, an electronic device, and a computer-readable storage medium, which are used to detect the possibility of face replacement in a video, thereby avoiding a security risk caused by improper use of a face replacement technology.

Fig. 1 is a flow chart illustrating a face replacement detection method according to an exemplary embodiment. As shown in fig. 1, the method comprises at least the following steps:

step 110, performing image frame extraction on the video stream to be detected to obtain an image sequence corresponding to the video stream.

In consideration of the fact that face replacement is implemented by performing corresponding processing based on an original face image, a process that facial features are changed inevitably exists in a video where the face replacement really occurs, and the process is usually not easily discovered by naked eyes.

After the face replacement is realized, along with the actions of the original face, such as movement, expression change, facial action and the like, the facial features of the replaced face are correspondingly replaced in the video. Therefore, the detection of the face feature replacement process can be carried out in the video, and if the process of face replacement is detected to be contained in the video, the face replacement is carried out in the video.

In this embodiment, the image frame extraction is performed on the video stream to be detected, so as to obtain each frame image in the video stream, and further determine whether a face replacement exists in the video stream by analyzing the change condition of pixels in each frame image.

In the image sequence corresponding to the obtained video stream, the frame images are sequentially arranged according to the sequence of image frame extraction in the video stream.

Because the face replacement is realized by the computer technology processing, the face feature replacement process is very short, in order to ensure the accuracy of the face replacement detection, the image extraction can be performed on the video stream frame by frame, that is, each frame of image in the video stream is sequentially extracted, so that the relevant information representing the face feature replacement process cannot be omitted in the image sequence corresponding to the video stream, but the process needs corresponding hardware equipment with better processing performance, otherwise, the speed of the face replacement detection is very slow.

In order to improve the speed of face replacement detection, the extraction rate of the image frames can be adaptively adjusted according to the processing performance of hardware equipment, for example, the image frames in a video stream are periodically extracted at set frame intervals, but the interval time for extracting each frame of image cannot be processed, otherwise, the possibility of missing replacement information of facial features in an image sequence is greatly increased.

And step 120, taking two adjacent frames of images in the image sequence as an image pair, and extracting image pair characteristics of each image pair in the image sequence.

The image pairs in a plurality of time sequences can be obtained by taking two adjacent frames of images in the image sequence corresponding to the video stream as an image pair. The temporal change of the image information reflected by each image pair can be used as one of the bases for judging whether the video stream contains the facial feature replacement.

Moreover, the image pair is composed of two adjacent frames of images, and the two adjacent frames of images have a front-back correspondence in terms of time, so that the correlation between the two adjacent frames of images in each image pair can also be used as a basis for judging whether the video stream contains facial feature replacement.

Therefore, the image pair features corresponding to each image pair in the image sequence extracted by the embodiment can reflect the motion information between each adjacent frame of image, and further, whether the image frame information replaced by the facial features is contained in the video stream can be obtained by analyzing the motion information.

And step 130, respectively acquiring optical flow characteristics of each image pair according to the image pair characteristics of each image pair, wherein the optical flow characteristics reflect pixel movement information of a next frame image relative to a previous frame image in each image pair.

Where optical flow is the relevant motion information due to the motion of the object itself in the scene, the motion of the camera, or both.

The optical flow specification process of the image pair is to find the corresponding relation existing between the two frames of images in front and back of the image pair by utilizing the change situation of the pixels of the two frames of images in front and back of the image pair in time and the correlation between the two frames of images in front and back of the image pair, thereby calculating the motion information between the two frames of images.

Thus, the motion information of the pixels in the image of the next frame relative to the pixels in the image of the previous frame in the image pair can be embodied by the optical flow characteristics of the image pair. According to the movement of each pixel point in the image, the position of each pixel in the previous frame image in the next frame image can be known, and further the change condition of the facial features can be judged according to the comparison condition of the pixel positions.

And step 140, identifying face feature changes contained in continuous images in the image sequence according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence, and obtaining the probability of face replacement in the video stream.

The face feature of the face appearing for the first time in the video stream can be determined according to the image feature of each frame of image in the image sequence. If the face feature is detected to be replaced subsequently, the situation that the face is replaced in the video stream can be determined.

Since the optical flow characteristics of each image pair in the image sequence can reflect the motion information between two adjacent frames of images, in the case of determining the facial characteristics of the human face appearing for the first time in the video stream, the optical flow characteristics of each image pair in the image sequence can be combined to determine whether the facial characteristics are changed.

Therefore, according to the optical flow characteristics of each image pair in the image sequence corresponding to the video stream and the image characteristics of each frame of image in the image sequence, whether the facial characteristics corresponding to the face appearing for the first time in the video stream change in time maintenance can be identified, and further whether face replacement occurs in the video stream can be judged. Therefore, the method provided by the embodiment can avoid the safety risk caused by the improper use of the face replacement technology.

Fig. 2 is a flow chart of step 110 in one exemplary embodiment of the embodiment shown in fig. 1. As shown in fig. 2, step 110 includes at least the following steps:

step 111, identifying key images in the video stream according to the image characteristics of each extracted frame image in the process of extracting image frames of the video stream to be detected;

and step 112, acquiring an image set formed by the key images as an image sequence corresponding to the video stream.

It should be noted that please refer to the content correspondingly described in step 110 in the embodiment shown in fig. 1, which is not repeated herein.

The key image in the video stream refers to an image in the video stream that has a higher contribution to the detection process of face replacement, and illustratively, the key image in the video stream includes, but is not limited to, an image whose image features have a larger difference with respect to those of adjacent images, a first frame image containing a face in the video stream, an image that is temporally spaced apart from the first frame image containing the face by a specified interval, and other images in the video stream are correspondingly referred to as non-key images.

In determining key images in the video stream, the key images may also be marked to facilitate subsequent extraction of the key images.

It should be noted that, in the present embodiment, the number of determined key images should be much larger than the number of non-key images, so as to ensure that image information related to facial feature replacement is not missed in the image sequence corresponding to the video stream.

By the method provided by the embodiment, the processing amount required by the hardware equipment for face replacement detection can be further reduced, the rate of the hardware equipment for face replacement detection can be further increased, and the hardware requirement on the hardware equipment is reduced.

Fig. 3 is a flow chart of step 120 in an exemplary embodiment of the embodiment shown in fig. 1. As shown in fig. 3, step 120 includes at least the following steps:

and step 121, obtaining the original image characteristics of each frame of image by extracting the characteristic information of each frame of image in the image sequence in the space dimension and the channel dimension.

The extraction of the characteristic information of each frame of image in the image sequence in the space dimension and the channel dimension is realized by performing convolution processing on the image data of each frame of image through a convolution network. For example, feature information of each frame of image may be extracted through a convolutional Network such as ResNet (Residual Network), MobileNet (a type of Network model proposed for applications of mobile terminals and embedded vision), and shefflenet (a type of Network model proposed for applications of mobile terminals and embedded vision).

The image feature extraction is essentially a process in which a convolution network performs correlation calculation of neurons with respect to image data such as an image size, a pixel value of each pixel in an image, and the like. It should be understood that the spatial dimension of the image corresponds to the image size (which is related to the resolution of the image), and the channel dimension of the image corresponds to the number of neurons in the convolutional network that perform feature extraction.

The original image characteristics of each frame of image can be represented as (H, W, C), where H and W represent the width and height of each frame of image, respectively, and C represents the channel information of each frame of image.

And step 122, stacking the original image features of the images contained in each image pair in the image sequence in sequence to obtain the original image pair features of each image pair, wherein the original image pair features contain feature information of the image pairs in a time dimension.

The time dimension of the original image pair features refers to the number of the corresponding image pairs, and in this embodiment, the time dimension corresponding to the image pair is 2.

By stacking the original image features of the images contained in each image pair in sequence, the resulting original image pair features can be represented as (H, W, C, T), where T represents the time dimension of the image pair. Thus, each original image pair feature contains time series information.

And 123, sequentially performing characteristic separation processing on the time dimension and the channel dimension on the characteristics of each original image pair to obtain the image pair characteristics of each image pair.

The method comprises the steps of sequentially carrying out feature separation processing on time dimension and channel dimension on each original image pair feature, and sequentially carrying out Convolution processing on each original image pair feature by using a Pointwise Convolution network (Pointwise Convolution) and a depth Convolution network (DepthwiseConvolution).

In the point-by-point convolution network, the number of point-by-point convolution kernels included is the same as the number of channels output by the convolution network for extracting the original image features of each frame of image in step 121, and each point-by-point convolution kernel performs a convolution operation on the time dimension of each original image pair feature, so that the feature information of each original image pair feature on the time dimension is understood to be separated.

In the deep convolution network, the number of contained deep convolution kernels is the same as the number of channels of the intermediate features output by the point-by-point convolution network in the channel dimension, and each deep convolution kernel respectively executes convolution operation of each intermediate feature in the channel dimension, so that the deep convolution kernel is understood to further separate feature information of each original image pair feature in the channel dimension.

As shown in fig. 4, in one exemplary embodiment, the original image pair features are represented as 1 × 1 × 3 × 2, where "1 × 1" represents the spatial dimension of the original image pair features, "3" represents the channel dimension of the original image pair features, and "2" represents the temporal dimension of the original image pair features.

The point-by-point convolution network comprises 3 point-by-point convolution kernels with the size of 1 multiplied by 3, the point-by-point convolution kernels take one time point each time to execute convolution operation on the original image pair characteristics, and at each time point, the point-by-point convolution kernels perform convolution calculation on all characteristic elements of the original image pair characteristics at the corresponding time point to obtain intermediate characteristics. The intermediate features are compared to the original image pair features, although the channel dimension is changed, the time dimension remains unchanged.

The depth convolution network comprises 3 depth convolution kernels with the size of 1 multiplied by 3, and the depth convolution network respectively allocates corresponding channels to each depth convolution kernel, so that each depth convolution kernel respectively executes convolution calculation of the intermediate features on the allocated channels, and image pair features are obtained. The deep convolutional network does not change the channel dimension and the time dimension of the intermediate features, and only separates the intermediate features on the channel dimension.

For each image pair in the image sequence, since the image pair features are generally the result of performing feature extraction in the channel dimension on the intermediate features, and the intermediate features are obtained by performing feature extraction in the time dimension on the original image pair features, the image pair features enhance the feature expression of the video feature information of the image pair in the time dimension compared with the original image pair features.

Therefore, the time information between two frames of images before and after is enhanced in the image pair characteristics of each image pair obtained based on the embodiment, and the time information is beneficial to more accurately identifying the human face replacement existing in the video stream in the subsequent step.

Fig. 5 is a flowchart of an exemplary embodiment of step 130 in the embodiment shown in fig. 1. As shown in fig. 5, step 130 includes at least the following steps:

step 131, performing feature compression processing on the image pair features to obtain compression features of the image pair, wherein the compression features comprise optical flow information of the image pair;

step 132, feature amplification processing is performed on the compressed features to obtain optical flow features of the image pair.

In this embodiment, an optical flow method is used to predict optical flow features of an image pair. The optical flow method is to calculate the motion information of an object between two adjacent frames of images in an image pair by using the change of pixels in the image pair in a time domain and the correlation between the two frames of images before and after the image pair.

Since the prediction of the optical flow characteristics relates to the accurate position information of each pixel point in the image, which not only relates to the characteristic information of each frame of image, but also relates to the association of corresponding pixel points between two adjacent frames of images, the convolution network for predicting the optical flow characteristics is different from the common convolution network.

The convolution network for extracting optical flow characteristics comprises a characteristic compression part and a characteristic amplification part, wherein the characteristic compression part is used for carrying out characteristic compression processing on the characteristics of the images so as to deeply extract the optical flow information of two frames of images in the image pair.

In one embodiment, the feature compression part may be embodied in a flowessimple network model. The flowessimple network model is provided with 9 convolutional network layers, the convolution step length of each convolutional network layer is 2, and the convolution kernels of the convolutional network layers are gradually reduced along with the depth of convolution, so that in the process of performing layer-by-layer convolution calculation on image features by the flowessimple network model, the resolution of two frames of images in an image pair is reduced, and optical flow information is compressed.

Because the feature compression part can reduce the resolution of two frames of images in an image pair, and the accuracy of the optical flow feature prediction is affected, the feature amplification part is required to perform feature amplification processing on the compressed features so as to restore the optical flow features to a high pixel state, and therefore the accuracy of face replacement recognition in the video stream can be further improved. Illustratively, the feature enlargement portion is constituted by an unpooling (inverse pooling, as opposed to a pooling convolutional network, for enlarging features) convolutional network and a general convolutional network, and by performing an enlargement process of the compressed features, it is possible to restore optical flow information contained in the compressed features to a normal state.

FIG. 6 is a flowchart of an exemplary embodiment of step 140 in the embodiment shown in FIG. 1. As shown in fig. 6, step 140 includes at least the following steps:

step 141, determining a target image containing a face feature for the first time in the video stream according to the image feature of each frame of image in the image sequence, and acquiring the face feature as the target feature.

As described above, the face replacement is implemented by performing corresponding processing based on the original face image, so that the face appearing for the first time in the video stream is highly likely to be the target of the face replacement, and if other faces appear in the subsequent video stream, it indicates that the face replacement exists in the video stream.

Based on this, the present embodiment needs to determine a target image in the video stream, which contains facial features for the first time, and determine these facial features as target features.

It should be noted that the face appearing for the first time in the video stream may be a face of a certain determined user obtained by performing face recognition according to image features of each frame of image in the video stream, or may also be a face obtained by performing face feature comparison, where the face does not correspond to a certain determined user, and only needs to have a certain face form, for example, the image features contain facial features.

And 142, starting from the image pair where the target image is located, tracking the target features in the image sequence, and comparing the image feature difference between the previous frame image and the next frame image in each image pair according to the optical flow features of each image pair in the video stream to obtain a comparison result corresponding to each image pair.

For example, for the previous frame image in each image pair, the position of each pixel in the next frame image can be determined according to the feature information in the time dimension, and then the pixels at the determined positions are compared, for example, feature similarity is calculated.

Therefore, in this embodiment, starting with the image pair where the target image is located, the target feature is tracked in the video stream, and according to the optical flow features of each image pair in the video stream, the image feature difference between the previous frame image and the next frame image in each image pair is compared, so that whether the target feature appearing in the video stream for the first time changes in the time dimension can be detected according to the corresponding comparison result, and further, whether face replacement occurs in the video stream can be determined.

And step 143, predicting the probability of face replacement in the video stream according to the comparison result corresponding to each image pair.

The prediction process of the probability of face replacement in the video stream is essentially a process of sample classification, and the probability of replacement of the target feature is predicted according to the comparison result corresponding to each image pair, so that the probability of face replacement in the video stream is obtained. In one embodiment, if the obtained probability is greater than a set threshold, it may be determined that face replacement has occurred in the video stream, otherwise it is determined that face replacement has occurred in the video stream.

Therefore, the method provided by the embodiment can accurately obtain the possibility of face replacement in the video stream.

In another exemplary embodiment, the method shown in fig. 6 is specifically executed by a face replacement probability prediction model trained in advance, and by inputting information such as image features of each frame of image in an image sequence, optical flow features of each image pair in a video stream, and the like into the face replacement probability prediction model, a probability value of face replacement occurring in the video stream output by the face replacement probability prediction model can be correspondingly obtained.

In the training of the face replacement probability prediction model, a plurality of video positive samples containing real face features and video negative samples containing virtual face features need to be obtained, and the mixed video positive samples and video negative samples are used as training sample data to carry out iterative training of the face replacement probability prediction model until the face replacement probability prediction model is converged.

For example, the video positive sample may be a video of some live-detected users, which contains dynamic processes of the faces of the users, such as dynamic actions of the users including blinking, raising heads, and the like. The negative examples of the video may be a forged video in which a face in the video data containing the face is replaced with another virtual face using an image generation technique including, but not limited to, an image generation technique based on depth learning and a technique based on a conventional image processing method, such as PS or the like.

In the iterative training of the face replacement probability prediction model, the loss value of the training is calculated according to the result of each training, the model parameters of the face replacement probability prediction model are updated according to the back propagation of the loss value, the updated face replacement probability prediction model is used for continuing the next training until the obtained loss value is less than the set threshold value, and the face replacement probability prediction model is represented to be convergent.

Fig. 7 is a block diagram illustrating a face replacement detection apparatus according to an exemplary embodiment. As shown in fig. 7, the apparatus includes at least an image frame extraction module 210, an image pair feature extraction module 220, an optical flow feature extraction module 230, and a face replacement prediction module 240.

The image frame extraction module 210 is configured to extract an image frame from a video stream to be detected, so as to obtain an image sequence corresponding to the video stream.

The image pair feature extraction module 220 is configured to extract image pair features of each image pair in the image sequence, where two adjacent frames of images in the image sequence are taken as an image pair.

The optical flow feature extraction module 230 is configured to obtain optical flow features of each image pair according to the image pair features of each image pair, where the optical flow features reflect pixel motion information of a subsequent frame image relative to a previous frame image in the image pair.

The face replacement prediction module 240 is configured to perform recognition on changes of face features included in consecutive images in the image sequence according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence, so as to obtain a probability of face replacement occurring in the video stream.

In another exemplary embodiment, the image frame extraction module 210 includes a key frame identification unit and an image sequence acquisition unit.

The key frame identification unit is used for identifying key images in the video stream according to the image characteristics of each extracted frame image in the process of extracting image frames of the video stream to be detected.

The image sequence acquisition unit is used for acquiring an image set formed by key images as an image sequence corresponding to the video stream.

In another exemplary embodiment, the image pair feature extraction module 220 includes a feature extraction unit, a feature stacking unit, and a feature separation unit.

The feature extraction unit is used for extracting feature information of each frame of image in the image sequence in a space dimension and a channel dimension to obtain original image features of each frame of image.

The feature stacking unit is used for sequentially stacking original image features of images contained in each image pair in the image sequence to obtain the original image pair features of each image pair, wherein the original image pair features contain feature information of the image pairs in a time dimension.

The characteristic separation unit is used for sequentially carrying out characteristic separation processing on the time dimension and the channel dimension on the characteristics of each original image pair to obtain the image pair characteristics of each image pair.

In another exemplary embodiment, the optical flow feature extraction module 230 includes a feature compression unit and a feature enlargement unit.

The feature compression unit is used for performing feature compression processing on the features of the image pair to obtain compression features of the image pair, wherein the compression features contain optical flow information of the compressed image pair.

The feature amplification unit is used for performing feature amplification processing on the compressed features to obtain optical flow features of the image pair.

In another exemplary embodiment, the face replacement prediction module 240 includes a target feature acquisition unit, an image feature comparison unit, and a face replacement probability prediction unit.

The target feature acquiring unit is used for determining a target image which contains the face feature for the first time in the video stream according to the image feature of each frame of image in the image sequence, and acquiring the face feature as the target feature.

The image feature comparison unit is used for tracking the target features in the image sequence by taking the image pair where the target image is located as a start according to the optical flow features of each image pair in the video stream, comparing the image feature difference between the previous frame image and the next frame image in each image pair, and obtaining the comparison result corresponding to each image pair.

And the human face replacement probability prediction unit is used for predicting the probability of human face replacement in the video stream according to the comparison result corresponding to each image pair.

It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module performs operations has been described in detail in the method embodiment, and is not described again here.

In an exemplary embodiment, the present application further provides an electronic device, which includes a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, implement the face replacement detection method as described above.

FIG. 8 is a hardware schematic diagram of an electronic device shown in accordance with an example embodiment.

It should be noted that the electronic device is only an example adapted to the application and should not be considered as providing any limitation to the scope of use of the application. The electronic device is also not to be construed as requiring reliance on, or necessity of, one or more components of the exemplary electronic device illustrated in fig. 2.

The hardware structure of the electronic device may have a large difference due to the difference of configuration or performance, as shown in fig. 8, the electronic device includes: a power supply 310, an interface 330, at least one memory 350, and at least one Central Processing Unit (CPU) 370.

The power supply 310 is used for providing an operating voltage for each hardware device on the electronic device.

The interface 330 includes at least one wired or wireless network interface 331, at least one serial-to-parallel conversion interface 333, at least one input/output interface 335, and at least one USB interface 337, etc. for communicating with external devices.

The memory 350 may be a read-only memory, a random access memory, a magnetic or optical disk, etc. as a carrier for storing resources, such as an operating system 351, application programs 353, data 355, etc., in a transient or permanent manner.

The operating system 351 is used for managing and controlling various hardware devices and application programs 353 on the electronic device, so as to implement the calculation and processing of the mass data 355 by the central processing unit 370, which may be Windows server, Mac OSXTM, unix, linux, etc. The application program 353 is a computer program that performs at least one specific task on the operating system 351, and may include at least one module (not shown in fig. 8), each of which may contain a series of computer-readable instructions for an electronic device. Data 355 may be neural network data stored in disk, or the like.

Central processor 370 may include one or more processors and is arranged to communicate with memory 350 via a bus for computing and processing mass data 355 in memory 350.

As described in detail above, an electronic device to which the present application is applied will read a series of computer readable instructions stored in the memory 350 by the central processing unit 370 to implement the face replacement detection method described in the following embodiments.

Furthermore, the present application can also be implemented by hardware circuitry or by a combination of hardware circuitry and software instructions, and thus the implementation of the present application is not limited to any specific hardware circuitry, software, or combination of both.

In an exemplary embodiment, the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the face replacement detection method as described above.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A face replacement detection method, comprising:

extracting image frames of a video stream to be detected to obtain an image sequence corresponding to the video stream;

taking two adjacent frames of images in the image sequence as an image pair, and extracting image pair features of each image pair in the image sequence;

respectively acquiring optical flow characteristics of each image pair according to the image pair characteristics of each image pair, wherein the optical flow characteristics reflect pixel motion information of a next frame image relative to a previous frame image in each image pair;

and according to the image characteristics of each frame of image in the image sequence and the optical flow characteristics of each image pair in the image sequence, identifying the face characteristic change contained in the continuous images in the image sequence, and obtaining the probability of face replacement in the video stream.

2. The method of claim 1, wherein the extracting image pair features of each image pair in the image sequence by using two adjacent frames of images in the image sequence as an image pair comprises:

obtaining the original image characteristics of each frame of image by extracting the characteristic information of each frame of image in the image sequence on the space dimension and the channel dimension;

sequentially stacking original image features of images contained in each image pair in the image sequence to obtain the original image pair features of each image pair, wherein the original image pair features contain feature information of the image pairs in a time dimension;

and sequentially carrying out feature separation processing on the time dimension and the channel dimension on the features of each original image pair to obtain the image pair features of each image pair.

3. The method of claim 1, wherein said separately obtaining optical flow features for each of said image pairs from image pair features for each of said image pairs comprises:

performing feature compression processing on the image pair features to obtain compressed features of the image pair, wherein the compressed features comprise compressed optical flow information of the image pair;

and performing feature amplification processing on the compressed features to obtain optical flow features of the image pair.

4. The method according to claim 1, wherein the identifying the change of the face feature contained in the continuous images in the image sequence according to the image feature of each frame of image in the image sequence and the optical flow feature of each image pair in the image sequence to obtain the probability of face replacement in the video stream comprises:

determining a target image containing human face features for the first time in the video stream according to the image features of each frame of image in the image sequence, and acquiring the human face features as target features;

taking the image pair where the target image is located as a start, tracking the target feature in the image sequence, and comparing the image feature difference between the previous frame image and the next frame image in each image pair according to the optical flow feature of each image pair in the video stream to obtain a comparison result corresponding to each image pair;

and predicting the probability of face replacement in the video stream according to the comparison result corresponding to each image pair.

5. The method according to any one of claims 1 to 4, wherein the extracting image frames from the video stream to be detected to obtain an image sequence corresponding to the video stream comprises:

in the process of extracting image frames of a video stream to be detected, identifying key images in the video stream according to the image characteristics of each extracted image frame;

and acquiring an image set formed by the key images as an image sequence corresponding to the video stream.

6. The method according to claim 1, wherein the step of recognizing changes of face features contained in successive images in the image sequence based on image features of each frame of image in the image sequence and optical flow features of each image pair in the image sequence to obtain the probability of face replacement in the video stream is performed by a pre-trained face replacement probability prediction model, and the training method of the face replacement probability prediction model comprises:

acquiring a plurality of video positive samples containing real face features and video negative samples containing virtual face features;

and performing iterative training of the face replacement probability prediction model by taking the mixed video positive sample and the mixed video negative sample as training sample data until the face replacement probability prediction model is converged.

7. A face replacement detection apparatus, comprising:

the image frame extraction module is used for extracting image frames of a video stream to be detected to obtain an image sequence corresponding to the video stream;

the image pair feature extraction module is used for taking two adjacent frames of images in the image sequence as an image pair and extracting image pair features of each image pair in the image sequence;

the optical flow feature extraction module is used for respectively acquiring the optical flow features of each image pair according to the image pair features of each image pair, wherein the optical flow features reflect the pixel motion information of a next frame image relative to a previous frame image in each image pair;

and the face replacement prediction module is used for identifying face feature changes contained in continuous images in the image sequence according to the image features of each frame of image in the image sequence and the optical flow features of each image pair in the image sequence to obtain the probability of face replacement in the video stream.

8. The apparatus of claim 7, wherein the face replacement prediction module comprises:

the target feature acquisition unit is used for determining a target image which contains human face features for the first time in the video stream according to the image features of each frame of image in the image sequence and acquiring the human face features as target features;

an image feature comparison unit, configured to start with an image pair where the target image is located, perform tracking on the target feature in the video stream according to an optical flow feature of each image pair in the video stream, and compare an image feature difference between a previous frame image and a next frame image in each image pair to obtain a comparison result corresponding to each image pair;

9. An electronic device, comprising:

a memory storing computer readable instructions;

a processor to read computer readable instructions stored by the memory to perform the method of any of claims 1-6.

10. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-6.