CN114092864B

CN114092864B - Fake video identification method and device, electronic equipment and computer storage medium

Info

Publication number: CN114092864B
Application number: CN202210057918.4A
Authority: CN
Inventors: 宋旭军; 黄双龙; 杨智
Original assignee: Hunan Xindatong Information Technology Co ltd
Current assignee: Hunan Xindatong Information Technology Co ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-12
Anticipated expiration: 2042-01-19
Also published as: CN114092864A

Abstract

The invention relates to a method and a device for identifying a forged video, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring a video to be processed comprising a face image; performing feature extraction at least twice on each frame of face image in a video to be processed to obtain feature maps of at least two levels corresponding to each frame of face image; for each frame of face image, at least two feature images in at least two levels of feature images of the face image are connected in series at least once to obtain at least one series feature image; and performing forgery identification on the video to be processed according to at least one series characteristic diagram and at least two levels of characteristic diagrams corresponding to each frame of face image to obtain an identification result. By the method, the characteristic diagrams in the characteristic diagrams of at least two hierarchies are connected in series, so that the expression capacity of the forged characteristic is higher, and the identification result determined based on at least one serial characteristic diagram and the characteristic diagrams of at least two hierarchies is more accurate.

Description

Fake video identification method and device, electronic equipment and computer storage medium

Technical Field

The invention relates to the field of image processing and machine learning, in particular to a method and a device for identifying a forged video, an electronic device and a computer storage medium.

Background

At present, with the development of video synthesis technology, many forged videos which tamper with video images appear in a network. In the forged videos, the images are distorted, but the videos still keep visual reality, and the real videos and the forged videos are difficult to distinguish manually, so that personal safety hazards are caused. Therefore, an effective method for identifying a counterfeit video is urgently needed.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method and a device for identifying a forged video, electronic equipment and a computer storage medium, and aims to solve the problem of accurately identifying the forged video.

In a first aspect, the technical solution for solving the above technical problem of the present invention is as follows: a method of identifying counterfeit video, the method comprising:

step 110, acquiring a video to be processed comprising a face image;

step 120, extracting a first feature map corresponding to each frame of face image in the video to be processed, performing at least one time of feature extraction on the first feature map to obtain at least one level feature map corresponding to the face image, and taking the first feature map and the at least one level feature map as at least two level feature maps corresponding to the face image;

step 130, for each frame of facial image, at least two feature maps in at least two levels of feature maps of the facial image are connected in series at least once to obtain at least one series feature map;

and 140, performing forgery identification on the video to be processed according to at least one series characteristic diagram and at least two levels of characteristic diagrams corresponding to each frame of face image to obtain an identification result.

The invention has the beneficial effects that: when identifying a video to be processed, firstly extracting at least two levels of feature maps corresponding to each frame of face image in the video to be processed, expressing forged features of different depths in the face image through the feature maps of different levels, then connecting at least two feature maps in the at least two levels of feature maps in series at least once to obtain at least one series feature map, and enhancing the expression capability of the forged features through the at least one series feature map, because the at least one series feature map is determined based on the at least two levels of feature maps, and finally, when identifying the video to be processed by forging, adopting a feature reuse mode of the feature maps in the at least two levels of feature maps to make the expression capability of the forged features stronger, therefore, the fake features in the video to be processed can be identified more accurately based on the at least one serial feature map and the at least two levels of feature maps, and the obtained identification result is more accurate.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the method also includes:

extracting a texture feature map of each frame of face image;

the step 140 specifically includes:

and performing forgery identification on the video to be processed according to at least one series characteristic image, at least two levels of characteristic images and texture characteristic images corresponding to each frame of face image to obtain an identification result.

The technical scheme has the advantages that the artifacts in the forged images are usually very prominent in the texture information of the shallow feature map, so that the texture feature map of each frame of face image can be extracted, the shallow forged features in the face image are reflected by the texture feature map, the deep forged features in the face image are expressed by at least one serial feature map and at least two levels of feature maps, and when the video to be processed is subjected to forged recognition, the video to be processed can be subjected to forged recognition based on the deep forged features and the shallow forged features, so that the recognition result is more accurate.

Further, the step 140 specifically includes:

for each frame of face image, at least one series feature image corresponding to the face image and at least one feature image in at least two levels of feature images are connected in series to obtain a detail feature image corresponding to the face image;

for each frame of face image, connecting a texture feature image and a detail feature image corresponding to the face image in series to obtain a forged feature image corresponding to the face image;

for each frame of face image, determining a classification result corresponding to the face image according to a forged feature map corresponding to the face image, wherein the classification result comprises a real image and a forged image;

and determining the identification result of the video to be processed according to the classification result corresponding to each frame of face image.

The method has the advantages that at least one series characteristic diagram and at least two levels of characteristic diagrams are characteristic diagrams reflecting deep fake characteristics of the human face image, the at least one series characteristic diagram and at least one characteristic diagram of the at least two levels of characteristic diagrams are connected in series, the expression capability of the deep characteristics can be enhanced, namely the expression capability of the fake characteristics of the obtained detail characteristic diagram is stronger, the fake characteristics can be expressed from the deep layer and the shallow layer respectively based on the detail characteristic diagram and the texture characteristic diagram, and the human face image can be accurately identified as an image or a fake image based on the fake characteristic diagram, so that the identification result of the video to be processed determined according to the classification result corresponding to each frame of human face image is more accurate.

Further, for each frame of face image, the above-mentioned connecting the texture feature map and the detail feature map in series to obtain a forged feature map, including:

connecting the texture feature map and the detail feature map in series to obtain an initial feature map;

dividing the initial feature map into at least two sub-feature maps;

respectively extracting the features of each sub-feature map to obtain a depth sub-feature map corresponding to each sub-feature map;

and connecting the depth sub-feature maps in series to obtain the feature maps after connection in series, and taking the feature maps after connection in series as the forged feature maps.

The method has the advantages that the texture feature map and the detail feature map are connected in series, the initial feature map is used for expressing the fake features in the face image, the deeper features of the initial feature map are extracted through feature extraction, the deeper fake features are obtained, and in addition, when the initial feature map is subjected to feature extraction, the initial feature map can be divided into at least two sub-feature maps for feature extraction, the data processing amount can be reduced, and the processing efficiency is improved.

Further, in the above steps 120 and 130, for each frame of face image, at least one series feature map corresponding to the face image and at least one feature map in at least two levels of feature maps are connected in series to obtain a detail feature map corresponding to the face image; connecting the texture feature map and the detail feature map corresponding to the face image in series to obtain a forged feature map corresponding to the face image; determining that the classification result corresponding to the face image is realized by forging a face recognition model according to the forged feature map corresponding to the face image; the forged face recognition model is obtained based on neural network model training.

The technical scheme has the advantages that the purpose of automatically identifying whether the video to be processed is a forged video or not can be achieved by adopting the forged face recognition model trained by the neural network to forge and recognize the video to be processed, so that the recognition process is more intelligent.

Further, the forged face recognition model includes at least two convolution layers, and the step 120 specifically includes:

extracting feature maps of at least two levels of the facial image through at least two convolution layers for each frame of facial image in a video to be processed;

the above-mentioned forged face recognition model also includes the pooling layer, for every frame of facial image, extracts the texture characteristic map of the facial image, including:

for each frame of face image, performing first-time feature extraction on the face image through at least one convolution layer of at least two convolution layers to obtain a first feature map;

for each frame of face image, processing the first feature map through a pooling layer to obtain a pooling feature map;

for each frame of face image, determining a texture feature map of the face image according to the pooling feature map and the first feature map;

above-mentioned counterfeit face identification model still includes two at least convolution modules, and every convolution module is including at least one convolution layer of establishing ties in proper order, and the aforesaid carries out feature extraction to every sub-feature graph respectively, obtains the degree of depth sub-feature graph that every sub-feature graph corresponds, includes:

and for each sub-feature map, sequentially inputting the sub-feature map into each convolution module of at least two convolution modules to obtain a depth sub-feature map corresponding to the sub-feature map.

The method has the advantages that the characteristic maps of at least two levels of the face image can be extracted through the convolution layer, and the more the number of layers of the convolution layer is, the more the extracted levels are. The first feature map is subjected to average pooling through the pooling layer, so that the dimensionality of features output by the convolutional layer can be reduced, the calculation speed can be increased, and overfitting can be prevented. And the sub-feature graph is subjected to feature extraction through at least two convolution modules, so that deeper features can be obtained.

In a second aspect, the present invention provides an apparatus for identifying a counterfeit video, which includes:

the video acquisition module is used for acquiring a video to be processed comprising a face image;

the characteristic image extraction module is used for extracting a first characteristic image corresponding to each frame of face image in the video to be processed, performing at least one time of characteristic extraction on the first characteristic image to obtain at least one hierarchy characteristic image corresponding to the face image, and taking the first characteristic image and the at least one hierarchy characteristic image as at least two hierarchy characteristic images corresponding to the face image;

the characteristic image tandem connection module is used for performing tandem connection on at least two characteristic images in at least two levels of characteristic images of the facial images for each frame of facial image at least once to obtain at least one tandem characteristic image;

and the identification module is used for performing counterfeiting identification on the video to be processed according to the at least one series characteristic diagram and the at least two levels of characteristic diagrams corresponding to each frame of face image to obtain an identification result.

In a third aspect, the present invention provides an electronic device to solve the above technical problem, where the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the method for identifying a counterfeit video according to the present application.

In a fourth aspect, the present invention further provides a computer-readable storage medium for solving the above technical problems, the computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the method for identifying a counterfeit video according to the present application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

Fig. 1 is a schematic flowchart of a method for identifying a counterfeit video according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a neural network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a network structure of another neural network model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for identifying a counterfeit video according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with examples which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

In the prior art, for a scheme of performing forgery identification on a video through a convolutional neural network, the implementation principle is mainly to extract forgery features by performing convolution operation on each frame of image in the video for multiple times, and then perform forgery identification on the video based on the forgery features. However, after passing through multiple layers of convolution, the forged features become less and less obvious and even lost, thereby affecting the model performance and the accuracy of the recognition result. If the performance is further improved, the inherent form of the convolutional layer needs to be improved to force the convolutional layer to learn more stable counterfeit noise from the input image, that is, the network structure of the convolutional neural network is changed, and this implementation mode needs to change the network structure of the convolutional neural network and is relatively complex to implement.

Aiming at the problems in the prior art, the embodiment of the invention provides a method for identifying a forged video, which comprises the steps of firstly extracting at least two levels of feature maps corresponding to each frame of face image in the video to be processed when identifying the video to be processed, expressing forged features with different depths in the face image through the feature maps with different levels, then carrying out at least one series connection on at least two feature maps in the at least two levels of feature maps to obtain at least one series connection feature map, enhancing the expression capability of the forged features through the at least one series connection feature map, and finally adopting a feature reuse mode on the feature maps in the at least two levels of feature maps when carrying out forged identification on the video to be processed based on the at least one series connection feature map and the at least two levels of feature maps to ensure that the expression capability of the forged features is stronger, therefore, the fake features in the video to be processed can be identified more accurately based on the at least one serial feature map and the at least two levels of feature maps, and the obtained identification result is more accurate. In addition, in the scheme of the invention, the network structure of the neural network model is not changed, and the implementation mode is relatively simple.

The technical solution of the present invention and how to solve the above technical problems will be described in detail with specific embodiments below. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

The scheme provided by the embodiment of the invention can be applied to any application scene needing to identify whether the video is a forged video. The scheme provided by the embodiment of the invention can be executed by any electronic equipment, for example, the scheme can be a terminal device of a user, the terminal device can be any terminal device which can be installed with an application and can access a webpage through the application, and the scheme comprises at least one of the following items: smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, smart televisions, and smart car-mounted devices.

An embodiment of the present invention provides a possible implementation manner, and as shown in fig. 1, provides a flowchart of a method for identifying a counterfeit video, where the scheme may be executed by any electronic device, for example, may be a terminal device, or may be executed by both the terminal device and a server. For convenience of description, the method provided by the embodiment of the present invention will be described below by taking a server as an execution subject, and as shown in the flowchart shown in fig. 1, the method may include the following steps:

step 110, acquiring a video to be processed comprising a face image;

By the method, when a video to be processed is identified, at least two levels of feature maps corresponding to each frame of face image in the video to be processed are extracted, forged features with different depths in the face image can be expressed by the feature maps with different levels, at least two feature maps in the at least two levels of feature maps are connected in series at least once to obtain at least one series feature map, the expression capability of the forged features can be enhanced by the at least one series feature map, because the at least one series feature map is determined based on the at least two levels of feature maps, and finally, when the video to be processed is identified by forging based on the at least one series feature map and the at least two levels of feature maps, the expression capability of the forged features is stronger by adopting a feature reuse mode of the feature maps in the at least two levels of feature maps, therefore, the fake features in the video to be processed can be identified more accurately based on the at least one serial feature map and the at least two levels of feature maps, and the obtained identification result is more accurate.

The following provides a further description of the solution of the present invention with reference to the following specific embodiments, in which the method for identifying a counterfeit video provided by this embodiment may include the following steps:

step S110, a video to be processed including a face image is obtained.

The video to be processed can be a video acquired by the video acquisition device, and can be a video uploaded by a user. Each frame of image in the video to be processed can comprise a face image and a non-face image, and in the scheme of the invention, only the face image in the video is processed subsequently so as to identify whether the face in the face image is forged or not.

If the number of frames of the face images contained in the video to be processed is too large, in order to reduce the data processing amount, the face images with the set number of frames can be selected for subsequent processing, and the identification result of the video to be processed can be accurately determined based on the identification result of the face images with the set number of frames.

Considering the face area with the main concentrated forged features, before processing the face image, each frame of face image can be cut to obtain the image containing the face area, and the cut image can be used to reduce the data processing range and focus on the important area only.

In order to keep the forged trace as much as possible and properly combine the spatial background information in the face image, the cut image may be expanded, for example, to a set size along the width and height of the image.

Because the areas of the face regions occupied by the face images are different, after cropping, the size of each frame of cropped images may be different, and then the cropped images or the expanded images may be unified to a set size, which is usually larger than the maximum size in the cropped images or the maximum size in the expanded images.

After the above processing, each frame of face image mentioned later may be the above image unified to the set size.

Step S120, extracting a first feature map corresponding to each frame of face image in the video to be processed, performing at least one time of feature extraction on the first feature map to obtain at least one level feature map corresponding to the face image, and taking the first feature map and the at least one level feature map as at least two level feature maps corresponding to the face image.

The feature images of at least two levels can be obtained by performing feature extraction on each frame of face image at least twice, namely, the feature images of one level are corresponding to the feature images of one level by performing feature extraction once, and the feature images obtained by performing feature extraction twice are deeper in depth and higher in level compared with the feature images obtained by performing feature extraction once. It should be noted that, in step S120, at least two levels of feature maps are extracted for each frame of face image.

In an optional aspect of the present invention, another implementation manner of the step S120 specifically includes:

extracting a first feature map corresponding to each frame of face image;

for each frame of face image, subtracting the face image from the first feature image to obtain a second feature image corresponding to the face image, wherein the second feature image represents the difference between the face image and the first feature image;

and for each frame of face image, taking the second feature map as a feature map of an initial level, and performing feature extraction on the second feature map at least once to obtain at least one level feature map corresponding to the face image, wherein the at least two level feature maps corresponding to the face image comprise at least one level feature map corresponding to the face image and the feature map of the initial level.

The face image can be subtracted from the first feature image to obtain a second feature image, the forged features in the face image are reflected by the second feature image obtained by subtracting the face image from the first feature image, and the features in the first feature image can be reduced by some unforced features after subtraction, so that the forged features expressed by the second feature image have higher distinguishing capability.

Step S130, for each frame of facial image, at least two feature maps in at least two levels of feature maps of the facial image are connected in series at least once to obtain at least one series feature map.

And for at least two levels of feature maps, selecting at least two feature maps from the at least two levels of feature maps to be connected in series, and obtaining a series feature map once every time the feature maps are connected in series. For each of the at least two levels of feature maps, the feature map may be concatenated multiple times with other of the at least two levels of feature maps. For each frame of face image, the face image may be a three-channel image, for example, an RGB three-channel image, and the feature map corresponding to the face image may also be a feature map of three channels. Taking two feature maps as an example, the concatenation of the two feature maps means that the dimensions of corresponding channels in the two feature maps are concatenated, and in the alternative of the present invention, the concatenation means that the dimensions corresponding to the third channel in the two feature maps are concatenated. Because the characteristic diagrams of at least two levels obtained by only extracting the characteristics lack multi-scale information and the diversity of receptive fields is reduced, the multi-level characteristic diagrams can be well utilized by adopting a characteristic diagram series connection mode so as to improve the expression capability of the forged characteristics.

As an example, for example, the feature map a and the feature map B are three-channel images, where the dimension corresponding to the feature map a is 224 × 224 × 3, and the dimension corresponding to the feature map B is 224 × 224 × 6, the feature map a and the feature map B are connected in series, where the third dimension may be connected in series, and the dimension of the connected feature map is 224 × 224 × 9.

The above-mentioned primary concatenation specifically means that at least two feature maps are concatenated at a time, that is, the feature map to be concatenated at a time may be two feature maps or may be more than two feature maps, and one concatenated feature map is obtained by being concatenated at a time. For example, if at least two feature maps are two feature maps, the two feature maps are connected in series once, which means that the two feature maps are connected in series to obtain a series feature map; if at least two feature maps are three feature maps, the step of serially connecting the three feature maps at a time means that the three feature maps are serially connected at a time to obtain a serial feature map.

As an example, for example, the feature map a, the feature map B, and the feature map C are three-channel images, where the dimension corresponding to the feature map a is 224 × 224 × 3, the dimension corresponding to the feature map B is 224 × 224 × 6, and the dimension corresponding to the feature map C is 224 × 224 × 3, and the dimension of the feature map after concatenation obtained by concatenating the three feature maps once is 224 × 224 × 12.

Similarly, the two-time concatenation refers to performing two-time concatenation on at least three feature maps of at least two levels of feature maps to obtain two concatenated feature maps.

As an example, for example, the feature map a, the feature map B, and the feature map C are three-channel images, the dimension of the feature map a is 224 × 224 × 3, the dimension of the feature map B is 224 × 224 × 6, and the dimension of the feature map C is 224 × 224 × 3, the feature map a and the feature map B are connected in series once to obtain a 224 × 224 × 9 series feature map, and the feature map a and the feature map C are connected in series once to obtain a 224 × 224 × 6 series feature map.

If the series connection of every two feature maps is a series connection, the feature map after the series connection of every two feature maps can be used as the series feature map.

It should be noted that not every feature map in at least two hierarchical levels needs to participate in the concatenation, each feature map may not participate in the concatenation, and for a feature map participating in the concatenation, the feature map may participate in the concatenation at least once.

In the alternative of the invention, the number of the series connection times is 1, the number of the at least two levels of feature maps is 4, and experiments prove that the corresponding recognition results and the calculated amount are both appropriate when the number of the series connection times is 1 and the number of the at least two levels of feature maps is 4.

And step S140, performing forgery identification on the video to be processed according to at least one serial feature map and at least two levels of feature maps corresponding to each frame of face image to obtain an identification result.

In an alternative of the present invention, the above-mentioned performing, according to at least one series feature map and at least two hierarchical feature maps corresponding to each frame of face image, a forgery identification on the video to be processed to obtain an identification result includes:

for each frame of face image, determining a classification result corresponding to the frame of face image according to a detail characteristic image of the frame of face image, wherein the classification result comprises a real image and a forged image;

and determining the recognition result of the video to be processed according to the recognition result of each frame of face image.

The method comprises the steps of representing the forged characteristics of each frame of face image through the detail characteristic image corresponding to each frame of face image, identifying whether the frame of face image is a real image or a forged image based on the detail characteristic image corresponding to the frame of face image, similarly, judging the same of each frame of face image to determine the identification result (the real image or the forged image) of each frame of face image, and determining the identification result (the real video or the forged video) of the video to be processed based on the identification result of each frame of face image.

The specific implementation manner of concatenating the at least one concatenated feature map corresponding to the face image and the at least one feature map in the at least two levels of feature maps is the same as the concatenation manner described above, and is not described herein again.

Optionally, an optional manner for determining the recognition result of the video to be processed based on the recognition result of each frame of the face image is as follows: in each frame of face image, the identification result of the face image with the proportion larger than the set proportion is a forged image, the identification result of the video to be processed is a forged video, the identification result of the face image with the proportion not larger than the set proportion is a forged image, and the identification result of the video to be processed is a real video. The set ratio can be configured based on actual requirements, and is not limited in the scheme of the present invention.

In an alternative aspect of the invention, the method further comprises:

extracting a texture feature map of each frame of face image;

the step 140 specifically includes:

Because artifacts in the forged images are usually very prominent in the texture information of the shallow feature map, in the scheme of the invention, the texture feature map of each frame of face image can be extracted, the forged features of the shallow layer in the face image are reflected by the texture feature map, the forged features of the deep layer in the face image are expressed by at least one serial feature map and at least two levels of feature maps, and when the forged recognition is performed on the video to be processed, the forged recognition can be performed on the video to be processed based on the forged features of the deep layer and the forged features of the shallow layer, so that the recognition result is more accurate.

In an alternative of the present invention, the extracting the texture feature map of each frame of the face image may include extracting first feature maps of each frame of the face image, pooling each first feature map, for example, locally averaging the pooling, to obtain pooled feature maps; and for each frame of face image, subtracting the pooled feature map corresponding to the frame of face image from the first feature map corresponding to the frame of face image to obtain a feature map which is a texture feature map. The purpose of pooling the first feature map is to reduce the amount of computation and prevent overfitting.

In consideration of the texture feature map, the step 140 specifically includes: for each frame of face image, connecting a texture feature image and a detail feature image corresponding to the face image in series to obtain a forged feature image corresponding to the face image;

When the recognition result of each frame of face image is determined, the texture feature map and the detail feature map can be comprehensively considered, so that the expression capability of the forged features is better, the recognition result of the face image determined based on the forged feature map is more accurate, and the recognition result of the video to be processed is more accurate.

In an alternative of the present invention, the obtaining a forged feature map by concatenating the texture feature map and the detail feature map for each frame of face image includes:

dividing the initial feature map into at least two sub-feature maps;

and connecting the depth sub-feature maps in series, and taking the connected feature maps as fake feature maps.

In the scheme of the invention, the texture feature map and the detail feature map are connected in series, the initial feature map is used for expressing the forged features in the face image, then the initial feature map is used for extracting deeper features through feature extraction so as to obtain deeper forged features, and the initial feature map can be divided into at least two sub-feature maps for feature extraction when the initial feature map is used for feature extraction, so that the data processing amount can be reduced, and the processing efficiency can be improved.

And for each sub-feature map, performing feature extraction at least once to obtain a depth sub-feature map. The depth sub-feature map refers to features at a deeper level than the sub-feature map, and more detailed features can be expressed through the depth sub-feature degree. The concatenation of the sub-feature maps is the same as the concatenation processing method described above, and is not described herein again.

In an alternative of the present invention, in the above step 120, step 130, and for each frame of face image, at least one tandem feature map corresponding to the face image and at least one feature map in at least two levels of feature maps are connected in series to obtain a detail feature map corresponding to the face image; connecting the texture feature map and the detail feature map corresponding to the face image in series to obtain a forged feature map corresponding to the face image; determining that the classification result corresponding to the face image is realized by forging a face recognition model according to the forged feature map corresponding to the face image; the forged face recognition model is obtained by training in the following way:

obtaining a plurality of sample videos, wherein the plurality of sample videos comprise real videos and fake videos;

for each sample video, determining each frame of sample face image in the sample video, wherein each frame of sample face image corresponds to one classification and annotation result, and for each frame of sample face image, the classification and annotation result represents that the sample face image is a real image or a forged image;

the forged face recognition model is obtained by performing the following training steps:

inputting each frame of sample face image corresponding to each sample video into a neural network model to obtain a prediction classification result of each frame of sample face image;

for each frame of sample face image, determining a loss value corresponding to the sample face image according to a prediction classification result and a classification labeling result of the sample face image;

determining the total loss value of the neural network model according to the loss value corresponding to each frame of sample face image;

and if the total loss value meets the training end condition, taking the neural network model meeting the training end condition as a forged face recognition model, if the total loss value does not meet the training end condition, adjusting model parameters of the neural network model, and repeating the training step until the total loss value meets the training end condition.

For each sample video, the processing manner of the prediction classification result of each frame of sample face image in the sample video is the same as the processing manner of the identification result of each frame of face image in the video to be processed in the foregoing, which is not specifically described in detail.

Optionally, the neural network model may be a convolutional neural network model, or may be other neural network models, such as a cyclic neural network model.

The neural network model can comprise at least two convolutional layers, a pooling layer and convolutional modules, each convolutional module can comprise at least one convolutional layer, one pooling layer, one activation function layer and one normalization layer which are connected in series, wherein each convolutional layer in the at least two convolutional layers is used for extracting a feature map of a hierarchy, and the pooling layer is used for reducing the dimension of the feature maps output by the at least two convolutional layers so as to reduce the data processing amount. The convolution module is used for extracting features of each sub-feature graph, wherein the convolution layer in the convolution module is used for extracting the depth sub-feature graph of the sub-feature graph, the pooling layer in the convolution module is used for reducing the dimension of the features output by the convolution layer in the convolution module, and the normalization layer is used for accelerating the model training speed.

The forged face recognition model obtained by training also has a network structure in the neural network model. When the forged face recognition model is applied, since the forged face recognition model includes at least two convolution layers, step 120 specifically includes:

and for each frame of facial image in the video to be processed, extracting at least two levels of characteristic maps of the facial image through at least two convolution layers.

In step S120, the first feature map corresponding to each frame of the face image may be extracted through at least one of the at least two convolution layers, and the first feature map may be extracted at least once through at least one of the at least two convolution layers.

the forged face recognition model further comprises at least two convolution modules, each convolution module comprises at least one convolution layer which is sequentially connected in series, feature extraction is carried out on each sub-feature graph respectively, and a depth sub-feature graph corresponding to each sub-feature graph is obtained, and the method comprises the following steps:

It should be noted that, the specific implementation schemes related to step S120 and step S130 are implemented inside the forged face recognition model, the scheme related to determining the classification result corresponding to each frame of the face image according to the forged feature map corresponding to the face image in step S140 is implemented inside the forged face recognition model, and the scheme related to determining the recognition result of the video based on the classification result corresponding to each frame of the face image in step S140 is implemented outside the forged face recognition model.

For a better illustration and understanding of the principles of the method provided by the present invention, the solution of the invention is described below with reference to an alternative embodiment. It should be noted that the specific implementation manner of each step in this specific embodiment should not be construed as a limitation to the scheme of the present invention, and other implementation manners that can be conceived by those skilled in the art based on the principle of the scheme provided by the present invention should also be considered as within the protection scope of the present invention.

Firstly, to train a fake face recognition model, referring to the schematic network structure shown in fig. 2 and fig. 3, the training process of the model includes the following steps:

step 1, obtaining a plurality of sample videos, wherein the plurality of sample videos comprise real videos and fake videos.

For each sample video, each frame of sample face image in the sample video corresponds to a classification label (classification labeling result), and for each frame of sample face image, the classification label represents a real recognition result of the sample face image, the real recognition result is a real image or a forged image, and each sample video is a video containing a face image. For convenience of description, the sample face image is hereinafter referred to as a sample image.

And 2, preprocessing each frame of sample image in each training sample to obtain a preprocessed sample image, wherein the preprocessed sample image is called an image a hereinafter for convenience in description.

The preprocessing process of each frame of sample image in the sample video comprises the following steps: for each frame of sample image, detecting human face feature points in the frame of image a through a Cascade Classifier Cascade Classifier in Opencv, and aligning the frame of sample image based on the detected human face feature points to obtain a human face region in the frame of sample image. Because the forged position is mainly concentrated on the face region, the face region obtained after the alignment operation can be utilized to greatly reduce the algorithm processing range, the forged detection focuses on the important region, in order to keep the forged trace as much as possible and properly combine the spatial background information, each face region can be expanded outwards to a certain extent along the width and height of the face region, then the size of the face region is uniformly adjusted to 224 x 224, and the sample image obtained in the step is recorded as an image a to be used as the input of the neural network model.

And 3, sequentially inputting each frame image a in each sample video into the neural network model for each frame image a in each sample video, and obtaining a prediction classification result of each frame image a through the neural network model for each frame image a.

For each frame of image a, the concrete implementation process of obtaining the prediction classification result of the frame of image a through the neural network model is as follows:

step a, generating through a convolutional neural network CNN in a neural network modelAnd forming an enhanced feature map (human face feature map) of the image a, namely extracting the features of the image a through the CNN. The enhanced feature map comprises a texture feature map (texture feature map)f _texAnd characteristic diagram of forged trace (detailed characteristic diagram)f _reu。

Specifically, the CNN may be divided into two branches, which are used for extracting the two feature maps respectivelyf _texAndf _reuthe specific network structure of CNN is shown in fig. 2. Taking an image a as an example, the texture feature map of the image a is extracted through two branches of the CNN respectivelyf _texAnd a characteristic map of the counterfeit tracef _reuSpecifically, the following description is made:

(1) texture feature mapf _texIs extracted

Texture feature mapf _texAs shown in the upper branch (hereinafter referred to as the upper branch) of fig. 2, first, the image a is input to the first convolution layer Conv _1 of the upper branch, i.e., the image a is subjected to the first feature extraction to obtain a first feature mapf _L1Then, the first feature map is pooled by local averagingf _L1Down-sampling to obtain pooled feature mapsf _pool. Finally, based on the pooled feature mapsf _poolAnd a first characteristic diagramf _L1Obtaining a texture feature mapf _texSee, in particular, the following formula (1):

（1）

wherein the above pooling is performed by local averagingf _L1Down-sampling to obtain pooled feature mapsf _poolOne way of achieving this is: determining a filter for pooling, wherein the size of the window in the filter is 2 x 2 and the step size is 2; according to the filter pair characteristic diagramf _L1Is subjected to poolingThe idea of local average pooling is to average several eigenvalues corresponding to a window, replace these several eigenvalues by the average, and map the eigenvalues based on the window and step sizef _L1After pooling is performed, the dimensionality of the feature map output by the convolutional layer can be reduced (namely dimensionality reduction), the calculation speed can be increased, and overfitting can be prevented.

Pooling by local averaging versus the first feature mapf _L1After down-sampling, the noise information (false feature) in the image a can be filtered out, and the first feature mapf _L1If the feature map contains a forged feature, the feature map is extracted from the first feature mapf _L1Subtracting the pooled feature mapf _poolThe texture feature map obtainedf _texThe counterfeit character of image a can be reflected.

(2) Characteristic pattern of forged tracef _reuIs extracted

Characteristic pattern of forged tracef _reuAs shown in the following branch of fig. 2 for convenience of description, hereinafter referred to as the lower branch), first, the image a is input into the first convolutional layer Conv _1, i.e., the image a is subjected to the first feature extraction, so as to obtain a first feature map, and the counterfeit feature in the image a can be reflected by the first feature map. The first characteristic diagram will be referred to as a second characteristic diagram hereinafter, and the second characteristic diagram is actually the same as the first characteristic diagram.

Since the second feature map is obtained by performing feature extraction once on the image a, the second feature map can only reflect the shallow forged features of the image a, and cannot reflect the forged features of deeper levels (more details), in order to more accurately extract the forged features in the image a, feature extraction can be further performed on the second feature map to obtain a feature map of forged tracesf _reuThe specific process is as follows:

the second feature map is subjected to further feature extraction by the second convolution layer Conv _2 and the third convolution layer Conv _3 (Conv _2, 3, Conv _2, 3 shown in fig. 2 indicate the second convolution layer and the third convolution layer) branched in this order to obtain a second feature mapThree characteristic diagramf _L3Then, the third characteristic diagram is usedf _L3Further feature extraction is performed on the fourth convolutional layer Conv _4 and the fifth convolutional layer Conv _5 (Conv _4, 5, Conv _4, 5 shown in fig. 2 indicate the fourth convolutional layer and the fifth convolutional layer) sequentially branching down in order to obtain a fourth feature mapf _L4In the above feature extraction process, the third feature mapf _L3Compared with the second characteristic diagramf _L2The extracted forged features contain more detailed features, and similarly, the fourth feature mapf _L4Compared with the third characteristic diagramf _L3The extracted forged features contain more detailed features, and the feature extraction process can also be described as the second feature mapf _L2An enhancement process is performed.

Since the feature map extracted only by the convolutional layer lacks multi-scale information and reduces the diversity of receptive fields, and cannot better describe the counterfeit features, in the solution of the present invention, the second feature map is usedf _L2And a third characteristic diagramf _L3Are connected in series (in FIG. 2)f _L3Rear end

Corresponding processing method), the fourth convolution layer Conv _4 and the fifth convolution layer Conv _5 that have passed through the lower branch are subjected to further feature extraction to obtain a fifth feature mapf _L5As an example, the second characteristic diagramf _L2Dimension of (2) is 224 × 224 × 3, third feature mapf _L3Dimension of (2) is 224 × 224 × 6, second feature mapf _L2And a third characteristic diagramf _L3Fifth characteristic diagram obtained after series connectionf _L5Is 224 x 9, that is, the purpose of concatenation is to fuse two feature maps from the third dimension, so that the fifth feature mapf _L5The forgery features can be expressed more deeply from the third dimension.

The second characteristic diagramf _L2The fourth characteristic diagramf _L4And the fifth characteristic diagramf _L5Serially connecting to obtain characteristic graph of forged tracef _reuSee, in particular, the following equation (2):

（2）

the above convolutional layer parameters of each convolutional layer are shown in table 1, the input image a is uniformly adjusted to 224 × 224 (the 224 × 224 size indicates the length and width of the image a), and a ReLU active layer follows each convolutional layer. The convolution kernels all adopt 3 × 3 sizes, the input image is RGB three-channel, so the number of convolution kernels of the first convolution layer Conv _1 is 3, and the obtained first feature mapf _L1Is 224 × 224 × 3, and then the feature map after local average pooling is subtractedf _poolThe dimension of the texture feature map is 224 multiplied by 3. Second characteristic diagramf _L2Is 224 × 224 × 3. Second characteristic diagramf _L2A third characteristic diagram obtained after passing through the second convolution layer Conv _2 and the third convolution layer Conv _3f _L3Is 224 × 224 × 6, and a fourth characteristic diagram is obtained after passing through the fourth convolution layer Conv _4 and the fifth convolution layer Conv _5f _L4Has a dimension of 224 × 224 × 12, so that the final characteristic map of the counterfeit trace is obtainedf _reuHas dimensions of 224 × 224 × 24.

TABLE 1 convolution layer setup

Step b, the texture characteristic graph obtained in the step a is usedf _texAnd a characteristic map of the counterfeit tracef _reuAre connected in series to obtain an initial characteristic diagramf _finAlso called noise signaturef _fin. The noise characteristic diagramDimension of (2) is 224 × 224 × 27, noise feature mapf _finCan represent the noise information of the input image a, and provides an important clue for distinguishing true and false images.

Step c, although the ideal noise characteristic diagram is obtained through the step bf _finBut the face image is not directly used as the feature of the face image recognition, but the face image is continuously input into a hierarchical feature extraction network in a fake face recognition model to learn higher-level features, namely deeper features, so as to obtain a fake feature map.

Wherein, the network-to-noise feature map is extracted by the hierarchical featuref _finThe process of extracting the characteristics to obtain the forged characteristic diagram is as follows:

because of redundancy in the wider convolution (the third channel of the convolution corresponds to a larger dimension), we do not directly use the noise feature map after concatenationf _finInstead, the noise characteristic diagram is convolved in the form of groupwisef _finThe separation into groups and the separate convolution are performed to achieve similar accuracy with less computation and the appropriate number of groups can significantly reduce FLOPs. Specifically, referring to the schematic diagram of the network structure in fig. 3, the noise characteristic diagram is first obtainedf _finThe average division into three sub-feature maps (processing procedure corresponding to Split shown in fig. 3) by the number of channels is 224 × 224 × 9 in each dimension.

Then, each sub-feature map is fed into three convolution modules, respectively, the first convolution module containing a convolution layer (convolution kernel size 3 × 3, number 48, step size 1), a 2 × 2 max pooling layer, a ReLU activation function layer, and a Normalization layer (Batch Normalization, BN), the second convolution module containing a convolution layer (convolution kernel size 3 × 3, number 48, step size 1), a 2 × 2 max pooling layer, a ReLU activation function layer, and a BN layer, the third convolution module containing a convolution layer (convolution kernel size 1 × 1, number 64, step size 1), a 2 × 2 max pooling layer, and a ReLU activation function layer, the 1 × 1 convolution kernel being used to learn linear combinations of features located at the same location but different channels, the BN layer being used for accelerated model training, and weight sharing is carried out between convolution layers corresponding to each sub-feature graph. In addition, a small step size can extract richer features than a large step size, so the step size for each convolutional layer is set to 1.

Wherein for each convolution module, the network layers in the convolution module are represented by different gray levels in fig. 3.

Finally, the results of the three sub-feature maps obtained are concatenated (shown in FIG. 3)

Corresponding processing), a counterfeit signature is obtained.

And d, obtaining a prediction classification result of the image a according to the forged characteristic diagram.

One way to obtain the prediction classification result of the image a according to the forged feature map is as follows: and sequentially transmitting the forged feature graph to a classification module consisting of three fully-connected layers in a neural network model, wherein the association between the learning depth features of the first two fully-connected layers respectively comprises 500 and 300 neurons, the output of the last fully-connected layer comprises 2 nodes, one node represents that the image a is the classification result of the forged image, and the other node represents that the image a is the classification result of the real image. And mapping the two nodes output by the last full-connection layer to class probabilities in a range of 0-1 (normalization processing) through a Softmax function to obtain a prediction classification result of the image a, wherein the prediction classification result can be represented by two prediction probability values, one prediction probability value represents the probability that the image a is a real image (true shown in figure 3), the other prediction probability value represents the probability that the image a is a fake image (false shown in figure 3), and the sum of the two prediction probability values is 1. Wherein the definition of the Softmax function is given in equation (3) below:

（3）

wherein the content of the first and second substances,

representing the output of each node in the last fully-connected layer,irepresentative category index: (i=1 represents a real face image,i=0 representing a false face image),

indicating that the output of each node is normalized.

And 4, for each sample video, determining a loss function value corresponding to the sample video according to the prediction classification result and the classification label of each frame image a in the sample video.

Wherein, the neural network model uses Binary Cross Entropy (Binary Cross Entropy) as a loss function during training, and the definition of the Binary Cross Entropy can be seen in the following formula (4):

（4）

wherein L represents a loss function corresponding to a sample video, j represents the number of frames of a sample image in the sample video,

representing imagesjThe classification label of (a) is,

representing imagesjN is the number of frames of the sample image in the sample video.

And 5, determining a loss function value (total loss value) of the neural network model based on the loss function value of each sample image.

Wherein, the total loss value of the neural network model is equal to N multiplied by the loss function L and then multiplied by the total number of the sample videos.

And 6, if the total loss value meets the training end condition, ending the training, taking the corresponding neural network model as the forged face recognition model when the training is ended, if the total loss value does not meet the training end condition, adjusting model parameters of the neural network model, and training the neural network model based on the adjusted model parameters until the total loss value meets the training end condition.

After the forged face recognition model is obtained through training, in practical application, the forged face recognition model can be used for recognizing whether a video to be processed is a forged video, and the specific recognition process comprises the following steps:

step 1, acquiring a video to be processed including a face image, acquiring each frame image in the video to be processed, detecting a face feature point in each frame image through a Cascade Classifier in Opencv, and aligning the frame image based on the detected face feature point to obtain a face region in the frame image. Because the fake position is mainly concentrated on the face area, the algorithm processing range can be greatly reduced by using the face area obtained after the alignment operation, and the fake detection focuses on the important area.

And 2, in order to reserve forged traces as much as possible and properly combine spatial background information, for each face region, outwards expanding the face region to a certain extent along the width and height of the face region, uniformly adjusting the face region to 224 multiplied by 224, and recording an image obtained in the step as an image I which is used as the input of a forged face recognition model.

And 3, the forged face recognition model is a model obtained by training based on the convolutional neural network CNN, and the specific training process is described in the foregoing and is not repeated herein. Inputting each frame of image I into a forged face recognition model, and generating an enhanced feature map (face feature map) of the image I through a Convolutional Neural Network (CNN) in the forged face recognition model for each frame of image I, namely extracting the features of the image I through the CNN. The enhanced feature map comprises a texture feature map (texture feature map)f _texAnd characteristic diagram of forged trace (detailed characteristic diagram)f _reu。

In particular toThe CNN can be divided into two branches for extracting the two characteristic maps respectivelyf _texAndf _reuthe concrete network structure of CNN is shown in FIG. 2, the texture feature map of each frame image If _texAnd a characteristic map of the counterfeit tracef _reuThe extraction process and the texture feature map of each frame of image af _texAnd a characteristic map of the counterfeit tracef _reuThe extraction process is the same in principle, and is not described herein again.

Step 4, the texture feature map obtained in the step 3 is processedf _texAnd a characteristic map of the counterfeit tracef _reuAre connected in series to obtain an initial characteristic diagramf _finAlso called noise signaturef _finThe dimension of the noise characteristic diagram is 224 multiplied by 27, and the noise characteristic diagramf _finCan represent the noise information of the input image I, and provides an important clue for distinguishing true and false images.

Step 5, although the ideal noise characteristic map has been obtained through the above step 4f _finBut the face image is not directly used as the feature of the face image recognition, but the face image is continuously input into a hierarchical feature extraction network in a fake face recognition model to learn higher-level features, namely deeper features, so as to obtain a fake feature map.

Wherein, the network-to-noise feature map is extracted by the hierarchical featuref _finThe process of extracting features to obtain the forged feature map is described in the process of forging the face recognition model, and the implementation principle is the same, and is not described herein again.

And 6, obtaining the identification result of the image I according to the fake feature map, for example, the probability that the frame image I is a fake image or the probability that the frame image I is a real image.

And 7, determining the identification result (fake video or real video) of the video to be processed according to the identification result (fake image or real image) of each frame of image I, for example, if the identification result of the image exceeding a set proportion (for example, 60%) in all the images I is a fake image, the identification result of the video to be processed is a fake video.

If all frames of the whole to-be-processed video are extracted, the calculation amount is a little large, and in consideration of the calculation cost and the time cost, a plurality of continuous frame images can be taken from the middle of the to-be-processed video for identification.

Based on the same principle as the method shown in fig. 1, the embodiment of the present invention further provides an apparatus 20 for identifying a counterfeit video, as shown in fig. 4, the apparatus 20 for identifying a counterfeit video may include a video obtaining module 210, a feature map extracting module 220, a feature map concatenation module 230, and an identifying module 240, wherein:

a video obtaining module 210, configured to obtain a video to be processed including a face image;

the feature map extraction module 220 is configured to extract a first feature map corresponding to each frame of face image in the video to be processed, perform feature extraction at least once on the first feature map to obtain at least one level feature map corresponding to the face image, and use the first feature map and the at least one level feature map as at least two level feature maps corresponding to the face image;

the feature map concatenation module 230 is configured to concatenate at least two feature maps in at least two levels of feature maps of each frame of the face image at least once to obtain at least one concatenated feature map;

and the identification module 240 is configured to perform forgery identification on the video to be processed according to the at least one series feature map and the at least two hierarchical feature maps corresponding to each frame of the face image, so as to obtain an identification result.

Optionally, the apparatus further comprises:

the texture feature extraction module is used for extracting a texture feature image of each frame of face image;

the identification module 240 is specifically configured to:

Optionally, the identification module 240 is specifically configured to:

Optionally, for each frame of face image, when the texture feature map and the detail feature map are connected in series to obtain a forged feature map, the identifying module 240 is specifically configured to:

dividing the initial feature map into at least two sub-feature maps;

Optionally, in the implementation process of the feature map extraction module 220, the implementation process of the feature map concatenation module 230, and the recognition module 240, for each frame of face image, at least one concatenation feature map corresponding to the face image and at least one feature map in at least two levels of feature maps are concatenated to obtain a detail feature map corresponding to the face image; connecting the texture feature map and the detail feature map corresponding to the face image in series to obtain a forged feature map corresponding to the face image; determining the implementation process of the classification result corresponding to the face image according to the forged feature map corresponding to the face image, wherein the implementation process is realized by forging a face recognition model; the forged face recognition model is obtained based on neural network model training.

Optionally, the forged face recognition model includes at least two convolution layers, and the feature map extraction module 220 is specifically configured to:

the forged face recognition model further comprises a pooling layer, and the texture feature extraction module is specifically used for:

the forged face recognition model further comprises at least two convolution modules, each convolution module comprises at least one convolution layer which are sequentially connected in series, and the recognition module 240 is specifically used for performing feature extraction on each sub-feature map to obtain a depth sub-feature map corresponding to each sub-feature map:

The counterfeit video identification device according to the embodiment of the present invention can execute the counterfeit video identification method according to the embodiment of the present invention, and the implementation principles thereof are similar, the actions performed by each module and unit in the counterfeit video identification device according to the embodiments of the present invention correspond to the steps in the counterfeit video identification method according to the embodiments of the present invention, and the detailed functional description of each module of the counterfeit video identification device may specifically refer to the description in the corresponding counterfeit video identification method shown in the foregoing, and will not be described again here.

The counterfeit video identification device may be a computer program (including program code) running in a computer device, for example, the counterfeit video identification device is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present invention.

In some embodiments, the counterfeit video identification apparatus provided by the embodiments of the present invention may be implemented by a combination of hardware and software, and as an example, the counterfeit video identification apparatus provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the counterfeit video identification method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

In other embodiments, the apparatus for identifying a counterfeit video according to the embodiments of the present invention may be implemented in software, and fig. 4 illustrates the apparatus for identifying a counterfeit video stored in a memory, which may be software in the form of a program, a plug-in, and the like, and includes a series of modules, including a video acquisition module 210, a feature map extraction module 220, a feature map concatenation module 230, and an identification module 240, for implementing the method for identifying a counterfeit video according to the embodiments of the present invention.

The modules described in the embodiments of the present invention may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

Based on the same principle as the method shown in the embodiment of the present invention, an embodiment of the present invention also provides an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; a processor for executing the method according to any of the embodiments of the present invention by calling a computer program.

In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 30 shown in fig. 5 comprising: a processor 310 and a memory 330. Wherein the processor 310 is coupled to the memory 330, such as via a bus 320. Optionally, the electronic device 30 may further include a transceiver 340, and the transceiver 340 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 340 is not limited to one in practical application, and the structure of the electronic device 30 does not limit the embodiment of the present invention.

The Processor 310 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 310 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 320 may include a path that transfers information between the above components. The bus 320 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 320 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The Memory 330 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 330 is used for storing application program codes (computer programs) for performing aspects of the present invention and is controlled to be executed by the processor 310. The processor 310 is configured to execute application program code stored in the memory 330 to implement the aspects illustrated in the foregoing method embodiments.

The electronic device may also be a terminal device, and the electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the application scope of the embodiment of the present invention.

Embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

According to another aspect of the invention, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various embodiment implementations described above.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer readable storage medium provided by the embodiments of the present invention may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents is encompassed without departing from the spirit of the disclosure. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A method for identifying a counterfeit video, comprising:

step 110, acquiring a video to be processed comprising a face image;

step 130, for each frame of the facial image, at least two feature maps in at least two levels of feature maps of the facial image are connected in series at least once to obtain at least one series feature map;

140, performing forgery identification on the video to be processed according to at least one series characteristic diagram corresponding to the face image of each frame and the characteristic diagrams of the at least two hierarchies to obtain an identification result;

before step S140, the method further includes:

extracting a texture feature map of each frame of the face image;

the step 140 specifically includes:

for each frame of face image, at least one series feature map corresponding to the face image and at least one feature map in at least two levels of feature maps are connected in series to obtain a detail feature map corresponding to the face image;

dividing the initial feature map into at least two sub-feature maps;

connecting the depth sub-feature maps in series to obtain a feature map after the connection in series, and taking the feature map after the connection in series as a fake feature map;

determining the recognition result of the video to be processed according to the classification result corresponding to the face image of each frame;

the step 120, the step 130, and a processing procedure for obtaining a forged feature map corresponding to the facial image according to at least one serial feature map corresponding to the facial image and at least one feature map in at least two levels of feature maps for each frame of the facial image; according to the forged characteristic image corresponding to the face image, the processing process of determining the classification result corresponding to the face image is realized by forging a face recognition model; the forged face recognition model is obtained based on neural network model training;

the forged face recognition model includes at least two convolution layers, and the step 120 specifically includes:

extracting feature maps of at least two levels of the facial image through the at least two convolution layers for each frame of facial image in the video to be processed;

the forged face recognition model further comprises a pooling layer, and for each frame of the face image, extracting a texture feature map of the face image, wherein the pooling layer comprises:

for each frame of the face image, performing first-time feature extraction on the face image through at least one convolution layer of the at least two convolution layers to obtain a first feature map;

for each frame of face image, processing the first feature map through the pooling layer to obtain a pooling feature map;

and for each sub-feature map, sequentially inputting the sub-feature map into each convolution module of the at least two convolution modules to obtain a depth sub-feature map corresponding to the sub-feature map.

2. An apparatus for identifying a counterfeit video, comprising:

the feature map extraction module is used for extracting a first feature map corresponding to each frame of face image in the video to be processed, performing at least one time of feature extraction on the first feature map to obtain at least one level feature map corresponding to the face image, and taking the first feature map and the at least one level feature map as at least two level feature maps corresponding to the face image;

the feature map series module is used for carrying out at least one series connection on at least two feature maps in at least two levels of feature maps of the face image for each frame of the face image to obtain at least one series feature map;

the identification module is used for carrying out counterfeiting identification on the video to be processed according to at least one series characteristic diagram corresponding to each frame of the face image and the characteristic diagrams of the at least two levels to obtain an identification result;

the device further comprises:

the identification module is specifically configured to:

dividing the initial feature map into at least two sub-feature maps;

the implementation process of the feature map extraction module, the implementation process of the feature map series module and the implementation process of the identification module for each frame of face image obtain the forged feature map corresponding to the face image according to at least one series feature map corresponding to the face image and at least one feature map in at least two levels of feature maps; determining the implementation process of the classification result corresponding to the face image according to the forged feature map corresponding to the face image, wherein the implementation process is realized by forging a face recognition model; the forged face recognition model is obtained based on neural network model training;

the forged face recognition model comprises at least two convolution layers, and the feature map extraction module is specifically used for:

the forged face recognition model further comprises at least two convolution modules, each convolution module comprises at least one convolution layer which is sequentially connected in series, and the recognition module is specifically used for performing feature extraction on each sub-feature graph to obtain a depth sub-feature graph corresponding to each sub-feature graph:

3. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of claim 1 when executing the computer program.

4. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of claim 1.