CN113298747A

CN113298747A - Picture and video detection method and device

Info

Publication number: CN113298747A
Application number: CN202010102318.6A
Authority: CN
Inventors: 赵汉青; 赵苗苗
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2021-08-24

Abstract

The invention discloses a picture and video detection method and device, and relates to the technical field of computers. The picture detection method comprises the following steps: inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face; respectively inputting the face area in the picture and the local feature area in the face into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face; and identifying whether the picture is a synthesized picture or not according to the credibility predicted value of the face area and the credibility predicted value of the local feature area. Through the steps, the identification accuracy of the synthesized picture can be improved.

Description

Picture and video detection method and device

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for detecting pictures and videos.

Background

With the development of the AI face changing technology, the common user can easily realize the batch synthesis of the face pictures and videos. Meanwhile, the real credibility of the face picture and the video is greatly reduced. In the process of face brushing payment and photo reporting, how to accurately detect the face synthesized picture and the synthesized video of the collected picture is a key for preventing the user from using false information for arbitrage.

In the prior art, there are mainly the following two composite picture detection algorithms: a composite picture detection algorithm based on dual stream fast R-CNN, and a composite picture detection algorithm based on blink recognition.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the first algorithm is mainly used for identifying the picture synthesized based on the traditional means, can effectively identify traces such as picture smearing, splicing and the like caused by using tools such as PS and the like, and judges whether the picture belongs to an artificially synthesized picture. However, in the composite picture generation algorithm based on deep learning emerging in recent years, composite pictures are obtained by generating antagonistic network autonomous learning training, and there is no obvious difference in local information, so that the algorithm is difficult to accurately detect the composite pictures. In addition, in the updating iteration process of AI face changing technologies such as deep fake, the problem that the synthesized video cannot blink is gradually solved, and the latest synthesized result can synthesize the simulated blink frequency, so that the second algorithm cannot accurately detect the blink frequency.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for detecting a picture and a video, which can improve the accuracy of identifying a composite picture and a composite video.

To achieve the above object, according to a first aspect of the present invention, a picture detection method is provided.

The picture detection method comprises the following steps: inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face; respectively inputting the face area in the picture and the local feature area in the face into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face; and identifying whether the picture is a synthesized picture or not according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

Optionally, the identifying whether the picture is a composite picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region includes: determining a credibility predicted value of the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area; comparing the credibility prediction value of the picture with a preset threshold value; determining that the picture is not a synthesized picture under the condition that the credibility prediction value of the picture is greater than a preset threshold value; and under the condition that the credibility prediction value of the picture is less than or equal to a preset threshold value, confirming that the picture is a synthesized picture.

Optionally, the determining, according to the reliability prediction value of the face region and the reliability prediction value of the local feature region, the reliability prediction value of the picture includes: mapping to obtain a space vector point corresponding to the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area; and calculating a weighted Euclidean distance between the space vector point and the origin of the coordinate system, and taking the weighted Euclidean distance as a credibility prediction value of the picture.

Optionally, the local feature region in the face comprises: eyebrow area, eye area, nose area, and mouth area.

To achieve the above object, according to a second aspect of the present invention, a video detection method is provided.

The video detection method of the invention comprises the following steps: inputting a plurality of frames of pictures in a video into a detection network model to detect a face area and a local feature area in the face of each frame of picture in the plurality of frames of pictures; respectively inputting the face area in each frame of picture and the local feature area in the face into a discrimination network model to obtain a credibility prediction value of the face area in the frame of picture and a credibility prediction value of the local feature area in the face; and identifying whether the video is a composite video or not according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture.

Optionally, the identifying whether the video is a composite video according to the reliability prediction value of the face region and the reliability prediction value of the local feature region in each frame of picture comprises: mapping to obtain a space vector point corresponding to each frame of picture according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture; determining the dispersion of the space vector points corresponding to the multi-frame pictures; confirming that the video is a composite video under the condition that the dispersion is larger than a preset threshold value; and confirming that the video is not a composite video when the dispersion is less than or equal to a preset threshold value.

Optionally, the determining the dispersion of the space vector points corresponding to the multi-frame picture includes: taking each space vector point as a central point of a cluster, calculating the sum of distances from other space vector points to the space vector point, and taking the sum of distances as cluster dispersion with the space vector point as the central point; and taking the minimum value in the clustering dispersion taking each space vector point as a central point as the dispersion of the space vector points corresponding to the multi-frame pictures.

To achieve the above object, according to a third aspect of the present invention, there is provided a picture sensing apparatus.

The picture detection device of the present invention includes: the detection module is used for inputting a picture into a detection network model so as to detect a face area in the picture and a local feature area in the face; the judging module is used for respectively inputting the face area in the picture and the local feature area in the face into a judging network model so as to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face; and the picture identification module is used for identifying whether the picture is a synthesized picture according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area.

To achieve the above object, according to a fourth aspect of the present invention, there is provided a video detection apparatus.

The video detection device of the present invention includes: the detection module is used for inputting a plurality of frames of pictures in a video into a detection network model so as to detect a face area and a local characteristic area in the face of each frame of picture in the plurality of frames of pictures; the judging module is used for respectively inputting the face area in each frame of picture and the local feature area in the face into a judging network model so as to obtain a credibility predicted value of the face area in the frame of picture and a credibility predicted value of the local feature area in the face; and the video identification module is used for identifying whether the video is a synthesized video or not according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture.

To achieve the above object, according to a fifth aspect of the present invention, there is provided an electronic apparatus.

The electronic device of the present invention includes: one or more processors; and storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement the picture detection method of the present invention or the video detection method of the present invention.

To achieve the above object, according to a sixth aspect of the present invention, there is provided a computer-readable medium.

The computer-readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the picture detection method of the present invention or the video detection method of the present invention.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of inputting a picture into a detection network model to detect a face area and a local feature area in the face in the picture, respectively inputting the face area and the local feature area in the face in the picture into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face, and identifying whether the picture is a synthesized picture according to the credibility predicted values of the face area and the credibility predicted values of the local feature areas, so that the identification accuracy of the synthesized picture can be improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic main flowchart of a picture detection method according to a first embodiment of the present invention;

FIG. 2 is a schematic main flowchart of a picture detection method according to a second embodiment of the present invention;

FIG. 3 is a schematic main flow chart of a video detection method according to a third embodiment of the present invention;

FIG. 4 is a schematic main flow chart of a video detection method according to a fourth embodiment of the present invention;

FIG. 5 is a schematic diagram of main blocks of a picture detection apparatus according to a fifth embodiment of the present invention;

FIG. 6 is a schematic diagram of the main blocks of a video detection apparatus according to a sixth embodiment of the present invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 8 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention provides a new picture and video detection scheme, which is characterized in that the identification capability of an algorithm is enhanced by introducing local facial features and global facial features and utilizing a multi-module weighting fusion mode, the identification accuracy of a facial composite image is improved, and the problem of low identification accuracy of the composite image in the prior art is solved. The picture and video detection scheme provided by the invention can be applied to various application scenes such as real-name authentication, face-brushing payment, identity document uploading and the like in an e-commerce system, and can also be applied to other application scenes in which facial image detection is required.

Fig. 1 is a schematic main flow chart of a picture detection method according to a first embodiment of the present invention. As shown in fig. 1, the picture detection method according to the embodiment of the present invention includes:

step S101, inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face.

Illustratively, the detection network model may be a pre-trained MTCNN (multi-tasking convolutional neural network) model. The MTCNN model puts together face region detection and local feature detection in the face, with a network structure that contains three 3 × 3 convolutional layers and a 2-fold pooling layer. By inputting a picture into the pre-trained MTCNN model, the face region and the local feature region in the face can be segmented from the picture. Wherein the local feature region in the face may comprise at least one of: eyebrow area, eye area, nose area, and mouth area.

In addition, the detection network model can also be a facial and facial key feature point detection network model such as a FaceNet model, a YOLO v3 model, a PCN model, a RetinaNet model and the like. In specific implementation, the detection network model can be flexibly selected according to the resolution of different pictures and the computing capacity of the equipment.

And step S102, respectively inputting the face area in the picture and the local feature area in the face into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face.

When reliability prediction is performed on the face region in the picture and the local feature region in the face, different structural discrimination network models can be adopted. For example, in predicting the credibility of the face area in the picture, a VGG16 network model can be adopted; when performing confidence prediction on a local feature region in a face, a network model consisting of three convolutional layers and a full link layer can be used. In addition, when the credibility of the face area in the picture is predicted, network models with different depths such as AlexNet, GoogleNet, ResNet and the like can be adopted. In addition, under the condition of not influencing the implementation of the invention, when the credibility of the face area in the picture and the credibility of the local feature area in the face are predicted, the judgment network models with the same structure can be adopted.

In an optional example, inputting the face area in the picture into a trained VGG16 network model to obtain a truth degree predicted value p of the face area₀(ii) a Respectively inputting four local characteristic regions of an eyebrow region, an eye region, a nose region and a mouth region into a trained network model consisting of three convolutional layers and a full-link layer to obtain a truth degree predicted value p of the eyebrow region₁The predicted value p of the degree of realism of the region where the eye is located₂The predicted value p of the truth of the area where the nose is located₃And the truth prediction value p of the area where the mouth is positioned₄. Wherein the truth degree predicted value p of the face area₀Is the result of judging the consistency characteristics (such as illumination, latitude, saturation, smoothness and the like) of the face area image, p₀The higher the basic attribute of the face area image, the better the credibility; p is a radical of₁～p₄Is the result of judging the appearance characteristics of partial images such as eyebrows, eyes and the like, p₁～p₄Higher values of (d) indicate higher confidence in the local features of the face.

And step S103, identifying whether the picture is a synthesized picture or not according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

Further, step S103 may specifically include: determining a credibility predicted value of the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area; comparing the credibility prediction value of the picture with a preset threshold value; determining that the picture is not a synthesized picture under the condition that the credibility prediction value of the picture is greater than a preset threshold value; and under the condition that the credibility prediction value of the picture is less than or equal to a preset threshold value, confirming that the picture is a synthesized picture. Under the condition of not influencing the implementation of the invention, the preset threshold value can be flexibly set according to the specific application scene.

In the embodiment of the invention, the identification accuracy of the synthesized picture can be improved and the problem of low detection accuracy of the face synthesized image in the prior art can be solved by detecting the face area in the picture and the local feature area in the face, respectively inputting the face area in the picture and the local feature area in the face into the discrimination network model to obtain the credibility predicted value of the face area and the credibility predicted value of the local feature area in the face, and identifying whether the picture is the synthesized picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

Fig. 2 is a schematic main flow chart of a picture detection method according to a second embodiment of the invention. As shown in fig. 2, the picture detection method according to the embodiment of the present invention includes:

step S201, inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face.

Illustratively, the detection network model may be a pre-trained MTCNN (multi-tasking convolutional neural network) model. The MTCNN model puts together face region detection and local feature detection in the face, with a network structure that contains three 3 × 3 convolutional layers and a 2-fold pooling layer. By inputting a picture into the pre-trained MTCNN model, the face region and the local feature region in the face can be segmented from the picture. In addition, the detection network model can also be a facial and facial key feature point detection network model such as a FaceNet model, a YOLO v3 model, a PCN model, a RetinaNet model and the like. In specific implementation, the detection network model can be flexibly selected according to the resolution of different pictures and the computing capacity of the equipment.

In the embodiment of the present invention, considering that most face changing algorithms do not involve synthesis of ears, and that the ears are often occluded in an actual picture, in the embodiment of the present invention, the local feature regions in the face include: the area of the eyebrows, the area of the eyes, the nose, and the mouth.

Step S202, inputting the face area in the picture and the local feature area in the face into a discrimination network model respectively to obtain a credibility prediction value of the face area and a credibility prediction value of the local feature area in the face.

The discriminant network model for processing the face region in the picture may be referred to as a "large-scale discriminant network" for short, and the discriminant network model for processing the local feature region in the face may be referred to as a "small-scale discriminant network" for short.

In an alternative example, the large scale arbiter network may employ a trained VGG16 network model; the small-scale discriminator network can adopt a trained network model consisting of three convolutional layers and a full-connection layer. Further, in this optional example, the face area in the picture is input into the trained VGG16 network model, and the truth prediction value p of the face area is obtained₀(ii) a Inputting four local characteristic regions of eyebrow region, eye region, nose region and mouth region into a trained network model composed of three convolutional layers and full-link layer to obtain eyebrowTruth prediction value p of hair region₁The predicted value p of the degree of realism of the region where the eye is located₂The predicted value p of the truth of the area where the nose is located₃And the truth prediction value p of the area where the mouth is positioned₄. Wherein the truth degree predicted value p of the face area₀Is the result of judging the consistency characteristics (such as illumination, latitude, saturation, smoothness and the like) of the face area image, p₀The higher the basic attribute of the face area image, the better the credibility; p is a radical of₁～p₄Is the result of judging the appearance characteristics of partial images such as eyebrows, eyes and the like, p₁～p₄Higher values of (d) indicate higher confidence in the local features of the face.

The training process of the small scale discriminator network is explained below. The training process of the small-scale discriminator network mainly comprises steps 1 to 4.

Step 1, constructing a training sample set. Illustratively, 50 real pictures and 50 composite pictures may be included in each batch of the training sample set. Wherein the training sample set can be divided into four categories: a first training sample set which is composed of pictures of the area where the eyebrows are located; a second training sample set consisting of pictures of the region where the eyes are located; a third training sample set, which is composed of pictures of the region where the nose is located; and a fourth training sample set which is composed of pictures of the area where the mouth is located.

And 2, respectively inputting the training samples of each batch into the small-scale discriminator network, and respectively training each small-scale discriminator network to obtain a prediction result. For example, a first set of training samples is input to a first small-scale discriminator network, a second set of training samples is input to a second small-scale discriminator network, a third set of training samples is input to a third small-scale discriminator network, and a fourth set of training samples is input to a fourth small-scale discriminator network. In the training process, the small-scale discriminator networks do not share parameters and are trained respectively to ensure that each small-scale discriminator network learns and discriminates the independent features of the five sense organs.

And 3, comparing the obtained prediction result with the real source domain, and calculating a loss function.

Step 4, optimizing network parameters of the small-scale discriminator by a random gradient descent method reverse propagation loss function; and then inputting a training sample set of a next batch for training until the loss function converges.

And step S203, mapping to obtain a space vector point corresponding to the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

Illustratively, assume that the confidence prediction value of the face region is p₀The predicted value of the truth of the area where the eyebrow is positioned is p₁The predicted value of the truth of the area where the eyes are positioned is p₂The predicted value of the truth of the area where the nose is positioned is p₃The truth prediction value of the area where the mouth is positioned is p₄Then, the five values can be used as five-dimensional coordinates, so as to obtain a space vector point P ═ P (P) corresponding to the picture₀,p₁,p₂,p₃,p₄)。

And S204, calculating a weighted Euclidean distance between the space vector point and the origin of the coordinate system, and taking the weighted Euclidean distance as a credibility prediction value of the picture.

In this step, it is assumed that the spatial vector point P corresponding to the picture is (P ═ P)₀,p₁,p₂,p₃,p₄) Then, the weighted euclidean distance between the space vector point and the coordinate system origin can be calculated according to the following formula:

wherein alpha is_iAnd (i is 0-4) represents a weight coefficient, and d represents a weighted Euclidean distance between the space vector point and the coordinate system origin, namely the credibility prediction value of the picture.

It should be noted that, in other embodiments of the present invention, the reliability prediction value of the picture may also be calculated according to other manners. For example, the reliability prediction value of the picture can be calculated according to the following formula:

wherein alpha is_iAnd (i is 0-4) represents a weight coefficient, and D represents a reliability prediction value of the picture. In specific implementation, different weighting coefficients may be assigned to the parameters according to different application scenarios. For example, α can be adjusted to account for the fact that the global features of the face, eyes and mouth typically vary to a greater degree than the eyebrows and nose₀、α₂、α₄Slightly larger than that, e.g. alpha can be set₀To alpha₄Sequentially setting as follows: 3. 1, 3, 1 and 2.

And step S205, judging whether the reliability prediction value of the picture is greater than a preset threshold value.

The preset threshold value can be flexibly set according to a specific application scene. For example, different preset thresholds may be set according to different risk levels. Executing step S206 when the reliability prediction value of the picture is greater than a preset threshold; if the reliability prediction value of the picture is less than or equal to the preset threshold, step S207 is executed.

Step S206, determining that the picture is not a composite picture.

And step S207, determining that the picture is a composite picture.

In the embodiment of the invention, the identification accuracy of the synthesized picture can be improved by steps of dividing the face area and the local characteristic area in the face from the picture, determining the credibility predicted value of the face area and the credibility predicted value of the local characteristic area through the discrimination network model, obtaining the space vector point through mapping, taking the weighted Euclidean distance between the space vector point and the coordinate origin as the credibility predicted value of the picture and the like, and the problem of low detection accuracy of the face synthesized picture in the prior art is solved.

Fig. 3 is a schematic main flow chart of a video detection method according to a third embodiment of the present invention. As shown in fig. 3, the video detection method according to the embodiment of the present invention includes:

step S301, inputting a plurality of frames of pictures in a video into a detection network model to detect a face area and a local feature area in the face of each frame of picture in the plurality of frames of pictures.

Illustratively, the detection network model may be a pre-trained MTCNN (multi-tasking convolutional neural network) model. The MTCNN model puts together face region detection and local feature detection in the face, with a network structure that contains three 3 × 3 convolutional layers and a 2-fold pooling layer. By inputting a plurality of pictures in the video into the pre-trained MTCNN model, the face area and the local feature area in the face can be segmented from each picture. Wherein the local feature region in the face may comprise at least one of: eyebrow area, eye area, nose area, and mouth area.

Step S302, inputting the face region in each frame of picture and the local feature region in the face into a discrimination network model respectively, so as to obtain a reliability prediction value of the face region in the frame of picture and a reliability prediction value of the local feature region in the face.

When reliability prediction is performed on a face region in a frame of picture and a local feature region in the face, different-structure discrimination network models can be adopted. For example, when performing confidence prediction on a face region in a frame of picture, a VGG16 network model may be adopted; when the credibility prediction is performed on the local feature area of the face in the frame of picture, a network model consisting of three convolutional layers and a full connection layer can be adopted. In addition, when the credibility of the face region in one frame of picture is predicted, network models with different depths such as AlexNet, GoogleNet, ResNet and the like can be adopted. In addition, under the condition of not influencing the implementation of the invention, when the credibility prediction is carried out on the face area in one frame of picture and the local characteristic area in the face, the judgment network models with the same structure can also be adopted.

In an optional example, the face area in one frame of picture is input into a trained VGG16 network model, and the truth degree predicted value p of the face area is obtained₀(ii) a Inputting four local characteristic regions of an eyebrow region, an eye region, a nose region and a mouth region in the frame picture into a trained network model consisting of three convolutional layers and a full-link layer respectively to obtain a truth degree predicted value p of the eyebrow region₁The predicted value p of the degree of realism of the region where the eye is located₂The predicted value p of the truth of the area where the nose is located₃And the truth prediction value p of the area where the mouth is positioned₄. Wherein the truth degree predicted value p of the face area₀Is the result of judging the consistency characteristics (such as illumination, latitude, saturation, smoothness and the like) of the face area image, p₀The higher the basic attribute of the face area image, the better the credibility; p is a radical of₁～p₄Is the result of judging the appearance characteristics of partial images such as eyebrows, eyes and the like, p₁～p₄Higher values of (d) indicate higher confidence in the local features of the face.

Step S303, identifying whether the video is a composite video according to the credibility predicted value of the face area and the credibility predicted value of the local feature area in each frame of picture.

In the embodiment of the invention, when detecting the synthesized video, not only the global characteristics of the face in the multi-frame pictures are considered, but also the local characteristics of the face are considered, the credibility predicted value of the face area in each frame picture and the credibility predicted value of the local characteristic area in the face are obtained by respectively inputting the face area in each frame picture and the local characteristic area in the face into the discrimination network model, and the synthesized video is detected according to the credibility predicted value of the face area in each frame picture and the credibility predicted value of the local characteristic area, so that the identification accuracy of the synthesized video can be improved, and the problem of low detection accuracy of the face synthesized video in the prior art is solved.

Fig. 4 is a schematic main flow chart of a video detection method according to a fourth embodiment of the present invention. As shown in fig. 4, the video detection method according to the embodiment of the present invention includes:

step S401, inputting multiple frames of pictures in the video into a detection network model to detect a face region and a local feature region in the face of each frame of picture in the multiple frames of pictures.

Illustratively, the detection network model may be a pre-trained MTCNN (multi-tasking convolutional neural network) model. The MTCNN model puts together face region detection and local feature detection in the face, with a network structure that contains three 3 × 3 convolutional layers and a 2-fold pooling layer. By inputting a plurality of pictures in the video into the pre-trained MTCNN model, the face area and the local feature area in the face can be segmented from each picture. In addition, the detection network model can also be a facial and facial key feature point detection network model such as a FaceNet model, a YOLO v3 model, a PCN model, a RetinaNet model and the like. In specific implementation, the detection network model can be flexibly selected according to the resolution of different pictures and the computing capacity of the equipment.

Step S402, inputting the face region and the local feature region in the face in each frame of picture into a discrimination network model, respectively, to obtain a reliability prediction value of the face region in the frame of picture and a reliability prediction value of the local feature region in the face.

The discriminant network model for processing the face region in one frame of picture may be referred to as a "large-scale discriminant network" for short, and the discriminant network model for processing the local feature region of the face in one frame of picture may be referred to as a "small-scale discriminant network" for short.

In an alternative example, the large scale arbiter network may employ a trained VGG16 network model; the small-scale discriminator network can adopt a trained network model consisting of three convolutional layers and a full-connection layer. Further, in this alternative example, the face area in one frame of picture is input into the trained VGG16 network model, and the truth prediction value p of the face area is obtained₀(ii) a Inputting four local characteristic regions of the eyebrow region, the eye region, the nose region and the mouth region in the frame picture into a trained network model consisting of three convolutional layers and a full-link layer respectively to obtain a truth degree predicted value p of the eyebrow region₁The predicted value p of the degree of realism of the region where the eye is located₂The predicted value p of the truth of the area where the nose is located₃And the truth prediction value p of the area where the mouth is positioned₄. Wherein the truth degree predicted value p of the face area₀Is the result of judging the consistency characteristics (such as illumination, latitude, saturation, smoothness and the like) of the face area image, p₀The higher the basic attribute of the face area image, the better the credibility; p is a radical of₁～p₄Is the result of judging the appearance characteristics of partial images such as eyebrows, eyes and the like, p₁～p₄Higher values of (d) indicate higher confidence in the local features of the face.

Step S403, mapping to obtain a space vector point corresponding to each frame of picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region in each frame of picture.

Illustratively, assume that the confidence prediction value of a face region in a frame of picture is p₀The predicted value of the truth of the area where the eyebrow is positioned is p₁The predicted value of the truth of the area where the eyes are positioned is p₂The predicted value of the truth of the area where the nose is positioned is p₃The truth prediction value of the area where the mouth is positioned is p₄Then, the five values can be used as five-dimensional coordinates, so as to obtain (P) the space vector point P corresponding to the frame of picture₀,p₁,p₂,p₃,p₄)。

And S404, determining the dispersion of the space vector points corresponding to the multi-frame pictures.

In an alternative example, step S404 may specifically include: taking each space vector point as a central point of a cluster, calculating the sum of distances from other space vector points to the space vector point, and taking the sum of distances as cluster dispersion with the space vector point as the central point; and taking the minimum value in the clustering dispersion taking each space vector point as a central point as the dispersion of the space vector points corresponding to the multi-frame pictures.

For example, assuming that three frames of pictures exist in the video to be detected, and spatial vector points corresponding to the three frames of pictures can be represented as P1, P2 and P3, the sum V1 of distances from P2, P3 to P1 can be calculated by taking P1 as a cluster center point; calculating the sum V2 of the distances from P1, P3 to P2 by taking P2 as a clustering center point; calculating the sum V3 of the distances from P1, P2 to P3 by taking P3 as a clustering center point; then, the minimum value of V1, V2, and V3 is taken as the dispersion of the space vector points corresponding to the three frames of pictures. In particular, when calculating the distance between the space vector points, a mahalanobis distance calculation formula or other distance calculation formulas may be used. In addition, without affecting the implementation of the present invention, besides calculating the dispersion by using the method in the above-mentioned optional example, a hierarchical clustering algorithm, a graph cut algorithm, or the like may be used to perform clustering analysis and dispersion calculation on the space vector points corresponding to the multi-frame pictures.

Step S405, judging whether the dispersion is larger than a preset threshold value.

Wherein, the lower the value of the dispersion is, the higher the video credibility is; the higher the value of the dispersion, the lower the confidence of the video. The preset threshold value can be flexibly set according to a specific application scene. For example, different preset thresholds may be set according to different risk levels. Executing step S406 if the dispersion of the space vector points corresponding to the multi-frame picture is greater than a preset threshold; if the dispersion of the space vector points corresponding to the multi-frame picture is less than or equal to the preset threshold, step S407 is executed.

Step S406, determining that the video is a composite video.

Step S407, determining that the video is not a composite video.

In the embodiment of the invention, the face area and the local characteristic area in the face are segmented from the multi-frame pictures in the video, then the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture are determined through the discrimination network model, the space vector point corresponding to the frame of picture is obtained through mapping, the dispersion degree of the space vector point corresponding to the multi-frame pictures is calculated, and the like, so that the identification accuracy of the composite video can be improved, and the problem of low detection accuracy of the face composite video in the prior art is solved.

FIG. 5 is a schematic diagram of main blocks of a picture detection apparatus according to a fifth embodiment of the present invention. As shown in fig. 5, the picture detection apparatus 500 according to the embodiment of the present invention includes: the device comprises a detection module 501, a judgment module 502 and a picture identification module 503.

A detecting module 501, configured to input a picture into a detection network model to detect a face region in the picture and a local feature region in the face.

The determining module 502 is configured to input the face region in the picture and the local feature region in the face into a determining network model respectively, so as to obtain a reliability prediction value of the face region and a reliability prediction value of the local feature region in the face.

In an optional example, inputting the face area in the picture into a trained VGG16 network model to obtain a truth degree predicted value p of the face area₀(ii) a Respectively inputting four local characteristic regions of an eyebrow region, an eye region, a nose region and a mouth region into a trained network model consisting of three convolutional layers and a full-link layer to obtain a truth degree predicted value p of the eyebrow region₁The predicted value p of the degree of realism of the region where the eye is located₂The predicted value p of the truth of the area where the nose is located₃And the truth prediction value p of the area where the mouth is positioned₄. Wherein the truth degree predicted value p of the face area₀Is the result of judging the consistency characteristics (such as illumination, latitude, saturation, smoothness and the like) of the face area image, p₀The higher the basic attribute of the face area image, the better the credibility; p is a radical of₁～p₄Is the result of judging the appearance characteristics of partial images such as eyebrows, eyes and the like, p₁～p₄Higher value of (A) indicates the confidence of the local feature of the faceThe higher.

The picture identification module 503 is configured to identify whether the picture is a synthesized picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region.

For example, the identifying, by the picture identifying module 503, whether the picture is a synthesized picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region may specifically include: the picture identification module 503 determines the reliability prediction value of the picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region; the picture identification module 503 compares the reliability prediction value of the picture with a preset threshold; when the reliability prediction value of the picture is greater than a preset threshold, the picture identification module 503 determines that the picture is not a synthesized picture; in the case that the reliability prediction value of the picture is less than or equal to the preset threshold, the picture identification module 503 confirms that the picture is a composite picture. Under the condition of not influencing the implementation of the invention, the preset threshold value can be flexibly set according to the specific application scene.

In the device of the embodiment of the invention, the face area in the picture and the local feature area in the face are detected by the detection module, the face area in the picture and the local feature area in the face are respectively input into the discrimination network model by the discrimination module to obtain the credibility predicted value of the face area and the credibility predicted value of the local feature area in the face, and the picture identification module identifies whether the picture is a synthesized picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area, so that the identification accuracy of the synthesized picture can be improved, and the problem of low face synthesized image detection accuracy in the prior art is solved.

Fig. 6 is a schematic block diagram of a video detection apparatus according to a sixth embodiment of the present invention. As shown in fig. 6, the video detection apparatus 600 according to the embodiment of the present invention includes: a detection module 601, a discrimination module 602, and a video identification module 603.

The detecting module 601 is configured to input multiple frames of pictures in a video into a detection network model to detect a face region and a local feature region in the face of each frame of picture in the multiple frames of pictures.

The determining module 602 is configured to input the face region in each frame of picture and the local feature region in the face into a determining network model respectively, so as to obtain a reliability prediction value of the face region in the frame of picture and a reliability prediction value of the local feature region in the face.

The video identification module 603 is configured to identify whether the video is a composite video according to the confidence prediction value of the face region and the confidence prediction value of the local feature region in each frame of picture.

For example, the identifying, by the video identifying module 603, whether the video is a composite video according to the confidence prediction value of the face region and the confidence prediction value of the local feature region in each frame of picture may specifically include: the video identification module 603 maps the confidence prediction value of the face region and the confidence prediction value of the local feature region in each frame of picture to obtain a space vector point corresponding to the frame of picture; the video identification module 603 determines the dispersion of the space vector points corresponding to the multi-frame pictures; when the dispersion is greater than a preset threshold, the video identification module 603 determines that the video is a composite video; in the case where the dispersion is less than or equal to a preset threshold, the video recognition module 603 confirms that the video is not a composite video.

The determining, by the video identification module 603, the dispersion of the space vector points corresponding to the multi-frame picture may specifically include: the video identification module 603 takes each space vector point as a central point of a cluster, calculates the sum of distances from other space vector points to the space vector point, and takes the sum of distances as cluster dispersion taking the space vector point as the central point; the video identification module 603 takes the minimum value of the clustering dispersion taking each space vector point as the center point as the dispersion of the space vector point corresponding to the multi-frame picture.

In the embodiment of the invention, when detecting the synthesized video, not only the global characteristics of the face in the multi-frame pictures are considered, but also the local characteristics of the face are considered, the face area in each frame picture and the local characteristic area in the face are respectively input into the judgment network model through the judgment module so as to obtain the credibility predicted value of the face area in the frame picture and the credibility predicted value of the local characteristic area in the face, and the video identification module carries out synthesized video detection according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame picture, so that the identification accuracy of the synthesized video can be improved, and the problem of low detection accuracy of the face synthesized video in the prior art is solved.

Fig. 7 shows an exemplary system architecture 700 to which the picture detection method or the video detection method or the picture detection apparatus or the video detection apparatus of the embodiments of the present invention can be applied.

As shown in fig. 7, the system architecture 700 may include

terminal devices

701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the

terminal devices

701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. Various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

701, 702, and 703.

The

terminal devices

701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 705 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

701, 702, and 703. The background management server may analyze and perform other processing on the received data such as the face picture or the video, and feed back a processing result (for example, a face picture detection result) to the terminal device.

It should be noted that the picture detection method or the video detection method provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the picture detection apparatus or the video detection apparatus is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system illustrated in FIG. 8 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a detection module, a discrimination module, and an identification module. The names of these modules do not in some cases constitute a limitation on the module itself, and for example, the detection module may also be described as a "module that segments a face region and a local feature region in the face from a picture".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the following: inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face; respectively inputting the face area in the picture and the local feature area in the face into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face; and identifying whether the picture is a synthesized picture or not according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A picture detection method, the method comprising:

inputting a picture into a detection network model to detect a face area in the picture and a local feature area in the face;

respectively inputting the face area in the picture and the local feature area in the face into a discrimination network model to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face;

and identifying whether the picture is a synthesized picture or not according to the credibility predicted value of the face area and the credibility predicted value of the local feature area.

2. The method according to claim 1, wherein identifying whether the picture is a composite picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region comprises:

determining a credibility predicted value of the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area; comparing the credibility prediction value of the picture with a preset threshold value; determining that the picture is not a synthesized picture under the condition that the credibility prediction value of the picture is greater than a preset threshold value; and under the condition that the credibility prediction value of the picture is less than or equal to a preset threshold value, confirming that the picture is a synthesized picture.

3. The method according to claim 2, wherein determining the reliability prediction value of the picture according to the reliability prediction value of the face region and the reliability prediction value of the local feature region comprises:

mapping to obtain a space vector point corresponding to the picture according to the credibility predicted value of the face area and the credibility predicted value of the local feature area; and calculating a weighted Euclidean distance between the space vector point and the origin of the coordinate system, and taking the weighted Euclidean distance as a credibility prediction value of the picture.

4. The method of claim 1, wherein the local feature areas in the face comprise: eyebrow area, eye area, nose area, and mouth area.

5. A method for video detection, the method comprising:

inputting a plurality of frames of pictures in a video into a detection network model to detect a face area and a local feature area in the face of each frame of picture in the plurality of frames of pictures;

respectively inputting the face area in each frame of picture and the local feature area in the face into a discrimination network model to obtain a credibility prediction value of the face area in the frame of picture and a credibility prediction value of the local feature area in the face;

and identifying whether the video is a composite video or not according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture.

6. The method according to claim 5, wherein identifying whether the video is a composite video according to the credibility prediction value of the face region and the credibility prediction value of the local feature region in each frame of picture comprises:

mapping to obtain a space vector point corresponding to each frame of picture according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture; determining the dispersion of the space vector points corresponding to the multi-frame pictures; confirming that the video is a composite video under the condition that the dispersion is larger than a preset threshold value; and confirming that the video is not a composite video when the dispersion is less than or equal to a preset threshold value.

7. The method of claim 6, wherein the determining the dispersion of the spatial vector points corresponding to the multi-frame picture comprises:

taking each space vector point as a central point of a cluster, calculating the sum of distances from other space vector points to the space vector point, and taking the sum of distances as cluster dispersion with the space vector point as the central point; and taking the minimum value in the clustering dispersion taking each space vector point as a central point as the dispersion of the space vector points corresponding to the multi-frame pictures.

8. The method of claim 5, wherein the local feature areas in the face comprise: eyebrow area, eye area, nose area, and mouth area.

9. A picture detection apparatus, characterized in that the apparatus comprises:

the detection module is used for inputting a picture into a detection network model so as to detect a face area in the picture and a local feature area in the face;

the judging module is used for respectively inputting the face area in the picture and the local feature area in the face into a judging network model so as to obtain a credibility predicted value of the face area and a credibility predicted value of the local feature area in the face;

and the picture identification module is used for identifying whether the picture is a synthesized picture according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area.

10. A video detection apparatus, characterized in that the apparatus comprises:

the detection module is used for inputting a plurality of frames of pictures in a video into a detection network model so as to detect a face area and a local characteristic area in the face of each frame of picture in the plurality of frames of pictures;

the judging module is used for respectively inputting the face area in each frame of picture and the local feature area in the face into a judging network model so as to obtain a credibility predicted value of the face area in the frame of picture and a credibility predicted value of the local feature area in the face;

and the video identification module is used for identifying whether the video is a synthesized video or not according to the credibility predicted value of the face area and the credibility predicted value of the local characteristic area in each frame of picture.

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4, 5-8.

12. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 4, 5 to 8.