CN112699707A

CN112699707A - Video detection method, device and storage medium

Info

Publication number: CN112699707A
Application number: CN201911006641.7A
Authority: CN
Inventors: 谢榛; 吴西
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2021-04-23
Anticipated expiration: 2039-10-22

Abstract

The embodiment of the application provides a video detection method, video detection equipment and a storage medium, wherein the method comprises the following steps: acquiring at least one frame of image in a video to be detected; extracting steganographic features and visual features associated with a specified object in at least one frame of image; analyzing the characteristic difference of at least one frame of image in time sequence according to the steganographic characteristics in at least one frame of image and the visual characteristics related to the specified object; and detecting whether the video to be detected is changed or not according to the steganographic features in the at least one frame of image, the visual features related to the specified object and the feature difference of the at least one frame of image in time sequence. In the embodiment, the detection of the spatial dimension and the time sequence dimension can be performed on the video to be detected, so that whether the video to be detected is changed or not can be determined by combining the detection conditions of the spatial dimension and the time dimension, and the comprehensiveness and the accuracy of video detection can be effectively improved.

Description

Video detection method, device and storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video detection method, a video detection device, and a storage medium.

Background

At present, the video counterfeiting technology can forge the behavior expression of characters in the video at will by means of replacing faces in the video, changing expressions and the type of a grafting interface and the like.

Video counterfeiting causes serious interference to the fields of judicial investigation, insurance authentication, network security and the like. In these fields, the authenticity of video is particularly important, and therefore, detecting the authenticity of video becomes an urgent problem to be solved.

Disclosure of Invention

Aspects of the present disclosure provide a video detection method, apparatus, and storage medium to detect whether a video is altered.

The embodiment of the application provides a video detection method, which comprises the following steps:

acquiring at least one frame of image in a video to be detected;

extracting steganographic features and visual features associated with a specified object in the at least one frame of image;

analyzing the time-sequence feature difference of the at least one frame of image according to the steganographic features in the at least one frame of image and the visual features associated with the specified objects;

and detecting whether the video to be detected is changed or not according to the steganographic features in the at least one frame of image, the visual features related to the specified object and the feature difference of the at least one frame of image in time sequence.

The embodiment of the application also provides a computing device, which comprises a memory and a processor;

the memory is to store one or more computer instructions;

the processor is coupled with the memory for executing the one or more computer instructions for:

acquiring at least one frame of image in a video to be detected;

Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned video detection method.

In the embodiment of the application, in the process of detecting a video to be detected, the steganographic features in the single-frame image and the visual features associated with the designated object can be extracted, and the feature difference between the frame images in time sequence can be analyzed according to the steganographic features in the single-frame image and the visual features associated with the designated object, so that whether the designated object in the video to be detected is changed or not can be determined according to the steganographic features in the single-frame image, the visual features associated with the designated object and the feature difference between the frame images in time sequence. Accordingly, in this embodiment, the video to be detected can be detected in the spatial dimension based on the steganographic features in the single-frame image and the visual features associated with the designated object, and the video to be detected can be detected in the time sequence dimension based on the feature difference between the frame images in the time sequence, so that whether the video to be detected is changed or not can be determined by combining the detection conditions of the spatial dimension and the time dimension, which can effectively improve the comprehensiveness and accuracy of video detection.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 1b is a schematic view of a video detection process according to an embodiment of the present application;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computing device according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, the video detection technology is not complete, and whether the video is counterfeit or not can not be effectively detected. In order to solve the problems existing in the prior art, in some embodiments of the present application: the video to be detected can be detected in the spatial dimension based on the steganographic features in the single-frame images and the visual features associated with the designated objects, and the video to be detected can be detected in the time dimension based on the feature difference between the frame images in the time sequence, so that whether the video to be detected is changed or not can be determined by combining the detection conditions of the spatial dimension and the time dimension, and the comprehensiveness and the accuracy of video detection can be effectively improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1a is a schematic flowchart of a video detection method according to an embodiment of the present disclosure. As shown in fig. 1a, the method comprises:

100. acquiring at least one frame of image in a video to be detected;

101. extracting steganographic features and visual features associated with a specified object in at least one frame of image;

102. analyzing the characteristic difference of at least one frame of image in time sequence according to the steganographic characteristics in at least one frame of image and the visual characteristics related to the specified object;

103. and detecting whether the video to be detected is changed or not according to the steganographic features in the at least one frame of image, the visual features related to the specified object and the feature difference of the at least one frame of image in time sequence.

The video detection method provided by the embodiment can be applied to various scenes needing video detection, such as judicial investigation, insurance authentication, network security and the like. In various application scenarios, the designated objects may be set as desired. In this embodiment, the designated object may be a human face, a license plate number, a trademark, or other objects that may be changed, and is not limited herein.

Fig. 1b is a schematic view of a video detection process according to an embodiment of the present application, where fig. 1b corresponds to fig. 1 a. In fig. 1a and 1b, at least one frame of image in a video to be detected can be acquired. The at least one frame image may be a group of continuous frame images in the video to be detected, or a group of frame images selected through sampling, or of course, a frame image in which the designated object extracted from the video to be detected is located, which is not limited in this embodiment. In addition, the video to be detected can be a finished product video or a video in editing.

Based on the words, the obtained at least one frame of image can be subjected to steganographic feature extraction and visual feature extraction respectively so as to obtain the steganographic features and the visual features associated with the specified object in the at least one frame of image.

The steganographic features refer to inherent characteristic information hidden in data in the digital image imaging process. The steganographic features may be neighborhood correlation, color filter array CFA, lens characteristics, JPEG compression features, or the like, which is not limited in this embodiment. The domain correlation can represent the relationship between a pixel point in the image and the surrounding pixel points.

The visual feature (Frame feature) is image attribute information of an area in which a specified object is located in an image. The visual features may be edge contrast, etc., which is not limited in this embodiment.

It should be noted that, as mentioned above, at least one frame of image may be a set of consecutive frame images in the video to be detected, and therefore, there may be a frame image that does not include the designated object in at least one frame of image, in this case, for the frame image that does not include the designated object, in order to simplify the operation, the operation of extracting the visual feature associated with the designated object may not be performed, and of course, this embodiment does not limit this, and the operation may also be performed.

Accordingly, steganographic features and visual features associated with the specified object in at least one frame of image can be extracted. For example, when the designated object is a human face, the steganographic feature is a neighborhood correlation, and the visual feature is an edge contrast, the neighborhood correlation and the edge contrast associated with the human face in at least one frame of image may be extracted based on step 101.

In this embodiment, the temporal feature difference of the at least one frame of image may be analyzed according to the steganographic feature in the at least one frame of image and the visual feature associated with the designated object based on the temporal relationship existing between the at least one frame of image.

Specifically, the temporal analysis may be performed on the steganographic features in each frame of image according to the steganographic features in at least one frame of image, and the temporal analysis may be performed on the visual features associated with the designated object in each frame of image according to the visual features associated with the designated object in at least one frame of image, so as to determine the feature difference of each frame of image in temporal.

As a result, the time sequence of the neighborhood correlation and the edge contrast associated with the human face in each frame of image can be analyzed to determine the feature difference of each frame of image in time sequence.

On the basis, whether the at least one frame of image is changed or not can be detected according to the steganographic features in the at least one frame of image, the visual features related to the specified object and the feature difference of the at least one frame of image in time sequence, and then whether the video to be detected is changed or not can be detected.

When the steganographic features in at least one frame of image are abnormal, at least one frame of image can be changed.

When there is an abnormality in the visual feature associated with the specified object in the at least one image, the at least one image may be altered.

When the difference of the characteristics of the at least one frame of image in the time sequence is abnormal, the at least one frame of image may be altered.

Accordingly, in the embodiment, whether the at least one frame of image is changed or not can be detected by using the steganographic feature in the at least one frame of image, the visual feature associated with the specified object, and the feature difference of the at least one frame of image in time sequence, so as to determine whether the video to be detected is changed or not.

The change of the video to be detected means that the video to be detected is subjected to face replacement, license plate replacement, face filter, trademark replacement, license plate covering, character replacement and the like. Of course, the present embodiment is not limited thereto.

In addition, it should be noted that, in this embodiment, at least one frame of image may be used as a detection unit for video detection. It should be understood that, in this embodiment, a plurality of sets of frame images may be obtained from the image to be detected, and whether each set of frame image is changed or not may be detected, so as to determine whether the video to be detected is changed or not. Of course, the embodiment is not limited thereto, and enough frame images may be obtained from the image to be detected, and whether the video to be detected is changed or not may be determined directly according to the detection result of the frame images.

In this embodiment, in the process of detecting a video to be detected, the steganographic features and the visual features associated with the designated object in the single-frame image may be extracted, and the feature difference between the frame images in the time sequence may be analyzed according to the steganographic features and the visual features associated with the designated object in the single-frame image, so that whether the designated object in the video to be detected is modified or not may be determined according to the steganographic features in the single-frame image, the visual features associated with the designated object, and the feature difference between the frame images in the time sequence. Accordingly, in this embodiment, the video to be detected can be detected in the spatial dimension based on the steganographic features in the single-frame image and the visual features associated with the designated object, and the video to be detected can be detected in the time sequence dimension based on the feature difference between the frame images in the time sequence, so that whether the video to be detected is changed or not can be determined by combining the detection conditions of the spatial dimension and the time dimension, which can effectively improve the comprehensiveness and accuracy of video detection.

In the above or below embodiments, the at least one frame of image may be input to a convolutional neural network, and the steganographic features and the visual features associated with the specified object in the at least one frame of image may be extracted using the convolutional neural network.

In this embodiment, a convolutional neural network is used to perform extraction processing of steganographic features and visual features associated with a specified object.

In practical application, at least one frame of image can be input into a first convolution neural network, and the first convolution neural network is utilized to carry out convolution processing on the at least one frame of image so as to extract a feature map reflecting steganographic features from the at least one frame of image; and inputting the at least one frame of image into a second convolution neural network, and performing convolution processing on the at least one frame of image by using the second convolution neural network so as to extract a feature map reflecting visual features associated with the specified object from the at least one frame of image.

In this embodiment, a greater or lesser number of convolutional neural networks may also be used to perform the extraction processing of the steganographic features and the visual features associated with the designated object, which is not limited in this embodiment.

As mentioned above, the present embodiment does not limit what kind of steganographic features and visual features are used. In this regard, in this embodiment, different first convolutional neural networks may be used for different steganographic features, and different second convolutional neural networks may be used for different visual features.

Taking the case that the steganographic features adopt neighborhood correlation as an example, for each frame of image, performing convolution processing on the frame of image in sequence by utilizing a plurality of convolution layers of a first convolution neural network so as to extract a feature map reflecting the neighborhood correlation from the frame of image; in the first convolution layer of the first convolution neural network, convolution processing is carried out on the frame image by using a convolution kernel with the average value of 0.

In this example, the convolution kernels in the first convolutional layer of the first convolutional neural network are constrained such that the mean of each convolution kernel in that convolutional layer is 0. Based on this, when the frame image is convolved in the convolution layer, if the value change in the receptive field of the convolution kernel is gentle, that is, the value is close, the output value after convolution will be close to 0; if the value change in the field of the convolution kernel is more drastic, that is, the value difference is larger, the larger the convolved output value will be. Accordingly, the convolutional layer can extract the noise condition in the frame image, so that the first convolutional layer can limit the subsequent convolution processing process to the domain of the neighborhood correlation.

The subsequent convolutional layer of the first convolutional neural network can be utilized to continuously carry out convolution processing on the output of the first convolutional layer so as to carry out combination abstraction on the shallow layer features extracted by the first convolutional layer, and therefore the neighborhood correlation of the frame image is obtained.

Accordingly, in this example, the constraints may be designed in a data-driven fashion such that the first convolutional neural network may adaptively extract neighborhood correlations in at least one frame of the image.

Taking the visual feature as an example of the edge contrast, the frame image may be sequentially convolved by the plurality of convolution layers of the second convolutional neural network for each frame image, so as to extract a feature map reflecting the edge contrast associated with the specified object from the frame image. The specific convolution process is not described in detail herein.

It is worth noting that in this embodiment, multiple steganographic features and multiple visual features in at least one frame of image can be extracted simultaneously, different first convolutional neural networks can be configured for different steganographic features, and different second convolutional neural networks can be configured for different visual features.

Accordingly, in the embodiment, the constrained first convolutional neural network can be used to extract local and bottom-layer noise in the frame image, so that steganographic features in the frame image can be obtained adaptively; a second convolutional neural network may also be utilized to extract visual features associated with the specified object in the frame image. The steganographic features and the visual features are complementary, so that alteration traces in the frame image can be more comprehensively found based on the steganographic features in the frame image and the visual features associated with the specified object.

In the above or below embodiments, the steganographic features and the visual features associated with the designated object in the at least one frame of image may be respectively subjected to feature integration within the frame of image, so as to analyze the feature difference of the at least one frame of image in time sequence based on the integrated features.

In practical application, spatial feature vectors corresponding to at least one frame of image can be respectively generated according to steganographic features in at least one frame of image and visual features associated with a specified object; sequencing the space characteristic vectors corresponding to at least one frame of image according to the time sequence relation among at least one frame of image to form a space characteristic sequence; and analyzing the characteristic difference of at least one frame of image in time sequence according to the spatial characteristic sequence.

Wherein the spatial feature vector is used for representing the change trace in the frame image as a whole.

In this embodiment, the spatial feature sequence may be input into a recurrent neural network, and the cyclic neural network is utilized to analyze the feature difference of at least one frame of image in time sequence. When the feature difference of the at least one frame of image in time sequence is significant, the at least one frame of image may be altered.

For example, in a case that a certain frame of image in at least one frame of image is modified, the feature difference between the frame of image and the previous and subsequent frame of image will be significantly higher than the feature difference between the other frame of image, which is only one aspect, and the feature difference between the frame of image and the previous and subsequent frame of image may also be significantly higher than a certain preset value. Therefore, in various aspects, the recurrent neural network is used to analyze the feature difference of at least one frame of image in time sequence.

In this embodiment, multiple implementation manners may be adopted to generate the spatial feature vectors corresponding to the at least one frame of image.

In an optional implementation manner, for each frame image, global pooling processing can be performed on the steganographic features extracted from the frame image and the visual features associated with the designated object respectively to obtain steganographic feature vectors and visual feature vectors corresponding to the frame image; splicing the steganographic feature vector and the visual feature vector corresponding to the frame image, and inputting a splicing result into a first full-connection network; in the first full-connection network, the steganographic feature vector and the visual feature vector corresponding to the frame image are fused based on preset weight parameters to obtain a spatial feature vector corresponding to the frame image.

On the basis of extracting the steganographic features and the visual features by using the convolutional neural network mentioned in the foregoing embodiment, in this implementation, the feature map output by the first convolutional neural network and reflecting the steganographic features and the feature map output by the second convolutional neural network and reflecting the visual features associated with the specified object may be subjected to global pooling processing to obtain steganographic feature vectors and visual feature vectors corresponding to the frame image.

If the steganographic feature vector corresponding to the frame image is [ a ] and the visual feature vector is [ b ], the steganographic feature vector and the visual feature vector corresponding to the frame image can be spliced in a mode of [ a ], [ b ] } a, b ].

And presetting a weight parameter in the first fully-connected network, wherein the weight parameter is used for limiting the attention degree of the steganographic feature and the visual feature in the video detection process. And after the splicing result is input into a first full-connection network, the steganographic feature vector and the visual feature vector corresponding to the frame image can be fused by using the first full-connection network, so that the spatial feature vector corresponding to the frame image is obtained.

Of course, the above implementation is only exemplary, and the embodiment is not limited thereto.

In this embodiment, a scheme for fusing a steganographic feature and a visual feature is provided. In the spatial dimension, the steganographic features and the visual features are respectively extracted by different convolutional neural networks and then are fused into final spatial features through global pooling, feature splicing and full-connection networks. In the time dimension, different frame images can share a recurrent neural network, the fused spatial feature sequence is used as the input of the recurrent neural network, and the recurrent neural network can be used for analyzing additional time sequence features.

In the above or below embodiment, the steganographic features in the at least one frame of image and the difference in temporal characteristics between the visual features associated with the specified object and the at least one frame of image may be input into the second fully-connected network; and detecting the probability of the video to be detected being changed by utilizing the second full-connection network so as to determine whether the video to be detected is changed.

And the second full-connection network is preset with the mapping relation between the input information and the classification result. In this embodiment, the input information includes steganographic features in at least one frame of image and differences in visual features associated with the designated object and features of at least one frame of image in time sequence; the classification result may contain modified and unmodified.

Accordingly, the second fully-connected network may be utilized to predict a probability that at least one frame of the image is altered.

As mentioned above, in this embodiment, a plurality of groups of frame images in the network to be detected can be detected, and accordingly, whether the video to be detected is modified can be determined according to the modified probability corresponding to each group of frame images. For example, whether the video to be detected is modified can be determined according to the average value, the maximum value, and the like of the modified probabilities corresponding to the groups of frame images. And when the average value or the maximum value meets the changed condition, determining that the video to be detected is changed.

In the foregoing embodiment, on the basis of analyzing the feature difference of at least one frame of image in time sequence by using the recurrent neural network, in this embodiment, all output nodes of the recurrent neural network may be connected to the second fully-connected network, and the output of the recurrent neural network at each time may be provided to the second fully-connected network; in addition, based on the memory characteristics of the recurrent neural network, the recurrent neural network can not only input the characteristic difference of the at least one frame of image in time sequence into the second fully-connected network, but also provide the steganographic characteristics and the visual characteristics associated with the specified object in the at least one frame of image memorized by the recurrent neural network into the second fully-connected network.

Of course, this connection structure is merely exemplary, and in this embodiment, the convolutional neural network may also be connected to a second fully-connected network to provide the steganographic features and the visual features associated with the designated object in the at least one frame of image to the second fully-connected network. This embodiment is not limited to this.

In the above or below embodiments, the specified object in at least one frame of image may also be detected; carrying out alignment processing on a specified object in at least one frame of image; and inputting the at least one frame of image subjected to the alignment processing into a convolutional neural network so as to extract the steganographic features and the visual features associated with the specified object in the at least one frame of image by using the convolutional neural network.

Taking the designated object as the face feature as an example, in this embodiment, a face detection technology may be used to detect a face in at least one frame of image, and perform an alignment operation on the face.

Specifically, key feature points of the human face, such as glasses, nose tips, mouth corner points, eyebrows, contour points of each part of the human face, and the like, can be located. And aligning the human face in at least one frame of image through affine transformation based on the located key feature points.

Therefore, even if the position of the face deviates in different frame images, based on the alignment operation, the convolutional neural network does not need to pay attention to the deviation of the position of the face in different frame images as a special feature, so that the influence of the feature on the detection result can be avoided.

In the above or following embodiments, the video detection scheme may be implemented by using a video detection system, where the video detection system may include the convolutional neural network, the cyclic neural network, the fully-connected layer, and the like mentioned in the above embodiments, outputs of the first convolutional neural network and the second convolutional neural network are connected to the first fully-connected network, an output of the first fully-connected network is connected to the cyclic neural network, and an output of the cyclic neural network is connected to the second fully-connected network.

Based on the above framework of the video detection system, in this embodiment, the video detection system can be trained to determine parameters in the video detection system.

In practical application, an end-to-end mode can be adopted to carry out overall training on the video detection system.

Inputting training samples into a first convolution neural network and a second convolution neural network, wherein the first convolution neural network extracts the steganography characteristic of each frame of image, the second convolution neural network extracts the visual characteristic of each frame of image, the first full-connection network respectively fuses the steganography characteristic and the visual characteristic of each frame of image, the fused characteristics of all frames are input into a circulation neural network according to a time sequence, the output of the circulation neural network is subjected to time sequence average pooling, and the merged characteristics are connected to the second full-connection network.

And measuring the output result of the second fully-connected network by using a cross entropy loss function, and updating the parameters of the whole network. Wherein, the training sample is at least one frame image which is known whether to be changed or not.

Of course, the above training process is only exemplary, and the embodiment is not limited thereto.

In the above or following embodiments, it may also be determined whether the video to be detected carries pre-buried feature prompt information. In practical application, the appointed field in the video to be detected can be read, and whether the appointed field contains the pre-buried characteristic prompt information or not is judged.

And if the pre-buried characteristic prompt information is carried in the video to be detected, extracting the pre-buried characteristic of at least one frame of image.

The pre-buried feature is a user-defined feature which is hidden and configured in the video to be detected. For example, the pre-buried feature may be a watermark feature, a digital signature feature, or a white noise feature, among others.

In addition, in this embodiment, a convolutional neural network may be used to extract the pre-buried features, which is not limited to this embodiment.

Based on the method, whether the video to be detected is changed or not can be detected according to the pre-buried feature extraction result.

In this embodiment, the pre-buried feature extraction result at least includes that the pre-buried feature is not extracted, that the extracted pre-buried feature is abnormal is determined, or that the extracted pre-buried feature is not abnormal. In practical application, the extracted pre-buried features may be subjected to tamper detection to determine whether the extracted pre-buried features are abnormal, which is not limited in this embodiment.

Accordingly, in this embodiment, whether the video to be detected is modified can be detected according to the steganographic features in the at least one frame of image, the visual features associated with the designated object, the feature difference of the at least one frame of image in the time sequence, and the pre-buried feature extraction result.

The specific detection rule can be set according to actual needs.

For example, the probability that the video to be detected is changed can be predicted according to the steganographic features in at least one frame of image, the visual features related to the specified object and the feature difference of at least one frame of image in time sequence, and if the pre-buried features are not extracted or the extracted pre-buried features are determined to be abnormal in the pre-buried feature extraction result, the probability can be linearly compensated to improve the probability.

For another example, it may be determined that the video to be detected is changed when any one of the probability or the pre-buried feature extraction result does not meet the preset standard. For example, when the probability is higher than the preset standard of 0.5 or the pre-buried features are not extracted, it is determined that the video to be detected is changed.

For another example, it may be determined that the video to be detected is changed under the condition that the probability or the pre-buried feature extraction result does not meet the preset standard.

The above are exemplary, and the present embodiment is not limited thereto.

Fig. 2 is a schematic view of an application scenario of video detection according to an embodiment of the present application. As shown in fig. 2, the video detection scheme provided by the present embodiment is used to detect whether a face in a video is modified.

In the application scenario shown in fig. 2, the input is a video to be detected, and the output is whether the video to be detected is altered.

The following describes the video detection process with reference to fig. 2 by taking 5 consecutive frame images in the video to be detected as an example.

After 5 continuous frame images are preprocessed through face detection, extraction, alignment and the like, the preprocessed images are respectively input into CNN1 and CNN2, the CNN1 is used for extracting feature maps which respectively correspond to the 5 continuous frame images and reflect neighborhood correlation, and the CNN2 is used for extracting edge contrast which respectively corresponds to the 5 continuous frame images and reflects the correlation with the face.

The output of CNN1 and CNN2 is input into the first fully-connected network after global pooling and feature splicing. And fusing neighborhood correlation and edge contrast associated with the human face in 5 continuous frame images into 5 spatial feature vectors by using a first full-connection network, and sequencing the 5 spatial feature vectors based on the time sequence relation of the 5 continuous frame images to form a spatial feature sequence.

And inputting the spatial feature sequence into the RNN to analyze the feature difference of 5 continuous frame images in time sequence by using the RNN.

And the RNN inputs the analyzed feature difference of the 5 continuous frame images on the time sequence, the memorized neighborhood correlation and the edge contrast related to the human face in the 5 continuous frame images into the second full-connection layer.

The second fully-connected layer may output the probability that 5 consecutive frame images are altered.

After the video detection process is carried out on other frame images in the video to be detected, whether the video to be detected is changed or not can be determined. For example, whether the video to be detected is altered can be defined by setting a probability threshold.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 102 may be device a; for another example, the execution subject of

steps

101 and 102 may be device a, and the execution subject of step 103 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 3 is a schematic structural diagram of a computing device according to another embodiment of the present application. As shown in fig. 3, the computing device includes: a memory 30 and a processor 31.

The memory 30 is used to store computer programs and may be configured to store other various data to support operations on the computing device. Examples of such data include instructions for any application or method operating on the computing device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 30 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 31, coupled to the memory 30, for executing the computer program in the memory 30 for:

acquiring at least one frame of image in a video to be detected;

extracting steganographic features and visual features associated with a specified object in at least one frame of image;

analyzing the characteristic difference of at least one frame of image in time sequence according to the steganographic characteristics in at least one frame of image and the visual characteristics related to the specified object;

In an alternative embodiment, the processor 31, when extracting the steganographic features and the visual features associated with the designated object in at least one frame of the image, is configured to:

and inputting the at least one frame of image into a convolutional neural network, and extracting the steganographic features and the visual features associated with the specified object in the at least one frame of image by using the convolutional neural network.

In an alternative embodiment, the processor 31, when inputting the at least one frame of image into the convolutional neural network, and extracting the steganographic features and the visual features associated with the designated object in the at least one frame of image using the convolutional neural network, is configured to:

inputting at least one frame of image into a first convolution neural network, and performing convolution processing on the at least one frame of image by using the first convolution neural network so as to extract a feature map reflecting steganographic features from the at least one frame of image;

and inputting the at least one frame of image into a second convolution neural network, and performing convolution processing on the at least one frame of image by using the second convolution neural network so as to extract a feature map reflecting visual features associated with the specified object from the at least one frame of image.

In an alternative embodiment, the steganographic feature is a neighborhood correlation, and the processor 31, when performing convolution processing on the at least one frame of image by using the first convolution neural network to extract a feature map reflecting the steganographic feature from the at least one frame of image, is configured to:

and sequentially performing convolution processing on each frame image by utilizing a plurality of convolution layers of the first convolution neural network so as to extract a characteristic diagram reflecting neighborhood correlation from the frame image, wherein in the first convolution layer of the first convolution neural network, convolution kernel with the average value of 0 is used for performing convolution processing on the frame image.

In an alternative embodiment, the processor 31, when inputting the at least one frame of image into the convolutional neural network, is configured to:

detecting a designated object in at least one frame of image;

carrying out alignment processing on a specified object in at least one frame of image;

and inputting the at least one frame of image subjected to the alignment processing into a convolutional neural network so as to extract the steganographic features and the visual features associated with the specified object in the at least one frame of image by using the convolutional neural network.

In an alternative embodiment, the processor 31 is configured to, when analyzing feature variability of at least one frame of image in time sequence according to the steganographic features in the at least one frame of image and the visual features associated with the designated object:

respectively generating spatial feature vectors corresponding to the at least one frame of image according to the steganographic features in the at least one frame of image and the visual features associated with the designated object;

sequencing the space characteristic vectors corresponding to at least one frame of image according to the time sequence relation among at least one frame of image to form a space characteristic sequence;

and analyzing the characteristic difference of at least one frame of image in time sequence according to the spatial characteristic sequence.

In an optional embodiment, the processor 31, when generating the spatial feature vectors corresponding to the at least one frame of image respectively according to the steganographic features in the at least one frame of image and the visual features associated with the designated object, is configured to:

aiming at each frame of image, global pooling processing is respectively carried out on the steganographic features extracted from the frame image and the visual features associated with the designated object so as to obtain steganographic feature vectors and visual feature vectors corresponding to the frame image;

splicing the steganographic feature vector and the visual feature vector corresponding to the frame image, and inputting a splicing result into a first full-connection network;

in the first full-connection network, the steganographic feature vector and the visual feature vector corresponding to the frame image are fused based on preset weight parameters to obtain a spatial feature vector corresponding to the frame image.

In an alternative embodiment, the processor 31, when analyzing the temporal feature difference of at least one frame of image according to the spatial feature sequence, is configured to:

and inputting the spatial feature sequence into a recurrent neural network, and analyzing the feature difference of at least one frame of image in time sequence by using the recurrent neural network.

In an optional embodiment, the processor 31, when detecting whether the video to be detected is altered according to the steganographic features in the at least one frame of image and the difference between the visual features associated with the specified object and the features of the at least one frame of image in time sequence, is configured to:

inputting steganographic features in the at least one frame of image and visual features associated with the designated object and feature differences of the at least one frame of image in time sequence into a second fully-connected network;

and detecting the probability of the video to be detected being changed by utilizing the second full-connection network so as to determine whether the video to be detected is changed.

In an alternative embodiment, the designated object is a face, a license plate number, a text or a trademark.

In an alternative embodiment, the processor 31 is further configured to:

if the pre-buried characteristic prompt information is carried in the video to be detected, extracting pre-buried characteristics of the at least one frame of image, wherein the pre-buried characteristics are self-defined characteristics which are hidden and configured in the video to be detected;

and detecting whether the video to be detected is changed or not according to the pre-buried feature extraction result.

In an optional embodiment, the pre-buried feature extraction result is that the pre-buried feature is not extracted, the extracted pre-buried feature is determined to have abnormality, or the extracted pre-buried feature is determined to have no abnormality.

In an optional embodiment, the pre-buried feature is one or more of a watermark feature, a digital signature feature, or a white noise feature.

Further, as shown in fig. 3, the computing device further includes: communication components 32, power components 33, and the like. Only some of the components are schematically shown in fig. 3, and the computing device is not meant to include only the components shown in fig. 3.

Wherein the communication component 32 is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies to facilitate short-range communications.

The power supply unit 33 supplies power to various components of the device in which the power supply unit is installed. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A video detection method, comprising:

acquiring at least one frame of image in a video to be detected;

2. The method of claim 1, wherein extracting the steganographic features and the visual features associated with the designated object in the at least one frame of image comprises:

3. The method of claim 2, wherein inputting the at least one frame of image into a convolutional neural network, extracting steganographic features and visual features associated with a specified object in the at least one frame of image using the convolutional neural network, comprises:

inputting the at least one frame of image into a first convolution neural network, and performing convolution processing on the at least one frame of image by using the first convolution neural network so as to extract a feature map reflecting steganographic features from the at least one frame of image;

and inputting the at least one frame of image into a second convolutional neural network, and performing convolution processing on the at least one frame of image by using the second convolutional neural network so as to extract a feature map reflecting visual features associated with a specified object from the at least one frame of image.

4. The method of claim 3, wherein the steganographic features are neighborhood correlations, and wherein the convolving the at least one frame of image with the first convolutional neural network to extract a feature map reflecting the steganographic features from the at least one frame of image comprises:

for each frame image, sequentially performing convolution processing on the frame image by utilizing a plurality of convolution layers of the first convolution neural network so as to extract a feature map reflecting neighborhood correlation from the frame image;

and performing convolution processing on the frame image by using a convolution kernel with the mean value of 0 in a first convolution layer of the first convolution neural network.

5. The method of claim 2, wherein said inputting said at least one frame of image into a convolutional neural network comprises:

detecting a designated object in the at least one frame of image;

carrying out alignment processing on a specified object in the at least one frame of image;

inputting the at least one frame of image subjected to the alignment processing into the convolutional neural network so as to extract the steganographic features and the visual features associated with the specified object in the at least one frame of image by using the convolutional neural network.

6. The method according to claim 1, wherein analyzing the temporal feature variability of the at least one frame of image according to the steganographic features in the at least one frame of image and the visual features associated with the designated objects comprises:

sequencing the space characteristic vectors corresponding to the at least one frame of image according to the time sequence relation among the at least one frame of image to form a space characteristic sequence;

and analyzing the characteristic difference of the at least one frame of image in time sequence according to the spatial characteristic sequence.

7. The method according to claim 6, wherein the generating spatial feature vectors corresponding to the at least one frame of image respectively according to the steganographic features in the at least one frame of image and the visual features associated with the designated object comprises:

aiming at each frame image, performing global pooling on the steganographic features extracted from the frame image and the visual features associated with the specified object respectively to obtain steganographic feature vectors and visual feature vectors corresponding to the frame image;

and in the first full-connection network, fusing the steganographic feature vector and the visual feature vector corresponding to the frame image based on a preset weight parameter to obtain a spatial feature vector corresponding to the frame image.

8. The method of claim 6, wherein analyzing the temporal feature variability of the at least one frame of image according to the spatial feature sequence comprises:

inputting the spatial feature sequence into a recurrent neural network, and analyzing feature difference of the at least one frame of image in time sequence by using the recurrent neural network.

9. The method according to any one of claims 1 to 8, wherein the detecting whether the video to be detected is altered according to the steganographic features in the at least one frame of image and the difference between the visual features associated with the designated object and the features of the at least one frame of image in time sequence comprises:

inputting steganographic features in the at least one frame of image and differences in the visual features associated with the designated object and the features of the at least one frame of image in time sequence into a second fully connected network;

10. The method according to any one of claims 1 to 8, wherein the designated object is a face, a letter, a license plate number, or a trademark.

11. The method of claim 1, further comprising:

12. The method of claim 11, wherein the pre-buried feature extraction result comprises: and extracting no embedded features, determining that the extracted embedded features are abnormal or determining that the extracted embedded features are not abnormal.

13. The method of claim 11, wherein the pre-buried features are one or more of watermark features, digital signature features, or white noise features.

14. A computing device comprising a memory and a processor;

the memory is to store one or more computer instructions;

acquiring at least one frame of image in a video to be detected;

15. The device of claim 14, wherein the processor, in extracting the steganographic features and the visual features associated with the designated object in the at least one frame of image, is configured to:

16. The apparatus of claim 15, wherein the processor, in inputting the at least one frame of image into a convolutional neural network, extracting steganographic features and visual features associated with a specified object in the at least one frame of image using the convolutional neural network, is configured to:

17. The apparatus of claim 16, wherein the steganographic feature is a neighborhood correlation, and wherein the processor, when convolving the at least one frame of image with the first convolution neural network to extract a feature map from the at least one frame of image that reflects the steganographic feature, is configured to:

and for each frame image, sequentially performing convolution processing on the frame image by using a plurality of convolution layers of the first convolution neural network so as to extract a feature map reflecting neighborhood correlation from the frame image, wherein in the first convolution layer of the first convolution neural network, convolution processing is performed on the frame image by using a convolution core with the mean value of 0.

18. The apparatus of claim 13, wherein the processor, when inputting the at least one frame of image into the convolutional neural network, is configured to:

detecting a designated object in the at least one frame of image;

19. The apparatus of claim 14, wherein the processor, when analyzing the temporal feature variability of the at least one frame of images based on the steganographic features in the at least one frame of images and the visual features associated with the designated objects, is configured to:

20. The apparatus of claim 19, wherein the processor, when generating spatial feature vectors corresponding to the at least one frame of image respectively according to the steganographic features in the at least one frame of image and the visual features associated with the designated object, is configured to:

aiming at each frame of image, respectively carrying out global pooling on the steganographic features extracted from the frame of image and the visual features associated with the specified object so as to obtain steganographic feature vectors and visual feature vectors corresponding to the frame of image;

21. The apparatus of claim 19, wherein the processor, when analyzing the temporal feature variability of the at least one frame of image according to the spatial feature sequence, is configured to:

22. The apparatus according to any one of claims 14-21, wherein the processor, when detecting whether the video to be detected is altered according to the steganographic features in the at least one frame of image and the difference in the visual features associated with the designated object and the features of the at least one frame of image in time sequence, is configured to:

23. The apparatus according to any one of claims 14 to 21, wherein the designated object is a face, a letter, a license plate number, or a trademark.

24. The device of claim 14, wherein the processor is further configured to:

25. The apparatus of claim 24, wherein the pre-buried feature extraction result is that no pre-buried features are extracted, that the extracted pre-buried features are determined to be abnormal, or that the extracted pre-buried features are determined to be not abnormal.

26. The apparatus of claim 24, wherein the pre-buried feature is one or more of a watermark feature, a digital signature feature, or a white noise feature.

27. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the video detection method of any one of claims 1-13.