CN110163061B

CN110163061B - Method, apparatus, device and computer readable medium for extracting video fingerprint

Info

Publication number: CN110163061B
Application number: CN201811353102.6A
Authority: CN
Inventors: 叶燕罡; 沈小勇; 陈忠磊; 马子扬; 戴宇榮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2023-04-07
Anticipated expiration: 2038-11-14
Also published as: CN110163061A

Abstract

A method, apparatus, device and computer readable medium for extracting a video fingerprint of a video are disclosed. The method comprises the following steps: extracting a plurality of video frames in a video; for each of the plurality of video frames, processing the video frame by using a neural network with a plurality of layers, wherein the plurality of layers comprise at least one convolution layer, and each convolution layer is used for performing convolution processing on the output of the previous layer; taking the convolution processing result output by the middle layer of the neural network as the fingerprint characteristic of the video frame; and processing the fingerprint features of the plurality of video frames to generate a video fingerprint of the video.

Description

Method, apparatus, device and computer readable medium for extracting video fingerprint

Technical Field

The present disclosure relates to the field of video processing, and in particular, to a method, apparatus, device, and computer-readable medium for extracting a video fingerprint of a video.

Background

In recent years, the number of network videos has increased very rapidly. In order to achieve efficient video recognition, a plurality of video frames included in a video may be image-processed, and fingerprint features of the video frames may be generated. The video fingerprint of the video can be obtained by processing the fingerprint characteristics. The video fingerprint obtained in such a way can represent a video file as the unique feature of the video. By using the description effect of the video fingerprints on the video content, the applications such as comparison and clustering among similar videos can be realized.

Because video fingerprints are widely applied in the process of video comparison and clustering at present, a new video fingerprint extraction method is needed to realize the fast extraction of video fingerprints and the fast comparison of video fingerprints.

Disclosure of Invention

To this end, the present disclosure provides a method, apparatus, device and computer readable medium for extracting a video fingerprint of a video.

According to an aspect of the present disclosure, there is provided a method for extracting a video fingerprint of a video, comprising: extracting a plurality of video frames in a video; processing each of the plurality of video frames by using a neural network with a plurality of layers, wherein the plurality of layers comprise at least one convolutional layer, each convolutional layer is used for performing convolution processing on the output of the previous layer, and the convolution processing result of the output of the middle layer of the neural network is used as the fingerprint feature of the video frame; and processing the fingerprint features of the plurality of video frames to generate a video fingerprint of the video.

In some embodiments, processing the fingerprint features of the plurality of video frames to generate the video fingerprint of the video comprises: respectively executing dimension reduction processing on the fingerprint characteristics of the plurality of video frames to obtain a plurality of dimension reduced fingerprint characteristics; stitching the plurality of reduced-dimension video fingerprint features to generate a video fingerprint of the video.

In some embodiments, performing dimension reduction processing on the fingerprint features of the plurality of video frames comprises at least one of: performing pooling processing on the fingerprint features of the video frames respectively to obtain a plurality of pooled fingerprint features, and performing principal component analysis on the pooled fingerprint features to obtain a plurality of dimension-reduced fingerprint features.

In some embodiments, extracting the plurality of video frames in the video comprises: selecting a plurality of frames as the plurality of video frames at equal intervals in the video.

In some embodiments, the neural network transforms the video frame into image data having a plurality of channels, the method further comprising performing a convolution operation on the image data of the plurality of channels in parallel.

In some embodiments, the neural network is a Mobilenet network.

According to another aspect of the present disclosure, there is also provided an apparatus for extracting a video fingerprint, including: a video frame extraction unit configured to extract a plurality of video frames in a video; a fingerprint feature extraction unit configured to process, for each of the plurality of video frames, the video frame by using a neural network having a plurality of layers, wherein the plurality of layers includes at least one convolutional layer, and each of the at least one convolutional layer is used for performing convolution processing on an output of a previous layer; taking the convolution processing result output by the middle layer of the neural network as the fingerprint characteristic of the video frame; and a video fingerprint generation unit configured to process fingerprint features of the plurality of video frames to generate a video fingerprint of the video.

In some embodiments, the video fingerprint generation unit further comprises: the dimension reduction subunit is configured to respectively perform dimension reduction processing on the fingerprint features of the video frames to obtain a plurality of dimension reduction fingerprint features; and a stitching subunit configured to stitch the plurality of dimension-reduced video fingerprint features to generate a video fingerprint of the video.

In some embodiments, the dimension reduction process comprises at least one of: performing pooling processing on the fingerprint features of the plurality of video frames respectively to obtain a plurality of pooled fingerprint features, and performing principal component analysis on the pooled fingerprint features to obtain a plurality of dimension-reduced fingerprint features.

In some embodiments, the neural network transforms the video frame into image data having a plurality of channels, the apparatus being further configured to perform a convolution operation on the image data of the plurality of channels in parallel.

In some embodiments, the neural network is a Mobilenet network.

According to another aspect of the present disclosure, there is also provided an apparatus for extracting video fingerprints, comprising a processor and a memory, the memory having stored therein program instructions, which, when executed by the processor, is configured to perform the method for extracting video fingerprints of a video as described above.

According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method for extracting video fingerprints of a video as described above.

According to the method, the device, the equipment and the computer readable medium for extracting the video fingerprint provided by the disclosure, the fingerprint characteristics of the video frame in the video can be determined by a novel fingerprint characteristic extraction method, and the video fingerprint of the video can be further determined by utilizing the fingerprint characteristics. The video fingerprints obtained by the method have better clustering effect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without making creative efforts. The following drawings are not intended to be drawn to scale in actual dimensions, with emphasis instead being placed upon illustrating the principles of the disclosure.

FIG. 1 shows a schematic diagram of an exemplary video fingerprint extraction system according to the present disclosure;

FIG. 2 shows a schematic flow diagram of a method for video fingerprint extraction according to an embodiment of the present disclosure;

FIG. 3 shows a schematic flow chart of the video fingerprint feature generation step according to an embodiment of the present disclosure;

FIG. 4 shows a schematic block diagram of an apparatus for extracting video fingerprints according to an embodiment of the present disclosure; and

FIG. 5 illustrates an architecture of an exemplary computing device according to the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the present disclosure is further described in detail by referring to the following examples. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

A video fingerprint of a video is a unique video feature generated by analysis of image information in the video, which may represent a corresponding video file. Video searching can be achieved by utilizing video fingerprints, and for example, application of video searching can be achieved by utilizing video fingerprints because the video fingerprints can help locate similar videos for the descriptive nature of video contents. As another example, a fast comparison of two videos may be achieved using video fingerprints. For two videos with the same content but different video frame rates and resolutions, whether the two videos are the same cannot be judged by the message digest algorithm md5 of the video file. However, since video fingerprints can describe video content, two videos with the same content can be found by comparing the video fingerprints. With such characteristics, video fingerprinting may enable operations such as performing deduplication on the same video. In addition, by using the descriptive property of the video fingerprints on the video content, videos watched by the user can be clustered according to the video fingerprints, so that similar videos can be recommended to the categories in which the user is interested. In addition, in the process of video content transmission, for example, on a video sharing platform, the earliest uploaded user in the same or highly similar videos can be determined through video fingerprints and video uploading time, so that the originator of the video content is determined, and pirated videos are attacked.

Because the video fingerprints are widely applied in the video comparison and clustering processes at present, a new video fingerprint extraction method is needed to realize the fast extraction of the video fingerprints and the fast comparison of the video fingerprints.

Fig. 1 shows a schematic diagram of an exemplary video fingerprint extraction system according to the present disclosure. As shown in fig. 1, the video fingerprint extraction system 100 may include one or more clients 110, a network 120, a server 130, and a database 140. For convenience of description, in the present disclosure, the video fingerprint extraction system 100 may be simply referred to as the system 100.

Clients 110 may include, but are not limited to, one or more of stationary electronic devices or mobile electronic devices. For example, stationary electronic devices may include, but are not limited to, desktop computers, smart home devices, and the like. The mobile electronic devices may include, but are not limited to, one or more of a smartphone, a smartwatch, a laptop, a tablet, a gaming device, and the like. The client 110 may communicate with a server, database, or other client over the network 120, for example, by sending videos stored locally by the client or videos taken by the client to the server 130 or other client via the network. For example, the video may be captured using a camera program running on the client, or other programs such as a browser, a swipe code built into Instant Messaging (IM), or a camera program. In some embodiments of the present disclosure, the client 110 may be configured to perform the methods provided by the present disclosure for extracting video fingerprints. For example, the client 110 may perform the methods provided by the present disclosure for extracting video fingerprints for locally stored videos or videos received from other clients and/or databases over a network.

Network 120 may be a single network or a combination of multiple different networks. For example, the network 120 may include, but is not limited to, one or a combination of local area networks, wide area networks, the Internet, and the like. Network 120 may be used to enable data exchange between clients 110, servers 130, and databases 140.

Server 130 is a system that can perform analytical processing on the data to generate analytical results. The server 130 may be a single server or a group of servers, each server in the group being connected via a wired or wireless network. In an embodiment of the present disclosure, the server 130 may be configured to perform the method for extracting video fingerprints provided by the present disclosure. The server 130 may perform the methods provided by the present disclosure for extracting video fingerprints on videos received from the client 110 and/or the database 140.

Database 140 may generally refer to a device having a storage function. Database 140 is primarily used to store data collected from clients 110 and various data utilized, generated, and output in the operation of server 130. For example, the database 140 may store algorithm parameters involved in the methods described below for generating video fingerprints. The database 140 may also be used to store video fingerprints generated using the methods provided by the present disclosure. The database 140 may be local or remote. The database 140 may be a non-persistent memory, or a persistent memory. The above mentioned storage devices are only examples and the storage devices that the system can use are not limited to these. Database 140 may be interconnected or in communication with network 120, or directly interconnected or in communication with system 100 or a portion thereof (e.g., server 130), or a combination thereof. In some embodiments, the database 140 may be located in the background of the server 130. In some embodiments, database 140 may be separate and directly connected to network 120. The connections or communications between the database 140 and the other devices of the system may be wired or wireless.

It should be noted that in addition to the above-described system including a network, embodiments of the present disclosure may also be implemented in a separate local computer.

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings.

Fig. 2 shows a schematic flow diagram of a method for video fingerprint extraction of a video according to an embodiment of the present disclosure. The method 200 shown in fig. 2 may be implemented with the client 110 or the server 130 shown in fig. 1.

As shown in fig. 2, the method 200 may include step S202. In step S202, a plurality of video frames in the video may be extracted. In the following description, the extraction of image features may be performed with respect to the image information of the plurality of video frames extracted in step S202, and the image features respectively extracted with respect to the plurality of video frames may be respectively taken as fingerprint features of the corresponding video frames. The fingerprint feature may be a feature map output by any layer of a neural network after processing a video frame by using the neural network having a plurality of layers. Hereinafter, these video frames for extracting fingerprint features are also referred to as key frames.

In step S202, an effective video segment for extracting a video fingerprint may be determined from the video, and key frames may be extracted from the effective video segment.

In some embodiments, a portion of the video may be determined to be a valid video segment. For example, the video may be cut into segments of a preset time length (e.g., 15s, or any other possible preset time length), and the cut segments are used as valid video segments. For example, a section of 0 th to 15 th seconds of the video may be determined as an effective video section, and a section of a preset time length starting at an arbitrary position in the video may also be determined as an effective video section. In other embodiments, the entire video may be determined to be a valid video segment.

Key frames may then be extracted from the determined active video segments. In some embodiments, each frame in the active video clip may be determined to be a key frame. In other embodiments, shot segmentation for the active video segments may be achieved by analyzing each frame in the active video segments and extracting key frames for each shot based on the results of the shot segmentation. For example, the first frame of each shot determined by shot segmentation may be determined as a key frame. In still other embodiments, video frames may be selected as key frames within a video by sampling. For example, the valid video segment may be sampled at equal intervals (e.g., every 5s, or any predetermined time interval), and the video frames obtained by sampling may be used as key frames. For a 15 second length of a valid video segment, when sampling is performed at 5 second intervals, 3 frames will be obtained as key frames. For another example, the valid video segment may be sampled at arbitrary intervals, and the video frame obtained by sampling may be used as the key frame.

Then, in step S204, for each of the plurality of video frames, the video frame is processed by using a neural network having a plurality of layers, and an intermediate layer feature map output by an intermediate layer of the neural network is taken as a fingerprint feature of the video frame. The fingerprint features may be used to generate a video fingerprint.

In some embodiments, the neural network may be a neural network trained using an image classification task. Such neural networks may be comprised of one or more of convolutional layers, pooling layers, fully-connected layers, and activation functions. For example, the neural network having a plurality of layers used in the present disclosure is a deep neural network constructed by stacking a plurality of convolutional layers and at least one pooling layer. Each convolution layer is used for performing convolution processing on the output of the previous layer and outputting the result after the convolution processing. For example, the neural network used in the embodiments of the present disclosure may be implemented using structures such as Alexnet, vggnet, resnet, googlenet, mobilenet, and the like.

Training the neural network with a suitable image data set may enable the trained neural network to perform image classification functions. That is, the last layer of the neural network trained by the image classification task may output the classification image features and the predicted classification results of the input images of the neural network.

According to the method provided by the present disclosure, the neural network trained by the image classification task may be used to process the key frame, and the feature output by the intermediate layer of the neural network is used as the fingerprint feature of the currently processed key frame.

Taking the mobilene structure as an example, an exemplary structure of a mobilene network is shown in table 1. Hereinafter, each row in table 1 is referred to as a layer in the Mobilenet network. For example, the first layer of the Mobilenet network shown in table 1 is a convolutional layer having a shape of 3 × 3 × 3 × 32, and the second layer is a deep convolutional layer having a shape of 3 × 3 × 32. By analogy, the penultimate layer is the fully connected layer FC of 1024 × 1000 in shape, and the last layer is the activation function Softmax.

TABLE 1

To accommodate the multi-tagged task, the Mobilene structure shown in Table 1 can be trained using the sigmoid cross-entropy function as a loss function.

The input key frame can be subjected to convolution processing using the Mobilenet structure shown in table 1, and image features for describing image information of the key frame are generated. In fact, the output of each layer of the neural network may be used to describe the input image. However, in general, in a deep neural network, image features output at a deeper level are more descriptive of the input image than image features output at a shallower level.

In addition, as described above, the video fingerprints are used for describing video contents, and can be further used for matching and clustering similar videos. Therefore, in order to determine which output feature of each layer of the neural network has the best clustering effect, the output of each layer of the neural network can be subjected to clustering analysis, and the most suitable output can be selected as the fingerprint feature of the video frame according to the result of the clustering analysis.

For example, the output image features of which layer are selected as the fingerprint features for the key frame may be determined by performing cluster analysis on the output image features of each layer of the neural network structure shown in table 1. For example, a data set comprising a plurality of pictures may be selected and the pictures in the data set processed using a trained neural network. Since the picture content included in the data set is known, the image clustering effect achieved by the image features output by the intermediate layers in the neural network can be verified by performing a clustering algorithm (e.g., K-means clustering, mean shift clustering, density-based clustering method, etc.) on the image features output by different intermediate layers in the neural network (e.g., penultimate layer, or any other intermediate layer). That is, the effect of determining similar pictures using the output of the intermediate layers of the neural network can be examined. In some embodiments, a clustering algorithm may be used to perform clustering on features output by each intermediate layer of the neural network, and determine the consistency of image data in each class obtained in the clustering process.

For the exemplary neural network structure shown in table 1, it can be determined that the image features output from the fifth last layer are more consistent, i.e., the clustering effect is the best, for example, and therefore, as shown in table 1, the image features output from this layer can be used as the fingerprint features of the processed key frames.

It will be appreciated that an exemplary method of determining fingerprint features for a key frame is described above using only the mobilene structure shown in table 1 as an example. The middle layer for outputting fingerprint features of key frames is not limited to the fifth last layer of the neural network. It will be appreciated by those skilled in the art that when other neural network structures are selected for extracting image features of key frames, the output of one intermediate layer of the neural network used may be selected as the fingerprint feature of the processed key frame according to the method described above. The person skilled in the art can select the image feature output by the middle layer with the best clustering effect as the fingerprint feature of the processed key frame according to the actual situation. In addition, according to the use scene of the video fingerprint, the person skilled in the art can also check and select the image feature output by any middle layer in the neural network structure as the fingerprint feature of the processed key frame through other criteria.

The neural network used in step S204 may include a deep convolution (depthwise convolution) layer. For example, as shown in table 1, the 2 nd, 4 th, 6 th, etc. layers are depth convolution layers. In a deep neural network, the processed input image may be transformed into image features of image data having multiple channels. For depth convolution, the number of channels of the convolution kernel set by the depth convolution layer is the same as the number of channels of the image feature to be processed, and the convolution kernel of each channel is used for performing convolution on the image data of the corresponding channel in the image feature respectively. The output of each channel convolution can then be combined using a 1 x 1 convolution kernel. Such as layers 3, 5, 7, etc. shown in table 1. The traditional standard convolution in the neural network structure can be replaced by the deep-level separable convolution formed by deep convolution and point convolution, meanwhile, the calculation efficiency of the neural network can be improved, and the number of parameters in the neural network structure can be reduced.

When the deep convolution architecture is included in the neural network architecture, in some embodiments, the single instruction multiple data Stream (SIMD) nature of the processor (e.g., CPU) may be utilized to speed up the operation of the deep convolution. When convolution operation needs to be executed, because convolution operation instructions executed on data on different channels of processed image features are the same, each instruction can be made to execute operation on data on a plurality of channels simultaneously and parallelly by utilizing SIMD characteristics, and therefore the throughput of a processor can be improved. In addition, when the size of the convolution kernel is larger than 1, the convolution kernels are partially overlapped when sliding on a feature map (feature map), and the number of times of memory access can be reduced by reserving data of the overlapped part in a register, so that the calculation efficiency is improved. The SIMD nature of the processor may be invoked using assembly instructions.

Optionally, before the image feature extraction of the key frame by the neural network, a size transformation (e.g., up-sampling or down-sampling) may be performed on the key frame so that the size of the key frame conforms to the input size of the neural network. Taking the structure shown in table 1 as an example, the size of the input image frame may be 224 × 224 × 3.

Fingerprint features of all key frames of the video extracted in step S202 can be obtained by using step S204. Then, as shown in fig. 2, in step S206, the fingerprint features of the plurality of video frames may be processed to generate a video fingerprint of the video. In some embodiments, the video fingerprint of the video may be determined by fusing fingerprint features of the plurality of video frames. That is, the video fingerprint includes information of fingerprint features of all key frames.

Fig. 3 shows a schematic flow chart of the video fingerprint feature generation step S206 according to an embodiment of the present disclosure.

As shown in fig. 3, step S206 may further include step S2062. In step S2062, dimension reduction processing may be performed on the fingerprint features of the plurality of video frames, respectively, to obtain a plurality of dimension-reduced fingerprint features.

In some embodiments, the dimensionality reduction process may include a pooling step and a principal component analysis step. For example, the pooling process may be performed on fingerprint features of the plurality of video frames, respectively, to obtain a plurality of pooled fingerprint features. For example, the pooling treatment may be one or more of average pooling, maximum pooling, or minimum pooling. Taking the Mobilenet network structure shown in table 1 as an example, as described above, the feature output from the fifth to last layer of the network shown in table 1 can be taken as the fingerprint feature of the processed key frame. As shown in table 1, the size of the fingerprint features output by this layer is 7 × 7 × 1024. The fingerprint features output by the layer can then be dimensionality reduced through a pooling process. For example, the size of the pooled fingerprint features may be 1 × 1 × 1024.

The pooled fingerprint features may then be subjected to principal component analysis to obtain a reduced dimensional fingerprint features. The principal component analysis may include performing matrix multiplication on the plurality of pooled fingerprint features by using a preset projection matrix, so as to perform dimensionality reduction on the fingerprint features. In some embodiments, the projection matrix may be converted to the form of a corresponding convolution kernel and convolved with the pooled fingerprint features described above to achieve the same result as the matrix multiplication. In some embodiments, a projection matrix for principal component analysis may be determined using a preset image dataset. For example, a covariance matrix for a d-dimensional image dataset comprising multiple types of image information may be calculated using the dataset. In the examples of the present disclosure described hereinabove, d =1024. Then, the eigenvalues of the covariance matrix and the corresponding eigenvectors can be computed and the eigenvectors corresponding to the top k largest eigenvalues are selected from them. Where k is an integer less than d. A projection matrix can be constructed from the k eigenvectors corresponding to the first k largest eigenvalues. The d-dimensional fingerprint features can be reduced to k-dimensional by using the projection matrix. In one example, k =128. It is understood that one skilled in the art can set the value of k according to actual situations, thereby obtaining dimension-reduced fingerprint features with different sizes.

Although partial image feature information is lost by the fingerprint features after dimensionality reduction, the accuracy of the fingerprint features after dimensionality reduction generated by the method is verified to be reduced by 0.05% compared with the fingerprint features without dimensionality reduction, and the comparison accuracy of 96% is still achieved. Thus, the dimensionality reduced fingerprint features may be considered to be the same as the full fingerprint features that are not dimensionality reduced for the descriptive nature of the video content. Due to the fact that the data volume in the fingerprint features after dimension reduction is greatly reduced, the comparison speed of fingerprints among videos can be increased.

Step S206 may also include step S2064. In step S2064, the plurality of dimension-reduced video fingerprint features may be stitched to generate a video fingerprint of the video. For example, when the number of key frames of the video is 3, the dimensional-reduced fingerprint features generated for the 3 key frames respectively may be concatenated. Taking the 128-dimensional fingerprint feature as an example, a 384-dimensional feature can be obtained after the concatenation. The stitched features may be used as video fingerprints of the video.

Alternatively, in step S2064, a transformation may also be performed on the plurality of dimension-reduced video fingerprint features to generate a video fingerprint of the video. For example, a feature hash transform may be performed on the plurality of dimension-reduced video fingerprint features, and a result of the feature hash may be determined as the video fingerprint of the video.

By using the method for extracting the video fingerprint provided by the disclosure, the generation speed of the video fingerprint can be increased by only performing the operation of extracting the fingerprint feature on part of video frames in the video. In addition, by checking the clustering effect of the image features output by each layer of the neural network, the output of the middle layer with the best clustering effect can be selected for generating the video fingerprint, so that the generated video fingerprint has better effect in the process of identifying similar videos. Further, by performing principal component analysis on the output of the neural network, the accuracy of successful comparison between the video fingerprint and the video with similar content can be ensured, and the time for decoding the video is removed, so that the CPU (central processing unit) core for generating the fingerprint of the whole video only needs 160 milliseconds. The dimensionality of the generated video fingerprint characteristics is reduced, so that the comparison speed of fingerprints among videos is rapidly improved, and the video search application based on video content can be supported.

Fig. 4 shows a schematic flow chart of an apparatus for extracting video fingerprints according to an embodiment of the present disclosure. The client and/or server shown in fig. 1 may be implemented as the apparatus for extracting video fingerprints shown in fig. 4.

As shown in fig. 4, the apparatus 400 may include a video frame extraction unit 410 configured to extract a plurality of video frames in a video. The apparatus 400 may perform extraction of image features for the image information of the plurality of video frames extracted by the video frame extraction unit 410, and use the image features respectively extracted from the plurality of video frames as fingerprint features of the corresponding video frames. These video frames used to extract fingerprint features are also referred to as key frames.

In some embodiments, the video frame extraction unit 410 may be configured to determine an active video segment for extracting video fingerprints from the video and extract key frames from the active video segment. In some embodiments, a portion of the video may be determined to be a valid video segment. For example, the video may be cut into segments of a preset time length, and the cut segments may be used as valid video segments. In other embodiments, the entire video may be determined to be a valid video segment.

The video frame extraction unit 410 may be further configured to extract key frames from the determined active video segments. In some embodiments, each frame in the active video clip may be determined to be a key frame. In other embodiments, shot segmentation for the active video segments may be achieved by analyzing each frame in the active video segments, and determining a key frame for each shot based on the results of the shot segmentation. In still other embodiments, video frames may be selected as key frames within the active video segment by sampling. For example, the valid video segments may be sampled at equal intervals, and the sampled video frames may be used as key frames. For another example, the valid video segment may be sampled at arbitrary intervals, and the video frame obtained by sampling may be used as the key frame.

The apparatus 400 may further include a fingerprint feature extraction unit 420, which may be configured to process each of the plurality of video frames by using a neural network having a plurality of layers, where the plurality of layers includes at least one convolutional layer, and each of the at least one convolutional layer is configured to perform convolution processing on an output of a previous layer, and use a convolution processing result output from an intermediate layer of the neural network as a fingerprint feature of the video frame.

In some embodiments, the neural network may be a neural network trained using an image classification task. Such neural networks may be comprised of one or more of convolutional layers, pooling layers, fully-connected layers, and activation functions. For example, the neural network having a plurality of layers used in the present disclosure is a deep neural network constructed by stacking a plurality of convolutional layers and at least one pooling layer. For example, the neural network used in the embodiments of the present disclosure may be implemented using structures such as Alexnet, vggnet, resnet, googlenet, mobilenet, and the like. In the examples described below, the apparatus for extracting video fingerprints provided by the present disclosure is described by taking the Mobilenet network shown in table 1 as an example.

As described earlier, the output image features of which layer are selected as fingerprint features for the key frame are determined by performing cluster analysis on the output image features of each layer of the neural network structure shown in table 1. That is, the image clustering effect achieved by the image features output by the middle layer of the neural network can be tested by performing a clustering algorithm (e.g., K-means clustering, mean shift clustering, density-based clustering method, etc.) on the image features output by different middle layers (e.g., second to last layer, fifth to last layer, or any other middle layer) in the neural network. As described above, with the exemplary neural network structure shown in table 1, it can be determined that the image feature output at the fifth last layer thereof is the best in clustering effect, and therefore, as shown in table 1, the image feature output at this layer can be used as the fingerprint feature of the processed key frame.

It will be appreciated by those skilled in the art that when other neural network structures are selected for extracting image features of key frames, the output of one intermediate layer of the neural network used may be selected as the fingerprint feature of the processed key frame according to the method described above. The person skilled in the art can select or cluster the image feature output by the middle layer with the best effect as the fingerprint feature of the processed key frame according to the actual situation. In addition, according to the use scene of the video fingerprint, the person skilled in the art can also check and select the image feature output by any middle layer in the neural network structure as the fingerprint feature of the processed key frame through other criteria.

In some embodiments, the fingerprint feature extraction unit 420 may also be configured to take advantage of the single instruction multiple data Stream (SIMD) nature of processors (e.g., CPUs) to speed up the operation of deep convolutions. When convolution operation needs to be executed, because convolution operation instructions executed on data on different channels of processed image features are the same, each instruction can be enabled to execute operation on data on a plurality of channels simultaneously and in parallel by utilizing SIMD characteristics, and therefore the throughput of the processor can be improved. In addition, when the size of the convolution kernel is larger than 1, the convolution kernels are partially overlapped when sliding on a feature map (feature map), and the number of times of memory access can be reduced by reserving data of the overlapped part in a register of a processor, so that the calculation efficiency is improved. The SIMD nature of the processor may be invoked using assembly instructions.

Optionally, before the image feature extraction of the key frame by using the neural network, the fingerprint feature extraction unit 420 may be further configured to perform a size transformation (e.g., up-sampling or down-sampling) on the key frame so that the size of the key frame conforms to the input size of the neural network.

The apparatus 400 may further comprise a video fingerprint generation unit 330, which may be configured to process fingerprint features of the plurality of video frames to generate a video fingerprint of the video. In some embodiments, the video fingerprint of the video may be determined by fusing fingerprint features of the plurality of video frames. That is, the video fingerprint includes information of fingerprint features of all key frames.

As shown in fig. 4, the video fingerprint generation unit 430 may include a dimensionality reduction subunit 431 and a stitching subunit 432. Wherein, the dimension reduction subunit 431 may be configured to perform dimension reduction processing on the fingerprint features of the plurality of video frames, respectively, to obtain a plurality of dimension reduced fingerprint features.

In some embodiments, the dimension reduction process may include a pooling step and a principal component analysis step. For example, the pooling process may be performed on fingerprint features of the plurality of video frames, respectively, to obtain a plurality of pooled fingerprint features. For example, the pooling treatment may be one or more of average pooling, maximum pooling, or minimum pooling. As shown in table 1, the size of the fingerprint features output by the fifth last layer is 7 × 7 × 1024. The fingerprint features output by this layer may then be dimensionality reduced to a size of 1 × 1 × 1024 by a pooling process.

The principal component analysis step may comprise principal component analysis of the plurality of pooled fingerprint features using a projection matrix for principal component analysis to obtain a plurality of reduced-dimension fingerprint features. In some embodiments, the projection matrix of the principal component analysis is determined using a preset image data set. For example, a covariance matrix for a d-dimensional image dataset comprising multiple types of image information may be calculated using the dataset. Then, eigenvalues and corresponding eigenvectors of the covariance matrix may be computed and the eigenvectors corresponding to the top k largest eigenvalues are selected from among them. Where k is an integer less than d. A projection matrix is constructed from the k eigenvectors corresponding to the first k largest eigenvalues. The d-dimensional fingerprint features can be reduced to k dimensions by using the projection matrix. It is understood that one skilled in the art can set the value of k according to actual situations, thereby obtaining dimension-reduced fingerprint features with different sizes.

The stitching subunit 432 may be configured to stitch the plurality of reduced-dimension video fingerprint features to generate a video fingerprint of the video. For example, when the number of key frames of the video is 3, the dimension-reduced fingerprint features generated for the 3 key frames respectively may be stitched. Taking the example that the fingerprint feature obtained after dimension reduction is 128 dimensions, a 384-dimensional feature can be obtained after splicing. The stitched features may be used as video fingerprints of the video.

Alternatively, the stitching subunit 432 may also be configured to perform a transformation on the plurality of reduced-dimension video fingerprint features to generate a video fingerprint of the video. For example, a feature hash transform may be performed on the plurality of dimension-reduced video fingerprint features, and a result of the feature hash may be determined as the video fingerprint of the video.

With the device for extracting the video fingerprint provided by the disclosure, the generation speed of the video fingerprint can be increased by only performing the operation of extracting the fingerprint feature on part of video frames in the video. In addition, by checking the clustering effect of the image features output by each layer of the neural network, the output of the middle layer with the best clustering effect can be selected for generating the video fingerprint, so that the generated video fingerprint has better effect in the process of identifying similar videos. Further, by performing principal component analysis on the output of the neural network, the accuracy of successful comparison between the video fingerprint and the video with similar content can be ensured, and the time for decoding the video is removed, so that the CPU (central processing unit) core for generating the fingerprint of the whole video only needs 160 milliseconds. The dimensionality of the generated video fingerprint characteristics is reduced, so that the comparison speed of fingerprints among videos is rapidly improved, and the video search application based on video content can be supported.

Furthermore, the method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of a computing device as shown in fig. 5. Fig. 5 illustrates an architecture of the computing device. As shown in fig. 5, computing device 500 may include a bus 510, one or more CPUs 520, a Read Only Memory (ROM) 530, a Random Access Memory (RAM) 540, a communication port 550 connected to a network, an input/output component 560, a hard disk 570, and the like. A storage device in the computing device 500, such as the ROM 530 or hard disk 570, may store various data or files used by the processing and/or communication of the methods for video fingerprint extraction provided by the present disclosure, as well as program instructions executed by the CPU. Computing device 500 may also include a user interface 580. Of course, the architecture shown in FIG. 5 is merely exemplary, and one or more components of the computing device shown in FIG. 5 may be omitted as needed in implementing different devices.

Embodiments of the present disclosure may also be implemented as a computer-readable storage medium. A computer readable storage medium according to an embodiment of the present disclosure has computer readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform methods according to embodiments of the present disclosure as described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc.

Those skilled in the art will appreciate that the disclosure of the present disclosure is susceptible to numerous variations and modifications. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Furthermore, as used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Further, while the present disclosure makes various references to certain elements of a system according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are illustrative only, and different aspects of the systems and methods may use different units.

Furthermore, flow charts are used in this disclosure to illustrate operations performed by systems according to embodiments of the disclosure. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to or removed from these processes.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method for extracting a video fingerprint of a video, comprising:

extracting a plurality of video frames in a video;

for each of the plurality of video frames, processing the video frame by using a neural network with a plurality of layers, wherein the plurality of layers comprises at least one convolutional layer, and each convolutional layer is used for performing convolution processing on the output of the previous layer;

taking the convolution processing result output by the middle layer of the neural network as the fingerprint characteristic of the video frame; and

processing fingerprint features of the plurality of video frames to generate a video fingerprint of the video,

the method for processing the video frame by taking the convolution processing result output by the middle layer of the neural network as the fingerprint feature of the video frame comprises the following steps:

and performing cluster analysis on the convolution processing results output by each middle layer of the neural network, and selecting the convolution processing result output by the middle layer with the best clustering effect as the fingerprint characteristic of the video frame according to the result of the cluster analysis.

2. The method of claim 1, wherein processing the fingerprint features of the plurality of video frames to generate the video fingerprint of the video comprises:

performing dimension reduction processing on the fingerprint features of the video frames respectively to obtain a plurality of dimension reduced fingerprint features;

stitching the plurality of dimension-reduced video fingerprint features to generate a video fingerprint of the video.

3. The method of claim 2, wherein performing dimension reduction processing on the fingerprint features of the plurality of video frames comprises at least one of:

performing pooling processing on the fingerprint features of the plurality of video frames, respectively, to obtain a plurality of pooled fingerprint features, an

Performing principal component analysis on the plurality of pooled fingerprint features to obtain a plurality of dimension-reduced fingerprint features.

4. The method of claim 1, wherein extracting a plurality of video frames in a video comprises:

selecting a plurality of frames as the plurality of video frames at equal intervals in the video.

5. The method of claim 1, wherein the neural network transforms the video frame into image data having a plurality of channels, the method further comprising performing a convolution operation on the image data of the plurality of channels in parallel.

6. The method according to any one of claims 1-5, wherein the neural network is a Mobilenet network.

7. An apparatus for extracting video fingerprints of a video, comprising:

a video frame extraction unit configured to extract a plurality of video frames in a video;

a fingerprint feature extraction unit, configured to, for each of the plurality of video frames, process the video frame by using a neural network having a plurality of layers, where the plurality of layers includes at least one convolutional layer, each of the at least one convolutional layer is used for performing convolution processing on an output of a previous layer, and a convolution processing result output by an intermediate layer of the neural network is used as a fingerprint feature of the video frame; and

a video fingerprint generation unit configured to process fingerprint features of the plurality of video frames to generate video fingerprints of the video,

wherein the fingerprint feature extraction unit is further configured to:

8. The apparatus of claim 7, wherein the video fingerprint generation unit further comprises:

the dimension reduction subunit is configured to respectively perform dimension reduction processing on the fingerprint features of the video frames to obtain a plurality of dimension reduction fingerprint features; and

a stitching subunit configured to stitch the plurality of dimension-reduced video fingerprint features to generate a video fingerprint of the video.

9. The apparatus of claim 8, wherein the dimension reduction process comprises at least one of:

Performing principal component analysis on the pooled fingerprint features to obtain a reduced-dimension fingerprint features.

10. The apparatus of claim 7, wherein extracting a plurality of video frames in a video comprises:

11. The apparatus of claim 7, wherein the neural network transforms the video frame into image data having a plurality of channels, the apparatus further configured to perform a convolution operation on the image data of the plurality of channels in parallel.

12. The apparatus according to any of claims 7-11, wherein the neural network is a Mobilenet network.

13. An apparatus for extracting video fingerprints, comprising a processor and a memory, the memory having stored therein program instructions which, when executed by the processor, are configured to perform the method for extracting video fingerprints of a video according to any of claims 1-6.

14. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method for extracting video fingerprints of video according to any of claims 1-6.