CN110163061A

CN110163061A - For extracting the method, apparatus, equipment and computer-readable medium of video finger print

Info

Publication number: CN110163061A
Application number: CN201811353102.6A
Authority: CN
Inventors: 叶燕罡; 沈小勇; 陈忠磊; 马子扬; 戴宇榮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-08-23
Anticipated expiration: 2038-11-14
Also published as: CN110163061B

Abstract

It discloses a kind of for extracting the method, apparatus, equipment and computer-readable medium of the video finger print of video.The described method includes: extracting multiple video frames in video；For each of the multiple video frame, the video frame is handled using the neural network with multiple layers, it wherein, include at least one convolutional layer in multiple layers, each of at least one described convolutional layer is used for the output to its preceding layer and carries out process of convolution；The convolution processing result that the middle layer of the neural network is exported is as the fingerprint characteristic of the video frame；And the fingerprint characteristic of the multiple video frame of processing, to generate the video finger print of the video.

Description

For extracting the method, apparatus, equipment and computer-readable medium of video finger print

Technical field

This disclosure relates to field of video processing, and in particular to a kind of method, apparatus for extracting the video finger print of video, Equipment and computer-readable medium.

Background technique

In recent years, the growth rate of the quantity of network video is very rapid.In order to realize that effective video identifies, Ke Yitong It crosses and image procossing is carried out to the multiple video frames for including in video, and generate the fingerprint characteristic of these video frames.By to these Fingerprint characteristic handle the video finger print of the available video.The video finger print that benefit obtains in such a way can be used as The unique features of video represent a video file.The description of video content is acted on using video finger print, may be implemented similar The application such as comparison and cluster between video.

Since current video finger print has a wide range of applications during video compares and clusters, therefore, it is necessary to new Method for extracting video fingerprints is to realize the rapidly extracting of video finger print and the quick comparison of video finger print.

Summary of the invention

For this purpose, the method, apparatus, equipment and computer that present disclose provides a kind of for extracting the video finger print of video can Read medium.

According to one aspect of the disclosure, a kind of method for extracting the video finger print of video is provided, comprising: extract Multiple video frames in video；For each of the multiple video frame, using the neural network with multiple layers to the view Frequency frame is handled, wherein includes at least one convolutional layer in multiple layers, each of at least one described convolutional layer is used for Process of convolution is carried out to the output of its preceding layer, using the convolution processing result of the output of the middle layer of the neural network as this The fingerprint characteristic of video frame；And the fingerprint characteristic of the multiple video frame of processing, to generate the video finger print of the video.

In some embodiments, the fingerprint characteristic of the multiple video frame is handled, to generate the video finger print of the video It include: that dimension-reduction treatment is executed respectively to the fingerprint characteristic of the multiple video frame, to obtain the fingerprint characteristic of multiple dimensionality reductions；Splicing The video finger print feature of the multiple dimensionality reduction, to generate the video finger print of the video.

In some embodiments, the fingerprint characteristic of the multiple video frame is executed during dimension-reduction treatment includes the following steps At least one: executing pondization processing to the fingerprint characteristic of the multiple video frame respectively, to obtain the fingerprint characteristic in multiple ponds, And principal component analysis is carried out to the fingerprint characteristic in the multiple pond, to obtain the fingerprint characteristic of multiple dimensionality reductions.

In some embodiments, extract video in multiple video frames include: selected in the video equal intervals it is more A frame is as the multiple video frame.

In some embodiments, the neural network by the video frame transformation at the image data with multiple channels, The method also includes the image datas concurrently to the multiple channel to execute convolution operation.

In some embodiments, the neural network is Mobilenet network.

According to another aspect of the present disclosure, it additionally provides a kind of for extracting the device of video finger print, comprising: video frame mentions Unit is taken, is configured to extract multiple video frames in video；Finger print characteristic abstract unit is configured in the multiple video frame Each, is handled the video frame using the neural network with multiple layers, wherein includes at least one volume in multiple layers Lamination, each of at least one described convolutional layer are used for the output to its preceding layer and carry out process of convolution；By the nerve Fingerprint characteristic of the convolution processing result of the middle layer output of network as the video frame；And video finger print generation unit, match It is set to the fingerprint characteristic for handling the multiple video frame, to generate the video finger print of the video.

In some embodiments, the video finger print generation unit further include: dimensionality reduction subelement is configured to the multiple view The fingerprint characteristic of frequency frame executes dimension-reduction treatment respectively, to obtain the fingerprint characteristic of multiple dimensionality reductions；And splicing subelement, it is configured to Splice the video finger print feature of the multiple dimensionality reduction, to generate the video finger print of the video.

In some embodiments, the dimension-reduction treatment at least one of includes the following steps: to the multiple video frame Fingerprint characteristic execute respectively pondization processing, to obtain the fingerprint characteristic in multiple ponds, and the fingerprint to the multiple pond Feature carries out principal component analysis, to obtain the fingerprint characteristic of multiple dimensionality reductions.

In some embodiments, the neural network by the video frame transformation at the image data with multiple channels, Described device is configured to concurrently execute convolution operation to the image data in the multiple channel.

In some embodiments, the neural network is Mobilenet network.

According to another aspect of the present disclosure, additionally provide a kind of equipment for extracting video finger print, including processor and Memory is stored with program instruction in the memory, and when executing described program instruction by processor, the processor is matched It is set to and executes the method for extracting the video finger print of video as previously described.

According to another aspect of the present disclosure, a kind of computer readable storage medium is additionally provided, instruction, institute are stored thereon with Instruction is stated when being executed by processor, so that the processor executes as previously described for extracting the side of the video finger print of video Method.

According to the disclosure provide for extracting the method, apparatus, equipment and computer-readable medium of video finger print, can be with The fingerprint characteristic for the video frame in video is determined by new method for extracting fingerprint feature, and further special using the fingerprint Levy the video finger print for determining the video.There is better Clustering Effect using the video finger print that the method that the disclosure provides obtains.

Detailed description of the invention

It, below will be to required use in embodiment description in order to illustrate more clearly of the technical solution of the embodiment of the present disclosure Attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present disclosure, for this For the those of ordinary skill of field, without making creative work, it can also be obtained according to these attached drawings other Attached drawing.The following drawings is not drawn by actual size equal proportion scaling deliberately, it is preferred that emphasis is shows the purport of the disclosure.

Fig. 1 shows the schematic diagram of an illustrative video finger print extraction system according to the disclosure；

Fig. 2 shows a kind of schematical processes for the method extracted for video finger print according to an embodiment of the present disclosure Figure；

Fig. 3 shows the schematical flow chart of video finger print feature generation step according to an embodiment of the present disclosure；

Fig. 4 shows according to an embodiment of the present disclosure a kind of for extracting the schematical frame of the device of video finger print Figure；And

Fig. 5 shows the illustrative framework for calculating equipment of one kind according to the disclosure.

Specific embodiment

To keep the purposes, technical schemes and advantages of the disclosure clearer, the disclosure is made by the following examples further It is described in detail.Obviously, described embodiment is only disclosure a part of the embodiment, instead of all the embodiments.It is based on Embodiment in the disclosure, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment belongs to the range of disclosure protection.

The video finger print of video is the unique video features generated by the analysis to the image information in video, the video Fingerprint can represent corresponding video file.Video search may be implemented using video finger print, for example, due to video finger print for The descriptive of video content can help to position similar video, therefore may be implemented to search answering for video with video using video finger print With.In another example the quick comparison of two videos may be implemented using video finger print.It is just the same for two contents, but video The discrepant video of frame per second, resolution ratio, by the Message Digest 5 md5 of video file can not judge two videos whether phase Together.However, due to video finger print can with describing video contents, by compare video finger print can be found that content is identical Two videos.Using such characteristic, video finger print may be implemented to operate such as executing duplicate removal to same video.In addition, sharp With video finger print for the descriptive of video content, the video that user watched can be clustered according to video finger print, from And the recommendation of similar video can be carried out to the interested classification of user.In addition, during video content is propagated, such as regard On the shared platform of frequency, can be determined by video finger print and video uplink time in the most morning in the identical or similar video of height The user of biography, so that it is determined that the author of video content, and pirate video is hit.

Fig. 1 shows the schematic diagram of an illustrative video finger print extraction system according to the disclosure.As shown in Figure 1, Video finger print extraction system 100 may include one or more clients 110, network 120, server 130 and database 140.For convenience of description, in the disclosure, video finger print extraction system 100 can be referred to as system 100.

Client 110 can include but is not limited to one of stationary electronic device or mobile electronic equipment or more Kind.For example, stationary electronic device can include but is not limited to desktop computer, smart home device etc..Mobile electronic device can To include but is not limited to one of smart phone, smartwatch, laptop, tablet computer, game station etc. or a variety of. Client 110 can be communicated by network 120 with server, database or other clients, for example, by client local The video of storage is sent to server 130 or other clients via network by the video that client is shot.For example, can be with Utilize the camera program or other programs such as browser run in client, barcode scanning built-in in instant messaging (IM) or photograph Phase program shoots video.In some embodiments of the present disclosure, client 110 can be configured for executing what the disclosure provided Method for extracting video finger print.For example, client 110 can be directed to the video being locally stored or objective from other by network Received video executes the method for extracting video finger print that the disclosure provides in family end and/or database.

Network 120 can be the combination of single network or multiple and different networks.For example, network 120 may include but unlimited Combination in one or more of local area network, wide area network, internet etc..Network 120 can be used to implement client 110, clothes The data exchange being engaged between device 130 and database 140.

Server 130 is the system that data can be carried out with analysis processing to generate analysis result.Server 130 can be One individual server or a server farm, each server in group by wired or wireless network into Row connection.In embodiment of the disclosure, server 130 can be configured for executing that the disclosure provides for extracting video The method of fingerprint.Server 130 can execute the disclosure to video received from client 110 and/or database 140 and provide The method for extracting video finger print.

Database 140 can refer to the equipment with store function.Database 140 is mainly used for storage from client 110 The various data for utilizing, producing and exporting in data and server 130 work of collection.For example, database 140 can be deposited Store up the algorithm parameter for being used to generate video finger print being related in the method being described below.Database 140 can be used for depositing The video finger print that storage is generated using the method that the disclosure provides.Database 140 can be local or long-range.Database 140 It can be impermanent memory memory or permanent memory memory.Above-mentioned storage equipment only lists some examples, The storage equipment that the system can be used is not limited thereto.Database 140 can be connected with each other or communicate with network 120, or Directly and system 100 or part of it (for example, server 130) is connected with each other or communicates or the combination of two ways.One In a little embodiments, database 140 be can be set on the backstage of server 130.In some embodiments, database 140 can be It is independent, directly it is connected with network 120.Connection or communication between database 140 and system other equipment can be it is wired, Or wireless.

It should be noted that in addition to it is above-mentioned including the system of network other than, embodiment of the disclosure also can be implemented in list In only local computer.

In the following, will be described in connection with the drawings embodiment of the disclosure.

Fig. 2 shows a kind of the schematic of the method that the video finger print for video extracts according to an embodiment of the present disclosure Flow chart.Method 200 shown in Figure 2 can use client 110 shown in Fig. 1 or server 130 is realized.

As shown in Fig. 2, method 200 may include step S202.In step S202, multiple views in video can be extracted Frequency frame.In the following description, characteristics of image can be executed for the image information for the multiple video frames extracted in step S202 Extraction, and using the characteristics of image extracted respectively for above-mentioned multiple video frames as the fingerprint characteristic of corresponding video frame. After fingerprint characteristic mentioned here can be using having multiple layers of neural network to handle video frame, in neural network The characteristic pattern of any one layer of output.Hereinafter, these video frames for being used to take the fingerprint feature are also referred to as key frame.

In step S202, the effective video segment for extracting video finger print can be determined according to video, and from effective Key frame is extracted in video clip.

In some embodiments, a part of video can be determined as effective video segment.For example, video can be cut It is segmented into the segment of predetermined time period (such as 15s or any other possible predetermined time period), and the segment that cutting is obtained As effective video segment.For example, the segment of the 0 to 15th second of video can be determined as effective video segment, can also incite somebody to action The segment for the predetermined time period that any position starts in video is determined as effective video segment.In further embodiments, may be used Entire video is determined as effective video segment.

It is then possible to extract key frame from determining effective video segment.In some embodiments, it will can effectively regard Each frame in frequency segment is determined as key frame.It in further embodiments, can be by each in effective video segment Frame is analyzed, and realizes the shot segmentation to effective video segment, and extract according to the result of shot segmentation and be used for each camera lens Key frame.For example, the first frame of each camera lens determined by shot segmentation can be determined as key frame.In other reality It applies in example, the method selecting video frame of sampling can be passed through in video as key frame.For example, can be in a manner of equally spaced (such as every preset time interval of 5s or any) samples effective video segment, and the video frame that sampling is obtained As key frame.For the effective video segment of a 15 seconds length, when being sampled with 5 seconds for interval, 3 frames work will be obtained For key frame.In another example can be sampled with arbitrary interval to effective video segment, and the video frame that sampling is obtained as Key frame.

Then, in step S204, for each of the multiple video frame, the neural network with multiple layers is utilized The video frame is handled, and using the middle layer characteristic pattern of the middle layer of neural network output as the finger of the video frame Line feature.The fingerprint characteristic can be used for generating video finger print.

In some embodiments, which can be the neural network obtained using the training of image classification task.This The neural network of sample can be made of the one or more in convolutional layer, pond layer, full articulamentum and activation primitive.For example, this The open neural network with multiple layers used is the depth constituted by stacking multiple convolutional layers and at least one pond layer Neural network.Wherein each convolutional layer is used for the output to its preceding layer and carries out process of convolution, and exports the knot after process of convolution Fruit.For example, can use the reality that the structures such as Alexnet, Vggnet, Resnet, Googlenet, Mobilenet realize the disclosure Apply neural network used in example.

The neural fusion image point that training can be made to obtain using suitable image data set training neural network The function of class.That is, can be defeated using the last layer of the above-mentioned neural network obtained using the training of image classification task The classifying image features of the input picture of the neural network and the classification results of prediction out.

According to the method that the disclosure provides, the above-mentioned neural network pair obtained using the training of image classification task can be used Key frame is handled, and using the feature of the middle layer of neural network output as the fingerprint characteristic of currently processed key frame.

By taking Mobilenet structure as an example, a kind of illustrative structure of Mobilenet network is shown in table 1.Below In, every a line in table 1 is referred to as one layer in Mobilenet network.For example, Mobilenet network shown in table 1 First layer is the convolutional layer that shape is 3 × 3 × 3 × 32, and the second layer is the depth convolutional layer that shape is 3 × 3 × 32.With such It pushes away, layer second from the bottom is the full articulamentum FC that shape is 1024 × 1000, and the last layer is activation primitive Softmax.

Table 1

In order to adapt to the task of multi-tag, it can use sigmoid and intersect entropy function as in loss function training table 1 The Mobilenet structure shown.

Process of convolution can be carried out to the key frame of input using Mobilenet structure shown in table 1, and is generated and be used for The characteristics of image of the image information of the key frame is described.In fact, each layer of neural network of output can be used for retouching State input picture.But, it is however generally that, in deep neural network, the characteristics of image of deeper output compares shallow-layer output Characteristics of image is to the descriptive much better of input picture.

In addition, as previously mentioned, the effect of video finger print is to be used for describing video contents, and can further refer to using video The comparison and cluster of line progress similar video.Therefore, the feature which layer exports in each layer in order to determine neural network has Best Clustering Effect can carry out clustering to the output of layer each in neural network, and be selected according to the result of clustering Select fingerprint characteristic of the most suitable output as video frame.

For example, cluster point can be carried out by the characteristics of image exported to each layer of neural network structure shown in table 1 Analysis determines the characteristics of image for the output for selecting which layer as the fingerprint characteristic for being used for the key frame.For example, can choose including The data set of multiple pictures, and the picture in the data set is handled using trained neural network.Due to data set In include image content be it is known, therefore, can by middle layer different in neural network (layer such as second from the bottom, One layer of layer 5 reciprocal or any other centre) output characteristics of image execute clustering algorithm (such as K mean cluster, average drifting Cluster, density clustering method etc.) come the image clustering effect for the characteristics of image realization for examining neural network middle layer to export Fruit.I.e., it is possible to examine the effect for determining similar pictures using the output of layer each among neural network.In some embodiments, may be used Cluster is executed with the feature exported using clustering algorithm to each middle layer of neural network, and is judged obtained in above-mentioned cluster process The consistency of image data in every one kind.

For illustrative neural network structure shown in table 1, such as it can determine the figure of its layer 5 output reciprocal As the consistency of feature is higher, i.e., Clustering Effect is best, and therefore, as shown in table 1, the characteristics of image that this layer can be exported is made For the fingerprint characteristic of processed key frame.

It is understood that above only describing for key frame for the Mobilenet structure shown in the table 1 A kind of illustrative determining method of fingerprint characteristic.Middle layer for exporting the fingerprint characteristic of key frame is not limited to neural network Layer 5 reciprocal.It will be understood by those skilled in the art that when the figure for selecting other neural network structures to be used to extract key frame It, can be according to the output of a middle layer of above method neural network selected to use as processed key when as feature The fingerprint characteristic of frame.Those skilled in the art can according to the actual situation, the figure for the middle layer output for selecting Clustering Effect best Fingerprint characteristic as feature as processed key frame.In addition, those skilled in the art can also make according to video finger print With scene, is examined by other standards and select the characteristics of image of any middle layer output in neural network structure as processed Key frame fingerprint characteristic.

Neural network used in step S204 may include depth convolution (depthwise convolution) layer.Example Such as, as shown in table 1, the layers such as the 2nd, 4,6 are depth convolutional layers.In deep neural network, processed input picture can be with It is transformed into the characteristics of image of the image data with multiple channels.For depth convolution, the volume of depth convolutional layer setting The port number of product core is identical as the port number of processed characteristics of image, also, the convolution kernel in each channel is respectively used to this The image data of respective channel carries out convolution in characteristics of image.Then, each channel convolution can be combined using 1 × 1 convolution kernel Output.Such as the layers such as the 3rd, 5,7 shown in table 1.Pass through the separable volume of depth level using depth convolution sum point convolution composition Product can replace Standard convolution traditional in neural network structure, while the computational efficiency of neural network can be improved, and reduce Number of parameters in neural network structure.

When in neural network structure including the structure of depth convolution, in some embodiments, processor can use (such as CPU single-instruction multiple-data stream (SIMD) (SIMD) characteristic) accelerates the speed of service of depth convolution.When needing to be implemented convolution operation, by In on the different channels to processed characteristics of image data execute convolution operation instruction be it is identical, therefore, Ke Yili With SIMD characteristic, each instruction is enabled to execute operation to the data parallel on multiple channels simultaneously, so as to improve processor Handling capacity.In addition when the size of convolution kernel is greater than 1, convolution kernel has part when sliding on characteristic pattern (feature map) Overlapping, the data for retaining lap in a register can reduce internal storage access number, to improve computational efficiency.It can make The SIMD characteristic of processor is called with assembly instruction.

It optionally, can be to the execution of key frame before carrying out image characteristics extraction to key frame using neural network Size change over (such as up-sampling or down-sampling), so that the size of key frame meets the input size of neural network.With table 1 Shown in for structure, the size of the picture frame of input can be 224 × 224 × 3.

Utilize the fingerprint characteristic of all key frames of the video extracted in the available step S202 of step S204.Then, As shown in Fig. 2, can handle the fingerprint characteristic of the multiple video frame in step S206, to generate the video of the video Fingerprint.In some embodiments, it can determine that the video of the video refers to by merging the fingerprint characteristic of the multiple video frame Line.That is, the information of the fingerprint characteristic in video finger print including all key frames.

Fig. 3 shows the schematical process of video finger print feature generation step S206 according to an embodiment of the present disclosure Figure.

As shown in figure 3, step S206 may further include step S2062.It, can be to described more in step S2062 The fingerprint characteristic of a video frame executes dimension-reduction treatment respectively, to obtain the fingerprint characteristic of multiple dimensionality reductions.

In some embodiments, dimension-reduction treatment may include pond step and principal component analysis step.For example, can to institute The fingerprint characteristic for stating multiple video frames executes pondization processing respectively, to obtain the fingerprint characteristic in multiple ponds.For example, pondization is handled It can be one of average pond, maximum pond or minimum pond or a variety of.With Mobilenet network knot shown in table 1 For structure, as previously mentioned, the feature that the layer 5 reciprocal of network shown in table 1 can be exported is as processed key frame Fingerprint characteristic.As shown in table 1, the size of the fingerprint characteristic of this layer output is 7 × 7 × 1024.Then, pass through pond Processing can carry out dimensionality reduction to the fingerprint characteristic that this layer exports.For example, the size of the fingerprint characteristic of Chi Huahou can be 1 × 1 × 1024。

It is then possible to which the fingerprint characteristic to the multiple pond carries out principal component analysis, to obtain the fingerprint of multiple dimensionality reductions Feature.Principal component analysis may include carrying out matrix respectively using fingerprint characteristic of the preset projection matrix to above-mentioned multiple ponds Multiplication, to carry out dimensionality reduction to fingerprint characteristic.In some embodiments, projection matrix can be converted to corresponding convolution kernel Form, and convolution is carried out to the fingerprint characteristic in above-mentioned pond, to obtain result identical with matrix multiplication.In some embodiments In, it can use preset image data set and determine the projection matrix for being used for principal component analysis.For example, can use including a variety of The d dimensional data image collection of the image information of type calculates the covariance matrix for being used for the data set.It is hereinbefore retouched in the disclosure In the example stated, d=1024.It is then possible to calculate the characteristic value and corresponding feature vector of the covariance matrix, and therefrom select Select feature vector corresponding with preceding k maximum eigenvalue.Wherein k is less than the integer of d.By corresponding to first k maximum feature K feature vector of value can construct a projection matrix.Using the projection matrix can by d tie up fingerprint characteristic dimensionality reduction at K dimension.In one example, k=128.It is understood that the value of k can be arranged in those skilled in the art according to the actual situation, To obtain the fingerprint characteristic of various sizes of dimensionality reduction.

Although the fingerprint characteristic after dimensionality reduction has lost parts of images characteristic information, however, the drop generated using the above method Fingerprint characteristic after dimension compares fingerprint characteristic of the successful accuracy rate only than non-dimensionality reduction with similar video and has dropped by examining 0.05%, still realize 96% comparison accuracy rate.It is therefore contemplated that fingerprint characteristic after dimensionality reduction and non-dimensionality reduction is complete Fingerprint characteristic be identical to the descriptive of video content.Since the data volume in the fingerprint characteristic after dimensionality reduction is greatly reduced, The comparison speed of fingerprint between video can thus be accelerated.

Step S206 can also include step S2064.In step S2064, the video of the multiple dimensionality reduction can be spliced Fingerprint characteristic, to generate the video finger print of the video.For example, 3 can will be directed to when the crucial number of frames of video is 3 The fingerprint characteristic for the dimensionality reduction that key frame generates respectively is spliced.It is spliced available by taking the fingerprint characteristic of 128 dimensions as an example The feature of one 384 dimension.The spliced feature can be used as the video finger print of the video.

Alternatively, in step S2064, can also video finger print feature to the multiple dimensionality reduction execute transformation with life At the video finger print of the video.For example, can video finger print feature to the multiple dimensionality reduction execute feature hash conversion, and The result of feature Hash is determined as to the video finger print of the video.

The method for extracting video finger print provided using the disclosure, can be by only to the partial video frame in video The operation for executing the feature that takes the fingerprint, improves the formation speed of video finger print.In addition, by examining each layer output of neural network The Clustering Effect of characteristics of image, the output that can choose the middle layer with best Clustering Effect are used to generate video finger print, So that the video finger print generated has better effect during identifying similar video.Further, by mind Output through network executes principal component analysis, can compare successful accuracy rate guaranteeing the video of video finger print and Similar content While, the video decoded time is removed, the time CPU monokaryon that entire video generates fingerprint only needs 160 milliseconds.By to generation The dimension of video finger print feature carry out dimensionality reduction so that the versus speed of fingerprint has obtained fast lifting between video, therefore can be with Support the video search application based on video content.

Fig. 4 shows according to an embodiment of the present disclosure a kind of for extracting the schematical process of the device of video finger print Figure.Client and/or server shown in Fig. 1 can be implemented as shown in Fig. 4 for extracting the device of video finger print.

As shown in figure 4, device 400 may include video frame extraction unit 410, it is configured to extract multiple views in video Frequency frame.It is special image can be executed for the image information for multiple video frames that video frame extraction unit 410 extracts using device 400 The extraction of sign, and the characteristics of image extracted respectively according to above-mentioned multiple video frames is respectively as the fingerprint characteristic for corresponding to video frame. These video frames for being used to take the fingerprint feature are also referred to as key frame.

In some embodiments, video frame extraction unit 410 can be configured to be determined according to video and refer to for extracting video The effective video segment of line, and key frame is extracted from effective video segment.It in some embodiments, can be by one of video Divide and is determined as effective video segment.For example, the piece that video can be cut into the segment of predetermined time period, and cutting is obtained Duan Zuowei effective video segment.In further embodiments, entire video can be determined as effective video segment.

Video frame extraction unit 410 may be configured to extract key frame from determining effective video segment.Some In embodiment, each frame in effective video segment can be determined as key frame.It in further embodiments, can be by right Each frame in effective video segment is analyzed, and realizes the shot segmentation to effective video segment, and according to shot segmentation As a result the key frame for being used for each camera lens is determined.Sampling can be passed through in effective video segment in yet other embodiments, Method selecting video frame is as key frame.For example, can be sampled in a manner of equally spaced to effective video segment, and will adopt The video frame that sample obtains is as key frame.In another example can be sampled with arbitrary interval to effective video segment, and will sampling Obtained video frame is as key frame.

Device 400 can also include Finger print characteristic abstract unit 420, can be configured to for the multiple video frame Each, is handled the video frame using the neural network with multiple layers, wherein includes at least one volume in multiple layers Lamination, each of at least one described convolutional layer is used for the output to its preceding layer and carries out process of convolution, by the nerve Fingerprint characteristic of the convolution processing result of the middle layer output of network as the video frame.

In some embodiments, which can be the neural network obtained using the training of image classification task.This The neural network of sample can be made of the one or more in convolutional layer, pond layer, full articulamentum and activation primitive.For example, this The open neural network with multiple layers used is the depth constituted by stacking multiple convolutional layers and at least one pond layer Neural network.For example, can use the structures such as Alexnet, Vggnet, Resnet, Googlenet, Mobilenet realizes this public affairs Neural network used in the embodiment opened.In the example being described below, using Mobilenet network shown in table 1 as Example describes the device for being used to extract video finger print of disclosure offer.

As previously mentioned, carrying out cluster point by the characteristics of image exported to each layer of neural network structure shown in table 1 Analysis determines the characteristics of image for the output for selecting which layer as the fingerprint characteristic for being used for the key frame.That is, can pass through To the image of middle layer (one layer of layer, layer 5 reciprocal or any other centre such as second from the bottom) output different in neural network Feature executes clustering algorithm (such as K mean cluster, mean shift clustering, density clustering method) to examine neural network The image clustering effect that the characteristics of image of middle layer output is realized.As previously mentioned, for illustrative nerve net shown in table 1 Network structure, such as can determine that the Clustering Effect of the characteristics of image of its layer 5 output reciprocal is best, therefore, as shown in table 1, The characteristics of image that this layer can be exported is as the fingerprint characteristic of processed key frame.

It will be understood by those skilled in the art that when the characteristics of image for selecting other neural network structures to be used to extract key frame When, it can be according to the finger of a middle layer of above method neural network selected to use exported as processed key frame Line feature.Those skilled in the art can select according to the actual situation or the image of middle layer output that Clustering Effect is best is special Levy the fingerprint characteristic as processed key frame.In addition, those skilled in the art can also use field according to video finger print Scape is examined by other standards and selects the characteristics of image of any middle layer output in neural network structure as processed pass The fingerprint characteristic of key frame.

In some embodiments, Finger print characteristic abstract unit 420 may be configured to singly referring to using processor (such as CPU) Multiple data stream (SIMD) characteristic is enabled to accelerate the speed of service of depth convolution.When needing to be implemented convolution operation, due to processed Characteristics of image different channels on data execute convolution operation instruction be it is identical, therefore, can use SIMD characteristic, Each instruction is enabled to execute operation to the data parallel on multiple channels simultaneously, so as to improve the handling capacity of processor.In addition When the size of convolution kernel is greater than 1, convolution kernel can overlap when sliding on characteristic pattern (feature map), handle The data for retaining lap in the register of device can reduce internal storage access number, to improve computational efficiency.It can be used The SIMD characteristic of assembly instruction calling processor.

Optionally, before carrying out image characteristics extraction to key frame using neural network, Finger print characteristic abstract unit 420 It may be configured to the execution size change over (such as up-sampling or down-sampling) to key frame, so that the size of key frame accords with Close the input size of neural network.

Device 400 can also include video finger print generation unit 330, can be configured to handle the multiple video frame Fingerprint characteristic, to generate the video finger print of the video.It in some embodiments, can be by merging the multiple video frame Fingerprint characteristic determines the video finger print of the video.That is, including the fingerprint characteristic of all key frames in video finger print Information.

As shown in figure 4, video finger print generation unit 430 may include dimensionality reduction subelement 431 and splicing subelement 432.Its In, dimensionality reduction subelement 431 can be configured to execute dimension-reduction treatment respectively to the fingerprint characteristic of the multiple video frame, more to obtain The fingerprint characteristic of a dimensionality reduction.

In some embodiments, dimension-reduction treatment may include pond step and principal component analysis step.For example, can to institute The fingerprint characteristic for stating multiple video frames executes pondization processing respectively, to obtain the fingerprint characteristic in multiple ponds.For example, pondization is handled It can be one of average pond, maximum pond or minimum pond or a variety of.As shown in table 1, layer 5 reciprocal is defeated The size of fingerprint characteristic out is 7 × 7 × 1024.Then, the fingerprint characteristic dimensionality reduction that this layer can be exported is handled by pondization To 1 × 1 × 1024 size.

Principal component analysis step may include utilizing the finger for being used for the projection matrix of principal component analysis to the multiple pond Line feature carries out principal component analysis, to obtain the fingerprint characteristic of multiple dimensionality reductions.In some embodiments, above-mentioned principal component analysis Projection matrix is determined using preset image data set.For example, can use the d dimension including a plurality of types of image informations Image data set calculates the covariance matrix for being used for the data set.It is then possible to calculate the characteristic value and phase of the covariance matrix The feature vector answered, and therefrom select feature vector corresponding with preceding k maximum eigenvalue.Wherein k is less than the integer of d.It is logical A projection matrix can be constructed corresponding to k feature vector of preceding k maximum eigenvalue by crossing.It can be with using the projection matrix The d fingerprint characteristic dimensionality reduction tieed up is tieed up at k.It is understood that those skilled in the art can be arranged according to the actual situation k's Value, to obtain the fingerprint characteristic of various sizes of dimensionality reduction.

Splicing subelement 432 can be configured to the video finger print feature for splicing the multiple dimensionality reduction, to generate the video Video finger print.For example, when the crucial number of frames of video is 3, it can be by the finger of the dimensionality reduction generated respectively for 3 key frames Line feature is spliced.By taking the fingerprint characteristic obtained after dimensionality reduction is 128 dimensions as an example, spliced available one 384 dimension Feature.The spliced feature can be used as the video finger print of the video.

Alternatively, splicing subelement 432 also can be configured to execute change to the video finger print feature of the multiple dimensionality reduction It changes to generate the video finger print of the video.For example, can video finger print feature to the multiple dimensionality reduction execute feature Hash It converts, and the result of feature Hash is determined as to the video finger print of the video.

The device for being used to extract video finger print provided using the disclosure, can be by only to the partial video frame in video The operation for executing the feature that takes the fingerprint, improves the formation speed of video finger print.In addition, by examining each layer output of neural network The Clustering Effect of characteristics of image, the output that can choose the middle layer with best Clustering Effect are used to generate video finger print, So that the video finger print generated has better effect during identifying similar video.Further, by mind Output through network executes principal component analysis, can compare successful accuracy rate guaranteeing the video of video finger print and Similar content While, the video decoded time is removed, the time CPU monokaryon that entire video generates fingerprint only needs 160 milliseconds.By to generation The dimension of video finger print feature carry out dimensionality reduction so that the versus speed of fingerprint has obtained fast lifting between video, therefore can be with Support the video search application based on video content.

In addition, can also be by means of the framework shown in fig. 5 for calculating equipment according to the method or apparatus of the embodiment of the present disclosure To realize.Fig. 5 shows the framework of the calculating equipment.As shown in figure 5, calculate equipment 500 may include bus 510, one or Multiple CPU 520, read-only memory (ROM) 530, random access memory (RAM) 540, the communication port for being connected to network 550, input output assembly 560, hard disk 570 etc..The storage equipment in equipment 500 is calculated, such as ROM 530 or hard disk 570 can With store processing and/or the various data that use of communication or the file of the method for disclosure offer extracted for video finger print with And program instruction performed by CPU.Calculating equipment 500 can also include user interface 580.Certainly, framework shown in fig. 5 is Illustratively, when realizing different equipment, according to actual needs, it is convenient to omit one in calculating equipment shown in Fig. 5 or Multiple components.

Embodiment of the disclosure also may be implemented as computer readable storage medium.According to the calculating of the embodiment of the present disclosure Computer-readable instruction is stored on machine readable storage medium storing program for executing.It, can be with when the computer-readable instruction is run by processor Execute the method according to the embodiment of the present disclosure referring to the figures above description.The computer readable storage medium includes but unlimited In such as volatile memory and/or nonvolatile memory.The volatile memory for example may include that arbitrary access is deposited Reservoir (RAM) and/or cache memory (cache) etc..The nonvolatile memory for example may include read-only storage Device (ROM), hard disk, flash memory etc..

It will be appreciated by those skilled in the art that a variety of variations and modifications can occur in content disclosed by the disclosure.For example, Various equipment described above or component can also pass through one in software, firmware or three by hardware realization A little or whole combinations is realized.

In addition, as shown in the disclosure and claims, unless context clearly prompts exceptional situation, " one ", " one It is a ", the words such as "an" and/or "the" not refer in particular to odd number, may also comprise plural number.It is, in general, that term " includes " and "comprising" Only prompt included the steps that clearly identified and element, and these steps and element do not constitute one it is exclusive enumerate, method Or equipment the step of may also including other or element.

In addition, although the disclosure is made that various references to certain units in system according to an embodiment of the present disclosure, However, any amount of different units can be used and be operated on client and/or server.The unit is only explanation Property, and different units can be used in the different aspect of the system and method.

In addition, flow chart has been used to be used to illustrate behaviour performed by system according to an embodiment of the present disclosure in the disclosure Make.It should be understood that front or following operate not necessarily accurately carry out in sequence.On the contrary, can according to inverted order or Various steps are handled simultaneously.It is also possible to during other operations are added to these, or from these processes remove a certain step Or number step operation.

Unless otherwise defined, all terms (including technical and scientific term) used herein have leads with belonging to the present invention The identical meanings that the those of ordinary skill in domain is commonly understood by.It is also understood that those of definition term such as in usual dictionary The meaning consistent with their meanings in the context of the relevant technologies should be interpreted as having, without application idealization or The meaning of extremely formalization explains, unless being clearly defined herein.

The above is the description of the invention, and is not considered as limitation ot it.Notwithstanding of the invention several Exemplary embodiment, but those skilled in the art will readily appreciate that, before without departing substantially from teaching and advantage of the invention Many modifications can be carried out to exemplary embodiment by putting.Therefore, all such modifications are intended to be included in claims institute In the scope of the invention of restriction.It should be appreciated that being the description of the invention above, and it should not be considered limited to disclosed spy Determine embodiment, and the model in the appended claims is intended to encompass to the modification of the disclosed embodiments and other embodiments In enclosing.The present invention is limited by claims and its equivalent.

Claims

1. a kind of method for extracting the video finger print of video, comprising:

Extract multiple video frames in video；

For each of the multiple video frame, the video frame is handled using the neural network with multiple layers, In, it include at least one convolutional layer in multiple layers, each of at least one described convolutional layer is used for the defeated of its preceding layer Process of convolution is carried out out；

The convolution processing result that the middle layer of the neural network is exported is as the fingerprint characteristic of the video frame；And

The fingerprint characteristic of the multiple video frame is handled, to generate the video finger print of the video.

2. according to the method described in claim 1, wherein, the fingerprint characteristic of the multiple video frame is handled, to generate the view The video finger print of frequency includes:

Dimension-reduction treatment is executed respectively to the fingerprint characteristic of the multiple video frame, to obtain the fingerprint characteristic of multiple dimensionality reductions；

Splice the video finger print feature of the multiple dimensionality reduction, to generate the video finger print of the video.

3. according to the method described in claim 2, wherein, executing dimension-reduction treatment to the fingerprint characteristic of the multiple video frame includes At least one of following steps:

Pondization processing is executed respectively to the fingerprint characteristic of the multiple video frame, to obtain the fingerprint characteristic in multiple ponds, and

Principal component analysis is carried out to the fingerprint characteristic in the multiple pond, to obtain the fingerprint characteristic of multiple dimensionality reductions.

4. method according to claim 1-3, wherein extract video in multiple video frames include:

Select multiple frames as the multiple video frame in the video equal intervals.

5. method according to claim 1-4, wherein the neural network is by the video frame transformation at having The image data in multiple channels, the method also includes the image datas concurrently to the multiple channel to execute convolution operation.

6. method according to claim 1-5, wherein the neural network is Mobilenet network.

7. a kind of for extracting the device of the video finger print of video, comprising:

Video frame extraction unit is configured to extract multiple video frames in video；

Finger print characteristic abstract unit, be configured to for the multiple video frame each, utilize the nerve net with multiple layers Network handles the video frame, wherein and it include at least one convolutional layer in multiple layers, it is every at least one described convolutional layer One carries out process of convolution for the output to its preceding layer, the convolution processing result that the middle layer of the neural network is exported Fingerprint characteristic as the video frame；And

Video finger print generation unit is configured to handle the fingerprint characteristic of the multiple video frame, to generate the video of the video Fingerprint.

8. device according to claim 7, wherein the video finger print generation unit further include:

Dimensionality reduction subelement, configuration executes dimension-reduction treatment to the fingerprint characteristic of the multiple video frame respectively, to obtain multiple dimensionality reductions Fingerprint characteristic；And

Splice subelement, is configured to splice the video finger print feature of the multiple dimensionality reduction, to generate the video finger print of the video.

9. device according to claim 8, wherein the dimension-reduction treatment at least one of includes the following steps:

10. according to the described in any item devices of claim 7-9, wherein extract video in multiple video frames include:

11. according to the described in any item devices of claim 7-10, wherein the neural network is by the video frame transformation at tool There is the image data in multiple channels, described device is configured to concurrently execute convolution behaviour to the image data in the multiple channel Make.

12. according to the described in any item devices of claim 7-11, wherein the neural network is Mobilenet network.

13. a kind of equipment for extracting video finger print, including processor and memory, program is stored in the memory and is referred to It enables, when executing described program instruction by processor, the processor is configured to execute any one of -6 institute according to claim 1 The method for the video finger print for extracting video stated.

14. a kind of computer readable storage medium is stored thereon with instruction, described instruction is when being executed by processor, so that institute It states processor and executes the method according to claim 1-6 for extracting the video finger print of video.