CN107977461A

CN107977461A - A kind of video feature extraction method and device

Info

Publication number: CN107977461A
Application number: CN201711390947.8A
Authority: CN
Inventors: 刘旭; 丁大钧; 赵丽丽
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2018-05-01

Abstract

The embodiment of the present application provides a kind of video feature extraction method and device, and the consecutive frame that the first sets of video frames and each video frame of extraction are obtained by extracting multiple video frame from target video obtains the second sets of video frames.Then, first sets of video frames and the second sets of video frames are combined to form the feature that contiguous frames extract set input depth convolutional network each video frame.Finally, the video features Hash layer that the feature access for extracting obtained each video frame is made of loss function and classification cross entropy loss function sigmoid activation primitives, contiguous frames is calculated, obtains the Hash feature of the target video.Make the feature representation of video more can reflecting video content information, and effectively raise the precision and utilization ratio of video features expression.

Description

A kind of video feature extraction method and device

Technical field

This application involves technical field of video processing, in particular to a kind of video feature extraction method and device.

Background technology

Hash coding is carried out to video can apply in the relevant neck of the videos such as video frequency searching, Video clustering, video compress Domain.Whether the Hash feature that existing video Hash feature extracting method obtains effective, is determined by two factors, one because Element is the whether effective representing video content of the video features extracted, and another factor is the accuracy of hash algorithm.Therefore, how Design video Hash feature extracting method so that the feature representation of video more can reflecting video content information, lifted video The precision and utilization rate of feature representation, are the big problems for needing to study at present.

The content of the invention

In view of this, the purpose of the application is to provide a kind of video feature extraction method and device, so that the spy of video Sign expression more can reflecting video content information, and effectively raise video features expression precision and utilization ratio.

In order to achieve the above object, the embodiment of the present application adopts the following technical scheme that：

On the one hand, the application provides a kind of video feature extraction method, including：

Multiple video frame are extracted from target video to obtain the first sets of video frames and extract each video frame Consecutive frame obtains the second sets of video frames；

First sets of video frames and the second sets of video frames are combined to form contiguous frames to set input depth convolution Network extracts the feature of each video frame；

To extract the obtained feature of each video frame access by sigmoid activation primitives, contiguous frames to loss function, with And the video features Hash layer that classification cross entropy loss function is formed is calculated, and obtains the Hash feature of the target video；

Wherein, the contiguous frames represent as follows to loss function：

Wherein, f₁It is the contiguous frames to the first set of video in set The feature representation of conjunction, f₂Feature representation for the contiguous frames to the second sets of video frames in set, m for default constraint because Son.

On the other hand, the application provides a kind of video feature extraction device, including：

Neighbouring frame extraction module, obtains the first sets of video frames and carries for extracting multiple video frame from target video The consecutive frame of each video frame is taken to obtain the second sets of video frames；

Convolutional network processing module, for first sets of video frames and the second sets of video frames are combined to be formed it is neighbouring Frame extracts set input depth convolutional network the feature of each video frame；And

Hash feature calculation module, letter is activated for accessing the feature for extracting obtained each video frame by sigmoid The video features Hash layer that number, contiguous frames form loss function and classification cross entropy loss function calculates, and obtains institute State the Hash feature of target video；

Wherein, the contiguous frames represent as follows to loss function：

Compared to the prior art, video feature extraction method and device provided by the embodiments of the present application, for adjacent in video The similar characteristic of nearly two field picture, design contiguous frames are to loss function so that and the feature representation difference of neighbouring interframe minimizes, so that Make the feature representation of video more can reflecting video content information.Secondly, the classification based on video, it is proposed that a brand-new method For finding for the maximum Hash coding site of current class feature representation contribution, video features expression is effectively raised Precision and utilization ratio.

Brief description of the drawings

, below will be to needed in the embodiment attached in order to illustrate more clearly of the technical solution of the embodiment of the present application Figure is briefly described, it will be appreciated that the following drawings illustrate only some embodiments of the application, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is a kind of flow chart of video feature extraction method provided by the embodiments of the present application.

Fig. 2 is the sub-process figure of the step S101 shown in Fig. 1.

Fig. 3 is the process schematic provided by the embodiments of the present application that video features are extracted by depth Hash network structure.

Fig. 4 be it is provided by the embodiments of the present application it is a kind of be intra-pair loss functions schematic diagram.

Fig. 5 is showing according to the Hamming distance of Hash feature progress similar video retrieval tasks provided by the embodiments of the present application It is intended to.

Fig. 6 is the side of the video processing equipment provided by the embodiments of the present application for being used for realization above-mentioned video feature extraction method Frame schematic diagram.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical solution in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is The part of the embodiment of the application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real Applying the component of example can be arranged and designed with a variety of configurations.

Therefore, below the detailed description of the embodiments herein to providing in the accompanying drawings be not intended to limit it is claimed Scope of the present application, but be merely representative of the selected embodiment of the application.It is common based on the embodiment in the application, this area Technical staff's all other embodiments obtained without creative efforts, belong to the model of the application protection Enclose.

It should be noted that：Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.

Referring to Fig. 1, it is a kind of flow chart of video feature extraction method provided by the embodiments of the present application.With reference to figure 1, each step S101-S103 of the video feature extraction method is described in detail.

Step S101, multiple video frame are extracted from target video and are obtained described in the first sets of video frames and extraction each The consecutive frame of video frame obtains the second sets of video frames.

In detail, as shown in Fig. 2, step S101 can be realized by following sub-step S111 and S112.

In step S111, described first is obtained according to the multiple video frame of default frame pitch extraction from the target video Sets of video frames.

In step S112, extracted from the target video according to the default frame pitch in the first sets of video frames often The consecutive frame of a video frame forms second sets of video frames.

In an example, such as shown in Fig. 3, can be extracted first from the target video according to the default frame pitch Go out four video frame such as A, B, C, D.First sets of video frames is formed by four video frame.Then, regarded from the target Consecutive frame A ', B ', C ', the D ' for extracting described tetra- video frame of A, B, C, D in frequency again according to the default frame pitch form institute State the second sets of video frames.For example, consecutive frame (the former frame of such as A of video frame A first according to the video frame A, can be found Or a later frame) A ', then B, C, D consecutive frame B ', C ', D ' are found respectively according to the default frame pitch since A ', so that Combination forms second sets of video frames.

Step S102, first sets of video frames and the second sets of video frames are combined to form contiguous frames to set input Depth convolutional network extracts the feature of each video frame.Such as shown in Fig. 3, regarded obtaining first sets of video frames and second After frequency frame set, first sets of video frames and the second sets of video frames will be combined to form neighbouring video frame pair, input is deep Convolutional network is spent, carries out the extraction of video frame feature.

Step S103, the feature for extracting obtained each video frame is accessed by sigmoid activation primitives, contiguous frames to damage Lose the video features Hash layer that function and classification cross entropy loss function are formed to be calculated, obtain the target video Hash feature.

Wherein, the contiguous frames represent as follows to loss function：

In the embodiment of the present invention, the contiguous frames are the letter specially designed according to Hash characteristic extraction procedure to loss function Number, also referred to as intra-pair loss functions, by the feature representation between the function pair contiguous frames, one about in addition Beam so that the feature representation between same video contiguous frames is as similar as possible.As shown in figure 4, it is intra-pair loss functions Schematic diagram.Wherein, grid 1-8 represents the feature representation that the intervals of video arranged in order cuts frame, extracts neighbouring frame structure The feature of each frame group is produced into intra-pair input networks, difference is allowed to as far as possible to using restraint between each pair feature Small, because the content information of contiguous frames is all identical in video, and the image information of shallow-layer has difference, so by Video features after intra-pair loss function constraints more can reflecting video content level information, and to shallow-layer information not It is sensitive.

In detail, in step s 103, obtained each video frame can will be extracted by the sigmoid activation primitives Feature Mapping to 0 to 1 section, the output of sigmoid activation primitives is mapped in binary coding according to given threshold Form Hash codes and export.Wherein, the given threshold can be 0.5.

, can be by following in the application stage of Hash feature after said extracted obtains the Hash feature of target video Mode carries out：

First, the Hash codes by sigmoid activation primitives output generate target two by a category mask matrix System Hash codes, the category mask matrix are the matrixes of a M*N, and wherein M is the video classification number being previously obtained, and N is Hash code length.Each video classification corresponds to the weight parameter of an a length of Hash code length, these weight parameters be used to obtain The critical positions corresponding to the most important Hash feature of current video are taken, these critical positions are related with video generic, often The corresponding critical positions of a classification are different.

Then, the classification belonging to institute's target video is obtained, and corresponding multiple weight parameters are obtained according to the category.Obtaining After obtaining corresponding multiple weight parameters, it is ranked up after can multiple weight parameters be taken absolute value, is finally tied according to sequence The corresponding position of Hash codes of predetermined number, the Kazakhstan as the target video in the Hash feature of the fruit acquisition target video The target location of uncommon feature representation.

In detail, these above-mentioned weight parameters can be taken and descending sequence is carried out after value of fighting to the finish, wherein absolute value compared with The Hash code position of big parameter is considered as the weight expressed by current video (such as the target video) Hash feature more Want position (target location).Finally, it can set a special ratios according to ranking results and take out the preceding some of length-specific Position critical positions, as the target location.For example, the desirable 20% corresponding Hash code position of weight parameter of being arranged in front is made For the target location.

The shape of the classification matrix is classification number * Hash code lengths.For example, it is assumed that classification is 101, and Hash codes Length is 512, then shape is exactly 101*512.So, pair taken out according to a certain classification (classification of such as described target video) It is exactly 1*512 to answer weight parameter number.Can be according to the special ratios after carrying out absolute value sequence to this 512 parameters Remove the corresponding target location of some weight parameters.

Finally, determined based on above-mentioned target location, in the application stage of video Hash feature, as shown in figure 5, can basis The Hamming distance of Hash feature carries out the task of similar video retrieval, only corresponding in video to inquiring about being retrieved in data set Most important Hash feature locations are retrieved, and can effectively improve retrieval precision.

For example, the mode of retrieval can be corresponded to by calculating target location described in the target video (inquiry video) Hash feature and retrieval data set in each video in Hamming distance between Hash feature corresponding with the target location From then according to the Hamming distance being calculated in the retrieval data Integrated query and the associated video of the target video.

Inventor verifies by real case, obtains the Hash codes of the corresponding category mask matrix generation of different weight coefficients Accuracy rate (precision)-recall rate (recall) curve when being retrieved in UCF101 [1] sets of video data is analyzed, It was found that when weight coefficient is 0.3~0.4, the retrieval effectiveness of Hash codes is best, far above the retrieval without using category mask matrix Effect.It is also more excellent compared to the retrieval effectiveness without using category mask matrix when weight coefficient is other values.

As shown in fig. 6, it is the Video processing provided in an embodiment of the present invention for being used for realization the video feature extraction method The schematic diagram of equipment 100.

The video processing equipment 100.The video processing equipment 100 may be, but not limited to, PC (personal computer, PC), laptop, server etc. possess the computer equipment of video analysis and disposal ability.

The video processing equipment 100 further includes video feature extraction device 11, memory 12 and processor 13.This hair In bright preferred embodiment, video feature extraction device 11 can be deposited including at least one in the form of software or firmware (firmware) It is stored in the memory 12 or is solidificated in the operating system (operating system, OS) of the video processing equipment 100 In software function module.The processor 13 is used to perform the executable software module stored in the memory 12, such as Software function module and computer program included by the video feature extraction device 11 etc..In the present embodiment, the video Feature deriving means 11 can also be integrated in the operating system, the part as the operating system.Specifically, it is described Video feature extraction device 11 includes neighbouring frame extraction module 111, convolutional network processing module 112 and Hash feature calculation mould Block 113.It should be noted that in other embodiments, in the above-mentioned function module that the video feature extraction device 11 includes A portion can also omit, or it can also include other more function modules.Below to above-mentioned each function module It is described in detail.

The neighbouring frame extraction module 111 obtains the first sets of video frames for extracting multiple video frame from target video And the consecutive frame of each video frame of extraction obtains the second sets of video frames.

In detail, the neighbouring frame extraction module 111 can be used for performing above-mentioned steps S101.The neighbouring frame extraction module 111 can extract multiple video frame first from the target video according to default frame pitch obtains first sets of video frames； Then the consecutive frame of each video frame in the first sets of video frames is extracted according to the default frame pitch from the target video Form second sets of video frames.

The convolutional network processing module 112 is used to first sets of video frames and the second sets of video frames combining shape The feature of each video frame is extracted to set input depth convolutional network into contiguous frames.In detail, the convolutional network processing mould Block 112 can be used for performing above-mentioned steps S102, on the detailed content of the module, can join the description to step S102.

The Hash feature calculation module 113 is used to access the feature for extracting obtained each video frame by sigmoid The video features Hash layer that activation primitive, contiguous frames form loss function and classification cross entropy loss function calculates, Obtain the Hash feature of the target video.

Wherein, the contiguous frames represent as follows to loss function：

Wherein, f₁It is the contiguous frames to the first video in set The feature representation of frame set, f₂Feature representation for the contiguous frames to the second sets of video frames in set, m are default constraint The factor.

In the present embodiment, the feature for extracting obtained each video frame can be reflected by the sigmoid activation primitives 0 to 1 section is mapped to, the output of sigmoid activation primitives is mapped in binary coding according to given threshold and forms Hash Code simultaneously exports.Wherein, the given threshold can be 0.5.

In detail, the Hash feature calculation module 113 can be used for performing above-mentioned steps S103, on the detailed of the module Content, can join the description to step S103.

In the present embodiment, then as shown in fig. 6, the video feature extraction device 11 can also include Hash characteristic key mould Block 114, for the Hash codes that the sigmoid activation primitives export to be generated object binary by a category mask matrix Hash codes, the category mask matrix are the matrixes of a M*N, and wherein M is the video classification number being previously obtained, and N is Hash Code length, each classification correspond to the weight parameter of an a length of Hash code length；Then, the class according to belonging to the target video Not, corresponding multiple weight parameters are obtained；Finally, after being ranked up after multiple weight parameters are taken absolute value, according to row The corresponding position of Hash codes of predetermined number in the Hash feature of the sequence result acquisition target video, as the target video Hash feature representation target location.

In addition, in the present embodiment, the video feature extraction device 11 can also include query video module 115, be used for It is every in data set with retrieving by calculating the corresponding Hash feature in target location described in the target video (inquiry video) Hamming distance in a video between Hash feature corresponding with the target location, then according to the Hamming distance being calculated In the associated video of the retrieval data Integrated query and the target video.

In conclusion video feature extraction method and device provided by the embodiments of the present application, for contiguous frames figure in video As similar characteristic, design contiguous frames are to loss function so that the feature representation difference of neighbouring interframe minimizes, so that video Feature representation more can reflecting video content information.Secondly, the classification based on video, it is proposed that a brand-new method is used for looking for To the precision for for the maximum Hash coding site of current class feature representation contribution, effectively raising video features expression and Utilization ratio.

In the embodiment that the application provides, it will be appreciated that disclosed apparatus and method, can also be by others side Formula is realized.Embodiments described above is only schematical, for example, the flow chart and block diagram in attached drawing are shown according to this Apply for device, method and computer program product architectural framework in the cards, function and the operation of embodiment.At this point, Each square frame in flow chart or block diagram can represent a part for a module, program segment or code, the module, program segment Or the part protection one or more of code is used for realization the executable instruction of corresponding logic function.

Furthermore, it should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also be with Send out and occur different from the order marked in attached drawing.For example, two continuous square frames can essentially be performed in parallel, they have When can also perform in the opposite order, this is depending on involved function.It should also be noted that in block diagram and/or flow chart Each square frame and block diagram and/or the square frame in flow chart combination, the special of function or action as defined in performing can be used Hardware based system realize, or can be realized with the combination of specialized hardware and computer instruction.

In addition, each function module in each embodiment of the application can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.

If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, the technical solution of the application is substantially in other words The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server, or network equipment etc.) perform each embodiment the method for the application all or part of step. And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), disk or CD etc. are various can be with the medium of store program codes.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there are other identical element in process, method, article or equipment including the key element.

Finally, it should be noted that the foregoing is merely the application preferred embodiment, be not limited to this Shen Please, for those skilled in the art, the application can have various modifications and variations.It is all in spirit herein and principle Within, any modification, equivalent replacement, improvement and so on, should be included within the protection domain of the application.

Claims

A kind of 1. video feature extraction method, it is characterised in that this method includes：

Multiple video frame are extracted from target video to obtain the first sets of video frames and extract the adjacent of each video frame Frame obtains the second sets of video frames；

First sets of video frames and the second sets of video frames are combined to form contiguous frames to set input depth convolutional network Extract the feature of each video frame；

The feature for extracting obtained each video frame is accessed by sigmoid activation primitives, contiguous frames to loss function and class The video features Hash layer that other cross entropy loss function is formed is calculated, and obtains the Hash feature of the target video；

Wherein, the contiguous frames represent as follows to loss function：

Wherein, f₁It is the contiguous frames to the first sets of video frames in set Feature representation, f₂Feature representation for the contiguous frames to the second sets of video frames in set, m are default constraint factor.
2. video feature extraction method as claimed in claim 1, it is characterised in that described that multiple regard is extracted from target video Frequency frame obtains the first sets of video frames and the consecutive frame of each video frame of extraction obtains the second sets of video frames, including：

Multiple video frame are extracted according to default frame pitch obtain first sets of video frames from the target video；

The consecutive frame of each video frame in the first sets of video frames is extracted according to the default frame pitch from the target video Form second sets of video frames.
3. video feature extraction method as claimed in claim 1, it is characterised in that described to extract obtained each video frame The video that is made of sigmoid activation primitives, contiguous frames to loss function and classification cross entropy loss function of feature access Feature Hash layer is calculated, and obtains the Hash feature of the target video, the step of include：

The obtained Feature Mapping of each video frame can will be extracted to 0 to 1 section, root by the sigmoid activation primitives The output of sigmoid activation primitives is mapped in binary coding according to given threshold and forms Hash codes and exports.
4. video feature extraction method as claimed in claim 3, it is characterised in that this method further includes：

The Hash codes of sigmoid activation primitives output are generated into object binary Hash by a category mask matrix Code, the category mask matrix is the matrix of a M*N, and wherein M is the video classification number being previously obtained, and N is Hash code length Degree, each classification correspond to the weight parameter of an a length of Hash code length；

The classification belonging to the target video is obtained, and corresponding multiple weight parameters are obtained according to the category；

It is ranked up after multiple weight parameters are taken absolute value；

The corresponding position of Hash codes of predetermined number in the Hash feature of the target video is obtained according to ranking results, as institute State the target location of the Hash feature representation of target video.
5. the video feature extraction method as described in claim 1-4 any one, it is characterised in that this method further includes：

Calculate the corresponding Hash feature in target location in the target video with retrieve data set in each video in it is described Hamming distance between the corresponding Hash feature in target location；And

According to the Hamming distance being calculated in the retrieval data Integrated query and the associated video of the target video.
A kind of 6. video feature extraction device, it is characterised in that including：

Neighbouring frame extraction module, for being extracted from target video, multiple video frame obtain the first sets of video frames and extraction is every The consecutive frame of a video frame obtains the second sets of video frames；

Convolutional network processing module, for combining and to form contiguous frames pair first sets of video frames and the second sets of video frames Set input depth convolutional network extracts the feature of each video frame；And

Hash feature calculation module, for accessing the feature for extracting obtained each video frame by sigmoid activation primitives, neighbour The video features Hash layer that nearly frame forms loss function and classification cross entropy loss function calculates, and obtains the mesh Mark the Hash feature of video；

Wherein, the contiguous frames represent as follows to loss function：

Wherein, f₁It is the contiguous frames to the first sets of video frames in set Feature representation, f₂Feature representation for the contiguous frames to the second sets of video frames in set, m are default constraint factor.
7. video feature extraction device as claimed in claim 6, it is characterised in that the neighbouring frame extraction module is by from institute State in target video and obtain first sets of video frames according to the multiple video frame of default frame pitch extraction；And by from described The consecutive frame for extracting each video frame in the first sets of video frames in target video according to the default frame pitch forms described the Two sets of video frames.
8. video feature extraction device as claimed in claim 6, it is characterised in that the Hash feature calculation module passes through institute The section of the obtained Feature Mapping of each video frame to 0 to 1 can will be extracted by stating sigmoid activation primitives, according to given threshold The output of sigmoid activation primitives is mapped in binary coding and forms Hash codes and exports.
9. video feature extraction device as claimed in claim 8, it is characterised in that further include：

Hash characteristic key module, is used for：

The Hash codes of sigmoid activation primitives output are generated into object binary Hash by a category mask matrix Code, the category mask matrix is the matrix of a M*N, and wherein M is the video classification number being previously obtained, and N is Hash code length Degree, each classification correspond to the weight parameter of an a length of Hash code length；

The classification belonging to the target video is obtained, and corresponding multiple weight parameters are obtained according to the category；

It is ranked up after multiple weight parameters are taken absolute value；

The corresponding position of Hash codes of predetermined number in the Hash feature of the target video is obtained according to ranking results, as institute State the target location of the Hash feature representation of target video.
10. the video feature extraction device as described in claim 6-9 any one, it is characterised in that further include：

Query video module, for calculating the corresponding Hash feature in target location in the target video with retrieving in data set Hamming distance in each video between Hash feature corresponding with the target location, and according to the Hamming distance being calculated In the associated video of the retrieval data Integrated query and the target video.