CN108734106A

CN108734106A - Quick sudden and violent probably video frequency identifying method based on comparison

Info

Publication number: CN108734106A
Application number: CN201810366397.4A
Authority: CN
Inventors: 李兵; 胡卫明; 原春锋; 王博; 赵永帅; 刘琴
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2018-11-02
Anticipated expiration: 2038-04-23
Also published as: CN108734106B

Abstract

The present invention relates to visual classification fields, propose a kind of quick sudden and violent probably video frequency identifying method based on comparison, it aims to solve the problem that in the sudden and violent probably video identification of view-based access control model feature since Feature Descriptor descriptive power is limited, the accuracy rate (precious) and recall rate (recall) relatively low problem of caused sudden and violent probably video identification.This method includes：To for carrying out cruelly, probably the video to be detected of identification carries out shot segmentation to choose the key frame of video to be detected；Video identification model is feared cruelly using what is built in advance, and Hash codes operation is carried out to each key frame of the video to be detected, obtains the Hash codes of each above-mentioned key frame；By the Hash codes of each above-mentioned key frame respectively compared with the sudden and violent probably Hash codes of the video frame of video that prestore, video frame similar with each above-mentioned key frame is determined；If the number of video frame similar with each above-mentioned key frame is more than given threshold, determine that the video to be detected is probably video cruelly.The present invention can be identified fast and accurately from a large amount of video fears video cruelly.

Description

Quick sudden and violent probably video frequency identifying method based on comparison

Technical field

The present invention relates to technical field of computer vision, more particularly to visual classification field, and in particular to one kind based on pair The quick of ratio fears video frequency identifying method cruelly.

Background technology

Sudden and violent probably video refers to the video of the contents such as, religion extreme, separation of nationalities sudden and violent probably containing advocation.With network technology Rapid development, mobile internet era is following, this makes more and more multi-medium datas be presented on people at the moment, cruelly Probably video is also able to largely propagate and spread.The detection for fearing video cruelly is mainly marked by manual examination and verification at present, this method Consume a large amount of financial resource and material resource.Therefore in face of the growing internet of data volume, a kind of novel technology automatic fitration is needed Terrorism video image content, and can deploy to ensure effective monitoring and control of illegal activities early warning in important public place.

The visual signature in sudden and violent probably video detection is applied to be broadly divided into two classes, static nature and behavioral characteristics at present.It is quiet State feature is used to describe the feature in video frame, including color, texture, structure etc..These features can effectively reflect background, ring The information such as border, leading role's appearance, MPEG-7 are a kind of typical static natures, there is the visions such as CLD, CSD, SC, EH description.Dynamically Feature is used to describe the feature of video interframe, including motion amplitude, direction, frequency etc., these features can effectively reflect The moving situation of leading role in video.Behavioral characteristics use Corner Detection Algorithm to carry out track and extract mostly.As HOG, HOF, MoSIFT etc..For detecting local feature, this description can only carry wherein MoSIFT algorithms in the place for having sufficient movement Take feature.But it is limited that features above describes sub- descriptive power, it is difficult to the content in comprehensive accurate description video image, especially sudden and violent It probably needs to be detected for specific target in video, so as to cause the detection work accuracy rate (precious) and recall rate (recall) relatively low.

Invention content

In order to solve the above problem in the prior art, in order to solve in two sections of videos, there are many places to copy segment, nothing Method accurately detect some it is compiled after video copy judge, and be accurately positioned copy video clip position the problem of, This application provides a kind of based on comparison quick probably video frequency identifying method cruelly, to solve the above problems.

This application provides the quick sudden and violent probably video frequency identifying methods based on comparison, and this method comprises the following steps：To being used for The video to be detected for carrying out sudden and violent probably identification carries out shot segmentation to choose the key frame of above-mentioned video to be detected；Utilize advance structure Probably video identification model cruelly, Hash codes operation is carried out to each key frame of above-mentioned video to be detected, obtains each above-mentioned key frame Hash codes；Wherein, above-mentioned probably video identification model is based on Hash network struction cruelly, and input is video frame, and it is defeated to export The Hash codes of the video frame entered；By the Hash codes of each above-mentioned key frame Hash codes ratio with the sudden and violent probably video frame of video that prestores respectively Compared with determining video frame similar with each above-mentioned key frame；The number of similar frame is counted, if similar with each above-mentioned key frame The number of video frame is more than given threshold, it is determined that above-mentioned video to be detected is to fear video cruelly.

In some instances, " to for carrying out cruelly, probably the video to be detected of identification carries out shot segmentation to choose above-mentioned wait for Detect the key frame of video ", including：The histogram for extracting every frame video frame of above-mentioned video to be detected, to adjacent video frames Histogram carries out comparison in difference, with the shot boundary of the above-mentioned video to be detected of determination；According to identified shot boundary, in selection State each camera lens of video to be detected start frame and/or end frame as key frame.

In some instances, " by the Hash codes of each above-mentioned key frame respectively with what is prestored the sudden and violent probably video frame of video Kazakhstan Uncommon code compares, and determines video frame similar with each above-mentioned key frame ", including：By the Hash codes of each above-mentioned key frame respectively with The Hash codes of the sudden and violent probably video frame of video in video library compare；Calculate the Kazakhstan of the Hash codes and above-mentioned video frame of above-mentioned key frame The Hamming distance of uncommon code；Key frame of the above-mentioned Hamming distance radius in range of set value and video frame are confirmed as similar frame.

It is in some instances, above-mentioned that probably video identification model, training method are cruelly：To preset training samples pictures Classification, is divided into positive sample data and negative sample data；Wherein, above-mentioned positive sample data are to fear cruelly and sudden and violent probably picture, above-mentioned negative sample Notebook data is to fear cruelly and non-sudden and violent probably picture；The size for adjusting above-mentioned training samples pictures, from the above-mentioned training sample after adjustment The region being sized is intercepted in this picture at random and carries out sample average processing；Video identification model is feared cruelly to place using initial Picture after reason is trained, and obtains fearing video identification model cruelly based on Hash network.

In some instances, the network structure of above-mentioned initial sudden and violent probably video identification model includes input layer, convolutional layer and complete Articulamentum, wherein first layer is input layer, and the second layer to layer 6 is convolutional layer, layer 7 to the 9th layer be full articulamentum.

In some instances, in the above-mentioned sudden and violent probably video identification model of training, input is through sample average in above-mentioned input layer Above-mentioned training samples pictures that treated.

In some instances, above-mentioned convolutional layer receives the output of preceding layer, sharp through this layer after process of convolution in this layer It is exported after function activation living；Above-mentioned full articulamentum receives the output of preceding layer, the activation through this layer after process of convolution in this layer It is exported after function activation.

In some instances, the activation of the initial sudden and violent probably second layer to the 8th layer of the network structure of video identification model Function is：

Wherein, ReLU (x) is activation primitive, and x is the output after this layer of convolution.

In some instances, the 9th layer of activation primitive of the above-mentioned initial sudden and violent probably network structure of video identification model is：

Wherein, δ (x) is to b_i,jSeek the result that local derviation is later.

In some instances, the loss function of the above-mentioned sudden and violent probably video identification model of training is：

Wherein, y_iIndicate sample to whether being similar, i.e. y_i=1 two samples of expression are similar, otherwise dissimilar；It is the Euclidean distance between two sample two-value codes of sample centering；|||b_i,1-1|||₁、|||b_i,2-1|||₁It is sample The manhatton distance L of this two-value code and unit matrix_rBe loss function m (m > 0) it is marginal threshold parameter, α is zoom factor, b_i,1With the Hash codes b of sample 1_i,2For the Hash codes of sample 2, N is training sample to sum, and k is the dimension of Hash codes.

Quick probably video frequency identifying method cruelly provided by the present application based on comparison, by carry out the video of sudden and violent probably detection into Row structured analysis, extracts key frame；Secondly, this section of video is determined using the video identification model of fearing cruelly based on Hash network The Hash codes of each key frame；Then, by the Kazakhstan of the Hash codes of the key frame of video to be detected and the sudden and violent probably key frame of video to prestore Uncommon code matching determines whether video to be detected is to fear video cruelly.Structured analysis, extraction are carried out to video to be detected in the present invention Go out key frame, realization reaches good balance between the accuracy and speed of Shot Detection；Hash codes using key frame with prestore Hash codes compare, can quickly judge video to be detected whether be include video；And the Hash codes occupied space to prestore Small, retrieval rate is fast, therefore, the present invention can quickly, accurately identify cruelly probably video.

Description of the drawings

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow diagram of quick sudden and violent probably one embodiment of video frequency identifying method based on comparison of the application；

Fig. 3 is according to the net of Hash network model in quick sudden and violent probably video frequency identifying method embodiment of the application based on comparison Network structural schematic diagram；

Fig. 4, the application example flow diagram of the quick sudden and violent probably video frequency identifying method based on comparison of the application.

Specific implementation mode

The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this A little embodiments are used only for explaining the technical principle of the present invention, it is not intended that limit the scope of the invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows the exemplary of the quick sudden and violent probably video frequency identifying method embodiment based on comparison that can apply the application System architecture schematic diagram.

As shown in Figure 1, system architecture may include terminal device 101, network 102 and server 103.Network 102 to The medium of communication link is provided between terminal device 101 and server 103.Network 102 may include various connection types, example Such as wired, wireless communication link or fiber optic cables.

User can be interacted by network 102 with server 103 with using terminal equipment 101, to receive or send message etc.. Various telecommunication customer end applications can be installed on terminal device 101, for example, web browser applications, video tour, on video Pass class application, social platform software etc..

Terminal device 101 can be the various electronic equipments for having display screen and video tour or video being supported to upload, Including but not limited to smart mobile phone, tablet computer, pocket computer on knee and desktop computer etc..

Server 103 can be to provide the server of various services, such as the video uploaded to terminal device 101 wraps Include the video processing service device or application platform of identification.Video processing service device pair can be set with each terminal of its network connection The standby video data uploaded carries out the processing such as analyzing, and handling result (such as video fears recognition result cruelly) is fed back to terminal and is set Standby or third party uses.

It should be noted that the embodiment of the present application provided based on comparison it is quick cruelly probably video frequency identifying method generally by Server 103 executes, and correspondingly, the device of method shown in the application can be applied to be generally positioned in server 103.

It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, a reality of the quick sudden and violent probably video frequency identifying method based on comparison according to the application is shown Apply the flow of example.The quick sudden and violent probably video frequency identifying method based on comparison, includes the following steps：

Step 201, to for carrying out cruelly, probably the video to be detected of identification carries out shot segmentation to choose above-mentioned to be detected regard The key frame of frequency.

In the present embodiment, it can apply the electronic equipment of the quick sudden and violent probably video frequency identifying method based on comparison (in such as Fig. 1 Server) or application platform, obtain the video to be detected of pending sudden and violent probably detection.Above-mentioned electronic equipment or application platform are to institute It obtains video to be detected and carries out shot segmentation respectively to extract the key frame of video to be detected.It to be detected is regarded as an example, above-mentioned Frequency can be obtained from the terminal device being connect with above-mentioned electronic equipment or application platform, for example, using with above-mentioned server or After user's uploaded videos of the terminal device of application platform network connection, above-mentioned server or application platform obtain the video conduct Video to be detected.

Specifically, above-mentioned " described to be checked to choose to the video to be detected progress shot segmentation for carrying out probably identification cruelly Survey the key frame of video ", including：The histogram for extracting every frame video frame of video to be detected, to the histogram of adjacent video frames Comparison in difference is carried out, with the shot boundary of the above-mentioned video to be detected of determination；According to identified shot boundary, choose above-mentioned to be checked The start frame and/or end frame of survey each camera lens of video are as key frame.Histogram of the said extracted per frame video frame, can be ash Spend histogram or color histogram.It, can be by each camera lens after i.e. by Video segmentation to be detected at a series of camera lens The key frame of first frame or last frame as camera lens；It can also be using first frame and last frame as key frame.

Step 202, video identification model is feared cruelly using what is built in advance, each key frame of above-mentioned video to be detected is carried out Hash codes operation obtains the Hash codes of each key frame.

In the present embodiment, based on multiple key frames of the video to be detected of selected taking-up in step 201, above-mentioned electronic equipment Or application platform carries out operation using the Hash network model built in advance, determines the Hash codes of each key frame.Here, above-mentioned sudden and violent Probably video identification model can be depth convolutional neural networks model, such as can be Siamese network models, utilize Siamese network models add the Hash operation of designed Hash loss completion key frame of video to be detected.It is above-mentioned to fear video cruelly Identification model is based on Hash network struction, and input is video frame, exports the Hash codes of the video frame to be inputted.

Above-mentioned probably video identification model determines that key frame Hash codes can be that the frame picture that will be inputted judges cruelly, profit It is run with the optimization of depth convolutional neural networks, completes inputted key frame (picture) Hash operation.It is above-mentioned that probably video is known cruelly Other model can using the feature of key frame carry out operation, the feature of key frame can be include color, texture, the reflections such as structure The static nature of the information such as background, environment, leading role's appearance；And including motion amplitude, direction, leading role in the reflecting videos such as frequency Moving situation behavioral characteristics.Using the features described above of key frame, the Hash codes of key frame are determined.

Step 203, by the Hash codes of each above-mentioned key frame respectively compared with the sudden and violent probably Hash codes of the video frame of video that prestore, Determine video frame similar with each above-mentioned key frame.

In the present embodiment, it is based in step 202 utilizing the sudden and violent probably obtained video to be detected of video identification model calculation Key frame Hash codes, above-mentioned electronic equipment or application platform to be detected regard compared with the Hash codes to prestore with determination is above-mentioned Whether the key frame of frequency is similar to the video frame of video is feared cruelly.The above-mentioned Hash codes to prestore can be the sudden and violent probably video frame of video Hash codes.

Here, the above-mentioned Hash codes to prestore obtain in the following way：It is extracted from video library first and fears video cruelly, so Afterwards, video is feared cruelly all offline or in line extraction key video sequence frame to what is extracted；Finally, the key video sequence that will be extracted Frame is input to fears operation in video identification model cruelly based on Hash network, obtains the Hash codes for fearing video cruelly, and will be acquired Cruelly probably video Hash codes storage.

Above-mentioned Hash codes relatively can be the Hamming distance of the Hash codes and the Hash codes that prestore that compare key frame, and according to the Chinese Prescribed distance determines whether key frame is similar to the video frame of video is feared cruelly.

In some optional realization methods of the present embodiment, it is above-mentioned " by the Hash codes of each above-mentioned key frame respectively in advance The Hash codes for the sudden and violent probably video frame of video deposited compare, and determine video frame similar with each above-mentioned key frame ", including：It will be each The Hash codes of above-mentioned key frame are respectively compared with the probably Hash codes of the video frame of video sudden and violent in video library；Calculate above-mentioned key frame Hash codes and above-mentioned video frame Hash codes Hamming distance；By key of the above-mentioned Hamming distance radius in range of set value Frame and video frame are confirmed as similar frame.Specifically, two frame pictures of the Hamming distance radius within 2 can be confirmed as similar Frame.

Step 204, similar frame number is counted, if the number of video frame similar with each above-mentioned key frame is more than setting threshold Value, it is determined that above-mentioned video to be detected is to fear video cruelly.

In the present embodiment, it in above-mentioned steps 203, determines and the sudden and violent probably video in above-mentioned sudden and violent probably video database The similar key frame of video frame counts key frame similar with the video frame in above-mentioned sudden and violent probably video in above-mentioned video to be detected Number can then determine that the video to be detected is to fear video cruelly if the number is more than the threshold value of setting.Specifically, if waited for Detection video has 3 frames and the above key frame of 3 frames and probably probably the video frame of video is similar cruelly in video library cruelly, then confirms that this is to be detected Video is to fear video cruelly.

It is above-mentioned that video identification model is feared based on Hash network cruelly in some optional realization methods of the present embodiment, Its training method is：Classify to preset training samples pictures, is divided into positive sample data and negative sample data, wherein above-mentioned Positive sample data are feared to fear picture with sudden and violent to be sudden and violent, and above-mentioned negative sample data are to fear cruelly and non-sudden and violent probably picture；Adjust above-mentioned training sample The size of this picture, from the above-mentioned training after adjustment with intercepting the region being sized in samples pictures at random and carry out sample standard deviation Value processing；Using initially probably to treated, picture is trained video identification model cruelly, obtain cruelly fearing based on Hash network Video identification model.Specifically, training can be divided into two groups with data：Positive sample data and negative sample data；Wherein, positive sample Notebook data can be probably to fear picture with sudden and violent cruelly, and the label of positive sample data is set as 1, and negative sample data can be feared cruelly to fear with non-to be sudden and violent The label of picture, negative sample data is set as 0；So that the Hash codes feared between video cruelly are similar as possible, non-probably video is feared with sudden and violent cruelly The Hash codes of video are mutually remote as possible.

Adjust above-mentioned training samples pictures, the size of above-mentioned training samples pictures be adjusted to 256*256, then with Machine intercepts the region of 227*227 sizes, and subtracts all sample averages as treated samples pictures, can be directly inputted to Initial Hash network model is trained.Above-mentioned sample average is the average value of the samples pictures all pixels point；Subtract sample After this mean value, then it is trained and tests to improve trained speed and measuring accuracy.

By a pair of of picture of above-mentioned positive sample data (the first sudden and violent probably picture and the second sudden and violent probably picture) or negative sample data A pair of of picture (frame is that probably picture, a frame are non-sudden and violent probably picture cruelly) is input to initial Hash network model and is trained.

In some optional realization methods of the present embodiment, the network structure packet of above-mentioned initial sudden and violent probably video identification model Input layer, convolutional layer and full articulamentum are included, the schematic network structure of Hash network model is illustrated in figure 3.Wherein, first layer For input layer, the second layer to layer 6 is convolutional layer, layer 7 to the 9th layer be full articulamentum.Wherein, defeated in above-mentioned input layer The training samples pictures that enter that treated, above-mentioned training samples pictures are the picture of two frame RGB triple channels.The above-mentioned second layer is extremely The convolutional layer of layer 6 uses conv1-conv5 to indicate in figure 3；The full articulamentum of above-mentioned layer 7 to the 9th layer, in Fig. 3 It is middle to be indicated using fc1-fc3；Loss function (loss) in above-mentioned full articulamentum has：" power of having any different And the two major features of " nearly binary-coding (Binary-like) " (Discriminative) ".

Above-mentioned convolutional layer receives the output of preceding layer, this layer after process of convolution after the activation of the activation primitive of this layer it is defeated Go out；Above-mentioned full articulamentum receives the output of preceding layer, is exported after the activation of the activation primitive of this layer after process of convolution in this layer. Specifically：

The above-mentioned second layer is convolutional layer, shares 64 convolution kernels, and each convolution kernel size is 11 × 11, and convolution step-length is 4, Padding=0, connection active coating, down-sampling layer and normalization layer after the characteristic pattern of output.Active coating activation primitive uses ReLU Function.Sample level sample mode is maximum value sampling, and sampling core is 3 × 3, step-length 2.Normalize the LRN normalization that layer uses Method, core size are set as 0.00001, beta for 5, alpha and are set as 0.75.Wherein, alpha is zoom factor, and beta is to refer to It is several.The second layer obtains the output of first layer, and output is C after process of convolution₁, C₁It is input to down-sampling layer and obtains P₁, P₁It is input to Active coating obtains A₁, A₁It is input to normalization layer and obtains L₁, finally export L₁To third layer.

Third layer is convolutional layer, shares 256 convolution kernels, and each convolution kernel size is 5 × 5, and convolution step-length is 1, Padding=2, connection active coating, down-sampling layer and normalization layer after the characteristic pattern of output.Active coating activation primitive uses ReLU Function.Sample level sample mode is maximum value sampling, and sampling core is 3 × 3, step-length 2.Normalize the LRN normalization that layer uses Method, core size are set as 0.00001, beta for 5, alpha and are set as 0.75.Third layer obtains the output of the second layer, at convolution Output is C after reason₂, C₂It is input to down-sampling layer and obtains P₂, P₂It is input to active coating and obtains A₂, A₂Normalization layer is input to obtain L₂, finally export L₂To the 4th layer.

4th layer is convolutional layer, shares 256 convolution kernels, and each convolution kernel size is 3 × 3, and convolution step-length is 1, Padding=1 connects active coating after the characteristic pattern of output.Active coating activation primitive uses ReLU functions.4th layer of acquisition third The output of layer, output is C after process of convolution₃, C₃It is input to active coating and obtains A₃, finally export A₃To layer 5.

Layer 5 is convolutional layer, shares 256 convolution kernels, and each convolution kernel size is 3 × 3, and convolution step-length is 1, Padding=1 connects active coating after the characteristic pattern of output.Active coating activation primitive uses ReLU functions.Layer 5 obtains the 4th The output of layer, output is C after process of convolution₄, C₄It is input to active coating and obtains A₄, finally export A₄To layer 6.

Layer 6 is convolutional layer, shares 256 convolution kernels, and each convolution kernel size is 3 × 3, and convolution step-length is 1, Padding=1, connection active coating, down-sampling layer after the characteristic pattern of output.Active coating activation primitive uses ReLU functions.Sampling Layer sample mode is maximum value sampling, and sampling core is 3 × 3, step-length 2.Layer 6 obtains the output of layer 5, after process of convolution Output is C₅, C₅It is input to down-sampling layer and obtains P₅, P₅It is input to active coating and obtains A₅, finally export A₅To layer 7.

Layer 7 is full articulamentum, and it is 1 × 1 to have 4096 convolution kernels, each convolution kernel size, step-length 1, the spy of output Active coating is connected after sign figure.Active coating activation primitive uses ReLU functions.Layer 7 obtains the output of layer 6, after process of convolution Output is C₆, C₆It is input to active coating and obtains A₆, finally export A₆To the 8th layer.

8th layer is full articulamentum, and it is 1 × 1 to have 4096 convolution kernels, each convolution kernel size, step-length 1, the spy of output Active coating is connected after sign figure.Active coating activation primitive uses ReLU functions.The output of 8th layer of acquisition layer 7, after process of convolution Output is C₇, C₇It is input to active coating and obtains A₇, finally export A₇To last one layer.

9th layer is full articulamentum, and convolution kernel number Hash code length as needed determines that each convolution kernel size is 1 × 1, Step-length is 1, and Hash loss layer is connected after the characteristic pattern of output.Hash loss layer uses hash function.9th layer obtains the 8th layer Output, output is C after process of convolution₈, C₈It is input to the Hash two-value code (b of Hash loss layer output sample pair_i,1,b_i,2)。

All include activation primitive in above layers, wherein the activation primitive of the second layer to the 8th layer is：

It is above-mentioned that initially probably the 9th layer of activation primitive of the network structure of video identification model is cruelly：

Wherein, δ (x) is to b_i,jSeek the result that local derviation is later.

Training is above-mentioned, and probably the loss function of video identification model is cruelly：

Wherein, whether yi indicates sample to being similar, i.e. y_i=1 two samples of expression are similar, otherwise dissimilar；It is the Euclidean distance between two sample two-value codes of sample centering；|||b_i,1-1|||₁、|||b_i,2-1|||₁It is sample The manhatton distance L of this two-value code and unit matrix_rBe loss function m (m > 0) it is marginal threshold parameter, α is zoom factor, b_i,1And b_i,2For the Hash codes of sample 1 and sample 2, N is training sample to sum, and k is the dimension of Hash codes.

As an example, with reference to figure 4, Fig. 4 shows the quick sudden and violent probably video identification schematic diagram based on comparison.As shown in Figure 4, On the one hand, the key frame for extracting sudden and violent probably video from video database in advance, using probably the generation of video identification model is each cruelly The Hash codes of key frame.On the other hand, the key frame for extracting video to be detected generates each key frame using Hash network model Hash codes.Then the Hamming distance of the Hash codes of key frame of video more to be detected and the sudden and violent probably Hash codes of key frame of video. Two frame pictures of the Hamming distance radius within 2 are confirmed as similar frame.Finally, if video to be detected has 3 frames and 3 frames or more Key frame is similar with key frame of video is feared in video library cruelly, then it is assumed that the video is to fear video cruelly.

The method that the above embodiments of the present application are provided with sudden and violent by the Hash codes of key frame of video to be detected by fearing video The Hash codes of key frame match, and confirm the similar frame of key frame of video to be detected, according in video to be detected with video database The number of the similar key frame of middle key frame confirms whether video to be detected is probably video cruelly.It is closed using shot segmentation extraction video Key frame realizes and reaches good balance between the accuracy and speed of Shot Detection；Using key frame Hash codes and prestore Hash codes compare, can quickly judge video to be detected whether be include video；And the Hash codes to prestore occupy little space, Retrieval rate is fast；The Hash codes of key frame can be accurately and rapidly obtained using Hash network model；Therefore, using the present invention The method of offer can quickly, accurately identify cruelly probably video.

So far, it has been combined preferred embodiment shown in the drawings and describes technical scheme of the present invention, still, this field Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific implementation modes.Without departing from this Under the premise of the principle of invention, those skilled in the art can make the relevant technologies feature equivalent change or replacement, these Technical solution after change or replacement is fallen within protection scope of the present invention.

Claims

1. a kind of quick sudden and violent probably video frequency identifying method based on comparison, which is characterized in that the method includes：

To for carrying out probably identifying that video to be detected carries out shot segmentation to choose the key frame of the video to be detected cruelly；

Video identification model is feared cruelly using what is built in advance, and Hash codes operation is carried out to each key frame of the video to be detected, Obtain the Hash codes of each key frame；Described probably video identification model is based on Hash network struction cruelly, and input is video frame, Output is the Hash codes of the video frame inputted；

The Hash codes of each key frame are determined and each institute compared with the sudden and violent probably Hash codes of the video frame of video that prestore respectively State the similar video frame of key frame；

The number of similar frame is counted, if the number of video frame similar with each key frame is more than given threshold, it is determined that The video to be detected is to fear video cruelly.

2. it is according to claim 1 based on comparison it is quick cruelly probably video frequency identifying method, which is characterized in that " to be used for into The video to be detected of the sudden and violent probably identification of row carries out shot segmentation to choose the key frame of the video to be detected ", including：

The histogram for extracting every frame video frame of the video to be detected carries out comparison in difference to the histogram of adjacent video frames, With the shot boundary of the determination video to be detected；

According to identified shot boundary, chooses the start frame of each camera lens of video to be detected and/or end frame is used as and closes Key frame.

3. the quick sudden and violent probably video frequency identifying method according to claim 1 based on comparison, which is characterized in that " will be each described The Hash codes of key frame compared with the probably Hash codes of the video frame of video sudden and violent with what is prestored, are determined and each key frame phase respectively As video frame ", including：

Compared with the Hash codes of each key frame respectively probably Hash codes of the video frame of video sudden and violent in video library；

Calculate the Hamming distance of the Hash codes of the key frame and the Hash codes of the video frame；

Key frame of the Hamming distance radius in range of set value and video frame are confirmed as similar frame.

4. the quick sudden and violent probably video frequency identifying method according to claim 3 based on comparison, which is characterized in that described probably to regard cruelly Frequency identification model, training method：

Classify to preset training samples pictures, is divided into positive sample data and negative sample data；Wherein, the positive sample data To fear to fear picture with sudden and violent cruelly, the negative sample data are to fear cruelly and non-sudden and violent probably picture；

The size for adjusting the training samples pictures, interception setting is big at random from the training samples pictures after adjustment Small region simultaneously carries out sample average processing；

Using initially probably to treated, picture is trained video identification model cruelly, obtain fearing video cruelly based on Hash network Identification model.

5. the quick sudden and violent probably video frequency identifying method according to claim 4 based on comparison, which is characterized in that described initial sudden and violent Probably the network structure of video identification model includes input layer, convolutional layer and full articulamentum, wherein and first layer is input layer, second Layer to layer 6 be convolutional layer, layer 7 to the 9th layer be full articulamentum.

6. the quick sudden and violent probably video frequency identifying method according to claim 5 based on comparison, which is characterized in that described in training Cruelly probably in video identification model, input is through sample average treated training samples pictures in the input layer.

7. the quick sudden and violent probably video frequency identifying method according to claim 5 based on comparison, which is characterized in that the convolutional layer Receive preceding layer output, this layer after process of convolution through the activation primitive of this layer activation after export；The full articulamentum connects Receive preceding layer output, this layer after process of convolution through the activation primitive of this layer activation after export.

8. the quick sudden and violent probably video frequency identifying method according to claim 7 based on comparison, which is characterized in that described initial sudden and violent Probably the activation primitive of the second layer of the network structure of video identification model to the 8th layer is：

9. the quick sudden and violent probably video frequency identifying method according to claim 7 based on comparison, which is characterized in that described initial sudden and violent Probably the 9th layer of activation primitive of the network structure of video identification model is：

Wherein, δ (x) is to b_i,jSeek the result that local derviation is later.

10. according to any quick sudden and violent probably video frequency identifying method based on comparison of claim 4 to 9, which is characterized in that instruction Practicing the sudden and violent loss function for fearing video identification model is：

s.t. b_i,j∈{-1,+1}^k,i∈{1,...,N},j∈{1,2}

Wherein, y_iIndicate sample to whether being similar, i.e. y_i=1 two samples of expression are similar, otherwise dissimilar；It is the Euclidean distance between two sample two-value codes of sample centering；|||b_i,1-1|||₁、|||b_i,2-1|||₁It is sample The manhatton distance L of this two-value code and unit matrix_rBe loss function m (m > 0) it is marginal threshold parameter, α is zoom factor, b_i,1And b_i,2For the Hash codes of sample 1 and sample 2, N is the sum of training sample pair, and k is the dimension of Hash codes.