CN105389558A

CN105389558A - Method and apparatus for detecting video

Info

Publication number: CN105389558A
Application number: CN201510764366.0A
Authority: CN
Inventors: 李邵梅; 黄海; 于洪涛; 王凯; 高超; 黄雅静; 李印海
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2016-03-09

Abstract

The present invention provides a method and an apparatus for detecting a video. A video to be detected is segmented into a plurality of sub-section videos based on similarity of adjacent frame images in the video to be detected, then image detection, text detection and voice keyword detection are performed for each sub-section video, so that a detection result of the video to be detected can be determined based on an image detection result of each sub-section video obtained through the image detection, based on a text detection result of each sub-section video obtained through the text detection, and based on a voice detection result obtained through the voice keyword detection, i.e., whether the video to be detected is a bad video is determined. According to the above technical scheme, the method and the apparatus of the present invention determine whether the video to be detected is a bad video based on image, text and voice. In comparison with the simple image detection in the prior art, the method and the apparatus detect the video to be detected from multiple aspects, thereby analyzing the video to be detected more comprehensively and improving accuracy of video detection.

Description

A kind of video detecting method and device

Technical field

The invention belongs to image identification technical field, in particular, particularly relate to a kind of video detecting method and device.

Background technology

Bad video refers to there is video that is illegal or unlawful practice in mode of propagation or content.Current bad video mainly contains two types: the bad video of pirate video and other types, and wherein the bad video of other types mainly comprises: reaction video, cruelly probably video, swindle video and pornographic video.These bad videos, by public network wide-scale distribution, have become the major incentive of social danger.

In order to purify Internet environment, researchist proposes the multiple method detected bad video.Detect delay wherein for pirate video is comparatively ripe, and be content-based detection method for main detection method reaction video, cruelly probably video, swindle video and this badness video of pornographic video, its processing procedure is as follows:

First obtain the visual object in bad video, and the eigenwert extracting described visual object is as matching template; Secondly, need the video of coupling acquisition one after, subregion is carried out to the every two field picture in described video, and extracts the eigenwert of each subregion district by district; Then the eigenwert of each subregion and the above-mentioned eigenwert as matching template are carried out the Similarity Measure based on distance, similarity is less than specifies threshold value then to judge that video is as bad video.But video is the set of image, text and a speech, is detected by image merely and determine whether video is that bad video may cause video detection inaccurate.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of video detecting method and device, for improving the accuracy that video detects.

The invention provides a kind of video detecting method, described method comprises:

Based on the similarity of consecutive frame image in video to be detected, described Video segmentation to be detected is become multiple subsegment video;

Carry out image detection, text detection and voice key word to each subsegment video respectively to detect, obtain the text hegemony result of the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, wherein said image testing result is used to indicate the testing result detecting the subsegment video obtained based on image, described text detection result is used to indicate the testing result of the subsegment video obtained based on text detection, and described text hegemony result is used to indicate the testing result detecting the subsegment video obtained based on voice key word;

Based on the text hegemony result of the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtain the testing result of corresponding subsegment video;

Based on the testing result of each subsegment video, obtain the testing result of described video to be detected.

Preferably, the described text hegemony result based on the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtains the testing result of corresponding subsegment video, comprising:

In the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being one-level, obtain indicating described subsegment video to be the testing result of bad video subsegment;

In the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, at least two testing result instructions detect destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of bad video subsegment, the significance level of wherein said secondary is less than the significance level of described one-level;

In the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of doubtful bad video subsegment.

Preferably, the described testing result based on each subsegment video, obtains the testing result of described video to be detected, comprising:

Based on described testing result, obtain the second subsegment number of videos of the first subsegment number of videos into bad video subsegment and doubtful bad video field;

When the ratio of described first subsegment number of videos and subsegment video sum is greater than first threshold, obtain indicating described video to be detected to be the testing result of bad video;

When the ratio of described second subsegment number of videos and described subsegment video sum is greater than Second Threshold, obtain indicating described video to be detected to be the testing result of bad video, wherein said first threshold is less than Second Threshold.

Preferably, image detection is carried out to subsegment video, obtains the image testing result of subsegment video, comprising:

Extract the visual signature of the surveyed area of every two field picture in described subsegment video;

Extracted visual signature and the image object model set up in advance are carried out the matching analysis, to obtain the grade of bad object in described every two field picture and described bad object, wherein said image testing result comprises the grade of bad object in described every two field picture and described bad object.

Preferably, text detection is carried out to subsegment video, obtains the text detection result of subsegment video, comprising:

That determines in described subsegment video in every two field picture is text filed;

Text filedly carry out text identification to determined, obtain the described text filed text comprised;

The text obtained is mated with the text library set up in advance, to obtain the grade of bad text in described every two field picture and described bad text, wherein said text detection result comprises the grade of bad text in described every two field picture and described bad text.

Preferably, text hegemony is carried out to subsegment video, obtains the text hegemony result of subsegment video, comprising:

Extract the voice data in described subsegment video, and obtain the phonetic feature sequence of described voice data;

Obtained phonetic feature sequence and the phonetic feature sequence of each keyword in the speech storehouse set up in advance are compared, obtains the distance between obtained phonetic feature sequence and the phonetic feature sequence of each keyword;

When the value of the distance between obtained phonetic feature sequence and the phonetic feature sequence of any one keyword is less than distance threshold, determine that described subsegment video comprises bad speech;

The value obtaining distance is less than the described keyword of distance threshold, and determines the grade of bad speech based on described keyword place grade, and described text hegemony result comprises the grade of described bad speech and described bad speech.

The present invention also provides a kind of video detecting device, and described device comprises:

Cutting unit, for the similarity based on consecutive frame image in video to be detected, becomes multiple subsegment video by described Video segmentation to be detected;

Detecting unit, for carrying out image detection to each subsegment video respectively, text detection and voice key word detect, obtain the image testing result of each subsegment video, the text detection result of each subsegment video and the text hegemony result of each subsegment video, wherein said image testing result is used to indicate the testing result detecting the subsegment video obtained based on image, described text detection result is used to indicate the testing result of the subsegment video obtained based on text detection, described text hegemony result is used to indicate the testing result detecting the subsegment video obtained based on voice key word,

First processing unit, for the text hegemony result based on the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtains the testing result of corresponding subsegment video;

Second processing unit, for the testing result based on each subsegment video, obtains the testing result of described video to be detected.

Preferably, described first processing unit is used for: when the image testing result of subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being one-level, obtain indicating described subsegment video to be the testing result of bad video subsegment, and for the image testing result when subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, at least two testing results instructions detect destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of bad video subsegment, the significance level of wherein said secondary is less than the significance level of described one-level, and for the image testing result when subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of doubtful bad video subsegment.

Preferably, described second processing unit comprises: obtain subelement and process subelement;

Described acquisition subelement, for based on described testing result, obtains the second subsegment number of videos of the first subsegment number of videos into bad video subsegment and doubtful bad video field;

Described process subelement, for when the ratio of described first subsegment number of videos and subsegment video sum is greater than first threshold, obtain the testing result that the described video to be detected of instruction is bad video, and when the ratio of described second subsegment number of videos and described subsegment video sum is greater than Second Threshold, obtain the testing result that the described video to be detected of instruction is bad video, wherein said first threshold is less than Second Threshold.

Preferably, described detecting unit comprises: image detection sub-unit, text detection subelement and text hegemony subelement;

Described image detection sub-unit, for extracting the visual signature of the surveyed area of every two field picture in described subsegment video, extracted visual signature and the image object model set up in advance are carried out the matching analysis, to obtain the grade of bad object in described every two field picture and described bad object, wherein said image testing result comprises the grade of bad object in described every two field picture and described bad object;

Described text detection subelement, text filed for what determine in described subsegment video in every two field picture, text filedly text identification is carried out to determined, obtain the described text filed text comprised, and the text obtained is mated with the text library set up in advance, to obtain the grade of bad text in described every two field picture and described bad text, wherein said text detection result comprises the grade of bad text in described every two field picture and described bad text;

Described text hegemony subelement, for extracting the voice data in described subsegment video, obtain the phonetic feature sequence of described voice data, obtained phonetic feature sequence and the phonetic feature sequence of each keyword in the speech storehouse set up in advance are compared, obtains the distance between obtained phonetic feature sequence and the phonetic feature sequence of each keyword; When the value of the distance between obtained phonetic feature sequence and the phonetic feature sequence of any one keyword is less than distance threshold, determine that described subsegment video comprises bad speech; The value obtaining distance is less than the described keyword of distance threshold, and determines the grade of bad speech based on described keyword place grade, and described text hegemony result comprises the grade of described bad speech and described bad speech.

Compared with prior art, technique scheme tool provided by the invention has the following advantages:

Technique scheme provided by the invention, can based on the similarity of consecutive frame image in video to be detected, Video segmentation to be detected is become multiple subsegment video, then image detection is carried out to each subsegment video, text detection and voice key word detect, the image testing result of each subsegment video obtained so just can be detected based on image, the text detection result of each subsegment video obtained based on text detection and detect the testing result that the text hegemony result obtained judges video to be detected based on voice key word, namely video to be detected is judged whether as bad video.That is the present invention is judging that whether video to be detected is based in image, text and voice key word these three as bad video, compared with detecting with image simple in prior art, the present invention detects from many aspects video to be detected, thus more comprehensive video to be detected to be analyzed, improve the accuracy that video detects.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of the video detecting method that the embodiment of the present invention provides;

Fig. 2 is the light stream trajectory diagram that the embodiment of the present invention provides;

Fig. 3 is schematic diagram subsegment video being carried out to image detection that the embodiment of the present invention provides;

Fig. 4 is the unit surveyed area of the destination object that the embodiment of the present invention provides;

Fig. 5 is the histogram of gradients that the embodiment of the present invention provides;

Fig. 6 is schematic diagram subsegment video being carried out to text detection that the embodiment of the present invention provides;

Fig. 7 is schematic diagram subsegment video being carried out to text hegemony that the embodiment of the present invention provides;

Fig. 8 is the sequence alignment schematic diagram based on DTW that the embodiment of the present invention provides;

Fig. 9 is the structural representation of the video detecting device that the embodiment of the present invention provides;

Figure 10 is the structural representation of detecting unit in the video detecting device that provides of the embodiment of the present invention.

Embodiment

One of core concept of the video detecting method that the embodiment of the present invention provides and device is: by carrying out image detection to each subsegment video of video to be detected, text detection and voice key word detect and judge video to be detected whether as bad video, detect relative to image simple in prior art like this and compare, the video detecting method that the embodiment of the present invention provides and device can detect from many aspects video to be detected, thus more comprehensive video to be detected to be analyzed, improve the accuracy that video detects.

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Refer to Fig. 1, a kind of process flow diagram of the video detecting method that the embodiment of the present invention provides, can comprise the following steps:

101: based on the similarity of consecutive frame image in video to be detected, Video segmentation to be detected is become multiple subsegment video.Whether wherein in video to be detected, the similarity of consecutive frame image refers to the similarity degree of consecutive frame image, can determine between consecutive frame image similar by light stream track in embodiments of the present invention.Its process is as follows:

First, for each pixel in every two field picture arranges a light stream vector, optical flow computation method is adopted to extract the movement locus of each pixel between consecutive frame image based on light stream vector, and namely movement locus two-dimensional coordinate figure drawing each pixel is obtained light stream trajectory diagram, as shown in the 3rd accompanying drawing in Fig. 2, wherein in Fig. 2, the 3rd accompanying drawing is the light stream trajectory diagram that front two image respective pixel value comparing calculations obtain, and front two images are front and back two two field pictures extracted from same section of video; Then add up movement velocity on light stream trajectory diagram and exceed the number of the pixel of certain movement threshold speed, and when the number of pixel and the ratio of sum of all pixels that exceed certain movement threshold speed are greater than presetted pixel threshold value, judge that this two two field picture is as similar image, can be so just partitioning boundary by this two two field picture, video to be detected then can carry out segmentation with this partitioning boundary and obtain subsegment video.

Such as when two two field pictures of similar image are the 3rd two field picture and the 4th two field picture, when then splitting using this two two field picture as partitioning boundary, can by before the 3rd two field picture and the 3rd two field picture, namely the 1st two field picture and the 2nd two field picture are segmented in same subsegment video, and the Iamge Segmentation after the 4th two field picture and the 4th two field picture is in another subsegment video, if still there is similar image in the image after the 4th two field picture and the 4th two field picture, then can split further subsegment video and obtain multiple subsegment video.

In embodiments of the present invention, the above-mentioned optical flow computation method adopted based on light stream vector can be any one method in existing optical flow computation, as LK (Lucas-Kanade) algorithm, and above-mentioned movement velocity threshold value and presetted pixel threshold value can set according to actual conditions, do not limit its concrete value to this embodiment of the present invention.

102: respectively image detection, text detection and voice key word are carried out to each subsegment video and detect, obtain the text hegemony result of the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, wherein image testing result is used to indicate the testing result detecting the subsegment video obtained based on image, text detection result is used to indicate the testing result of the subsegment video obtained based on text detection, and text hegemony result is used to indicate the testing result detecting the subsegment video obtained based on voice key word.

Namely above-mentioned image testing result, text detection result and text hegemony result may be used for indicating in corresponding subsegment video whether comprise destination object.For image testing result, comprise: the portrait of terrorist head and the badge of terrorist if obtain subsegment video based on image detection, then image testing result instruction subsegment video comprises destination object.

103: based on the text hegemony result of the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtain the testing result of corresponding subsegment video.Because the text detection result of the image testing result of subsegment video, subsegment video and the text hegemony result of subsegment video are detect to an aspect of subsegment video the testing result obtained respectively, whether it can not indicate subsegment video to be bad video field completely, so need in the embodiment of the present invention testing result obtaining corresponding subsegment video based on these three testing results.

In embodiments of the present invention, a kind of mode obtaining the testing result of corresponding subsegment video is: in the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being one-level, obtain indicating subsegment video to be the testing result of bad video subsegment;

In the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, at least two testing result instructions detect destination object, and the grade of destination object is when being secondary, obtain indicating subsegment video to be the testing result of bad video subsegment;

In the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being secondary, obtain indicating subsegment video to be the testing result of doubtful bad video subsegment;

Except above-mentioned image testing result, text detection result and text hegemony result can indicate subsegment video to be except the testing result of bad video field or doubtful bad video field, other situations then can obtain the testing result that instruction subsegment video is normal video field.

In embodiments of the present invention, destination object is the flame comprised in subsegment video, as in image testing result, destination object can be the nude etc. in the portrait of terrorist head, the badge of terrorist and pornographic video; In text detection result, destination object can be " once effects a radical cure and do not recur " of cruelly fearing in " crusade " in video, " the pornographic select-elite " in pornographic video, swindle advertisement video etc.; And destination object can as " cure rate 100% " in " massacre and commit suiside ", swindle advertisement video in cruelly probably video etc. in text hegemony result.And the significance level of the grade indicating target object of destination object in embodiments of the present invention, namely the higher explanation destination object of significance level may be more flame, and in embodiments of the present invention, the significance level of secondary is less than the significance level of one-level.

104: based on the testing result of each subsegment video, obtain the testing result of video to be detected.Its feasible pattern is: based on testing result, obtains the second subsegment number of videos of the first subsegment number of videos into bad video subsegment and doubtful bad video field; When the ratio of the first subsegment number of videos and subsegment video sum is greater than first threshold, obtain indicating video to be detected to be the testing result of bad video; When the ratio of the second subsegment number of videos and subsegment video sum is greater than Second Threshold, obtain indicating video to be detected to be the testing result of bad video, wherein first threshold is less than Second Threshold.

Such as first threshold is 60%, and Second Threshold is 80%.Here it should be noted that: 60% and 80% only illustrates, first threshold and Second Threshold can set different values in varied situations.

From technique scheme, the video detecting method that the embodiment of the present invention provides can based on the similarity of consecutive frame image in video to be detected, Video segmentation to be detected is become multiple subsegment video, then image detection is carried out to each subsegment video, text detection and voice key word detect, the image testing result of each subsegment video obtained so just can be detected based on image, the text detection result of each subsegment video obtained based on text detection and detect the testing result that the text hegemony result obtained judges video to be detected based on voice key word, namely video to be detected is judged whether as bad video.That is the present invention is judging that whether video to be detected is based in image, text and voice key word these three as bad video, compared with detecting with image simple in prior art, the present invention detects from many aspects video to be detected, thus more comprehensive video to be detected to be analyzed, improve the accuracy that video detects.

Then introduce the feasible pattern that the embodiment of the present invention carries out image detection to subsegment video, text detection and voice key word detect below in detail, as shown in Figure 3, it illustrates and image detection is carried out to subsegment video, obtain the image testing result of subsegment video, can comprise the following steps:

1021: the visual signature extracting the surveyed area of every two field picture in subsegment video.The subregion of every two field picture that refers to of the surveyed area of every two field picture in embodiments of the present invention, it can adopt existing conspicuousness detection method to locate the surveyed area obtaining object place to be detected in every two field picture, namely adopt existing conspicuousness detection method may for the object to be detected of destination object to orient in every two field picture, then at the visual signature extracting its place surveyed area, or sliding window method region-by-region on whole two field picture is adopted to extract the visual signature of each surveyed area.

Wherein visual signature can adopt HOG (HistogramofOrientedGradient, histograms of oriented gradients) feature, its leaching process is: for each surveyed area by the horizontal direction gradient of pixel extraction pixel and vertical gradient, horizontal direction gradient is G _h(x, y)=f (x+1, y)-f (x-1, y), vertical gradient is G _v(x, y)=f (x, y+1)-f (x, y-1), f (x, y) are the pixel values at (x, y) place;

Then gradient magnitude M (x, y) and the gradient direction θ (x, y) of pixel is calculated based on above-mentioned horizontal direction gradient and vertical gradient:

M (x, y) = \sqrt{G_{h} {(x, y)}^{2} + G_{v} {(x, y)}^{2}}

θ (x, y)=arctan (G _h(x, y)/G _v(x, y)), wherein gradient direction is limited to (0 ~ 180 °),

θ (x, y) = \{\begin{matrix} θ (x, y) + π, & θ (x, y) < 0 \\ θ (x, y), & o t h e r s \end{matrix} .

1022: extracted visual signature and the image object model set up in advance are carried out the matching analysis, to obtain the grade of bad object in every two field picture and bad object, wherein image testing result comprises the grade of bad object in every two field picture and bad object.

In embodiments of the present invention, the image object model set up in advance can be a SVM (SupportVectorMachine, support vector machine) model, after extracting visual signature, can be substituted in the optimal classification function of SVM model, as in, wherein, sgn () is-symbol function, x _tfor the visual signature extracted, if g is (x _t)=1, then judge that surveyed area comprises bad object, otherwise do not comprise.After obtaining bad object, each destination object in itself and bad video object library of object is compared, determines its grade.Wherein bad video object library of object is the library of object obtained according to existing bad video, it stores the plurality of target formation being identified as bad object in existing bad video, and according to the significance level of destination object, multiple destination object is divided into two-stage, such as one-level is destination object exclusive in bad video, as cruelly feared portrait, the badge of the terrorist head in video; Nude in pornographic video; Secondary is that in multiple bad video, occurrence frequency is high, but the destination object that also may occur in other videos, as cruelly feared the explosive etc. in video.

Wherein SVM model can carry out modeling based on HOG feature, its process of establishing is: collect the tape identification image that some comprises above-mentioned destination object, extract the HOG feature of destination object region, be labeled as forward sample set, and the same image collected some and do not comprise above-mentioned destination object, extract the HOG feature of arbitrary region, be labeled as negative sense sample set; Above-mentioned forward sample set and negative sense sample set are sent into SVM model, and training obtains the following objective function of model: wherein, N is the number of all samples, i.e. the sum of forward sample and negative sense sample, x _ithe HOG feature of each sample above-mentioned, y _ibeing the label of sample, how to be forward sample is exactly+1, otherwise is-1; By solving above-mentioned the minimization of object function, obtain the correlation parameter of SVM model: w, a and b; Respective SVM model is obtained, the bad object video model bank of common formation to each destination object training in bad object video storehouse.First the HOG feature of each destination object in bad video object library of object is extracted; Then based on the HOG feature of each destination object, set up the SVM model of each destination object respectively, the SVM model wherein set up is the optimal classification function be made up of multiple parameter (w, a, b).

Wherein the process of each destination object extraction HOG feature is: each destination object region above-mentioned is normalized to 224 × 224, be that unit is divided into Cell (unit) according to 8 × 8 pixels, every 4 Cell form 1 Block (block), as shown in Figure 4, each like this destination object region is divided into 49 Block;

For each Cell, obtain gradient magnitude M (x, y) and the gradient direction θ (x, y) of each pixel in each Cell, and add up gradient magnitude and the direction of each pixel in each Cell, form histogram of gradients, as shown in Figure 5; The histogram of gradients feature of the Cell of 4 in each Block is connected, forms the feature of 4 × 9=36 dimension; Finally by the HOG feature of the histogram of gradients feature of all Block series connection formation 36 × 49=1746 dimension, i.e. the HOG feature of destination object.The amplitude that wherein in Fig. 5, transverse axis represents obtains according to calculated for pixel values, and the span of pixel value is [0,255], without unit.

Here it should be noted is that: the embodiment of the present invention is only illustrate with HOG characteristic sum SVM model, in actual application, above-mentioned visual signature and image object model can also adopt other modes, illustrate no longer one by one this embodiment of the present invention.

Refer to Fig. 6, it illustrates in the video detecting method that the embodiment of the present invention provides and text detection is carried out to subsegment video, obtain the process of the text detection result of subsegment video, can comprise the following steps:

1023: that determines in subsegment video in every two field picture is text filed.Wherein text filed is the region that may comprise text in every two field picture, the region containing captions or scene text can be locked by the MSER (MaximallyStableExtremalRegion, maximum stable extremal region) in the every two field picture of detection in the embodiment of the present invention; Wherein the testing process of MSER is: use multiple gray threshold to carry out binary conversion treatment to every two field picture, obtain the bianry image corresponding with each gray threshold; For the bianry image that each gray threshold obtains, obtain the black region in each bianry image and white portion; When all comprising the similar region of shape in the bianry image that multiple continuous print gray threshold is corresponding, be that a shape preserves stable region depending on the region that this shape is similar, the region of this dimensionally stable is exactly MSER.

1024: text filedly carry out text identification to determined, obtain the text filed text comprised.In embodiments of the present invention, OCR (OpticalCharacterRecognition can be adopted, optical character identification) technology identifies text filed, to during text filed identification in employing OCR technology text filedly can carry out enhancings and process above-mentioned further, OCR technology more be had text is identified.

1025: mated with the text library set up in advance by the text obtained, to obtain the grade of bad text in every two field picture and bad text, wherein text detection result comprises the grade of bad text in every two field picture and bad text.

The text library wherein set up in advance builds according to the sensitive word in the typical scene text comprised in existing bad video and captions, and according to the significance level of typical scene text and sensitive word, be divided into two-stage, one-level is exclusive text in bad video, as cruelly feared " crusade " in video, " the pornographic select-elite " in pornographic video, " once effect a radical cure and do not recur " of swindling in advertisement video etc.; Secondary is that in bad video, occurrence frequency is high, but the text that also may occur in other videos, as cruelly feared " Koran " in video, " temptation " in pornographic video, " the invalid reimbursement " swindled in advertisement video etc.Like this after determining that image comprises bad text, bad text is compared with the texts at different levels in the text library set up in advance and can obtain the grade of bad text.

Refer to Fig. 7, it illustrates in the video detecting method that the embodiment of the present invention provides and text hegemony is carried out to subsegment video, obtain the text hegemony result of subsegment video, can comprise the following steps:

1026: extract the voice data in subsegment video, and obtain the phonetic feature sequence of voice data.Wherein phonetic feature sequence can be MFCC (Mel-frequencyCepstralCoefficient, Mel frequency cepstrum coefficient) characteristic sequence, its leaching process is: carry out framing with certain hour interval to voice data, obtains multiframe speech data; FFT (FastFrequencyTransformation, fast fourier transform) computing is carried out to every frame speech data, and the result of FFT computing is sent into the Mel bank of filters divided in advance, obtain the output of each wave filter; Get the logarithm of wave filter output and carry out the 12 dimension MFCC features that namely DCT (DiscreteCosineTransform, discrete cosine transform) conversion obtains speech data.

In embodiments of the present invention, the time interval is a default time data, and it can set at random according to practical situations, does not limit this embodiment of the present invention.Mel bank of filters is the array of the up-and-down boundary composition of a series of division Mel frequency range, it is made up of the V-belt bandpass filter of given number, the centre frequency of V-belt bandpass filter and bandwidth evenly distributed in the Mel scale frequency that [0-4000] Hz scope is corresponding.Wherein Mel frequency puts forward based on the auditory properties of people's ear, and it becomes nonlinear correspondence relation with Hz frequency, and conversion formula is: Mel (f)=2595log ₁₀(1+f/700).

1027: obtained phonetic feature sequence and the phonetic feature sequence of each keyword in the speech storehouse set up in advance are compared, obtains the distance between obtained phonetic feature sequence and the phonetic feature sequence of each keyword.

In embodiments of the present invention, in the speech storehouse set up in advance, each keyword root obtains according to the representative key word comprised in existing bad video, and according to the significance level of these keywords, is divided into two-stage.One-level is keyword exclusive in bad video, as cruelly feared " massacre and commit suiside " in video, " cure rate 100% " in swindle advertisement video etc.; Secondary is that in bad video, occurrence frequency is high, but the keyword that also may occur in other videos, as cruelly feared " the getting into heaven " in video, " a large amount of clinical verification " in swindle advertisement video etc.

For these keywords in speech storehouse, its phonetic feature sequence also can be MFCC characteristic sequence, and leaching process can consult the explanation in above-mentioned steps 1026, no longer sets forth this embodiment of the present invention.

1028: when the value of the distance between obtained phonetic feature sequence and the phonetic feature sequence of any one keyword is less than distance threshold, determine that subsegment video comprises bad speech.

The phonetic feature sequence of sliding window method to obtained phonetic feature sequence and each keyword can being adopted obtained phonetic feature sequence to carry out DTW (DyanamicTimeWarping with when the phonetic feature sequence of each keyword is compared in the speech storehouse to set up in advance, dynamic time warping), wherein DTW be a kind of classics sequence between comparative approach, by calculating the distance in two sequences between correspondence position element, then cumulative output is carried out, what it exported is distance between two sequences, when distance more hour, similarity between two sequences is larger, when the value of distance is less than distance threshold, determine that subsegment video comprises bad speech.

Distance wherein between two sequences exporting of DTW is the minimum value of two sequence spacings, as shown in Figure 8 based on the characteristic sequence comparison schematic diagram of DTW, wherein transverse axis and the longitudinal axis are two characteristic sequences compared respectively, and length is respectively M and N, mark { 1 in coordinate axis, 2 ..., M} and { 1,2, .., N} represents the label of each eigenwert in two characteristic sequences, and the diamond-shaped area in figure is for retraining the path of carrying out distance and calculating.Calculate in diamond-shaped area all from initial point (0,0) to impact point (M, N) distance in path, distance between two characteristic sequence corresponding point pair of i.e. path process, the point as curved path in figure identified calculating path process is right, finds each point to the position in two corresponding respectively characteristic sequences successively, get the eigenwert of correspondence position, then calculate the Euclidean distance between eigenwert, as shown in the first two dotted line frame in Fig. 8, be namely used to indicate the distance of how calculating path.Finally the Euclidean distance between all-pair is compared, therefrom get minimum value just as the distance between two characteristic sequences.

1029: the value obtaining distance is less than the keyword of distance threshold, and determines the grade of bad speech based on keyword place grade, and text hegemony result comprises the grade of bad speech and bad speech.

For aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.

Corresponding with said method embodiment, the embodiment of the present invention also provides a kind of video detecting device, and its structural representation as shown in Figure 9, can comprise: cutting unit 11, detecting unit 12, first processing unit and the second processing unit 14.

Cutting unit 11, for the similarity based on consecutive frame image in video to be detected, becomes multiple subsegment video by Video segmentation to be detected.Wherein in video to be detected, the similarity of consecutive frame image refers to the similarity degree of consecutive frame image, whether similarly can determine between consecutive frame image by light stream track in embodiments of the present invention, detailed process can consult the related description in said method embodiment, no longer sets forth this embodiment of the present invention.

Detecting unit 12, detect for carrying out image detection, text detection and voice key word to each subsegment video respectively, obtain the text hegemony result of the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, wherein image testing result is used to indicate the testing result detecting the subsegment video obtained based on image, text detection result is used to indicate the testing result of the subsegment video obtained based on text detection, and text hegemony result is used to indicate the testing result detecting the subsegment video obtained based on voice key word.

First processing unit 13, for the text hegemony result based on the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtains the testing result of corresponding subsegment video.Because the text detection result of the image testing result of subsegment video, subsegment video and the text hegemony result of subsegment video are detect to an aspect of subsegment video the testing result obtained respectively, whether it can not indicate subsegment video to be bad video field completely, so need in the embodiment of the present invention testing result obtaining corresponding subsegment video based on these three testing results.

In embodiments of the present invention, a kind of mode that first processing unit 13 obtains the testing result of corresponding subsegment video is: in the text hegemony result of the image testing result of subsegment video, the text detection result of subsegment video and subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being one-level, obtain indicating subsegment video to be the testing result of bad video subsegment;

Second processing unit 14, for the testing result based on each subsegment video, obtains the testing result of video to be detected.Concrete, the second processing unit comprises: obtain subelement and process subelement.

Obtain subelement, for based on testing result, obtain the second subsegment number of videos of the first subsegment number of videos into bad video subsegment and doubtful bad video field.

Process subelement, for when the ratio of the first subsegment number of videos and subsegment video sum is greater than first threshold, obtain indicating video to be detected to be the testing result of bad video, and when the ratio of the second subsegment number of videos and subsegment video sum is greater than Second Threshold, obtain indicating video to be detected to be the testing result of bad video, wherein first threshold is less than Second Threshold.Such as first threshold is 60%, and Second Threshold is 80%.Here it should be noted that: 60% and 80% only illustrates, first threshold and Second Threshold can set different values in varied situations.

From technique scheme, the video detecting device that the embodiment of the present invention provides can based on the similarity of consecutive frame image in video to be detected, Video segmentation to be detected is become multiple subsegment video, then image detection is carried out to each subsegment video, text detection and voice key word detect, the image testing result of each subsegment video obtained so just can be detected based on image, the text detection result of each subsegment video obtained based on text detection and detect the testing result that the text hegemony result obtained judges video to be detected based on voice key word, namely video to be detected is judged whether as bad video.That is the present invention is judging that whether video to be detected is based in image, text and voice key word these three as bad video, compared with detecting with image simple in prior art, the present invention detects from many aspects video to be detected, thus more comprehensive video to be detected to be analyzed, improve the accuracy that video detects.

In embodiments of the present invention, the structural representation of above-mentioned detecting unit 12 as shown in Figure 10, can comprise: image detection sub-unit 121, text detection subelement 122 and text hegemony subelement 123.

Image detection sub-unit 121, for extracting the visual signature of the surveyed area of every two field picture in subsegment video, extracted visual signature and the image object model set up in advance are carried out the matching analysis, to obtain the grade of bad object in every two field picture and bad object, wherein image testing result comprises the grade of bad object in every two field picture and bad object.

Text detection subelement 122, text filed for what determine in subsegment video in every two field picture, text filedly text identification is carried out to determined, obtain the text filed text comprised, and the text obtained is mated with the text library set up in advance, to obtain the grade of bad text in every two field picture and bad text, wherein text detection result comprises the grade of bad text in every two field picture and bad text.

Text hegemony subelement 123, for extracting the voice data in subsegment video, obtain the phonetic feature sequence of voice data, obtained phonetic feature sequence and the phonetic feature sequence of each keyword in the speech storehouse set up in advance are compared, obtains the distance between obtained phonetic feature sequence and the phonetic feature sequence of each keyword.When the value of the distance between obtained phonetic feature sequence and the phonetic feature sequence of any one keyword is less than distance threshold, determine that subsegment video comprises bad speech.The value obtaining distance is less than the keyword of distance threshold, and determines the grade of bad speech based on keyword place grade, and text hegemony result comprises the grade of bad speech and bad speech.

The concrete implementation of above-mentioned image detection sub-unit 121, text detection subelement 122 and text hegemony subelement 123 can consult the related description in said method embodiment, no longer sets forth this embodiment of the present invention.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

To the above-mentioned explanation of the disclosed embodiments, those skilled in the art are realized or uses the present invention.To be apparent for a person skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a video detecting method, is characterized in that, described method comprises:

2. method according to claim 1, it is characterized in that, the described text hegemony result based on the image testing result of each subsegment video, the text detection result of each subsegment video and each subsegment video, obtains the testing result of corresponding subsegment video, comprising:

3. method according to claim 2, is characterized in that, the described testing result based on each subsegment video, obtains the testing result of described video to be detected, comprising:

4. method according to claim 1, is characterized in that, carries out image detection to subsegment video, obtains the image testing result of subsegment video, comprising:

5. method according to claim 1, is characterized in that, carries out text detection to subsegment video, obtains the text detection result of subsegment video, comprising:

6. method according to claim 1, is characterized in that, carries out text hegemony to subsegment video, obtains the text hegemony result of subsegment video, comprising:

7. a video detecting device, is characterized in that, described device comprises:

8. device according to claim 7, it is characterized in that, described first processing unit is used for: when the image testing result of subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being one-level, obtain indicating described subsegment video to be the testing result of bad video subsegment, and for the image testing result when subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, at least two testing results instructions detect destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of bad video subsegment, the significance level of wherein said secondary is less than the significance level of described one-level, and for the image testing result when subsegment video, in the text detection result of subsegment video and the text hegemony result of subsegment video, the instruction of any one testing result detects destination object, and the grade of destination object is when being secondary, obtain indicating described subsegment video to be the testing result of doubtful bad video subsegment.

9. device according to claim 8, is characterized in that, described second processing unit comprises: obtain subelement and process subelement;

10. device according to claim 7, is characterized in that, described detecting unit comprises: image detection sub-unit, text detection subelement and text hegemony subelement;