WO2014194481A1

WO2014194481A1 - Video classifier construction method with consideration of characteristic reliability

Info

Publication number: WO2014194481A1
Application number: PCT/CN2013/076757
Authority: WO
Inventors: 吴偶; 胡卫明; 祝守宇; 王麒深
Original assignee: 中国科学院自动化研究所
Priority date: 2013-06-05
Filing date: 2013-06-05
Publication date: 2014-12-11

Abstract

The present invention provides a video classifier construction method with consideration of characteristic reliability, comprising: extracting video characteristics of each video sample in a video sample set to obtain a video characteristic set; granting a label to each video sample to indicate whether the video sample belongs to a first category or a second category; carrying out reliability evaluation on each video sample to obtain a reliability factor of each video sample; according to the video characteristic set, the label of each video sample and the reliability factor of each video sample, utilizing a weighted support vector machine algorithm to obtain a video classifier. The video classifier construction method is applicable to services of Internet harmful video filtration and video supervision so as to maintain the safety of content on the Internet.

Description

Video classifier construction method considering feature reliability

The present invention relates to the field of computer application technologies, and in particular, to a video classifier construction method that considers feature reliability. Background technique

With the rapid development of Internet technology, various multimedia applications are emerging, and digital libraries, remote education, video on demand, digital video broadcasting, interactive television, etc. all generate and use a large amount of multimedia data. Even if you don't leave home, people can learn knowledge, access information, and enjoy a variety of entertainment activities over the Internet. However, in addition to useful information such as people's work, study and life, due to the openness of the Internet, many harmful information is also transmitted through the Internet. Harmful information on the Internet has had a serious impact on society, especially the negative impact on minors. The harmful effects of bad information on the human society have increasingly attracted worldwide attention.

Video sites are on the rise in recent years. In 2006, the largest foreign video website "Youtube" was acquired by Google for $1.65 billion. This year was called the first year of online video. At the same time, there have been a large number of video websites in China, such as Youku, Tudou, Cool6, 56.com, etc. Domestic well-known portals and search engines have also launched their own video sites. The number of online videos has grown by spurt, and more and more people are keen to upload videos to the Internet to share with more people. At the same time, more people are happy to search for videos of their interest to enjoy. However, the Internet is full of unhealthy videos, especially the huge amount of violence, horror and pornography videos that are harmful to children's development. These videos need to be effectively identified and based on the recognition results. It is effectively controlled.

For the identification of harmful video on the network, the existing technologies can be mainly divided into two categories: (1) Identification methods based on single-modal features. This type of method is mainly to extract the visual features of the video, and construct a classifier based on these features. For example, in violent video recognition, common features are video motion vectors, colors, textures, and shapes. (2) Recognition method based on multimodal feature fusion. This method mainly extracts features of multiple modalities of video and fuses them to construct a classifier. For example, in violent video recognition, in addition to video features, many methods also extract audio features, including short-term energy, bursty sounds, and the like. Some methods also consider text around the network video, and continue to extract features from these texts for fusion recognition. A large number of researches and practices have shown that the recognition method based on multimodal feature fusion is better than the recognition method based on single modal features. But Network video data is usually more complicated. From the three modes of text, vision and audio, some videos are rich in text, and some are rare; some videos have high visual quality and some are low; Some video audio signals are very clear, and some are very noisy. Features extracted from poor quality modalities often do not truly reflect the characteristics of the video due to low reliability. The current recognition methods based on multi-modal feature fusion do not consider the problem of feature reliability, which leads to the inability to achieve accurate and reliable video recognition and classification. Summary of the invention

In view of this, it is a primary object of the present invention to provide a video classifier construction method that takes into account feature reliability.

To achieve the above object, according to an aspect of the present invention, a video classifier construction method that considers video feature reliability, includes: extracting video features of each video sample in a video sample set to obtain a video feature set; The video sample is assigned a label to indicate that the video sample belongs to the first category or the second category; a reliability evaluation is performed for each video sample to obtain a reliability factor of the video sample; and a video feature set, a label of each video sample, and The reliability factor of each video sample is obtained by using a weighted support vector machine algorithm to obtain a video classifier.

Optionally, each video sample includes a video and text surrounding the video.

Optionally, the video features include visual features, audio features, and text features.

Optionally, performing reliability assessment for each video sample includes separately performing a reliability assessment on visual information, audio information, and text information for each sample.

Optionally, the reliability factor includes: a visual feature reliability factor, the reliability evaluation of the visual information is obtained to obtain the visual feature reliability factor; and the audio feature reliability factor is obtained by performing reliability evaluation on the audio information to obtain the audio feature. a factor; and a text feature reliability factor, which is obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.

Optionally, the first category is a harmful video and the second category is a normal video.

Optionally, performing reliability evaluation on the visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain an evaluation value; determining visual information of all video samples. a maximum evaluation value; and dividing the evaluation value of the visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample.

Optionally, the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or Block-based measurement algorithm.

Optionally, performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. And dividing the evaluation value of the audio information of each video sample by the maximum evaluation value to obtain an audio feature reliability factor for each video sample.

Optionally, the audio objective quality assessment method comprises: a Bark spectral distortion measure, a normalized block measure, or a perceptual analysis measure.

Optionally, the reliability evaluation of the text information of each video sample includes: the total number of words i of the statistical text and the average number of words of the sentence ^; and calculating the text feature reliability factor _{下 ί by:}

r _t = 0.5*min(l, Ji/200) + 0.5*min(l, J ₂ /20).

Optionally, based on the video feature set, the label of each video sample, and the reliable factor of each video sample, using the weighted support vector machine algorithm to obtain the video classifier includes: representing the video feature set as =

{(Xvl, Xa 3⁄4ΐ), ..., (Xv Xar, Xti) (x _v N, X _a N, XfW)}, where X _W is the visual feature of the ith video sample, X _∞ is the first video The audio characteristics of the sample, X _fi is the text feature of the first video sample, N represents the total number of video samples; the label of the first video sample is represented, when the first video sample is the first category, = 1, when the first video sample For the second category, = -1; r _w represents the visual feature reliability factor of the first video sample, ^ represents the audio feature reliability factor of the first video sample, r _t , represents the text feature reliability factor of the first video sample; & represents ^+^+, and the video classification is obtained by solving the following formula

St. Vv.yi ^Γ (w ^T x + b ) + (wx + b ) + (w _t ^T x _ti + b _t ) ≥ 1 - ξί

Ξί > 0,

among them! ^^,^, ,^^ is the video classifier parameter, which is the relaxation factor, and C is the balance factor. In the process of solving, the cross-validation method is used to select C.

Optionally, the method according to the embodiment of the present invention further includes: extracting visual features, audio features, and text features from the video to be classified and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and classifying according to the video Parameters W _a , W _t , b _v , b _a , calculation

s = r _v + r _a + r _t

y= ^r (wjx _v + b _v ) + ^r (wx _a + b _a ) +j ^r (w _t ^T x _t + b _t ) Where ^ represents the visual feature of the video to be classified, X. An audio feature representing the video to be classified, indicating a text feature of the video to be classified, and r _v indicating a video feature reliability factor of the video to be classified, Γ. Indicates the audio feature reliability factor of the video to be classified, r _f represents the text feature reliability factor of the video to be classified. If _y > 0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.

As can be seen from the above technical solutions, the present invention has the following advantages:

1. A video classifier construction method for considering feature reliability provided by the present invention, which can accurately and reliably classify video, for example, to identify harmful video on the network. The present invention is capable of analyzing the reliability of extracted features based on the characteristics of the network video samples and incorporating these reliability factors in the construction of the network harmful video classifier. The network video samples are more complicated. From the three modes of text, vision and audio, some videos are rich in text, and some are very rare. Some videos have high visual quality, while others are very low. Very loud noise; some audio signals are very clear, and some are very distorted. These factors clearly affect the reliability of the extracted features. At present, all the network harmful video classifier construction methods based on multi-modal feature fusion do not consider these practical factors. The invention calculates the reliability of each modal corresponding feature by the characteristics of each modal information itself, and the constructed classifier is more in line with the characteristics of the network video than the classifier constructed by the existing method.

2. The proposed weighted support vector machine algorithm of the present invention can effectively integrate the three feature reliability factors corresponding to the network video samples, so that the trained classifier can identify the network video samples according to the three samples. The feature reliability factor is adaptive to information fusion, which is more reasonable. DRAWINGS

1 shows a flow chart of a video classifier construction method considering feature reliability according to an embodiment of the present invention;

FIG. 2 shows the operation of the video classification method according to an embodiment of the present invention. detailed description

The present invention will be further described in detail below with reference to the specific embodiments of the invention.

The execution environment of the present invention uses a Pentium 4 computer with a 3.0 GHz central processing unit and 2 Gbytes of memory and a network harmful video classifier constructor in C++ language, realizing the video classifier considering the feature reliability of the present invention. Construction method. The invention may also be implemented in other computer environments. I will not repeat them here.

FIG. 1 is a flowchart of a method for constructing a video classifier considering feature reliability according to the present invention, and the steps are as follows:

At step 101, video features of each video sample in the video sample set are extracted to obtain a video feature set. Optionally, each video sample includes a video and text surrounding the video. A computer can be used to collect network video and text around each network video to form a network video sample set. This video sample set can also be provided in other ways.

According to an embodiment of the invention, the video features may include visual features, audio features, and text features. Which features are selected specifically depends on the specific category of the video. Let's take a violent video as an example to illustrate which features are extracted. In visual feature extraction, features such as motion vectors, colors, textures, shapes, etc., are extracted. In audio feature extraction, it mainly extracts audio features related to violence, such as short-term energy, zero-crossing rate, pitch period and so on. In text feature extraction, it is mainly extracted by conventional text feature extraction algorithms such as document frequency, information gain and mutual information.

At step 102, each video sample is assigned a label corresponding to its category to indicate that the video sample belongs to the first category or the second category. For example, the first category can be a harmful (e.g., containing violent content) category, and the second category can be a normal category. According to an embodiment of the present invention, it is possible to manually identify whether a video is harmful, and then assign a label to the video sample accordingly. Alternatively, existing harmful video sample sets and normal video sample sets can also be utilized and tagged in batch mode.

At step 103, a reliability assessment is performed for each video sample to obtain a reliable factor for the video samples. The reliability factor may represent the degree of reliability when the video feature is used to classify the video. According to an embodiment of the present invention, the reliability factor includes: a visual feature reliability factor obtained by performing reliability evaluation on the visual information to obtain the visual feature reliability factor; and an audio feature reliability factor, by performing reliability evaluation on the audio information to obtain the The audio feature reliability factor; and the text feature reliability factor are obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.

At step 104, a video classifier is obtained using a weighted support vector machine algorithm based on the video feature set, the label of each video sample, and the reliability factor of each video sample.

Optionally, the method may further include: extracting visual features, audio features, and text features from the to-categorized video and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and using the video classifier to classify the video Classified as the first category or the second category.

It should be noted that the above numbering of each step is for illustrative purposes only and does not limit the execution of each step. Order. The order of execution of the steps may be changed and/or the individual steps may be separated into multiple steps, the multiple steps being combined into a single step, or a portion of a certain step and other steps, without departing from the spirit and scope of the invention. Or a combination of some of the other steps is performed in a single step. The present invention explicitly contemplates these circumstances and is included in the scope of the present invention.

According to an embodiment of the present invention, optionally, in step 103, performing reliability evaluation on visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain a Evaluating values; determining a maximum evaluation value of visual information of all video samples; and dividing an evaluation value of visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample, wherein The value of the visual feature reliability factor is between 0 and 1, and the larger the value, the higher the reliability of the visual feature.

Optionally, the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or a block effect based measurement algorithm.

Optionally, performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. a value; and an evaluation value of the audio information of each video sample is divided by the maximum evaluation value to obtain an audio feature reliability factor for each video sample, wherein the audio feature reliability factor has a value between 0 and 1 The greater the value, the higher the reliability of the audio features.

Optionally, the reliability evaluation of the text information of each video sample includes: the total number of words of the statistical text and the average number of words of the sentence 2; and the text feature reliability factor calculated by the following formula ^ = 0.5*!^! (1,

Ji/200) + 0.5*min(l, J ₂ /20), where the value of the reliability factor is between 0 and 1, the larger the value, the higher the reliability of the text feature.

According to an embodiment of the present invention, optionally, in step 104, the video classifier is obtained by using a weighted support vector machine algorithm based on a video feature set, a label of each video sample, and a reliability factor of each video sample. The feature set is expressed as = {(x _vl , x _a xn), . . . , (x _Vh x _ah x _tI ) (x _vN , where x _w is the visual feature of the first video sample, χ _∞ is the first video sample Audio feature, x _ti is the text feature of the first video sample, N is the total number of video samples; the label of the first video sample is used to indicate that when the first video sample is the first category, = 1, when the first video sample is In the second category, y, = -1; ^ represents the video feature reliability factor of the first video sample, and ^ represents the first video sample. Audio feature reliability factor, r _fi represents the text feature reliability factor of the first video sample; && r _w + r _ai + r _tl , by

St. Vi-.yi (w ^T x + b ) + (wx + b ) + (w _t ^T x _ti + b _t ) >

Ξί > 0,

among them! ^^,^, ,^^ is the video classifier parameter, which is the relaxation factor, and C is the balance factor. In the process of solving, C can be selected by cross-validation.

According to an embodiment of the present invention, optionally, the method for extracting visual features, audio features, and text features from the video samples and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors is used. The classified video extracts visual features, audio features and text features and obtains corresponding visual feature reliability factors, audio feature reliability factors and text feature reliability factors, and the specific process will not be described herein.

According to an embodiment of the present invention, optionally, classifying the classified video by using the video classifier comprises: calculating according to the video classifier parameters W _a , W _h b _v , b _a calculated above

s = r _v + r _a + r _t

y= ^r (wjx _v + b _v ) + ^r (wx _a + b _a ) +j ^r (w _t ^T x _t + b _t ) where ^ represents the visual feature of the video to be classified, X. Indicates the audio features of the video to be classified, the text features of the video to be classified, and r _v represents the video feature reliability factor of the video to be classified, r. Indicates the audio feature reliability factor of the video to be classified, r _f represents the text feature reliability factor of the video to be classified. If _y>0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.

When an embodiment according to the present invention is applied to the identification of harmful video of the network, the network video and the text surrounding each network video may be collected to constitute the above video sample set, and the first category is harmful video, and the second category is normal. video.

FIG. 2 illustrates the operation of a video classification method in accordance with an embodiment of the present invention. As shown in FIG. 2, video sample set 201 includes N video samples. According to an embodiment of the invention, each video sample may include text around the video and video. When applied to network harmful video recognition, the video sample set can be collected from the network. Video features are extracted from each video sample (= 1, 2, ... N) to obtain a video feature set 202. According to an embodiment of the invention, the video feature may include a visual feature x, an audio feature X _∞ and text features,

Each video sample is given a tag 203 corresponding to its category to indicate that it belongs to the first category or the second category. For example, it is possible to manually identify whether a video is harmful, and then assign a label to a pair of video samples. Alternatively, existing sets of harmful video samples and normal video samples can also be utilized and tagged in batch mode.

For each video sample, the reliability factor 204 is calculated in the manner described above. Perform visual quality-related reliability assessment on visual information to obtain visual feature reliability factor r _w , perform audio quality-related reliability evaluation on audio information to obtain audio feature reliability factor r _∞ , and total text information and text Reliability estimates related to the number of words and the average number of words in a sentence to obtain a text feature reliability factor.

Video classifier 206 is obtained using weighted support vector machine algorithm 205 based on video feature set 202, tag 203 for each video sample, and video feature reliability factor 204 for each video sample.

For the video to be classified, the video features (x _v , x _a , x _f ) and the reliability factors (r _v , r _a ) of the video to be classified are calculated in the same manner as the video features are extracted from each video sample and the reliability factor is calculated. , r _t ), which is classified by the video classifier 206.

Although the present invention has been described above for network video classification, the present invention is not limited to application to network video, but can be applied to various video classifications including visual, audio, and text information. The invention is also not limited to the identification of harmful video, but can be applied to identify various videos containing specific features.

The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand the alteration or replacement within the scope of the technical scope of the present invention. The scope of the invention should be construed as being included in the scope of the invention.

Claims

Rights request

1. A video classifier construction method that considers the reliability of video features, including:

Extract the video features of each video sample in the video sample set to obtain the video feature set; assign a label to each video sample to indicate that the video sample belongs to the first category or the second category; conduct a reliability evaluation for each video sample, To obtain the reliability factor of the video sample; and based on the video feature set, the label of each video sample and the reliability factor of each video sample, use the weighted support vector machine algorithm to obtain the video classifier.

2. The method of claim 1, wherein each video sample includes a video and text surrounding the video.

3. The method according to claim 2, wherein the video features include visual features, audio features and text features.

4. The method according to claim 3, wherein conducting a reliability assessment on each video sample includes separately conducting a reliability assessment on the visual information, audio information and text information of each sample.

5. The method according to claim 4, wherein the reliability factor includes:

The visual feature reliability factor is obtained by evaluating the reliability of visual information to obtain the visual feature reliability factor;

The audio feature reliability factor is obtained by conducting a reliability assessment on audio information to obtain the audio feature reliability factor; and

The text feature reliability factor is obtained by evaluating the reliability of the text information.

6. The method according to claim 1, wherein the first category is harmful videos and the second category is normal videos.

7. The method of claim 5, wherein assessing the reliability of the visual information of each video sample includes:

Use the non-reference video objective quality assessment method to evaluate the visual information of each video sample and obtain an evaluation value;

Determine the maximum evaluation value of visual information for all video samples; and

The evaluation value of the visual information of each video sample is divided by the maximum evaluation value to obtain the visual feature reliability factor of each video sample.

8. The method according to claim 7, wherein the objective quality assessment method of the reference-free video includes a method based on the index peak signal-to-noise ratio or a measurement algorithm based on block effects.

9. The method of claim 5, wherein assessing the reliability of the audio information of each video sample includes:

Use the objective audio quality assessment method to evaluate the audio information of each video sample to obtain an evaluation value;

Determine the maximum evaluation value of audio information for all video samples; and

Divide the evaluation value of the audio information of each video sample by the maximum evaluation value to obtain the audio feature reliability factor of each video sample.

10. The method according to claim 9, wherein the audio objective quality assessment method includes: Buck spectral distortion measure, normalized block measure, or perceptual analysis measure.

11. The method according to claim 5, wherein the reliability assessment of the text information of each video sample includes:

Statistics of the total word count of the text^ and the average word count of the sentences^; and

Calculate the text feature reliability factor r _t - by the following formula.

r _t = 0.5*min(l, Ji/200) + 0.5*min(l, J ₂ /20).

12. The method according to claim 5, wherein based on the video feature set, the label of each video sample and the reliability factor of each video sample, the weighted support vector machine algorithm is used to obtain the video classifier including:

Represent the video feature set as = {(Xvl, a X), (x _V a Xti) ₍ vN, XaN, audio features, x _ti is the text feature of the i-th video sample, N represents the total number of video samples;

The label of the th video sample is represented by , when the th video sample is the first category, = 1, when the th video sample is the second category, =-1;

r _w represents the visual feature reliability factor of the i-th video sample, r _a represents the audio feature reliability factor of the i-th video sample, ^ represents the text feature reliability factor of the i-th video sample;

Expressed as r„+ r _a & r _tl , the parameters of the video classifier are obtained by solving the following equation: w^.w^.w^.b. (I II + I II + I ID + CV st. Vi- . yi ^Γ — (w _v ^T x _vi + b _v ) +— (wx _ai + b _a ) +— (w _t ^T x _ti +b _t ) ≥ 1 -ξι

St Si - ξί > ,

in! ^^, ^, , ^^ are the video classifier parameters, is the relaxation factor, and C is the balance factor. In the solution process, C is selected through the cross-validation method.

13. The method of claim 12, further comprising:

Extract visual features, audio features and text features from the video to be classified and obtain the corresponding visual feature reliability factors, audio feature reliability factors and text feature reliability factors; and

Calculated according to the video classifier parameters W _a , W _t , b _v , b _a

s = r _v + r _a + r _t

y= ^r (wjx _v + b _v ) + ^r (wx _a + b _a ) +j ^r (w _t ^T x _t + b _t ) where ^ represents the visual feature of the video to be classified, X. represents the audio feature of the video to be classified, represents the text feature of the video to be classified, r _v represents the video feature reliability factor of the video to be classified, r. represents the audio feature reliability factor of the video to be classified, r _f represents the text feature reliability factor of the video to be classified, if _y>0, then the network video sample is judged to be the first category, otherwise it is judged to be the second category.