WO2014194481A1 - Video classifier construction method with consideration of characteristic reliability - Google Patents

Video classifier construction method with consideration of characteristic reliability Download PDF

Info

Publication number
WO2014194481A1
WO2014194481A1 PCT/CN2013/076757 CN2013076757W WO2014194481A1 WO 2014194481 A1 WO2014194481 A1 WO 2014194481A1 CN 2013076757 W CN2013076757 W CN 2013076757W WO 2014194481 A1 WO2014194481 A1 WO 2014194481A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
reliability
feature
sample
audio
Prior art date
Application number
PCT/CN2013/076757
Other languages
French (fr)
Chinese (zh)
Inventor
吴偶
胡卫明
祝守宇
王麒深
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Priority to PCT/CN2013/076757 priority Critical patent/WO2014194481A1/en
Publication of WO2014194481A1 publication Critical patent/WO2014194481A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/647Control signaling between network components and server or clients; Network processes for video distribution between server and clients, e.g. controlling the quality of the video stream, by dropping packets, protecting content from unauthorised alteration within the network, monitoring of network load, bridging between two different networks, e.g. between IP and wireless
    • H04N21/64784Data processing by the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/61Network physical structure; Signal processing
    • H04N21/6106Network physical structure; Signal processing specially adapted to the downstream path of the transmission network
    • H04N21/6125Network physical structure; Signal processing specially adapted to the downstream path of the transmission network involving transmission via Internet

Definitions

  • the present invention relates to the field of computer application technologies, and in particular, to a video classifier construction method that considers feature reliability. Background technique
  • Video sites are on the rise in recent years. In 2006, the largest foreign video website "Youtube” was acquired by Google for $1.65 billion. This year was called the first year of online video. At the same time, there have been a large number of video websites in China, such as Youku, Tudou, Cool6, 56.com, etc. Domestic well-known portals and search engines have also launched their own video sites. The number of online videos has grown by spurt, and more and more people are keen to upload videos to the Internet to share with more people. At the same time, more people are happy to search for videos of their interest to enjoy. However, the Internet is full of unhealthy videos, especially the huge amount of violence, horror and pornography videos that are harmful to children's development. These videos need to be effectively identified and based on the recognition results. It is effectively controlled.
  • Identification methods based on single-modal features This type of method is mainly to extract the visual features of the video, and construct a classifier based on these features. For example, in violent video recognition, common features are video motion vectors, colors, textures, and shapes.
  • Recognition method based on multimodal feature fusion This method mainly extracts features of multiple modalities of video and fuses them to construct a classifier. For example, in violent video recognition, in addition to video features, many methods also extract audio features, including short-term energy, bursty sounds, and the like. Some methods also consider text around the network video, and continue to extract features from these texts for fusion recognition.
  • a video classifier construction method that considers video feature reliability, includes: extracting video features of each video sample in a video sample set to obtain a video feature set; The video sample is assigned a label to indicate that the video sample belongs to the first category or the second category; a reliability evaluation is performed for each video sample to obtain a reliability factor of the video sample; and a video feature set, a label of each video sample, and The reliability factor of each video sample is obtained by using a weighted support vector machine algorithm to obtain a video classifier.
  • each video sample includes a video and text surrounding the video.
  • the video features include visual features, audio features, and text features.
  • performing reliability assessment for each video sample includes separately performing a reliability assessment on visual information, audio information, and text information for each sample.
  • the reliability factor includes: a visual feature reliability factor, the reliability evaluation of the visual information is obtained to obtain the visual feature reliability factor; and the audio feature reliability factor is obtained by performing reliability evaluation on the audio information to obtain the audio feature. a factor; and a text feature reliability factor, which is obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.
  • the first category is a harmful video and the second category is a normal video.
  • performing reliability evaluation on the visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain an evaluation value; determining visual information of all video samples. a maximum evaluation value; and dividing the evaluation value of the visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample.
  • the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or Block-based measurement algorithm.
  • performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. And dividing the evaluation value of the audio information of each video sample by the maximum evaluation value to obtain an audio feature reliability factor for each video sample.
  • the audio objective quality assessment method comprises: a Bark spectral distortion measure, a normalized block measure, or a perceptual analysis measure.
  • the reliability evaluation of the text information of each video sample includes: the total number of words i of the statistical text and the average number of words of the sentence ⁇ ; and calculating the text feature reliability factor ⁇ ⁇ by:
  • ⁇ , ⁇ , , ⁇ is the video classifier parameter, which is the relaxation factor
  • C is the balance factor.
  • the cross-validation method is used to select C.
  • the method according to the embodiment of the present invention further includes: extracting visual features, audio features, and text features from the video to be classified and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and classifying according to the video Parameters W a , W t , b v , b a , calculation
  • represents the visual feature of the video to be classified
  • X An audio feature representing the video to be classified, indicating a text feature of the video to be classified
  • r v indicating a video feature reliability factor of the video to be classified
  • Indicates the audio feature reliability factor of the video to be classified
  • r f represents the text feature reliability factor of the video to be classified. If _y > 0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.
  • the present invention has the following advantages:
  • a video classifier construction method for considering feature reliability provided by the present invention, which can accurately and reliably classify video, for example, to identify harmful video on the network.
  • the present invention is capable of analyzing the reliability of extracted features based on the characteristics of the network video samples and incorporating these reliability factors in the construction of the network harmful video classifier.
  • the network video samples are more complicated. From the three modes of text, vision and audio, some videos are rich in text, and some are very rare. Some videos have high visual quality, while others are very low. Very loud noise; some audio signals are very clear, and some are very distorted. These factors clearly affect the reliability of the extracted features.
  • all the network harmful video classifier construction methods based on multi-modal feature fusion do not consider these practical factors.
  • the invention calculates the reliability of each modal corresponding feature by the characteristics of each modal information itself, and the constructed classifier is more in line with the characteristics of the network video than the classifier constructed by the existing method.
  • the proposed weighted support vector machine algorithm of the present invention can effectively integrate the three feature reliability factors corresponding to the network video samples, so that the trained classifier can identify the network video samples according to the three samples.
  • the feature reliability factor is adaptive to information fusion, which is more reasonable.
  • FIG. 1 shows a flow chart of a video classifier construction method considering feature reliability according to an embodiment of the present invention
  • FIG. 2 shows the operation of the video classification method according to an embodiment of the present invention. detailed description
  • the execution environment of the present invention uses a Pentium 4 computer with a 3.0 GHz central processing unit and 2 Gbytes of memory and a network harmful video classifier constructor in C++ language, realizing the video classifier considering the feature reliability of the present invention. Construction method.
  • the invention may also be implemented in other computer environments. I will not repeat them here.
  • FIG. 1 is a flowchart of a method for constructing a video classifier considering feature reliability according to the present invention, and the steps are as follows:
  • each video sample in the video sample set includes a video and text surrounding the video.
  • a computer can be used to collect network video and text around each network video to form a network video sample set. This video sample set can also be provided in other ways.
  • the video features may include visual features, audio features, and text features. Which features are selected specifically depends on the specific category of the video. Let's take a violent video as an example to illustrate which features are extracted.
  • visual feature extraction features such as motion vectors, colors, textures, shapes, etc.
  • audio feature extraction it mainly extracts audio features related to violence, such as short-term energy, zero-crossing rate, pitch period and so on.
  • text feature extraction it is mainly extracted by conventional text feature extraction algorithms such as document frequency, information gain and mutual information.
  • each video sample is assigned a label corresponding to its category to indicate that the video sample belongs to the first category or the second category.
  • the first category can be a harmful (e.g., containing violent content) category
  • the second category can be a normal category.
  • existing harmful video sample sets and normal video sample sets can also be utilized and tagged in batch mode.
  • a reliability assessment is performed for each video sample to obtain a reliable factor for the video samples.
  • the reliability factor may represent the degree of reliability when the video feature is used to classify the video.
  • the reliability factor includes: a visual feature reliability factor obtained by performing reliability evaluation on the visual information to obtain the visual feature reliability factor; and an audio feature reliability factor, by performing reliability evaluation on the audio information to obtain the The audio feature reliability factor; and the text feature reliability factor are obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.
  • a video classifier is obtained using a weighted support vector machine algorithm based on the video feature set, the label of each video sample, and the reliability factor of each video sample.
  • the method may further include: extracting visual features, audio features, and text features from the to-categorized video and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and using the video classifier to classify the video Classified as the first category or the second category.
  • each step is for illustrative purposes only and does not limit the execution of each step.
  • Order The order of execution of the steps may be changed and/or the individual steps may be separated into multiple steps, the multiple steps being combined into a single step, or a portion of a certain step and other steps, without departing from the spirit and scope of the invention. Or a combination of some of the other steps is performed in a single step.
  • the present invention explicitly contemplates these circumstances and is included in the scope of the present invention.
  • performing reliability evaluation on visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain a Evaluating values; determining a maximum evaluation value of visual information of all video samples; and dividing an evaluation value of visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample, wherein The value of the visual feature reliability factor is between 0 and 1, and the larger the value, the higher the reliability of the visual feature.
  • the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or a block effect based measurement algorithm.
  • performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. a value; and an evaluation value of the audio information of each video sample is divided by the maximum evaluation value to obtain an audio feature reliability factor for each video sample, wherein the audio feature reliability factor has a value between 0 and 1 The greater the value, the higher the reliability of the audio features.
  • the audio objective quality assessment method comprises: a Bark spectral distortion measure, a normalized block measure, or a perceptual analysis measure.
  • the video classifier is obtained by using a weighted support vector machine algorithm based on a video feature set, a label of each video sample, and a reliability factor of each video sample.
  • ⁇ , ⁇ , , ⁇ is the video classifier parameter, which is the relaxation factor
  • C is the balance factor. In the process of solving, C can be selected by cross-validation.
  • the method for extracting visual features, audio features, and text features from the video samples and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors is used.
  • the classified video extracts visual features, audio features and text features and obtains corresponding visual feature reliability factors, audio feature reliability factors and text feature reliability factors, and the specific process will not be described herein.
  • classifying the classified video by using the video classifier comprises: calculating according to the video classifier parameters W a , W h b v , b a calculated above
  • represents the visual feature of the video to be classified
  • X Indicates the audio features of the video to be classified
  • r v represents the video feature reliability factor of the video to be classified
  • r Indicates the audio feature reliability factor of the video to be classified
  • r f represents the text feature reliability factor of the video to be classified. If _y>0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.
  • the network video and the text surrounding each network video may be collected to constitute the above video sample set, and the first category is harmful video, and the second category is normal. video.
  • FIG. 2 illustrates the operation of a video classification method in accordance with an embodiment of the present invention.
  • video sample set 201 includes N video samples.
  • each video sample may include text around the video and video.
  • the video sample set can be collected from the network.
  • the video feature may include a visual feature x, an audio feature X ⁇ and text features,
  • Each video sample is given a tag 203 corresponding to its category to indicate that it belongs to the first category or the second category. For example, it is possible to manually identify whether a video is harmful, and then assign a label to a pair of video samples. Alternatively, existing sets of harmful video samples and normal video samples can also be utilized and tagged in batch mode.
  • the reliability factor 204 is calculated in the manner described above. Perform visual quality-related reliability assessment on visual information to obtain visual feature reliability factor r w , perform audio quality-related reliability evaluation on audio information to obtain audio feature reliability factor r ⁇ , and total text information and text Reliability estimates related to the number of words and the average number of words in a sentence to obtain a text feature reliability factor.
  • Video classifier 206 is obtained using weighted support vector machine algorithm 205 based on video feature set 202, tag 203 for each video sample, and video feature reliability factor 204 for each video sample.
  • the video features (x v , x a , x f ) and the reliability factors (r v , r a ) of the video to be classified are calculated in the same manner as the video features are extracted from each video sample and the reliability factor is calculated. , r t ), which is classified by the video classifier 206.
  • the present invention has been described above for network video classification, the present invention is not limited to application to network video, but can be applied to various video classifications including visual, audio, and text information.
  • the invention is also not limited to the identification of harmful video, but can be applied to identify various videos containing specific features.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a video classifier construction method with consideration of characteristic reliability, comprising: extracting video characteristics of each video sample in a video sample set to obtain a video characteristic set; granting a label to each video sample to indicate whether the video sample belongs to a first category or a second category; carrying out reliability evaluation on each video sample to obtain a reliability factor of each video sample; according to the video characteristic set, the label of each video sample and the reliability factor of each video sample, utilizing a weighted support vector machine algorithm to obtain a video classifier. The video classifier construction method is applicable to services of Internet harmful video filtration and video supervision so as to maintain the safety of content on the Internet.

Description

考虑特征可靠性的视频分类器构造方法 技术领域  Video classifier construction method considering feature reliability
本发明涉及计算机应用技术领域, 特别涉及一种考虑特征可靠性的视频分 类器构造方法。 背景技术  The present invention relates to the field of computer application technologies, and in particular, to a video classifier construction method that considers feature reliability. Background technique
随着互联网技术的飞速发展, 各类多媒体应用不断涌现, 数字图书馆、 远 程教育、视频点播、数字视频广播、交互式电视等都产生和使用了大量的多媒体 数据。 即使足不出户, 人们也可以通过互联网学习知识, 查阅信息, 以及享受各 种各样的娱乐活动。然而, 除了对人们工作学习和生活等有用的信息之外, 由于 互联网的开放性, 也使许多有害信息通过网络得到传播。互联网上的有害信息对 社会造成了严重的影响,尤其是对未成年人的不良影响更是屡见于报端。互联网 不良信息对人类社会造成的危害日益引起了世界的广泛关注。  With the rapid development of Internet technology, various multimedia applications are emerging, and digital libraries, remote education, video on demand, digital video broadcasting, interactive television, etc. all generate and use a large amount of multimedia data. Even if you don't leave home, people can learn knowledge, access information, and enjoy a variety of entertainment activities over the Internet. However, in addition to useful information such as people's work, study and life, due to the openness of the Internet, many harmful information is also transmitted through the Internet. Harmful information on the Internet has had a serious impact on society, especially the negative impact on minors. The harmful effects of bad information on the human society have increasingly attracted worldwide attention.
最近几年,视频网站正在大量兴起。 2006年国外最大的视频网站" Youtube" 被 google以 16.5亿美元收购, 这一年被称为网络视频元年。 与此同时, 国内也 出现了大量的视频网站, 如优酷网、 土豆网、 酷 6网、 56.com等。 国内知名的 门户网站和搜索引擎也相继推出了自己的视频网站。网络视频数量成井喷式增长, 越来越多的人热衷于上传视频到网上, 与更多的人分享。 同时, 更多的人乐于搜 索自己感兴趣的视频来欣赏。然而, 网络上充斥着各类不健康的视频, 尤其是其 中数量巨大的暴力、恐怖以及色情类视频对于儿童的发展是有比较大的危害, 需 要对这些视频进行有效地识别, 根据识别结果来对其进行有效地管控。  Video sites are on the rise in recent years. In 2006, the largest foreign video website "Youtube" was acquired by Google for $1.65 billion. This year was called the first year of online video. At the same time, there have been a large number of video websites in China, such as Youku, Tudou, Cool6, 56.com, etc. Domestic well-known portals and search engines have also launched their own video sites. The number of online videos has grown by spurt, and more and more people are keen to upload videos to the Internet to share with more people. At the same time, more people are happy to search for videos of their interest to enjoy. However, the Internet is full of unhealthy videos, especially the huge amount of violence, horror and pornography videos that are harmful to children's development. These videos need to be effectively identified and based on the recognition results. It is effectively controlled.
对于网络有害视频的识别, 现有的技术主要可以分为两类: (1) 基于单模态 特征的识别方法。这类方法主要是提取视频的视觉特征,根据这些特征来构造分 类器。 例如在暴力视频识别上, 常见的特征有视频运动矢量、 颜色、 纹理以及形 状等。 (2)基于多模态特征融合的识别方法, 这类方法主要是提取视频的多个模 态的特征, 将其融合以构造分类器。 例如在暴力视频识别上, 除了视频特征外, 很多方法还提取音频特征, 包括短时能量, 突发声音等。有些方法还考虑网络视 频周围的文本, 从这些文本里面继续提取一些特征用于融合识别。大量研究与实 践表明基于多模态特征融合的识别方法要优于基于单模态特征的识别方法。但是 网络视频数据通常比较复杂, 从文本、视觉以及音频这三个模态来看, 有些视频 周围的文本很丰富, 而有的很少; 有的视频的视觉质量很高, 而有的很低; 有的 视频的音频信号非常清晰,有的则噪声很大。从质量不好的模态提取的特征由于 可靠性不高而通常无法真实地反映视频的特性。目前的基于多模态特征融合的识 别方法均没有考虑特征可靠性这一问题,导致无法实现准确可靠的视频识别和分 类。 发明内容 For the identification of harmful video on the network, the existing technologies can be mainly divided into two categories: (1) Identification methods based on single-modal features. This type of method is mainly to extract the visual features of the video, and construct a classifier based on these features. For example, in violent video recognition, common features are video motion vectors, colors, textures, and shapes. (2) Recognition method based on multimodal feature fusion. This method mainly extracts features of multiple modalities of video and fuses them to construct a classifier. For example, in violent video recognition, in addition to video features, many methods also extract audio features, including short-term energy, bursty sounds, and the like. Some methods also consider text around the network video, and continue to extract features from these texts for fusion recognition. A large number of researches and practices have shown that the recognition method based on multimodal feature fusion is better than the recognition method based on single modal features. But Network video data is usually more complicated. From the three modes of text, vision and audio, some videos are rich in text, and some are rare; some videos have high visual quality and some are low; Some video audio signals are very clear, and some are very noisy. Features extracted from poor quality modalities often do not truly reflect the characteristics of the video due to low reliability. The current recognition methods based on multi-modal feature fusion do not consider the problem of feature reliability, which leads to the inability to achieve accurate and reliable video recognition and classification. Summary of the invention
有鉴于此, 本发明的主要目的是提供一种考虑特征可靠性的视频分类器构 造方法。  In view of this, it is a primary object of the present invention to provide a video classifier construction method that takes into account feature reliability.
为达到上述目的, 根据本发明的一个方面提供了一种考虑视频特征可靠性 的视频分类器构造方法, 包括: 提取视频样本集中每个视频样本的视频特征, 以 得到视频特征集; 对每个视频样本赋予标签, 以表示该视频样本属于第一类别或 第二类别; 针对每个视频样本进行可靠性评估, 以得到视频样本的可靠因子; 以 及基于视频特征集、每个视频样本的标签以及每个视频样本的可靠因子,利用加 权的支持向量机算法得到视频分类器。  To achieve the above object, according to an aspect of the present invention, a video classifier construction method that considers video feature reliability, includes: extracting video features of each video sample in a video sample set to obtain a video feature set; The video sample is assigned a label to indicate that the video sample belongs to the first category or the second category; a reliability evaluation is performed for each video sample to obtain a reliability factor of the video sample; and a video feature set, a label of each video sample, and The reliability factor of each video sample is obtained by using a weighted support vector machine algorithm to obtain a video classifier.
可选地, 每个视频样本包括视频以及该视频周围的文本。  Optionally, each video sample includes a video and text surrounding the video.
可选地, 视频特征包括视觉特征、 音频特征和文本特征。  Optionally, the video features include visual features, audio features, and text features.
可选地, 针对每个视频样本进行可靠性评估包括对每个样本的视觉信息、 音频信息和文本信息分别进行可靠性评估。  Optionally, performing reliability assessment for each video sample includes separately performing a reliability assessment on visual information, audio information, and text information for each sample.
可选地, 可靠因子包括: 视觉特征可靠因子, 通过对视觉信息进行可靠性 评估以得到所述视觉特征可靠因子; 音频特征可靠因子,通过对音频信息进行可 靠性评估以得到所述音频特征可靠因子; 以及文本特征可靠因子,通过对文本信 息进行可靠性评估以得到所述文本特征可靠因子。  Optionally, the reliability factor includes: a visual feature reliability factor, the reliability evaluation of the visual information is obtained to obtain the visual feature reliability factor; and the audio feature reliability factor is obtained by performing reliability evaluation on the audio information to obtain the audio feature. a factor; and a text feature reliability factor, which is obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.
可选地, 第一类别是有害视频, 第二类别是正常视频。  Optionally, the first category is a harmful video and the second category is a normal video.
可选地, 对每个视频样本的视觉信息进行可靠性评估包括: 利用无参考视 频客观质量评估方法对每个视频样本的视觉信息进行评估,得到一个评估值; 确 定所有视频样本的视觉信息的最大评估值;以及将每个视频样本的视觉信息的评 估值除以所述最大评估值, 以得到每个视频样本的视觉特征可靠因子。  Optionally, performing reliability evaluation on the visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain an evaluation value; determining visual information of all video samples. a maximum evaluation value; and dividing the evaluation value of the visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample.
可选地, 无参考视频客观质量评估方法包括基于指标峰值信噪比的方法或 基于块效应的测量算法。 Optionally, the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or Block-based measurement algorithm.
可选地, 对每个视频样本的音频信息进行可靠性评估包括: 利用音频客观 质量评估方法对每个视频样本的音频信息进行评估,得到一个评估值; 确定所有 视频样本的音频信息的最大评估值;以及将每个视频样本的音频信息的评估值除 以所述最大评估值, 以得到每个视频样本的音频特征可靠因子。  Optionally, performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. And dividing the evaluation value of the audio information of each video sample by the maximum evaluation value to obtain an audio feature reliability factor for each video sample.
可选地, 音频客观质量评估方法包括: 巴克谱失真测度、 归一化块测度、 或感知分析测度。  Optionally, the audio objective quality assessment method comprises: a Bark spectral distortion measure, a normalized block measure, or a perceptual analysis measure.
可选地,对每个视频样本的文本信息进行可靠性评估包括: 统计文本的总字 数 i以及句子的平均字数^; 以及通过下式计算文本特征可靠因子 Γί: Optionally, the reliability evaluation of the text information of each video sample includes: the total number of words i of the statistical text and the average number of words of the sentence ^; and calculating the text feature reliability factor 下 ί by:
rt = 0.5*min(l, Ji/200) + 0.5*min(l, J2/20)。 r t = 0.5*min(l, Ji/200) + 0.5*min(l, J 2 /20).
可选地,基于视频特征集、每个视频样本的标签以及每个视频样本的可靠因 子, 利用加权的支持向量机算法得到视频分类器包括: 将视频特征集表示为 = Optionally, based on the video feature set, the label of each video sample, and the reliable factor of each video sample, using the weighted support vector machine algorithm to obtain the video classifier includes: representing the video feature set as =
{(Xvl, Xa ¾ΐ), ..., (Xv Xar, Xti) (xvN, XaN, XfW)}, 其中 XW为第 i个视频样本的视觉 特征, X为第 个视频样本的音频特征, Xfi为第 个视频样本的文本特征, N表 示视频样本总数; 第 个视频样本的标签用 表示, 当第 个视频样本为第一类 别时, = 1, 当第 个视频样本为第二类别时, = -1; rw表示第 个视频样本 的视觉特征可靠因子, ^表示第 个视频样本的音频特征可靠因子, rt,表示第 个视频样本的文本特征可靠因子; 用&表示^+^+ , 通过对下式进行求解得 到视频分类
Figure imgf000005_0001
{(Xvl, Xa 3⁄4ΐ), ..., (Xv Xar, Xti) (x v N, X a N, XfW)}, where X W is the visual feature of the ith video sample, X is the first video The audio characteristics of the sample, X fi is the text feature of the first video sample, N represents the total number of video samples; the label of the first video sample is represented, when the first video sample is the first category, = 1, when the first video sample For the second category, = -1; r w represents the visual feature reliability factor of the first video sample, ^ represents the audio feature reliability factor of the first video sample, r t , represents the text feature reliability factor of the first video sample; & represents ^+^+, and the video classification is obtained by solving the following formula
Figure imgf000005_0001
s-t. Vv.yi Γ (w Tx + b ) + (w x + b ) + (wt Txti + bt) ≥ 1 - ξίSt. Vv.yi Γ (w T x + b ) + (wx + b ) + (w t T x ti + b t ) ≥ 1 - ξί
Figure imgf000005_0002
Figure imgf000005_0002
ξί > 0,  Ξί > 0,
其中!^^,^, ,^^为视频分类器参数, 为松弛因子, C为平衡因子, 在求 解过程中通过交叉验证的方法来对 C进行选取。 among them! ^^,^, ,^^ is the video classifier parameter, which is the relaxation factor, and C is the balance factor. In the process of solving, the cross-validation method is used to select C.
可选地, 根据本发明实施例的方法还包括: 从待分类视频提取视觉特征、音 频特征以及文本特征并得到相应的视觉特征可靠因子、音频特征可靠因子和文本 特征可靠因子; 以及根据视频分类器参数 Wa, Wt, bv, ba, 计算 Optionally, the method according to the embodiment of the present invention further includes: extracting visual features, audio features, and text features from the video to be classified and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and classifying according to the video Parameters W a , W t , b v , b a , calculation
s = rv+ ra + rt s = r v + r a + r t
y= r (wjxv + bv) +r (w xa + ba) +j r (wt Txt + bt) 其中 ^表示待分类视频的视觉特征, X。表示待分类视频的音频特征, 表示待分 类视频的文本特征, rv表示待分类视频的视频特征可靠因子, Γ。表示待分类视频 的音频特征可靠因子, rf表示待分类视频的文本特征可靠因子, 如果 _y > 0, 那么 该网络视频样本判定为第一类别, 否则判定为第二类别。 y= r (wjx v + b v ) + r (wx a + b a ) +j r (w t T x t + b t ) Where ^ represents the visual feature of the video to be classified, X. An audio feature representing the video to be classified, indicating a text feature of the video to be classified, and r v indicating a video feature reliability factor of the video to be classified, Γ. Indicates the audio feature reliability factor of the video to be classified, r f represents the text feature reliability factor of the video to be classified. If _y > 0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.
从上述技术方案可以看出, 本发明具有以下优点:  As can be seen from the above technical solutions, the present invention has the following advantages:
1、本发明提供的一种考虑特征可靠性的视频分类器构造方法, 可以准确可 靠地对视频分类,例如识别网络上的有害视频。本发明能够根据网络视频样本的 自身特点来分析所提取特征的可靠性,并在构造网络有害视频分类器的过程中融 入这些可靠性因素。 网络视频样本比较复杂, 从文本、视觉以及音频这三个模态 来看, 有些视频周围的文本很丰富, 而有的很少; 有的视频的视觉质量很高, 而 有的很低,有很大的噪声;有的视频的音频信号非常清晰,有的则失真非常严重。 这些因素显然影响着所提取的特征的可靠性。目前所有的基于多模态特征融合的 网络有害视频分类器构造方法均没有考虑这些实际因素。而本发明通过各个模态 信息自身的特点来计算各个模态对应特征的可靠性,所构造出的分类器相比于现 有方法所构造出的分类器, 更加符合网络视频的特点。  1. A video classifier construction method for considering feature reliability provided by the present invention, which can accurately and reliably classify video, for example, to identify harmful video on the network. The present invention is capable of analyzing the reliability of extracted features based on the characteristics of the network video samples and incorporating these reliability factors in the construction of the network harmful video classifier. The network video samples are more complicated. From the three modes of text, vision and audio, some videos are rich in text, and some are very rare. Some videos have high visual quality, while others are very low. Very loud noise; some audio signals are very clear, and some are very distorted. These factors clearly affect the reliability of the extracted features. At present, all the network harmful video classifier construction methods based on multi-modal feature fusion do not consider these practical factors. The invention calculates the reliability of each modal corresponding feature by the characteristics of each modal information itself, and the constructed classifier is more in line with the characteristics of the network video than the classifier constructed by the existing method.
2、本发明的所提出的加权的支持向量机算法能够将网络视频样本对应的三 个特征可靠因子有效地融入进去,使得训练出来的分类器能够在识别网络视频样 本时, 根据样本的三个特征可靠因子进行自适应的信息融合, 更具合理性。 附图说明  2. The proposed weighted support vector machine algorithm of the present invention can effectively integrate the three feature reliability factors corresponding to the network video samples, so that the trained classifier can identify the network video samples according to the three samples. The feature reliability factor is adaptive to information fusion, which is more reasonable. DRAWINGS
图 1示出了根据本发明实施例的考虑特征可靠性的视频分类器构造方法的 流程图; 以及  1 shows a flow chart of a video classifier construction method considering feature reliability according to an embodiment of the present invention;
图 2示出了根据本发明实施例的视频分类方法的工作过程。 具体实施方式  FIG. 2 shows the operation of the video classification method according to an embodiment of the present invention. detailed description
为使本发明的目的、 技术方案和优点更加清楚明白, 以下结合具体实施例, 并参照附图, 对本发明进一步详细说明。  The present invention will be further described in detail below with reference to the specific embodiments of the invention.
本发明的执行环境采用一台具有 3.0G赫兹中央处理器和 2G字节内存的奔 腾 4计算机并用 C++语言编制了网络有害视频分类器构造程序,实现了本发明的 考虑特征可靠性的视频分类器构造方法。还可以采用其他计算机环境实现本发明, 在此不再赘述。 The execution environment of the present invention uses a Pentium 4 computer with a 3.0 GHz central processing unit and 2 Gbytes of memory and a network harmful video classifier constructor in C++ language, realizing the video classifier considering the feature reliability of the present invention. Construction method. The invention may also be implemented in other computer environments. I will not repeat them here.
图 1为本发明提供的一种考虑特征可靠性的视频分类器构造方法的流程图, 其步骤如下:  FIG. 1 is a flowchart of a method for constructing a video classifier considering feature reliability according to the present invention, and the steps are as follows:
在步骤 101,提取视频样本集中每个视频样本的视频特征, 以得到视频特征 集。可选地, 每个视频样本包括视频以及该视频周围的文本。可以利用计算机收 集网络视频以及每个网络视频周围的文本以构成网络视频样本集。也可以通过其 他方式提供该视频样本集。  At step 101, video features of each video sample in the video sample set are extracted to obtain a video feature set. Optionally, each video sample includes a video and text surrounding the video. A computer can be used to collect network video and text around each network video to form a network video sample set. This video sample set can also be provided in other ways.
根据本发明的实施例,视频特征可以包括视觉特征、音频特征和文本特征。 关于具体选取哪些特征, 主要依据视频的具体类别来定。下面以暴力视频为例来 说明提取哪些特征。在视觉特征提取上, 主要提取能体现暴力内容的特征, 例如 运动矢量、 颜色、 纹理、 形状等。 在音频特征提取上, 主要提取和暴力相关的音 频特征, 例如短时能量, 过零率, 基音周期等。 在文本特征提取上, 主要利用常 规的文本特征提取算法, 如文档频率, 信息增益以及互信息等方法来提取。  According to an embodiment of the invention, the video features may include visual features, audio features, and text features. Which features are selected specifically depends on the specific category of the video. Let's take a violent video as an example to illustrate which features are extracted. In visual feature extraction, features such as motion vectors, colors, textures, shapes, etc., are extracted. In audio feature extraction, it mainly extracts audio features related to violence, such as short-term energy, zero-crossing rate, pitch period and so on. In text feature extraction, it is mainly extracted by conventional text feature extraction algorithms such as document frequency, information gain and mutual information.
在步骤 102,对每个视频样本赋予与其类别相对应的标签, 以表示该视频样 本属于第一类别或第二类别。 例如, 第一类别可以是有害 (例如包含暴力内容) 类别, 第二类别可以是正常类别。根据本发明的实施例, 可以人工识别视频是否 有害, 然后相应地对视频样本赋予标签。备选地, 也可以利用已有的有害视频样 本集和正常视频样本集, 并以批处理方式对其赋予标签。  At step 102, each video sample is assigned a label corresponding to its category to indicate that the video sample belongs to the first category or the second category. For example, the first category can be a harmful (e.g., containing violent content) category, and the second category can be a normal category. According to an embodiment of the present invention, it is possible to manually identify whether a video is harmful, and then assign a label to the video sample accordingly. Alternatively, existing harmful video sample sets and normal video sample sets can also be utilized and tagged in batch mode.
在步骤 103,针对每个视频样本进行可靠性评估, 以得到视频样本的可靠因 子。所述可靠因子可以表示当视频特征用于对视频分类时的可靠程度。根据本发 明的实施例, 可靠因子包括: 视觉特征可靠因子, 通过对视觉信息进行可靠性评 估以得到所述视觉特征可靠因子; 音频特征可靠因子,通过对音频信息进行可靠 性评估以得到所述音频特征可靠因子; 以及文本特征可靠因子,通过对文本信息 进行可靠性评估以得到所述文本特征可靠因子。  At step 103, a reliability assessment is performed for each video sample to obtain a reliable factor for the video samples. The reliability factor may represent the degree of reliability when the video feature is used to classify the video. According to an embodiment of the present invention, the reliability factor includes: a visual feature reliability factor obtained by performing reliability evaluation on the visual information to obtain the visual feature reliability factor; and an audio feature reliability factor, by performing reliability evaluation on the audio information to obtain the The audio feature reliability factor; and the text feature reliability factor are obtained by performing reliability evaluation on the text information to obtain the text feature reliability factor.
在步骤 104,基于视频特征集、每个视频样本的标签以及每个视频样本的可 靠因子, 利用加权的支持向量机算法得到视频分类器。  At step 104, a video classifier is obtained using a weighted support vector machine algorithm based on the video feature set, the label of each video sample, and the reliability factor of each video sample.
可选地, 该方法还可以包括: 从待分类视频提取视觉特征、 音频特征以及 文本特征并得到相应的视觉特征可靠因子、音频特征可靠因子和文本特征可靠因 子; 以及利用视频分类器对待分类视频分类为第一类别或第二类别。  Optionally, the method may further include: extracting visual features, audio features, and text features from the to-categorized video and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors; and using the video classifier to classify the video Classified as the first category or the second category.
应当注意, 以上对各步骤的编号仅为说明目的, 而并不限定各步骤的执行 顺序。 在不脱离本发明精神和范围的情况下, 可以改变步骤的执行顺序和 /或将 单个步骤拆分为多个步骤、将多个步骤组合为单个步骤、或将某个步骤的一部分 与其他步骤或其他步骤的一部分组合为单个步骤来执行。本发明明确考虑这些情 况并将其包含在本发明的范围内。 It should be noted that the above numbering of each step is for illustrative purposes only and does not limit the execution of each step. Order. The order of execution of the steps may be changed and/or the individual steps may be separated into multiple steps, the multiple steps being combined into a single step, or a portion of a certain step and other steps, without departing from the spirit and scope of the invention. Or a combination of some of the other steps is performed in a single step. The present invention explicitly contemplates these circumstances and is included in the scope of the present invention.
根据本发明的实施例, 可选地, 在步骤 103, 对每个视频样本的视觉信息进 行可靠性评估包括:利用无参考视频客观质量评估方法对每个视频样本的视觉信 息进行评估, 得到一个评估值; 确定所有视频样本的视觉信息的最大评估值; 以 及将每个视频样本的视觉信息的评估值除以所述最大评估值,以得到每个视频样 本的视觉特征可靠因子, 其中, 所述视觉特征可靠因子的值介于 0和 1之间, 值 越大表示视觉特征的可靠性越高。  According to an embodiment of the present invention, optionally, in step 103, performing reliability evaluation on visual information of each video sample comprises: evaluating visual information of each video sample by using a non-reference video objective quality assessment method to obtain a Evaluating values; determining a maximum evaluation value of visual information of all video samples; and dividing an evaluation value of visual information of each video sample by the maximum evaluation value to obtain a visual feature reliability factor for each video sample, wherein The value of the visual feature reliability factor is between 0 and 1, and the larger the value, the higher the reliability of the visual feature.
可选地, 无参考视频客观质量评估方法包括基于指标峰值信噪比的方法或 基于块效应的测量算法。  Optionally, the non-reference video objective quality assessment method includes a method based on an indicator peak signal to noise ratio or a block effect based measurement algorithm.
可选地, 对每个视频样本的音频信息进行可靠性评估包括: 利用音频客观 质量评估方法对每个视频样本的音频信息进行评估,得到一个评估值; 确定所有 视频样本的音频信息的最大评估值;以及将每个视频样本的音频信息的评估值除 以所述最大评估值, 以得到每个视频样本的音频特征可靠因子, 其中, 所述音频 特征可靠因子的值介于 0和 1之间, 值越大表示音频特征的可靠性越高。  Optionally, performing reliability evaluation on the audio information of each video sample comprises: evaluating audio information of each video sample by using an audio objective quality assessment method to obtain an evaluation value; determining a maximum evaluation of audio information of all video samples. a value; and an evaluation value of the audio information of each video sample is divided by the maximum evaluation value to obtain an audio feature reliability factor for each video sample, wherein the audio feature reliability factor has a value between 0 and 1 The greater the value, the higher the reliability of the audio features.
可选地, 音频客观质量评估方法包括: 巴克谱失真测度、 归一化块测度、 或感知分析测度。  Optionally, the audio objective quality assessment method comprises: a Bark spectral distortion measure, a normalized block measure, or a perceptual analysis measure.
可选地,对每个视频样本的文本信息进行可靠性评估包括: 统计文本的总字 数 以及句子的平均字数 2 ;以及通过下式计算文本特征可靠因子^ = 0.5*!^! (1, Optionally, the reliability evaluation of the text information of each video sample includes: the total number of words of the statistical text and the average number of words of the sentence 2; and the text feature reliability factor calculated by the following formula ^ = 0.5*!^! (1,
Ji/200) + 0.5*min(l, J2/20), 其中该可靠因子的值介于 0和 1之间, 值越大表明 文本特征的可靠性越高。 Ji/200) + 0.5*min(l, J 2 /20), where the value of the reliability factor is between 0 and 1, the larger the value, the higher the reliability of the text feature.
根据本发明的实施例, 可选地, 在步骤 104, 基于视频特征集、 每个视频样 本的标签以及每个视频样本的可靠因子,利用加权的支持向量机算法得到视频分 类器包括:将视频特征集表示为 = {(xvl, xa xn), . . . , (xVh xah xtI) (xvN, 其中 xw为第 个视频样本的视觉特征, χ为第 个视频样本的音频特征, xti为第 个视频样本的文本特征, N表示视频样本总数; 第 个视频样本的标签用 表 示, 当第 个视频样本为第一类别时, = 1, 当第 个视频样本为第二类别时, y, = -1; ^表示第 个视频样本的视频特征可靠因子, ^表示第 个视频样本的 音频特征可靠因子, rfi表示第 个视频样本的文本特征可靠因子;用 &表示 rw+ rai+ rtl, 通过对
Figure imgf000009_0001
According to an embodiment of the present invention, optionally, in step 104, the video classifier is obtained by using a weighted support vector machine algorithm based on a video feature set, a label of each video sample, and a reliability factor of each video sample. The feature set is expressed as = {(x vl , x a xn), . . . , (x Vh x ah x tI ) (x vN , where x w is the visual feature of the first video sample, χ is the first video sample Audio feature, x ti is the text feature of the first video sample, N is the total number of video samples; the label of the first video sample is used to indicate that when the first video sample is the first category, = 1, when the first video sample is In the second category, y, = -1; ^ represents the video feature reliability factor of the first video sample, and ^ represents the first video sample. Audio feature reliability factor, r fi represents the text feature reliability factor of the first video sample; && r w + r ai + r tl , by
Figure imgf000009_0001
s-t. Vi-.yi (w Tx + b ) + (w x + b ) + (wt Txti + bt) >St. Vi-.yi (w T x + b ) + (wx + b ) + (w t T x ti + b t ) >
Figure imgf000009_0002
Figure imgf000009_0002
ξί > 0,  Ξί > 0,
其中!^^,^, ,^^为视频分类器参数, 为松弛因子, C为平衡因子, 在求 解过程中可以通过交叉验证的方法来对 C进行选取。 among them! ^^,^, ,^^ is the video classifier parameter, which is the relaxation factor, and C is the balance factor. In the process of solving, C can be selected by cross-validation.
根据本发明的实施例, 可选地, 利用与上述从视频样本提取视觉特征、音频 特征以及文本特征并得到相应的视觉特征可靠因子、音频特征可靠因子和文本特 征可靠因子类似的方法, 从待分类视频提取视觉特征、音频特征以及文本特征并 得到相应的视觉特征可靠因子、音频特征可靠因子和文本特征可靠因子,其具体 过程在此不再赘述。  According to an embodiment of the present invention, optionally, the method for extracting visual features, audio features, and text features from the video samples and obtaining corresponding visual feature reliability factors, audio feature reliability factors, and text feature reliability factors is used. The classified video extracts visual features, audio features and text features and obtains corresponding visual feature reliability factors, audio feature reliability factors and text feature reliability factors, and the specific process will not be described herein.
根据本发明的实施例,可选地,利用视频分类器对待分类视频进行分类包括: 根据以上计算的视频分类器参数 Wa, Wh bv, ba, 计算 According to an embodiment of the present invention, optionally, classifying the classified video by using the video classifier comprises: calculating according to the video classifier parameters W a , W h b v , b a calculated above
s = rv+ ra + rt s = r v + r a + r t
y= r (wjxv + bv) +r (w xa + ba) +j r (wt Txt + bt) 其中 ^表示待分类视频的视觉特征, X。表示待分类视频的音频特征, 表示待分 类视频的文本特征, rv表示待分类视频的视频特征可靠因子, r。表示待分类视频 的音频特征可靠因子, rf表示待分类视频的文本特征可靠因子, 如果 _y>0, 那么 该网络视频样本判定为第一类别, 否则判定为第二类别。 y= r (wjx v + b v ) + r (wx a + b a ) +j r (w t T x t + b t ) where ^ represents the visual feature of the video to be classified, X. Indicates the audio features of the video to be classified, the text features of the video to be classified, and r v represents the video feature reliability factor of the video to be classified, r. Indicates the audio feature reliability factor of the video to be classified, r f represents the text feature reliability factor of the video to be classified. If _y>0, the network video sample is determined to be the first category, otherwise it is determined to be the second category.
当将根据本发明的实施例应用于对网络有害视频的识别时, 可以收集网络 视频以及每个网络视频周围的文本构成上述视频样本集,并且上述第一类别是有 害视频, 第二类别是正常视频。  When an embodiment according to the present invention is applied to the identification of harmful video of the network, the network video and the text surrounding each network video may be collected to constitute the above video sample set, and the first category is harmful video, and the second category is normal. video.
图 2示出了根据本发明实施例的视频分类方法的工作过程。 如图 2所示, 视频样本集 201包括 N个视频样本。 根据本发明的实施例, 每个视频样本可以 包括视频和视频周围的文本。当应用于网络有害视频识别时,视频样本集可以是 从网络上收集的。 从每个视频样本 ( = 1,2, …… N)提取视频特征以得到视频 特征集 202。 根据本发明的实施例, 视频特征可以包括视觉特征 x„、 音频特征 X和文本特征 , FIG. 2 illustrates the operation of a video classification method in accordance with an embodiment of the present invention. As shown in FIG. 2, video sample set 201 includes N video samples. According to an embodiment of the invention, each video sample may include text around the video and video. When applied to network harmful video recognition, the video sample set can be collected from the network. Video features are extracted from each video sample (= 1, 2, ... N) to obtain a video feature set 202. According to an embodiment of the invention, the video feature may include a visual feature x, an audio feature X and text features,
对每个视频样本赋予与其类别相对应的标签 203,以表示其属于第一类别或 第二类别。 例如, 可以人工识别视频是否有害, 然后逐一对视频样本赋予标签。 备选地, 也可以利用已有的有害视频样本集和正常视频样本集, 并以批处理方式 对其赋予标签。  Each video sample is given a tag 203 corresponding to its category to indicate that it belongs to the first category or the second category. For example, it is possible to manually identify whether a video is harmful, and then assign a label to a pair of video samples. Alternatively, existing sets of harmful video samples and normal video samples can also be utilized and tagged in batch mode.
针对每个视频样本 , 按照上文所述的方式, 计算可靠因子 204。 对视觉信 息进行与视觉质量有关的可靠性评估以得到视觉特征可靠因子 rw, 对音频信息 进行与音频质量有关的可靠性评估以得到音频特征可靠因子 r, 对文本信息进 行与文本的总字数和句子的平均字数有关的可靠性评估以得到文本特征可靠因 子 。 For each video sample, the reliability factor 204 is calculated in the manner described above. Perform visual quality-related reliability assessment on visual information to obtain visual feature reliability factor r w , perform audio quality-related reliability evaluation on audio information to obtain audio feature reliability factor r , and total text information and text Reliability estimates related to the number of words and the average number of words in a sentence to obtain a text feature reliability factor.
基于视频特征集 202、每个视频样本的标签 203以及每个视频样本的视频特 征可靠因子 204, 利用加权的支持向量机算法 205得到视频分类器 206。  Video classifier 206 is obtained using weighted support vector machine algorithm 205 based on video feature set 202, tag 203 for each video sample, and video feature reliability factor 204 for each video sample.
对于待分类视频, 按照与从每个视频样本提取视频特征和计算可靠因子相 同的方式, 计算该待分类视频的视频特征 (xv, xa, xf) 和可靠因子 (rv,ra,rt), 利 用视频分类器 206对其进行分类。 For the video to be classified, the video features (x v , x a , x f ) and the reliability factors (r v , r a ) of the video to be classified are calculated in the same manner as the video features are extracted from each video sample and the reliability factor is calculated. , r t ), which is classified by the video classifier 206.
尽管以上针对网络视频分类说明了本发明, 但是本发明并不仅限于应用于 网络视频, 而是可以应用于各种包含视觉、音频和文本信息的视频分类。本发明 也不仅限于有害视频的识别, 而是可以应用于识别各种包含特定特征的视频。  Although the present invention has been described above for network video classification, the present invention is not limited to application to network video, but can be applied to various video classifications including visual, audio, and text information. The invention is also not limited to the identification of harmful video, but can be applied to identify various videos containing specific features.
以上所述, 仅为本发明中的具体实施方式, 但本发明的保护范围并不局限 于此,任何熟悉该技术的人在本发明所揭露的技术范围内,可理解想到的变换或 替换, 都应涵盖在本发明的包含范围之内, 因此, 本发明的保护范围应该以权利 要求书的保护范围为准。  The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand the alteration or replacement within the scope of the technical scope of the present invention. The scope of the invention should be construed as being included in the scope of the invention.

Claims

权 利 要 求 Rights request
1、 一种考虑视频特征可靠性的视频分类器构造方法, 包括: 1. A video classifier construction method that considers the reliability of video features, including:
提取视频样本集中每个视频样本的视频特征, 以得到视频特征集; 对每个视频样本赋予标签, 以表示该视频样本属于第一类别或第二类别; 针对每个视频样本进行可靠性评估, 以得到视频样本的可靠因子; 以及 基于视频特征集、 每个视频样本的标签以及每个视频样本的可靠因子, 利 用加权的支持向量机算法得到视频分类器。 Extract the video features of each video sample in the video sample set to obtain the video feature set; assign a label to each video sample to indicate that the video sample belongs to the first category or the second category; conduct a reliability evaluation for each video sample, To obtain the reliability factor of the video sample; and based on the video feature set, the label of each video sample and the reliability factor of each video sample, use the weighted support vector machine algorithm to obtain the video classifier.
2、根据权利要求 1所述的方法, 其中每个视频样本包括视频以及该视频周 围的文本。 2. The method of claim 1, wherein each video sample includes a video and text surrounding the video.
3、 根据权利要求 2所述的方法, 其中视频特征包括视觉特征、 音频特征和 文本特征。 3. The method according to claim 2, wherein the video features include visual features, audio features and text features.
4、根据权利要求 3所述的方法, 其中针对每个视频样本进行可靠性评估包 括对每个样本的视觉信息、 音频信息和文本信息分别进行可靠性评估。 4. The method according to claim 3, wherein conducting a reliability assessment on each video sample includes separately conducting a reliability assessment on the visual information, audio information and text information of each sample.
5、 根据权利要求 4所述的方法, 其中可靠因子包括: 5. The method according to claim 4, wherein the reliability factor includes:
视觉特征可靠因子, 通过对视觉信息进行可靠性评估以得到所述视觉特征 可靠因子; The visual feature reliability factor is obtained by evaluating the reliability of visual information to obtain the visual feature reliability factor;
音频特征可靠因子, 通过对音频信息进行可靠性评估以得到所述音频特征 可靠因子; 以及 The audio feature reliability factor is obtained by conducting a reliability assessment on audio information to obtain the audio feature reliability factor; and
文本特征可靠因子, 通过对文本信息进行可靠性评估以得到所述文本特征 可靠因子。 The text feature reliability factor is obtained by evaluating the reliability of the text information.
6、 根据权利要求 1所述的方法, 其中第一类别是有害视频, 第二类别是正 常视频。 6. The method according to claim 1, wherein the first category is harmful videos and the second category is normal videos.
7、根据权利要求 5所述的方法, 其中对每个视频样本的视觉信息进行可靠 性评估包括: 7. The method of claim 5, wherein assessing the reliability of the visual information of each video sample includes:
利用无参考视频客观质量评估方法对每个视频样本的视觉信息进行评估, 得到一个评估值; Use the non-reference video objective quality assessment method to evaluate the visual information of each video sample and obtain an evaluation value;
确定所有视频样本的视觉信息的最大评估值; 以及 Determine the maximum evaluation value of visual information for all video samples; and
将每个视频样本的视觉信息的评估值除以所述最大评估值, 以得到每个视 频样本的视觉特征可靠因子。 The evaluation value of the visual information of each video sample is divided by the maximum evaluation value to obtain the visual feature reliability factor of each video sample.
8、根据权利要求 7所述的方法, 其中无参考视频客观质量评估方法包括基 于指标峰值信噪比的方法或基于块效应的测量算法。 8. The method according to claim 7, wherein the objective quality assessment method of the reference-free video includes a method based on the index peak signal-to-noise ratio or a measurement algorithm based on block effects.
9、根据权利要求 5所述的方法, 其中对每个视频样本的音频信息进行可靠 性评估包括: 9. The method of claim 5, wherein assessing the reliability of the audio information of each video sample includes:
利用音频客观质量评估方法对每个视频样本的音频信息进行评估, 得到一 个评估值; Use the objective audio quality assessment method to evaluate the audio information of each video sample to obtain an evaluation value;
确定所有视频样本的音频信息的最大评估值; 以及 Determine the maximum evaluation value of audio information for all video samples; and
将每个视频样本的音频信息的评估值除以所述最大评估值, 以得到每个视 频样本的音频特征可靠因子。 Divide the evaluation value of the audio information of each video sample by the maximum evaluation value to obtain the audio feature reliability factor of each video sample.
10、 根据权利要求 9所述的方法, 其中音频客观质量评估方法包括: 巴克 谱失真测度、 归一化块测度、 或感知分析测度。 10. The method according to claim 9, wherein the audio objective quality assessment method includes: Buck spectral distortion measure, normalized block measure, or perceptual analysis measure.
11、 根据权利要求 5所述的方法, 其中对每个视频样本的文本信息进行可 靠性评估包括: 11. The method according to claim 5, wherein the reliability assessment of the text information of each video sample includes:
统计文本的总字数 ^以及句子的平均字数^; 以及 Statistics of the total word count of the text^ and the average word count of the sentences^; and
通过下式计算文本特征可靠因子 rt-. Calculate the text feature reliability factor r t - by the following formula.
rt = 0.5*min(l, Ji/200) + 0.5*min(l, J2/20)。 r t = 0.5*min(l, Ji/200) + 0.5*min(l, J 2 /20).
12、 根据权利要求 5所述的方法, 其中基于视频特征集、 每个视频样本的 标签以及每个视频样本的可靠因子,利用加权的支持向量机算法得到视频分类器 包括: 12. The method according to claim 5, wherein based on the video feature set, the label of each video sample and the reliability factor of each video sample, the weighted support vector machine algorithm is used to obtain the video classifier including:
将视频特征集表示为 = {(Xvl, a X ), (xV a Xti) ( vN, XaN, XfW)}, 其 中 xw为第 个视频样本的视觉特征, χ<„为第 个视频样本的音频特征, xti为第 i 个视频样本的文本特征, N表示视频样本总数; Represent the video feature set as = {(Xvl, a X), (x V a Xti) ( vN, XaN, audio features, x ti is the text feature of the i-th video sample, N represents the total number of video samples;
第 个视频样本的标签用 表示, 当第 个视频样本为第一类别时, = 1, 当第 个视频样本为第二类别时, =-1; The label of the th video sample is represented by , when the th video sample is the first category, = 1, when the th video sample is the second category, =-1;
rw表示第 i个视频样本的视觉特征可靠因子, ra,表示第 i个视频样本的音频 特征可靠因子, ^表示第 个视频样本的文本特征可靠因子; r w represents the visual feature reliability factor of the i-th video sample, r a represents the audio feature reliability factor of the i-th video sample, ^ represents the text feature reliability factor of the i-th video sample;
用 表示 r„+ ra& rtl, 通过对下式进行求解得到视频分类器的参数: w^.w^.w^.b. (I II + I II + I ID + CV s-t. Vi-. yi Γ— (wv Txvi + bv) +— (w xai + ba) +— (wt Txti +bt) ≥ 1 -ξι Expressed as r„+ r a & r tl , the parameters of the video classifier are obtained by solving the following equation: w^.w^.w^.b. (I II + I II + I ID + CV st. Vi- . yi Γ — (w v T x vi + b v ) +— (wx ai + b a ) +— (w t T x ti +b t ) ≥ 1 -ξι
St Si - ξί > , St Si - ξί > ,
其中!^^,^, ,^^为视频分类器参数, 为松弛因子, C为平衡因子, 在求 解过程中通过交叉验证的方法来对 C进行选取。 in! ^^, ^, , ^^ are the video classifier parameters, is the relaxation factor, and C is the balance factor. In the solution process, C is selected through the cross-validation method.
13、 根据权利要求 12所述的方法, 还包括: 13. The method of claim 12, further comprising:
从待分类视频提取视觉特征、 音频特征以及文本特征并得到相应的视觉特 征可靠因子、 音频特征可靠因子和文本特征可靠因子; 以及 Extract visual features, audio features and text features from the video to be classified and obtain the corresponding visual feature reliability factors, audio feature reliability factors and text feature reliability factors; and
根据视频分类器参数 Wa, Wt, bv, ba, 计算 Calculated according to the video classifier parameters W a , W t , b v , b a
s = rv+ ra + rt s = r v + r a + r t
y= r (wjxv + bv) +r (w xa + ba) +jr (wt Txt + bt) 其中 ^表示待分类视频的视觉特征, X。表示待分类视频的音频特征, 表示待分 类视频的文本特征, rv表示待分类视频的视频特征可靠因子, r。表示待分类视频 的音频特征可靠因子, rf表示待分类视频的文本特征可靠因子, 如果 _y>0, 那么 该网络视频样本判定为第一类别, 否则判定为第二类别。 y= r (wjx v + b v ) + r (wx a + b a ) +j r (w t T x t + b t ) where ^ represents the visual feature of the video to be classified, X. represents the audio feature of the video to be classified, represents the text feature of the video to be classified, r v represents the video feature reliability factor of the video to be classified, r. represents the audio feature reliability factor of the video to be classified, r f represents the text feature reliability factor of the video to be classified, if _y>0, then the network video sample is judged to be the first category, otherwise it is judged to be the second category.
PCT/CN2013/076757 2013-06-05 2013-06-05 Video classifier construction method with consideration of characteristic reliability WO2014194481A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/076757 WO2014194481A1 (en) 2013-06-05 2013-06-05 Video classifier construction method with consideration of characteristic reliability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/076757 WO2014194481A1 (en) 2013-06-05 2013-06-05 Video classifier construction method with consideration of characteristic reliability

Publications (1)

Publication Number Publication Date
WO2014194481A1 true WO2014194481A1 (en) 2014-12-11

Family

ID=52007410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/076757 WO2014194481A1 (en) 2013-06-05 2013-06-05 Video classifier construction method with consideration of characteristic reliability

Country Status (1)

Country Link
WO (1) WO2014194481A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169061A (en) * 2017-05-02 2017-09-15 广东工业大学 A kind of text multi-tag sorting technique for merging double information sources
GB2608803A (en) * 2021-07-09 2023-01-18 Milestone Systems As A video processing apparatus, method and computer program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008016903A (en) * 2006-07-03 2008-01-24 Nippon Telegr & Teleph Corp <Ntt> Motion vector reliability measurement method, moving frame determination method, moving picture coding method, apparatuses for them, and programs for them and recording medium thereof
CN101990093A (en) * 2009-08-06 2011-03-23 索尼株式会社 Method and device for detecting replay section in video
CN102508923A (en) * 2011-11-22 2012-06-20 北京大学 Automatic video annotation method based on automatic classification and keyword marking
CN102509084A (en) * 2011-11-18 2012-06-20 中国科学院自动化研究所 Multi-examples-learning-based method for identifying horror video scene
CN103294811A (en) * 2013-06-05 2013-09-11 中国科学院自动化研究所 Visual classifier construction method with consideration of characteristic reliability

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008016903A (en) * 2006-07-03 2008-01-24 Nippon Telegr & Teleph Corp <Ntt> Motion vector reliability measurement method, moving frame determination method, moving picture coding method, apparatuses for them, and programs for them and recording medium thereof
CN101990093A (en) * 2009-08-06 2011-03-23 索尼株式会社 Method and device for detecting replay section in video
CN102509084A (en) * 2011-11-18 2012-06-20 中国科学院自动化研究所 Multi-examples-learning-based method for identifying horror video scene
CN102508923A (en) * 2011-11-22 2012-06-20 北京大学 Automatic video annotation method based on automatic classification and keyword marking
CN103294811A (en) * 2013-06-05 2013-09-11 中国科学院自动化研究所 Visual classifier construction method with consideration of characteristic reliability

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169061A (en) * 2017-05-02 2017-09-15 广东工业大学 A kind of text multi-tag sorting technique for merging double information sources
CN107169061B (en) * 2017-05-02 2020-12-11 广东工业大学 Text multi-label classification method fusing double information sources
GB2608803A (en) * 2021-07-09 2023-01-18 Milestone Systems As A video processing apparatus, method and computer program
GB2608803B (en) * 2021-07-09 2023-11-08 Milestone Systems As A video processing apparatus, method and computer program

Similar Documents

Publication Publication Date Title
WO2020221278A1 (en) Video classification method and model training method and apparatus thereof, and electronic device
US9230547B2 (en) Metadata extraction of non-transcribed video and audio streams
CN109727246A (en) Comparative learning image quality evaluation method based on twin network
CN110909205A (en) Video cover determination method and device, electronic equipment and readable storage medium
CN108874832B (en) Target comment determination method and device
JP5502703B2 (en) Flow classification method, system, and program
WO2023035923A1 (en) Video checking method and apparatus and electronic device
Lovato et al. Tell me what you like and I’ll tell you what you are: discriminating visual preferences on Flickr data
CN110047506B (en) Key audio detection method based on convolutional neural network and multi-core learning SVM
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN206378900U (en) A kind of advertisement delivery effect evaluation system based on mobile terminal
CN111401100A (en) Video quality evaluation method, device, equipment and storage medium
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN111326139A (en) Language identification method, device, equipment and storage medium
Sharma et al. Two-stage supervised learning-based method to detect screams and cries in urban environments
CN111062439A (en) Video definition classification method, device, equipment and storage medium
CN111816170A (en) Training of audio classification model and junk audio recognition method and device
WO2014194481A1 (en) Video classifier construction method with consideration of characteristic reliability
WO2015003341A1 (en) Constructing method for video classifier based on quality metadata
Ramakrishna et al. An Expectation Maximization Approach to Joint Modeling of Multidimensional Ratings Derived from Multiple Annotators.
CN111200607B (en) Online user behavior analysis method based on multilayer LSTM
CN109492124B (en) Method and device for detecting bad anchor guided by selective attention clue and electronic equipment
KR101564176B1 (en) An emotion recognition system and a method for controlling thereof
CN107133644B (en) Digital library&#39;s content analysis system and method
KR101551879B1 (en) A Realization of Injurious moving picture filtering system and method with Data pruning and Likelihood Estimation of Gaussian Mixture Model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13886612

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13886612

Country of ref document: EP

Kind code of ref document: A1