CN112507164B

CN112507164B - Barrage filtering method, device and storage medium based on content and user ID

Info

Publication number: CN112507164B
Application number: CN202011417368.XA
Authority: CN
Inventors: 吴渝; 李芊; 王利; 于磊
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-04-12
Anticipated expiration: 2040-12-07
Also published as: CN112507164A

Abstract

The invention discloses a method and device for filtering bullet screens based on content and user identification. The method includes: preprocessing bullet screen data and user data of bullet screen video websites crawled by python crawler software; introducing word embedding, The short text representation method under the combined effect of word similarity, word and topic probability, and tag topic probability expands the short text of the bullet screen; constructs user platform features; splices the expanded text features and platform features into the classification model, Output the barrage classification result. The invention combines the advantages of external corpus expansion and content feature expansion of short text itself, and at the same time, introduces word vector into feature expansion, realizes semantic expansion of original text to the greatest extent, and adds user platform type features to the bullet screen feature space to enrich the bullet screen feature space. Improve the barrage recognition rate.

Description

Barrage filtering method, device and storage medium based on content and user ID

技术领域technical field

本发明涉及视频弹幕技术领域，具体涉及基于内容和用户标识的弹幕过滤方法、装置及存储介质。The present invention relates to the technical field of video bullet screens, in particular to a method, device and storage medium for filtering bullet screens based on content and user identification.

背景技术Background technique

近年来，弹幕(Bullet Subtitle)视频分享网站(如A站、B站以及C站等)发展迅速，吸引了大量的当代青年客户群体关注。弹幕视频网站依靠弹幕高互动度、高自由度等优势吸引用户，提高网站的播放量和传播度。但是网络平台对弹幕内容的管理和监控却不够合理，导致不同的视频中出现粗俗、暴力和负能量等评论，严重影响了客户群体的观看体验。同时弹幕的受众以青少年为主，弹幕语言不规范、暴力粗俗等问题会对青少年语言素养、价值观的导向以及心智的形成造成不利的影响。In recent years, Bullet Subtitle video sharing websites (such as Station A, Station B, Station C, etc.) have developed rapidly, attracting the attention of a large number of contemporary young customers. The bullet screen video website relies on the advantages of high interaction and high degree of freedom of the bullet screen to attract users and increase the playback volume and spread of the website. However, the management and monitoring of the bullet screen content on the network platform is not reasonable enough, resulting in vulgar, violent and negative energy comments in different videos, which seriously affects the viewing experience of the customer group. At the same time, the audience of the bullet screen is mainly young people. The non-standard language, violence and vulgarity of the bullet screen will have a negative impact on the language literacy, value orientation and mental formation of young people.

弹幕文本经预处理后，余留的特征词很少，属于典型的超短文本。弹幕文本特征的稀疏性使传统的文本分类方法对弹幕分类效果很不理想。为解决短文本特征稀疏的问题，国内外的研究学者采用特征扩展，主要有借助外部语料库扩展和借助短文本自身内容特征实现文本扩展。外部语料库扩展和短文本自身内容特征扩展各有优点、不足。外部语料库扩展主要依靠语料库的质量，计算量大，文本依赖性强。短文本自身内容特征扩展，主要是挖掘短文本自身语义特征，容易出现过拟合现象。因此，可以结合这两种方法的优点提出一种新的文本分类方法。After the bullet screen text is preprocessed, there are few remaining feature words, which are typical ultra-short texts. The sparsity of bullet screen text features makes traditional text classification methods unsatisfactory for bullet screen classification. In order to solve the problem of sparse short text features, researchers at home and abroad use feature expansion, mainly by means of external corpus expansion and text expansion by means of short text's own content features. External corpus expansion and short text own content feature expansion have their own advantages and disadvantages. External corpus expansion mainly depends on the quality of the corpus, which is computationally intensive and text-dependent. The extension of the content features of the short text itself is mainly to mine the semantic features of the short text itself, which is prone to overfitting. Therefore, a new text classification method can be proposed by combining the advantages of these two methods.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于内容和用户标识的弹幕过滤方法及装置，结合外部语料库扩展和短文本自身内容特征扩展，同时将词向量引入特征扩展，提出一种新的文本分类方法解决弹幕视频网站弹幕文本特征稀疏的问题。The present invention provides a method and device for filtering bullet screens based on content and user identification, combining external corpus expansion and content feature expansion of short texts, and introducing word vectors into feature expansion, and proposes a new text classification method to solve the bullet screen problem. The problem of sparse text features in video website bullet chat.

本发明通过下述技术方案实现：The present invention is achieved through the following technical solutions:

由于传统的文本分类方法对弹幕分类效果很不理想，单独借助外部语料库扩展或借助短文本自身内容特征实现文本扩展又各有不足，因此，本发明结合外部语料库扩展和短文本自身内容特征扩展两种方法的优点，同时将词向量引入特征扩展，提出一种基于内容和用户标识的弹幕内容分类方法，从而完成弹幕的过滤，包括步骤S1-S4：Since the traditional text classification method is very unsatisfactory for the classification effect of the bullet screen, the expansion of the text with the help of the external corpus expansion alone or with the help of the content features of the short text itself has its own shortcomings. Therefore, the present invention combines the expansion of the external corpus and the content feature expansion of the short text itself. The advantages of the two methods, while introducing the word vector into the feature extension, proposes a content and user identification-based bullet screen content classification method, so as to complete the bullet screen filtering, including steps S1-S4:

S1、使用python爬虫软件爬取弹幕视频网站的弹幕数据和用户数据，对爬取到的数据进行清洗并对弹幕数据进行打标分为普通弹幕和不良弹幕，其中，弹幕数据包括弹幕短文本即该弹幕的文本内容，用户数据包括用户性别、粉丝数、关注数、用户等级；S1. Use python crawler software to crawl the bullet screen data and user data of the bullet screen video website, clean the crawled data and mark the bullet screen data into ordinary bullet screen and bad bullet screen. Among them, the bullet screen The data includes the short text of the bullet screen, that is, the text content of the bullet screen, and the user data includes the user's gender, the number of fans, the number of followers, and the user level;

S2、对步骤S1中打标后的弹幕短文本进行扩展，并对扩展后的弹幕短文本进行文本的特征表示优化，得到扩展文本特征；S2, extending the short text of the bullet screen marked in step S1, and optimizing the feature representation of the text on the short text of the bullet screen after the expansion, so as to obtain the extended text feature;

S3、构造用户平台类特征，对用户数据进行分析，在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征；S3. Constructing user platform class features, analyzing user data, and constructing new user credit rating features and user identity credibility features based on the original features of user data;

S4、将步骤S1中打标好的弹幕数据分为训练集和测试集，利用五折交叉验证训练SVM模型，构造弹幕内容分类模型，将步骤S2得到的扩展文本特征和步骤S3得到的用户平台类特征进行拼接输入到所述弹幕内容分类模型中，输出弹幕分类结果，最后根据弹幕分类结果对弹幕进行过滤。S4. Divide the bullet screen data marked in step S1 into a training set and a test set, use the five-fold cross-validation to train the SVM model, construct a bullet screen content classification model, and combine the extended text features obtained in step S2 with those obtained in step S3. The user platform class features are spliced and input into the bullet screen content classification model, the bullet screen classification result is output, and finally the bullet screen is filtered according to the bullet screen classification result.

进一步地，步骤S1中对所述弹幕数据进行打标的具体过程为根据弹幕内容对弹幕数据进行打标，将具有暴力、威胁、色情等不文明用语、无意义的单个字符、全是表情符号的弹幕内容标为1，其他弹幕内容标为0，其中0代表普通弹幕，1代表不良弹幕。Further, the specific process of marking the bullet screen data in step S1 is to mark the bullet screen data according to the content of the bullet screen. The bullet screen content of emoji is marked with 1, and the other bullet screen content is marked with 0, where 0 represents normal bullet screen and 1 represents bad bullet screen.

进一步地，步骤S2具体包括步骤S21-S24：Further, step S2 specifically includes steps S21-S24:

S21、根据外部语料库预训练Word2Vec模型；S21. Pre-train the Word2Vec model according to the external corpus;

S22、构建最优特征空间和标签主题特征空间；S22, construct the optimal feature space and the label topic feature space;

S23、根据构建的最优特征空间和标签主题特征空间，基于预训练的Word2Vec模型对弹幕短文本中符合条件的词汇进行扩展，得到扩展短文本；S23. According to the constructed optimal feature space and tag topic feature space, based on the pre-trained Word2Vec model, expand the qualified vocabulary in the short text of the bullet screen to obtain the extended short text;

S24、改进文本表示方法，在扩展短文本中引入扩展影响因子表示扩展词语对弹幕短文本的影响程度，最大程度实现原弹幕文本语义扩展，得到扩展文本特征。S24 , improving the text representation method, and introducing an extended influence factor into the extended short text to indicate the degree of influence of the extended word on the short text of the bullet screen, so as to maximize the semantic expansion of the original bullet screen text and obtain extended text features.

进一步地，训练Word2Vec模型的外部语料库来自弹幕视频网站视频下方的评论数据和视频中的弹幕数据，外部语料库内容跟待分类的弹幕数据具有领域相似性，这样可以保证词语覆盖率要高于目前常用的维基百科词向量和搜狗新闻词向量。Further, the external corpus for training the Word2Vec model comes from the comment data below the video of the bullet screen video website and the bullet screen data in the video. The content of the external corpus has domain similarity with the bullet screen data to be classified, which can ensure that the word coverage rate is high. It is used in the commonly used Wikipedia word vectors and Sogou news word vectors.

进一步地，构建最优特征空间和标签主题特征空间具体包括步骤S221-S223：Further, constructing the optimal feature space and the label topic feature space specifically includes steps S221-S223:

S221、利用卡方检验方法提取弹幕短文本中具有类别倾向性的特征词，构建最优特征空间；S221, using the chi-square test method to extract the feature words with category tendency in the short text of the bullet screen, and construct the optimal feature space;

S222、采用聚合策略，将每个标签下的所有弹幕短文本合并成长文本，然后将各标签下的长文本输入LDA主题模型进行训练；S222, adopt the aggregation strategy to merge all the short texts of the bullet screen under each label into long texts, and then input the long texts under each label into the LDA topic model for training;

S223、利用LDA主题模型得到文本-主题概率矩阵，得到每个标签在各个主题下的概率，选取每个标签下概率较大的前n个主题构建标签主题特征空间。S223 , using the LDA topic model to obtain a text-topic probability matrix, obtaining the probability of each tag under each topic, and selecting the top n topics with a larger probability under each tag to construct a tag topic feature space.

进一步地，对所述符合条件的词汇进行扩展具体包括以下步骤：Further, expanding the qualified vocabulary specifically includes the following steps:

S231、根据构建的标签主题特征空间，用LDA主题模型得到主题-主题词概率矩阵，选取每个主题下概率较大的前n个主题词构成主题词文件；S231. According to the constructed tag topic feature space, use the LDA topic model to obtain a topic-topic word probability matrix, and select the top n topic words with high probability under each topic to form a topic word file;

S232、遍历弹幕短文本中的词汇，若词汇属于最优特征空间时，基于主题-主题词分布矩阵计算该特征词所属的最大概率主题；S232, traverse the vocabulary in the short text of the bullet screen, if the vocabulary belongs to the optimal feature space, calculate the maximum probability topic to which the feature word belongs based on the topic-topic word distribution matrix;

S233、根据主题词文件查看该词汇是否属于该主题的主题词，若不属于，则证明该特征词不具有强烈的主题信息，若将此特征词进行扩展容易引入不相关的特征词，则不对该词汇进行扩展；S233. Check whether the word belongs to the theme word of the theme according to the theme word file. If not, it proves that the feature word does not have strong theme information. If the feature word is expanded, it is easy to introduce irrelevant feature words. the vocabulary is expanded;

S234、若属于，则再查看最大概率主题是否属于标签主题特征空间，若属于，利用Word2Vec模型将相似度高的前k个词汇作为扩展词加入弹幕短文本中；若不属于，则证明该最大概率主题没有标签识别性，则不对该词汇进行扩展。S234. If it belongs, then check whether the maximum probability topic belongs to the tag topic feature space. If it belongs, use the Word2Vec model to add the top k words with high similarity as expansion words into the short text of the bullet screen; if not, it is proved that the For maximum probability topics that do not have label identification, the vocabulary is not expanded.

进一步地，利用词嵌入、词相似度、词与主题概率度和标签主题概率度的共同作用对所述文本表示方法进行改进，具体过程包括利用弹幕短文本与扩展短文本向量直接相加的方法，构造出短文本向量：Further, the text representation method is improved by using the combined effect of word embedding, word similarity, word and topic probability, and label topic probability. The specific process includes using the direct addition of bullet screen short text and extended short text vector. method to construct a short text vector:

C(w_i,j)＝sim(w_i,w_i,j)*P(w_i,topic_m)*P(topic_m,class)*D(w_i,j)#C(wi _,j )=sim( _wi ,wi _,j )*P(wi _, topic _m )*P(topic _m ,class)*D(wi _,j )#

其中，C(d)代表短文本d基于词向量合成的短文本表示，w_i为短文本d中存在的第i个词语，C(w_i)为词汇w_i的词向量表示，w_i,j为第i个词语的第j个扩展词语，C(w_i,j)为扩展词汇向量最终加权后的向量表示，D(w_i,j)为第i个词语的第j个扩展词语的词向量表示。sim(w_i,w_i,j)表示由w_i扩展的第j个词语与w_i的语义相似度，P(w_i,topic_m)表示w_i在所属的最大主题topic_m中的概率，P(topic_m,class)表示class标签中产生最大主题topic_m的概率。Among them, C(d) represents the short text representation of the short text d based on the word vector synthesis, wi is the _ith word existing in the short text d, C( _wi ) is the word vector representation of the word wi, _wi _{, j} is the j-th extended word of the i-th word, C(wi _,j ) is the final weighted vector representation of the extended word vector, D(wi _,j ) is the j-th extended word of the i-th word word vector representation. sim( _wi ,wi _,j ) represents the semantic similarity between the jth word extended by _wi and _wi , P(wi _, topic _m ) represents the probability of _wi in the largest topic topic _m to which it belongs, P(topic _m , class) represents the probability of generating the largest topic topic _m in the class label.

进一步地，所述用户信誉等级的计算步骤为：Further, the calculation steps of the user reputation level are:

根据用户发布的历史弹幕行为计算得到用户信誉等级I_{credit-rating}，公式如下：The user's credit rating I _{credit-rating} is calculated according to the historical bullet chat behavior published by the user. The formula is as follows:

其中N_total表示发布的弹幕总数，N_bad表示发布的不良弹幕数，周期性的更新用户信誉等级；Among them, N _total represents the total number of published bullet screens, and N _bad represents the number of bad bullet screens published, and the user reputation level is periodically updated;

所述用户身份可信度的计算步骤为：The calculation steps of the user identity credibility are:

根据用户平台等级、用户是否VIP得到用户身份可信度I_{identity-credibility}，公式如下：According to the user's platform level and whether the user is VIP or not, the user's identity credibility I _{identity-credibility} is obtained. The formula is as follows:

其中I_level表示用户的平台等级归一化后的值，I_vip表示用户是否VIP。Among them, I _level represents the normalized value of the user's platform level, and I _vip represents whether the user is a VIP.

另外，本发明提出一种基于内容和用户标识的弹幕过滤装置，所述装置支持上述的基于内容和用户标识的弹幕过滤方法，包括数据预处理模块、文本扩展模块、用户平台类特征构造模块和分类模块，其中，In addition, the present invention proposes a content and user identification-based bullet screen filtering device, which supports the above-mentioned content and user identification-based bullet screen filtering method, including a data preprocessing module, a text extension module, and a user platform-like feature structure modules and classification modules, where,

数据预处理模块：用于对使用python爬虫软件爬取到的弹幕视频网站弹幕数据和用户数据清洗缺失的数据，并对弹幕数据进行打标分为普通弹幕和不良弹幕，弹幕数据包括弹幕短文本，即该弹幕的文本内容；Data preprocessing module: used to clean the missing data of the bullet screen video website and user data crawled by python crawler software, and mark the bullet screen data into ordinary bullet screen and bad bullet screen. The screen data includes the short text of the bullet screen, that is, the text content of the bullet screen;

文本扩展模块：用于构建最优特征空间和标签主题特征空间对打标后的弹幕短文本进行扩展，改进文本表示方法对扩展后的弹幕短文本进行文本的特征表示优化，得到扩展文本特征；Text extension module: It is used to construct the optimal feature space and label theme feature space to expand the marked bullet screen short text, improve the text representation method, and optimize the text feature representation of the expanded bullet screen short text to obtain the extended text. feature;

用户平台类特征构造模块：用于构造用户平台类特征，对用户数据进行分析，在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征；User platform class feature construction module: used to construct user platform class features, analyze user data, and construct user reputation level features and user identity credibility features based on the original features of user data;

分类模块：用于将打标好的弹幕数据集分为训练集和测试集，利用五折交叉验证训练SVM模型构造出分类模型，将扩展后的文本特征和用户平台类特征进行拼接输入到分类模型中，输出弹幕分类结果。Classification module: It is used to divide the marked bullet screen data set into training set and test set, use five-fold cross-validation to train the SVM model to construct a classification model, and splicing the expanded text features and user platform features into the In the classification model, output the bullet chat classification result.

一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序在运行时实现上述基于内容和用户标识的弹幕过滤方法。A computer-readable storage medium on which a computer program is stored, the computer program implements the above-mentioned content- and user-identity-based bullet screen filtering method when running.

本发明与现有技术相比，具有如下的优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

本发明一种基于内容和用户标识的弹幕过滤方法及装置，结合外部语料库扩展和短文本自身内容特征扩展的优点，同时将词向量引入特征扩展，提出了一种改进的文本扩展方法。该方法为不同粒度主题、不同语义补充度的扩展词语赋予不同权重，最大程度实现原文本语义扩展。同时提出一种词相似度、词与主题概率度、标签主题概率度、词嵌入共同作用的文本表示方法，改进后的文本表示可以学习到更丰富层次的语义信息，优化文本的特征表示。在用户平台类特征构造模块中，构建了用户信誉等级、用户身份可信度两个新特征。弹幕文本简短，信息量少，仅基于文本内容的识别维度单一，在弹幕特征空间中加入用户特征，可以进一步增加信息维度，丰富弹幕特征空间，提高弹幕识别率，优化了弹幕分类算法。The present invention is a method and device for filtering barrage based on content and user identification, which combines the advantages of external corpus expansion and content feature expansion of short text itself, and introduces word vector into feature expansion, and proposes an improved text expansion method. This method assigns different weights to the expanded words with different granular topics and different semantic complementarities, so as to maximize the semantic expansion of the original text. At the same time, a text representation method with word similarity, word and topic probability, tag topic probability, and word embedding is proposed. The improved text representation can learn richer semantic information and optimize the feature representation of text. In the user platform feature construction module, two new features, user reputation level and user identity credibility, are constructed. The bullet screen text is short and the amount of information is small. The recognition dimension based on the text content is single. Adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and optimize the bullet screen. Classification algorithm.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解，构成本申请的一部分，并不构成对本发明实施例的限定。在附图中：The accompanying drawings described herein are used to provide further understanding of the embodiments of the present invention, and constitute a part of the present application, and do not constitute limitations to the embodiments of the present invention. In the attached image:

图1为本发明方法流程示意图；Fig. 1 is the schematic flow chart of the method of the present invention;

图2为文本扩展方法示意图；Fig. 2 is a schematic diagram of a text expansion method;

图3为构建最优特征空间和标签主题特征空间流程图；Fig. 3 is the flow chart of constructing the optimal feature space and tag topic feature space;

图4为一种实施例的文本扩展方法。FIG. 4 is a text expansion method according to an embodiment.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施例和附图，对本发明作进一步的详细说明，本发明的示意性实施方式及其说明仅用于解释本发明，并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and the accompanying drawings. as a limitation of the present invention.

在以下描述中，为了提供对本发明的透彻理解阐述了大量特定细节。然而，对于本领域普通技术人员显而易见的是：不必采用这些特定细节来实行本发明。在其他实例中，为了避免混淆本发明，未具体描述公知的结构、电路、材料或方法。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one of ordinary skill in the art that these specific details need not be employed to practice the present invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail in order to avoid obscuring the present invention.

在整个说明书中，对“一个实施例”、“实施例”、“一个示例”或“示例”的提及意味着：结合该实施例或示例描述的特定特征、结构或特性被包含在本发明至少一个实施例中。因此，在整个说明书的各个地方出现的短语“一个实施例”、“实施例”、“一个示例”或“示例”不一定都指同一实施例或示例。此外，可以以任何适当的组合和、或子组合将特定的特征、结构或特性组合在一个或多个实施例或示例中。此外，本领域普通技术人员应当理解，在此提供的示图都是为了说明的目的，并且示图不一定是按比例绘制的。这里使用的术语“和/或”包括一个或多个相关列出的项目的任何和所有组合。Throughout this specification, references to "one embodiment," "an embodiment," "an example," or "an example" mean that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in the present invention in at least one embodiment. Thus, appearances of the phrases "one embodiment," "an embodiment," "one example," or "an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures or characteristics may be combined in any suitable combination and/or subcombination in one or more embodiments or examples. Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and that the drawings are not necessarily drawn to scale. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

在本发明的描述中，需要理解的是，术语“前”、“后”、“左”、“右”、“上”、“下”、“竖直”、“水平”、“高”、“低”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明保护范围的限制。In the description of the present invention, it should be understood that the terms "front", "rear", "left", "right", "upper", "lower", "vertical", "horizontal", "high", The orientation or positional relationship indicated by "low", "inner", "outer", etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying the indicated device or Elements must have a specific orientation, be constructed and operate in a specific orientation, and therefore should not be construed as limiting the scope of the invention.

实施例1Example 1

由于传统的文本分类方法对弹幕分类效果很不理想，单独借助外部语料库扩展或借助短文本自身内容特征实现文本扩展又各有不足，因此，本发明结合外部语料库扩展和短文本自身内容特征扩展两种方法的优点，同时将词向量引入特征扩展，提出一种基于内容和用户标识的弹幕过滤方法，下文提到的弹幕视频网站包括但不限于如A站、B站以及C站等弹幕视频分享网站，如图1所示，基于内容和用户标识的弹幕过滤方法的整体流程包括步骤S1-S4：Since the traditional text classification method is very unsatisfactory for the classification effect of the bullet screen, the expansion of the text with the help of the external corpus expansion alone or the content feature of the short text itself has its own shortcomings. Therefore, the present invention combines the expansion of the external corpus and the content feature expansion of the short text itself. The advantages of the two methods, at the same time, the word vector is introduced into the feature extension, and a bullet screen filtering method based on content and user identification is proposed. The bullet screen video websites mentioned below include but are not limited to stations A, B, and C, etc. For the bullet screen video sharing website, as shown in Figure 1, the overall process of the bullet screen filtering method based on content and user identification includes steps S1-S4:

具体地，步骤S1中对所述弹幕数据进行打标的具体过程为根据弹幕内容对弹幕数据进行打标，将具有暴力、威胁、色情等不文明用语、无意义的单个字符、全是表情符号的弹幕内容标为1，其他弹幕内容标为0，其中0代表普通弹幕，1代表不良弹幕。Specifically, the specific process of marking the bullet screen data in step S1 is to mark the bullet screen data according to the content of the bullet screen. The bullet screen content of emoji is marked with 1, and the other bullet screen content is marked with 0, where 0 represents normal bullet screen and 1 represents bad bullet screen.

具体地，为了最大程度实现原弹幕短文本语义扩展，对弹幕短文本特征进行扩展，如图2所示，包括步骤S21-S24：Specifically, in order to maximize the semantic expansion of the original bullet screen short text, the features of the bullet screen short text are extended, as shown in Figure 2, including steps S21-S24:

上述训练Word2Vec模型的外部语料库来自弹幕视频网站视频下方的评论数据和视频中的弹幕数据，利用Jieba对语料库进行分词和去停用词的处理，保证语料库规模尽可能的大，通过训练好优化后的Word2Vec模型构建词向量字典，外部语料库内容跟待分类的弹幕数据具有领域相似性，这样可以保证词语覆盖率要高于目前常用的维基百科词向量和搜狗新闻词向量。The above external corpus for training the Word2Vec model comes from the comment data below the video of the bullet screen video website and the bullet screen data in the video. Jieba is used to segment the corpus and remove stop words to ensure that the corpus scale is as large as possible. The optimized Word2Vec model builds a word vector dictionary, and the content of the external corpus has domain similarity with the bullet screen data to be classified, which can ensure that the word coverage is higher than the currently commonly used Wikipedia word vectors and Sogou news word vectors.

具体地，构建最优特征空间和标签主题特征空间为不同粒度主题、不同语义补充度的扩展词语赋予不同权重，如图3所示，具体包括步骤S221-S223：Specifically, constructing an optimal feature space and a tag topic feature space gives different weights to extended words with different granularity topics and different semantic complementarities, as shown in Figure 3, which specifically includes steps S221-S223:

实际操作中，对于获取到的弹幕数据不必对全部的词汇进行扩展，根据上述构造出的优特征空间和标签主题特征空间对于符合条件的词汇进行扩展，提高弹幕分类效率，如图4所示，具体包括以下步骤：In actual operation, it is not necessary to expand all the vocabulary for the obtained bullet screen data. According to the above-constructed optimal feature space and tag topic feature space, the qualified vocabulary is expanded to improve the bullet screen classification efficiency, as shown in Figure 4. It includes the following steps:

具体地，利用词嵌入、词相似度、词与主题概率度和标签主题概率度的共同作用对所述文本表示方法进行改进，改进后的文本表示可以学习到更丰富层次的语义信息，优化文本的特征表示，具体过程包括利用弹幕短文本与扩展短文本向量直接相加的方法，构造出短文本向量：Specifically, the text representation method is improved by using the combined effect of word embedding, word similarity, word and topic probability, and label topic probability. The improved text representation can learn richer levels of semantic information and optimize the text. The specific process includes using the method of directly adding the bullet screen short text and the extended short text vector to construct the short text vector:

其中，C(d)代表短文本d基于词向量合成的短文本表示，w_i为短文本d中存在的第i个词语，C(w_i)为词汇w_i的词向量表示，w_i,j为第i个词语的第j个扩展词语，C(w_i,j)为扩展词汇向量最终加权后的向量表示，D(w_i,j)为第i个词语的第j个扩展词语的词向量表示，sim(w_i,w_i,j)表示由w_i扩展的第j个词语与w_i的语义相似度，P(w_i,topic_m)表示w_i在所属的最大主题topic_m中的概率，P(topic_m,class)表示class标签中产生最大主题topic_m的概率。Among them, C(d) represents the short text representation of the short text d based on the word vector synthesis, wi is the _ith word existing in the short text d, C( _wi ) is the word vector representation of the word wi, _wi _{, j} is the j-th extended word of the i-th word, C(wi _,j ) is the final weighted vector representation of the extended word vector, D(wi _,j ) is the j-th extended word of the i-th word Word vector representation, sim( _wi ,wi _,j ) represents the semantic similarity between the jth word extended by _wi and _wi , P(wi _, topic _m ) represents the largest topic topic _m to which _wi belongs The probability in , P(topic _m , class) represents the probability of generating the largest topic topic _m in the class label.

由于弹幕文本简短，信息量少，仅基于文本内容的识别维度单一，在弹幕特征空间中加入用户特征，可以进一步增加信息维度，丰富弹幕特征空间，提高弹幕识别率，对用户数据进行分析，在用户数据原始特征的基础上构建了用户信誉等级、用户身份可信度两个新特征；所述用户信誉等级的计算步骤为：Because the bullet screen text is short and the amount of information is small, the recognition dimension based only on the text content is single, adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and improve the recognition rate of the bullet screen. After analysis, two new features of user reputation level and user identity credibility are constructed on the basis of the original features of user data; the calculation steps of the user reputation level are as follows:

实施例2Example 2

本发明具体实施例还提供了一种基于内容和用户标识的弹幕过滤装置，包括数据预处理模块、文本扩展模块、用户平台类特征构造模块和分类模块，其中，A specific embodiment of the present invention also provides a barrage filtering device based on content and user identification, including a data preprocessing module, a text expansion module, a user platform class feature construction module and a classification module, wherein,

数据预处理模块：用于对使用python爬虫软件爬取到的弹幕视频网站弹幕数据和用户数据清洗缺失的数据，并对弹幕数据进行打标分为普通弹幕和不良弹幕；Data preprocessing module: It is used to clean the missing data of the bullet screen video website and user data crawled by python crawler software, and mark the bullet screen data into ordinary bullet screen and bad bullet screen;

分类模块：用于将打标好的弹幕数据集分为训练集和测试集，利用五折交叉验证训练SVM模型构造出分类模型，将扩展后的文本特征和用户平台类特征进行拼接输入到分类模型中，输出弹幕分类结果，根据分类结果对弹幕进行过滤。Classification module: It is used to divide the marked bullet screen data set into training set and test set, use five-fold cross-validation to train the SVM model to construct a classification model, and splicing the expanded text features and user platform features into the In the classification model, the bullet screen classification results are output, and the bullet screens are filtered according to the classification results.

本装置支持实施例1中所述的基于内容和用户标识的弹幕过滤方法，在此就不一一赘述。The device supports the content- and user-identity-based barrage filtering method described in Embodiment 1, and details are not described here.

实施例3Example 3

本发明具体实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序在运行时实现实施例1中所述的基于内容和用户标识的弹幕过滤方法。A specific embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and the computer program implements the content- and user-identity-based bullet screen filtering method described in Embodiment 1 when running.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的模块及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Professionals may further realize that the modules and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

可以理解的是，经过本发明提供的方法，结合外部语料库扩展和短文本自身内容特征扩展的优点，同时将词向量引入特征扩展，提出了一种改进的文本扩展方法。该方法为不同粒度主题、不同语义补充度的扩展词语赋予不同权重，最大程度实现原文本语义扩展。同时提出一种词相似度、词与主题概率度、标签主题概率度、词嵌入共同作用的文本表示方法，改进后的文本表示可以学习到更丰富层次的语义信息，优化文本的特征表示。在用户平台类特征构造模块中，构建了用户信誉等级、用户身份可信度两个新特征。弹幕文本简短，信息量少，仅基于文本内容的识别维度单一，在弹幕特征空间中加入用户特征，可以进一步增加信息维度，丰富弹幕特征空间，提高弹幕识别率，优化了弹幕分类算法。It can be understood that, through the method provided by the present invention, an improved text expansion method is proposed by combining the advantages of external corpus expansion and content feature expansion of short text itself, and introducing word vectors into feature expansion. This method assigns different weights to the expanded words with different granular topics and different semantic complementarities, so as to maximize the semantic expansion of the original text. At the same time, a text representation method with word similarity, word and topic probability, tag topic probability, and word embedding is proposed. The improved text representation can learn richer semantic information and optimize the feature representation of text. In the user platform feature construction module, two new features, user reputation level and user identity credibility, are constructed. The bullet screen text is short and the amount of information is small. The recognition dimension based on the text content is single. Adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and optimize the bullet screen. Classification algorithm.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A bullet screen filtering method based on content and user identification is characterized by comprising the following steps:

s1, crawling bullet screen data and user data of a bullet screen video website by using python crawler software, cleaning the crawled data, marking the bullet screen data into a common bullet screen and a bad bullet screen, wherein the bullet screen data comprise short bullet screen texts;

s2, expanding the barrage short text marked in the step S1, and optimizing the feature representation of the text of the expanded barrage short text to obtain the feature of the expanded text;

s3, constructing user platform class characteristics, analyzing user data, and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of original characteristics of the user data;

and S4, dividing the marked bullet screen data in the step S1 into a training set and a testing set, constructing a bullet screen content classification model by utilizing a five-fold cross validation training SVM model, splicing the extended text features obtained in the step S2 and the user platform features obtained in the step S3, inputting the spliced extended text features and the user platform features into the bullet screen content classification model, and outputting bullet screen classification results.

2. The bullet screen filtering method based on the content and the user identifier as claimed in claim 1, wherein the specific process of marking the bullet screen data in step S1 is marking the bullet screen data according to the bullet screen content, marking the bullet screen content with the unintelligent wording, meaningless single characters and all emoticons as 1, and marking the other bullet screen content as 0, where 0 represents a normal bullet screen and 1 represents a bad bullet screen.

3. The bullet screen filtering method based on the content and the user identifier as claimed in claim 1, wherein the step S2 specifically includes the following steps:

s21, pre-training a Word2Vec model according to an external corpus;

s22, constructing an optimal feature space and a label theme feature space;

s23, expanding words meeting conditions in the bullet screen short text based on a pre-trained Word2Vec model according to the constructed optimal feature space and the label theme feature space to obtain an expanded short text;

and S24, improving a text representation method, and introducing an expansion influence factor into the expanded short text to represent the influence degree of the expansion words on the bullet screen short text to obtain the characteristics of the expanded text.

4. The bullet screen filtering method based on content and user identification as claimed in claim 3, wherein the external corpus includes comment data below the bullet screen video website video and bullet screen data in the video.

5. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the constructing of the optimal feature space and the tag topic feature space specifically comprises the following steps:

s221, extracting feature words with category tendentiousness in the bullet screen short text by using a chi-square test method, and constructing an optimal feature space;

s222, combining all the bullet screen short texts under each label into a long text by adopting a polymerization strategy, and inputting the long text under each label into an LDA topic model for training;

s223, obtaining a text-theme probability matrix by using the LDA theme model, obtaining the probability of each label under each theme, and selecting the first n themes with high probability under each label to construct a label theme feature space.

6. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the step of expanding the qualified vocabulary specifically comprises the steps of:

s231, forming a subject word file according to the subjects in the constructed tag subject feature space;

s232, traversing the words in the bullet screen short text, and calculating the maximum probability theme of the words based on a theme-subject word distribution matrix if the words belong to the optimal feature space;

s233, checking whether the vocabulary belongs to the theme words of the corresponding theme according to the theme word file, if not, not expanding the vocabulary;

s234, if the Word belongs to the label topic feature space, checking whether the maximum probability topic of the Word belongs to the label topic feature space, and if the Word belongs to the label topic feature space, adding the first k words with high similarity as extension words into the bullet screen short text by using a Word2Vec model; if not, the vocabulary is not expanded.

7. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the text representation method is improved by using the combined action of word embedding, word similarity, word and topic probability and tag topic probability, and the specific process comprises the following steps of constructing a short text vector by using a method of directly adding bullet short text and an extended short text vector:

C(w_i,j)＝sim(w_i,w_i,j)*P(w_i,topic_m)*P(topic_m,class)*D(w_i,j)#

wherein C (d) represents short text d short text representation based on word vector synthesis, w_iFor the i-th word, C (w), present in short text d_i) Is a word w_iWord vector representation of, w_i,jThe j-th expansion word of the ith word, C (w)_i,j) For expanding the final weighted vector representation of the vocabulary vector, D (w)_i,j) Word vector representation of the jth expanded word for the ith word, sim (w)_i,w_i,j) Is represented by w_iExpanded jth word and w_iSemantic similarity of (c), P (w)_i,topic_m) Denotes w_iIn the belonged maximum topic_mProbability of (1), P (topic)_mClass) represents the largest topic yielding topic in a class label_mThe probability of (c).

8. The bullet screen filtering method based on the content and the user identification according to claim 1, wherein the user reputation level is calculated by the steps of:

calculating to obtain a user reputation grade I according to historical bullet screen behaviors issued by a user_{credit-rating}The formula is as follows:

wherein N is_totalIndicates total number of delivered barrages, N_badBad bullet screen for showing publicationUpdating the user credit rating periodically;

the user identity credibility calculation steps are as follows:

obtaining user identity credibility I according to user platform grade and whether the user is VIP_{identity-credibility}The formula is as follows:

wherein I_levelValue normalized by the platform level representing the user, I_vipIndicating whether the user is VIP.

9. A bullet screen filtering device based on content and user identification, which supports the bullet screen filtering method based on content and user identification according to any one of claims 1-8, and comprises a data preprocessing module, a text extension module, a user platform class feature construction module and a classification module, wherein,

a data preprocessing module: the system is used for cleaning missing data of bullet screen data and user data of a bullet screen video website crawled by python crawler software, and marking the bullet screen data into a common bullet screen and a bad bullet screen;

a text extension module: the method is used for constructing an optimal feature space and a label subject feature space to expand the marked bullet screen short text, and the improved text representation method is used for carrying out text feature representation optimization on the expanded bullet screen short text to obtain the expanded text features;

the user platform class feature construction module: the system is used for constructing user platform class characteristics, analyzing user data and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of user data original characteristics;

a classification module: the method is used for dividing the marked bullet screen data set into a training set and a testing set, constructing a classification model by utilizing a five-fold cross validation training SVM model, splicing and inputting the expanded text characteristics and the user platform type characteristics into the classification model, and outputting bullet screen classification results.

10. A computer-readable storage medium, on which a computer program is stored which, when executed, implements the method of any one of claims 1-8.