CN112507164B - Barrage filtering method, device and storage medium based on content and user ID - Google Patents
Barrage filtering method, device and storage medium based on content and user ID Download PDFInfo
- Publication number
- CN112507164B CN112507164B CN202011417368.XA CN202011417368A CN112507164B CN 112507164 B CN112507164 B CN 112507164B CN 202011417368 A CN202011417368 A CN 202011417368A CN 112507164 B CN112507164 B CN 112507164B
- Authority
- CN
- China
- Prior art keywords
- bullet screen
- user
- text
- word
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于内容和用户标识的弹幕过滤方法及装置,所述方法包括:对python爬虫软件爬取到的弹幕视频网站弹幕数据和用户数据进行预处理;引入词嵌入、词相似度、词与主题概率度、标签主题概率度共同作用下的短文本表示方法对弹幕短文本进行扩展;构造用户平台类特征;拼接扩展后的文本特征和平台类特征输入分类模型,输出弹幕分类结果。本发明结合外部语料库扩展和短文本自身内容特征扩展的优点,同时将词向量引入特征扩展,最大程度实现原文本语义扩展,在弹幕特征空间中加入用户平台类特征,丰富弹幕特征空间,提高弹幕识别率。
The invention discloses a method and device for filtering bullet screens based on content and user identification. The method includes: preprocessing bullet screen data and user data of bullet screen video websites crawled by python crawler software; introducing word embedding, The short text representation method under the combined effect of word similarity, word and topic probability, and tag topic probability expands the short text of the bullet screen; constructs user platform features; splices the expanded text features and platform features into the classification model, Output the barrage classification result. The invention combines the advantages of external corpus expansion and content feature expansion of short text itself, and at the same time, introduces word vector into feature expansion, realizes semantic expansion of original text to the greatest extent, and adds user platform type features to the bullet screen feature space to enrich the bullet screen feature space. Improve the barrage recognition rate.
Description
技术领域technical field
本发明涉及视频弹幕技术领域,具体涉及基于内容和用户标识的弹幕过滤方法、装置及存储介质。The present invention relates to the technical field of video bullet screens, in particular to a method, device and storage medium for filtering bullet screens based on content and user identification.
背景技术Background technique
近年来,弹幕(Bullet Subtitle)视频分享网站(如A站、B站以及C站等)发展迅速,吸引了大量的当代青年客户群体关注。弹幕视频网站依靠弹幕高互动度、高自由度等优势吸引用户,提高网站的播放量和传播度。但是网络平台对弹幕内容的管理和监控却不够合理,导致不同的视频中出现粗俗、暴力和负能量等评论,严重影响了客户群体的观看体验。同时弹幕的受众以青少年为主,弹幕语言不规范、暴力粗俗等问题会对青少年语言素养、价值观的导向以及心智的形成造成不利的影响。In recent years, Bullet Subtitle video sharing websites (such as Station A, Station B, Station C, etc.) have developed rapidly, attracting the attention of a large number of contemporary young customers. The bullet screen video website relies on the advantages of high interaction and high degree of freedom of the bullet screen to attract users and increase the playback volume and spread of the website. However, the management and monitoring of the bullet screen content on the network platform is not reasonable enough, resulting in vulgar, violent and negative energy comments in different videos, which seriously affects the viewing experience of the customer group. At the same time, the audience of the bullet screen is mainly young people. The non-standard language, violence and vulgarity of the bullet screen will have a negative impact on the language literacy, value orientation and mental formation of young people.
弹幕文本经预处理后,余留的特征词很少,属于典型的超短文本。弹幕文本特征的稀疏性使传统的文本分类方法对弹幕分类效果很不理想。为解决短文本特征稀疏的问题,国内外的研究学者采用特征扩展,主要有借助外部语料库扩展和借助短文本自身内容特征实现文本扩展。外部语料库扩展和短文本自身内容特征扩展各有优点、不足。外部语料库扩展主要依靠语料库的质量,计算量大,文本依赖性强。短文本自身内容特征扩展,主要是挖掘短文本自身语义特征,容易出现过拟合现象。因此,可以结合这两种方法的优点提出一种新的文本分类方法。After the bullet screen text is preprocessed, there are few remaining feature words, which are typical ultra-short texts. The sparsity of bullet screen text features makes traditional text classification methods unsatisfactory for bullet screen classification. In order to solve the problem of sparse short text features, researchers at home and abroad use feature expansion, mainly by means of external corpus expansion and text expansion by means of short text's own content features. External corpus expansion and short text own content feature expansion have their own advantages and disadvantages. External corpus expansion mainly depends on the quality of the corpus, which is computationally intensive and text-dependent. The extension of the content features of the short text itself is mainly to mine the semantic features of the short text itself, which is prone to overfitting. Therefore, a new text classification method can be proposed by combining the advantages of these two methods.
发明内容SUMMARY OF THE INVENTION
本发明提供了一种基于内容和用户标识的弹幕过滤方法及装置,结合外部语料库扩展和短文本自身内容特征扩展,同时将词向量引入特征扩展,提出一种新的文本分类方法解决弹幕视频网站弹幕文本特征稀疏的问题。The present invention provides a method and device for filtering bullet screens based on content and user identification, combining external corpus expansion and content feature expansion of short texts, and introducing word vectors into feature expansion, and proposes a new text classification method to solve the bullet screen problem. The problem of sparse text features in video website bullet chat.
本发明通过下述技术方案实现:The present invention is achieved through the following technical solutions:
由于传统的文本分类方法对弹幕分类效果很不理想,单独借助外部语料库扩展或借助短文本自身内容特征实现文本扩展又各有不足,因此,本发明结合外部语料库扩展和短文本自身内容特征扩展两种方法的优点,同时将词向量引入特征扩展,提出一种基于内容和用户标识的弹幕内容分类方法,从而完成弹幕的过滤,包括步骤S1-S4:Since the traditional text classification method is very unsatisfactory for the classification effect of the bullet screen, the expansion of the text with the help of the external corpus expansion alone or with the help of the content features of the short text itself has its own shortcomings. Therefore, the present invention combines the expansion of the external corpus and the content feature expansion of the short text itself. The advantages of the two methods, while introducing the word vector into the feature extension, proposes a content and user identification-based bullet screen content classification method, so as to complete the bullet screen filtering, including steps S1-S4:
S1、使用python爬虫软件爬取弹幕视频网站的弹幕数据和用户数据,对爬取到的数据进行清洗并对弹幕数据进行打标分为普通弹幕和不良弹幕,其中,弹幕数据包括弹幕短文本即该弹幕的文本内容,用户数据包括用户性别、粉丝数、关注数、用户等级;S1. Use python crawler software to crawl the bullet screen data and user data of the bullet screen video website, clean the crawled data and mark the bullet screen data into ordinary bullet screen and bad bullet screen. Among them, the bullet screen The data includes the short text of the bullet screen, that is, the text content of the bullet screen, and the user data includes the user's gender, the number of fans, the number of followers, and the user level;
S2、对步骤S1中打标后的弹幕短文本进行扩展,并对扩展后的弹幕短文本进行文本的特征表示优化,得到扩展文本特征;S2, extending the short text of the bullet screen marked in step S1, and optimizing the feature representation of the text on the short text of the bullet screen after the expansion, so as to obtain the extended text feature;
S3、构造用户平台类特征,对用户数据进行分析,在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征;S3. Constructing user platform class features, analyzing user data, and constructing new user credit rating features and user identity credibility features based on the original features of user data;
S4、将步骤S1中打标好的弹幕数据分为训练集和测试集,利用五折交叉验证训练SVM模型,构造弹幕内容分类模型,将步骤S2得到的扩展文本特征和步骤S3得到的用户平台类特征进行拼接输入到所述弹幕内容分类模型中,输出弹幕分类结果,最后根据弹幕分类结果对弹幕进行过滤。S4. Divide the bullet screen data marked in step S1 into a training set and a test set, use the five-fold cross-validation to train the SVM model, construct a bullet screen content classification model, and combine the extended text features obtained in step S2 with those obtained in step S3. The user platform class features are spliced and input into the bullet screen content classification model, the bullet screen classification result is output, and finally the bullet screen is filtered according to the bullet screen classification result.
进一步地,步骤S1中对所述弹幕数据进行打标的具体过程为根据弹幕内容对弹幕数据进行打标,将具有暴力、威胁、色情等不文明用语、无意义的单个字符、全是表情符号的弹幕内容标为1,其他弹幕内容标为0,其中0代表普通弹幕,1代表不良弹幕。Further, the specific process of marking the bullet screen data in step S1 is to mark the bullet screen data according to the content of the bullet screen. The bullet screen content of emoji is marked with 1, and the other bullet screen content is marked with 0, where 0 represents normal bullet screen and 1 represents bad bullet screen.
进一步地,步骤S2具体包括步骤S21-S24:Further, step S2 specifically includes steps S21-S24:
S21、根据外部语料库预训练Word2Vec模型;S21. Pre-train the Word2Vec model according to the external corpus;
S22、构建最优特征空间和标签主题特征空间;S22, construct the optimal feature space and the label topic feature space;
S23、根据构建的最优特征空间和标签主题特征空间,基于预训练的Word2Vec模型对弹幕短文本中符合条件的词汇进行扩展,得到扩展短文本;S23. According to the constructed optimal feature space and tag topic feature space, based on the pre-trained Word2Vec model, expand the qualified vocabulary in the short text of the bullet screen to obtain the extended short text;
S24、改进文本表示方法,在扩展短文本中引入扩展影响因子表示扩展词语对弹幕短文本的影响程度,最大程度实现原弹幕文本语义扩展,得到扩展文本特征。S24 , improving the text representation method, and introducing an extended influence factor into the extended short text to indicate the degree of influence of the extended word on the short text of the bullet screen, so as to maximize the semantic expansion of the original bullet screen text and obtain extended text features.
进一步地,训练Word2Vec模型的外部语料库来自弹幕视频网站视频下方的评论数据和视频中的弹幕数据,外部语料库内容跟待分类的弹幕数据具有领域相似性,这样可以保证词语覆盖率要高于目前常用的维基百科词向量和搜狗新闻词向量。Further, the external corpus for training the Word2Vec model comes from the comment data below the video of the bullet screen video website and the bullet screen data in the video. The content of the external corpus has domain similarity with the bullet screen data to be classified, which can ensure that the word coverage rate is high. It is used in the commonly used Wikipedia word vectors and Sogou news word vectors.
进一步地,构建最优特征空间和标签主题特征空间具体包括步骤S221-S223:Further, constructing the optimal feature space and the label topic feature space specifically includes steps S221-S223:
S221、利用卡方检验方法提取弹幕短文本中具有类别倾向性的特征词,构建最优特征空间;S221, using the chi-square test method to extract the feature words with category tendency in the short text of the bullet screen, and construct the optimal feature space;
S222、采用聚合策略,将每个标签下的所有弹幕短文本合并成长文本,然后将各标签下的长文本输入LDA主题模型进行训练;S222, adopt the aggregation strategy to merge all the short texts of the bullet screen under each label into long texts, and then input the long texts under each label into the LDA topic model for training;
S223、利用LDA主题模型得到文本-主题概率矩阵,得到每个标签在各个主题下的概率,选取每个标签下概率较大的前n个主题构建标签主题特征空间。S223 , using the LDA topic model to obtain a text-topic probability matrix, obtaining the probability of each tag under each topic, and selecting the top n topics with a larger probability under each tag to construct a tag topic feature space.
进一步地,对所述符合条件的词汇进行扩展具体包括以下步骤:Further, expanding the qualified vocabulary specifically includes the following steps:
S231、根据构建的标签主题特征空间,用LDA主题模型得到主题-主题词概率矩阵,选取每个主题下概率较大的前n个主题词构成主题词文件;S231. According to the constructed tag topic feature space, use the LDA topic model to obtain a topic-topic word probability matrix, and select the top n topic words with high probability under each topic to form a topic word file;
S232、遍历弹幕短文本中的词汇,若词汇属于最优特征空间时,基于主题-主题词分布矩阵计算该特征词所属的最大概率主题;S232, traverse the vocabulary in the short text of the bullet screen, if the vocabulary belongs to the optimal feature space, calculate the maximum probability topic to which the feature word belongs based on the topic-topic word distribution matrix;
S233、根据主题词文件查看该词汇是否属于该主题的主题词,若不属于,则证明该特征词不具有强烈的主题信息,若将此特征词进行扩展容易引入不相关的特征词,则不对该词汇进行扩展;S233. Check whether the word belongs to the theme word of the theme according to the theme word file. If not, it proves that the feature word does not have strong theme information. If the feature word is expanded, it is easy to introduce irrelevant feature words. the vocabulary is expanded;
S234、若属于,则再查看最大概率主题是否属于标签主题特征空间,若属于,利用Word2Vec模型将相似度高的前k个词汇作为扩展词加入弹幕短文本中;若不属于,则证明该最大概率主题没有标签识别性,则不对该词汇进行扩展。S234. If it belongs, then check whether the maximum probability topic belongs to the tag topic feature space. If it belongs, use the Word2Vec model to add the top k words with high similarity as expansion words into the short text of the bullet screen; if not, it is proved that the For maximum probability topics that do not have label identification, the vocabulary is not expanded.
进一步地,利用词嵌入、词相似度、词与主题概率度和标签主题概率度的共同作用对所述文本表示方法进行改进,具体过程包括利用弹幕短文本与扩展短文本向量直接相加的方法,构造出短文本向量:Further, the text representation method is improved by using the combined effect of word embedding, word similarity, word and topic probability, and label topic probability. The specific process includes using the direct addition of bullet screen short text and extended short text vector. method to construct a short text vector:
C(wi,j)=sim(wi,wi,j)*P(wi,topicm)*P(topicm,class)*D(wi,j)#C(wi ,j )=sim( wi ,wi ,j )*P(wi , topic m )*P(topic m ,class)*D(wi ,j )#
其中,C(d)代表短文本d基于词向量合成的短文本表示,wi为短文本d中存在的第i个词语,C(wi)为词汇wi的词向量表示,wi,j为第i个词语的第j个扩展词语,C(wi,j)为扩展词汇向量最终加权后的向量表示,D(wi,j)为第i个词语的第j个扩展词语的词向量表示。sim(wi,wi,j)表示由wi扩展的第j个词语与wi的语义相似度,P(wi,topicm)表示wi在所属的最大主题topicm中的概率,P(topicm,class)表示class标签中产生最大主题topicm的概率。Among them, C(d) represents the short text representation of the short text d based on the word vector synthesis, wi is the ith word existing in the short text d, C( wi ) is the word vector representation of the word wi, wi , j is the j-th extended word of the i-th word, C(wi ,j ) is the final weighted vector representation of the extended word vector, D(wi ,j ) is the j-th extended word of the i-th word word vector representation. sim( wi ,wi ,j ) represents the semantic similarity between the jth word extended by wi and wi , P(wi , topic m ) represents the probability of wi in the largest topic topic m to which it belongs, P(topic m , class) represents the probability of generating the largest topic topic m in the class label.
进一步地,所述用户信誉等级的计算步骤为:Further, the calculation steps of the user reputation level are:
根据用户发布的历史弹幕行为计算得到用户信誉等级Icredit-rating,公式如下:The user's credit rating I credit-rating is calculated according to the historical bullet chat behavior published by the user. The formula is as follows:
其中Ntotal表示发布的弹幕总数,Nbad表示发布的不良弹幕数,周期性的更新用户信誉等级;Among them, N total represents the total number of published bullet screens, and N bad represents the number of bad bullet screens published, and the user reputation level is periodically updated;
所述用户身份可信度的计算步骤为:The calculation steps of the user identity credibility are:
根据用户平台等级、用户是否VIP得到用户身份可信度Iidentity-credibility,公式如下:According to the user's platform level and whether the user is VIP or not, the user's identity credibility I identity-credibility is obtained. The formula is as follows:
其中Ilevel表示用户的平台等级归一化后的值,Ivip表示用户是否VIP。Among them, I level represents the normalized value of the user's platform level, and I vip represents whether the user is a VIP.
另外,本发明提出一种基于内容和用户标识的弹幕过滤装置,所述装置支持上述的基于内容和用户标识的弹幕过滤方法,包括数据预处理模块、文本扩展模块、用户平台类特征构造模块和分类模块,其中,In addition, the present invention proposes a content and user identification-based bullet screen filtering device, which supports the above-mentioned content and user identification-based bullet screen filtering method, including a data preprocessing module, a text extension module, and a user platform-like feature structure modules and classification modules, where,
数据预处理模块:用于对使用python爬虫软件爬取到的弹幕视频网站弹幕数据和用户数据清洗缺失的数据,并对弹幕数据进行打标分为普通弹幕和不良弹幕,弹幕数据包括弹幕短文本,即该弹幕的文本内容;Data preprocessing module: used to clean the missing data of the bullet screen video website and user data crawled by python crawler software, and mark the bullet screen data into ordinary bullet screen and bad bullet screen. The screen data includes the short text of the bullet screen, that is, the text content of the bullet screen;
文本扩展模块:用于构建最优特征空间和标签主题特征空间对打标后的弹幕短文本进行扩展,改进文本表示方法对扩展后的弹幕短文本进行文本的特征表示优化,得到扩展文本特征;Text extension module: It is used to construct the optimal feature space and label theme feature space to expand the marked bullet screen short text, improve the text representation method, and optimize the text feature representation of the expanded bullet screen short text to obtain the extended text. feature;
用户平台类特征构造模块:用于构造用户平台类特征,对用户数据进行分析,在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征;User platform class feature construction module: used to construct user platform class features, analyze user data, and construct user reputation level features and user identity credibility features based on the original features of user data;
分类模块:用于将打标好的弹幕数据集分为训练集和测试集,利用五折交叉验证训练SVM模型构造出分类模型,将扩展后的文本特征和用户平台类特征进行拼接输入到分类模型中,输出弹幕分类结果。Classification module: It is used to divide the marked bullet screen data set into training set and test set, use five-fold cross-validation to train the SVM model to construct a classification model, and splicing the expanded text features and user platform features into the In the classification model, output the bullet chat classification result.
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在运行时实现上述基于内容和用户标识的弹幕过滤方法。A computer-readable storage medium on which a computer program is stored, the computer program implements the above-mentioned content- and user-identity-based bullet screen filtering method when running.
本发明与现有技术相比,具有如下的优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:
本发明一种基于内容和用户标识的弹幕过滤方法及装置,结合外部语料库扩展和短文本自身内容特征扩展的优点,同时将词向量引入特征扩展,提出了一种改进的文本扩展方法。该方法为不同粒度主题、不同语义补充度的扩展词语赋予不同权重,最大程度实现原文本语义扩展。同时提出一种词相似度、词与主题概率度、标签主题概率度、词嵌入共同作用的文本表示方法,改进后的文本表示可以学习到更丰富层次的语义信息,优化文本的特征表示。在用户平台类特征构造模块中,构建了用户信誉等级、用户身份可信度两个新特征。弹幕文本简短,信息量少,仅基于文本内容的识别维度单一,在弹幕特征空间中加入用户特征,可以进一步增加信息维度,丰富弹幕特征空间,提高弹幕识别率,优化了弹幕分类算法。The present invention is a method and device for filtering barrage based on content and user identification, which combines the advantages of external corpus expansion and content feature expansion of short text itself, and introduces word vector into feature expansion, and proposes an improved text expansion method. This method assigns different weights to the expanded words with different granular topics and different semantic complementarities, so as to maximize the semantic expansion of the original text. At the same time, a text representation method with word similarity, word and topic probability, tag topic probability, and word embedding is proposed. The improved text representation can learn richer semantic information and optimize the feature representation of text. In the user platform feature construction module, two new features, user reputation level and user identity credibility, are constructed. The bullet screen text is short and the amount of information is small. The recognition dimension based on the text content is single. Adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and optimize the bullet screen. Classification algorithm.
附图说明Description of drawings
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,并不构成对本发明实施例的限定。在附图中:The accompanying drawings described herein are used to provide further understanding of the embodiments of the present invention, and constitute a part of the present application, and do not constitute limitations to the embodiments of the present invention. In the attached image:
图1为本发明方法流程示意图;Fig. 1 is the schematic flow chart of the method of the present invention;
图2为文本扩展方法示意图;Fig. 2 is a schematic diagram of a text expansion method;
图3为构建最优特征空间和标签主题特征空间流程图;Fig. 3 is the flow chart of constructing the optimal feature space and tag topic feature space;
图4为一种实施例的文本扩展方法。FIG. 4 is a text expansion method according to an embodiment.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施例和附图,对本发明作进一步的详细说明,本发明的示意性实施方式及其说明仅用于解释本发明,并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and the accompanying drawings. as a limitation of the present invention.
在以下描述中,为了提供对本发明的透彻理解阐述了大量特定细节。然而,对于本领域普通技术人员显而易见的是:不必采用这些特定细节来实行本发明。在其他实例中,为了避免混淆本发明,未具体描述公知的结构、电路、材料或方法。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one of ordinary skill in the art that these specific details need not be employed to practice the present invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail in order to avoid obscuring the present invention.
在整个说明书中,对“一个实施例”、“实施例”、“一个示例”或“示例”的提及意味着:结合该实施例或示例描述的特定特征、结构或特性被包含在本发明至少一个实施例中。因此,在整个说明书的各个地方出现的短语“一个实施例”、“实施例”、“一个示例”或“示例”不一定都指同一实施例或示例。此外,可以以任何适当的组合和、或子组合将特定的特征、结构或特性组合在一个或多个实施例或示例中。此外,本领域普通技术人员应当理解,在此提供的示图都是为了说明的目的,并且示图不一定是按比例绘制的。这里使用的术语“和/或”包括一个或多个相关列出的项目的任何和所有组合。Throughout this specification, references to "one embodiment," "an embodiment," "an example," or "an example" mean that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in the present invention in at least one embodiment. Thus, appearances of the phrases "one embodiment," "an embodiment," "one example," or "an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures or characteristics may be combined in any suitable combination and/or subcombination in one or more embodiments or examples. Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and that the drawings are not necessarily drawn to scale. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
在本发明的描述中,需要理解的是,术语“前”、“后”、“左”、“右”、“上”、“下”、“竖直”、“水平”、“高”、“低”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明保护范围的限制。In the description of the present invention, it should be understood that the terms "front", "rear", "left", "right", "upper", "lower", "vertical", "horizontal", "high", The orientation or positional relationship indicated by "low", "inner", "outer", etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying the indicated device or Elements must have a specific orientation, be constructed and operate in a specific orientation, and therefore should not be construed as limiting the scope of the invention.
实施例1Example 1
由于传统的文本分类方法对弹幕分类效果很不理想,单独借助外部语料库扩展或借助短文本自身内容特征实现文本扩展又各有不足,因此,本发明结合外部语料库扩展和短文本自身内容特征扩展两种方法的优点,同时将词向量引入特征扩展,提出一种基于内容和用户标识的弹幕过滤方法,下文提到的弹幕视频网站包括但不限于如A站、B站以及C站等弹幕视频分享网站,如图1所示,基于内容和用户标识的弹幕过滤方法的整体流程包括步骤S1-S4:Since the traditional text classification method is very unsatisfactory for the classification effect of the bullet screen, the expansion of the text with the help of the external corpus expansion alone or the content feature of the short text itself has its own shortcomings. Therefore, the present invention combines the expansion of the external corpus and the content feature expansion of the short text itself. The advantages of the two methods, at the same time, the word vector is introduced into the feature extension, and a bullet screen filtering method based on content and user identification is proposed. The bullet screen video websites mentioned below include but are not limited to stations A, B, and C, etc. For the bullet screen video sharing website, as shown in Figure 1, the overall process of the bullet screen filtering method based on content and user identification includes steps S1-S4:
S1、使用python爬虫软件爬取弹幕视频网站的弹幕数据和用户数据,对爬取到的数据进行清洗并对弹幕数据进行打标分为普通弹幕和不良弹幕,其中,弹幕数据包括弹幕短文本即该弹幕的文本内容,用户数据包括用户性别、粉丝数、关注数、用户等级;S1. Use python crawler software to crawl the bullet screen data and user data of the bullet screen video website, clean the crawled data and mark the bullet screen data into ordinary bullet screen and bad bullet screen. Among them, the bullet screen The data includes the short text of the bullet screen, that is, the text content of the bullet screen, and the user data includes the user's gender, the number of fans, the number of followers, and the user level;
S2、对步骤S1中打标后的弹幕短文本进行扩展,并对扩展后的弹幕短文本进行文本的特征表示优化,得到扩展文本特征;S2, extending the short text of the bullet screen marked in step S1, and optimizing the feature representation of the text on the short text of the bullet screen after the expansion, so as to obtain the extended text feature;
S3、构造用户平台类特征,对用户数据进行分析,在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征;S3. Constructing user platform class features, analyzing user data, and constructing new user credit rating features and user identity credibility features based on the original features of user data;
S4、将步骤S1中打标好的弹幕数据分为训练集和测试集,利用五折交叉验证训练SVM模型,构造弹幕内容分类模型,将步骤S2得到的扩展文本特征和步骤S3得到的用户平台类特征进行拼接输入到所述弹幕内容分类模型中,输出弹幕分类结果,最后根据弹幕分类结果对弹幕进行过滤。S4. Divide the bullet screen data marked in step S1 into a training set and a test set, use the five-fold cross-validation to train the SVM model, construct a bullet screen content classification model, and combine the extended text features obtained in step S2 with those obtained in step S3. The user platform class features are spliced and input into the bullet screen content classification model, the bullet screen classification result is output, and finally the bullet screen is filtered according to the bullet screen classification result.
具体地,步骤S1中对所述弹幕数据进行打标的具体过程为根据弹幕内容对弹幕数据进行打标,将具有暴力、威胁、色情等不文明用语、无意义的单个字符、全是表情符号的弹幕内容标为1,其他弹幕内容标为0,其中0代表普通弹幕,1代表不良弹幕。Specifically, the specific process of marking the bullet screen data in step S1 is to mark the bullet screen data according to the content of the bullet screen. The bullet screen content of emoji is marked with 1, and the other bullet screen content is marked with 0, where 0 represents normal bullet screen and 1 represents bad bullet screen.
具体地,为了最大程度实现原弹幕短文本语义扩展,对弹幕短文本特征进行扩展,如图2所示,包括步骤S21-S24:Specifically, in order to maximize the semantic expansion of the original bullet screen short text, the features of the bullet screen short text are extended, as shown in Figure 2, including steps S21-S24:
S21、根据外部语料库预训练Word2Vec模型;S21. Pre-train the Word2Vec model according to the external corpus;
S22、构建最优特征空间和标签主题特征空间;S22, construct the optimal feature space and the label topic feature space;
S23、根据构建的最优特征空间和标签主题特征空间,基于预训练的Word2Vec模型对弹幕短文本中符合条件的词汇进行扩展,得到扩展短文本;S23. According to the constructed optimal feature space and tag topic feature space, based on the pre-trained Word2Vec model, expand the qualified vocabulary in the short text of the bullet screen to obtain the extended short text;
S24、改进文本表示方法,在扩展短文本中引入扩展影响因子表示扩展词语对弹幕短文本的影响程度,最大程度实现原弹幕文本语义扩展,得到扩展文本特征。S24 , improving the text representation method, and introducing an extended influence factor into the extended short text to indicate the degree of influence of the extended word on the short text of the bullet screen, so as to maximize the semantic expansion of the original bullet screen text and obtain extended text features.
上述训练Word2Vec模型的外部语料库来自弹幕视频网站视频下方的评论数据和视频中的弹幕数据,利用Jieba对语料库进行分词和去停用词的处理,保证语料库规模尽可能的大,通过训练好优化后的Word2Vec模型构建词向量字典,外部语料库内容跟待分类的弹幕数据具有领域相似性,这样可以保证词语覆盖率要高于目前常用的维基百科词向量和搜狗新闻词向量。The above external corpus for training the Word2Vec model comes from the comment data below the video of the bullet screen video website and the bullet screen data in the video. Jieba is used to segment the corpus and remove stop words to ensure that the corpus scale is as large as possible. The optimized Word2Vec model builds a word vector dictionary, and the content of the external corpus has domain similarity with the bullet screen data to be classified, which can ensure that the word coverage is higher than the currently commonly used Wikipedia word vectors and Sogou news word vectors.
具体地,构建最优特征空间和标签主题特征空间为不同粒度主题、不同语义补充度的扩展词语赋予不同权重,如图3所示,具体包括步骤S221-S223:Specifically, constructing an optimal feature space and a tag topic feature space gives different weights to extended words with different granularity topics and different semantic complementarities, as shown in Figure 3, which specifically includes steps S221-S223:
S221、利用卡方检验方法提取弹幕短文本中具有类别倾向性的特征词,构建最优特征空间;S221, using the chi-square test method to extract the feature words with category tendency in the short text of the bullet screen, and construct the optimal feature space;
S222、采用聚合策略,将每个标签下的所有弹幕短文本合并成长文本,然后将各标签下的长文本输入LDA主题模型进行训练;S222, adopt the aggregation strategy to merge all the short texts of the bullet screen under each label into long texts, and then input the long texts under each label into the LDA topic model for training;
S223、利用LDA主题模型得到文本-主题概率矩阵,得到每个标签在各个主题下的概率,选取每个标签下概率较大的前n个主题构建标签主题特征空间。S223 , using the LDA topic model to obtain a text-topic probability matrix, obtaining the probability of each tag under each topic, and selecting the top n topics with a larger probability under each tag to construct a tag topic feature space.
实际操作中,对于获取到的弹幕数据不必对全部的词汇进行扩展,根据上述构造出的优特征空间和标签主题特征空间对于符合条件的词汇进行扩展,提高弹幕分类效率,如图4所示,具体包括以下步骤:In actual operation, it is not necessary to expand all the vocabulary for the obtained bullet screen data. According to the above-constructed optimal feature space and tag topic feature space, the qualified vocabulary is expanded to improve the bullet screen classification efficiency, as shown in Figure 4. It includes the following steps:
S231、根据构建的标签主题特征空间,用LDA主题模型得到主题-主题词概率矩阵,选取每个主题下概率较大的前n个主题词构成主题词文件;S231. According to the constructed tag topic feature space, use the LDA topic model to obtain a topic-topic word probability matrix, and select the top n topic words with high probability under each topic to form a topic word file;
S232、遍历弹幕短文本中的词汇,若词汇属于最优特征空间时,基于主题-主题词分布矩阵计算该特征词所属的最大概率主题;S232, traverse the vocabulary in the short text of the bullet screen, if the vocabulary belongs to the optimal feature space, calculate the maximum probability topic to which the feature word belongs based on the topic-topic word distribution matrix;
S233、根据主题词文件查看该词汇是否属于该主题的主题词,若不属于,则证明该特征词不具有强烈的主题信息,若将此特征词进行扩展容易引入不相关的特征词,则不对该词汇进行扩展;S233. Check whether the word belongs to the theme word of the theme according to the theme word file. If not, it proves that the feature word does not have strong theme information. If the feature word is expanded, it is easy to introduce irrelevant feature words. the vocabulary is expanded;
S234、若属于,则再查看最大概率主题是否属于标签主题特征空间,若属于,利用Word2Vec模型将相似度高的前k个词汇作为扩展词加入弹幕短文本中;若不属于,则证明该最大概率主题没有标签识别性,则不对该词汇进行扩展。S234. If it belongs, then check whether the maximum probability topic belongs to the tag topic feature space. If it belongs, use the Word2Vec model to add the top k words with high similarity as expansion words into the short text of the bullet screen; if not, it is proved that the For maximum probability topics that do not have label identification, the vocabulary is not expanded.
具体地,利用词嵌入、词相似度、词与主题概率度和标签主题概率度的共同作用对所述文本表示方法进行改进,改进后的文本表示可以学习到更丰富层次的语义信息,优化文本的特征表示,具体过程包括利用弹幕短文本与扩展短文本向量直接相加的方法,构造出短文本向量:Specifically, the text representation method is improved by using the combined effect of word embedding, word similarity, word and topic probability, and label topic probability. The improved text representation can learn richer levels of semantic information and optimize the text. The specific process includes using the method of directly adding the bullet screen short text and the extended short text vector to construct the short text vector:
C(wi,j)=sim(wi,wi,j)*P(wi,topicm)*P(topicm,class)*D(wi,j)#C(wi ,j )=sim( wi ,wi ,j )*P(wi , topic m )*P(topic m ,class)*D(wi ,j )#
其中,C(d)代表短文本d基于词向量合成的短文本表示,wi为短文本d中存在的第i个词语,C(wi)为词汇wi的词向量表示,wi,j为第i个词语的第j个扩展词语,C(wi,j)为扩展词汇向量最终加权后的向量表示,D(wi,j)为第i个词语的第j个扩展词语的词向量表示,sim(wi,wi,j)表示由wi扩展的第j个词语与wi的语义相似度,P(wi,topicm)表示wi在所属的最大主题topicm中的概率,P(topicm,class)表示class标签中产生最大主题topicm的概率。Among them, C(d) represents the short text representation of the short text d based on the word vector synthesis, wi is the ith word existing in the short text d, C( wi ) is the word vector representation of the word wi, wi , j is the j-th extended word of the i-th word, C(wi ,j ) is the final weighted vector representation of the extended word vector, D(wi ,j ) is the j-th extended word of the i-th word Word vector representation, sim( wi ,wi ,j ) represents the semantic similarity between the jth word extended by wi and wi , P(wi , topic m ) represents the largest topic topic m to which wi belongs The probability in , P(topic m , class) represents the probability of generating the largest topic topic m in the class label.
由于弹幕文本简短,信息量少,仅基于文本内容的识别维度单一,在弹幕特征空间中加入用户特征,可以进一步增加信息维度,丰富弹幕特征空间,提高弹幕识别率,对用户数据进行分析,在用户数据原始特征的基础上构建了用户信誉等级、用户身份可信度两个新特征;所述用户信誉等级的计算步骤为:Because the bullet screen text is short and the amount of information is small, the recognition dimension based only on the text content is single, adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and improve the recognition rate of the bullet screen. After analysis, two new features of user reputation level and user identity credibility are constructed on the basis of the original features of user data; the calculation steps of the user reputation level are as follows:
根据用户发布的历史弹幕行为计算得到用户信誉等级Icredit-rating,公式如下:The user's credit rating I credit-rating is calculated according to the historical bullet chat behavior published by the user. The formula is as follows:
其中Ntotal表示发布的弹幕总数,Nbad表示发布的不良弹幕数,周期性的更新用户信誉等级;Among them, N total represents the total number of published bullet screens, and N bad represents the number of bad bullet screens published, and the user reputation level is periodically updated;
所述用户身份可信度的计算步骤为:The calculation steps of the user identity credibility are:
根据用户平台等级、用户是否VIP得到用户身份可信度Iidentity-credibility,公式如下:According to the user's platform level and whether the user is VIP or not, the user's identity credibility I identity-credibility is obtained. The formula is as follows:
其中Ilevel表示用户的平台等级归一化后的值,Ivip表示用户是否VIP。Among them, I level represents the normalized value of the user's platform level, and I vip represents whether the user is a VIP.
实施例2Example 2
本发明具体实施例还提供了一种基于内容和用户标识的弹幕过滤装置,包括数据预处理模块、文本扩展模块、用户平台类特征构造模块和分类模块,其中,A specific embodiment of the present invention also provides a barrage filtering device based on content and user identification, including a data preprocessing module, a text expansion module, a user platform class feature construction module and a classification module, wherein,
数据预处理模块:用于对使用python爬虫软件爬取到的弹幕视频网站弹幕数据和用户数据清洗缺失的数据,并对弹幕数据进行打标分为普通弹幕和不良弹幕;Data preprocessing module: It is used to clean the missing data of the bullet screen video website and user data crawled by python crawler software, and mark the bullet screen data into ordinary bullet screen and bad bullet screen;
文本扩展模块:用于构建最优特征空间和标签主题特征空间对打标后的弹幕短文本进行扩展,改进文本表示方法对扩展后的弹幕短文本进行文本的特征表示优化,得到扩展文本特征;Text extension module: It is used to construct the optimal feature space and label theme feature space to expand the marked bullet screen short text, improve the text representation method, and optimize the text feature representation of the expanded bullet screen short text to obtain the extended text. feature;
用户平台类特征构造模块:用于构造用户平台类特征,对用户数据进行分析,在用户数据原始特征的基础上新构造用户信誉等级特征和用户身份可信度特征;User platform class feature construction module: used to construct user platform class features, analyze user data, and construct user reputation level features and user identity credibility features based on the original features of user data;
分类模块:用于将打标好的弹幕数据集分为训练集和测试集,利用五折交叉验证训练SVM模型构造出分类模型,将扩展后的文本特征和用户平台类特征进行拼接输入到分类模型中,输出弹幕分类结果,根据分类结果对弹幕进行过滤。Classification module: It is used to divide the marked bullet screen data set into training set and test set, use five-fold cross-validation to train the SVM model to construct a classification model, and splicing the expanded text features and user platform features into the In the classification model, the bullet screen classification results are output, and the bullet screens are filtered according to the classification results.
本装置支持实施例1中所述的基于内容和用户标识的弹幕过滤方法,在此就不一一赘述。The device supports the content- and user-identity-based barrage filtering method described in Embodiment 1, and details are not described here.
实施例3Example 3
本发明具体实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在运行时实现实施例1中所述的基于内容和用户标识的弹幕过滤方法。A specific embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and the computer program implements the content- and user-identity-based bullet screen filtering method described in Embodiment 1 when running.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the modules and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
可以理解的是,经过本发明提供的方法,结合外部语料库扩展和短文本自身内容特征扩展的优点,同时将词向量引入特征扩展,提出了一种改进的文本扩展方法。该方法为不同粒度主题、不同语义补充度的扩展词语赋予不同权重,最大程度实现原文本语义扩展。同时提出一种词相似度、词与主题概率度、标签主题概率度、词嵌入共同作用的文本表示方法,改进后的文本表示可以学习到更丰富层次的语义信息,优化文本的特征表示。在用户平台类特征构造模块中,构建了用户信誉等级、用户身份可信度两个新特征。弹幕文本简短,信息量少,仅基于文本内容的识别维度单一,在弹幕特征空间中加入用户特征,可以进一步增加信息维度,丰富弹幕特征空间,提高弹幕识别率,优化了弹幕分类算法。It can be understood that, through the method provided by the present invention, an improved text expansion method is proposed by combining the advantages of external corpus expansion and content feature expansion of short text itself, and introducing word vectors into feature expansion. This method assigns different weights to the expanded words with different granular topics and different semantic complementarities, so as to maximize the semantic expansion of the original text. At the same time, a text representation method with word similarity, word and topic probability, tag topic probability, and word embedding is proposed. The improved text representation can learn richer semantic information and optimize the feature representation of text. In the user platform feature construction module, two new features, user reputation level and user identity credibility, are constructed. The bullet screen text is short and the amount of information is small. The recognition dimension based on the text content is single. Adding user features to the bullet screen feature space can further increase the information dimension, enrich the bullet screen feature space, improve the bullet screen recognition rate, and optimize the bullet screen. Classification algorithm.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011417368.XA CN112507164B (en) | 2020-12-07 | 2020-12-07 | Barrage filtering method, device and storage medium based on content and user ID |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011417368.XA CN112507164B (en) | 2020-12-07 | 2020-12-07 | Barrage filtering method, device and storage medium based on content and user ID |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112507164A CN112507164A (en) | 2021-03-16 |
CN112507164B true CN112507164B (en) | 2022-04-12 |
Family
ID=74970884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011417368.XA Active CN112507164B (en) | 2020-12-07 | 2020-12-07 | Barrage filtering method, device and storage medium based on content and user ID |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112507164B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114022744A (en) * | 2021-11-04 | 2022-02-08 | 北京香侬慧语科技有限责任公司 | Automatic illegal barrage detection method, device, system, medium and equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956200A (en) * | 2016-06-24 | 2016-09-21 | 武汉斗鱼网络科技有限公司 | Filtration and conversion-based popup screen interception method and apparatus |
CN106210770A (en) * | 2016-07-11 | 2016-12-07 | 北京小米移动软件有限公司 | A kind of method and apparatus showing barrage information |
CN106897422A (en) * | 2017-02-23 | 2017-06-27 | 百度在线网络技术(北京)有限公司 | Text handling method, device and server |
CN108650546A (en) * | 2018-05-11 | 2018-10-12 | 武汉斗鱼网络科技有限公司 | Barrage processing method, computer readable storage medium and electronic equipment |
CN108763348A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | A kind of classification improved method of extension short text word feature vector |
CN108846431A (en) * | 2018-06-05 | 2018-11-20 | 成都信息工程大学 | Based on the video barrage sensibility classification method for improving Bayesian model |
CN110517121A (en) * | 2019-09-23 | 2019-11-29 | 重庆邮电大学 | Commodity recommendation method and commodity recommendation device based on comment text sentiment analysis |
CN111061866A (en) * | 2019-08-20 | 2020-04-24 | 河北工程大学 | Bullet screen text clustering method based on feature extension and T-oBTM |
CN111163359A (en) * | 2019-12-31 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Bullet screen generation method and device and computer readable storage medium |
CN111614986A (en) * | 2020-04-03 | 2020-09-01 | 威比网络科技(上海)有限公司 | Bullet screen generation method, system, equipment and storage medium based on online education |
CN111625718A (en) * | 2020-05-19 | 2020-09-04 | 辽宁工程技术大学 | User portrait construction method based on user search keyword data |
CN111930943A (en) * | 2020-08-12 | 2020-11-13 | 中国科学技术大学 | Method and device for detecting pivot bullet screen |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7752272B2 (en) * | 2005-01-11 | 2010-07-06 | Research In Motion Limited | System and method for filter content pushed to client device |
US10284806B2 (en) * | 2017-01-04 | 2019-05-07 | International Business Machines Corporation | Barrage message processing |
-
2020
- 2020-12-07 CN CN202011417368.XA patent/CN112507164B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956200A (en) * | 2016-06-24 | 2016-09-21 | 武汉斗鱼网络科技有限公司 | Filtration and conversion-based popup screen interception method and apparatus |
CN106210770A (en) * | 2016-07-11 | 2016-12-07 | 北京小米移动软件有限公司 | A kind of method and apparatus showing barrage information |
CN106897422A (en) * | 2017-02-23 | 2017-06-27 | 百度在线网络技术(北京)有限公司 | Text handling method, device and server |
CN108650546A (en) * | 2018-05-11 | 2018-10-12 | 武汉斗鱼网络科技有限公司 | Barrage processing method, computer readable storage medium and electronic equipment |
CN108763348A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | A kind of classification improved method of extension short text word feature vector |
CN108846431A (en) * | 2018-06-05 | 2018-11-20 | 成都信息工程大学 | Based on the video barrage sensibility classification method for improving Bayesian model |
CN111061866A (en) * | 2019-08-20 | 2020-04-24 | 河北工程大学 | Bullet screen text clustering method based on feature extension and T-oBTM |
CN110517121A (en) * | 2019-09-23 | 2019-11-29 | 重庆邮电大学 | Commodity recommendation method and commodity recommendation device based on comment text sentiment analysis |
CN111163359A (en) * | 2019-12-31 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Bullet screen generation method and device and computer readable storage medium |
CN111614986A (en) * | 2020-04-03 | 2020-09-01 | 威比网络科技(上海)有限公司 | Bullet screen generation method, system, equipment and storage medium based on online education |
CN111625718A (en) * | 2020-05-19 | 2020-09-04 | 辽宁工程技术大学 | User portrait construction method based on user search keyword data |
CN111930943A (en) * | 2020-08-12 | 2020-11-13 | 中国科学技术大学 | Method and device for detecting pivot bullet screen |
Non-Patent Citations (3)
Title |
---|
BoosTexter: A Boosting-based System for Text Categorization;Schapire R E;《Machine Learming》;20001231;第135-168页 * |
基于关键词词向量特征扩展的健康问句分类研究;唐晓波等;《数据分析与知识发现》;20200725(第07期);第66-75页 * |
基于种子词和数据集的垃圾弹幕屏蔽词典的自动构建;汪舸等;《计算机工程与科学》;20200715(第07期);第1302-1308页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112507164A (en) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Guo et al. | Mixed graph neural network-based fake news detection for sustainable vehicular social networks | |
CN111079444B (en) | Network rumor detection method based on multi-modal relationship | |
US9147154B2 (en) | Classifying resources using a deep network | |
CN109446331B (en) | Text emotion classification model establishing method and text emotion classification method | |
KR101605430B1 (en) | System and method for constructing a questionnaire database, search system and method using the same | |
Van Hee et al. | We usually don’t like going to the dentist: Using common sense to detect irony on twitter | |
CN110909529B (en) | User emotion analysis and prejudgment system of company image promotion system | |
Ramakrishnan et al. | Question answering via Bayesian inference on lexical relations | |
CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
Alvari et al. | Less is more: Semi-supervised causal inference for detecting pathogenic users in social media | |
CN103646097B (en) | A kind of suggestion target based on restriction relation and emotion word associating clustering method | |
Tang et al. | Learning sentence representation for emotion classification on microblogs | |
CN111737427A (en) | A MOOC forum post recommendation method integrating forum interaction behavior and user reading preference | |
CN110287314A (en) | Long text credibility evaluation method and system based on unsupervised clustering | |
CN107590558A (en) | A kind of microblogging forwarding Forecasting Methodology based on multilayer integrated study | |
CN112699831B (en) | Video hotspot segment detection method and device based on barrage emotion and storage medium | |
Zhou et al. | Automated hate speech detection and span extraction in underground hacking and extremist forums | |
Marathe et al. | Approaches for mining youtube videos metadata in cyber bullying detection | |
Zhang et al. | Rumor detection with hierarchical representation on bipartite ad hoc event trees | |
CN112507164B (en) | Barrage filtering method, device and storage medium based on content and user ID | |
CN116049393A (en) | A GCN-based aspect-level text sentiment classification method | |
Vila-López et al. | Opinion leaders on sporting events for country branding | |
KR102344804B1 (en) | Method for user feedback information management using AI-based monitoring technology | |
CN112507115B (en) | Method and device for classifying emotion words in barrage text and storage medium | |
CN118349681A (en) | Rumor Detection Method Based on Bidirectional Graph Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |