CN112507164B - Bullet screen filtering method and device based on content and user identification and storage medium - Google Patents

Bullet screen filtering method and device based on content and user identification and storage medium Download PDF

Info

Publication number
CN112507164B
CN112507164B CN202011417368.XA CN202011417368A CN112507164B CN 112507164 B CN112507164 B CN 112507164B CN 202011417368 A CN202011417368 A CN 202011417368A CN 112507164 B CN112507164 B CN 112507164B
Authority
CN
China
Prior art keywords
bullet screen
text
user
word
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011417368.XA
Other languages
Chinese (zh)
Other versions
CN112507164A (en
Inventor
吴渝
李芊
王利
于磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011417368.XA priority Critical patent/CN112507164B/en
Publication of CN112507164A publication Critical patent/CN112507164A/en
Application granted granted Critical
Publication of CN112507164B publication Critical patent/CN112507164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a bullet screen filtering method and a bullet screen filtering device based on content and user identification, wherein the method comprises the following steps: preprocessing barrage data and user data of the barrage video website crawled by python crawler software; a short text representation method under the combined action of word embedding, word similarity, word and theme probability and label theme probability is introduced to expand the bullet screen short text; constructing a user platform class characteristic; and splicing the expanded text features and the platform class features, inputting the text features and the platform class features into a classification model, and outputting a bullet screen classification result. The method combines the advantages of external corpus expansion and short text self content feature expansion, introduces word vectors into feature expansion, realizes original text semantic expansion to the maximum extent, adds user platform features into the bullet screen feature space, enriches the bullet screen feature space and improves the bullet screen recognition rate.

Description

Bullet screen filtering method and device based on content and user identification and storage medium
Technical Field
The invention relates to the technical field of video barrage, in particular to a barrage filtering method and device based on content and user identification and a storage medium.
Background
In recent years, Bullet screen (Bullet Subtitle) video sharing websites (such as a station, a station B, a station C and the like) are rapidly developed, and attract a great deal of attention of current-generation young customer groups. The bullet screen video website attracts users by virtue of the advantages of high interaction degree, high freedom degree and the like of the bullet screen, and the playing amount and the transmission degree of the website are improved. However, the management and monitoring of the barrage content by the network platform are not reasonable enough, so that vulgar, violence, negative energy and other comments appear in different videos, and the watching experience of a client group is seriously influenced. Meanwhile, the audiences of the bullet screen are mainly teenagers, and the problems of non-standard bullet screen language, violent morbid violence and the like can cause adverse effects on language literacy, value view guidance and mental development of the teenagers.
After the bullet screen text is preprocessed, the remaining characteristic words are few, and the bullet screen text belongs to typical ultra-short texts. The sparsity of the bullet screen text features makes the traditional text classification method far from ideal bullet screen classification effect. In order to solve the problem of sparse short text features, researchers at home and abroad adopt feature expansion, and mainly realize text expansion by means of external corpus expansion and by means of content features of short texts. External corpus expansion and short text self content feature expansion have advantages and disadvantages respectively. The expansion of the external corpus mainly depends on the quality of the corpus, the calculated amount is large, and the text dependency is strong. The self content feature expansion of the short text mainly aims at mining the self semantic features of the short text, and the overfitting phenomenon is easy to occur. Therefore, a new text classification method can be proposed combining the advantages of the two methods.
Disclosure of Invention
The invention provides a bullet screen filtering method and device based on content and user identification, which combine external corpus expansion and short text self content feature expansion and introduce word vectors into feature expansion at the same time, and provide a new text classification method to solve the problem of sparse bullet screen text features of a bullet screen video website.
The invention is realized by the following technical scheme:
because the traditional text classification method has poor bullet screen classification effect and has defects of realizing text expansion by means of external corpus expansion or short text self content characteristic, the invention combines the advantages of the external corpus expansion and the short text self content characteristic expansion, introduces word vectors into the characteristic expansion, and provides a bullet screen content classification method based on content and user identification, thereby completing the filtration of bullet screens, and the method comprises the steps of S1-S4:
s1, crawling bullet screen data and user data of a bullet screen video website by using python crawler software, cleaning the crawled data, marking the bullet screen data into a common bullet screen and a bad bullet screen, wherein the bullet screen data comprise short bullet screen texts, namely text contents of the bullet screen, and the user data comprise user gender, vermicelli number, attention number and user grade;
s2, expanding the barrage short text marked in the step S1, and optimizing the feature representation of the text of the expanded barrage short text to obtain the feature of the expanded text;
s3, constructing user platform class characteristics, analyzing user data, and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of original characteristics of the user data;
s4, dividing the marked bullet screen data in the step S1 into a training set and a testing set, utilizing a five-fold cross validation training SVM model to construct a bullet screen content classification model, splicing the extended text features obtained in the step S2 and the user platform features obtained in the step S3, inputting the spliced extended text features and the user platform features into the bullet screen content classification model, outputting bullet screen classification results, and finally filtering the bullet screen according to the bullet screen classification results.
Further, the specific process of marking the bullet screen data in step S1 is marking the bullet screen data according to the bullet screen content, and marking the bullet screen content having the inexplicable phrases such as violence, threat, and pornography, the meaningless single characters, and all the emoticons as 1, and marking the other bullet screen content as 0, where 0 represents a common bullet screen, and 1 represents a bad bullet screen.
Further, the step S2 specifically includes steps S21-S24:
s21, pre-training a Word2Vec model according to an external corpus;
s22, constructing an optimal feature space and a label theme feature space;
s23, expanding words meeting conditions in the bullet screen short text based on a pre-trained Word2Vec model according to the constructed optimal feature space and the label theme feature space to obtain an expanded short text;
and S24, improving a text representation method, introducing an expansion influence factor into the expanded short text to represent the influence degree of the expanded words on the bullet screen short text, and realizing semantic expansion of the original bullet screen text to the maximum extent to obtain the characteristics of the expanded text.
Further, an external corpus for training the Word2Vec model is from comment data below a barrage video website video and barrage data in the video, and the content of the external corpus has field similarity with the barrage data to be classified, so that the Word coverage rate can be higher than that of a Wikipedia Word vector and a dog-searching news Word vector which are commonly used at present.
Further, the step of constructing the optimal feature space and the label topic feature space specifically includes steps S221 to S223:
s221, extracting feature words with category tendentiousness in the bullet screen short text by using a chi-square test method, and constructing an optimal feature space;
s222, combining all the bullet screen short texts under each label into a long text by adopting a polymerization strategy, and inputting the long text under each label into an LDA topic model for training;
s223, obtaining a text-theme probability matrix by using the LDA theme model, obtaining the probability of each label under each theme, and selecting the first n themes with higher probability under each label to construct a label theme feature space.
Further, expanding the qualified vocabulary specifically includes the following steps:
s231, according to the constructed label theme characteristic space, obtaining a theme-theme word probability matrix by using an LDA theme model, and selecting the first n theme words with higher probability under each theme to form a theme word file;
s232, traversing the vocabulary in the bullet screen short text, and calculating the maximum probability theme of the characteristic word based on a theme-subject word distribution matrix if the vocabulary belongs to the optimal characteristic space;
s233, checking whether the vocabulary belongs to the subject word of the subject according to the subject word file, if not, proving that the feature word does not have strong subject information, and if the feature word is expanded and is easy to introduce into irrelevant feature words, not expanding the vocabulary;
s234, if the keyword belongs to the label theme feature space, checking whether the maximum probability theme belongs to the label theme feature space, and if the maximum probability theme belongs to the label theme feature space, adding the first k words with high similarity into the bullet screen short text as expansion words by using a Word2Vec model; if not, the maximum probability theme is proved to have no label identification, and the vocabulary is not expanded.
Further, the text representation method is improved by utilizing the combined action of word embedding, word similarity, word and topic probability and label topic probability, and the specific process comprises the following steps of constructing a short text vector by utilizing a method of directly adding a bullet short text and an extended short text vector:
Figure BDA0002820568220000031
C(wi,j)=sim(wi,wi,j)*P(wi,topicm)*P(topicm,class)*D(wi,j)#
wherein C (d) represents short text d short text representation based on word vector synthesis, wiFor the i-th word, C (w), present in short text di) Is a word wiWord vector representation of, wi,jThe j-th expansion word of the ith word, C (w)i,j) For expanding the final weighted vector representation of the vocabulary vector, D (w)i,j) A word vector representation of the jth expanded word that is the ith word. sim (w)i,wi,j) Is represented by wiExpanded jth word and wiSemantic similarity of (c), P (w)i,topicm) Denotes wiIn the belonged maximum topicmProbability of (1), P (topic)mClass) represents the largest topic yielding topic in a class labelmThe probability of (c).
Further, the user reputation level calculation step is as follows:
calculating to obtain a user reputation grade I according to historical bullet screen behaviors issued by a usercredit-ratingThe formula is as follows:
Figure BDA0002820568220000032
wherein N istotalIndicates total number of delivered barrages, NbadThe number of the issued bad barrages is shown, and the user credit rating is periodically updated;
the user identity credibility calculation steps are as follows:
obtaining user identity credibility I according to user platform grade and whether the user is VIPidentity-credibilityThe formula is as follows:
Figure BDA0002820568220000033
wherein IlevelValue normalized by the platform level representing the user, IvipIndicating whether the user is VIP.
In addition, the invention provides a bullet screen filtering device based on content and user identification, which supports the bullet screen filtering method based on content and user identification and comprises a data preprocessing module, a text extension module, a user platform class feature construction module and a classification module, wherein,
a data preprocessing module: the system comprises a system and a method, a system and a device, wherein the system is used for cleaning missing data of bullet screen video website bullet screen data and user data which are crawled by python crawler software, marking the bullet screen data into a common bullet screen and a bad bullet screen, and the bullet screen data comprises bullet screen short texts, namely text contents of the bullet screen;
a text extension module: the method is used for constructing an optimal feature space and a label subject feature space to expand the marked bullet screen short text, and the improved text representation method is used for carrying out text feature representation optimization on the expanded bullet screen short text to obtain the expanded text features;
the user platform class feature construction module: the system is used for constructing user platform class characteristics, analyzing user data and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of user data original characteristics;
a classification module: the method is used for dividing the marked bullet screen data set into a training set and a testing set, constructing a classification model by utilizing a five-fold cross validation training SVM model, splicing and inputting the expanded text characteristics and the user platform type characteristics into the classification model, and outputting bullet screen classification results.
A computer-readable storage medium, on which a computer program is stored, which, when run, implements the above bullet screen filtering method based on content and user identification.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention discloses a bullet screen filtering method and device based on content and user identification, which combines the advantages of external corpus expansion and short text self content feature expansion and introduces word vectors into feature expansion at the same time, thereby providing an improved text expansion method. The method gives different weights to the expansion words with different granularity themes and different semantic supplementation degrees, and realizes the original text semantic expansion to the maximum extent. Meanwhile, a text representation method with combined action of word similarity, word-to-subject probability, label subject probability and word embedding is provided, the improved text representation can learn richer semantic information, and the feature representation of the text is optimized. In a user platform class feature construction module, two new features of user reputation grade and user identity credibility are constructed. The bullet screen text is short and short, the information quantity is small, the user characteristics are added into the bullet screen characteristic space only based on the single identification dimension of the text content, the information dimension can be further increased, the bullet screen characteristic space is enriched, the bullet screen identification rate is improved, and the bullet screen classification algorithm is optimized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a text expansion method;
FIG. 3 is a flow chart of constructing an optimal feature space and a label topic feature space;
FIG. 4 is a text expansion method of an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: it is not necessary to employ these specific details to practice the present invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail so as not to obscure the present invention.
Throughout the specification, reference to "one embodiment," "an embodiment," "one example," or "an example" means: the particular features, structures, or characteristics described in connection with the embodiment or example are included in at least one embodiment of the invention. Thus, the appearances of the phrases "one embodiment," "an embodiment," "one example" or "an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable combination and/or sub-combination in one or more embodiments or examples. Further, those of ordinary skill in the art will appreciate that the illustrations provided herein are for illustrative purposes and are not necessarily drawn to scale. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In the description of the present invention, it is to be understood that the terms "front", "rear", "left", "right", "upper", "lower", "vertical", "horizontal", "high", "low", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and therefore, are not to be construed as limiting the scope of the present invention.
Example 1
Because the traditional text classification method has a poor bullet screen classification effect and is insufficient for realizing text extension by means of external corpus extension or short text self content feature alone, the invention combines the advantages of the external corpus extension and the short text self content feature extension, and simultaneously introduces word vectors into feature extension to provide a bullet screen filtering method based on content and user identification, wherein the bullet screen video websites mentioned below include but are not limited to bullet screen video sharing websites such as an A station, a B station and a C station, as shown in FIG. 1, the whole flow of the bullet screen filtering method based on content and user identification comprises steps S1-S4:
s1, crawling bullet screen data and user data of a bullet screen video website by using python crawler software, cleaning the crawled data, marking the bullet screen data into a common bullet screen and a bad bullet screen, wherein the bullet screen data comprise short bullet screen texts, namely text contents of the bullet screen, and the user data comprise user gender, vermicelli number, attention number and user grade;
s2, expanding the barrage short text marked in the step S1, and optimizing the feature representation of the text of the expanded barrage short text to obtain the feature of the expanded text;
s3, constructing user platform class characteristics, analyzing user data, and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of original characteristics of the user data;
s4, dividing the marked bullet screen data in the step S1 into a training set and a testing set, utilizing a five-fold cross validation training SVM model to construct a bullet screen content classification model, splicing the extended text features obtained in the step S2 and the user platform features obtained in the step S3, inputting the spliced extended text features and the user platform features into the bullet screen content classification model, outputting bullet screen classification results, and finally filtering the bullet screen according to the bullet screen classification results.
Specifically, the specific process of marking the bullet screen data in step S1 is marking the bullet screen data according to the bullet screen content, and marking the bullet screen content having the inexplicable phrases such as violence, threat, and pornography, meaningless single characters, and all emoticons as 1, and marking the other bullet screen content as 0, where 0 represents a common bullet screen and 1 represents a bad bullet screen.
Specifically, in order to implement semantic extension of the original bullet short text to the maximum extent, the method for extending the bullet short text features includes, as shown in fig. 2, steps S21-S24:
s21, pre-training a Word2Vec model according to an external corpus;
s22, constructing an optimal feature space and a label theme feature space;
s23, expanding words meeting conditions in the bullet screen short text based on a pre-trained Word2Vec model according to the constructed optimal feature space and the label theme feature space to obtain an expanded short text;
and S24, improving a text representation method, introducing an expansion influence factor into the expanded short text to represent the influence degree of the expanded words on the bullet screen short text, and realizing semantic expansion of the original bullet screen text to the maximum extent to obtain the characteristics of the expanded text.
The external corpus of the training Word2Vec model is from comment data below a barrage video website video and barrage data in the video, Word segmentation and Word stop removal processing are performed on the corpus by means of Jieba, the corpus is guaranteed to be large in scale as far as possible, a Word vector dictionary is built through the trained and optimized Word2Vec model, the content of the external corpus has field similarity with the barrage data to be classified, and therefore Word coverage rate can be guaranteed to be higher than that of a Wikipedia Word vector and a dog searching news Word vector which are commonly used at present.
Specifically, constructing an optimal feature space and a label topic feature space to give different weights to expansion words with different granularity topics and different semantic supplements, as shown in fig. 3, specifically includes steps S221 to S223:
s221, extracting feature words with category tendentiousness in the bullet screen short text by using a chi-square test method, and constructing an optimal feature space;
s222, combining all the bullet screen short texts under each label into a long text by adopting a polymerization strategy, and inputting the long text under each label into an LDA topic model for training;
s223, obtaining a text-theme probability matrix by using the LDA theme model, obtaining the probability of each label under each theme, and selecting the first n themes with higher probability under each label to construct a label theme feature space.
In actual operation, all vocabularies do not need to be expanded for the acquired bullet screen data, and the vocabularies meeting the conditions are expanded according to the constructed excellent feature space and the constructed label theme feature space, so that the bullet screen classification efficiency is improved, as shown in fig. 4, the method specifically comprises the following steps:
s231, according to the constructed label theme characteristic space, obtaining a theme-theme word probability matrix by using an LDA theme model, and selecting the first n theme words with higher probability under each theme to form a theme word file;
s232, traversing the vocabulary in the bullet screen short text, and calculating the maximum probability theme of the characteristic word based on a theme-subject word distribution matrix if the vocabulary belongs to the optimal characteristic space;
s233, checking whether the vocabulary belongs to the subject word of the subject according to the subject word file, if not, proving that the feature word does not have strong subject information, and if the feature word is expanded and is easy to introduce into irrelevant feature words, not expanding the vocabulary;
s234, if the keyword belongs to the label theme feature space, checking whether the maximum probability theme belongs to the label theme feature space, and if the maximum probability theme belongs to the label theme feature space, adding the first k words with high similarity into the bullet screen short text as expansion words by using a Word2Vec model; if not, the maximum probability theme is proved to have no label identification, and the vocabulary is not expanded.
Specifically, the text representation method is improved by utilizing the combined action of word embedding, word similarity, word and theme probability and tag theme probability, the improved text representation can learn richer-level semantic information and optimize the feature representation of the text, and the specific process comprises the following steps of constructing a short text vector by utilizing a method of directly adding a bullet short text and an extended short text vector:
Figure BDA0002820568220000071
C(wi,j)=sim(wi,wi,j)*P(wi,topicm)*P(topicm,class)*D(wi,j)#
wherein C (d) represents short text d short text representation based on word vector synthesis, wiFor the i-th word, C (w), present in short text di) Is a word wiWord vector representation of, wi,jThe j-th expansion word of the ith word, C (w)i,j) For expanding the final weighted vector representation of the vocabulary vector, D (w)i,j) Word vector representation of the jth expanded word for the ith word, sim (w)i,wi,j) Is represented by wiExpanded jth word and wiSemantic similarity of (c), P (w)i,topicm) Denotes wiIn the belonged maximum topicmProbability of (1), P (topic)mClass) represents the largest topic yielding topic in a class labelmThe probability of (c).
Because the barrage text is short and has less information amount, only the identification dimension based on the text content is single, the user features are added into the barrage feature space, the information dimension can be further increased, the barrage feature space is enriched, the barrage identification rate is improved, the user data is analyzed, and two new features of the user reputation grade and the user identity credibility are established on the basis of the original features of the user data; the user reputation level calculation steps are as follows:
calculating to obtain a user reputation grade I according to historical bullet screen behaviors issued by a usercredit-ratingThe formula is as follows:
Figure BDA0002820568220000072
wherein N istotalIndicates total number of delivered barrages, NbadThe number of the issued bad barrages is shown, and the user credit rating is periodically updated;
the user identity credibility calculation steps are as follows:
obtaining user identity credibility I according to user platform grade and whether the user is VIPidentity-credibilityThe formula is as follows:
Figure BDA0002820568220000081
wherein IlevelValue normalized by the platform level representing the user, IvipIndicating whether the user is VIP.
Example 2
The specific embodiment of the invention also provides a bullet screen filtering device based on content and user identification, which comprises a data preprocessing module, a text extension module, a user platform class characteristic construction module and a classification module, wherein,
a data preprocessing module: the system is used for cleaning missing data of bullet screen data and user data of a bullet screen video website crawled by python crawler software, and marking the bullet screen data into a common bullet screen and a bad bullet screen;
a text extension module: the method is used for constructing an optimal feature space and a label subject feature space to expand the marked bullet screen short text, and the improved text representation method is used for carrying out text feature representation optimization on the expanded bullet screen short text to obtain the expanded text features;
the user platform class feature construction module: the system is used for constructing user platform class characteristics, analyzing user data and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of user data original characteristics;
a classification module: the method is used for dividing the marked bullet screen data set into a training set and a testing set, constructing a classification model by utilizing a five-fold cross validation training SVM model, splicing and inputting expanded text features and user platform features into the classification model, outputting bullet screen classification results, and filtering bullet screens according to the classification results.
The apparatus supports the bullet screen filtering method based on content and user identification described in embodiment 1, which is not described herein again.
Example 3
The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when running, implements the bullet screen filtering method based on content and user identification described in embodiment 1.
Those of skill would further appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
It can be understood that the method provided by the invention combines the advantages of external corpus expansion and short text content feature expansion, and introduces word vectors into feature expansion, thereby providing an improved text expansion method. The method gives different weights to the expansion words with different granularity themes and different semantic supplementation degrees, and realizes the original text semantic expansion to the maximum extent. Meanwhile, a text representation method with combined action of word similarity, word-to-subject probability, label subject probability and word embedding is provided, the improved text representation can learn richer semantic information, and the feature representation of the text is optimized. In a user platform class feature construction module, two new features of user reputation grade and user identity credibility are constructed. The bullet screen text is short and short, the information quantity is small, the user characteristics are added into the bullet screen characteristic space only based on the single identification dimension of the text content, the information dimension can be further increased, the bullet screen characteristic space is enriched, the bullet screen identification rate is improved, and the bullet screen classification algorithm is optimized.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A bullet screen filtering method based on content and user identification is characterized by comprising the following steps:
s1, crawling bullet screen data and user data of a bullet screen video website by using python crawler software, cleaning the crawled data, marking the bullet screen data into a common bullet screen and a bad bullet screen, wherein the bullet screen data comprise short bullet screen texts;
s2, expanding the barrage short text marked in the step S1, and optimizing the feature representation of the text of the expanded barrage short text to obtain the feature of the expanded text;
s3, constructing user platform class characteristics, analyzing user data, and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of original characteristics of the user data;
and S4, dividing the marked bullet screen data in the step S1 into a training set and a testing set, constructing a bullet screen content classification model by utilizing a five-fold cross validation training SVM model, splicing the extended text features obtained in the step S2 and the user platform features obtained in the step S3, inputting the spliced extended text features and the user platform features into the bullet screen content classification model, and outputting bullet screen classification results.
2. The bullet screen filtering method based on the content and the user identifier as claimed in claim 1, wherein the specific process of marking the bullet screen data in step S1 is marking the bullet screen data according to the bullet screen content, marking the bullet screen content with the unintelligent wording, meaningless single characters and all emoticons as 1, and marking the other bullet screen content as 0, where 0 represents a normal bullet screen and 1 represents a bad bullet screen.
3. The bullet screen filtering method based on the content and the user identifier as claimed in claim 1, wherein the step S2 specifically includes the following steps:
s21, pre-training a Word2Vec model according to an external corpus;
s22, constructing an optimal feature space and a label theme feature space;
s23, expanding words meeting conditions in the bullet screen short text based on a pre-trained Word2Vec model according to the constructed optimal feature space and the label theme feature space to obtain an expanded short text;
and S24, improving a text representation method, and introducing an expansion influence factor into the expanded short text to represent the influence degree of the expansion words on the bullet screen short text to obtain the characteristics of the expanded text.
4. The bullet screen filtering method based on content and user identification as claimed in claim 3, wherein the external corpus includes comment data below the bullet screen video website video and bullet screen data in the video.
5. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the constructing of the optimal feature space and the tag topic feature space specifically comprises the following steps:
s221, extracting feature words with category tendentiousness in the bullet screen short text by using a chi-square test method, and constructing an optimal feature space;
s222, combining all the bullet screen short texts under each label into a long text by adopting a polymerization strategy, and inputting the long text under each label into an LDA topic model for training;
s223, obtaining a text-theme probability matrix by using the LDA theme model, obtaining the probability of each label under each theme, and selecting the first n themes with high probability under each label to construct a label theme feature space.
6. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the step of expanding the qualified vocabulary specifically comprises the steps of:
s231, forming a subject word file according to the subjects in the constructed tag subject feature space;
s232, traversing the words in the bullet screen short text, and calculating the maximum probability theme of the words based on a theme-subject word distribution matrix if the words belong to the optimal feature space;
s233, checking whether the vocabulary belongs to the theme words of the corresponding theme according to the theme word file, if not, not expanding the vocabulary;
s234, if the Word belongs to the label topic feature space, checking whether the maximum probability topic of the Word belongs to the label topic feature space, and if the Word belongs to the label topic feature space, adding the first k words with high similarity as extension words into the bullet screen short text by using a Word2Vec model; if not, the vocabulary is not expanded.
7. The bullet screen filtering method based on the content and the user identification as claimed in claim 3, wherein the text representation method is improved by using the combined action of word embedding, word similarity, word and topic probability and tag topic probability, and the specific process comprises the following steps of constructing a short text vector by using a method of directly adding bullet short text and an extended short text vector:
Figure FDA0002820568210000021
C(wi,j)=sim(wi,wi,j)*P(wi,topicm)*P(topicm,class)*D(wi,j)#
wherein C (d) represents short text d short text representation based on word vector synthesis, wiFor the i-th word, C (w), present in short text di) Is a word wiWord vector representation of, wi,jThe j-th expansion word of the ith word, C (w)i,j) For expanding the final weighted vector representation of the vocabulary vector, D (w)i,j) Word vector representation of the jth expanded word for the ith word, sim (w)i,wi,j) Is represented by wiExpanded jth word and wiSemantic similarity of (c), P (w)i,topicm) Denotes wiIn the belonged maximum topicmProbability of (1), P (topic)mClass) represents the largest topic yielding topic in a class labelmThe probability of (c).
8. The bullet screen filtering method based on the content and the user identification according to claim 1, wherein the user reputation level is calculated by the steps of:
calculating to obtain a user reputation grade I according to historical bullet screen behaviors issued by a usercredit-ratingThe formula is as follows:
Figure FDA0002820568210000022
wherein N istotalIndicates total number of delivered barrages, NbadBad bullet screen for showing publicationUpdating the user credit rating periodically;
the user identity credibility calculation steps are as follows:
obtaining user identity credibility I according to user platform grade and whether the user is VIPidentity-credibilityThe formula is as follows:
Figure FDA0002820568210000031
wherein IlevelValue normalized by the platform level representing the user, IvipIndicating whether the user is VIP.
9. A bullet screen filtering device based on content and user identification, which supports the bullet screen filtering method based on content and user identification according to any one of claims 1-8, and comprises a data preprocessing module, a text extension module, a user platform class feature construction module and a classification module, wherein,
a data preprocessing module: the system is used for cleaning missing data of bullet screen data and user data of a bullet screen video website crawled by python crawler software, and marking the bullet screen data into a common bullet screen and a bad bullet screen;
a text extension module: the method is used for constructing an optimal feature space and a label subject feature space to expand the marked bullet screen short text, and the improved text representation method is used for carrying out text feature representation optimization on the expanded bullet screen short text to obtain the expanded text features;
the user platform class feature construction module: the system is used for constructing user platform class characteristics, analyzing user data and newly constructing user reputation grade characteristics and user identity credibility characteristics on the basis of user data original characteristics;
a classification module: the method is used for dividing the marked bullet screen data set into a training set and a testing set, constructing a classification model by utilizing a five-fold cross validation training SVM model, splicing and inputting the expanded text characteristics and the user platform type characteristics into the classification model, and outputting bullet screen classification results.
10. A computer-readable storage medium, on which a computer program is stored which, when executed, implements the method of any one of claims 1-8.
CN202011417368.XA 2020-12-07 2020-12-07 Bullet screen filtering method and device based on content and user identification and storage medium Active CN112507164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011417368.XA CN112507164B (en) 2020-12-07 2020-12-07 Bullet screen filtering method and device based on content and user identification and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011417368.XA CN112507164B (en) 2020-12-07 2020-12-07 Bullet screen filtering method and device based on content and user identification and storage medium

Publications (2)

Publication Number Publication Date
CN112507164A CN112507164A (en) 2021-03-16
CN112507164B true CN112507164B (en) 2022-04-12

Family

ID=74970884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011417368.XA Active CN112507164B (en) 2020-12-07 2020-12-07 Bullet screen filtering method and device based on content and user identification and storage medium

Country Status (1)

Country Link
CN (1) CN112507164B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956200A (en) * 2016-06-24 2016-09-21 武汉斗鱼网络科技有限公司 Filtration and conversion-based popup screen interception method and apparatus
CN106210770A (en) * 2016-07-11 2016-12-07 北京小米移动软件有限公司 A kind of method and apparatus showing barrage information
CN106897422A (en) * 2017-02-23 2017-06-27 百度在线网络技术(北京)有限公司 Text handling method, device and server
CN108650546A (en) * 2018-05-11 2018-10-12 武汉斗鱼网络科技有限公司 Barrage processing method, computer readable storage medium and electronic equipment
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector
CN108846431A (en) * 2018-06-05 2018-11-20 成都信息工程大学 Based on the video barrage sensibility classification method for improving Bayesian model
CN110517121A (en) * 2019-09-23 2019-11-29 重庆邮电大学 Method of Commodity Recommendation and the device for recommending the commodity based on comment text sentiment analysis
CN111061866A (en) * 2019-08-20 2020-04-24 河北工程大学 Bullet screen text clustering method based on feature extension and T-oBTM
CN111163359A (en) * 2019-12-31 2020-05-15 腾讯科技(深圳)有限公司 Bullet screen generation method and device and computer readable storage medium
CN111614986A (en) * 2020-04-03 2020-09-01 威比网络科技(上海)有限公司 Bullet screen generation method, system, equipment and storage medium based on online education
CN111625718A (en) * 2020-05-19 2020-09-04 辽宁工程技术大学 User portrait construction method based on user search keyword data
CN111930943A (en) * 2020-08-12 2020-11-13 中国科学技术大学 Method and device for detecting pivot bullet screen

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752272B2 (en) * 2005-01-11 2010-07-06 Research In Motion Limited System and method for filter content pushed to client device
US10284806B2 (en) * 2017-01-04 2019-05-07 International Business Machines Corporation Barrage message processing

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956200A (en) * 2016-06-24 2016-09-21 武汉斗鱼网络科技有限公司 Filtration and conversion-based popup screen interception method and apparatus
CN106210770A (en) * 2016-07-11 2016-12-07 北京小米移动软件有限公司 A kind of method and apparatus showing barrage information
CN106897422A (en) * 2017-02-23 2017-06-27 百度在线网络技术(北京)有限公司 Text handling method, device and server
CN108650546A (en) * 2018-05-11 2018-10-12 武汉斗鱼网络科技有限公司 Barrage processing method, computer readable storage medium and electronic equipment
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector
CN108846431A (en) * 2018-06-05 2018-11-20 成都信息工程大学 Based on the video barrage sensibility classification method for improving Bayesian model
CN111061866A (en) * 2019-08-20 2020-04-24 河北工程大学 Bullet screen text clustering method based on feature extension and T-oBTM
CN110517121A (en) * 2019-09-23 2019-11-29 重庆邮电大学 Method of Commodity Recommendation and the device for recommending the commodity based on comment text sentiment analysis
CN111163359A (en) * 2019-12-31 2020-05-15 腾讯科技(深圳)有限公司 Bullet screen generation method and device and computer readable storage medium
CN111614986A (en) * 2020-04-03 2020-09-01 威比网络科技(上海)有限公司 Bullet screen generation method, system, equipment and storage medium based on online education
CN111625718A (en) * 2020-05-19 2020-09-04 辽宁工程技术大学 User portrait construction method based on user search keyword data
CN111930943A (en) * 2020-08-12 2020-11-13 中国科学技术大学 Method and device for detecting pivot bullet screen

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BoosTexter: A Boosting-based System for Text Categorization;Schapire R E;《Machine Learming》;20001231;第135-168页 *
基于关键词词向量特征扩展的健康问句分类研究;唐晓波等;《数据分析与知识发现》;20200725(第07期);第66-75页 *
基于种子词和数据集的垃圾弹幕屏蔽词典的自动构建;汪舸等;《计算机工程与科学》;20200715(第07期);第1302-1308页 *

Also Published As

Publication number Publication date
CN112507164A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
Parikh et al. Media-rich fake news detection: A survey
US9449271B2 (en) Classifying resources using a deep network
CN105183833B (en) Microblog text recommendation method and device based on user model
CN111914085B (en) Text fine granularity emotion classification method, system, device and storage medium
CN113055386B (en) Method and device for identifying and analyzing attack organization
CN110597962B (en) Search result display method and device, medium and electronic equipment
CN110377900A (en) Checking method, device, computer equipment and the storage medium of Web content publication
WO2024036840A1 (en) Open-domain dialogue reply method and system based on topic enhancement
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN112507711A (en) Text abstract extraction method and system
Wang et al. Semi-supervised recursive autoencoders for social review spam detection
CN110222172A (en) A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
Yi et al. Method of profanity detection using word embedding and LSTM
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
CN111291551A (en) Text processing method and device, electronic equipment and computer readable storage medium
Chen et al. Sentiment analysis of animated film reviews using intelligent machine learning
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
Chen et al. Identifying Cantonese rumors with discriminative feature integration in online social networks
CN112507164B (en) Bullet screen filtering method and device based on content and user identification and storage medium
CN111985223A (en) Emotion calculation method based on combination of long and short memory networks and emotion dictionaries
Yang [Retracted] Application of English Vocabulary Presentation Based on Clustering in College English Teaching
Mahalakshmi et al. Twitter sentiment analysis using conditional generative adversarial network
CN112507115B (en) Method and device for classifying emotion words in barrage text and storage medium
CN115391522A (en) Text topic modeling method and system based on social platform metadata
CN113076424A (en) Data enhancement method and system for unbalanced text classified data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant