CN107633077B - System and method for cleaning social media text data by multiple strategies - Google Patents
System and method for cleaning social media text data by multiple strategies Download PDFInfo
- Publication number
- CN107633077B CN107633077B CN201710873539.1A CN201710873539A CN107633077B CN 107633077 B CN107633077 B CN 107633077B CN 201710873539 A CN201710873539 A CN 201710873539A CN 107633077 B CN107633077 B CN 107633077B
- Authority
- CN
- China
- Prior art keywords
- text
- texts
- social media
- marketing
- users
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses a system for cleaning social media text data by multiple strategies, which comprises: the method for cleaning the social media text data by multiple strategies comprises the following steps: calculating the similarity of the social media texts, identifying marketing texts based on the characteristics of the network marketing texts and an SVM separator, and recording users who issue the network marketing texts; the blacklist of users who publish "marketing text" and "repeat text" is recorded based on the first two steps. The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.
Description
Technical Field
The invention relates to a system and a method for cleaning social media text data by multiple strategies, belonging to the field of data mining technology.
Background
At present, social media becomes the most popular network communication platform, and anyone can issue words at any time in a computer, mobile phone and other modes, and the words can be spread throughout the internet. Social media is one of important platforms for public opinion information publishing, and the social media is characterized by fast update and high degree of freedom, so that more and more marketing advertisements start to be spread depending on the social media. The method not only seriously influences the normal browsing of the user, but also is not beneficial to the public sentiment analysis and control of related mechanisms. The filtering and shielding of the junk text information cannot be realized only by depending on the supervision function of the existing social media platform.
The current data spam filtering work is usually specific to a specific application scenario, such as spam filtering, spam web page distinguishing technology, and the like. On a social media platform, spam text includes advertising information, pornography, violence or unhealthy information, and the like. Existing social media data cleansing focuses on analysis and monitoring of spam users (e.g., zombie meals), identification of spam comments, and filtering of spam content. The method mainly comprises the steps of realizing classification of junk contents, selecting the characteristics of junk texts, and training a classifier by using a machine learning model, wherein the commonly used model comprises naive Bayes, Adaboost, a decision tree, a support vector machine and the like. The junk data on the social platform comprises repeated texts and similar texts besides marketing advertisements, and the prior art generally only adopts one strategy to realize cleaning work and cannot meet the requirements of people.
Disclosure of Invention
The invention aims to provide a system and a method for cleaning social media text data by multiple strategies, which particularly analyze the characteristics of advertising marketing text garbage, accurately identify marketing users and marketing advertisement texts, realize filtering and solve the defects in the prior art.
The invention is realized by adopting the following technical scheme:
a system for multi-policy cleansing social media textual data, the system comprising:
a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist;
marketing text recognition module:
introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics;
a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.
Further, initializing a vector D consisting of weighted words, initializing a 64-dimensional vector V, setting an initial value of each element in the vector to be 0, calculating each word in a word set S, calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, if the ith bit of the word is 0, subtracting the weight D [ word ] of the word from the ith dimension of the initial vector V, mapping an article into a 64-dimensional vector g after the calculation of all words in S is completed, if the ith dimension of g is greater than 0, setting the ith position of a 64-bit fingerprint to be 1, otherwise, setting the ith position to be 0.
Further, the content features include:
text number ratio: the social media text contains the proportion of the number to the total length of the text;
symbol length: the length of the expression symbol and the punctuation symbol in the text;
number of hyperlinks: the number of hyperlinks is contained in the text;
the external features include:
length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed;
text length: total length of original social media text;
forwarding number: the number of times the social media text is forwarded;
number of comments: number of times social media text is commented on;
the number of praise is as follows: number of times social media text is praised;
a method for multi-strategy cleaning of social media text data is characterized by comprising the following steps:
step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts;
and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text;
and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.
Further:
substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts;
substep A2: performing feature selection (tf-idf) on the S word set to form a group of vectors D consisting of weighted words;
substep A3: a 64-dimensional vector V is initialized and each element in the vector is initially set to 0. Each term in the term set S is calculated as follows: and calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After all words in the S are calculated, an article is mapped into a 64-dimensional vector g;
substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G;
substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.
Furthermore, the marketing text recognition module performs spam text recognition and classification by adopting an SVM model, the selected characteristics comprise content characteristics and external characteristics of the marketing text, such as third-party contact ways, character characteristics and the like, the recognized marketing text is stored in a spam text corpus, training samples of the model are expanded continuously, issuing users of marketing text data are recorded, and the issuing users are added into a user blacklist.
Further, step C includes: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.
The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a specific flow of similar text recognition.
Detailed Description
The present invention mainly aims at cleaning data of junk social media text, and will be more helpful for the public to understand the present invention through the following description of the embodiments, but the specific embodiments given by the applicant should not be construed as limiting the technical solution of the present invention, and any changes in the definition of the components or technical features and/or in the form of the overall structure rather than the essential changes should be construed as the protection scope defined by the technical solution of the present invention.
As shown in fig. 1, similarity comparison is first performed on web texts to filter out repeated texts with high similarity. Similarity comparison is based on a simhash algorithm, and hamming distance is replaced by cosine distance, so that although the calculation cost is increased, the efficiency of feature comparison is improved.
Second, marketing text recognition is performed on the social media data. The marketing text recognition part analyzes the characteristics of common marketing network texts and utilizes an SVM classifier for training and testing. Meanwhile, the identified marketing text data is used iteratively, so that the adaptability of the classifier is enhanced.
And finally, the junk user identification module establishes a user blacklist for the user who issues the repeated text and the marketing text on the basis of the junk user identification module and the marketing text. And (4) carrying out statistical analysis on the frequency of publishing the junk texts by the users in the blacklist, judging the users with high frequency as junk users, and filtering all social media data published by the users to realize cleaning.
Compared with the existing junk web text cleaning method, the invention designs a method for filtering junk texts by various strategies from multiple angles, and specifically comprises text similarity comparison, marketing text recognition and junk user recognition. Compared with single text spam filtering and spam user identification, the method has better applicability and wider application prospect.
The recognition of similar web text is shown in fig. 2:
firstly, segmenting a text, and removing common stop words to obtain a text word set S;
second, S is feature selected (tf-idf), forming a set of vectors D of weighted words. If feature selection is not selected, a vector D of words with weights all 1 is formed. A 64-dimensional vector V is initialized and each element in the vector is initially set to 0.
Then, each term in the term set S is calculated as follows: and calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the ith bit is 0, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After S all word computations are completed, a piece of text will be mapped into a 64-dimensional vector G.
If the ith dimension of G is larger than 0, setting the ith position of the 64-bit fingerprint as 1 from the left, otherwise setting the ith position as 0, and finally mapping a piece of text into a 64-bit fingerprint code.
And calculating similarity of the fingerprint codes G of different texts by using cosine distance. And judging the texts with the similarity greater than the threshold value to be similar texts.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it is therefore intended that all such changes and modifications as fall within the true spirit and scope of the invention be considered as within the following claims.
Claims (7)
1. A system for multi-policy cleansing social media textual data, the system comprising: a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist; marketing text recognition module: introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics; a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.
2. The system for multi-strategy cleaning of social media text data according to claim 1, wherein a vector D consisting of weighted words is initialized, a 64-dimensional vector V is initialized, an initial value of each element in the vector is set to 0, each word in the word set S is calculated, a word is calculated by using a Hash function to obtain a 64-bit signature f, each bit of the 64-bit signature f is traversed, if the word is 0 at the ith bit, D [ word ] of the word is subtracted from the ith dimension of the initial vector V, after all word calculations of S are completed, an article is mapped into a 64-dimensional vector g, if the ith dimension of g is greater than 0, the ith position of the 64-bit fingerprint is 1, otherwise, the ith position is 0.
3. The system for multi-policy cleansing social media textual data according to claim 1, wherein said content features comprise: text number ratio: the social media text contains the proportion of the number to the total length of the text; symbol length: the length of the expression symbol and the punctuation symbol in the text; number of hyperlinks: the number of hyperlinks is contained in the text; the external features include: length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed; text length: total length of original social media text; forwarding number: the number of times the social media text is forwarded; number of comments: number of times social media text is commented on; the number of praise is as follows: the number of times social media text is praised.
4. A method for multi-strategy cleaning of social media text data is characterized by comprising the following steps: step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts; and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text; and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.
5. The method of multi-policy cleansing social media textual data of claim 4, wherein: substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts; substep A2: performing feature selection on the S word set to form a group of vectors D consisting of weighted words; substep A3: initializing a 64-dimensional vector V, and setting the initial value of each element in the vector to be 0; each term in the term set S is calculated as follows: calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V; after all words in the S are calculated, an article is mapped into a 64-dimensional vector g; substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G; substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.
6. The method for multi-strategy cleaning of social media textual data according to claim 4, wherein the marketing text recognition module employs SVM model for spam text recognition and classification, the selected features include content features and external features of marketing text, saves the recognized marketing text into a spam text corpus, continuously expands training samples of the model, records publishing users of marketing text data, and adds to a user blacklist.
7. The method for multi-policy cleansing social media textual data according to claim 4, wherein step C comprises: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710873539.1A CN107633077B (en) | 2017-09-25 | 2017-09-25 | System and method for cleaning social media text data by multiple strategies |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710873539.1A CN107633077B (en) | 2017-09-25 | 2017-09-25 | System and method for cleaning social media text data by multiple strategies |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107633077A CN107633077A (en) | 2018-01-26 |
CN107633077B true CN107633077B (en) | 2020-12-18 |
Family
ID=61103475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710873539.1A Active CN107633077B (en) | 2017-09-25 | 2017-09-25 | System and method for cleaning social media text data by multiple strategies |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107633077B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108650546B (en) * | 2018-05-11 | 2021-07-23 | 武汉斗鱼网络科技有限公司 | Barrage processing method, computer-readable storage medium and electronic device |
CN108959484B (en) * | 2018-06-21 | 2020-07-28 | 中国人民解放军战略支援部队信息工程大学 | Multi-strategy media data stream filtering method and device for event detection |
CN110008282A (en) * | 2019-03-12 | 2019-07-12 | 平安信托有限责任公司 | Transaction data synchronization interconnection method, device, computer equipment and storage medium |
CN110516066B (en) * | 2019-07-23 | 2022-04-15 | 同盾控股有限公司 | Text content safety protection method and device |
CN111198992A (en) * | 2020-01-07 | 2020-05-26 | 精硕科技(北京)股份有限公司 | Identification method and identification device for mother and infant crowd, electronic equipment and storage medium |
CN112699949B (en) * | 2021-01-05 | 2023-05-26 | 百威投资(中国)有限公司 | Potential user identification method and device based on social platform data |
CN116932526B (en) * | 2023-09-19 | 2023-11-24 | 天泽智慧科技(成都)有限公司 | Text deduplication method for open source information |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160057159A1 (en) * | 2014-08-22 | 2016-02-25 | Syracuse University | Semantics-aware android malware classification |
US9928556B2 (en) * | 2014-12-31 | 2018-03-27 | Facebook, Inc. | Content quality evaluation and classification |
CN105956184B (en) * | 2016-06-01 | 2017-05-31 | 西安交通大学 | Collaborative and organized junk information issue the recognition methods of group in a kind of microblogging community network |
KR101773911B1 (en) * | 2016-09-27 | 2017-09-01 | 주식회사 케이앤컴퍼니 | Apparatus for estimating market price of real estate using market price and officially assessed price and method thereof |
-
2017
- 2017-09-25 CN CN201710873539.1A patent/CN107633077B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107633077A (en) | 2018-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107633077B (en) | System and method for cleaning social media text data by multiple strategies | |
CN109977416B (en) | Multi-level natural language anti-spam text method and system | |
Lee et al. | An abusive text detection system based on enhanced abusive and non-abusive word lists | |
US20200265076A1 (en) | System and method for text categorization and sentiment analysis | |
Li et al. | Twiner: named entity recognition in targeted twitter stream | |
Sonowal et al. | SmiDCA: an anti-smishing model with machine learning approach | |
CN107437038B (en) | Webpage tampering detection method and device | |
CN107943941B (en) | Junk text recognition method and system capable of being updated iteratively | |
Ortega et al. | SSA-UO: unsupervised Twitter sentiment analysis | |
CN108399241B (en) | Emerging hot topic detection system based on multi-class feature fusion | |
CN112487149B (en) | Text auditing method, model, equipment and storage medium | |
Shirani-Mehr | SMS spam detection using machine learning approach | |
TW201409261A (en) | Method and system for discovering suspicious account groups | |
CN111160019B (en) | Public opinion monitoring method, device and system | |
CN104933191A (en) | Spam comment recognition method and system based on Bayesian algorithm and terminal | |
Silva et al. | Towards filtering undesired short text messages using an online learning approach with semantic indexing | |
CN105183717A (en) | OSN user emotion analysis method based on random forest and user relationship | |
CN112199606B (en) | Social media-oriented rumor detection system based on hierarchical user representation | |
Sugandhi et al. | Methods for detection of cyberbullying: A survey | |
CN110913354A (en) | Short message classification method and device and electronic equipment | |
US20160283582A1 (en) | Device and method for detecting similar text, and application | |
CN107544961A (en) | A kind of sentiment analysis method, equipment and its storage device of social media comment | |
Raja et al. | Fake news detection on social networks using Machine learning techniques | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types | |
CN109783804B (en) | Low-quality language identification method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |