CN107633077B - System and method for cleaning social media text data by multiple strategies - Google Patents

System and method for cleaning social media text data by multiple strategies Download PDF

Info

Publication number
CN107633077B
CN107633077B CN201710873539.1A CN201710873539A CN107633077B CN 107633077 B CN107633077 B CN 107633077B CN 201710873539 A CN201710873539 A CN 201710873539A CN 107633077 B CN107633077 B CN 107633077B
Authority
CN
China
Prior art keywords
text
texts
social media
marketing
users
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710873539.1A
Other languages
Chinese (zh)
Other versions
CN107633077A (en
Inventor
薛涵凛
王颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Andlinks Data Technology Co ltd
Original Assignee
Nanjing Andlinks Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Andlinks Data Technology Co ltd filed Critical Nanjing Andlinks Data Technology Co ltd
Priority to CN201710873539.1A priority Critical patent/CN107633077B/en
Publication of CN107633077A publication Critical patent/CN107633077A/en
Application granted granted Critical
Publication of CN107633077B publication Critical patent/CN107633077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a system for cleaning social media text data by multiple strategies, which comprises: the method for cleaning the social media text data by multiple strategies comprises the following steps: calculating the similarity of the social media texts, identifying marketing texts based on the characteristics of the network marketing texts and an SVM separator, and recording users who issue the network marketing texts; the blacklist of users who publish "marketing text" and "repeat text" is recorded based on the first two steps. The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.

Description

System and method for cleaning social media text data by multiple strategies
Technical Field
The invention relates to a system and a method for cleaning social media text data by multiple strategies, belonging to the field of data mining technology.
Background
At present, social media becomes the most popular network communication platform, and anyone can issue words at any time in a computer, mobile phone and other modes, and the words can be spread throughout the internet. Social media is one of important platforms for public opinion information publishing, and the social media is characterized by fast update and high degree of freedom, so that more and more marketing advertisements start to be spread depending on the social media. The method not only seriously influences the normal browsing of the user, but also is not beneficial to the public sentiment analysis and control of related mechanisms. The filtering and shielding of the junk text information cannot be realized only by depending on the supervision function of the existing social media platform.
The current data spam filtering work is usually specific to a specific application scenario, such as spam filtering, spam web page distinguishing technology, and the like. On a social media platform, spam text includes advertising information, pornography, violence or unhealthy information, and the like. Existing social media data cleansing focuses on analysis and monitoring of spam users (e.g., zombie meals), identification of spam comments, and filtering of spam content. The method mainly comprises the steps of realizing classification of junk contents, selecting the characteristics of junk texts, and training a classifier by using a machine learning model, wherein the commonly used model comprises naive Bayes, Adaboost, a decision tree, a support vector machine and the like. The junk data on the social platform comprises repeated texts and similar texts besides marketing advertisements, and the prior art generally only adopts one strategy to realize cleaning work and cannot meet the requirements of people.
Disclosure of Invention
The invention aims to provide a system and a method for cleaning social media text data by multiple strategies, which particularly analyze the characteristics of advertising marketing text garbage, accurately identify marketing users and marketing advertisement texts, realize filtering and solve the defects in the prior art.
The invention is realized by adopting the following technical scheme:
a system for multi-policy cleansing social media textual data, the system comprising:
a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist;
marketing text recognition module:
introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics;
a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.
Further, initializing a vector D consisting of weighted words, initializing a 64-dimensional vector V, setting an initial value of each element in the vector to be 0, calculating each word in a word set S, calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, if the ith bit of the word is 0, subtracting the weight D [ word ] of the word from the ith dimension of the initial vector V, mapping an article into a 64-dimensional vector g after the calculation of all words in S is completed, if the ith dimension of g is greater than 0, setting the ith position of a 64-bit fingerprint to be 1, otherwise, setting the ith position to be 0.
Further, the content features include:
text number ratio: the social media text contains the proportion of the number to the total length of the text;
symbol length: the length of the expression symbol and the punctuation symbol in the text;
number of hyperlinks: the number of hyperlinks is contained in the text;
the external features include:
length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed;
text length: total length of original social media text;
forwarding number: the number of times the social media text is forwarded;
number of comments: number of times social media text is commented on;
the number of praise is as follows: number of times social media text is praised;
a method for multi-strategy cleaning of social media text data is characterized by comprising the following steps:
step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts;
and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text;
and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.
Further:
substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts;
substep A2: performing feature selection (tf-idf) on the S word set to form a group of vectors D consisting of weighted words;
substep A3: a 64-dimensional vector V is initialized and each element in the vector is initially set to 0. Each term in the term set S is calculated as follows: and calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After all words in the S are calculated, an article is mapped into a 64-dimensional vector g;
substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G;
substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.
Furthermore, the marketing text recognition module performs spam text recognition and classification by adopting an SVM model, the selected characteristics comprise content characteristics and external characteristics of the marketing text, such as third-party contact ways, character characteristics and the like, the recognized marketing text is stored in a spam text corpus, training samples of the model are expanded continuously, issuing users of marketing text data are recorded, and the issuing users are added into a user blacklist.
Further, step C includes: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.
The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a specific flow of similar text recognition.
Detailed Description
The present invention mainly aims at cleaning data of junk social media text, and will be more helpful for the public to understand the present invention through the following description of the embodiments, but the specific embodiments given by the applicant should not be construed as limiting the technical solution of the present invention, and any changes in the definition of the components or technical features and/or in the form of the overall structure rather than the essential changes should be construed as the protection scope defined by the technical solution of the present invention.
As shown in fig. 1, similarity comparison is first performed on web texts to filter out repeated texts with high similarity. Similarity comparison is based on a simhash algorithm, and hamming distance is replaced by cosine distance, so that although the calculation cost is increased, the efficiency of feature comparison is improved.
Second, marketing text recognition is performed on the social media data. The marketing text recognition part analyzes the characteristics of common marketing network texts and utilizes an SVM classifier for training and testing. Meanwhile, the identified marketing text data is used iteratively, so that the adaptability of the classifier is enhanced.
And finally, the junk user identification module establishes a user blacklist for the user who issues the repeated text and the marketing text on the basis of the junk user identification module and the marketing text. And (4) carrying out statistical analysis on the frequency of publishing the junk texts by the users in the blacklist, judging the users with high frequency as junk users, and filtering all social media data published by the users to realize cleaning.
Compared with the existing junk web text cleaning method, the invention designs a method for filtering junk texts by various strategies from multiple angles, and specifically comprises text similarity comparison, marketing text recognition and junk user recognition. Compared with single text spam filtering and spam user identification, the method has better applicability and wider application prospect.
The recognition of similar web text is shown in fig. 2:
firstly, segmenting a text, and removing common stop words to obtain a text word set S;
second, S is feature selected (tf-idf), forming a set of vectors D of weighted words. If feature selection is not selected, a vector D of words with weights all 1 is formed. A 64-dimensional vector V is initialized and each element in the vector is initially set to 0.
Then, each term in the term set S is calculated as follows: and calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the ith bit is 0, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After S all word computations are completed, a piece of text will be mapped into a 64-dimensional vector G.
If the ith dimension of G is larger than 0, setting the ith position of the 64-bit fingerprint as 1 from the left, otherwise setting the ith position as 0, and finally mapping a piece of text into a 64-bit fingerprint code.
And calculating similarity of the fingerprint codes G of different texts by using cosine distance. And judging the texts with the similarity greater than the threshold value to be similar texts.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it is therefore intended that all such changes and modifications as fall within the true spirit and scope of the invention be considered as within the following claims.

Claims (7)

1. A system for multi-policy cleansing social media textual data, the system comprising: a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist; marketing text recognition module: introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics; a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.
2. The system for multi-strategy cleaning of social media text data according to claim 1, wherein a vector D consisting of weighted words is initialized, a 64-dimensional vector V is initialized, an initial value of each element in the vector is set to 0, each word in the word set S is calculated, a word is calculated by using a Hash function to obtain a 64-bit signature f, each bit of the 64-bit signature f is traversed, if the word is 0 at the ith bit, D [ word ] of the word is subtracted from the ith dimension of the initial vector V, after all word calculations of S are completed, an article is mapped into a 64-dimensional vector g, if the ith dimension of g is greater than 0, the ith position of the 64-bit fingerprint is 1, otherwise, the ith position is 0.
3. The system for multi-policy cleansing social media textual data according to claim 1, wherein said content features comprise: text number ratio: the social media text contains the proportion of the number to the total length of the text; symbol length: the length of the expression symbol and the punctuation symbol in the text; number of hyperlinks: the number of hyperlinks is contained in the text; the external features include: length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed; text length: total length of original social media text; forwarding number: the number of times the social media text is forwarded; number of comments: number of times social media text is commented on; the number of praise is as follows: the number of times social media text is praised.
4. A method for multi-strategy cleaning of social media text data is characterized by comprising the following steps: step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts; and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text; and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.
5. The method of multi-policy cleansing social media textual data of claim 4, wherein: substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts; substep A2: performing feature selection on the S word set to form a group of vectors D consisting of weighted words; substep A3: initializing a 64-dimensional vector V, and setting the initial value of each element in the vector to be 0; each term in the term set S is calculated as follows: calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V; after all words in the S are calculated, an article is mapped into a 64-dimensional vector g; substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G; substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.
6. The method for multi-strategy cleaning of social media textual data according to claim 4, wherein the marketing text recognition module employs SVM model for spam text recognition and classification, the selected features include content features and external features of marketing text, saves the recognized marketing text into a spam text corpus, continuously expands training samples of the model, records publishing users of marketing text data, and adds to a user blacklist.
7. The method for multi-policy cleansing social media textual data according to claim 4, wherein step C comprises: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.
CN201710873539.1A 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies Active CN107633077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710873539.1A CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710873539.1A CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Publications (2)

Publication Number Publication Date
CN107633077A CN107633077A (en) 2018-01-26
CN107633077B true CN107633077B (en) 2020-12-18

Family

ID=61103475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710873539.1A Active CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Country Status (1)

Country Link
CN (1) CN107633077B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108650546B (en) * 2018-05-11 2021-07-23 武汉斗鱼网络科技有限公司 Barrage processing method, computer-readable storage medium and electronic device
CN108959484B (en) * 2018-06-21 2020-07-28 中国人民解放军战略支援部队信息工程大学 Multi-strategy media data stream filtering method and device for event detection
CN110008282A (en) * 2019-03-12 2019-07-12 平安信托有限责任公司 Transaction data synchronization interconnection method, device, computer equipment and storage medium
CN110516066B (en) * 2019-07-23 2022-04-15 同盾控股有限公司 Text content safety protection method and device
CN111198992A (en) * 2020-01-07 2020-05-26 精硕科技(北京)股份有限公司 Identification method and identification device for mother and infant crowd, electronic equipment and storage medium
CN112699949B (en) * 2021-01-05 2023-05-26 百威投资(中国)有限公司 Potential user identification method and device based on social platform data
CN116932526B (en) * 2023-09-19 2023-11-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160057159A1 (en) * 2014-08-22 2016-02-25 Syracuse University Semantics-aware android malware classification
US9928556B2 (en) * 2014-12-31 2018-03-27 Facebook, Inc. Content quality evaluation and classification
CN105956184B (en) * 2016-06-01 2017-05-31 西安交通大学 Collaborative and organized junk information issue the recognition methods of group in a kind of microblogging community network
KR101773911B1 (en) * 2016-09-27 2017-09-01 주식회사 케이앤컴퍼니 Apparatus for estimating market price of real estate using market price and officially assessed price and method thereof

Also Published As

Publication number Publication date
CN107633077A (en) 2018-01-26

Similar Documents

Publication Publication Date Title
CN107633077B (en) System and method for cleaning social media text data by multiple strategies
CN109977416B (en) Multi-level natural language anti-spam text method and system
Lee et al. An abusive text detection system based on enhanced abusive and non-abusive word lists
US20200265076A1 (en) System and method for text categorization and sentiment analysis
Li et al. Twiner: named entity recognition in targeted twitter stream
Sonowal et al. SmiDCA: an anti-smishing model with machine learning approach
CN107437038B (en) Webpage tampering detection method and device
CN107943941B (en) Junk text recognition method and system capable of being updated iteratively
Ortega et al. SSA-UO: unsupervised Twitter sentiment analysis
CN108399241B (en) Emerging hot topic detection system based on multi-class feature fusion
CN112487149B (en) Text auditing method, model, equipment and storage medium
Shirani-Mehr SMS spam detection using machine learning approach
TW201409261A (en) Method and system for discovering suspicious account groups
CN111160019B (en) Public opinion monitoring method, device and system
CN104933191A (en) Spam comment recognition method and system based on Bayesian algorithm and terminal
Silva et al. Towards filtering undesired short text messages using an online learning approach with semantic indexing
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN112199606B (en) Social media-oriented rumor detection system based on hierarchical user representation
Sugandhi et al. Methods for detection of cyberbullying: A survey
CN110913354A (en) Short message classification method and device and electronic equipment
US20160283582A1 (en) Device and method for detecting similar text, and application
CN107544961A (en) A kind of sentiment analysis method, equipment and its storage device of social media comment
Raja et al. Fake news detection on social networks using Machine learning techniques
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
CN109783804B (en) Low-quality language identification method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant