CN107633077B

CN107633077B - System and method for cleaning social media text data by multiple strategies

Info

Publication number: CN107633077B
Application number: CN201710873539.1A
Authority: CN
Inventors: 薛涵凛; 王颖
Original assignee: Nanjing Andlinks Data Technology Co ltd
Current assignee: Nanjing Andlinks Data Technology Co ltd
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2020-12-18
Anticipated expiration: 2037-09-25
Also published as: CN107633077A

Abstract

The invention discloses a system for cleaning social media text data by multiple strategies, which comprises: the method for cleaning the social media text data by multiple strategies comprises the following steps: calculating the similarity of the social media texts, identifying marketing texts based on the characteristics of the network marketing texts and an SVM separator, and recording users who issue the network marketing texts; the blacklist of users who publish "marketing text" and "repeat text" is recorded based on the first two steps. The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.

Description

System and method for cleaning social media text data by multiple strategies

Technical Field

The invention relates to a system and a method for cleaning social media text data by multiple strategies, belonging to the field of data mining technology.

Background

At present, social media becomes the most popular network communication platform, and anyone can issue words at any time in a computer, mobile phone and other modes, and the words can be spread throughout the internet. Social media is one of important platforms for public opinion information publishing, and the social media is characterized by fast update and high degree of freedom, so that more and more marketing advertisements start to be spread depending on the social media. The method not only seriously influences the normal browsing of the user, but also is not beneficial to the public sentiment analysis and control of related mechanisms. The filtering and shielding of the junk text information cannot be realized only by depending on the supervision function of the existing social media platform.

The current data spam filtering work is usually specific to a specific application scenario, such as spam filtering, spam web page distinguishing technology, and the like. On a social media platform, spam text includes advertising information, pornography, violence or unhealthy information, and the like. Existing social media data cleansing focuses on analysis and monitoring of spam users (e.g., zombie meals), identification of spam comments, and filtering of spam content. The method mainly comprises the steps of realizing classification of junk contents, selecting the characteristics of junk texts, and training a classifier by using a machine learning model, wherein the commonly used model comprises naive Bayes, Adaboost, a decision tree, a support vector machine and the like. The junk data on the social platform comprises repeated texts and similar texts besides marketing advertisements, and the prior art generally only adopts one strategy to realize cleaning work and cannot meet the requirements of people.

Disclosure of Invention

The invention aims to provide a system and a method for cleaning social media text data by multiple strategies, which particularly analyze the characteristics of advertising marketing text garbage, accurately identify marketing users and marketing advertisement texts, realize filtering and solve the defects in the prior art.

The invention is realized by adopting the following technical scheme:

a system for multi-policy cleansing social media textual data, the system comprising:

a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist;

marketing text recognition module:

introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics;

a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.

Further, initializing a vector D consisting of weighted words, initializing a 64-dimensional vector V, setting an initial value of each element in the vector to be 0, calculating each word in a word set S, calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, if the ith bit of the word is 0, subtracting the weight D [ word ] of the word from the ith dimension of the initial vector V, mapping an article into a 64-dimensional vector g after the calculation of all words in S is completed, if the ith dimension of g is greater than 0, setting the ith position of a 64-bit fingerprint to be 1, otherwise, setting the ith position to be 0.

Further, the content features include:

text number ratio: the social media text contains the proportion of the number to the total length of the text;

symbol length: the length of the expression symbol and the punctuation symbol in the text;

number of hyperlinks: the number of hyperlinks is contained in the text;

the external features include:

length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed;

text length: total length of original social media text;

forwarding number: the number of times the social media text is forwarded;

number of comments: number of times social media text is commented on;

the number of praise is as follows: number of times social media text is praised;

a method for multi-strategy cleaning of social media text data is characterized by comprising the following steps:

step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts;

and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text;

and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.

Further:

substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts;

substep A2: performing feature selection (tf-idf) on the S word set to form a group of vectors D consisting of weighted words;

substep A3: a 64-dimensional vector V is initialized and each element in the vector is initially set to 0. Each term in the term set S is calculated as follows: and calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After all words in the S are calculated, an article is mapped into a 64-dimensional vector g;

substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G;

substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.

Furthermore, the marketing text recognition module performs spam text recognition and classification by adopting an SVM model, the selected characteristics comprise content characteristics and external characteristics of the marketing text, such as third-party contact ways, character characteristics and the like, the recognized marketing text is stored in a spam text corpus, training samples of the model are expanded continuously, issuing users of marketing text data are recorded, and the issuing users are added into a user blacklist.

Further, step C includes: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.

The beneficial technical effects of the invention are as follows: the social media data cleaning is realized by one means, and different types of junk texts are filtered step by step in a multi-strategy mode. Compared with a single text spam filtering and spam user identification method, the method has better applicability and wider application prospect.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a specific flow of similar text recognition.

Detailed Description

The present invention mainly aims at cleaning data of junk social media text, and will be more helpful for the public to understand the present invention through the following description of the embodiments, but the specific embodiments given by the applicant should not be construed as limiting the technical solution of the present invention, and any changes in the definition of the components or technical features and/or in the form of the overall structure rather than the essential changes should be construed as the protection scope defined by the technical solution of the present invention.

As shown in fig. 1, similarity comparison is first performed on web texts to filter out repeated texts with high similarity. Similarity comparison is based on a simhash algorithm, and hamming distance is replaced by cosine distance, so that although the calculation cost is increased, the efficiency of feature comparison is improved.

Second, marketing text recognition is performed on the social media data. The marketing text recognition part analyzes the characteristics of common marketing network texts and utilizes an SVM classifier for training and testing. Meanwhile, the identified marketing text data is used iteratively, so that the adaptability of the classifier is enhanced.

And finally, the junk user identification module establishes a user blacklist for the user who issues the repeated text and the marketing text on the basis of the junk user identification module and the marketing text. And (4) carrying out statistical analysis on the frequency of publishing the junk texts by the users in the blacklist, judging the users with high frequency as junk users, and filtering all social media data published by the users to realize cleaning.

Compared with the existing junk web text cleaning method, the invention designs a method for filtering junk texts by various strategies from multiple angles, and specifically comprises text similarity comparison, marketing text recognition and junk user recognition. Compared with single text spam filtering and spam user identification, the method has better applicability and wider application prospect.

The recognition of similar web text is shown in fig. 2:

firstly, segmenting a text, and removing common stop words to obtain a text word set S;

second, S is feature selected (tf-idf), forming a set of vectors D of weighted words. If feature selection is not selected, a vector D of words with weights all 1 is formed. A 64-dimensional vector V is initialized and each element in the vector is initially set to 0.

Then, each term in the term set S is calculated as follows: and calculating the word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the ith bit is 0, subtracting the weight D [ word ] of the word from the ith dimension of the vector V. After S all word computations are completed, a piece of text will be mapped into a 64-dimensional vector G.

If the ith dimension of G is larger than 0, setting the ith position of the 64-bit fingerprint as 1 from the left, otherwise setting the ith position as 0, and finally mapping a piece of text into a 64-bit fingerprint code.

And calculating similarity of the fingerprint codes G of different texts by using cosine distance. And judging the texts with the similarity greater than the threshold value to be similar texts.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it is therefore intended that all such changes and modifications as fall within the true spirit and scope of the invention be considered as within the following claims.

Claims

1. A system for multi-policy cleansing social media textual data, the system comprising: a similar text recognition module: the module is used for performing network text word segmentation, removing stop words, constructing a word set S of a text, performing feature selection on the word set S to form a group of vectors D consisting of weighted words, realizing that one text is mapped into a 64-bit fingerprint code G, calculating the similarity of the fingerprint codes G of different texts by using cosine distance, determining the fingerprint codes G of different texts as repeated texts if the similarity is greater than a threshold value, recording issuing users of the repeated network texts, and storing the issuing users into a blacklist; marketing text recognition module: introducing a machine learning classifier, inducing and summarizing marketing characteristics of common web texts, and recognizing the marketing texts by means of an SVM classifier, wherein the characteristics selected by the SVM classifier comprise content characteristics and external characteristics; a garbage user identification module: on the basis of the similar text recognition module and the marketing text recognition module, recording users who issue the similar texts and the marketing texts to form a user blacklist, counting the frequency of issuing the similar texts and the marketing texts by the users in the blacklist, judging the users with high issuing frequency as junk users, and filtering all social media data issued by the junk users.

2. The system for multi-strategy cleaning of social media text data according to claim 1, wherein a vector D consisting of weighted words is initialized, a 64-dimensional vector V is initialized, an initial value of each element in the vector is set to 0, each word in the word set S is calculated, a word is calculated by using a Hash function to obtain a 64-bit signature f, each bit of the 64-bit signature f is traversed, if the word is 0 at the ith bit, D [ word ] of the word is subtracted from the ith dimension of the initial vector V, after all word calculations of S are completed, an article is mapped into a 64-dimensional vector g, if the ith dimension of g is greater than 0, the ith position of the 64-bit fingerprint is 1, otherwise, the ith position is 0.

3. The system for multi-policy cleansing social media textual data according to claim 1, wherein said content features comprise: text number ratio: the social media text contains the proportion of the number to the total length of the text; symbol length: the length of the expression symbol and the punctuation symbol in the text; number of hyperlinks: the number of hyperlinks is contained in the text; the external features include: length of noun and verb: after the text is segmented to stop the word, the length of the noun and the verb is summed; text length: total length of original social media text; forwarding number: the number of times the social media text is forwarded; number of comments: number of times social media text is commented on; the number of praise is as follows: the number of times social media text is praised.

4. A method for multi-strategy cleaning of social media text data is characterized by comprising the following steps: step A: calculating the similarity of the social media texts, setting a threshold value to delete the social media texts with high repetition degree based on a simhash algorithm, and recording issuing users of the repeated texts; and B: identifying a marketing text based on the characteristics of the network marketing text and the SVM separator, and recording a user issuing the network marketing text; and C: and recording a user blacklist for issuing marketing texts and repeated texts based on the previous two steps, counting the frequency of issuing junk texts by the users in the blacklist, judging the users with higher frequency as junk users, and deleting the social media data issued by the users.

5. The method of multi-policy cleansing social media textual data of claim 4, wherein: substep A1: segmenting social media texts, removing stop words, and constructing a word set S of the texts; substep A2: performing feature selection on the S word set to form a group of vectors D consisting of weighted words; substep A3: initializing a 64-dimensional vector V, and setting the initial value of each element in the vector to be 0; each term in the term set S is calculated as follows: calculating each word by using a Hash function to obtain a 64-bit signature f, traversing each bit of the 64-bit signature f, and if the word is 0 on the ith bit, subtracting the weight D [ word ] of the word from the ith dimension of the vector V; after all words in the S are calculated, an article is mapped into a 64-dimensional vector g; substep A4: if the ith dimension of the G is larger than 0, setting the ith position of the 64-bit fingerprint as 1, otherwise setting the ith position as 0, and enabling a piece of social media text to be mapped into a 64-bit fingerprint code G; substep A5: and calculating the similarity of the fingerprint codes G of different articles by using cosine distance, if the similarity is greater than a threshold value, determining the fingerprint codes G as repeated texts, and recording issuing users of the repeated texts to a blacklist.

6. The method for multi-strategy cleaning of social media textual data according to claim 4, wherein the marketing text recognition module employs SVM model for spam text recognition and classification, the selected features include content features and external features of marketing text, saves the recognized marketing text into a spam text corpus, continuously expands training samples of the model, records publishing users of marketing text data, and adds to a user blacklist.

7. The method for multi-policy cleansing social media textual data according to claim 4, wherein step C comprises: performing frequency statistics on repeated texts issued by users in the user blacklist and marketing texts, and judging the users with high frequency as junk users; and confirming the issuing user of the marketing text for the non-repeated text and the non-marketing text, judging whether the issuing user is a junk user, and filtering all social media data issued by the junk user.