CN107633077A

CN107633077A - A kind of system and method for more strategy cleaning social media text datas

Info

Publication number: CN107633077A
Application number: CN201710873539.1A
Authority: CN
Inventors: 薛涵凛; 王颖
Original assignee: Nanjing Chain Data Technology Co Ltd
Current assignee: Nanjing Chain Data Technology Co Ltd
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2018-01-26
Anticipated expiration: 2037-09-25
Also published as: CN107633077B

Abstract

The invention discloses a kind of system of more strategy cleaning social media text datas, the system includes：Similar Text identification module, marketing text identification module and junk user identification module, the method for more strategy cleaning social media text datas include step A：The Similarity Measure of social media text, feature and SVM separators based on network marketing text identify marketing text, the user of record delivery network marketing text；The subscriber blacklist of issue " marketing text " and " repeated text " is recorded based on first two steps.The method have the benefit that：It is not limited to realize social media data cleansing with a kind of means, substep, shifty realize different types of rubbish text are filtered.Contrasting single text garbage filtering and junk user recognition methods, the present invention has more preferable applicability, there is wider array of application prospect.

Description

A kind of system and method for more strategy cleaning social media text datas

Technical field

The present invention relates to a kind of system and method for more strategy cleaning social media text datas, belong to data mining technology Field.

Background technology

At this stage, social media turns into most burning hot network communication platform, and anyone can use the side such as computer, mobile phone Formula issues speech at any time, and these speeches can spread all over whole internet.Social media is the important flat of public feelings information issue One of platform, the high feature of its updating decision, the free degree so that increasing marketing advertisement becomes dependent upon social media and passed Broadcast.This has not only had a strong impact on the normal browsing of user, is also unfavorable for associated mechanisms and carries out the analysis of public opinion and control.Only rely on existing There is the monitoring function of social media platform, filtering and shielding to such rubbish text information can not be realized.

Current data garbage filtration is often directed to specific application scenarios, filtering, rubbish net such as spam Page discrimination technology etc..On social media platform, rubbish text includes advertising message, pornographic, violence or flame etc..It is existing Social media data cleansing concentrates on junk user（Such as corpse powder）Analysis and monitoring, the identification of comment spam, rubbish contents Filtering.Identification wherein to rubbish contents selects the feature of rubbish text, utilizes machine learning mainly using the realization of classification Model carries out classifier training, and conventional model includes naive Bayesian, Adaboost, decision tree and SVMs etc..Society The junk data on platform is handed in addition to advertisement of marketing, in addition to repeated text, Similar Text, prior art generally only with A kind of strategy realizes cleaning, can not meet the requirement of people.

The content of the invention

It is an object of the invention to provide a kind of system and method for more strategy cleaning social media text datas, especially divide The characteristics of having analysed advertisement marketing class text rubbish, marketing user, marketing advertisement text are accurately identified and realize filtering, Solves shortcoming existing for prior art.

The present invention adopts the following technical scheme that realization：

A kind of system of more strategy cleaning social media text datas, it is characterised in that the system includes：

Similar Text identification module：The effect of the module is to carry out network text participle, removes stop words, builds the word of text Collect S, feature selecting is carried out to word collection S, one group of vectorial D being made up of weighting word is formed, realizes that a text is mapped as The fingerprint code G of 64, to the fingerprint code G of different texts, similarity is calculated using cosine distance, more than recognizing for threshold value It is set to repeated text, meanwhile, the issue user of recurrent network text is recorded, is saved in blacklist；

Marketing text identification module：

Machine learning classifiers are introduced, the marketing feature to classical network text carries out induction and conclusion, is realized by SVM classifier Identification to text of marketing, the feature of SVM classifier selection include content characteristic and surface；

Junk user identification module：Similar Text identification module：On the basis of marketing text identification module, record issue The user of " Similar Text ", " marketing text ", formed subscriber blacklist, count blacklist in user issue " Similar Text " and The frequency of " marketing text ", the high user of the frequency will be issued and be determined as junk user, filter all social media numbers of its issue According to.

Further, the vectorial D being made up of weighting word is initialized, a 64 dimensional vector V is initialized, by vector In each element initial value be arranged to 0, each word in word collection S is calculated, word word is utilized into Hash function meters Obtain the signature f of one 64 after calculation, travel through 64 signature f each, if the word is 0 in i-th bit, from initially to Measure and the weight D of this word subtracted in V i-th dimension, complete after S wholes word calculates, an article will be mapped to 64 dimensions to G is measured, if g i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1, is otherwise set to 0.

Further, the content characteristic includes：

Textual number accounting：The ratio of text overall length is accounted in social media text containing numeral；

Symbol lengths：The length of emoticon, punctuation mark in text；

Hyperlink quantity：Number containing hyperlink in text；

The surface includes：

The length of noun and verb：After text participle removes stop words, the length sum of noun and verb；

Text size：The overall length of former social media text；

Forward number：Social media text is forwarded number；

Comment on number：Social media text is by comment number；

Thumb up number：Social media text is by like time；

A kind of method of more strategy cleaning social media text datas, it is characterised in that this method comprises the following steps：

Step A：The Similarity Measure of social media text, based on improved simhash algorithms, it is high that given threshold deletes multiplicity Social media text, and record the issue user of repeated text；

Step B：Feature and SVM separators based on network marketing text identify marketing text, record delivery network marketing text This user；

Step C：The subscriber blacklist of issue " marketing text " and " repeated text " is recorded based on first two steps, to user in blacklist The frequency of issue rubbish text is counted, and is differentiated that frequency high user is junk user, is deleted the social activity of such user issue Media data.

Further：

Sub-step A1：Social media text segments, and removes stop words, builds the word collection S of text；

Sub-step A2：Feature selecting is carried out to S words collection（tf-idf）, form one group of vectorial D being made up of weighting word；

Sub-step A3：A 64 dimensional vector V are initialized, each element initial value in vector is arranged to 0.To every in word collection S Individual word is calculated as below：By each word（word）The signature f of one 64 is obtained after being calculated using Hash functions, is traveled through Each of 64 signature f, if the word is 0 in i-th bit, subtracts the weight D of this word from vectorial V i-th dimension （word）.Complete in S after whole words calculating, an article is mapped to 64 dimensional vector g；

Sub-step A4：If g i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1, is otherwise set to 0 so that Yi Tiaoshe The fingerprint code G for handing over media text to be mapped as 64；

Sub-step A5：To the fingerprint code G of different articles, similarity is calculated using cosine distance, more than then recognizing for threshold value It is set to repeated text, records the issue user of these repeated texts into blacklist.

Further, the marketing text identification module carries out rubbish text identification and classification using SVM models, selection Feature includes the content characteristic and surface of marketing text, such as third party's contact method, character feature, the battalion that will identify that Pin text is saved in rubbish text corpus, constantly expands the training sample of model, and records the issue of marketing text data User, it is added in subscriber blacklist.

Further, step C includes：Repeated text is issued to the user in subscriber blacklist, marketing text carries out the frequency Statistics, judge that the too high user of the frequency is junk user；To non-duplicate text, non-marketing text, the hair of the marketing text is confirmed Cloth user, determines whether junk user, filters out all social media data of junk user issue.

The method have the benefit that：It is not limited to realize social media data cleansing with a kind of means, it is substep, more Being realized to different types of rubbish text for strategy is filtered.Single text garbage filtering and junk user recognition methods are contrasted, The present invention has more preferable applicability, there is wider array of application prospect.

Brief description of the drawings

Fig. 1 is the specific implementation flow chart of the present invention.

Fig. 2 is the idiographic flow of Similar Text identification.

Embodiment

Present invention is generally directed to rubbish social media text to carry out data cleansing, will by the following description to embodiment More contribute to public understanding of the invention, but the specific embodiment given by applicant should can't be considered as to this hair The limitation of bright technical scheme, any definition to part or technical characteristic be changed and/or overall structure made form and Immaterial conversion is regarded as the protection domain that technical scheme is limited.

As shown in figure 1, similarity system design is carried out to network text first, to filter out the high text of repetition, similitude.Phase Compare like degree based on improved simhash algorithms, Hamming distances are replaced with into cosine distance, although increased calculating Cost, but improve the efficiency that feature compares.

Secondly, marketing text identification is carried out to social media data.Market the common marketing net of text identification partial analysis The feature of network text, is trained and tested using SVM classifier.Meanwhile iteration utilizes the marketing text data identified, Strengthen the adaptability of grader.

Finally, junk user identification module is based on both, the user to issuing " repeated text " and " marketing text " Establish subscriber blacklist.The frequency that rubbish text is issued to user in blacklist carries out statistical analysis, judges the high user of the frequency For junk user, all social media data of its issue are filtered, realize cleaning.

Compared to existing rubbish network text cleaning method, the present invention devises a variety of strategies from multiple angles The method of filtering spam text, specifically include text similarity compare, text identification of marketing and junk user identification.Contrast single Text garbage filtering and junk user identification, the present invention there is more preferable applicability, have wider array of application prospect.

Identification to similitude network text is as shown in Figure 2：

First, text is segmented, removes common stop words, obtain text word collection S；

Secondly, feature selecting is carried out to S（tf-idf）, form one group of vectorial D being made up of weighting word.If do not select feature Selection, then form the vectorial D that the word that weighting is all 1 is formed.A 64 dimensional vector V are initialized, by the beginning of each element in vector Initial value is arranged to 0.

Then, each word in word collection S is calculated as below：Obtained after word word is calculated using Hash functions The signature f of one 64, each is traveled through to 64 signature f, if being 0 in i-th bit, this is subtracted from vectorial V i-th dimension The weight D [word] of word.After completing the calculating of S wholes word, a text will be mapped to 64 dimensional vector G.

If G i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1 from left number, is otherwise set to 0, a final text Originally it is mapped as the fingerprint code of 64.

To the fingerprint code G of different texts, similarity is calculated using cosine distance.Threshold value is more than to similarity Text, determine that it is Similar Text.

Certainly, the present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, Those skilled in the art can be made according to the present invention it is various it is corresponding change and deformation, but these it is corresponding change and Deformation should all belong to the protection domain of appended claims of the invention.

Claims

1. a kind of system of more strategy cleaning social media text datas, it is characterised in that the system includes：

Marketing text identification module：

2. the system of more strategy cleaning social media text datas according to claim 1, it is characterised in that to by weighting The vectorial D of word composition is initialized, and initializes a 64 dimensional vector V, each element initial value in vector is arranged into 0, Each word in word collection S is calculated, the signature f of one 64 is obtained after word word is calculated using Hash functions, Each of 64 signature f of traversal, if the word is 0 in i-th bit, subtracts this word from initial vector V i-th dimension Weight D, complete after S wholes words calculates, an article will be mapped to 64 dimensional vector g, if g i-th dimension is more than 0, The i-th bit of 64 fingerprints is set to 1, is otherwise set to 0.

3. the system of more strategy cleaning social media text datas according to claim 1, it is characterised in that the content Feature includes：

Symbol lengths：The length of emoticon, punctuation mark in text；

Hyperlink quantity：Number containing hyperlink in text；

The surface includes：

Text size：The overall length of former social media text；

Forward number：Social media text is forwarded number；

Comment on number：Social media text is by comment number；

Thumb up number：Social media text is by like time.

A kind of 4. method of more strategy cleaning social media text datas, it is characterised in that this method comprises the following steps：

5. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that：

Sub-step A2：Feature selecting is carried out to S words collection, forms one group of vectorial D being made up of weighting word；

Sub-step A3：A 64 dimensional vector V are initialized, each element initial value in vector is arranged to 0；

Each word in word collection S is calculated as below：By each word（word）One is obtained after being calculated using Hash functions Each of the signature f of individual 64,64 signature f of traversal, if the word is 0 in i-th bit, are subtracted from vectorial V i-th dimension The weight D of this word（word）；

Complete in S after whole words calculating, an article is mapped to 64 dimensional vector g；

6. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that the marketing Text identification module carries out rubbish text using SVM models and identified with classifying, and the feature of selection includes the content spy of marketing text To seek peace surface, the marketing text that will identify that is saved in rubbish text corpus, constantly expands the training sample of model, And the issue user of marketing text data is recorded, it is added in subscriber blacklist.

7. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that step C bags Include：Repeated text, marketing text progress frequency statistics are issued to the user in subscriber blacklist, judge that the too high user of the frequency is Junk user；To non-duplicate text, non-marketing text, confirm the issue user of the marketing text, determine whether junk user, Filter out all social media data of junk user issue.