CN107633077A - A kind of system and method for more strategy cleaning social media text datas - Google Patents

A kind of system and method for more strategy cleaning social media text datas Download PDF

Info

Publication number
CN107633077A
CN107633077A CN201710873539.1A CN201710873539A CN107633077A CN 107633077 A CN107633077 A CN 107633077A CN 201710873539 A CN201710873539 A CN 201710873539A CN 107633077 A CN107633077 A CN 107633077A
Authority
CN
China
Prior art keywords
text
social media
marketing
word
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710873539.1A
Other languages
Chinese (zh)
Other versions
CN107633077B (en
Inventor
薛涵凛
王颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Chain Data Technology Co Ltd
Original Assignee
Nanjing Chain Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Chain Data Technology Co Ltd filed Critical Nanjing Chain Data Technology Co Ltd
Priority to CN201710873539.1A priority Critical patent/CN107633077B/en
Publication of CN107633077A publication Critical patent/CN107633077A/en
Application granted granted Critical
Publication of CN107633077B publication Critical patent/CN107633077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of system of more strategy cleaning social media text datas, the system includes:Similar Text identification module, marketing text identification module and junk user identification module, the method for more strategy cleaning social media text datas include step A:The Similarity Measure of social media text, feature and SVM separators based on network marketing text identify marketing text, the user of record delivery network marketing text;The subscriber blacklist of issue " marketing text " and " repeated text " is recorded based on first two steps.The method have the benefit that:It is not limited to realize social media data cleansing with a kind of means, substep, shifty realize different types of rubbish text are filtered.Contrasting single text garbage filtering and junk user recognition methods, the present invention has more preferable applicability, there is wider array of application prospect.

Description

A kind of system and method for more strategy cleaning social media text datas
Technical field
The present invention relates to a kind of system and method for more strategy cleaning social media text datas, belong to data mining technology Field.
Background technology
At this stage, social media turns into most burning hot network communication platform, and anyone can use the side such as computer, mobile phone Formula issues speech at any time, and these speeches can spread all over whole internet.Social media is the important flat of public feelings information issue One of platform, the high feature of its updating decision, the free degree so that increasing marketing advertisement becomes dependent upon social media and passed Broadcast.This has not only had a strong impact on the normal browsing of user, is also unfavorable for associated mechanisms and carries out the analysis of public opinion and control.Only rely on existing There is the monitoring function of social media platform, filtering and shielding to such rubbish text information can not be realized.
Current data garbage filtration is often directed to specific application scenarios, filtering, rubbish net such as spam Page discrimination technology etc..On social media platform, rubbish text includes advertising message, pornographic, violence or flame etc..It is existing Social media data cleansing concentrates on junk user(Such as corpse powder)Analysis and monitoring, the identification of comment spam, rubbish contents Filtering.Identification wherein to rubbish contents selects the feature of rubbish text, utilizes machine learning mainly using the realization of classification Model carries out classifier training, and conventional model includes naive Bayesian, Adaboost, decision tree and SVMs etc..Society The junk data on platform is handed in addition to advertisement of marketing, in addition to repeated text, Similar Text, prior art generally only with A kind of strategy realizes cleaning, can not meet the requirement of people.
The content of the invention
It is an object of the invention to provide a kind of system and method for more strategy cleaning social media text datas, especially divide The characteristics of having analysed advertisement marketing class text rubbish, marketing user, marketing advertisement text are accurately identified and realize filtering, Solves shortcoming existing for prior art.
The present invention adopts the following technical scheme that realization:
A kind of system of more strategy cleaning social media text datas, it is characterised in that the system includes:
Similar Text identification module:The effect of the module is to carry out network text participle, removes stop words, builds the word of text Collect S, feature selecting is carried out to word collection S, one group of vectorial D being made up of weighting word is formed, realizes that a text is mapped as The fingerprint code G of 64, to the fingerprint code G of different texts, similarity is calculated using cosine distance, more than recognizing for threshold value It is set to repeated text, meanwhile, the issue user of recurrent network text is recorded, is saved in blacklist;
Marketing text identification module:
Machine learning classifiers are introduced, the marketing feature to classical network text carries out induction and conclusion, is realized by SVM classifier Identification to text of marketing, the feature of SVM classifier selection include content characteristic and surface;
Junk user identification module:Similar Text identification module:On the basis of marketing text identification module, record issue The user of " Similar Text ", " marketing text ", formed subscriber blacklist, count blacklist in user issue " Similar Text " and The frequency of " marketing text ", the high user of the frequency will be issued and be determined as junk user, filter all social media numbers of its issue According to.
Further, the vectorial D being made up of weighting word is initialized, a 64 dimensional vector V is initialized, by vector In each element initial value be arranged to 0, each word in word collection S is calculated, word word is utilized into Hash function meters Obtain the signature f of one 64 after calculation, travel through 64 signature f each, if the word is 0 in i-th bit, from initially to Measure and the weight D of this word subtracted in V i-th dimension, complete after S wholes word calculates, an article will be mapped to 64 dimensions to G is measured, if g i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1, is otherwise set to 0.
Further, the content characteristic includes:
Textual number accounting:The ratio of text overall length is accounted in social media text containing numeral;
Symbol lengths:The length of emoticon, punctuation mark in text;
Hyperlink quantity:Number containing hyperlink in text;
The surface includes:
The length of noun and verb:After text participle removes stop words, the length sum of noun and verb;
Text size:The overall length of former social media text;
Forward number:Social media text is forwarded number;
Comment on number:Social media text is by comment number;
Thumb up number:Social media text is by like time;
A kind of method of more strategy cleaning social media text datas, it is characterised in that this method comprises the following steps:
Step A:The Similarity Measure of social media text, based on improved simhash algorithms, it is high that given threshold deletes multiplicity Social media text, and record the issue user of repeated text;
Step B:Feature and SVM separators based on network marketing text identify marketing text, record delivery network marketing text This user;
Step C:The subscriber blacklist of issue " marketing text " and " repeated text " is recorded based on first two steps, to user in blacklist The frequency of issue rubbish text is counted, and is differentiated that frequency high user is junk user, is deleted the social activity of such user issue Media data.
Further:
Sub-step A1:Social media text segments, and removes stop words, builds the word collection S of text;
Sub-step A2:Feature selecting is carried out to S words collection(tf-idf), form one group of vectorial D being made up of weighting word;
Sub-step A3:A 64 dimensional vector V are initialized, each element initial value in vector is arranged to 0.To every in word collection S Individual word is calculated as below:By each word(word)The signature f of one 64 is obtained after being calculated using Hash functions, is traveled through Each of 64 signature f, if the word is 0 in i-th bit, subtracts the weight D of this word from vectorial V i-th dimension (word).Complete in S after whole words calculating, an article is mapped to 64 dimensional vector g;
Sub-step A4:If g i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1, is otherwise set to 0 so that Yi Tiaoshe The fingerprint code G for handing over media text to be mapped as 64;
Sub-step A5:To the fingerprint code G of different articles, similarity is calculated using cosine distance, more than then recognizing for threshold value It is set to repeated text, records the issue user of these repeated texts into blacklist.
Further, the marketing text identification module carries out rubbish text identification and classification using SVM models, selection Feature includes the content characteristic and surface of marketing text, such as third party's contact method, character feature, the battalion that will identify that Pin text is saved in rubbish text corpus, constantly expands the training sample of model, and records the issue of marketing text data User, it is added in subscriber blacklist.
Further, step C includes:Repeated text is issued to the user in subscriber blacklist, marketing text carries out the frequency Statistics, judge that the too high user of the frequency is junk user;To non-duplicate text, non-marketing text, the hair of the marketing text is confirmed Cloth user, determines whether junk user, filters out all social media data of junk user issue.
The method have the benefit that:It is not limited to realize social media data cleansing with a kind of means, it is substep, more Being realized to different types of rubbish text for strategy is filtered.Single text garbage filtering and junk user recognition methods are contrasted, The present invention has more preferable applicability, there is wider array of application prospect.
Brief description of the drawings
Fig. 1 is the specific implementation flow chart of the present invention.
Fig. 2 is the idiographic flow of Similar Text identification.
Embodiment
Present invention is generally directed to rubbish social media text to carry out data cleansing, will by the following description to embodiment More contribute to public understanding of the invention, but the specific embodiment given by applicant should can't be considered as to this hair The limitation of bright technical scheme, any definition to part or technical characteristic be changed and/or overall structure made form and Immaterial conversion is regarded as the protection domain that technical scheme is limited.
As shown in figure 1, similarity system design is carried out to network text first, to filter out the high text of repetition, similitude.Phase Compare like degree based on improved simhash algorithms, Hamming distances are replaced with into cosine distance, although increased calculating Cost, but improve the efficiency that feature compares.
Secondly, marketing text identification is carried out to social media data.Market the common marketing net of text identification partial analysis The feature of network text, is trained and tested using SVM classifier.Meanwhile iteration utilizes the marketing text data identified, Strengthen the adaptability of grader.
Finally, junk user identification module is based on both, the user to issuing " repeated text " and " marketing text " Establish subscriber blacklist.The frequency that rubbish text is issued to user in blacklist carries out statistical analysis, judges the high user of the frequency For junk user, all social media data of its issue are filtered, realize cleaning.
Compared to existing rubbish network text cleaning method, the present invention devises a variety of strategies from multiple angles The method of filtering spam text, specifically include text similarity compare, text identification of marketing and junk user identification.Contrast single Text garbage filtering and junk user identification, the present invention there is more preferable applicability, have wider array of application prospect.
Identification to similitude network text is as shown in Figure 2:
First, text is segmented, removes common stop words, obtain text word collection S;
Secondly, feature selecting is carried out to S(tf-idf), form one group of vectorial D being made up of weighting word.If do not select feature Selection, then form the vectorial D that the word that weighting is all 1 is formed.A 64 dimensional vector V are initialized, by the beginning of each element in vector Initial value is arranged to 0.
Then, each word in word collection S is calculated as below:Obtained after word word is calculated using Hash functions The signature f of one 64, each is traveled through to 64 signature f, if being 0 in i-th bit, this is subtracted from vectorial V i-th dimension The weight D [word] of word.After completing the calculating of S wholes word, a text will be mapped to 64 dimensional vector G.
If G i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1 from left number, is otherwise set to 0, a final text Originally it is mapped as the fingerprint code of 64.
To the fingerprint code G of different texts, similarity is calculated using cosine distance.Threshold value is more than to similarity Text, determine that it is Similar Text.
Certainly, the present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, Those skilled in the art can be made according to the present invention it is various it is corresponding change and deformation, but these it is corresponding change and Deformation should all belong to the protection domain of appended claims of the invention.

Claims (7)

1. a kind of system of more strategy cleaning social media text datas, it is characterised in that the system includes:
Similar Text identification module:The effect of the module is to carry out network text participle, removes stop words, builds the word of text Collect S, feature selecting is carried out to word collection S, one group of vectorial D being made up of weighting word is formed, realizes that a text is mapped as The fingerprint code G of 64, to the fingerprint code G of different texts, similarity is calculated using cosine distance, more than recognizing for threshold value It is set to repeated text, meanwhile, the issue user of recurrent network text is recorded, is saved in blacklist;
Marketing text identification module:
Machine learning classifiers are introduced, the marketing feature to classical network text carries out induction and conclusion, is realized by SVM classifier Identification to text of marketing, the feature of SVM classifier selection include content characteristic and surface;
Junk user identification module:Similar Text identification module:On the basis of marketing text identification module, record issue The user of " Similar Text ", " marketing text ", formed subscriber blacklist, count blacklist in user issue " Similar Text " and The frequency of " marketing text ", the high user of the frequency will be issued and be determined as junk user, filter all social media numbers of its issue According to.
2. the system of more strategy cleaning social media text datas according to claim 1, it is characterised in that to by weighting The vectorial D of word composition is initialized, and initializes a 64 dimensional vector V, each element initial value in vector is arranged into 0, Each word in word collection S is calculated, the signature f of one 64 is obtained after word word is calculated using Hash functions, Each of 64 signature f of traversal, if the word is 0 in i-th bit, subtracts this word from initial vector V i-th dimension Weight D, complete after S wholes words calculates, an article will be mapped to 64 dimensional vector g, if g i-th dimension is more than 0, The i-th bit of 64 fingerprints is set to 1, is otherwise set to 0.
3. the system of more strategy cleaning social media text datas according to claim 1, it is characterised in that the content Feature includes:
Textual number accounting:The ratio of text overall length is accounted in social media text containing numeral;
Symbol lengths:The length of emoticon, punctuation mark in text;
Hyperlink quantity:Number containing hyperlink in text;
The surface includes:
The length of noun and verb:After text participle removes stop words, the length sum of noun and verb;
Text size:The overall length of former social media text;
Forward number:Social media text is forwarded number;
Comment on number:Social media text is by comment number;
Thumb up number:Social media text is by like time.
A kind of 4. method of more strategy cleaning social media text datas, it is characterised in that this method comprises the following steps:
Step A:The Similarity Measure of social media text, based on improved simhash algorithms, it is high that given threshold deletes multiplicity Social media text, and record the issue user of repeated text;
Step B:Feature and SVM separators based on network marketing text identify marketing text, record delivery network marketing text This user;
Step C:The subscriber blacklist of issue " marketing text " and " repeated text " is recorded based on first two steps, to user in blacklist The frequency of issue rubbish text is counted, and is differentiated that frequency high user is junk user, is deleted the social activity of such user issue Media data.
5. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that:
Sub-step A1:Social media text segments, and removes stop words, builds the word collection S of text;
Sub-step A2:Feature selecting is carried out to S words collection, forms one group of vectorial D being made up of weighting word;
Sub-step A3:A 64 dimensional vector V are initialized, each element initial value in vector is arranged to 0;
Each word in word collection S is calculated as below:By each word(word)One is obtained after being calculated using Hash functions Each of the signature f of individual 64,64 signature f of traversal, if the word is 0 in i-th bit, are subtracted from vectorial V i-th dimension The weight D of this word(word);
Complete in S after whole words calculating, an article is mapped to 64 dimensional vector g;
Sub-step A4:If g i-th dimension is more than 0, the i-th bit of 64 fingerprints is set to 1, is otherwise set to 0 so that Yi Tiaoshe The fingerprint code G for handing over media text to be mapped as 64;
Sub-step A5:To the fingerprint code G of different articles, similarity is calculated using cosine distance, more than then recognizing for threshold value It is set to repeated text, records the issue user of these repeated texts into blacklist.
6. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that the marketing Text identification module carries out rubbish text using SVM models and identified with classifying, and the feature of selection includes the content spy of marketing text To seek peace surface, the marketing text that will identify that is saved in rubbish text corpus, constantly expands the training sample of model, And the issue user of marketing text data is recorded, it is added in subscriber blacklist.
7. the method for more strategy cleaning social media text datas according to claim 4, it is characterised in that step C bags Include:Repeated text, marketing text progress frequency statistics are issued to the user in subscriber blacklist, judge that the too high user of the frequency is Junk user;To non-duplicate text, non-marketing text, confirm the issue user of the marketing text, determine whether junk user, Filter out all social media data of junk user issue.
CN201710873539.1A 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies Active CN107633077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710873539.1A CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710873539.1A CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Publications (2)

Publication Number Publication Date
CN107633077A true CN107633077A (en) 2018-01-26
CN107633077B CN107633077B (en) 2020-12-18

Family

ID=61103475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710873539.1A Active CN107633077B (en) 2017-09-25 2017-09-25 System and method for cleaning social media text data by multiple strategies

Country Status (1)

Country Link
CN (1) CN107633077B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108650546A (en) * 2018-05-11 2018-10-12 武汉斗鱼网络科技有限公司 Barrage processing method, computer readable storage medium and electronic equipment
CN108959484A (en) * 2018-06-21 2018-12-07 中国人民解放军战略支援部队信息工程大学 More tactful media data filtration methods and its device towards event detection
CN110008282A (en) * 2019-03-12 2019-07-12 平安信托有限责任公司 Transaction data synchronization interconnection method, device, computer equipment and storage medium
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN111198992A (en) * 2020-01-07 2020-05-26 精硕科技(北京)股份有限公司 Identification method and identification device for mother and infant crowd, electronic equipment and storage medium
CN112699949A (en) * 2021-01-05 2021-04-23 百威投资(中国)有限公司 Potential user identification method and device based on social platform data
CN116932526A (en) * 2023-09-19 2023-10-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160057159A1 (en) * 2014-08-22 2016-02-25 Syracuse University Semantics-aware android malware classification
US20160188600A1 (en) * 2014-12-31 2016-06-30 Facebook, Inc Content Quality Evaluation and Classification
CN105956184A (en) * 2016-06-01 2016-09-21 西安交通大学 Method for identifying collaborative and organized junk information release team in micro-blog social network
KR101773911B1 (en) * 2016-09-27 2017-09-01 주식회사 케이앤컴퍼니 Apparatus for estimating market price of real estate using market price and officially assessed price and method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160057159A1 (en) * 2014-08-22 2016-02-25 Syracuse University Semantics-aware android malware classification
US20160188600A1 (en) * 2014-12-31 2016-06-30 Facebook, Inc Content Quality Evaluation and Classification
CN105956184A (en) * 2016-06-01 2016-09-21 西安交通大学 Method for identifying collaborative and organized junk information release team in micro-blog social network
KR101773911B1 (en) * 2016-09-27 2017-09-01 주식회사 케이앤컴퍼니 Apparatus for estimating market price of real estate using market price and officially assessed price and method thereof

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108650546A (en) * 2018-05-11 2018-10-12 武汉斗鱼网络科技有限公司 Barrage processing method, computer readable storage medium and electronic equipment
CN108650546B (en) * 2018-05-11 2021-07-23 武汉斗鱼网络科技有限公司 Barrage processing method, computer-readable storage medium and electronic device
CN108959484A (en) * 2018-06-21 2018-12-07 中国人民解放军战略支援部队信息工程大学 More tactful media data filtration methods and its device towards event detection
CN108959484B (en) * 2018-06-21 2020-07-28 中国人民解放军战略支援部队信息工程大学 Multi-strategy media data stream filtering method and device for event detection
CN110008282A (en) * 2019-03-12 2019-07-12 平安信托有限责任公司 Transaction data synchronization interconnection method, device, computer equipment and storage medium
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN111198992A (en) * 2020-01-07 2020-05-26 精硕科技(北京)股份有限公司 Identification method and identification device for mother and infant crowd, electronic equipment and storage medium
CN112699949A (en) * 2021-01-05 2021-04-23 百威投资(中国)有限公司 Potential user identification method and device based on social platform data
CN112699949B (en) * 2021-01-05 2023-05-26 百威投资(中国)有限公司 Potential user identification method and device based on social platform data
CN116932526A (en) * 2023-09-19 2023-10-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information
CN116932526B (en) * 2023-09-19 2023-11-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Also Published As

Publication number Publication date
CN107633077B (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN107633077A (en) A kind of system and method for more strategy cleaning social media text datas
CN102591854B (en) For advertisement filtering system and the filter method thereof of text feature
CN107437038B (en) Webpage tampering detection method and device
Li et al. Twiner: named entity recognition in targeted twitter stream
CN107943941B (en) Junk text recognition method and system capable of being updated iteratively
CN105787025B (en) Network platform public account classification method and device
CN103246670B (en) Microblogging sequence, search, methods of exhibiting and system
CN103996130B (en) A kind of information on commodity comment filter method and system
CN107341716A (en) A kind of method, apparatus and electronic equipment of the identification of malice order
CN101784022A (en) Method and system for filtering and classifying short messages
CN102760153A (en) Incorporating lexicon knowledge to improve sentiment classification
CN109960763A (en) A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference
CN107358075A (en) A kind of fictitious users detection method based on hierarchical clustering
Shen et al. Latent friend mining from blog data
CN104239539A (en) Microblog information filtering method based on multi-information fusion
CN108364199A (en) A kind of data analysing method and system based on Internet user's comment
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN104933191A (en) Spam comment recognition method and system based on Bayesian algorithm and terminal
CN108268554A (en) A kind of method and apparatus for generating filtering junk short messages strategy
CN110134792A (en) Text recognition method, device, electronic equipment and storage medium
CN107544961A (en) A kind of sentiment analysis method, equipment and its storage device of social media comment
CN106547875A (en) A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN107741958A (en) A kind of data processing method and system
CN102945246A (en) Method and device for processing network information data
Liu et al. Location type classification using tweet content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant