CN108009249A - For the comment spam filter method of the fusion user behavior rule of unbalanced data - Google Patents

For the comment spam filter method of the fusion user behavior rule of unbalanced data Download PDF

Info

Publication number
CN108009249A
CN108009249A CN201711247021.3A CN201711247021A CN108009249A CN 108009249 A CN108009249 A CN 108009249A CN 201711247021 A CN201711247021 A CN 201711247021A CN 108009249 A CN108009249 A CN 108009249A
Authority
CN
China
Prior art keywords
mrow
msub
data
sample
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711247021.3A
Other languages
Chinese (zh)
Other versions
CN108009249B (en
Inventor
丁茜
武琼
孙剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Television Information Technology (beijing) Co Ltd
Original Assignee
China Television Information Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Television Information Technology (beijing) Co Ltd filed Critical China Television Information Technology (beijing) Co Ltd
Priority to CN201711247021.3A priority Critical patent/CN108009249B/en
Publication of CN108009249A publication Critical patent/CN108009249A/en
Application granted granted Critical
Publication of CN108009249B publication Critical patent/CN108009249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, including:Few class sample data is rebuild, lack sampling is carried out to the multiclass sample data that multiclass sample data is concentrated, reconfigures the training sample corpus being balanced;Using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained;User behavior rule model is called, detects whether comment data to be sorted is comment spam, preliminary classification is carried out to comment data;Feature extraction is carried out to comment data to be sorted, is classified using trained Bayesian Classification Model;Using Adaboosting algorithms, integrated study is carried out to user behavior rule model and Bayesian Classification Model, is trained using marked training sample data, obtains the classification results of final comment data to be sorted.Advantage is:Comment spam filter efficiency is improved comprehensively.

Description

For the comment spam filter method of the fusion user behavior rule of unbalanced data
Technical field
The invention belongs to comment spam filtration art, and in particular to a kind of fusion user's row for unbalanced data For the comment spam filter method of rule.
Background technology
With the rapid development of Internet, more and more people are handed over by issuing various commentaries Flow, state the viewpoint attitude of oneself.At the same time, also provide a convenient for some hackers, launched on the platform normally commented on Substantial amounts of advertisement, the speech comment for publicizing and abusing so that user can not obtain useful information, also counteracts that to information Excavate.Therefore comment spam filtering is extremely important.
From the technical point of view, the filtering for this comment spam belongs to text classification category.Existing comment spam Filtering technique is generally divided to two kinds:Supervised learning method and unsupervised learning method.Unsupervised learning:Need not manually it mark in advance Language material, to comment carry out categorical filtering.But the scene for the nonstandard comment of various colloquial style produced on internet, Unsupervised learning method, which can not be identified accurately, filters out comment spam.The problem of compared to unsupervised learning, supervised learning The heavier text classification of colloquial style can be solved the problems, such as to a certain extent.But supervised learning needs substantial amounts of manually mark early period Corpus is noted, is spent human and material resources very big.Some experts and scholars attempt to solve the problems, such as this using the method for machine learning, such as existing Some comment spam filtering technique-Bayesian Classification Arithmetics, although this method is fine for filtering spam mail effect, premise It is that training corpus is balanced.But in real network media, the comment form of user is different, and comment spam bears example and collects ratio It is more difficult, easily so that training data concentrates the sample number of some classification to be far smaller than another classification, so that the property of classification Can drastically it decline;And in the network media, user comment expression form is changeable, especially advertisement change is various, so as to increase The difficulty identified to new word identification and advertisement.
The content of the invention
In view of the defects existing in the prior art, the present invention provides a kind of fusion user behavior rule for unbalanced data Comment spam filter method, can effectively solve the above problems.
The technical solution adopted by the present invention is as follows:
The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, including Following steps:
Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment The comment data type of sample data;Wherein, the comment data type includes positive example sample type and negative example sample type;
Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if just The quantity of example sample data is more than the quantity of negative example sample data, then, will be negative using positive example sample data as multiclass sample data Example sample data is as few class sample data;, whereas if the quantity of positive example sample data is less than the quantity of negative example sample data, Then using positive example sample data as few class sample data, using negative example sample data as multiclass sample data;Thus sample will be commented on Notebook data collection is divided into few class sample data set and multiclass sample data set;
Step 3, the few class sample data concentrated to few class sample data is rebuild, and multiclass sample data is concentrated Multiclass sample data carry out lack sampling, reconfigure the training sample corpus being balanced;Wherein, training sample corpus In the labeled comment data type of each training sample, therefore, training sample distinguishes positive example training sample and the training of negative example Sample;Training sample corpus includes positive example training sample word bank and negative training sample word bank;
Step 4, using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained;This Step specifically includes:
Step 4.1, the prior probability P (c of positive example training sample are calculated according to formula1) and the priori of negative training sample it is general Rate P (c2):
Wherein:N represents the total quantity of the training sample in training sample corpus;n1Represent in training sample corpus just The quantity of example training sample;n2Represent the quantity of negative training sample in training sample corpus;
Step 4.2, for each training sample, it is segmented, based on following formula, is calculated using the method for information gain Contribution margin IGs (w) of each word w to category of model:
Wherein:
P (w) represents the probability of occurrence of the training sample comprising word w in training sample corpus, i.e.,:P (w)=comprising single The total quantity N of quantity/training sample of the training sample of word w;
P(c1| w) represent the conditional probability of the positive example training sample comprising word w, i.e.,:P(c1| w)=including word w just The quantity of the quantity of example training sample/training sample comprising word w;
P(c2| w) represent the conditional probability of the negative training sample comprising word w, i.e.,:P(c2| w)=negative comprising word w The quantity of the quantity of example training sample/training sample comprising word w;
Represent the probability of occurrence of the training sample not comprising word w in training sample corpus, i.e.,:
Represent the conditional probability of the positive example training sample not comprising word w, i.e.,:
Represent the conditional probability of the negative training sample not comprising word w, i.e.,:
Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extraction altogether It is respectively w to m Feature Words1,w2..., wm, thus obtain feature word set F={ w1,w2,...,wm};
Step 4.4, arbitrary characteristics word w is concentrated for Feature Wordsi, i=1,2 ..., m, Feature Words w is calculated using following formulaiBelong to In the conditional probability value P (w of positive example training sample classificationi|c1) and wiBelong to the conditional probability value P of negative training sample classification (wi|c2):
Wherein:
Represent Feature Words w occur in training sample corpusiPositive example training sample quantity;
Represent positive example training sample quantity in training sample corpus;
Represent Feature Words w occur in training sample corpusiNegative training sample quantity;
Represent negative training sample quantity in training sample corpus;
Step 5, for comment data to be sorted, sensitive word detection is carried out first, if there are sensitive word to be directly considered as rubbish Rubbish is commented on;If sensitive word is not present, 6 are entered step;
Step 6, new word discovery detection is carried out to the comment data to be sorted after step 5 detection, if containing neologisms, After neologisms are marked comment data type, add it in training sample corpus, extension renewal training sample corpus;Such as Fruit does not contain neologisms, enters step 7;
Step 7, user behavior rule model is called, detects whether comment data to be sorted is comment spam, to comment Data carry out preliminary classification;Subsequently into step 8;
Step 8, feature extraction is carried out to comment data to be sorted, trained Bayesian Classification Model carries out for use Classification;This step specifically includes:
Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles;If from what is obtained In dry participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that extracts s feature altogether Word, is respectively x1,x2..., xs, thus obtain showing the feature term vector X={ x that comment classification is inclined to1, x2,...,xs};
Step 8.2, the priori conditions occurred using following formula calculating comment data to be sorted in positive example training sample word bank Probability P (X | c1) and negative training sample word bank occur priori conditions probability P (X | c2):
Wherein:
ujRepresent whether include Feature Words x in comment data to be sortedjIf contain Feature Words xj, then ujFor 1;It is on the contrary For 0;
p(xj|c1) represent Feature Words xjBelong to the conditional probability value of positive example training sample classification, be calculated with step 4.4 Same Feature Words P (wi|c1) value it is equal;
ujp(xj|c2) represent Feature Words xjBelong to the conditional probability value of negative training sample classification, calculated with step 4.4 P (the w of the same Feature Words arrivedi|c2) value it is equal;
Step 8.3, calculated using following formula in training sample corpus, the probability P that comment data to be sorted occurs (X):
P (X)=P (X | c1)P(c1)+P(X|c2)P(c2)
Step 8.4, the probability of positive example comment classification is belonged to using following formula calculating comment data to be sorted
P(c1| X) and belong to the probability P (c of negative example comment classification2|X):
Step 9, using Adaboosting algorithms, user behavior rule model and Bayesian Classification Model are integrated Study, is trained using marked training sample data, and point of final comment data to be sorted is obtained using following formula Class result Qk(X):
Wherein:K is the number of disaggregated model, is herein constant value 2;
μtRepresent the weighted value that t-th of disaggregated model accounts in final accurate grader, be consistently greater than 0;
ft(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, f1(X) Represent the classification judging result in X dimensional features term vector spatially Bayesian Classification Model, f2(X) represent the classification judging result in X dimensional features term vector spatially user behavior rule model, be what step 7 obtained Preliminary classification result.
Preferably, in step 1, the positive example sample type is normal comment sample type;The negative example sample type is Comment spam sample type.
Preferably, in step 3, the few class sample data concentrated to few class sample data is rebuild, and is specially:
Step 3.1, it is assumed that few class sample data concentrates shared r few class sample datas, is denoted as respectively:Few class sample data L1, few class sample data L2..., few class sample data Lr
Lack class sample data L for eachh, h=1,2 ..., r, calculate few class sample data L respectivelyhTo few class sample The Euclidean distance of other r-1 few class sample datas in data set;
Step 3.2, presetting neighbour chooses quantity d, and selected distance lacks class sample data LhD nearest few class sample numbers According to as few class sample data LhNeighbour;
Step 3.3, the training samples number of the balance according to needed for actual sample ratio and training pattern, sets sampling times Rate e;
Lack class sample data L for eachh, carry out e sampling;Sample, selected at random from its d neighbour each time Select 1 neighbour, it is assumed that this neighbour for sampling selection is LqQ=1,2 ..., r, the then new few class being building up to after this sampling Sample data Lnew
Lnew=Lh+rand(0,1)×(Lq-Lh)
Wherein:Rand (0,1) represents the random number of generation 0 to 1, so that new few class sample data LnewIn few class Sample data LhBetween its neighbour;
Thus the new few class sample data of extended architecture again.
Preferably, in step 9, further include:The mistake point rate error (t) of the disaggregated model trained according to training data, is adopted With the weighted value of following formula adjust automatically corresponding model:
It is provided by the invention for unbalanced data fusion user behavior rule comment spam filter method with Lower advantage:
This patent provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, this hair The bright few class sample data new using specific method construct, and multiple lack sampling is carried out to multiclass sample data, reconfigure The training sample corpus being balanced so that training sample data reach equilibrium state, and consider the characteristic of unbalanced data Data characteristics is extracted, comment classification is carried out thereby using Bayesian Classification Arithmetic;For the data such as changeable advertisement, multiple level marketing, profit Determined whether with the rule of conduct of user, finally unify to integrate comment data using Adaboosting methods Learning classification.Whole comment spam filtration system is identified word newly-generated in network using CRF algorithms, constantly updates Corpus causes it to be preferably applied in the comment spam of network media filtering scene, so as to improve comment spam filtering comprehensively Efficiency.
Brief description of the drawings
Fig. 1 is the comment spam filter method of the fusion user behavior rule provided by the invention for unbalanced data Flow diagram.
Embodiment
In order to which technical problem, technical solution and beneficial effect solved by the invention is more clearly understood, below in conjunction with Accompanying drawings and embodiments, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein only to Explain the present invention, be not intended to limit the present invention.
The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, belongs to Text Classification in artificial intelligence field under natural language processing direction, present invention can apply in media television program, Or in the filtration system of the comment spam of online friend's issue, solved based on the method for integrated sample renewal and user behavior rule The filtration problem of uneven, changeable comment spam data.Central scope of the present invention is:In the imbalance based on supervised learning point In class, in order to make full use of the information in multiclass sample, by way of carrying out multiple lack sampling to multiclass sample and few class sample The method of this resampling builds the training based on Ensemble Learning Algorithms and expects storehouse together, and is constantly updated according under actual scene Positive and negative comment sample data is fed back, so as to solve the imbalance problem in comment spam filtering;It is various for what is produced on line The junk data problem such as advertisement, multiple level marketing, new word discovery corpus renewal is carried out using random forest, and is combined user behavior and advised Then, to identify comment spam, while united using the mode based on Bayesian Classification Arithmetic integrated study to comment data One classification learning, filtering spam comment data.
With reference to figure 1, a kind of comment spam mistake of fusion user behavior rule for unbalanced data provided by the invention Filtering method, for comment tendency in real network media is uneven, the various spy such as changeable, short and small, lack of standardization of user comment data Point, devises the comment spam filtration system of the integrated study for merging user behavior rule for unbalanced data, specific bag Include following steps:
Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment The comment data type of sample data;Wherein, the comment data type includes positive example sample type and negative example sample type;Its In, the positive example sample type is normal comment sample type;The negative example sample type is comment spam sample type.
Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if just The quantity of example sample data, then, will using positive example sample data as multiclass sample data far more than the quantity of negative example sample data Negative example sample data is as few class sample data;, whereas if the quantity of positive example sample data is far fewer than negative example sample data Quantity, then using positive example sample data as few class sample data, using negative example sample data as multiclass sample data;Thus will comment Few class sample data set and multiclass sample data set are divided into by sample data set;
Step 3, the few class sample data concentrated to few class sample data is rebuild, and multiclass sample data is concentrated Multiclass sample data carry out lack sampling, reconfigure the training sample corpus being balanced;Wherein, training sample corpus In the labeled comment data type of each training sample, therefore, training sample distinguishes positive example training sample and the training of negative example Sample;Training sample corpus includes positive example training sample word bank and negative training sample word bank;
In this step, the few class sample data concentrated to few class sample data is rebuild, and is specially:
Step 3.1, it is assumed that few class sample data concentrates shared r few class sample datas, is denoted as respectively:Few class sample data L1, few class sample data L2..., few class sample data Lr
Lack class sample data L for eachh, h=1,2 ..., r, calculate few class sample data L respectivelyhTo few class sample The Euclidean distance of other r-1 few class sample datas in data set;
Step 3.2, presetting neighbour chooses quantity d, and selected distance lacks class sample data LhD nearest few class sample numbers According to as few class sample data LhNeighbour;
Step 3.3, the training samples number of the balance according to needed for actual sample ratio and training pattern, sets sampling times Rate e;
Lack class sample data L for eachh, carry out e sampling;Sample, selected at random from its d neighbour each time Select 1 neighbour, it is assumed that this neighbour for sampling selection is LqQ=1,2 ..., r, the then new few class being building up to after this sampling Sample data Lnew
Lnew=Lh+rand(0,1)×(Lq-Lh)
Wherein:Rand (0,1) represents the random number of generation 0 to 1, so that new few class sample data LnewIn few class Sample data LhBetween its neighbour;
Thus the new few class sample data of extended architecture again.
Step 4, using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained;This Step specifically includes:
Step 4.1, the prior probability P (c of positive example training sample are calculated according to formula1) and the priori of negative training sample it is general Rate P (c2):
Wherein:N represents the total quantity of the training sample in training sample corpus;n1Represent in training sample corpus just The quantity of example training sample;n2Represent the quantity of negative training sample in training sample corpus;
Step 4.2, for each training sample, it is segmented, based on following formula, is calculated using the method for information gain Contribution margin IGs (w) of each word w to category of model:
Wherein:
P (w) represents the probability of occurrence of the training sample comprising word w in training sample corpus, i.e.,:P (w)=comprising single The total quantity N of quantity/training sample of the training sample of word w;
P(c1| w) represent the conditional probability of the positive example training sample comprising word w, i.e.,:P(c1| w)=including word w just The quantity of the quantity of example training sample/training sample comprising word w;
P(c2| w) represent the conditional probability of the negative training sample comprising word w, i.e.,:P(c2| w)=negative comprising word w The quantity of the quantity of example training sample/training sample comprising word w;
Represent the probability of occurrence of the training sample not comprising word w in training sample corpus, i.e.,:
Represent the conditional probability of the positive example training sample not comprising word w, i.e.,:
Represent the conditional probability of the negative training sample not comprising word w, i.e.,:
Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extraction altogether It is respectively w to m Feature Words1,w2,...,wm, thus obtain feature word set F={ w1,w2..., wm};
Step 4.4, arbitrary characteristics word w is concentrated for Feature Wordsi, i=1,2 ..., m, Feature Words w is calculated using following formulaiBelong to In the conditional probability value P (w of positive example training sample classificationi|c1) and wiBelong to the conditional probability value P of negative training sample classification (wi|c2):
Wherein:
Represent Feature Words w occur in training sample corpusiPositive example training sample quantity;
Represent positive example training sample quantity in training sample corpus;
Represent Feature Words w occur in training sample corpusiNegative training sample quantity;
Represent negative training sample quantity in training sample corpus;
Step 5, for comment data to be sorted, sensitive word detection is carried out first, if there are sensitive word to be directly considered as rubbish Rubbish is commented on;If sensitive word is not present, 6 are entered step;
The sensitive dictionary (such as " lottery industry ") of the present invention one in local maintenance, if there are sensitive word in comment, just by It is considered as comment spam.If not containing sensitive word, new word discovery detection is carried out to it using CRF++ algorithms, if the comment Containing neologisms, then extension renewal training corpus in training corpus is added it to after manually being marked, it is more adapted to In actual network media application scenarios;If without neologisms, into user behavior rule model and Bayesian Classification Model It is middle to be learnt respectively.
Step 6, new word discovery detection is carried out to the comment data to be sorted after step 5 detection, if containing neologisms, After neologisms are marked comment data type, add it in training sample corpus, extension renewal training sample corpus;Such as Fruit does not contain neologisms, enters step 7;
Step 7, user behavior rule model is called, detects whether comment data to be sorted is comment spam, to comment Data carry out preliminary classification;Subsequently into step 8;
User behavior rule model:The behavioural habits of comment spam are mainly issued according to user, such as:The comment of commercial paper Generally all containing " QQ "+numeral or " wechat "+numeral;So as to formulate a plurality of comment classifying rules, local generation is then utilized Deformation dictionary, the deformation of such as QQ, wechat, public platform word, classify comment data to be sorted.
Step 8, feature extraction is carried out to comment data to be sorted, trained Bayesian Classification Model carries out for use Classification;This step specifically includes:
Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles;If from what is obtained In dry participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that extracts s feature altogether Word, is respectively x1,x2,...,xs, thus obtain showing the feature term vector X={ x that comment classification is inclined to1,x2..., xs};
Step 8.2, the priori conditions occurred using following formula calculating comment data to be sorted in positive example training sample word bank Probability P (X | c1) and negative training sample word bank occur priori conditions probability P (X | c2):
Wherein:
ujRepresent whether include Feature Words x in comment data to be sortedjIf contain Feature Words xj, then ujFor 1;It is on the contrary For 0;
p(xj|c1) represent Feature Words xjBelong to the conditional probability value of positive example training sample classification, be calculated with step 4.4 Same Feature Words P (wi|c1) value it is equal;
ujp(xj|c2) represent Feature Words xjBelong to the conditional probability value of negative training sample classification, calculated with step 4.4 P (the w of the same Feature Words arrivedi|c2) value it is equal;
Step 8.3, calculated using following formula in training sample corpus, the probability P that comment data to be sorted occurs (X):
P (X)=P (X | c1)P(c1)+P(X|c2)P(c2)
Step 8.4, the probability P (c of positive example comment classification is belonged to using following formula calculating comment data to be sorted1| X) and Belong to the probability P (c of negative example comment classification2|X):
Step 9, using Adaboosting algorithms, user behavior rule model and Bayesian Classification Model are integrated Study, is trained using marked training sample data, and point of final comment data to be sorted is obtained using following formula Class result Qk(X):
Wherein:K is the number of disaggregated model, is herein constant value 2;
μtRepresent the weighted value that t-th of disaggregated model accounts in final accurate grader, be consistently greater than 0;
ft(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, for Two disaggregated models are related only in the present invention in actual conditions, therefore we assume that f1(X) represent in X dimensional feature term vectors The spatially classification judging result of Bayesian Classification Model, f2(X) represent in X dimensional feature words The classification judging result of user behavior rule model in vector space, is the preliminary classification result that step 7 obtains.
Wherein, the mistake point rate error (t) of the disaggregated model trained according to training data, using following formula adjust automatically phase Answer the weighted value of model:
The main feature of the present invention includes:
1st, the present invention rebuilds few class sample data using smote algorithms, is carried out for multiclass sample data Lack sampling, reconfigures the training sample corpus of balance;
2nd, new word discovery detection is carried out to comment data using CRF++ algorithms, and constantly automatically updates training corpus;
3rd, the present invention constructs user behavior rule model, and whether detection comment data is the comment spams such as advertisement, multiple level marketing;
4th, use information gain method carries out feature extraction to comment data to be sorted, is divided using Bayesian model Class;
5th, using Adaboosting integrated learning approachs, user behavior rule model and Bayesian model are learnt, to comment Data are classified.
It is provided by the invention for unbalanced data fusion user behavior rule comment spam filter method with Lower advantage:
This patent provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, this hair The bright few class sample data new using specific method construct, and multiple lack sampling is carried out to multiclass sample data, reconfigure The training sample corpus being balanced so that training sample data reach equilibrium state, and consider the characteristic of unbalanced data Data characteristics is extracted, comment classification is carried out thereby using Bayesian Classification Arithmetic;For the data such as changeable advertisement, multiple level marketing, profit Determined whether with the rule of conduct of user, finally unify to integrate comment data using Adaboosting methods Learning classification.Whole comment spam filtration system is identified word newly-generated in network using CRF algorithms, constantly updates Corpus causes it to be preferably applied in the comment spam of network media filtering scene, so as to improve comment spam filtering comprehensively Efficiency.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should Depending on protection scope of the present invention.

Claims (4)

  1. A kind of 1. comment spam filter method of fusion user behavior rule for unbalanced data, it is characterised in that including Following steps:
    Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment sample The comment data type of data;Wherein, the comment data type includes positive example sample type and negative example sample type;
    Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if positive example sample The quantity of notebook data is more than the quantity of negative example sample data, then using positive example sample data as multiclass sample data, by negative example sample Notebook data is as few class sample data;, whereas if the quantity of positive example sample data is less than the quantity of negative example sample data, then will Positive example sample data is as few class sample data, using negative example sample data as multiclass sample data;Thus sample number will be commented on Few class sample data set and multiclass sample data set are divided into according to collection;
    Step 3, the few class sample data concentrated to few class sample data is rebuild, and multiclass sample data is concentrated more Class sample data carries out lack sampling, reconfigures the training sample corpus being balanced;Wherein, in training sample corpus The labeled comment data type of each training sample, therefore, training sample distinguishes positive example training sample and negative training sample; Training sample corpus includes positive example training sample word bank and negative training sample word bank;
    Step 4, using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained;This step Specifically include:
    Step 4.1, the prior probability P (c of positive example training sample are calculated according to formula1) and negative training sample prior probability P (c2):
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mn>1</mn> </msub> <mi>N</mi> </mfrac> </mrow>
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mn>2</mn> </msub> <mi>N</mi> </mfrac> </mrow>
    Wherein:N represents the total quantity of the training sample in training sample corpus;n1Represent that positive example is instructed in training sample corpus Practice the quantity of sample;n2Represent the quantity of negative training sample in training sample corpus;
    Step 4.2, for each training sample, it is segmented, based on following formula, is calculated using the method for information gain each Contribution margin IGs (w) of the word w to category of model:
    <mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>I</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> </mtable> </mfenced>
    Wherein:
    P (w) represents the probability of occurrence of the training sample comprising word w in training sample corpus, i.e.,:P (w)=include word w Training sample quantity/training sample total quantity N;
    P(c1| w) represent the conditional probability of the positive example training sample comprising word w, i.e.,:P(c1| w)=positive example comprising word w is instructed Practice the quantity of quantity/training sample comprising word w of sample;
    P(c2| w) represent the conditional probability of the negative training sample comprising word w, i.e.,:P(c2| w)=negative example comprising word w is instructed Practice the quantity of quantity/training sample comprising word w of sample;
    Represent the probability of occurrence of the training sample not comprising word w in training sample corpus, i.e.,:
    Represent the conditional probability of the positive example training sample not comprising word w, i.e.,:
    Represent the conditional probability of the negative training sample not comprising word w, i.e.,:
    Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extracting m altogether Feature Words, are respectively w1, w2..., wm, thus obtain feature word set F={ w1,w2,...,wm};
    Step 4.4, arbitrary characteristics word w is concentrated for Feature Wordsi, i=1,2 ..., m, Feature Words w is calculated using following formulaiBelong to just Conditional probability value P (the w of example training sample classificationi|c1) and wiBelong to the conditional probability value P (w of negative training sample classificationi| c2):
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>n</mi> <mrow> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>/</mo> <msub> <mi>n</mi> <msub> <mi>c</mi> <mn>1</mn> </msub> </msub> </mrow>
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>n</mi> <mrow> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>/</mo> <msub> <mi>n</mi> <msub> <mi>c</mi> <mn>2</mn> </msub> </msub> </mrow>
    Wherein:
    Represent Feature Words w occur in training sample corpusiPositive example training sample quantity;
    Represent positive example training sample quantity in training sample corpus;
    Represent Feature Words w occur in training sample corpusiNegative training sample quantity;
    Represent negative training sample quantity in training sample corpus;
    Step 5, for comment data to be sorted, sensitive word detection is carried out first, is commented if being directly considered as rubbish there are sensitive word By;If sensitive word is not present, 6 are entered step;
    Step 6, new word discovery detection is carried out to the comment data to be sorted after step 5 detection, will be new if containing neologisms After word mark comment data type, add it in training sample corpus, extension renewal training sample corpus;If no Containing neologisms, 7 are entered step;
    Step 7, user behavior rule model is called, detects whether comment data to be sorted is comment spam, to comment data Carry out preliminary classification;Subsequently into step 8;
    Step 8, feature extraction is carried out to comment data to be sorted, is divided using trained Bayesian Classification Model Class;This step specifically includes:
    Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles;From several obtained In participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that s Feature Words are extracted altogether, Respectively x1,x2..., xs, thus obtain showing the feature term vector X={ x that comment classification is inclined to1, x2..., xs};
    Step 8.2, the priori conditions probability P occurred using following formula calculating comment data to be sorted in positive example training sample word bank (X|c1) and negative training sample word bank occur priori conditions probability P (X | c2):
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>s</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mi>p</mi> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>s</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mi>p</mi> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>
    Wherein:
    ujRepresent whether include Feature Words x in comment data to be sortedjIf contain Feature Words xj, then ujFor 1;Otherwise for 0;
    p(xj|c1) represent Feature Words xjBelong to the conditional probability value of positive example training sample classification, be calculated with step 4.4 same P (the w of one Feature Wordsi|c1) value it is equal;
    ujp(xj|c2) represent Feature Words xjBelong to the conditional probability value of negative training sample classification, be calculated with step 4.4 P (the w of same Feature Wordsi|c2) value it is equal;
    Step 8.3, calculated using following formula in training sample corpus, the probability P (X) that comment data to be sorted occurs:
    P (X)=P (X | c1)P(c1)+P(X|c2)P(c2)
    Step 8.4, the probability P (c of positive example comment classification is belonged to using following formula calculating comment data to be sorted1| X) and belong to Probability P (the c of negative example comment classification2|X):
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    Step 9, using Adaboosting algorithms, integrated study is carried out to user behavior rule model and Bayesian Classification Model, It is trained using marked training sample data, the classification results of final comment data to be sorted is obtained using following formula Qk(X):
    <mrow> <msub> <mi>Q</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msub> <mi>&amp;mu;</mi> <mi>t</mi> </msub> <msub> <mi>f</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow>
    Wherein:K is the number of disaggregated model, is herein constant value 2;
    μtRepresent the weighted value that t-th of disaggregated model accounts in final accurate grader, be consistently greater than 0;
    ft(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, f1(X) represent In the classification judging result of X dimensional features term vector spatially Bayesian Classification Model,f2(X) Represent the classification judging result in X dimensional features term vector spatially user behavior rule model, be that step 7 obtains preliminary point Class result.
  2. 2. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 1, the positive example sample type is normal comment sample type;The negative example sample type is rubbish Comment on sample type.
  3. 3. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 3, the few class sample data concentrated to few class sample data is rebuild, and is specially:
    Step 3.1, it is assumed that few class sample data concentrates shared r few class sample datas, is denoted as respectively:Few class sample data L1, it is few Class sample data L2..., few class sample data Lr
    Lack class sample data L for eachh, h=1,2 ..., r, calculate few class sample data L respectivelyhTo few class sample data Concentrate the Euclidean distance of other r-1 few class sample datas;
    Step 3.2, presetting neighbour chooses quantity d, and selected distance lacks class sample data LhD nearest few class sample datas, make For few class sample data LhNeighbour;
    Step 3.3, the training samples number of the balance according to needed for actual sample ratio and training pattern, sets sampling multiplying power e;
    Lack class sample data L for eachh, carry out e sampling;Sample each time, 1 is randomly choosed from its d neighbour Neighbours, it is assumed that this neighbour for sampling selection is LqQ=1,2 ..., r, the then new few class sample being building up to after this sampling Data Lnew
    Lnew=Lh+rand(0,1)×(Lq-Lh)
    Wherein:Rand (0,1) represents the random number of generation 0 to 1, so that new few class sample data LnewIn few class sample Data LhBetween its neighbour;
    Thus the new few class sample data of extended architecture again.
  4. 4. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 9, further include:The mistake point rate error (t) of the disaggregated model trained according to training data, under The weighted value of formula adjust automatically corresponding model:
    <mrow> <msub> <mi>&amp;mu;</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <mi>e</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>e</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>.</mo> </mrow>
CN201711247021.3A 2017-12-01 2017-12-01 Spam comment filtering method for unbalanced data and fusing user behavior rules Active CN108009249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711247021.3A CN108009249B (en) 2017-12-01 2017-12-01 Spam comment filtering method for unbalanced data and fusing user behavior rules

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711247021.3A CN108009249B (en) 2017-12-01 2017-12-01 Spam comment filtering method for unbalanced data and fusing user behavior rules

Publications (2)

Publication Number Publication Date
CN108009249A true CN108009249A (en) 2018-05-08
CN108009249B CN108009249B (en) 2020-08-18

Family

ID=62055753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711247021.3A Active CN108009249B (en) 2017-12-01 2017-12-01 Spam comment filtering method for unbalanced data and fusing user behavior rules

Country Status (1)

Country Link
CN (1) CN108009249B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710614A (en) * 2018-05-31 2018-10-26 校宝在线(杭州)科技股份有限公司 A kind of composition evaluating method based on user behavior
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN109783586A (en) * 2019-01-21 2019-05-21 福州大学 Waterborne troops's comment detection system and method based on cluster resampling
CN109858805A (en) * 2019-01-29 2019-06-07 浙江力嘉电子科技有限公司 Peasant household's rubbish based on interval estimation harvests quantity computation method
CN110162621A (en) * 2019-02-22 2019-08-23 腾讯科技(深圳)有限公司 Disaggregated model training method, abnormal comment detection method, device and equipment
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN110516058A (en) * 2019-08-27 2019-11-29 出门问问(武汉)信息科技有限公司 The training method and training device of a kind of pair of garbage classification problem
CN110688484A (en) * 2019-09-24 2020-01-14 北京工商大学 Microblog sensitive event speech detection method based on unbalanced Bayesian classification
CN113127640A (en) * 2021-03-12 2021-07-16 嘉兴职业技术学院 Malicious spam comment attack identification method based on natural language processing
CN113569111A (en) * 2021-09-24 2021-10-29 腾讯科技(深圳)有限公司 Object attribute identification method and device, storage medium and computer equipment
CN113630336A (en) * 2021-07-19 2021-11-09 上海德衡数据科技有限公司 Data distribution method and system based on optical interconnection
CN114629871A (en) * 2022-02-28 2022-06-14 杭州趣链科技有限公司 Junk mail filtering method and device based on unbalanced dynamic flow data classification and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055412A1 (en) * 2007-08-24 2009-02-26 Shaun Cooley Bayesian Surety Check to Reduce False Positives in Filtering of Content in Non-Trained Languages
CN105306296A (en) * 2015-10-21 2016-02-03 北京工业大学 Data filter processing method based on LTE (Long Term Evolution) signaling
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055412A1 (en) * 2007-08-24 2009-02-26 Shaun Cooley Bayesian Surety Check to Reduce False Positives in Filtering of Content in Non-Trained Languages
CN105306296A (en) * 2015-10-21 2016-02-03 北京工业大学 Data filter processing method based on LTE (Long Term Evolution) signaling
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN108717408B (en) * 2018-05-11 2023-08-22 杭州排列科技有限公司 Sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN108710614A (en) * 2018-05-31 2018-10-26 校宝在线(杭州)科技股份有限公司 A kind of composition evaluating method based on user behavior
CN109783586B (en) * 2019-01-21 2022-10-21 福州大学 Water army comment detection method based on clustering resampling
CN109783586A (en) * 2019-01-21 2019-05-21 福州大学 Waterborne troops's comment detection system and method based on cluster resampling
CN109858805A (en) * 2019-01-29 2019-06-07 浙江力嘉电子科技有限公司 Peasant household's rubbish based on interval estimation harvests quantity computation method
CN109858805B (en) * 2019-01-29 2022-12-16 浙江力嘉电子科技有限公司 Farmer garbage collection quantity calculation method based on interval estimation
CN110162621A (en) * 2019-02-22 2019-08-23 腾讯科技(深圳)有限公司 Disaggregated model training method, abnormal comment detection method, device and equipment
CN110162621B (en) * 2019-02-22 2023-05-23 腾讯科技(深圳)有限公司 Classification model training method, abnormal comment detection method, device and equipment
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN110516058A (en) * 2019-08-27 2019-11-29 出门问问(武汉)信息科技有限公司 The training method and training device of a kind of pair of garbage classification problem
CN110688484A (en) * 2019-09-24 2020-01-14 北京工商大学 Microblog sensitive event speech detection method based on unbalanced Bayesian classification
CN113127640A (en) * 2021-03-12 2021-07-16 嘉兴职业技术学院 Malicious spam comment attack identification method based on natural language processing
CN113630336A (en) * 2021-07-19 2021-11-09 上海德衡数据科技有限公司 Data distribution method and system based on optical interconnection
CN113569111B (en) * 2021-09-24 2021-12-21 腾讯科技(深圳)有限公司 Object attribute identification method and device, storage medium and computer equipment
CN113569111A (en) * 2021-09-24 2021-10-29 腾讯科技(深圳)有限公司 Object attribute identification method and device, storage medium and computer equipment
CN114629871A (en) * 2022-02-28 2022-06-14 杭州趣链科技有限公司 Junk mail filtering method and device based on unbalanced dynamic flow data classification and storage medium

Also Published As

Publication number Publication date
CN108009249B (en) 2020-08-18

Similar Documents

Publication Publication Date Title
CN108009249A (en) For the comment spam filter method of the fusion user behavior rule of unbalanced data
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN107766371A (en) A kind of text message sorting technique and its device
Hitesh et al. Real-time sentiment analysis of 2019 election tweets using word2vec and random forest model
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN109165950A (en) A kind of abnormal transaction identification method based on financial time series feature, equipment and readable storage medium storing program for executing
CN111754345A (en) Bit currency address classification method based on improved random forest
CN105447505B (en) A kind of multi-level important email detection method
CN106294590A (en) A kind of social networks junk user filter method based on semi-supervised learning
Bonato et al. Mining and modeling character networks
CN102663001A (en) Automatic blog writer interest and character identifying method based on support vector machine
Díaz-Morales Cross-device tracking: Matching devices and cookies
CN106339718A (en) Classification method based on neural network and classification device thereof
CN109889436A (en) A kind of discovery method of spammer in social networks
CN107665221A (en) The sorting technique and device of keyword
CN110084609A (en) A kind of transaction swindling behavior depth detection method based on representative learning
CN108304479A (en) A kind of fast density cluster double-layer network recommendation method based on graph structure filtering
CN106227720B (en) A kind of APP software users comment mode identification method
Sitorus et al. Sensing trending topics in twitter for greater Jakarta area
CN101655911A (en) Mode identification method based on immune antibody network
CN110019563A (en) A kind of portrait modeling method and device based on multidimensional data
Wang et al. YNUWB at SemEval-2019 Task 6: K-max pooling CNN with average meta-embedding for identifying offensive language
CN114048294B (en) Similar population extension model training method, similar population extension method and device
CN106815199A (en) Protocol type analysis method and device based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant