CN104462509A

CN104462509A - Review spam detection method and device

Info

Publication number: CN104462509A
Application number: CN201410806356.4A
Authority: CN
Inventors: 李纪峰; 吴明
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qianxin Technology Co Ltd
Priority date: 2014-12-22
Filing date: 2014-12-22
Publication date: 2015-03-25

Abstract

The invention discloses a review spam detection method and device. The review spam detection method comprises the steps that a detection device located on the web server side detects review messages received by the web server; a preset review strategy is used for judging whether the review messages belong to review spam; if yes, the review messages, belonging to review spam, in the review messages are intercepted. The review messages in the web server are detected, the preset review strategy is used for judging whether the review messages belong to the review spam, and the review messages belonging to the review spam are intercepted when belonging to the review spam according to judgment, so that the method improves the review spam recognition rate and the interception efficiency and reduces the cost through detection and interception of the review spam.

Description

Review spam detection method and device

Technical field

The present invention relates to network security technology, be specifically related to a kind of review spam detection method and device.

Background technology

The development of internet and the universal life and the mode of thinking that profoundly change people, network has become that current people obtain knowledge, release news, the main tool of communication.For the content that the normal users in website is issued, wherein have some online friends, a comment spam that down sending content cloth that businessman, undesirable person issue in normal users is a large amount of.Such as, irrelevant advertising commentary, the comment etc. of promoting comment, containing the contents such as politics, violence, pornographic.A large amount of comment spam both have impact on the acquisition of the network user to useful information, brought negative effect also can to some users.

At present, unified comment spam Filtering system is not had in each Website server, each Website server needs oneself manually to arrange testing mechanism and carries out filtering screening to comment spam, thus, cause unifying to detect to the review information of each large website accurately in real time, in addition adopt manual detection comment spam, low, the consuming time length of efficiency, and filtering spam comment also may cause false retrieval or undetected by manual detection in a large amount of comments.

Summary of the invention

For defect of the prior art, the invention provides a kind of review spam detection method and device, solve in prior art that comment spam discrimination is low, intercepting efficiency is low, the problem that cost is high.

First aspect, the invention provides a kind of comment spam pick-up unit, comprising:

Detection module, for the review information that checking network site server receives;

For adopting default comment strategy, judge module, judges whether described review information belongs to comment spam information;

First blocking module, when the review information for judging in current site server at described judge module is comment spam, tackles the review information belonging to comment spam information in described review information.

Optionally, described review information comprises following one or more:

Word message, pictorial information, character string information;

And/or,

Described review information also comprises: the IP address sending the client of described comment.

Optionally, described device also comprises:

Receiver module, before detecting the review information in described Website server at described pick-up unit, the comment strategy that reception server sends;

Comment strategy in described server is the strategy of the comment spam acquisition of information that described server reports according to multiple pick-up unit;

Described comment strategy comprises following one or more: belong to the Feature Words of comment spam information, tagged word, feature image, feature string.

Optionally, described device also comprises:

Negative sense probability determination module, for at judge module after judging that the review information in current site server does not belong to the review information of comment spam, adopt preset model to determine the negative sense probability not belonging to this review information, described negative sense probability is the probability that this review information belongs to comment spam information;

Second blocking module, for when described negative sense probability meets preset range, tackles review information corresponding for described negative sense probability.

Optionally, described device also comprises:

Sending module, for will interception review information in real time or timed sending server, to be sent to comment strategy in pick-up unit according to the review information real-time update received to make server.

Second aspect, present invention also offers a kind of review spam detection method, it is characterized in that, comprising:

The pick-up unit being positioned at Website server side detects the review information of described Website server reception;

The comment strategy preset is adopted to judge whether described review information belongs to comment spam information;

If so, then the review information belonging to comment spam information in described review information is tackled.

Optionally, described review information comprises following one or more:

Word message, pictorial information, character string information;

And/or,

Optionally, described comment strategy is the described pick-up unit comment strategy that reception server sends before detecting the review information in described Website server;

Optionally, described method also comprises:

After the review information adopting the comment strategy preset to judge in current site server does not belong to comment spam information, employing preset model determines the negative sense probability of this review information, and described negative sense probability is the probability that this review information belongs to comment spam information;

If described negative sense probability meets preset range, then review information corresponding for described negative sense probability is tackled.

Optionally, described method also comprises:

By the review information of interception in real time or timed sending server, to make server be sent to comment strategy in pick-up unit according to the review information real-time update received.

As shown from the above technical solution, review spam detection method provided by the invention and device, the method is by detecting the review information in described Website server, and judge whether this review information belongs to comment spam information by the comment strategy preset, when this review information is comment spam information, the review information belonging to comment spam information is tackled, the method is by the detection of comment spam information and interception, improve the discrimination to comment spam information and intercepting efficiency, also reduce cost simultaneously.

In instructions of the present invention, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, to disclose and to help to understand in each inventive aspect one or more to simplify the present invention, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should not explained the following intention in reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

It will be understood by those skilled in the art that adaptively to change the module in the equipment in embodiment and they are arranged and be in one or more equipment that this embodiment is different.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit is mutually exclusive part, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the equipment of a kind of browser terminal of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme, it all should be encompassed in the middle of the scope of claim of the present invention and instructions.

Accompanying drawing explanation

The schematic flow sheet of the review spam detection method that Fig. 1 provides for one embodiment of the invention;

The schematic flow sheet of the review spam detection method that Fig. 2 provides for another embodiment of the present invention;

The structural representation of the comment spam pick-up unit that Fig. 3 provides for one embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing, the embodiment of invention is further described.Following examples only for technical scheme of the present invention is clearly described, and can not limit the scope of the invention with this.

Along with the development of infotech, a lot of website is supported to carry out interaction between users.After a people to have carried out to website registering and also may have passed relevant authentication, be then called as " user " of this website.In website, user can represent its user behavior in strange thing system, thisly represents operating in website of its user behavior and is commonly called " issue ", the content of this issue can see by other users.Such as, in the various websites of such as social, blog, microblogging, BBS forum, all allow user's " send out blog " in strange thing system operations such as " posting " of " sending out microblogging ".In addition the content also allowing user to issue for other users in these websites is commented on, and it is commonly called " issuing comment ".

For the content that above-mentioned issue is commented on, certain user may issue comment spam, such as, and irrelevant advertising commentary, the comment etc. of promoting comment, containing the contents such as politics, violence, pornographic.Following embodiment of the present invention is exactly for how above-mentioned comment spam detected and tackle it.

Fig. 1 shows a kind of review spam detection method that the embodiment of the present invention provides, and as shown in Figure 1, this review spam detection method specifically comprises the steps:

101, the pick-up unit being positioned at Website server side detects the review information of described Website server reception.

Above-mentioned Website server can be performed by the pick-up unit in the server of third party software company.

Above-mentioned review information comprises following at least one item: Word message, pictorial information, character string information; And/or, send Internet protocol (InternetProtocol the is called for short IP) address of the client of described comment.The present embodiment is only illustrated review information, and this review information also can comprise other information, and the present embodiment does not limit it.

102, the comment strategy preset is adopted to judge whether described review information belongs to comment spam information.

Comment strategy in the present embodiment is the described pick-up unit comment strategy that reception server sends before detecting the review information in described Website server;

Above-mentioned server can be cloud server.That is the pick-up unit of all Website server sides can connect cloud server, can the real-time reception Cloud Server comment strategy downloading or upgrade in the process of the review information in real-time monitoring network site server, to ensure the comment spam information in the review information of checking network site server comparatively accurately.

Above-mentioned comment spam information is undertaken judging by above-mentioned default comment strategy.

103 if then tackle the review information belonging to comment spam information in described review information.

That is, tackle belonging to comment spam information in described review information.

Certainly, if when the review information adopting default comment strategy to judge in current site server in above-mentioned steps 102 does not belong to comment spam information, the review information of current detection can not be tackled, namely to make Website server show this review information.

By the comment strategy preset, said method judges whether this review information is comment spam information, when this review information is comment spam information, comment spam information will be belonged to tackle, the method is by the detection of comment spam information and interception, improve comment spam discrimination and intercepting efficiency, also reduce cost simultaneously.

Fig. 2 shows a kind of review spam detection method that the embodiment of the present invention provides, and as shown in Figure 2, this review spam detection method specifically comprises the steps:

201, the pick-up unit being positioned at Website server side detects the review information of described Website server reception.

202, the comment strategy preset is adopted to judge whether described review information belongs to comment spam information.

Usually, comment strategy can comprise the comment content of each IP interior of nearest time period for this review information, or, also can comprise the IP blacklist for this review information in the nearest time period in comment strategy.

It should be noted that, above-mentioned default comment strategy is the comment strategy receiving cloud server transmission in advance; Wherein, comment strategy in cloud server is the strategy of the review information statistics of the comment spam reported according to multiple pick-up unit, above-mentioned comment strategy can be formulated according to the content of above-mentioned review information, and whether above-mentioned comment strategy can be that comment spam detects to certain review information.

The comment content that above-mentioned comment strategy is specifically as follows review information is the Feature Words, tagged word, feature image, the feature string that belong to comment spam information.For example, Feature Words can belong to for " invoice " " sale " " common reserve fund " etc. is various the vocabulary that in comment spam information, the frequency of occurrences is higher, can be verb, noun etc.; Feature image is with violence, pornographic picture etc.; Feature string can add the clause of the contact methods such as the telephone number of some advertisement for some Feature Words, the present embodiment does not carry out illustrated in greater detail to this.

Above-mentioned comment strategy is the comment spam Information Statistics that report according to multiple pick-up unit of cloud server.

In another attainable mode, cloud server also can by this comment in real time or timed sending to pick-up unit, this pick-up unit is directly detected the review information obtained, and the present embodiment does not limit aforesaid way.

In above-mentioned steps 202, whether review information is that the deterministic process of comment spam specifically comprises unshowned following sub-step 2021 to sub-step 2023 in Fig. 2:

2021, extract the feature of review information, obtain keyword in this feature or key message etc.;

The feature of above-mentioned review information can be understood as: the sentence features of review information, the semantic feature of review information, the affective characteristics of review information and the contextual feature etc. of review information, and the present embodiment does not limit the specific features extracted.

Be understandable that, the feature extraction of above-mentioned review information can be following process, first carries out pre-service to the content of certain comment, is divided into sentence, obtains the set of sentence by this comment according to punctuation mark; Utilize participle instrument that each sentence in sentence set is divided into word, obtain set of words; Each word in set of words is marked part of speech by recycling part of speech instrument, and carries out parts of speech classification, can obtain a set of words, verb set, adjective set etc.

When another kind is possible, some comment spam user in order to avoid being directly blocked, therefore can add some special characters when commenting in comment.In this case, such as, comment on content for " sending out@ticket & to open for #, please joining by #! Be 158XXXXX ", then need when extracting the feature of review information to reject special character, so this comment text content becomes " invoice generation opens, and please contact 158XXXXX ".

After above-mentioned word segmentation processing can carry out word segmentation processing from the comment content after rejecting special character, employing conditional random field models obtains the keyword/key message of the comment content after described word segmentation processing.Will be understood that, the function word of practical significance (as punctuate, auxiliary verb, modal particle, interjection, onomatopoeia etc.) that do not have in above-mentioned comment content can not as the keyword of this review information content.

The feature extracting review information in the present embodiment can realize according to existing mode, and the present embodiment does not limit it.

2022, the key message of review information is mated with the characteristic information in comment strategy, or the keyword in the feature of review information is mated with the Feature Words in comment strategy.

For example, the sentence features of this review information is advertisement type review information, then the sentence features of above-mentioned review information can comprise: " XXX; network address is http:XXXX ", " invoice generation opens, and please contact 158XXXXX " " common reserve fund extracts, and please contact 152XXXXX "

Above-mentioned keyword can comprise: " invoice " " contact " " network address is " " common reserve fund " " extraction ".

If the Feature Words in the keyword of 2023 review information or key message and the comment strategy preset or characteristic information matching degree exceed pre-set threshold value, then can determine that current commentary information is comment spam information.

For example, if stored in comment strategy, " in invoice generation, opens, please contact 158XXXXX " review information, then when extracting certain review information, if find, certain review information is for " sending out@ticket & to open for #, 010-XXXXX please be contact ", then by above-mentioned content, information after being removed by special character is for " invoice generation opens, 010-XXXXX please be contact ", this shows, above-mentioned review information is telephone number difference, but the sentence features of review information is identical, can be understood as the characteristic matching degree stored with above-mentioned comment strategy is 98%, now this review information is defined as comment spam information.

In another attainable mode, the coupling of above-mentioned keyword also comprises the keyword homophonic identical with responsive vocabulary, the coupling of these partials is also joined in the calculating of matching degree, as by numeral " one, two, the three " replacement " 1,2,3 " of capitalization.

For example, if resolve the content of certain comment for " needs are drawn a bill; send a telegraph First Five-Year Plan eight XXXXX " is comprising keyword: " invoice " " sends a telegraph " " 158 ", then mated with the keyword in above-mentioned comment strategy by this keyword again, calculates matching degree.

Will be understood that, above-mentioned comment spam information is undertaken judging by above-mentioned default comment strategy, when another kind may be implemented, if the IP address of this review information is mated with the IP blacklist of the review information in comment strategy, then determine that this review information is comment spam information, directly this review information is tackled.

In order to prevent being the IP of comment spam information to the IP of some new registrations, or some blacklists IP originally, when the matching degree of the comment spam within a period of time is less than pre-set threshold value, adopt original comment strategy the review information that these IP deliver directly can be tackled, therefore the comment strategy in cloud server is upgraded, prevent white list IP directly to tackle as certain review information that blacklist IP is corresponding to this IP.

Above-mentioned matching degree calculates and comprises factors, the Feature Words in keyword in review information of the number of times that such as certain IP comments within a period of time, the ratio of comment spam, certain IP or key message and comment spam information or the matching degree etc. of characteristic information.

Such as the IP of certain review information carried out comment spam before one month on each large website of being everlasting, and only commented at individual Web sites in nearest one month, and the ratio of comment spam is almost 0, then according to the number of times of the comment number of times of the IP of this review information, comment spam and the comment number of times of IP of this review information, the time period of the number of times of comment spam can be obtained, the matching degree of COMPREHENSIVE CALCULATING.

If 203 adopt the comment strategy preset to judge that the review information in current site server is comment spam information, then tackle the review information belonging to comment spam information in described review information.

Be understandable that, judge that the review information in current site server is comment spam information when adopting the comment strategy preset, both when the Feature Words in the keyword of review information or key message and the comment strategy preset or characteristic information matching degree exceed pre-set threshold value, then think that this review information is comment spam information.

In a particular application, when Feature Words in the keyword of such as above-mentioned review information or key message and the comment strategy preset or characteristic information matching degree do not exceed pre-set threshold value, namely some user is in order to avoid above-mentioned interception, thus take various statement avoid interception.In this case, in order to detect the review information whether this kind of review information is comment spam, therefore said method also comprises the steps:

204, after the review information adopting default comment strategy to judge in current site server in step 202. does not belong to the review information of comment spam, adopt preset model to determine the negative sense probability not belonging to comment spam information, described negative sense probability is the probability that this comment belongs to comment spam information.

The process of establishing of the sample of the comment spam information of foregoing model can be exemplified below:

A01, obtain multiple comment spam information in advance, word segmentation processing is carried out to those review information, extract the keyword corresponding with those review information or key message.

Concrete, above-mentioned comment spam information can be some review information captured by spider or crawler algorithm orientation in webpage.Will be understood that, web crawlers has another name called for Web Spider (Web spider), and realize the program automatically being extracted webpage by technology, be the important composition of search engine, the present invention is not described in detail this.

A02, described keyword and the Feature Words in the comment spam feature dictionary that presets to be combined, or, the characteristic information in described key message and comment spam characteristic information storehouse is combined; The model judging comment spam information is set up according to various combination.

For example, above-mentioned characteristics of spam dictionary can be sorted out according to part of speech, passing judgement on of the meaning of a word, and this word specifically can comprise and relate to advertising, distribution, contain the vocabulary such as politics, violence, pornographic; Characteristic information storehouse can comprise some relate to advertising, distribution, containing the image content of politics, violence, pornographic etc.Feature dictionary in the present embodiment and characteristic information storehouse are only for illustrating, the present embodiment does not limit its particular content.

By obtaining a large amount of comment spam information as sample to described model training, can know that in all review information, each word can be combined into the characteristic sum rule of comment spam information.

The training acquisition process of above-mentioned preset model can be: for obtaining review information in advance, and those review information comprise comment spam information and non-junk review information; The model judging comment spam information is set up according to those review information.

Thus, in above-mentioned steps 204, this model is adopted to calculate the negative sense probability of current junk information.

In a particular application, according to the model of above-mentioned steps training, whether can be that comment spam information detects to certain review information.In another attainable mode, cloud server also can by this model in real time or timed sending to pick-up unit, this pick-up unit is directly detected the review information obtained, and the present embodiment does not limit aforesaid way.

Will be understood that, the IP judged in certain review information in above-mentioned steps 202 does not mate with the tactful IP blacklist of comment, and the matching degree of Feature Words in the keyword of the comment content of this review information or key message and comment strategy or characteristic information is not when exceeding pre-set threshold value, then adopt preset model to determine the negative sense probability not belonging to comment spam information, described negative sense probability is the probability that this comment belongs to comment spam information.

When said method is applicable to not belong to the review information of comment spam to the review information adopting the comment strategy preset to judge in current site server, then carry out calculating by above-mentioned preset model the negative sense probability that this review information belongs to the review information of comment spam.Therefore said method is further comprising the steps of:

205, judge whether above-mentioned negative sense probability meets preset range;

If 206 described negative sense probability meet described preset range, then the review information of comment spam corresponding for described negative sense probability is tackled.

For example, if the preset range of negative sense probability is 0.5 ~ 0.9, then the negative sense probability calculated according to this review information is 0.8, then this review information tackled.

If 207 described negative sense probability do not meet preset range, then review information corresponding for described negative sense probability is let pass.

In another mode in the cards, if the preset range of negative sense probability is 0.5 ~ 0.9, then the negative sense probability calculated according to this review information is 0.45, then the information of review information corresponding for this negative sense probability shown.

In order to make the comment strategy in above-mentioned steps 202 be up-to-date comment strategy, therefore said method also comprises the steps 208:

208, review information corresponding to the review information and described negative sense probability that belong to comment spam in described review information is sent cloud server.

In a particular application, by pick-up unit, review information corresponding to the review information and described negative sense probability that belong to comment spam information is sent server, achieve and the comment strategy in above-mentioned cloud server is upgraded, the renewal of this comment strategy can be real-time also can timing, upgrade such as every day one inferior.

Fig. 3 shows the structural representation of the comment spam pick-up unit that the embodiment of the present invention provides, and as shown in Figure 3, this device comprises: detection module 31, judge module 32 and the first blocking module 33.

Detection module 31, for the review information that checking network site server receives;

Concrete, above-mentioned review information comprises following one or more:

Word message, pictorial information, character string information; And/or, send the IP address etc. of the client of described comment.The present embodiment is only illustrated review information, and this review information also can comprise other information, and the present embodiment does not limit it.

For adopting default comment strategy, judge module 32, judges whether described review information belongs to comment spam information;

First blocking module 33, when the review information for judging in current site server at described judge module is comment spam, tackles the review information belonging to comment spam information in described review information.

Concrete, said apparatus also comprises unshowned receiver module 34 in Fig. 3:

Receiver module 34, before detecting the review information in described Website server at described pick-up unit, the comment strategy that reception server sends;

Comment in described server is the strategy of the review information statistics of the comment spam that described server reports according to multiple pick-up unit.

Adopt above-mentioned comment strategy can not this review information of intuitive judgment whether be the review information of comment spam time, in order to review information larger for the review information negative sense probability in review information being comment spam be identified more accurately, said apparatus also comprises not shown negative sense probability determination module 35 and the second blocking module 36;

Negative sense probability determination module 35, for at judge module after judging that the review information in current site server does not belong to the review information of comment spam, adopt preset model to determine the negative sense probability not belonging to this review information, described negative sense probability is the probability that this review information belongs to comment spam information;

Second blocking module 36, for when described negative sense probability meets preset range, tackles review information corresponding for described negative sense probability.

In order to upgrade the comment strategy in above-mentioned cloud server, said apparatus also comprises not shown sending module 37:

Sending module 37, for will interception review information in real time or timed sending server, to be sent to comment strategy in pick-up unit according to the review information real-time update received to make server.

Said apparatus and said method are one to one, and the specific example of said method illustrates and is applicable to this device too, and the present invention is not described in detail to the implementation detail of said apparatus.

Thus, in the wireless invasive detection system in the present embodiment, server and sensor are mutual, can hot information in Real-Time Monitoring enterprise in wireless network, and effectively ensure the safety of wireless network in enterprise.

Claims

1. a comment spam pick-up unit, is characterized in that, comprising:

2. device according to claim 1, is characterized in that, described review information comprises following one or more:

Word message, pictorial information, character string information;

And/or,

Described review information also comprises: the internet protocol address sending the client of described comment.

3. device according to claim 1, is characterized in that, described device also comprises:

4. device according to claim 1, is characterized in that, described device also comprises:

5. device according to claim 4, is characterized in that, described device also comprises:

6. a review spam detection method, is characterized in that, comprising:

7. method according to claim 6, is characterized in that, described review information comprises following one or more:

Word message, pictorial information, character string information;

And/or,

8. method according to claim 6, is characterized in that, described comment strategy is the described pick-up unit comment strategy that reception server sends before detecting the review information in described Website server;

9. method according to claim 6, is characterized in that, described method also comprises:

10. method according to claim 9, is characterized in that, described method also comprises: