CN101510879A - Method and apparatus for filtering rubbish contents - Google Patents

Method and apparatus for filtering rubbish contents Download PDF

Info

Publication number
CN101510879A
CN101510879A CNA2009100807325A CN200910080732A CN101510879A CN 101510879 A CN101510879 A CN 101510879A CN A2009100807325 A CNA2009100807325 A CN A2009100807325A CN 200910080732 A CN200910080732 A CN 200910080732A CN 101510879 A CN101510879 A CN 101510879A
Authority
CN
China
Prior art keywords
content
posting
semantic analysis
rubbish
analysis condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100807325A
Other languages
Chinese (zh)
Inventor
李京晶
于章涛
张萌萌
祝锐
赵琳霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNA2009100807325A priority Critical patent/CN101510879A/en
Publication of CN101510879A publication Critical patent/CN101510879A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a content spam filtering method and adopts a technical proposal as follows: judgment is carried out to posted content according to preset semantic analysis conditions; the content meeting the preset semantic analysis conditions in the posted content is taken as content spam and shielded; and the posted content after shielding treatment is published on a network after reviewing. The invention also discloses a content spam filtering device. The technical proposal can effectively realize the shielding of the content spam of communities, save the input capital of labor and material resources and improve working efficiency.

Description

A kind of method of filtering rubbish contents and device
Technical field
The present invention relates to Internet technical field, specifically, relate to a kind of method and device of filtering rubbish contents.
Background technology
At present, at the means of community's filtering rubbish contents, generally adopt traditional filter type in the Internet technology.In conjunction with shown in Figure 1, the content that the user posts at first will be filtered through the dirty speech of one-level before being published on the network, and the vocabulary that mates with the dirty speech of one-level in the model is shielded as rubbish vocabulary; Secondly, the dirty speech of secondary that carries out the manual examination and verification stage through the content after the dirty speech filtration treatment of one-level is filtered, and the vocabulary that mates with the dirty speech of secondary in the described model is masked once more as rubbish vocabulary; To will successfully being published on the network through content after the dirty speech filtration treatment of secondary; For the rubbish contents that in the dirty speech of one-level or secondary filters, does not filter out, can only rely on the later stage mode that is published to model on the network and carries out the inspection of artificial or machine is deleted, to realize filtration to community's rubbish contents.
In realizing process of the present invention, the inventor finds that there is following shortcoming in above-mentioned prior art:
(1) the model content that the user is sent is carried out the dirty speech of one-level when filtering, because the present dirty speech filtration stage of one-level, the mode that existing dirty speech in content in the model and the dirty speech database mates one to one fully can only be filtered, if there is the new rubbish vocabulary of not preserving in some dirty speech databases in user's the model, in the dirty speech filtration treatment of one-level, just filter like this less than;
For the dirty speech filtration stage of the secondary of manual examination and verification, still simple dependence is filtered by the mode of dirty speech coupling, so there is the problem in the dirty speech filtration of one-level equally.In one-level or the dirty speech matching process of secondary, the new rubbish vocabulary of not checking out can be published on the network as the model that meets the issue requirement, can only rely on the artificial of later stage or machine inspection to delete, and add to accordingly in one-level or the dirty speech database of secondary.So the mode that the dirty speech of this dependence filters in the prior art is filtered rubbish contents and is had passivity, and causes filtering the limitation of coverage rate;
(2) be published to the later stage inspection process of the model content on the network, also there is certain passivity, management server will be on one's own initiative browsed and is patrolled being published to model on the network, one by one the rubbish vocabulary of finding is deleted, so strengthened labour's the input and the capital of machine maintenance aspect.
Summary of the invention
The technical problem to be solved in the present invention is: a kind of method and device of filtering rubbish contents are provided, can realize the shielding of community's rubbish contents effectively, the capital invested of having saved the man power and material.
Technical scheme of the present invention is as described below:
A kind of method of filtering rubbish contents, this method comprises:
By predetermined semantic analysis condition the content of posting is judged,, shielded as rubbish contents with satisfying the content of described predetermined semantic analysis condition in the described content of posting;
To be published on the network through the described content of posting after the shielding processing through after examining.
Further, described predetermined semantic analysis condition comprises:
(a) whether the described content of posting satisfies the requirement of grammer; Or
(b) whether has the feature that to describe rubbish vocabulary; Or
(c) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(d) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (a) and (b), (c), (d).
Further, the generative process of described predetermined semantic analysis condition specifically comprises:
When described predetermined semantic analysis condition is (b),, described content is classified according to classification by collecting the rubbish contents that is filtered in a large number; At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary; Perhaps,
When described predetermined semantic analysis condition was (c), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting; Perhaps,
When described predetermined semantic analysis condition was (d), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
Further, the content of posting is judged that will satisfy the content of described predetermined semantic analysis condition in the described content of posting, before shielding as rubbish contents, described method also comprises by predetermined semantic analysis condition:
Repeatability to the content of posting is judged, when the content of posting as described repeats with the content of posting before, described repeated content is shielded automatically; Otherwise, be left intact;
The content of judging through repeatability of posting is carried out the dirty speech coupling of one-level, as match, with the described content shielding of posting; Otherwise, the content of posting after filtering through the dirty speech of one-level is carried out semantic analysis.
Further, will be published on the network through the described content of posting after the shielding processing through after examining, detailed process comprises:
To through the described content of posting after the shielding processing, filter by dirty speech of secondary and web page address blacklist respectively, and respectively model is marked according to filter result, after the model shielding of described scoring, send to manual examination and verification less than standard value; The model of described scoring more than or equal to standard value is published on the network.
Further, described method also comprises:
The described model that is published on the network is carried out background monitoring, to the user of frequency that post greater than the setting reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option, and notify management server to handle in the mode that mail is reported to the police.
The present invention also provides a kind of device of filtering rubbish contents, and described device comprises:
The semantic analysis Executive Module is used for by predetermined semantic analysis condition the content of posting being judged, with satisfying the content of described predetermined semantic analysis condition in the described content of posting, shields as rubbish contents;
The audit Executive Module is used for being published on the network after the content process audit of posting after the described semantic analysis Executive Module processing.
Preferably, described predetermined semantic analysis condition comprises:
(e) whether the described content of posting satisfies the requirement of grammer; Or
(f) whether has the feature that to describe rubbish vocabulary; Or
(g) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(h) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (e), (f), (g), (h).
Preferably, described device also comprises:
The condition generation module is used for by collecting the rubbish contents that is filtered in a large number, described content being classified according to classification when described predetermined semantic analysis condition is (f); At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary; Perhaps,
When described predetermined semantic analysis condition was (g), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting; Perhaps,
When described predetermined semantic analysis condition was (h), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
Preferably, described semantic analysis Executive Module specifically comprises:
Judging unit is used for judging whether the content of posting contains the content that satisfies predetermined semantic analysis condition;
Processing unit is used for according to the judged result of described judging unit to the content of posting, and with satisfying the content of predetermined semantic analysis condition in the described content of posting, shields as rubbish contents; Otherwise, be left intact.
Preferably, described device also comprises:
The repeatability judging treatmenting module is used for the repeatability of the content of posting is judged, the content of posting as described repeats with the content of posting before, then described repeated content is shielded automatically; Otherwise, be left intact;
The dirty speech filtering module of one-level is used for the described content of posting after handling through described repeated judging treatmenting module is complementary with the dirty speech of one-level, as matches, and the described content of posting is shielded; Otherwise the content of posting after will filtering through the dirty speech of one-level sends to described semantic analysis Executive Module and handles.
Preferably, described audit Executive Module specifically comprises:
The evaluation unit is used for the content of posting after the described semantic analysis Executive Module filtration is filtered by dirty speech of secondary and web page address blacklist respectively, and respectively model marked according to filter result;
Performance element is used for handling according to the review result of evaluation unit, after described scoring is shielded less than the model of standard value, sends to manual examination and verification; The model of described scoring more than or equal to standard value is published on the network.
Preferably, described device also comprises:
The background monitoring module, be used for the model that described audit Executive Module is published on the network is carried out background monitoring, greater than the user who sets reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option to the frequency of posting;
The monitoring alarm module is used for notifying management server to handle to the mode that the data of described background monitoring module records are reported to the police with mail.
Adopt technical solutions according to the invention to have following beneficial effect:
1, the present invention adopts semantic analysis process to come the content of posting is carried out the filtration of rubbish contents, overcome the simple passivity that adopts dirty speech to filter in the prior art, owing to preserve a large amount of characteristic informations about rubbish contents in the predetermined semantic analysis condition, the coverage rate that filters is more extensive;
2, reduce workload for manual examination and verification and background monitoring process by semantic analysis process, saved the labour.
Description of drawings
Fig. 1 is the FB(flow block) of traditional community's filter type in the prior art;
Fig. 2 is the outline flowchart of a kind of filtering rubbish contents method of the embodiment of the invention:
A kind of brief block diagram of filtering rubbish contents device that Fig. 3 has been the embodiment of the invention;
Fig. 4 is the FB(flow block) of a kind of filtering rubbish contents method of the embodiment of the invention.
Embodiment
Technical scheme for a better understanding of the present invention describes below in conjunction with specific embodiment.
The present invention has adopted semantic analysis process to overcome the limitation of filtering coverage in the prior art by in community's rubbish filtering, and better labor savings is increased work efficiency.
As shown in Figure 2, the method for a kind of filtering rubbish contents of the embodiment of the invention, described method comprises:
Step S103: by predetermined semantic analysis condition the content of posting is judged,, shielded as rubbish contents with satisfying the content of described predetermined semantic analysis condition in the described content of posting;
Step S104: will be published on the network through the described content of posting after the shielding processing through after examining.
Specifically, described predetermined semantic analysis condition comprises:
(a) whether the described content of posting satisfies the requirement of grammer; Or
(b) whether has the feature that to describe rubbish vocabulary; Or
(c) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(d) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (a) and (b), (c), (d).
Usually, Chinese character or the letter that some are arbitrarily pounded out with keyboard occur through regular meeting in the rubbish card, these all do not meet grammar request according to syntactic analysis, so find that in model such content will be filtered.
Specifically, the generative process of described predetermined semantic analysis condition specifically comprises:
(1) if when predetermined semantic analysis condition is above-mentioned (b),, described content is classified according to classification by collecting the rubbish contents that is filtered in a large number; At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary;
General for the rubbish model of collecting, respectively can be with these contents according to advertisement, pornographic and malice this three kinds of modes of making friends are classified, and obtain the feature of these three types of cards respectively, are kept in the feature database.
(2) or, if when predetermined semantic analysis condition is above-mentioned (c), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting;
(3) or, if when predetermined semantic analysis condition is above-mentioned (d), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
Certainly, in the present embodiment, the generative process of described predetermined semantic analysis condition comprises the combination in any of above-mentioned (1), (2), (3).Such as, the generative process of described predetermined semantic analysis condition is (1) and (2) or (1) and (3) or (2) and (3) or (1) and (2) and (3).
Usually, the frequency that rubbish vocabulary occurs in rubbish card chained address very high and that occur is also very many, by the statistical learning to a large amount of rubbish contents, can therefrom obtain to distinguish to post whether contain the frequency standard value and the reference value of rubbish vocabulary in the content.
Specifically, before the step S103, described method also comprises:
Step S101: the repeatability to the content of posting is judged, when the content of posting as described repeats with the content of posting before, described repeated content is shielded automatically; Otherwise, be left intact;
Step S102: the content of judging through repeatability of posting is carried out the dirty speech coupling of one-level, as match, with the described content shielding of posting; Otherwise, the content of posting after filtering through the dirty speech of one-level is carried out semantic analysis.
Specifically, the embodiment of the invention can be carried out the filtration of rubbish contents to the content of common card, establishment and ballot card.By step S101, can limit and repeat brush card (manually brush card, machine brush obedient), the meaningless content of restriction shields the malice information of pouring water in the new content; Specifically can adopt the content of posting of the identical IP of contrast, whether the identification content repeats.
By step S102, the dirty speech that comprises in the model content can be filtered by dirty speech coupling;
By step S103, can shield rubbish contents such as the pornographic in the new content in the model, advertisement, malice friend-makings;
For common card, it is identical to create the technical scheme that relates to the rubbish contents processing procedure of ballot card, repeats no more.
Specifically, the detailed process of step S104 comprises:
To through the described content of posting after the shielding processing, filter by dirty speech of secondary and web page address blacklist respectively, and respectively model is marked according to filter result, after the model shielding of described scoring, send to manual examination and verification less than standard value; The model of described scoring more than or equal to standard value is published on the network.
In practice, the model content is filtered by dirty speech of secondary and web page address blacklist respectively, because model content difference, the rubbish contents difference that is filled into is respectively to each model scoring.Certainly it is relatively low to match the many model scorings of rubbish vocabulary, it is higher to match the less model scoring of rubbish contents on the contrary, in order to distinguish the rubbish card, according to can weigh the standard value of rubbish card to one of the statistical law acquisition of rubbish card in the past with normal card, judge with scoring respectively, guaranteed because there is the misoperation of a spot of rubbish vocabulary conductively-closed in certain model.
Embodiment of the invention step S104 is a platform with security centre's audit platform, by the dirty speech of coupling secondary, URL blacklist etc. word content and all image contents is examined.The content that can't handle in the filter process before the step S104 is carried out manual examination and verification to be handled.For example: rubbish picture (pornographic, advertisement, reaction picture etc.) comprises the rubbish contents that comprises national regulation restrictions such as reaction, pornographic is examined.
The common manual examination and verification stage of the prior art can only examine the image content that subscriber's local is uploaded, can not examine the image content of quoting other websites, the embodiment of the invention can be examined pictures all in the model, comprises the picture and the local picture of uploading of external linkage.
Specifically, described method also comprises step S105::
The described model that is published on the network is carried out background monitoring, to the user of frequency that post greater than the setting reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option, and notify management server to handle in the mode that mail is reported to the police.
Can according to the rubbish card collected in a large number before in conjunction with empirical value, determine to distinguish that unusual reference value takes place the model data.
When carrying out background monitoring, distinguish that model unusual situation takes place can comprise to being published to described model on the network:
(1), 5 minutes users that posted above 600 above 10,24 hours that post is carried out automatic record according to the model data of noting
(2) the common model that click volume, money order receipt to be signed and returned to the sender number are uprushed carries out automatic record;
(3) the ballot card of selecting number of times, click volume, answer number to uprush to option carries out automatic record;
The appearance of above-mentioned three kinds of situations can notify management server to carry out the labor cleaning by the mode that mail is reported to the police.
As described in Figure 3, the present invention also provides a kind of filtering rubbish contents device, and described device comprises:
Semantic analysis Executive Module S33 is used for by predetermined semantic analysis condition the content of posting being judged, with satisfying the content of described predetermined semantic analysis condition in the described content of posting, shields as rubbish contents;
Audit Executive Module S44 is used for being published on the network after the content process audit of posting after the described semantic analysis Executive Module processing.
The Executive Module of semantic analysis described in embodiment of the invention S33 and audit Executive Module S44, the technical scheme that relates to step S103 among the said method embodiment and step S104 is identical, does not give unnecessary details at this.
Specifically, described predetermined semantic analysis condition comprises:
(e) whether the described content of posting satisfies the requirement of grammer; Or
(f) whether has the feature that to describe rubbish vocabulary; Or
(g) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(h) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (e), (f), (g), (h).
Described predetermined semantic analysis condition is identical with generative process among the said method embodiment, does not give unnecessary details at this.
Specifically, described device also comprises:
Condition generation module S88 is used for by collecting the rubbish contents that is filtered in a large number, described content being classified according to classification when described predetermined semantic analysis condition is above-mentioned (f); At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary; Perhaps,
When described predetermined semantic analysis condition was above-mentioned (g), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting; Perhaps,
When described predetermined semantic analysis condition was above-mentioned (h), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
The function of described condition generation module S88 is identical with the technology contents that the foregoing description relates to, and please refer to above-mentioned explanation.
Specifically, described semantic analysis Executive Module specifically can comprise:
Judging unit S331 is used for judging whether the content of posting contains the content that satisfies predetermined semantic analysis condition;
Processing unit S332 is used for according to the judged result of described judging unit to the content of posting, and with satisfying the content of predetermined semantic analysis condition in the described content of posting, shields as rubbish contents; Otherwise, be left intact.
Specifically, described device also comprises:
Repeatability judging treatmenting module S11 is used for the repeatability of the content of posting is judged, the content of posting as described repeats with the content of posting before, then described repeated content is shielded automatically; Otherwise, be left intact;
The dirty speech filtering module of one-level S22 is used for the described content of posting after handling through described repeated judging treatmenting module is complementary with the dirty speech of one-level, as matches, and the described content of posting is shielded; Otherwise the content of posting after will filtering through the dirty speech of one-level sends to described semantic analysis Executive Module and handles.
Repeated judging treatmenting module S11 and the dirty speech filtering module of one-level S22 in the embodiment of the invention, the technical scheme that relates to step S101 among the said method embodiment and step S102 is identical, does not give unnecessary details at this.
Specifically, described audit Executive Module specifically comprises:
Evaluation unit S441 is used for the content of posting after the described semantic analysis Executive Module filtration is filtered by dirty speech of secondary and web page address blacklist respectively, and respectively model marked according to filter result;
Performance element S442 is used for handling according to the review result of evaluation unit, after described scoring is shielded less than the model of standard value, sends to manual examination and verification; The model of described scoring more than or equal to standard value is published on the network.
Specifically, described device also comprises:
Background monitoring module S55, be used for the model that described audit Executive Module is published on the network is carried out background monitoring, greater than the user who sets reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option to the frequency of posting;
Monitoring alarm module S66 is used for notifying management server to handle to the mode that the data of described background monitoring module records are reported to the police with mail.
The technology that step S105 relates among the technical scheme that relates among described background monitoring module S55 and the described monitoring alarm module S66 and the said method embodiment is identical, does not give unnecessary details at this.
Described for a better understanding of the present invention method describes below in conjunction with embodiment 1.
Embodiment 1:
The embodiment of the invention 1 is that example describes the method for the invention with common card, in conjunction with shown in Figure 4.
Step S501: the online friend is searching common card of issue, and the model content at first will be passed through the three phases of information filtering:
(1) the repeated judgement stage; Judge whether the model content repeats (can realize by the content of posting of more identical IP) with the model content of issue before, if repeat, then shielding automatically; If do not repeat, then do not do any processing;
(2) the dirty speech matching stage of one-level; To carry out the dirty speech coupling of one-level through the model content that repeatability is judged, if match, the model content that then will have dirty speech shields automatically; If do not match, then enter semantic analysis process;
(3) semantic analysis process; Whether the semanteme by the analysis word content is rubbish contents by predetermined semantic analysis condition criterion, if, then shielding automatically; If not, then enter the audit stage;
As can be seen, after by three processes of step S501 model being filtered, undesirable model content can be deleted, and can feed back to the user that posts by the mode of text prompt, the conductively-closed of notice model from Fig. 4.
Demonstrating the present invention among Fig. 4 can carry out rubbish filtering at the content of common card, establishment and ballot card, but should be with example at common card, and other dual modes are identical with this instance processes process, do not give unnecessary details.
Step S502: the audit stage can be examined platform by security centre and be realized, comprising the content auditing to literal, picture and establishment; Respectively the model content is filtered by logics such as dirty speech of secondary and web page address black and white lists, and to marking respectively through the model after filtering, the model of scoring less than reference value shielded automatically, and scoring is published on the network smaller or equal to the model of reference value;
Wherein, can be sent to the audit platform for the model of marking less than reference value, whether comprise rubbish contents by this model of manual examination and verification final decision, be then directly to delete model; Not, then do not do deletion and handle, directly be published to (this auditing result also can be informed the user that posts by the mode of text prompt) on the network.
Step S503: after model is successfully issued, the described model that is published on the network is carried out background monitoring, if there is abnormal data to take place, as: to the user of frequency that post greater than the setting reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option, can be automatically report to the police the user of data exception and model in the monitoring backstage by mail mode sends to management server, is manually deleted by the keeper;
Step S504: for not by the information filtering stage and the audit phase masked, also the model that in the background monitoring process, does not note abnormalities, need by manually by searching key word or directly in the mode of checking, the rubbish contents on the line is deleted cleaning.Manpower is limited, content newly-increased on the line is patrolled and clears up so need search robot, by adding the mode of the dirty speech of robot, the content of the dirty speech of coupling robot is deleted.
As can be seen, by the described method of the embodiment of the invention model content is filtered, coverage rate is more extensive.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection range of claims.

Claims (13)

1, a kind of method of filtering rubbish contents is characterized in that, described method comprises:
By predetermined semantic analysis condition the content of posting is judged,, shielded as rubbish contents with satisfying the content of described predetermined semantic analysis condition in the described content of posting;
To be published on the network through the described content of posting after the shielding processing through after examining.
2, method according to claim 1 is characterized in that, described predetermined semantic analysis condition comprises:
(a) whether the described content of posting satisfies the requirement of grammer; Or
(b) whether has the feature that to describe rubbish vocabulary; Or
(c) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(d) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (a) and (b), (c), (d).
3, method according to claim 2 is characterized in that, the generative process of described predetermined semantic analysis condition specifically comprises:
When described predetermined semantic analysis condition is (b),, described content is classified according to classification by collecting the rubbish contents that is filtered in a large number; At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary; Perhaps,
When described predetermined semantic analysis condition was (c), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting; Perhaps,
When described predetermined semantic analysis condition was (d), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
4, method according to claim 1, it is characterized in that, the content of posting is judged, will satisfy the content of described predetermined semantic analysis condition in the described content of posting by predetermined semantic analysis condition, before shielding as rubbish contents, described method also comprises:
Repeatability to the content of posting is judged, when the content of posting as described repeats with the content of posting before, described repeated content is shielded automatically; Otherwise, be left intact;
The content of judging through repeatability of posting is carried out the dirty speech coupling of one-level, as match, with the described content shielding of posting; Otherwise, the content of posting after filtering through the dirty speech of one-level is carried out semantic analysis.
5, method according to claim 1 is characterized in that, will be published on the network through the described content of posting after the shielding processing through after examining, and detailed process comprises:
To through the described content of posting after the shielding processing, filter by dirty speech of secondary and web page address blacklist respectively, and respectively model is marked according to filter result, after the model shielding of described scoring, send to manual examination and verification less than standard value; The model of described scoring more than or equal to standard value is published on the network.
6, method according to claim 1 is characterized in that, described method also comprises:
The described model that is published on the network is carried out background monitoring, to the user of frequency that post greater than the setting reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option, and notify management server to handle in the mode that mail is reported to the police.
7, a kind of device of filtering rubbish contents is characterized in that, described device comprises:
The semantic analysis Executive Module is used for by predetermined semantic analysis condition the content of posting being judged, with satisfying the content of described predetermined semantic analysis condition in the described content of posting, shields as rubbish contents;
The audit Executive Module is used for being published on the network after the content process audit of posting after the described semantic analysis Executive Module processing.
8, device according to claim 7 is characterized in that, described predetermined semantic analysis condition comprises:
(e) whether the described content of posting satisfies the requirement of grammer; Or
(f) whether has the feature that to describe rubbish vocabulary; Or
(g) whether comprise the vocabulary content of the vocabulary frequency of occurrences greater than the frequency standard value; Or
(h) whether carry the quantity of network linking address above the reference value of setting;
The combination in any of perhaps above-mentioned (e), (f), (g), (h).
9, device according to claim 8 is characterized in that, described device also comprises:
The condition generation module is used for by collecting the rubbish contents that is filtered in a large number, described content being classified according to classification when described predetermined semantic analysis condition is (f); At the described content in each classification, obtain to have the feature that to describe rubbish vocabulary; Perhaps,
When described predetermined semantic analysis condition was (g), by the occurrence number of rubbish vocabulary in the rubbish contents is carried out record, acquisition can be judged the frequency standard value that whether comprises rubbish vocabulary in the described content of posting; Perhaps,
When described predetermined semantic analysis condition was (h), by the quantity of carrying the network linking address in the rubbish contents is carried out record, acquisition can be differentiated the reference value that whether comprises rubbish contents in the described content of posting.
10, device according to claim 7 is characterized in that, described semantic analysis Executive Module specifically comprises:
Judging unit is used for judging whether the content of posting contains the content that satisfies predetermined semantic analysis condition;
Processing unit is used for according to the judged result of described judging unit to the content of posting, and with satisfying the content of predetermined semantic analysis condition in the described content of posting, shields as rubbish contents; Otherwise, be left intact.
11, device according to claim 7 is characterized in that, described device also comprises:
The repeatability judging treatmenting module is used for the repeatability of the content of posting is judged, the content of posting as described repeats with the content of posting before, then described repeated content is shielded automatically; Otherwise, be left intact;
The dirty speech filtering module of one-level is used for the described content of posting after handling through described repeated judging treatmenting module is complementary with the dirty speech of one-level, as matches, and the described content of posting is shielded; Otherwise the content of posting after will filtering through the dirty speech of one-level sends to described semantic analysis Executive Module and handles.
12, device according to claim 7 is characterized in that, described audit Executive Module specifically comprises:
The evaluation unit is used for the content of posting after the described semantic analysis Executive Module filtration is filtered by dirty speech of secondary and web page address blacklist respectively, and respectively model marked according to filter result;
Performance element is used for handling according to the review result of evaluation unit, after described scoring is shielded less than the model of standard value, sends to manual examination and verification; The model of described scoring more than or equal to standard value is published on the network.
13, device according to claim 7 is characterized in that, described device also comprises:
The background monitoring module, be used for the model that described audit Executive Module is published on the network is carried out background monitoring, greater than the user who sets reference value, click volume and money order receipt to be signed and returned to the sender amount select number of times, click volume and reply volume to carry out automatic record greater than the ballot card of setting reference value greater than the common card of setting reference value and option to the frequency of posting;
The monitoring alarm module is used for notifying management server to handle to the mode that the data of described background monitoring module records are reported to the police with mail.
CNA2009100807325A 2009-03-26 2009-03-26 Method and apparatus for filtering rubbish contents Pending CN101510879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100807325A CN101510879A (en) 2009-03-26 2009-03-26 Method and apparatus for filtering rubbish contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100807325A CN101510879A (en) 2009-03-26 2009-03-26 Method and apparatus for filtering rubbish contents

Publications (1)

Publication Number Publication Date
CN101510879A true CN101510879A (en) 2009-08-19

Family

ID=41003143

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100807325A Pending CN101510879A (en) 2009-03-26 2009-03-26 Method and apparatus for filtering rubbish contents

Country Status (1)

Country Link
CN (1) CN101510879A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN102315953A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting junk posts based on occurrence rule of posts
CN102315978A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting abnormal conditions of subset in open-type interactive platform
CN102315952A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting junk posts in community network
CN102752262A (en) * 2011-04-18 2012-10-24 腾讯科技(深圳)有限公司 Method and server for restricting malicious information propagation
CN102801640A (en) * 2011-05-23 2012-11-28 腾讯科技(深圳)有限公司 Information auditing method and device
CN102833218A (en) * 2011-06-16 2012-12-19 腾讯科技(深圳)有限公司 Network information display method and system
CN102932206A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Method and system for monitoring website access information
CN102982047A (en) * 2011-09-07 2013-03-20 百度在线网络技术(北京)有限公司 Method and equipment used for cheating detection on interactive information in interactive platform
CN102982041A (en) * 2011-09-06 2013-03-20 百度在线网络技术(北京)有限公司 Method used for detecting burst information of interactive platform and device
CN103576882A (en) * 2012-07-27 2014-02-12 深圳市世纪光速信息技术有限公司 Off-normal text recognition method and system
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104537461A (en) * 2014-12-09 2015-04-22 华迪计算机集团有限公司 Method and device for carrying out compliance inspection on enterprise internal control systems
CN102118217B (en) * 2010-01-04 2016-03-30 中兴通讯股份有限公司 A kind of method for parallel processing of rate-matched and device
CN105843856A (en) * 2016-03-16 2016-08-10 中国联合网络通信集团有限公司 Junk information processing method, apparatus and system
CN106055664A (en) * 2016-06-03 2016-10-26 腾讯科技(深圳)有限公司 Method and system for filtering UGC (User Generated Content) spam based on user comments
CN106202404A (en) * 2016-07-11 2016-12-07 百度在线网络技术(北京)有限公司 For the method and apparatus processing information
WO2017092355A1 (en) * 2015-12-01 2017-06-08 乐视控股(北京)有限公司 Data service system
CN107480803A (en) * 2016-06-07 2017-12-15 腾讯科技(深圳)有限公司 Data processing method, device and server in stochastic event forecast
CN109063054A (en) * 2018-07-19 2018-12-21 天津迈基生物科技有限公司 A kind of machine learning and big data processing system
CN109271768A (en) * 2018-10-26 2019-01-25 Oppo广东移动通信有限公司 Release news management method, device, storage medium and terminal
CN112650934A (en) * 2021-01-18 2021-04-13 北京小川在线网络技术有限公司 Content push-up method based on high participation of user and electronic equipment thereof
CN113592465A (en) * 2021-09-29 2021-11-02 飞狐信息技术(天津)有限公司 Method and device for shunting to-be-audited content, server and computer storage medium

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118217B (en) * 2010-01-04 2016-03-30 中兴通讯股份有限公司 A kind of method for parallel processing of rate-matched and device
CN102315953A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting junk posts based on occurrence rule of posts
CN102315978A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting abnormal conditions of subset in open-type interactive platform
CN102315952A (en) * 2010-06-29 2012-01-11 百度在线网络技术(北京)有限公司 Method and device for detecting junk posts in community network
CN102315953B (en) * 2010-06-29 2016-08-03 百度在线网络技术(北京)有限公司 Occurrence law based on model detects the method and apparatus of rubbish model
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN102752262A (en) * 2011-04-18 2012-10-24 腾讯科技(深圳)有限公司 Method and server for restricting malicious information propagation
CN102752262B (en) * 2011-04-18 2016-08-10 腾讯科技(深圳)有限公司 A kind of method limiting fallacious message propagation and server
CN102801640A (en) * 2011-05-23 2012-11-28 腾讯科技(深圳)有限公司 Information auditing method and device
CN102801640B (en) * 2011-05-23 2016-06-01 腾讯科技(深圳)有限公司 A kind of method and apparatus of message examination & verification
CN102833218A (en) * 2011-06-16 2012-12-19 腾讯科技(深圳)有限公司 Network information display method and system
CN102982041A (en) * 2011-09-06 2013-03-20 百度在线网络技术(北京)有限公司 Method used for detecting burst information of interactive platform and device
CN102982041B (en) * 2011-09-06 2018-05-08 百度在线网络技术(北京)有限公司 It is a kind of to be used to detect the method and apparatus that information is broken out in interaction platform
CN102982047B (en) * 2011-09-07 2017-06-06 百度在线网络技术(北京)有限公司 A kind of interactive information in interaction platform carries out the method and apparatus of cheating detection
CN102982047A (en) * 2011-09-07 2013-03-20 百度在线网络技术(北京)有限公司 Method and equipment used for cheating detection on interactive information in interactive platform
CN103576882A (en) * 2012-07-27 2014-02-12 深圳市世纪光速信息技术有限公司 Off-normal text recognition method and system
CN103576882B (en) * 2012-07-27 2018-03-09 深圳市世纪光速信息技术有限公司 Improper text recognition method and its system
CN102932206B (en) * 2012-11-19 2016-09-28 北京奇虎科技有限公司 The method and system of monitoring website access information
CN102932206A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Method and system for monitoring website access information
CN104050195B (en) * 2013-03-15 2017-11-03 暴风集团股份有限公司 A kind of advertisement sticker processing method and system
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104537461A (en) * 2014-12-09 2015-04-22 华迪计算机集团有限公司 Method and device for carrying out compliance inspection on enterprise internal control systems
WO2017092355A1 (en) * 2015-12-01 2017-06-08 乐视控股(北京)有限公司 Data service system
CN105843856A (en) * 2016-03-16 2016-08-10 中国联合网络通信集团有限公司 Junk information processing method, apparatus and system
CN106055664A (en) * 2016-06-03 2016-10-26 腾讯科技(深圳)有限公司 Method and system for filtering UGC (User Generated Content) spam based on user comments
CN106055664B (en) * 2016-06-03 2019-03-08 腾讯科技(深圳)有限公司 A kind of UGC filtering rubbish contents method and system based on user comment
CN107480803A (en) * 2016-06-07 2017-12-15 腾讯科技(深圳)有限公司 Data processing method, device and server in stochastic event forecast
CN107480803B (en) * 2016-06-07 2021-11-23 腾讯科技(深圳)有限公司 Data processing method and device in random event prediction and server
CN106202404A (en) * 2016-07-11 2016-12-07 百度在线网络技术(北京)有限公司 For the method and apparatus processing information
CN109063054A (en) * 2018-07-19 2018-12-21 天津迈基生物科技有限公司 A kind of machine learning and big data processing system
CN109271768A (en) * 2018-10-26 2019-01-25 Oppo广东移动通信有限公司 Release news management method, device, storage medium and terminal
CN112650934A (en) * 2021-01-18 2021-04-13 北京小川在线网络技术有限公司 Content push-up method based on high participation of user and electronic equipment thereof
CN113592465A (en) * 2021-09-29 2021-11-02 飞狐信息技术(天津)有限公司 Method and device for shunting to-be-audited content, server and computer storage medium

Similar Documents

Publication Publication Date Title
CN101510879A (en) Method and apparatus for filtering rubbish contents
CN102208992B (en) The malicious information filtering system of Internet and method thereof
CN101340308B (en) Network rubbish information filtering architecture, Network rubbish information cleaning system and method thereof
CN101257671B (en) Method for real time filtering large scale rubbish SMS based on content
CN105117484A (en) Internet public opinion monitoring method and system
CN101431434B (en) Content monitoring and plugging system and method based on WAP
CN104462509A (en) Review spam detection method and device
DE202012013734U1 (en) System for filtering spam messages based on user reputation
CN111217055B (en) Garbage putting supervision method, device, server and system
CN101587576A (en) Public inquiring and supervising system of public security cases
CN102833111B (en) A kind of visual HTTP data monitoring and managing method and device
CN103108290A (en) Short message handling method and device
CN103701693A (en) Message handling method and system in communication process
CN108154342A (en) Intelligent bus data collaborative method and its system based on cloud storage
CN107769958A (en) Server network security event automated analysis method and system based on daily record
CN1997058B (en) A method for monitoring of the high-traffic short message
CN105915440A (en) Mail recognition method and device
CN111176680A (en) Enterprise terminal management method, system, equipment and medium based on Internet of things
CN109063054A (en) A kind of machine learning and big data processing system
CN106528566A (en) Log file output method, server and client
CN102271331B (en) Method and system for detecting reliability of service provider (SP) site
CN105827432A (en) SHELL script-based traffic log statistical method and statistical system
CN100499599C (en) Rubbish mail filtration system and method based on email server
CN109377183A (en) Coalmine Safety Supervision intelligence enforcement system
CN110263289A (en) Material information management system and method based on wechat public platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20090819