CN103793398A - Trash data detection method and device - Google Patents

Trash data detection method and device Download PDF

Info

Publication number
CN103793398A
CN103793398A CN201210424029.3A CN201210424029A CN103793398A CN 103793398 A CN103793398 A CN 103793398A CN 201210424029 A CN201210424029 A CN 201210424029A CN 103793398 A CN103793398 A CN 103793398A
Authority
CN
China
Prior art keywords
data
create contents
junk data
default
junk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210424029.3A
Other languages
Chinese (zh)
Other versions
CN103793398B (en
Inventor
何小晨
杨娜
许春林
廖宇奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210424029.3A priority Critical patent/CN103793398B/en
Publication of CN103793398A publication Critical patent/CN103793398A/en
Application granted granted Critical
Publication of CN103793398B publication Critical patent/CN103793398B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a trash data detection method and device and belongs to the technical field of communication. The method comprises obtaining user created contents; detecting whether data meeting preset trash data conditions are included in the user created contents; determining that the user created contents are trash data if the data meeting preset trash data conditions are included. The device comprises an obtaining module, a detection module and a determination module. According to the method and the device, whether the user created contents are trash data is determined by detecting whether data meeting preset trash data conditions are included in the user created contents, automatic detection can be performed automatically, the workload of editors can be reduced, the trash data detection speed can be improved, a great number of user created contents can be processed, and actual requirements can be met.

Description

Detect the method and apparatus of junk data
Technical field
The present invention relates to Internet technical field, particularly a kind of method and apparatus that detects junk data.
Background technology
Along with the development of Internet technology, network becomes the important sources of people's obtaining information gradually, and in the particularly Web2.0 epoch, user participates in creating a large amount of contents, and network information is level rapid growth how much.But user's create contents (User Generated Content, UGC) is much junk data, has had a strong impact on network information quality.In order to improve network information quality, need to monitor user's create contents, detect user's create contents and whether belong to junk data, junk data is controlled accordingly.
At present, the method that detects junk data is: obtain user's create contents, editorial staff detects user's create contents and whether meets default document rule, if do not met, determines that user's create contents belongs to junk data.
But realizing in process of the present invention, inventor finds that prior art at least exists following problem:
Prior art relies on editorial staff's artificial judgment, and the speed that detects junk data is slow, is difficult to process a large number of users create contents, can not practical requirement.
Summary of the invention
In order to solve the problem of prior art, the embodiment of the present invention provides a kind of method and apparatus that detects junk data.Described technical scheme is as follows:
On the one hand, provide a kind of method that detects junk data, described method comprises:
Obtain user's create contents;
Detect in described user's create contents, whether contain the data that meet default junk data condition, described junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link;
If contain the data that meet default junk data condition, determine that described user's create contents is junk data.
On the other hand, provide a kind of device that detects junk data, described device comprises:
Acquisition module, for obtaining user's create contents;
Whether detection module, in described user's create contents, contain the data that meet default junk data condition, and described junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link;
Determination module, is to contain the data that meet default junk data condition for the testing result when described detection module, determines that described user's create contents is junk data.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
By detecting in user's create contents whether contain the data that meet default junk data condition, determine whether user's create contents is junk data, can automatically detect, can reduce editorial staff's workload, improve the speed that detects junk data, can process a large number of users create contents, can practical requirement.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is a kind of method flow diagram that detects junk data that the embodiment of the present invention one provides;
Fig. 2 is a kind of method flow diagram that detects junk data that the embodiment of the present invention two provides;
Fig. 3 is a kind of apparatus structure schematic diagram that detects junk data that the embodiment of the present invention three provides;
Fig. 4 is the apparatus structure schematic diagram that the another kind that provides of the embodiment of the present invention three detects junk data.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment mono-
The embodiment of the present invention provides a kind of method that detects junk data, and referring to Fig. 1, the method comprises:
101: obtain user's create contents.
102: detect in user's create contents, whether contain the data that meet default junk data condition, junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link.
103: if contain the data that meet default junk data condition, determine that user's create contents is junk data.
Wherein, user's create contents refers to the data that user delivers in the network applications such as Web Community, blog, microblogging.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
User's create contents and default junk data regular expression are compared;
Judge in user's create contents whether contain the data that meet default junk data regular expression;
If contained, determine in user's create contents and contain the data that meet default junk data condition.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
Utilize default statistical string frequency algorithm, detect the repeating data in user's create contents;
Judge the repeating data in user's create contents, whether meet default junk data repeat condition;
If met, determine in user's create contents and contain the data that meet default junk data condition.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
User's create contents and default junk data storehouse are compared;
Judge in user's create contents, whether contain with default junk data storehouse in the corresponding data of data;
If contained, determine in user's create contents and contain the data that meet default junk data condition.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
Detect in user's create contents and whether contain picture;
If contain picture, whether the source address that judges picture links main territory identical with the main territory of hyperlink
If not identical, determine in user's create contents and contain the data that meet default junk data condition.
Preferably, after obtaining user's create contents, detect in user's create contents, before whether containing the data that meet default junk data condition, also comprise:
User's create contents is carried out to standardization processing.
The method of the detection junk data described in the embodiment of the present invention, by detecting in user's create contents whether contain the data that meet default junk data condition, determine whether user's create contents is junk data, can automatically detect, can reduce editorial staff's workload, improve the speed that detects junk data, can process a large number of users create contents, can practical requirement.And, can detect by default junk data regular expression, the default mode such as statistical string frequency algorithm, default junk data storehouse, detection mode is various, and can detect from multiple dimensions, makes the junk data that detects more accurately comprehensively.
It should be noted that, the method for the detection junk data described in the embodiment of the present invention, can be applied to the environment that Web Community, retrieval platform etc. include user's create contents, and applied environment is not specifically limited.
Embodiment bis-
The embodiment of the present invention provides a kind of method that detects junk data, and referring to Fig. 2, the method comprises:
201: obtain user's create contents.
Wherein, user's create contents refers to the data that user delivers in the network applications such as Web Community, blog, microblogging.
202: user's create contents is carried out to standardization processing.
Particularly, user's create contents is carried out to standardization processing to be comprised user's create contents form is according to the rules carried out to typesetting, the complex form of Chinese characters in user's create contents is converted to simplified Chinese character etc., user's create contents is carried out to standardization processing to reduce data redundancy in user's create contents, promote the consistance of data.
203: detect in the user's create contents after standardization processing, whether contain the data that meet default junk data condition, if contained, carry out 204; Otherwise, carry out 205.
Wherein, junk data refers to low-quality nonsignificant data that user delivers etc.Junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link etc.
Preferably, detect in the user's create contents after standardization processing, whether contain the data that meet default junk data condition, comprising:
User's create contents and default junk data regular expression are compared.
Judge in user's create contents whether contain the data that meet default junk data regular expression;
If contained, determine in user's create contents and contain the data that meet default junk data condition;
If do not contained, determine in user's create contents and contain the data that meet default junk data condition.
Method by junk data regular expression can detect meaningless character, meaningless numeral etc.In the embodiment of the present invention, meaningless character, meaningless numeral mainly comprise No. QQ, telephone number and website etc.Particularly, in the embodiment of the present invention, default regular expression has following several:
1) qq number regular expression, specific as follows:
(([Qq] | (Q)) 1} (() | ()) *) { 1,2} ([:-] (:) | (-)) (() | ()) * (([) | (()) [1-9] [0-9] { 4, (()) | (])) | ([Qq] | (Q)) { 1,2} group (number) ([:-] | (:)) [1-9] [0-9] { 4, | (interchange) group (number) ([:-] | (:)) [1-9] [0-9] { 4, }
2) telephone number regular expression, specific as follows:
0[1-9][0-9]{1,2}[-]?[1-9][0-9]{3}[]?[0-9]{3,4}|0?13[0-9]{1}[-]?[0-9]{4}[-]?[0-9]{4}|0?15[0-9]{1}[-]?[0-9]{4}[-]?[0-9]{4}|0?189[-]?[0-9]{4}[-]?[0-9]{4}|400-[0-9]{3}-[0-9]{4}
3) e-commerce website regular expression, specific as follows:
(taobao|paipai|tmall|okbuy){1}\\.com
4) network address mailbox regular expression, specific as follows:
(http://)(www\\.)?([0-9a-zA-Z][_0-9a-zA-Z-]+\\.){1,3}(cn|com|org|net){1}|[_0-9a-zA-Z-]+@([0-9a-zA-Z][_0-9a-zA-Z-]+\\.)+[a-zA-Z]{2,3}
It should be noted that, be not limited to above-mentioned regular expression, can arrange according to practical application situation, this is not limited.And, be not limited to detect meaningless character, numeral by regular expression, can also detect other, as detected whether force to turn note, regular expression can be: turn | indispensable-| _ |---|-| _.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
Utilize default statistical string frequency (nango) algorithm, detect the repeating data in user's create contents;
Judge the repeating data in user's create contents, whether meet default junk data repeat condition;
If met, determine in user's create contents and contain the data that meet default junk data condition;
If do not met, determine in user's create contents and do not contain the data that meet default junk data condition.
Particularly, penetrate string from all of user's create contents, search the maximal value of penetrating string repeat length and be greater than default repeating and penetrate the string of penetrating of string number threshold value;
Judge whether can and, it should be noted that, the maximal value of penetrating string repeat length in statistics is greater than default repeating and penetrates penetrating in string of string number threshold value, while thering is the number of penetrating string of identical (or different) prefix;
If can find, the maximal value that statistics is penetrated string repeat length is greater than default repeating and penetrates penetrating in string of string number threshold value, has the number of penetrating string of same prefix;
Whether the number of penetrating string that judgement has same prefix is greater than default prefix multiplicity threshold value;
Be greater than default same prefix number threshold value if there is the number of penetrating string of same prefix, determine in user's create contents and contain the data that meet default junk data condition;
Be less than or equal to default same prefix number threshold value if having the number of penetrating string of same prefix, the maximal value that statistics is penetrated string repeat length is greater than default repeating and penetrates penetrating in string of string number threshold value, has the number of penetrating string of different prefixes;
Whether the number of penetrating string that judgement has different prefixes is greater than default different prefix number threshold values,
Be greater than default different prefix number threshold values if there is the number of penetrating string of different prefixes, determine in user's create contents and contain the data that meet default junk data condition;
Be less than or equal to default different prefix number threshold values if there is the number of penetrating string of different prefixes, judge whether the maximal value of penetrating string repeat length is greater than default maximum repeat length threshold value (REP_MZAX_LENGTH);
If be greater than default maximum repeat length threshold value, determine in user's create contents and contain the data that meet default junk data condition;
If be less than or equal to default maximum repeat length threshold value, judge whether the initial repeatable position of penetrating string corresponding to maximal value of penetrating string repeat length is positioned at default junk data position;
If be positioned at default junk data position, determine in user's create contents and contain the data that meet default junk data condition;
If can not find, calculate the number of bi-gram in user's create contents (bigram), and the length of user's create contents;
Whether the ratio that judges the number of bi-gram in user's create contents and the length of user's create contents, be less than default ratio threshold value;
If be less than default ratio threshold value, determine in user's create contents and contain the data that meet default junk data condition;
If be more than or equal to default ratio threshold value, determine in user's create contents and do not contain the data that meet default junk data condition.
Wherein, establish all of user's create contents and penetrate in string, two maximal values (max_len) of penetrating string repeat length are m.And, particularly, in the embodiment of the present invention, to penetrate string number threshold value (max_len_setting) n be 100 in default repeating, default same prefix number threshold value is 2, default different prefix number threshold values are 2, default REP_MZAX_LENGTH is 300, default junk data position be positioned at user create front 200 bytes within, default ratio threshold value is 0.55.It should be noted that, each threshold value is not limited to be set to above-mentioned value, can arrange flexibly according to practical application situation, and this is not specifically limited.
Wherein, bigram refers to bi-gram, and such as the sentence abcabc being made up of a, b, c morpheme, bigram is wherein ab, bc, ca, and that repeats can only calculate one.When actual treatment, character code is GBK, and each character accounts for 2 bytes.
And, it should be noted that, search the maximal value of penetrating string repeat length be greater than default repeating penetrate string number threshold value penetrate string time, be that single-byte character is (as repeated if penetrate repeating in string.。。。。) repeat to cause, this is penetrated string and is not included the maximal value of penetrating string repeat length in and be greater than default repeating and penetrate penetrating in string of string number threshold value.
Wherein, prefix described above refers to the first character of penetrating string.And, it should be noted that, the maximal value of penetrating string repeat length in statistics is greater than default repeating and penetrates string the penetrating in string of number threshold value, while thering is the number of penetrating string of identical (or different) prefix, if two are penetrated the location interval of the prefix (being first character) of string, be less than default location interval threshold value, penetrate string and can only calculate and do one and penetrate string for these two; Only have and penetrate the location interval of the prefix (being first character) of string when two, while being greater than default location interval threshold value, being just calculated as and being two and penetrating string.Wherein, default location interval threshold value can be set to 100 etc.
In addition, it should be noted that, if being the lyrics or poem etc., user's create contents needs the content repeating, for fear of by above-mentioned statistical string frequency (nango) algorithm, user's create contents being defined as to junk data, penetrate string from all of user's create contents in execution, search the maximal value of penetrating string repeat length be greater than default repeating penetrate string number threshold value penetrate string before, can also comprise:
The number of punctuation mark in counting user create contents, and the number of default character;
Whether the number that judges punctuation mark is less than default punctuation mark number threshold value, and whether the number of default character is less than default character number threshold value;
If the number of punctuation mark is less than default punctuation mark number threshold value, and the number of default character is less than default character number threshold value, determines that user's create contents is not junk data; Otherwise, carry out and penetrate string from all of user's create contents, search the step of penetrating string that the maximal value of penetrating string repeat length is greater than default repeating and penetrates string number threshold value.
Wherein, punctuation mark can be that half-angle ", " is denoted as c, and half-angle ". " is denoted as p.Correspondingly, establishing the default punctuation mark number threshold value that c is corresponding is that the default punctuation mark number threshold value that 1, p is corresponding is 3.Default character can be ASCII(American Standard Code for Information Interchange, ASCII(American Standard Code for information interchange)) 33-47 in table, and character corresponding to 55-63, correspondingly establishing default character number threshold value is 15.Therefore, when actual setting, can be total number of all punctuation marks of statistics, a punctuation mark number threshold value is set; Also can be the number of each punctuation mark of statistics, for each punctuation mark arranges a punctuation mark number threshold value.Can be total number of all default characters of statistics, a character number threshold value is set; Also can be the number of each default character of statistics, for each default character arranges a character number threshold value.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
User's create contents and default junk data storehouse are compared;
Judge in user's create contents, whether contain with default junk data storehouse in the corresponding data of data;
If contained, determine in user's create contents and contain the data that meet default junk data condition;
If do not contained, determine in user's create contents and do not contain the data that meet default junk data condition.
Particularly, can class word, pornographic class word, hot word (popular vocabulary: in community will be forced to reprint, hot ticket or the hot issue word of the artificial operation of product) data in advance such as class word, malicious websites chained address collects and is stored in default junk data storehouse, by user's create contents and default junk data storehouse are compared, judge in user's create contents, whether contain junk data.Wherein, forcing to reprint class word comprises: if do not turn whole family's death ray, do not turn the dead whole family do not reprint be just not people, must " reprinting " or after " sharing ", just can see full text etc.
Preferably, detect in user's create contents, whether contain the data that meet default junk data condition, comprising:
Detect in user's create contents and whether contain picture;
If contain picture, whether the source address that judges picture links main territory identical with the main territory of hyperlink;
If not identical, determine in user's create contents and contain the data that meet default junk data condition;
If identical, determine in user's create contents and do not contain the data that meet default junk data condition.
Particularly, if comprise shape in user's create contents as the content of <a href=" * * * " ><img src=" * * * " ></a>, in user's create contents, include picture, judge the src(source of picture, source address) whether link main territory identical with the main territory of hyperlink a.
It should be noted that, in practical application, can adopt a kind of above-mentioned method to detect, also above-mentioned several different methods combination in any can be got up detect, this is not specifically limited, can, according to practical application situation, select flexibly.
And, it should be noted that, if user's create contents comprises title and text, can adopt identical method to detect to title and text simultaneously, also can adopt respectively diverse ways to detect respectively to title and text, this is not specifically limited, can, according to practical application situation, selects flexibly.
204: determine that user's create contents is junk data, then finishes.
205: determine that user's create contents is not junk data, then finishes.
The method of the detection junk data described in the embodiment of the present invention, by detecting in user's create contents whether contain the data that meet default junk data condition, determine whether user's create contents is junk data, can automatically detect, can reduce editorial staff's workload, improve the speed that detects junk data, can process a large number of users create contents, can practical requirement.And, can detect by default junk data regular expression, the default mode such as statistical string frequency algorithm, default junk data storehouse, detection mode is various, and can detect from multiple dimensions, makes the junk data that detects more accurately comprehensively.
Embodiment tri-
Referring to Fig. 3, the embodiment of the present invention provides a kind of device that detects junk data, and this device comprises:
Acquisition module 301, for obtaining user's create contents;
Whether detection module 302, in user's create contents, contain the data that meet default junk data condition, and junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link;
Determination module 303, is to contain the data that meet default junk data condition for the testing result when detection module 302, determines that user's create contents is junk data.
Preferably, detection module 302 comprises:
The first comparing unit, for comparing user's create contents and default junk data regular expression;
The first judging unit, for judging user's create contents, whether contain the data that meet default junk data regular expression;
The first determining unit, is to contain for the judged result when the first judging unit, determines in user's create contents and contains the data that meet default junk data condition.
Preferably, detection module 302 comprises:
The first detecting unit, for utilizing default statistical string frequency algorithm, detects the repeating data in user's create contents;
Whether the second judging unit, for judging the repeating data of user's create contents, meet default junk data repeat condition;
The second determining unit, is to meet for the judged result when the second judging unit, determines in user's create contents and contains the data that meet default junk data condition.
Preferably, detection module 302 comprises:
The second comparing unit, for comparing user's create contents and default junk data storehouse;
The 3rd judging unit, for judging user's create contents, whether contain with default junk data storehouse in the corresponding data of data;
The 3rd determining unit, is to contain for the judged result when the 3rd judging unit, determines in user's create contents and contains the data that meet default junk data condition.
Preferably, detection module 302 comprises:
The second detecting unit, for detection of whether containing picture in user's create contents;
The 4th judging unit, for being to contain picture when the testing result of the second detecting unit, whether the source address that judges picture links main territory identical with the main territory of hyperlink;
The 4th determining unit, is not identical for the judged result when the 4th judging unit, determines in user's create contents and contains the data that meet default junk data condition.
Preferably, referring to Fig. 4, this device also comprises:
Processing module 304, for after acquisition module 301 obtains user's create contents, detection module 302 detects in user's create contents, before whether containing the data that meet default junk data condition, user's create contents is carried out to standardization processing.
The device of the detection junk data described in the embodiment of the present invention, by detecting in user's create contents whether contain the data that meet default junk data condition, determine whether user's create contents is junk data, can automatically detect, can reduce editorial staff's workload, improve the speed that detects junk data, can process a large number of users create contents, can practical requirement.And, can detect by default junk data regular expression, the default mode such as statistical string frequency algorithm, default junk data storehouse, detection mode is various, and can detect from multiple dimensions, makes the junk data that detects more accurately comprehensively.
It should be noted that: the device of the detection junk data that above-described embodiment provides, only be illustrated with the division of above-mentioned each functional module, in practical application, can above-mentioned functions be distributed and completed by different functional modules as required, be divided into different functional modules by the inner structure of equipment, to complete all or part of function described above.In addition, the device of the detection junk data that above-described embodiment provides belongs to same design with the embodiment of the method that detects junk data, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can carry out the hardware that instruction is relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (12)

1. a method that detects junk data, is characterized in that, described method comprises:
Obtain user's create contents;
Detect in described user's create contents, whether contain the data that meet default junk data condition, described junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link;
If contain the data that meet default junk data condition, determine that described user's create contents is junk data.
2. method according to claim 1, is characterized in that, in the described user's create contents of described detection, whether contains the data that meet default junk data condition, comprising:
Described user's create contents and default junk data regular expression are compared;
Judge in described user's create contents whether contain the data that meet default junk data regular expression;
If contained, determine in described user's create contents and contain the data that meet default junk data condition.
3. method according to claim 1, is characterized in that, in the described user's create contents of described detection, whether contains the data that meet default junk data condition, comprising:
Utilize default statistical string frequency algorithm, detect the repeating data in described user's create contents;
Judge the repeating data in described user's create contents, whether meet default junk data repeat condition;
If met, determine in described user's create contents and contain the data that meet default junk data condition.
4. method according to claim 1, is characterized in that, in the described user's create contents of described detection, whether contains the data that meet default junk data condition, comprising:
Described user's create contents and default junk data storehouse are compared;
Judge in described user's create contents, whether contain with default junk data storehouse in the corresponding data of data;
If contained, determine in described user's create contents and contain the data that meet default junk data condition.
5. method according to claim 1, is characterized in that, in the described user's create contents of described detection, whether contains the data that meet default junk data condition, comprising:
Detect in described user's create contents and whether contain picture;
If contain picture, whether the source address that judges described picture links main territory identical with the main territory of hyperlink
If not identical, determine in described user's create contents and contain the data that meet default junk data condition.
6. according to the method described in claim 1-5 any one claim, it is characterized in that, described in obtain described user's create contents after, detect in described user's create contents, before whether containing the data that meet default junk data condition, also comprise:
Described user's create contents is carried out to standardization processing.
7. a device that detects junk data, is characterized in that, described device comprises:
Acquisition module, for obtaining user's create contents;
Whether detection module, in described user's create contents, contain the data that meet default junk data condition, and described junk data condition comprises junk data regular expression, junk data repeat condition, junk data storehouse or picture link;
Determination module, is to contain the data that meet default junk data condition for the testing result when described detection module, determines that described user's create contents is junk data.
8. device according to claim 7, is characterized in that, described detection module comprises:
The first comparing unit, for comparing described user's create contents and default junk data regular expression;
The first judging unit, for judging described user's create contents, whether contain the data that meet default junk data regular expression;
The first determining unit, is to contain for the judged result when described the first judging unit, determines in described user's create contents and contains the data that meet default junk data condition.
9. device according to claim 7, is characterized in that, described detection module comprises:
The first detecting unit, for utilizing default statistical string frequency algorithm, detects the repeating data in described user's create contents;
Whether the second judging unit, for judging the repeating data of described user's create contents, meet default junk data repeat condition;
The second determining unit, is to meet for the judged result when described the second judging unit, determines in described user's create contents and contains the data that meet default junk data condition.
10. device according to claim 7, is characterized in that, described detection module comprises:
The second comparing unit, for comparing described user's create contents and default junk data storehouse;
The 3rd judging unit, for judging described user's create contents, whether contain with default junk data storehouse in the corresponding data of data;
The 3rd determining unit, is to contain for the judged result when described the 3rd judging unit, determines in described user's create contents and contains the data that meet default junk data condition.
11. devices according to claim 7, is characterized in that, described detection module comprises:
The second detecting unit, for detection of whether containing picture in described user's create contents;
The 4th judging unit, for being to contain picture when the testing result of described the second detecting unit, whether the source address that judges described picture links main territory identical with the main territory of hyperlink;
The 4th determining unit, is not identical for the judged result when described the 4th judging unit, determines in described user's create contents and contains the data that meet default junk data condition.
12. according to the device described in claim 7-11 any one claim, it is characterized in that, described device also comprises:
Processing module, for after acquisition module 301 obtains user's create contents, described detection module detects in described user's create contents, before whether containing the data that meet default junk data condition, described user's create contents is carried out to standardization processing.
CN201210424029.3A 2012-10-30 2012-10-30 The method and apparatus for detecting junk data Active CN103793398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210424029.3A CN103793398B (en) 2012-10-30 2012-10-30 The method and apparatus for detecting junk data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210424029.3A CN103793398B (en) 2012-10-30 2012-10-30 The method and apparatus for detecting junk data

Publications (2)

Publication Number Publication Date
CN103793398A true CN103793398A (en) 2014-05-14
CN103793398B CN103793398B (en) 2018-09-04

Family

ID=50669081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210424029.3A Active CN103793398B (en) 2012-10-30 2012-10-30 The method and apparatus for detecting junk data

Country Status (1)

Country Link
CN (1) CN103793398B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268220A (en) * 2014-09-25 2015-01-07 北京金山安全软件有限公司 Method and device for cleaning junk files of audio and video applications
CN106021231A (en) * 2016-05-24 2016-10-12 武汉斗鱼网络科技有限公司 Repeated chatting content detection method and device
CN109284467A (en) * 2018-09-14 2019-01-29 阿里巴巴集团控股有限公司 A kind of user generated content (UGC) number of repetition determines method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101534261A (en) * 2009-04-10 2009-09-16 阿里巴巴集团控股有限公司 A method, device and system of recognizing spam information
CN101667979A (en) * 2009-10-12 2010-03-10 哈尔滨工程大学 System and method for anti-phishing emails based on link domain name and user feedback
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101534261A (en) * 2009-04-10 2009-09-16 阿里巴巴集团控股有限公司 A method, device and system of recognizing spam information
CN101667979A (en) * 2009-10-12 2010-03-10 哈尔滨工程大学 System and method for anti-phishing emails based on link domain name and user feedback
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268220A (en) * 2014-09-25 2015-01-07 北京金山安全软件有限公司 Method and device for cleaning junk files of audio and video applications
CN104268220B (en) * 2014-09-25 2017-07-28 北京金山安全软件有限公司 Method and device for cleaning junk files of audio and video applications
CN106021231A (en) * 2016-05-24 2016-10-12 武汉斗鱼网络科技有限公司 Repeated chatting content detection method and device
CN106021231B (en) * 2016-05-24 2019-03-05 武汉斗鱼网络科技有限公司 A kind of detection repeats the method and device of chat content
CN109284467A (en) * 2018-09-14 2019-01-29 阿里巴巴集团控股有限公司 A kind of user generated content (UGC) number of repetition determines method and device

Also Published As

Publication number Publication date
CN103793398B (en) 2018-09-04

Similar Documents

Publication Publication Date Title
TWI729472B (en) Method, device and server for determining feature words
US9489372B2 (en) Web-based spell checker
CN103123618B (en) Text similarity acquisition methods and device
CN104750704B (en) A kind of webpage URL address sorts recognition methods and device
CN104866478B (en) Malicious text detection and identification method and device
CN101950312B (en) Method for analyzing webpage content of internet
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN104951469B (en) Optimize the method and apparatus of corpus
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
MX2011005771A (en) Method and device for intercepting spam.
CN106599270B (en) Network data capturing method and crawler
CN109634436A (en) Association method, device, equipment and the readable storage medium storing program for executing of input method
CN106371711A (en) Information input method and electronic equipment
US9336316B2 (en) Image URL-based junk detection
CN103793398A (en) Trash data detection method and device
US8990224B1 (en) Detecting document text that is hard to read
CN103984731B (en) Self adaptation topic tracking method and apparatus under microblogging environment
CN110442696B (en) Query processing method and device
CN109657472B (en) SQL injection vulnerability detection method, device, equipment and readable storage medium
CN111125704A (en) Webpage Trojan horse recognition method and system
CN111737398B (en) Method and device for retrieving sensitive words in text, electronic equipment and storage medium
US20170060998A1 (en) Method and apparatus for mining maximal repeated sequence
CN108200191A (en) Utilize the client dynamic URL associated script character string detecting systems of perturbation method
CN109977423A (en) A kind of unknown word processing method, apparatus, electronic equipment and readable storage medium storing program for executing
KR20220113075A (en) Word cloud system based on korean noun extraction tokenizer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant