CN106227808B - A kind of method and method for judging rubbish mail removing mail interference information - Google Patents

A kind of method and method for judging rubbish mail removing mail interference information Download PDF

Info

Publication number
CN106227808B
CN106227808B CN201610584290.8A CN201610584290A CN106227808B CN 106227808 B CN106227808 B CN 106227808B CN 201610584290 A CN201610584290 A CN 201610584290A CN 106227808 B CN106227808 B CN 106227808B
Authority
CN
China
Prior art keywords
information
character
mail
text information
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610584290.8A
Other languages
Chinese (zh)
Other versions
CN106227808A (en
Inventor
徐慧灵
纪春来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Data Co ltd
Original Assignee
Xiamen Rong Neng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Rong Neng Technology Co Ltd filed Critical Xiamen Rong Neng Technology Co Ltd
Priority to CN201610584290.8A priority Critical patent/CN106227808B/en
Publication of CN106227808A publication Critical patent/CN106227808A/en
Application granted granted Critical
Publication of CN106227808B publication Critical patent/CN106227808B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Present invention discloses a kind of method and method for judging rubbish mail for removing mail interference information, the method for the removal mail interference information includes: to obtain the html content for including in mail;Building document opposite direction model is executed to html content, and executing after following at least one or a variety of disturbance ecologies processing document opposite direction model by html Content Transformation is text information, the disturbance ecology processing includes: the processing of color block disturbance ecology, the processing of font size disturbance ecology, the processing of table disturbance ecology;By treated, text information carries out recombining contents.Through the invention, it can be realized and interference information included in spam is efficiently separated with text information, it can accurately identify the various interference informations artificially encapsulated in spam, it spam is determine whether to mail provides accurate foundation to be subsequent, and can effectively improve interception and the filter effect to spam.

Description

A kind of method and method for judging rubbish mail removing mail interference information
Technical field
The present invention relates to anti-spam technologies field more particularly to a kind of methods for removing mail interference information, and A kind of method for judging rubbish mail of method based on the removal spam interference information.
Background technique
With the development of internet, spam endangers more and more big caused by user.It is usually wrapped in spam It includes and promotes mail or with pornographic or other flames mail.For this purpose, occurring a variety of anti-spam in the prior art The identification of mail and filter method and background server strobe utility.
The method of the anti-rubbish mail of mainstream specifically includes that (1) optical character identification sends out (OCR) at present, passes through realization Content comprising advertising pictures or plain text is extracted, ad content is judged whether by content, to realize rubbish The identification of mail, but this technology expense caused by computer is larger.(2) the mail-detection technology based on MD5 verification, By the way that the character string of random length is executed hash operations, it is converted into the value of shorter regular length.Due to any two difference The MD5 value of character string is not identical, therefore can judge whether two character strings are identical by comparing the MD5 value of two character strings. But this anti-spam technologies are non-critical to Mail Contents identical, any variation Shi Douhui occur leads to the difference of MD5 value, Whether it is the judgement of spam to the mail and executes filtering and intercept operation to seriously affects.
Meanwhile anti-spam technologies in the prior art are directly to the preset text or figure for including in mail Piece is scanned detection, certainly will cause to be also required to execute above-mentioned inspection or filter operation to the mail normally sent in this way, because This will increase the computing cost of background server or web page search engine.It is therefore proposed that a kind of to may be identified rubbish The pretreated method of mail progress just seems and is highly desirable, to avoid the judgement that all mails are carried out with spam of blindness Operation, interception and delete operation, and improve the intercepting efficiency to spam.
If in addition, since interference character is added or to spam content in spam publisher in spam Display mode carry out rearrangement, then existing anti-garbage mail system is difficult the spam regarding as spam, Thus greatly reduce the intercepting efficiency to spam.
In view of this, it is necessary to the pretreatment side in the prior art for interference information included in spam Method is improved, to solve the above problems.
Summary of the invention
It is an object of the invention to disclose it is a kind of remove spam interference information method, with to avoid blindness to institute There is mail to carry out decision, interception and the delete operation of spam, and improves the intercepting efficiency to spam;The present invention Another goal of the invention a kind of method for judging rubbish mail is being disclosed, the mail comprising interference information is determined as to improve The efficiency of spam improves interception and filter efficiency to spam.
To realize said one goal of the invention, the present invention provides a kind of methods for removing mail interference information, comprising:
S1, the html content for including in mail is obtained;
S2, execute building DOM Document Object Model to html content, and to DOM Document Object Model execute it is following at least one or It by html Content Transformation is text information after person's a variety of disturbance ecologies processing, the disturbance ecology processing includes: color block interference Identifying processing, the processing of font size disturbance ecology, the processing of table disturbance ecology;
S3, by treated, text information carries out recombining contents.
As a further improvement of the present invention, the disturbance ecology processing in the step S2 further include: sensitive word interference is known Other places reason.
It as a further improvement of the present invention, is text information by html Content Transformation in the step S2 specifically: right Html content deletes label, to extract the text information in html content.
As a further improvement of the present invention, the sensitive word disturbance ecology processing includes: capitalization and lowercase Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number.
As a further improvement of the present invention, the conversion processing of the capitalization and lowercase specifically: to text Information is checked character by character, and in II code value of the ASC of character at [65,90], II code value of ASC of the character is increased by 32.
As a further improvement of the present invention, the conversion processing of the specification character and non-standard character specifically: to text This information is checked character by character, and the data value of non-standard character included in text information is encoded according to Unicode Table is revised as specification character.
As a further improvement of the present invention, the conversion processing of the letter and number specifically: text information is carried out It checks, and number is replaced with letter character by character according to II code table of ASC.
As a further improvement of the present invention, building DOM Document Object Model packet is executed to html content in the step S2 Include following steps:
S21, multiple labels as input content and are parsed into using html content;
S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein often A flag node includes the attribute information to match with the flag node;
S23, traversal dom tree shape model, extract the segmentation being mutually matched in html content with the attribute information of flag node Information;
Text information in S24, each segment information of extraction, and matched row is selected according to the attribute information of flag node Column sequence, to form continuous text information.
As a further improvement of the present invention, the step S21 specifically: to html content according to preset mark Note rule carries out traversal parsing operation, then by label generator distinguishing mark and passes to dom tree shape model builder.
As a further improvement of the present invention, the marking convention includes html content initial position label, html content End position label, categorical attribute, Property Name, attribute value.
As a further improvement of the present invention, the categorical attribute includes font size attribute, font inclination attribute, font Horizontal arrangement difference attribute, font vertical arrangement difference attribute, RGB difference attribute, gray value differences value attribute, color saturation category Property, contrast difference attribute.
As a further improvement of the present invention, the step S3 the following steps are included:
S31, using encoded translated device, to treated, text information carries out code conversion;
S32, the header information according to the coding section of setting length as the text information after recombining contents, middle part information And trailer information;
S33, header information, middle part information and trailer information are sequentially arranged into the text after the recombining contents in continuous state This information.
As a further improvement of the present invention, after the step S33, further includes: to the text information after recombining contents Execute following one or more kinds of operations:
Remove the operation of space markings;
Remove the operation of carriage return label;
The operation of removal line feed label;Wherein,
The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC in text information after bulk density group is 32 executes delete operation;
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 13 in text information after bulk density group executes delete operation;
It as a further improvement of the present invention, further include being held using encoder to html content before the step S1 is executed The step of processing of row code conversion is to be converted into Unicode coding.
To realize another above-mentioned invention, the present invention also provides a kind of method for judging rubbish mail, including it is any of the above-described The method of the item invention removal mail interference information;And
Text information after recombining contents is compared with keywords database set in database, and determine whether for Spam.
As a further improvement of the present invention, the database includes oracle database, DB2 database, Postgre SQL database, Microsoft SQL Server database, Microsoft Access database or MySQL database.
Compared with prior art, it is wrapped the beneficial effects of the present invention are: can be realized through the invention in spam The interference information contained is efficiently separated with text information, can accurately identify the various interference letter artificially encapsulated in spam Breath determines whether spam to mail and provides accurate foundation, and can effectively improve and block to spam to be subsequent Cut and filter effect.
Detailed description of the invention
Fig. 1 is a kind of flow chart for the method for removing mail interference information of the present invention;
Fig. 2 is the untreated preceding schematic diagram comprising html contents of interference informations such as English, numbers;
Fig. 3 is to execute at sensitive word disturbance ecology to interference informations such as English, numbers in html content illustrated in fig. 2 Schematic diagram after reason;
Fig. 4 is the schematic diagram of the untreated preceding html content comprising different size font interference information;
Fig. 5 is to be based at font size disturbance ecology after executing building DOM Document Object Model to html content in step S22 Dom tree shape model when reason;
Fig. 6 is the schematic diagram of the untreated preceding html content comprising different colours block interference information;
Fig. 7 is to be based on the processing of color block disturbance ecology after executing building DOM Document Object Model to html content in step S22 When dom tree shape model;
Fig. 8 is the schematic diagram of the untreated preceding html content comprising table interference information;
When Fig. 9 is in step S22 to after html content execution building DOM Document Object Model based on the processing of table disturbance ecology Dom tree shape model;
Figure 10 is showing before carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9 It is intended to;
Figure 11 is showing after carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9 It is intended to;
Figure 12 is the schematic diagram carried out before recombining contents to the text information comprising the meaningless character interference information of content;
Figure 13 is after carrying out recombining contents to the text information for including the meaningless character interference information of content in Figure 12 Schematic diagram.
Specific embodiment
The present invention is described in detail for each embodiment shown in reference to the accompanying drawing, but it should be stated that, these Embodiment is not limitation of the present invention, those of ordinary skill in the art according to these embodiments made by function, method, Or equivalent transformation or substitution in structure, all belong to the scope of protection of the present invention within.
Intercepting system or interception software in the prior art to spam is carrying out spam to targeted mails In decision process, the interference information that spammer adds in mail can not be removed, to significantly impact Filter effect or even some non-spam email to spam can also be identified as spam.This specification concrete mode Each application scenarios shown by part or specific implementation process base be only to example summary of the invention, therefore can not be Protection scope of the present invention or invention objective are constituted and limited.
Join shown in Fig. 1, in the present embodiment, the method for the removal mail interference information, comprising the following steps: S1, obtain Take the html content for including in mail;S2, building DOM Document Object Model is executed to html content, and DOM Document Object Model is executed It by html Content Transformation is text information after following at least one or a variety of disturbance ecology processing, the disturbance ecology processing packet It includes: the processing of color block disturbance ecology, the processing of font size disturbance ecology, the processing of table disturbance ecology;S3, will treated text Information carries out recombining contents.
If mail obtains HTML content there are HTML, and HTML information is converted to text information, conversion method For to HTML content carry out delete label operation, that is, delete "<tag name attribute name=attribute value></tag name>" Equal html label information, remaining content is Mail Contents.If HTML is not present in mail, believed using the plain text of mail Cease the content as mail.
The interference cases generally occurred in spam include:
The interfering with each other of English alphabet " I " and Arabic numerals " 1 ", English alphabet " O " (containing capitalization or small letter) and Ah The interfering with each other of Arabic numbers " 1 ", the interference of background color, the interference of text information arrangement mode.If occurring people in mail For English alphabet " I " and Arabic numerals " 1 " are replaced, thus be mingled in continuous Chinese or English and In order to avoid being volleyed by anti-rubbish mail, then the mail comprising these contents is likely to spam.It is existing in order to improve The interception and filter effect of some anti-garbage mail systems or anti-rubbish mail to these spams, it is necessary to dry to these It disturbs and is removed, and extract the text information that mail is really recorded, then pass through anti-rubbish mail system in the prior art System or anti-spam software carry out interception and filter operation, and spam is prevented to be sent in server.
In the present embodiment, the disturbance ecology processing in the step S2 is also wrapped other than above-mentioned three kinds of interference processing It includes: the processing of sensitive word disturbance ecology.In step S2 by html Content Transformation be text information specifically: to html content delete mark Label, to extract the text information in html content.
Join shown in Fig. 2 and Fig. 3, in the present embodiment, sensitive word disturbance ecology processing includes: capitalization and small letter Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number of letter.
Wherein, the conversion processing of capitalization and lowercase specifically: text information is checked character by character, and II code value of ASC of character increases by 32 at [65,90], by II code value of ASC of the character.It is changed to [97,122], completes The conversion of English alphabet capital and small letter.
The conversion processing of specification character and non-standard character specifically: text information is checked character by character, and will be literary The data value of non-standard character included in this information is revised as specification character according to Unicode coding schedule.For example, it examines Looking into input text byte sequence is 0x2776, it may be assumed thatIt is then converted into ASC code 49, i.e. number 1.
In the present embodiment, the definition of the non-standard character is true expressed by the people based on the increase interference information Meaning and select.Such as expression Arabic numerals " 1 ", non-standard character include but is not limited to " (1) ", " (1) ", " 1. ", "Ⅰ","ⅰ","1".The conversion processing of letter and number specifically: text information is checked character by character, and according to ASC II Code table is replaced number with letter.
For example, II numerical value of alphabetical ASC to be converted: 105 (i), 111 (o), the two is distinguished in II code table of ASC Corresponding II code value of target number ASC is 49 (1), 48 (0).Therefore, it can be determined by searching for numerical value in II code table of ASC Corresponding relationship each other, and alphabetical " i " is replaced with Arabic numerals " 1 ", by alphabetical " o " and Arabic numerals " 0 " is replaced.
In the present embodiment, in step S2 to html content execute building DOM Document Object Model the following steps are included:
S21, multiple labels as input content and are parsed into using html content;Wherein, step S21 specifically: to text This information carries out traversal parsing operation according to preset marking convention, then by label generator distinguishing mark and passes to Dom tree shape model builder.
The marking convention includes html content initial position label, html end of text position mark, categorical attribute, category Property title, attribute value.
The categorical attribute include font size attribute, font inclination attribute, font horizontal arrangement difference attribute, font hang down In line cloth difference attribute, RGB difference attribute, gray value differences value attribute, color saturation attribute, contrast difference attribute.
S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein often A flag node includes the attribute information to match with the flag node;
S23, traversal dom tree shape model, extract the segmentation being mutually matched in html content with the attribute information of flag node Information;
Text information in S24, each segment information of extraction, and matched row is selected according to the attribute information of flag node Column sequence, to form continuous text information.
Preferably, in the present embodiment, the step S3 the following steps are included:
S31, using encoded translated device, to treated, text information carries out code conversion;
S32, the header information according to the coding section of setting length as the text information after recombining contents, middle part information And trailer information;
S33, header information, middle part information and trailer information are sequentially arranged into the text after the recombining contents in continuous state This information.
After the step S33, further includes: execute following one or more kinds of behaviour to the text information after recombining contents Make:
Remove the operation of space markings;
Remove the operation of carriage return label;
The operation of removal line feed label;Wherein,
The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC in text information after bulk density group is 32 executes delete operation.
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 13 in text information after bulk density group executes delete operation;
The step S1 further includes executing code conversion processing to html content using encoder to be converted into before executing The step of Unicode is encoded.
Next interference behaviour is removed than the interference occurred in more typical html content to three kinds in conjunction with Fig. 4 to Fig. 9 The specific implementation process of work is described in detail.
Join Fig. 4 to Fig. 5 shown in, the situation be marking convention based on font size attribute when extract html content in text The detailed process of information.
As shown in figure 4, occurring two row texts in Fig. 4.For simplify indicate, in the present specification with " A, B, C, D, E, F, G " refers to Chinese text, and " XX " is used to refer to Arabic numerals.The first row " ABCDF hair ", the second row " ticket Q1980021XX "; Wherein the font of " ABCD " and " EFG " are smaller, attribute: font-size:6px, the font of " hair " and " ticket Q1980021XX " compared with Greatly, attribute: font-size:20px.
As shown in figure 5, in the present embodiment, parsing HTML includes three steps: marking, tree building and extraction need Information.The code of above-mentioned detailed process is as follows:
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, with formed be in continuous text information, please specifically join shown in Figure 11.
Join Fig. 6 to Fig. 7 shown in, the situation be marking convention based on RGB difference attribute when extract html content in text The detailed process of information.
Under such interference cases, step S23 specifically: traversal dom tree shape model extracts text information.According to CSS layer The rule of folded pattern, generates the final attribute of each flag node, (is defaulted as according to the value of attribute color in flag node RGB:000000) and the value of attribute background-color (being defaulted as RGB:FFFFFF), extract in html content in order The text information of flag node.According to determination method, it is determined whether the extraction of Yao Jinhang text information.Calculate foreground background colour Euclidean distance (d12), wherein x1y1z1 indicate foreground rgb value, x2y2z2 indicate background colour rgb value, d12 indicate Europe Formula is compared apart from calculated result, by it with the color critical value V being set, shown in the following formula of calculation formula (1):
The text information in html content is extracted if d12 < V, is not otherwise extracted in the text information in html content Hold.
Certainly, those skilled in the art can reasonable prediction arrive, may be based on gray value differences value attribute, color saturation category Property, one or more marking convention such as contrast difference attribute extract text information, details are not described herein.
The code of above-mentioned detailed process is as follows:
Set distance critical value is V=100, calculates Euclidean distance:
Text " ABCD " Euclidean distance d12=147;
Text " hair " Euclidean distance d12=0;
Text " EFG " Euclidean distance d12=147;
Text " Q1980021XX " Euclidean distance d12=0.
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, to form continuous text information.It please specifically join shown in Figure 11.
Join shown in Fig. 8 to Fig. 9, which handles html content executive table disturbance ecology and obtain the tool of text information Body process.
As shown in figure 8, " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are respectively encapsulated in a table (table) In.Each " tr " mark indicates that a line, the line number of record form are 1-M in table mark.Each " td " mark in table mark Know the column represented in row, recording every ranks number is 1-N.
The code of above-mentioned detailed process is as follows:
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, to form continuous text information, please specifically join shown in Figure 11.
Join shown in Figure 12 and Figure 13, there is disclosed to the text information progress comprising the meaningless character interference information of content The detailed process of recombining contents.
For the text information of the meaningless character interference of content, needs to recombinate html content, be conducive to analysis and place Reason, specific steps are as follows:
By Mail Contents code conversion, target character integrates as Unicode coding;
Extraction Unicode encodes section and is used as text for the character (II code value of ASC) of [48,57], [65,90], [97,122] The header information of text information after this recombination.
Extract text after the character (Chinese character) that Unicode coding section is [13312,40895] is recombinated as text The middle part information of information.
Extraction Unicode coding section is after the character (other language characters) of [40960,55215] is recombinated as text The trailer information of text information.
It is a space symbol that text information after recombination, which is replaced continuous space,;
Remove after recombination two Chinese characters (either English alphabet or Hiragana or Japanese pieces in text information Assumed name or German) between space;
Text information after recombination is deleted into carriage return and line feed symbol.
Wherein, the operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and The character for being 32 to II code value of ASC in the text information after recombining contents executes delete operation.
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 13 in text information after bulk density group executes delete operation.By header information, middle part information and tail portion Information is sequentially arranged into the recombining contents text in continuous state.
The final processing result of html content in Figure 12 are as follows:
"d!) ss as (v, v, v, v) b b z,.C c s c v c m c x Q Q:a e c n z n!2011 07 13 your the good existing invoices of our company can externally act on behalf of building advertisement and consult The Liu Sheng in need that please contact that pays the bill after sale etc. can be tested is ask if any bothering see forgiving date Shao ".
It can be realized through the invention and interference information included in spam efficiently separated with text information, it can The various interference informations artificially encapsulated in accurate identification spam determine whether spam to mail and provide to be subsequent Accurate foundation.
This specification also discloses a kind of method for judging rubbish mail, and the side of mail interference information is removed including above-mentioned Method;And the text information after recombining contents is compared with keywords database set in database, and determine whether for Spam.
Preferably, which includes oracle database, DB2 database, Postgre SQL database, Microsoft SQL Server database, MicrosoftAccess database or MySQL database, and further preferably MySQL data Library.By the above method, interception and filter effect to spam can be effectively improved.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims (15)

1. a kind of method for removing mail interference information, which comprises the following steps:
S1, the html content for including in mail is obtained;
S2, building DOM Document Object Model is executed to html content, and DOM Document Object Model is executed following at least one or more It by html Content Transformation is text information after the processing of kind of disturbance ecology, the disturbance ecology processing includes: color block disturbance ecology Processing, the processing of font size disturbance ecology, the processing of table disturbance ecology;
S3, by treated, text information carries out recombining contents;
In the step S2 to html content execute building DOM Document Object Model the following steps are included:
S21, multiple labels as input content and are parsed into using html content;
S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein each mark Note node includes the attribute information to match with the flag node;
S23, traversal dom tree shape model, extract the segment information being mutually matched in html content with the attribute information of flag node;
Text information in S24, each segment information of extraction, and it is suitable according to the matched arrangement of the attribute information of flag node selection Sequence, to form continuous text information.
2. the method for removal mail interference information according to claim 1, which is characterized in that the interference in the step S2 Identifying processing further include: sensitive word disturbance ecology processing.
3. the method for removal mail interference information according to claim 2, which is characterized in that by html in the step S2 Content Transformation is text information specifically: label is deleted to html content, to extract the text information in html content.
4. the method for removal mail interference information according to claim 3, which is characterized in that the sensitive word disturbance ecology Processing includes: the conversion processing of the conversion processing of capitalization and lowercase, specification character and non-standard character, letter and number The conversion processing of word.
5. the method for removal mail interference information according to claim 4, which is characterized in that the capitalization and small letter The conversion processing of letter specifically: text information is checked character by character, and in II code value of the ASC of character at [65,90], II code value of ASC of the character is increased by 32.
6. the method for removal mail interference information according to claim 4, which is characterized in that the specification character and non-rule The conversion processing of model character specifically: text information is checked character by character, and by non-standard included in text information The data value of character is revised as specification character according to Unicode coding schedule.
7. the method for removal mail interference information according to claim 4, which is characterized in that the letter turns with number Change processing specifically: text information is checked character by character, and number is replaced with letter according to II code table of ASC.
8. the method for removal mail interference information according to claim 1, which is characterized in that the step S21 specifically: Traversal parsing operation is carried out according to preset marking convention to html content, then by label generator distinguishing mark and is passed Pass dom tree shape model builder.
9. the method for removal mail interference information according to claim 8, which is characterized in that the marking convention includes Html content initial position label, html end of text position mark, categorical attribute, Property Name, attribute value.
10. the method for removal mail interference information according to claim 9, which is characterized in that the categorical attribute includes Font size attribute, font tilt attribute, font horizontal arrangement difference attribute, font vertical arrangement difference attribute, RGB difference category Property, gray value differences value attribute, color saturation attribute, contrast difference attribute.
11. it is according to claim 1 removal mail interference information method, which is characterized in that the step S3 include with Lower step:
S31, using encoded translated device, to treated, text information carries out code conversion;
S32, according to setting length coding section as the header information of the text information after recombining contents, middle part information and tail Portion's information;
S33, header information, middle part information and trailer information are sequentially arranged into the text envelope after the recombining contents in continuous state Breath.
12. the method for removal mail interference information according to claim 11, which is characterized in that after the step S33, Further include: following one or more kinds of operations are executed to the text information after recombining contents:
Remove the operation of space markings;
Remove the operation of carriage return label;
The operation of removal line feed label;Wherein,
The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC in text information after group is 32 executes delete operation;
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC is 10 in text information after group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC is 13 in text information after group executes delete operation.
13. the method for removal mail interference information according to claim 1, which is characterized in that the step S1 executes it Before further include the steps that using encoder to html content execute code conversion processing be converted into Unicode coding.
14. a kind of method for judging rubbish mail, which is characterized in that dry including removal mail described in any of the above-described claim The method for disturbing information;And
Text information after recombining contents is compared with keywords database set in database, and is determined whether for rubbish Mail.
15. method for judging rubbish mail according to claim 14, which is characterized in that the database includes Oracle number According to library, DB2 database, Postgre SQL database, Microsoft SQL Server database, Microsoft Access Database or MySQL database.
CN201610584290.8A 2016-07-22 2016-07-22 A kind of method and method for judging rubbish mail removing mail interference information Active CN106227808B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610584290.8A CN106227808B (en) 2016-07-22 2016-07-22 A kind of method and method for judging rubbish mail removing mail interference information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610584290.8A CN106227808B (en) 2016-07-22 2016-07-22 A kind of method and method for judging rubbish mail removing mail interference information

Publications (2)

Publication Number Publication Date
CN106227808A CN106227808A (en) 2016-12-14
CN106227808B true CN106227808B (en) 2019-04-05

Family

ID=57532592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610584290.8A Active CN106227808B (en) 2016-07-22 2016-07-22 A kind of method and method for judging rubbish mail removing mail interference information

Country Status (1)

Country Link
CN (1) CN106227808B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106817297B (en) * 2017-01-19 2019-11-26 华云数据(厦门)网络有限公司 A method of spam is identified by html tag
CN107171937A (en) * 2017-05-11 2017-09-15 翼果(深圳)科技有限公司 The method and system of anti-rubbish mail
CN110287147B (en) * 2019-06-27 2022-08-19 北京奇艺世纪科技有限公司 Character string sorting method and device
CN110717028B (en) * 2019-10-18 2022-02-15 支付宝(杭州)信息技术有限公司 Method and system for eliminating interference problem pairs
CN111275051A (en) * 2020-02-28 2020-06-12 上海眼控科技股份有限公司 Character recognition method, character recognition device, computer equipment and computer-readable storage medium
CN111859871A (en) * 2020-07-22 2020-10-30 中国联合网络通信集团有限公司 Data processing method, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852268A (en) * 2005-10-19 2006-10-25 华为技术有限公司 Junk-mail preventing method and system
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN202121603U (en) * 2011-03-18 2012-01-18 蓝盾信息安全技术股份有限公司 Anti-spam system
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
CN103684991A (en) * 2013-12-12 2014-03-26 深圳市彩讯科技有限公司 Junk mail filtering method based on mail features and content
JP2015026355A (en) * 2013-06-17 2015-02-05 富士ゼロックス株式会社 Information processing program and information processing device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852268A (en) * 2005-10-19 2006-10-25 华为技术有限公司 Junk-mail preventing method and system
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN202121603U (en) * 2011-03-18 2012-01-18 蓝盾信息安全技术股份有限公司 Anti-spam system
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Information prompting method and information prompting device for e-mails
JP2015026355A (en) * 2013-06-17 2015-02-05 富士ゼロックス株式会社 Information processing program and information processing device
CN103684991A (en) * 2013-12-12 2014-03-26 深圳市彩讯科技有限公司 Junk mail filtering method based on mail features and content

Also Published As

Publication number Publication date
CN106227808A (en) 2016-12-14

Similar Documents

Publication Publication Date Title
CN106227808B (en) A kind of method and method for judging rubbish mail removing mail interference information
CN103336766B (en) Short text garbage identification and modeling method and device
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
JP2005526314A (en) Document structure identifier
CN104598577B (en) A kind of extracting method of Web page text
CN104679825B (en) Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique
CN105677764A (en) Information extraction method and device
CN106446072B (en) The treating method and apparatus of web page contents
CN108737423A (en) Fishing website based on webpage key content similarity analysis finds method and system
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN104899219B (en) Pseudo- static state URL&#39;s screens out method, system and web page crawl method, system
CN109508458A (en) The recognition methods of legal entity and device
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN109657114B (en) Method for extracting webpage semi-structured data
CN107463571A (en) Web color method
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN109255117A (en) Chinese word cutting method and device
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN109446299A (en) The method and system of searching email content based on event recognition
CN104036190A (en) Method and device for detecting page tampering
CN106372232B (en) Information mining method and device based on artificial intelligence
CN104573097B (en) A method of extraction Web page text
CN103218420A (en) Method and device for extracting page titles
CN104036189A (en) Page distortion detecting method and black link database generating method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170815

Address after: 361006 Chinese (Fujian) free trade zone of Xiamen area (Free Trade Zone) Xiangyu Road No. 97 Xiamen international shipping center D 8 storey building 05 unit X (the residence only as legal instruments of commercial subject address for service)

Applicant after: XIAMEN RONGNENG TECHNOLOGY Co.,Ltd.

Address before: 214000 North -705 room (Development Zone), 5 wisdom road, Huishan Economic Development Zone, Jiangsu, Wuxi

Applicant before: WUXI CLOUDSTONE TECHNOLOGY CO.,LTD.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190911

Address after: 361000 Xiamen China (Fujian) Free Trade Pilot Zone Xiamen (Free Trade Zone) Xiangyu Road 97, Xiamen International Shipping Center D, 8 floors, 05 units X (the residence is only served as the address of legal documents of commercial subjects)

Patentee after: Huayun data (Xiamen) network Co.,Ltd.

Address before: China (Fujian) Free Trade Pilot Zone Xiamen Section (Bonded Zone) No. 97 Xiangyu Road, Xiamen International Shipping Center, Building D, 8 floors, Unit 05 X (The residence is only served as the address of legal documents of commercial subjects)

Patentee before: XIAMEN RONGNENG TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201211

Address after: 230000 6 / F, Zone C, G4 building, phase II, innovation industrial park, 2800 innovation Avenue, hi tech Zone, Hefei City, Anhui Province

Patentee after: Anhui AI Office Information Technology Co.,Ltd.

Address before: 361000 unit X, unit 05, 8 / F, building D, Xiamen international shipping center, No.97 Xiangyu Road, Xiamen area (Free Trade Zone), China (Fujian) pilot Free Trade Zone, Xiamen City, Fujian Province

Patentee before: Huayun data (Xiamen) network Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230506

Address after: 230000, floor 6, Zone C, building G4, phase II, innovation industrial park, No. 2800, innovation Avenue, high tech Zone, Hefei, Anhui Province

Patentee after: Huayun Data Co.,Ltd.

Address before: 230000 6 / F, Zone C, G4 building, phase II, innovation industrial park, 2800 innovation Avenue, hi tech Zone, Hefei City, Anhui Province

Patentee before: Anhui AI Office Information Technology Co.,Ltd.