WO2011153894A1 - Method and system for distinguishing image spam mail - Google Patents

Method and system for distinguishing image spam mail Download PDF

Info

Publication number
WO2011153894A1
WO2011153894A1 PCT/CN2011/074146 CN2011074146W WO2011153894A1 WO 2011153894 A1 WO2011153894 A1 WO 2011153894A1 CN 2011074146 W CN2011074146 W CN 2011074146W WO 2011153894 A1 WO2011153894 A1 WO 2011153894A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
probability
spam
value
feature
Prior art date
Application number
PCT/CN2011/074146
Other languages
French (fr)
Chinese (zh)
Inventor
林延中
潘庆峰
陈磊华
Original Assignee
盈世信息科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 盈世信息科技(北京)有限公司 filed Critical 盈世信息科技(北京)有限公司
Publication of WO2011153894A1 publication Critical patent/WO2011153894A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a method and system for distinguishing image spam mail. The method includes steps: extracting image characteristics according to a compression ratio distribution of the image in the mail; according to the probability that the each characteristics appears in a spam image, calculating a probability that the mail is a spam mail by use of a probability and statistic formula; looking up in the preset weight table according to the probability that the image is a spam mail, retransmission times, and the reputation of the sender IP address, calculating the weight sum of said image; judging whether the image is a spam mail or not according to said weight sum. By use of the present invention, the image spam mail can be distinguished efficiently, and the images with distortion or background noise can be distinguished.

Description

识别图片垃圾邮件的方法及系统 技术领域  Method and system for identifying image spam
本发明涉及通讯技术领域, 尤其涉及一种识别图片垃圾邮件的方法及系统。 背景技术  The present invention relates to the field of communications technologies, and in particular, to a method and system for identifying picture spam. Background technique
随着网络的快速发展,使用电子邮件( E-mail )进行通信已十分普遍, 图片、 文档、 影音等各种计算机文件均可通过 E-mail的方式传送给接收者, 给人们的 生活带来了极大的方便。 但同时垃圾邮件也随之蔓延, 严重威胁到用户邮箱的 稳定性及安全性。  With the rapid development of the network, it is very common to use e-mail (E-mail) for communication. Various computer files such as pictures, documents, audio and video can be transmitted to the recipient by E-mail, bringing life to people. Great convenience. At the same time, spam has also spread, which seriously threatens the stability and security of the user's mailbox.
目前 , 识别图片垃圾邮件的方法主要有两类。 一是通过 OCR ( Optical Character Recognition, 光学字符识别) 系统从图片中分析提取文字, 并对所提 取的文字进行分词, 再根据样本库, 获得每个词语对应的该邮件为垃圾邮件的 概率。 最后, 将每个词语对应的该邮件为垃圾邮件的概率代入贝叶斯公式中进 行计算, 得到该邮件为垃圾邮件的概率。 若该邮件为垃圾邮件的概率大于预定 的门限值, 则将该邮件标记为垃圾邮件。  Currently, there are two main methods for identifying image spam. First, the OCR (Optical Character Recognition) system is used to extract and extract text from the image, and the extracted words are segmented, and according to the sample library, the probability that the email corresponding to each word is spam is obtained. Finally, the probability that the email corresponding to each word is spam is substituted into the Bayesian formula to calculate the probability that the email is spam. If the probability of the message being spam is greater than the predetermined threshold, the message is marked as spam.
然而, 由于 OCR技术需要事先将图片分解成像素方式才能处理, 其效率非 常低, 特别是处理高分辨率的图像。 而且, OCR技术只能提取印刷版的字体信 息, 假如图片中的字体稍变形或者背景包含噪音, 其识别率就急速下降甚至不 能识别。 因此, 现有的使用 OCR技术从图片提取文字的垃圾图片过滤方式, 效 率低, 而且不能处理扭曲变形或者背景包含噪音信息的图片。 发明内容  However, since OCR technology requires the image to be processed into pixels in advance, it is very inefficient, especially for processing high resolution images. Moreover, OCR technology can only extract the font information of the printed version. If the font in the picture is slightly deformed or the background contains noise, the recognition rate will drop rapidly or even not be recognized. Therefore, the existing garbage filtering method of extracting text from a picture using OCR technology is low in efficiency, and cannot handle a picture in which distortion or background contains noise information. Summary of the invention
本发明实施例提出一种识别图片垃圾邮件的方法及系统, 识别图片垃圾邮 件的效率高, 并且能够识别扭曲变形的或者背景包含噪音信息的图片。  Embodiments of the present invention provide a method and system for identifying picture spam, which is highly efficient in identifying picture spam, and capable of recognizing a picture that is distorted or contains background noise information.
本发明实施例提供一种识别图片垃圾邮件的方法, 包括:  An embodiment of the present invention provides a method for identifying image spam, including:
根据邮件中的图片的压缩率分布特性, 提取所述图片的特征值;  Extracting a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;
根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统计公式 计算获得所述图片为垃圾邮件的概率; Applying a probability statistical formula based on the probability that each feature value of the picture appears in the junk picture Calculating the probability of obtaining the picture as spam;
应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮件图片 的哈希值进行比较, 得出所述图片被重复发送的次数;  Applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent;
根据所述图片的发信 IP查询声誉值数据库, 获得所述发信 IP的声誉值; 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值查 询预置的权重值列表, 计算所述图片的权重和, 并根据所述权重和判定所述图 片是否为垃圾邮件。  Obtaining a reputation value of the outgoing IP according to the sentiment IP query reputation database of the picture; and querying a preset weight according to the probability that the picture is spam, the number of times of repeated sending, and the reputation value of the sending IP a list of values, calculating a weight sum of the pictures, and determining whether the picture is spam based on the weights.
其中, 所述声誉值数据库保存有发信 IP的声誉值, 所述声誉值是发信 IP所 相应地, 本发明实施例还提供了一种邮件系统, 包括  The reputation value database stores the reputation value of the outgoing IP, and the reputation value is corresponding to the sending IP. The embodiment of the present invention further provides a mail system, including
图片特征提取模块, 用于根据邮件中的图片的压缩率分布特性, 提取所述 图片的特征值;  a picture feature extraction module, configured to extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;
垃圾邮件概率获取模块, 用于根据所述图片的每个特征值在垃圾图片中出 现的概率, 应用概率统计公式计算获得所述图片为垃圾邮件的概率;  a spam probability acquisition module, configured to calculate a probability of obtaining the image as spam according to a probability that each feature value of the picture appears in the garbage picture;
图片发送次数获取模块, 应用哈希算法计算所述图片的哈希值, 将所述哈 希值和已接收的邮件图片的哈希值进行比较, 得出所述图片被重复发送的次数; 声誉值获取模块, 用于根据所述邮件的发信 IP查询声誉值数据库, 获得所 述发信 IP的声誉值;  a picture sending times obtaining module, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a value obtaining module, configured to query a reputation value database according to the sending IP of the mail, to obtain a reputation value of the sending IP;
垃圾邮件判定模块, 用于根据所述图片为垃圾邮件的概率、 被重复发送的 次数、 发信 IP的声誉值查询预置的权重值列表, 计算所述图片的权重和, 并根 据所述权重和判定所述图片是否为垃圾邮件。  a spam determination module, configured to query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and according to the weight And determining whether the picture is spam.
所述邮件系统还包括:  The mail system further includes:
样本数据库, 用于保存垃圾图片样本和正常图片样本的所有特征值, 以及 每个特征值在垃圾图片中出现的概率;  a sample database for storing all feature values of the garbage image sample and the normal image sample, and the probability that each feature value appears in the garbage picture;
声誉值数据库, 用于保存发信 IP的声誉值; 所述声誉值是发信 IP所发送的 正常邮件在其所有已发送的邮件中所占的比例;  a reputation value database for storing the reputation value of the outgoing IP; the reputation value is the proportion of normal mail sent by the outgoing IP in all of its sent mails;
声誉值更新模块, 用于在所述垃圾邮件判定模块判定图片为垃圾邮件后, 重新计算所述图片的发信 IP的声誉值, 并对声誉值数据库中的相应的声誉值进 行更新。  The reputation value update module is configured to recalculate the reputation value of the sent IP of the picture after the spam determination module determines that the picture is spam, and update the corresponding reputation value in the reputation value database.
实施本发明实施例, 具有如下有益效果: 本发明实施例提供的识别图片垃圾邮件的方法及系统, 基于图片的压缩率 分布特性提取邮件中的图片的特征值, 应用概率统计公式计算获得所述图片为 垃圾邮件的概率; 再根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发 信 IP的声誉值三者的权重值计算所述图片的权重和, 根据所述权重和判定所述 图片是否为垃圾邮件。 本发明基于图片的压缩率分布识别图片垃圾邮件, 效率 高, 并且能够识别扭曲变形的或者背景包含噪音信息的图片。 此外, 本发明应 用哈希算法判断图片的相似度, 并统计相似图片被重复发送的次数, 而根据这 一特征可以很好地判断发信者的行为是否与垃圾邮件的发信行为相似, 从而提 高了识别图片垃圾邮件的准确率。 附图说明 Embodiments of the present invention have the following beneficial effects: The method and system for identifying image spam according to embodiments of the present invention, extracting feature values of pictures in a mail based on a compression ratio distribution characteristic of a picture, and calculating a probability of obtaining the picture as spam by using a probability statistical formula; The weight value of the picture is the weight of the spam, the number of times of repeated transmission, and the reputation value of the outgoing IP. The weight of the picture is calculated, and based on the weight, it is determined whether the picture is spam. The present invention recognizes picture spam based on the compression ratio distribution of pictures, is highly efficient, and is capable of recognizing pictures that are distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be well judged whether the behavior of the sender is similar to the sending behavior of the spam, thereby improving The accuracy of identifying image spam. DRAWINGS
图 1是本发明提供的识别图片垃圾邮件的方法的第一实施例的流程示意图; 图 2是本发明提供的支持向量机算法的示意图;  1 is a schematic flowchart of a first embodiment of a method for identifying a picture spam provided by the present invention; FIG. 2 is a schematic diagram of a support vector machine algorithm provided by the present invention;
图 3是本发明提供的识别图片垃圾邮件的方法的第二实施例的流程示意图; 图 4是本发明提供的识别图片垃圾邮件的方法的第三实施例的流程示意图; 图 5是本发明实施例提供的邮件系统的结构示意图;  3 is a schematic flowchart of a second embodiment of a method for identifying a picture spam provided by the present invention; FIG. 4 is a schematic flowchart of a third embodiment of a method for identifying a picture spam provided by the present invention; A schematic diagram of the structure of the mail system provided by the example;
图 6是本发明实施例提供的图片特征提取模块的结构示意图;  6 is a schematic structural diagram of a picture feature extraction module according to an embodiment of the present invention;
图 7是本发明实施例提供的垃圾邮件概率获取模块的结构示意图; 图 8是本发明实施例提供的图片发送次数获取模块的结构示意图; 图 9是本发明实施例提供的垃圾邮件判定模块的结构示意图。 具体实施方式  7 is a schematic structural diagram of a spam probability acquisition module according to an embodiment of the present invention; FIG. 8 is a schematic structural diagram of a picture sending times obtaining module according to an embodiment of the present invention; FIG. 9 is a schematic diagram of a spam determining module according to an embodiment of the present invention; Schematic. detailed description
下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行清 楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是 全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。  BRIEF DESCRIPTION OF THE DRAWINGS The technical solutions in the embodiments of the present invention will be described in detail with reference to the accompanying drawings. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative work are within the scope of the present invention.
本发明实施例提供的识别图片垃圾邮件的方法及系统, 预先收集正常图片 和垃圾邮件图片样本, 基于图片的压缩率分布特性提取图片特征, 获得正常图 片和垃圾邮件图片的特征集合; 再使用贝叶斯分类器学习这些特征集合, 计算 获得最具代表性的特征是垃圾图片还是正常图片的概率结果集。 具体如下: 一、 收集正常图片和垃圾邮件图片样本: The method and system for identifying image spam according to embodiments of the present invention collect pre-image and spam image samples in advance, extract image features based on image compression rate distribution characteristics, and obtain feature sets of normal pictures and spam pictures; The Yesi classifier learns these feature sets and calculates the most representative feature is the junk picture or the probability result set of the normal picture. details as follows: First, collect samples of normal pictures and spam pictures:
使用图片抓取软件, 从互联网上随机抓取格式为 JPG或者 GIF的图片, 加 入到正常邮件样本库中。  Use the image capture software to randomly capture images from the Internet as JPG or GIF and add them to the normal email sample library.
在邮件系统中部署举报系统, 收集用户举报的包含图片的垃圾邮件, 经过 人工审核确认图片为垃圾邮件的 , 将该图片加入到垃圾邮件样本库中。  The reporting system is deployed in the mail system to collect spam containing images submitted by the user. After manual verification to confirm that the image is spam, the image is added to the spam sample database.
二、 提取正常图片和垃圾邮件图片所包含的所有特征:  Second, extract all the features contained in normal images and spam pictures:
本发明实施例基于图片的压缩率分布特性提取图片特征, 下面仅以 JPG格 式、 GIF格式和 PNG格式的图片为例详细描述提取图片特征的方法。  In the embodiment of the present invention, the picture feature is extracted based on the compression rate distribution characteristic of the picture. The following is a detailed description of the method for extracting the picture feature by taking the picture in the JPG format, the GIF format, and the PNG format as an example.
( 1 )、 计算 JPG格式图片的压缩率;  (1) Calculating the compression ratio of the JPG format image;
JPG格式的图片的压缩方式是对图片每 8*8像素划分一个子块,并对每个子 块进行独立的压缩, 再将压缩后的块信息保存到文件。 因此, 在分析 JPG格式 的图片特征时, 只需要获取图片压缩后的每一个子块的大小, 再将子块大小除 以 (8*8), 取整后即可获得这一子块的压缩率, 无需对子块做解压操作。  The compression method of the JPG format picture is to divide a sub-block every 8*8 pixels of the picture, and perform independent compression on each sub-block, and then save the compressed block information to the file. Therefore, when analyzing the image features of the JPG format, it is only necessary to obtain the size of each sub-block after the image is compressed, and then divide the sub-block size by (8*8), and the compression of the sub-block can be obtained after the rounding. Rate, no need to decompress sub-blocks.
扫描整个 JPG文件, 即可获得一个压缩率序列 Cl、 C2、 C3、 C4... , 其中 CI代表图片左上角的 8*8像素的子块的压缩率, C2是连续的相邻近的子块的压 缩率, C3、 C4类推。  Scan the entire JPG file to obtain a compression ratio sequence Cl, C2, C3, C4..., where CI represents the compression ratio of the 8*8 pixel sub-block in the upper left corner of the image, and C2 is a continuous adjacent sub-block. The compression ratio of the block, C3, C4 analogy.
( 2 )、 计算 GIF格式图片的压缩率;  (2) Calculating the compression ratio of the GIF format picture;
GIF格式图片的压缩方式是著名的 LZW压缩算法。 LZW算法的主要思想 是维护一个有 256个单元的编码表, 如果图片中某一行像素序列曾经在编码表 里面出现过的, 则使用编码表的下标来代替这段像素序列, 以达到压缩的目的。  The compression method of GIF format pictures is the famous LZW compression algorithm. The main idea of the LZW algorithm is to maintain a coding table with 256 elements. If a pixel sequence in a picture has appeared in the code table, the subscript of the code table is used instead of the pixel sequence to achieve compression. purpose.
在分析 GIF格式的图片特征时, 只需要读取上述的码表下标(码表下标的 长度固定为一个字节), 通过查询对应的码表该下标所对应的像素值, 以此计算 出这一小块图片的压缩率: 1 / (码表对应的像素)。  When analyzing the picture features of the GIF format, it is only necessary to read the above code table subscript (the length of the code table subscript is fixed to one byte), and calculate the pixel value corresponding to the subscript by querying the corresponding code table, thereby calculating The compression ratio of this small picture: 1 / (the pixel corresponding to the code table).
扫描整个 GIF文件, 即可获得一个压缩率序列 Cl、 C2、 C3、 C4... , 其中 CI代表图片左上角一行不定长像素的压缩率, C2、 C3、 C4类推。  Scan the entire GIF file to obtain a compression ratio sequence Cl, C2, C3, C4..., where CI represents the compression ratio of a fixed length of pixels in the upper left corner of the picture, C2, C3, C4.
( 3 )计算 PNG格式图片的压缩率;  (3) calculating the compression ratio of the PNG format picture;
PNG格式图片使用 LZ77压缩算法, 其与 GIF图片的 LZW压缩算法相似, 不同点仅在于, LZ77算法没有一个固定的编码表, 而是使用之前已经遇到的序 列的相对位置和长度来表示像素序列。 例如: 对像素序列 abcdeabcde进行压缩 时, 在扫描到 abcde之前, 由于之前没有出现过与 、 b、 c、 d或 e重复的序列, 因此不对 abcde进行压缩, 即输入序列 abcde和压缩后的序列相等。 但是, 当扫 描到 abcdea的时候, 由于序列 a在之前出现过, 然后继续比对 abcde, 发现之前 也曾出现过 abcde这个序列, 于是第二次出现的 abcde序列, 用一个偏移量和长 度来表示即可。 也就是, PNG 图片使用的 LZ77算法没有固定的编码表, 其码 表就是隐含在当前位置之前已经出现过的序列中。 需要说明的是, LZ77压缩算 法是本领域的公知技术, 上述只作简单的原理说明, 实际上 PNG图片的偏移量 和长度等信息是按 bit保存的, 以便更加节省空间。 The PNG format picture uses the LZ77 compression algorithm, which is similar to the LZW compression algorithm of the GIF picture. The only difference is that the LZ77 algorithm does not have a fixed coding table, but uses the relative position and length of the sequence that has been encountered before to represent the pixel sequence. . For example: When compressing the pixel sequence abcdeabcde, before scanning to abcde, since there is no sequence with repeated repetitions of b, c, d or e, Therefore, the abcde is not compressed, that is, the input sequence abcde is equal to the compressed sequence. However, when scanning to abcdea, since sequence a has appeared before, and then continue to compare abcde, it is found that abcde has appeared before, so the second occurrence of abcde sequence, with an offset and length That's it. That is, the LZ77 algorithm used by PNG pictures does not have a fixed coding table, and its code table is implicit in the sequence that has appeared before the current position. It should be noted that the LZ77 compression algorithm is a well-known technology in the art. The above description is only a simple principle. In fact, the information such as the offset and length of the PNG picture is saved in bits, so as to save space.
因此, 在分析 PNG图片的压缩率时, 从压缩后的 PNG数据流可以得出: 对于没有做过压缩的数据序列, 这些序列的压缩率为 1 ; 对于做过压缩的数据序 列, 这些序列使用 (偏移量, 长度) 来表示该序列所对应的信息, 可以在之前 已经解压好的输出序列的特定位置找到。 假定保存(偏移量, 长度)信息需要 N 个字节, 而(偏移量, 长度) 中的 "长度,,属性的值为 M, 则压缩率为 N/M (即 用 N个字节保存 M个字节的信息)。  Therefore, when analyzing the compression ratio of a PNG picture, it can be derived from the compressed PNG data stream: For a data sequence that has not been compressed, the compression ratio of these sequences is 1; for a compressed data sequence, these sequences are used. (offset, length) to represent the information corresponding to the sequence, which can be found at a specific location of the output sequence that has been previously decompressed. Assume that the save (offset, length) information requires N bytes, and the "length" (offset, length), the value of the attribute is M, then the compression ratio is N/M (ie, N bytes) Save M bytes of information).
通过分析压缩后的 PNG数据流, 即可获得一个压缩率序列 Cl、 C2、 C3、 C4... , 其中 CI代表图片左上角一行不定长像素序列的压缩率, C2、 C3、 C4类 推。  By analyzing the compressed PNG data stream, a compression ratio sequence Cl, C2, C3, C4... can be obtained, where CI represents the compression ratio of a fixed length pixel sequence in the upper left corner of the picture, C2, C3, C4.
本发明实施例无需解压图片, 节省大量运算资源和内存资源。  The embodiment of the invention does not need to decompress the picture, and saves a large amount of computing resources and memory resources.
( 4 )、 计算图片的特征值;  (4) calculating the feature value of the picture;
通过上述的( 1 )、 ( 2 )、 ( 3 )实施例获得 JPG、 GIF或 PNG格式的图片压缩 率序列后, 将每 4个连续的压缩率合并成一个新的压缩率变化元素 D (其中 4 是经验值, 是经过试验的结果, 本发明不限于 4 )。 D代表图片的 4个相邻子块 的压缩率变化情况, 例如, 对于压缩率序列 Cl、 C2、 C3、 C4、 C5、 C6、 C7、 C8 , 经过转换后变成 Dl、 D2序列, 其中 D1= C1C2C3C4,D2= C5C6C7C8。  After obtaining the picture compression rate sequence of JPG, GIF or PNG format by the above (1), (2), (3) embodiments, each 4 consecutive compression ratios are merged into a new compression rate change element D (where 4 is an empirical value and is a result of the experiment, and the present invention is not limited to 4). D represents the change of the compression ratio of the four adjacent sub-blocks of the picture, for example, for the compression rate sequence Cl, C2, C3, C4, C5, C6, C7, C8, after conversion, it becomes a sequence of D1, D2, where D1 = C1C2C3C4, D2 = C5C6C7C8.
在获得图片的压缩率变化元素序列后, 将每一个压缩率变化元素加上该元 素所在的相对位置信息, 组成一个特征值。  After obtaining the compression ratio change element sequence of the picture, each compression rate change element is added with the relative position information of the element to form a feature value.
例如, 将图片分成 6个区域, 每一个区域对应一个固定的位置编码, 如下: 左上角区域: 位置编码为 1 ;  For example, the picture is divided into six areas, each of which corresponds to a fixed position code, as follows: Top left area: The position code is 1;
上方区域: 位置编码为 2;  Upper area: position code is 2;
右上角区域: 位置编码为 3;  Upper right corner area: Position code is 3;
左下角区域: 位置编码为 4; 下方区域: 位置编码为 5; Lower left corner area: position code is 4; Lower area: position code is 5;
右下角区域: 位置编码为 6;  Lower right corner area: Position code is 6;
如果像素块位于图片的左上角, 其压缩率变化元素为 D1的, 则包含位置信 息的特征值 F1 为 1D1 ; 如果像素块位于图片的右上角, 其压缩率变化元素为 D2的, 则包含位置信息的特征值 F2为 3D2。 依此类推, 将压缩率变化元素和 该元素对应的像素块在图片上的位置编码进行组合(位置编码 +压缩率变化元素 D ), 获得图片的特征序列: Fl、 F2、 F3、 F4... 。  If the pixel block is located in the upper left corner of the picture and the compression rate change element is D1, the feature value F1 containing the position information is 1D1; if the pixel block is located in the upper right corner of the picture and the compression rate change element is D2, the position is included The feature value F2 of the information is 3D2. And so on, combining the compression rate change element and the position code of the pixel block corresponding to the element on the picture (position coding + compression rate change element D), and obtaining the feature sequence of the picture: Fl, F2, F3, F4.. . . .
需要说明的是, 上述仅以 JPG、 GIF和 PNG格式的图片为例, 以说明基于 图片的压缩率特性提取图片特征的方法, 本发明实施例还可以应用到其他的具 有类似压缩率特性的图片中。  It should be noted that the foregoing only takes the pictures in the JPG, GIF, and PNG formats as an example to illustrate the method for extracting picture features based on the compression rate characteristics of the pictures. The embodiment of the present invention can also be applied to other pictures having similar compression rate characteristics. in.
三、 建立样本数据库:  Third, the establishment of a sample database:
( 1 )、 建立正常图片和垃圾邮件图片的特征集合;  (1) establishing a feature set of normal pictures and spam pictures;
通过上述步骤二的方法计算出正常图片和垃圾图片所包含的所有特征值 后 ,将正常图片的所有特征值保存在正常图片特征集合 HAM中, 将垃圾图片的 所有特征值保存在垃圾图片特征集合 SPAM中。  After all the feature values included in the normal picture and the garbage picture are calculated by the method in the above step 2, all the feature values of the normal picture are saved in the normal picture feature set HAM, and all the feature values of the junk picture are saved in the junk picture feature set. In SPAM.
此外,正常图片特征集合 HAM还记录了每个特征值在所有正常图片样本中 出现的次数。 例如, 特征值 F1在所有正常图片样本中出现的次数为 10000, 特 征值 F2在所有正常图片样本中出现的次数为 20000, 等等。  In addition, the normal picture feature set HAM also records the number of times each feature value appears in all normal picture samples. For example, the number of occurrences of the feature value F1 in all normal picture samples is 10000, and the number of occurrences of the feature value F2 in all normal picture samples is 20000, and so on.
同理, 垃圾图片特征集合 SPAM也记录了每个特征值在所有垃圾图片样本 中出现的次数。 例如, 特征值 F1在所有垃圾图片样本中出现的次数为 30000, 特征值 F2在所有垃圾图片样本中出现的次数为 40000, 等等。  Similarly, the garbage image feature set SPAM also records the number of times each feature value appears in all junk image samples. For example, the number of occurrences of the feature value F1 in all junk picture samples is 30000, the number of times the feature value F2 appears in all junk picture samples is 40000, and so on.
对于某个特定的特征值 Fn, 它既可能出现在垃圾邮件图片样本中, 也可能 出现在正常邮件图片样本中, 其出现的次数一般不相等。 For a particular feature value F n , it may appear in the spam picture sample or in the normal mail picture sample, and the number of occurrences is generally not equal.
( 2 )、 计算每个特征值在垃圾图片中出现的概率, 组建样本数据库; 从正常图片特征集合 HAM和垃圾图片特征集合 SPAM中 , 分别读取特征 值 F在正常图片样本和垃圾邮件图片样本中的出现次数, 使用贝叶斯分类器进 行计算, 即可得出这个特征值 F在垃圾邮件图片中出现的概率 Q。 例如, 特征 值 F1在垃圾邮件图片中出现的概率为 Q1 , 特征值 F2在垃圾邮件图片中出现的 概率为 Q2, 特征值 F3在垃圾邮件图片中出现的概率为 Q3。 将 F和 Q的对应关 系保存下来, 即保存为 F1:Q1 , F2:Q2, F3:Q3..., 组建成样本数据库。 本发明实施例所建立的样本数据库, 保存有垃圾图片样本和正常图片样本 的所有特征值, 以及每个特征值在垃圾图片中出现的概率。 (2) Calculating the probability of each feature value appearing in the garbage picture, and forming a sample database; reading the feature value F from the normal picture feature set HAM and the junk picture feature set SPAM in the normal picture sample and the spam picture sample respectively The number of occurrences in the Bayesian classifier is used to calculate the probability Q of the feature value F appearing in the spam picture. For example, the probability that the feature value F1 appears in the spam picture is Q1, the probability that the feature value F2 appears in the spam picture is Q2, and the probability that the feature value F3 appears in the spam picture is Q3. Save the correspondence between F and Q, that is, save as F1:Q1, F2:Q2, F3:Q3..., and build the sample database. The sample database established by the embodiment of the present invention stores all the feature values of the garbage picture sample and the normal picture sample, and the probability that each feature value appears in the garbage picture.
可选的,本发明实施例还可以按照 Q值的大小,从高到低对 "F1:Q1 , F2:Q2, F3:Q3..." 序列进行排序, 只抽取 Q值大于 80%的那些序列 F:Q (说明这些序列 在垃圾邮件样本出现的概率很高) 和 Q值小于 20%的那些序列 F:Q (说明这些 序列在正常邮件样本中出现的概率很高), 作为最终贝叶斯评估的评估基准保存 到样本数据库中。 经验表明, Q值在 ( 20%, 80% )之间的序列 F:Q, 因为其特 征序列 F在正常图片和垃圾邮件图片中出现的次数数量差不多, F对于评判图片 是否垃圾邮件图片没有太多的帮助, 而且这类中性的 F:Q序列占到 F:Q序列总 数的 80%左右, 因此剔除这些中性的数据, 将有助于加快评估图片是否垃圾邮 件图片的效率。  Optionally, in the embodiment of the present invention, the sequence of "F1:Q1, F2:Q2, F3:Q3..." may be sorted according to the magnitude of the Q value, and only those whose Q value is greater than 80% are extracted. Sequence F:Q (indicating that these sequences are highly likely to appear in spam samples) and those sequences with Q values less than 20% F:Q (indicating that these sequences are highly probable in normal mail samples), as the final Bayeux The evaluation criteria for the assessment are saved to the sample database. Experience has shown that the sequence F:Q with a Q value between (20%, 80%) is similar to the number of occurrences of the characteristic sequence F in normal pictures and spam pictures, and F is not too much for judging whether the picture is spam or not. More help, and this kind of neutral F:Q sequence accounts for about 80% of the total number of F:Q sequences, so eliminating these neutral data will help speed up the evaluation of the efficiency of the image.
下面结合图 1〜图 9, 对本发明实施例提供的识别图片垃圾邮件的方法及系 统进行详细描述。 本发明实施例的概率统计公式包括贝叶斯 ( Bayes )公式和 / 或支持向量机(SVM )公式。 应用贝叶斯公式进行计算, 所获得的图片为垃圾 邮件的概率称为 "第一概率"; 应用向量机公式进行计算, 所获得的图片为垃圾 邮件的概率称为 "第二概率 "。  The method and system for identifying picture spam provided by the embodiment of the present invention are described in detail below with reference to FIG. 1 to FIG. The probability and statistics formula of the embodiment of the present invention includes a Bayes formula and/or a support vector machine (SVM) formula. The Bayesian formula is used for calculation. The probability that the obtained image is spam is called "first probability"; the probability of obtaining the image as spam is called "second probability" by applying the vector machine formula.
参见图 1 ,是本发明提供的识别图片垃圾邮件的方法的第一实施例的流程示 意图。  Referring to Fig. 1, a flow chart of a first embodiment of a method for identifying picture spam provided by the present invention is shown.
在第一实施例中, 应用贝叶斯公式计算图片为垃圾邮件的概率。 所述方法 包括以下步骤:  In the first embodiment, the Bayesian formula is applied to calculate the probability that the picture is spam. The method includes the following steps:
5101 , 根据邮件中的图片的压缩率分布特性, 提取所述图片的特征值。 在具体实施当中, 当接收到邮件后, 包括: 对邮件中所包含的图片进行扫 描, 获得所述图片的每一个子块的压缩率; 将每 N个连续的子块的压缩率合并 成一个新的压缩率变化元素, 再将每一个压缩率变化元素和它所在图片中的位 置编码进行组合, 获得所述图片的特征值。 其中, N是大于 1 的自然数。 优选 的, N的值为 4。  5101. Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. In a specific implementation, after receiving the email, the method includes: scanning a picture included in the email to obtain a compression ratio of each sub-block of the image; combining compression ratios of each N consecutive sub-blocks into one The new compression rate change element combines each compression rate change element with the position code in the picture in which it is located to obtain the feature value of the picture. Where N is a natural number greater than 1. Preferably, the value of N is 4.
需要说明的是,本发明实施例可以处理 JPG、 GIF. PNG或其他格式的图片。 其中, 基于图片的压缩率分布特性提取 JPG、 GIF或 PNG格式的图片的方法与 上述实施例相同, 在此不再赞述。  It should be noted that the embodiment of the present invention can process pictures in JPG, GIF, PNG or other formats. The method for extracting pictures in JPG, GIF or PNG format based on the compression ratio distribution characteristic of the picture is the same as the above embodiment, and is not mentioned here.
5102, 根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统 计公式计算获得所述图片为垃圾邮件的概率。 5102, applying probabilities according to the probability that each characteristic value of the picture appears in the garbage picture The formula calculates the probability of obtaining the picture as spam.
所述概率统计公式为贝叶斯公式, 贝叶斯分类器的分类原理是通过某对象 的先验概率, 利用贝叶斯公式计算出其后验概率, 即该对象属于某一类的概率, 选择具有最大后验概率的类作为该对象所属的类。  The probabilistic statistical formula is a Bayesian formula. The classification principle of the Bayesian classifier is to calculate the posterior probability by using the Bayesian formula, that is, the probability that the object belongs to a certain class. Select the class with the largest a posteriori probability as the class to which the object belongs.
贝叶斯 (Bayes ) 分类器的数学基础是贝叶斯公式, 如下:  The mathematical basis of the Bayes classifier is the Bayesian formula, as follows:
若 Bl , B2, …为一系列互不相容的事件, 如果以 P(Bi)表示事件 Bi发生的 概率, 且  If Bl, B2, ... is a series of mutually incompatible events, if P(Bi) is used, the probability of event Bi occurring, and
Qs, = Ω , P(Bi)>0, i=l,2,... 则对于任一事件八, 有  Qs, = Ω , P(Bi)>0, i=l,2,... then for any event eight, there is
P(B1 \ A) = 零 (^) , i=l,2, .. P(B 1 \ A ) = zero (^) , i=l,2, ..
Yk= p(B^A W 通过上述步骤 S101的处理, 得到图片的所有特征值后, 在步骤 S102中, 根据图片的每一个特征值查询样本数据库, 获得所述图片的每个特征值在垃圾 图片中出现的概率; 再将所述图片的每个特征值在垃圾图片中出现的概率代入 上述的贝叶斯公式中进行计算, 获得第一概率。 所述第一概率就是所述图片为 垃圾邮件的概率。 Y k = p ( B ^ A W After the processing of the above step S101, after obtaining all the feature values of the picture, in step S102, the sample database is queried according to each feature value of the picture, and each feature value of the picture is obtained. The probability of occurrence in the garbage picture; the probability that each feature value of the picture appears in the garbage picture is substituted into the Bayesian formula described above to calculate the first probability. The first probability is that the picture is The probability of spam.
例如,在接收到一封未知是否为垃圾邮件的图片邮件后 ,应用上述步骤 S101 的方法, 获得图片的所有特征值: Fl , F2, F3...。 再查询样本数据库, 得出每 一个特征值在在垃圾图片中出现的概率: F1 :Q1 , F2:Q2, F3:Q3...。 应用贝叶斯 公式, 输入上述的 "Fl , F2, F3..." 特征值序列和 "F1 :Q1 , F2:Q2, F3:Q3..." 概率统计结果, 即可计算出该未知的图片邮件为垃圾邮件的概率。  For example, after receiving a picture mail that is unknown to be spam, apply the method of step S101 above to obtain all the feature values of the picture: Fl, F2, F3.... Then query the sample database to get the probability that each eigenvalue appears in the junk image: F1 : Q1 , F2 : Q2 , F3 : Q3 . Apply the Bayesian formula, enter the above-mentioned "Fl, F2, F3..." eigenvalue sequence and the "F1:Q1, F2:Q2, F3:Q3..." probability statistics to calculate the unknown The probability that a picture message is spam.
S103 , 应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮 件图片的哈希值进行比较, 得出所述图片被重复发送的次数。  S103: Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.
Nilsimsa算法是一个著名的哈希算法, 其特点是: 如果输入的信息只作少量 变化, 则其输出的哈希值也只会有少量变化甚至没有变化。 由于无论输入序列 的长度是多少, 其输出序列的长度都是固定的, 因此可以通过 Nilsimsa算法对 输入序列进行计算, 并通过比对输出序列的相似度来确定输入序列的相似度, 大大加快了相似信息的聚类速度。  The Nilsimsa algorithm is a well-known hash algorithm. Its characteristics are: If the input information changes only a small amount, the hash value of its output will only change little or not. Since the length of the output sequence is fixed regardless of the length of the input sequence, the input sequence can be calculated by the Nilsimsa algorithm, and the similarity of the input sequence is determined by comparing the similarity of the output sequences, which greatly speeds up the process. The clustering speed of similar information.
具体的,步骤 S103包括:应用 Nilsimsa算法对所述图片的特征值进行处理, 获得所述图片的哈希值; 将所述图片的哈希值和已接收的邮件图片的哈希值进 行比较, 得出所述图片和已接收的邮件图片的相似度; 根据所述图片和已接收 的邮件图片的相似度, 得出所述图片被重复发送的次数。 举例如下: Specifically, step S103 includes: applying a Nilsimsa algorithm to process the feature values of the image, Obtaining a hash value of the picture; comparing a hash value of the picture with a hash value of the received mail picture to obtain a similarity between the picture and the received mail picture; according to the picture and The similarity of the received mail pictures, the number of times the pictures are repeatedly sent. Examples are as follows:
假设在上述步骤 S101 中得到图片的所有特征值 Fl , F2, F3... , 则在步骤 S103中, 对上述的特征值 "Fl , F2, F3..." 进行处理, 输入序列是 "Fl , F2, F3..." , 输出序列是一个固定长度的二进制序列 "01 , 02, 03..."。 其中, 输出 序列的长度一般是 64字节, 0的取值为 0或 1。 该二进制序列 "01 , 02, 03..." 就是图片的哈希值。 然后, 再将所述图片的哈希值和之前已接收的邮件图片的 哈希值进行比较, 根据图片之间的相似度判定相似图片被重复发送的次数。  It is assumed that all the feature values F1, F2, F3... of the picture are obtained in the above step S101, then in step S103, the above-mentioned feature values "Fl, F2, F3..." are processed, and the input sequence is "Fl". , F2, F3..." , the output sequence is a fixed-length binary sequence "01, 02, 03...". The length of the output sequence is generally 64 bytes, and the value of 0 is 0 or 1. The binary sequence "01, 02, 03..." is the hash of the picture. Then, the hash value of the picture is compared with the hash value of the previously received mail picture, and the number of times the similar picture is repeatedly transmitted is determined according to the similarity between the pictures.
Nilsimsa 算法具有如下优点: 如果输入序列 "Fl , F2, F3..." 只是做了较 小的改动 (比如在其中插入多段小的序列, 修改其中一小段序列的内容等), 其 输出的二进制序列的稳定性很高, 变动很少甚至不会变动。 所以通过比较两个 输出序列的相似度, 即可获知两个输入序列的相似度, 从而判定相似图片被重 复发送的次数。  The Nilsimsa algorithm has the following advantages: If the input sequence "Fl, F2, F3..." is only slightly modified (such as inserting multiple small sequences into it, modifying the contents of a small sequence, etc.), the output binary Sequence stability is high, with little or no change. Therefore, by comparing the similarities between the two output sequences, the similarity between the two input sequences can be known, thereby determining the number of times the similar pictures are repeatedly transmitted.
5104,根据所述图片的发信 IP查询声誉值数据库, 获得所述发信 IP的声誉 值。  5104. Query a reputation value database according to the sending IP of the picture, and obtain a reputation value of the sending IP.
本发明实施例配置了声誉值数据库, 用于保存发信 IP的声誉值。 该声誉值 值的方法如下: 对发信 IP在过去一段时间的发信行为进行记录, 将发信 IP发送 的正常邮件比例作为此 IP的声誉值。 比如, 某个发信 IP在过去一段时间发送了 100 封邮件, 其中有 10 封邮件被判定为垃圾邮件, 则通过数学计算方式 "(100- 10)/100-90" 得出该发信 IP的声誉值为 90。  The embodiment of the invention configures a reputation value database for storing the reputation value of the outgoing IP. The method of the reputation value is as follows: The signaling behavior of the outgoing IP in the past period of time is recorded, and the proportion of the normal mail sent by the outgoing IP is used as the reputation value of the IP. For example, if a sending IP sends 100 emails in the past, and 10 of them are judged as spam, the signaling IP is obtained by mathematical calculation method ((100-10)/100-90". The reputation value is 90.
因此, 在步骤 S104中, 才艮据图片邮件的发信 IP查询声誉值数据库, 即可 获得该图片邮件的发信 IP的声誉值。  Therefore, in step S104, the reputation value of the originating IP of the picture mail is obtained by querying the reputation value database according to the sending IP of the picture mail.
5105, 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声 誉值查询预置的权重值列表, 计算所述图片的权重和, 并根据所述权重和判定 所述图片是否为垃圾邮件。  5105. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Whether it is spam.
本发明实施例预先配置了三个权重值列表, 分别记录了图片为垃圾邮件的 概率、 被重复发送的次数、 发信 IP的声誉值所对应的权重值。  In the embodiment of the present invention, three weight value lists are pre-configured, and the probability that the picture is spam, the number of times of repeated transmission, and the weight value corresponding to the reputation value of the sending IP are respectively recorded.
( 1 )本发明实施例根据图片属于垃圾邮件的概率所在的范围, 将 "图片为 垃圾邮件的概率" 定义为 10段, 并配置每段的权重值。 "图片为垃圾邮件的概 率" 的权重列表如下: (1) According to the embodiment of the present invention, according to the probability that the picture belongs to the spam, the "picture is The probability of spam is defined as 10 segments, and the weight value of each segment is configured. The weight list of "probability of images as spam" is as follows:
Figure imgf000012_0001
Figure imgf000012_0001
( 2 )本发明实施例根据图片邮件重复发送次数所在的范围, 将 "图片重复 发送次数" 定义为 6段, 并配置每段的权重值。 "图片重复发送次数" 的权重列 表如下:  (2) According to the embodiment of the present invention, the "number of repeated transmissions of pictures" is defined as 6 segments according to the range in which the number of repeated transmissions of picture mails is located, and the weight value of each segment is configured. The weights for "Number of image resends" are as follows:
Figure imgf000012_0002
Figure imgf000012_0002
( 3 )本发明实施例根据发信 IP的声誉值的范围, 将 "发信 IP声誉值" 定 义为 10段, 并配置每段的权重值。 "发信 IP声誉值" 的权重列表如下: 权重值 (3) In the embodiment of the present invention, the "send IP reputation value" is defined as 10 segments according to the range of the reputation value of the outgoing IP, and the weight value of each segment is configured. The weight list of "Send IP Reputation Value" is as follows: Weights
发信 IP声誉值 声誉值范围  Send IP reputation value Reputation value range
(实数)  (real number)
REPUTATION— 0—10 [0, 10] REPUTATION-0-10-W  REPUTATION — 0—10 [0, 10] REPUTATION-0-10-W
REPUTATION_10_20 [10, 20] REPUTATION_10-20_W  REPUTATION_10_20 [10, 20] REPUTATION_10-20_W
REPUTATION-20-30 [20, 30] REPUTATION-20-30-W  REPUTATION-20-30 [20, 30] REPUTATION-20-30-W
REPUTATION-30-40 [30, 40] REPUTATION-30-40-W  REPUTATION-30-40 [30, 40] REPUTATION-30-40-W
REPUTATION-40-50 [40, 50] REPUTATION-40-50-W  REPUTATION-40-50 [40, 50] REPUTATION-40-50-W
REPUTATION-50-60 [50, 60] REPUTATION-50-60-W  REPUTATION-50-60 [50, 60] REPUTATION-50-60-W
REPUTATION-60-70 [60, 70] REPUTATION-60-70-W  REPUTATION-60-70 [60, 70] REPUTATION-60-70-W
REPUTATION-70-80 [70, 80] REPUTATION-70-80-W  REPUTATION-70-80 [70, 80] REPUTATION-70-80-W
REPUTATION_80_90 [80, 90] REPUTATION_80-90_W  REPUTATION_80_90 [80, 90] REPUTATION_80-90_W
REPUTATION-90-100 [90, 100] REPUTATION— 90_100_W 优选的, 上述三个列表的权重值, 是通过使用遗传算法对已知的样本进行 学习来获得的。  REPUTATION-90-100 [90, 100] REPUTATION— 90_100_W Preferably, the weight values of the above three lists are obtained by learning a known sample using a genetic algorithm.
需要说明的是, 本发明实施例将图片为垃圾邮件的概率、 被重复发送的次 数、 发信 IP的声誉值进行分段, 是为了减少后续处理的计算量, 所定义的段数 (即将 "图片为垃圾邮件的概率" 定义为 10段, 将 "图片重复发送次数" 定义 为 6段, 将 "发信 IP声誉值"定义为 10段)只是经验数字, 本发明并不限于此。  It should be noted that, in the embodiment of the present invention, the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP are segmented, so as to reduce the calculation amount of subsequent processing, the number of segments defined (ie, "image" The probability of being spam is defined as 10 segments, the number of "repeated transmissions of pictures" is defined as 6 segments, and the "facilitated IP reputation value" is defined as 10 segments) only empirical figures, and the present invention is not limited thereto.
具体的, 经过上述的步骤 S102、 S103、 S104的处理, 获得图片为垃圾邮件 的概率、 图片重复发送次数、 发信 IP声誉值之后, 在步骤 S105 中, 进行如下 处理: 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值 查询预置的权重值列表, 分别获得三者的权重值; 再将三者的权重值相加, 获 到所述图片的权重和; 判断所述图片的权重和是否大于预定的门限值, 若是, 则确定所述图片为垃圾邮件; 若否, 则确定所述图片为正常邮件。 举例如下: 假设对于一封包含有图片的邮件, 经过上述步骤 S101〜S104的处理后, 得 出该邮件中的图片为 i立圾邮件的概率为 95%, 被重复发送次数为 2, 发信 IP的 声誉值为 78, 分别查询权重列表中的 BAYES_90 (假定权重值为 0.5 ), REPUTATION_0_10 (假定权重值为 0.1), REPUTATION_70_80 (假定权重值 为 0.3 ), 计算得出该邮件图片的权重和为 0.5+0.1+0.3=0.9, 权重和小于 1.0 ( 1.0 为门限值), 则该邮件被判定为正常邮件。  Specifically, after the processes of steps S102, S103, and S104 are performed to obtain the probability that the picture is spam, the number of times of repeated transmission of the picture, and the reputation value of the transmission IP, in step S105, the following processing is performed: The probability of the mail, the number of times of repeated transmission, and the reputation value of the outgoing IP query the preset weight value list, respectively obtain the weight values of the three; and then add the weight values of the three to obtain the weight of the picture and And determining whether the weight of the picture is greater than a predetermined threshold, and if yes, determining that the picture is spam; if not, determining that the picture is a normal mail. For example, it is assumed that, for an email containing a picture, after the processing of steps S101 to S104 described above, it is found that the probability that the picture in the mail is i-paste is 95%, and the number of times of repeated transmission is 2, and the IP is sent. The reputation value is 78, the BAYES_90 in the query weight list (assuming a weight value of 0.5), REPUTATION_0_10 (assuming a weight value of 0.1), and REPUTATION_70_80 (assuming a weight value of 0.3), and the weight of the mail picture is calculated to be 0.5. +0.1+0.3=0.9, the weight is less than 1.0 (1.0 is the threshold), then the message is judged to be a normal mail.
进一步的, 本发明实施例提供的识别图片垃圾邮件的方法还包括: 在判定 邮件中的图片为垃圾邮件后, 重新计算所述图片的发信 IP的声誉值, 并对声誉 值数据库中的相应的声誉值进行更新。 Further, the method for identifying image spam provided by the embodiment of the present invention further includes: After the picture in the mail is spam, the reputation value of the outgoing IP of the picture is recalculated, and the corresponding reputation value in the reputation value database is updated.
此外, 本发明实施例还可以采用 SVM ( Support Vector Machine, 支持向量 机) 算法计算图片为垃圾图片的概率。 SVM算法可以比较直观地通过图 2来解 释, 具体如下:  In addition, the embodiment of the present invention can also use the SVM (Support Vector Machine) algorithm to calculate the probability that the picture is a junk picture. The SVM algorithm can be explained intuitively through Figure 2, as follows:
定义一个函数 f(x,y) = al*x + a2*y +b; 其中 x是邮件的一个固有特征, y 是邮件的另一个与 X无关的固有特征, al、 a2、 b是常量, al、 a2控制图 2可以 切分两类点的平面的斜率。 假如图 2 中的叉点表示垃圾邮件, 圓点表示正常邮 件, 则邮件是否为垃圾邮件仅仅和 x、 y有关, 只要 f(x)大于某个值, 即可认为 邮件是 ϋ圾邮件。  Define a function f(x,y) = al*x + a2*y +b; where x is an intrinsic feature of the message, y is another intrinsic feature of the message that is not related to X, al, a2, b are constants, Al, a2 control Figure 2 can be used to segment the slope of the planes of the two types of points. If the cross point in Figure 2 indicates spam and the dot indicates normal mail, then whether the mail is spam is only related to x and y. As long as f(x) is greater than a certain value, the mail is considered to be spam.
在实际应用中, 对样本进行分类通常需要抽取几百到一千个特征才可能有 比较好的效果。 而对于如此多维度的模型, 本实施例无法在三维图中表达出来。 但是, 可以推导出最终的 SVM公式就是一个多项式: f(x,y,z,...) - al*x + a2*y + a3*z + ..... + b; 只要将未知样本的 x、 y、 z…等特征的值代入 SVM公式中, 即可 根据其结果是否大于 0来判断样本是否为垃圾邮件。  In practice, classifying a sample usually requires extracting hundreds to thousands of features to have a better effect. For such a multi-dimensional model, this embodiment cannot be expressed in a three-dimensional map. However, it can be inferred that the final SVM formula is a polynomial: f(x,y,z,...) - al*x + a2*y + a3*z + ..... + b; as long as the unknown sample will be The values of the features such as x, y, z, etc. are substituted into the SVM formula, and the sample is judged to be spam based on whether the result is greater than zero.
SVM模型的一个关键是要通过未知样本, 学习出上述公式的 al、 a2、 a3...., b等参数。 具体实施时, 只要提供了足够多的样本(正常邮件和垃圾邮件各一千 左右即可), 就可以通过特定的数学方法, 获取上述的参数, 由此获得 SVM公 式。 需要说明的是, 现有技术中已经有很多成熟的数学方法用于获取上述的参 数, 例如可以采用找边缘关键点拟合的方法, 在此不再赘述。  One of the keys of the SVM model is to learn the parameters al, a2, a3..., b, etc. of the above formula through unknown samples. In the specific implementation, as long as enough samples are provided (normal mail and spam are about one thousand each), the above parameters can be obtained through specific mathematical methods, thereby obtaining the SVM formula. It should be noted that there are many mature mathematical methods in the prior art for obtaining the above parameters. For example, a method for finding edge key points can be used, and details are not described herein.
SVM模型的另外一个关键是所提取的 "特征" 是否可以较好的描述问题, 即上述的 x,y,z等参数所代表的 "特征值"是否可以较好的区分两类样本。 本发 明实施例的解决方案是: 使用每个图片特征项在垃圾邮件中出现的概率来作为 SVM的输入特征。 在学习过程中, 统计出每个特征值在垃圾邮件中出现概率之 后, 则按照特征值出现的顺序, 构造出一个特征值概率序列, 通过学习程序获 得上述的 SVM公式(即获取上述的 al, a2, a3...b参数)。举例说明:有一个图片, 根据从图片文件中分解出来的顺序排列,有 4个(实际可能有很多)特征值 Tl、 Τ2、 Τ3、 Τ4, 经统计可知其在垃圾邮件中出现的 ^既率分别为 Gl、 G2、 G3、 G4, 则将此 Gl、 G2、 G3、 G4作为向量输入 SVM学习程序, 通过对一批正常邮件 和垃圾邮件的学习, 即可获得适合学习样本的 SVM公式。 在评估未知样本是否为垃圾邮件的时候, 同样按照从图片文件分解出来的 顺序, 排列特征值 Tl、 Τ2、 Τ3、 Τ4的 #率 Gl、 G2、 G3、 G4, ·ί巴 Gl、 G2、 G3、 G4代入 SVM公式中, 即可计算出此序列为垃圾邮件的概率。 Another key to the SVM model is whether the extracted "features" can better describe the problem, that is, whether the "characteristic values" represented by the above parameters such as x, y, z can better distinguish the two types of samples. The solution of the embodiment of the present invention is to use the probability that each picture feature item appears in the spam as an input feature of the SVM. In the learning process, after the probability of occurrence of each feature value in the spam is counted, an eigenvalue probability sequence is constructed according to the order in which the feature values appear, and the above SVM formula is obtained through the learning program (ie, obtaining the above al, A2, a3...b parameters). For example: there is a picture, according to the order of decomposition from the picture file, there are 4 (there may be many) feature values Tl, Τ2, Τ3, Τ4, and statistics show that the rate of occurrence in spam is For Gl, G2, G3, and G4 respectively, Gl, G2, G3, and G4 are used as vector input SVM learning programs. By learning a batch of normal mail and spam, the SVM formula suitable for learning samples can be obtained. When evaluating whether the unknown sample is spam, the #values Gl, G2, G3, G4, ί巴 Gl, G2, G3 of the feature values T1, Τ2, Τ3, Τ4 are also arranged in the order decomposed from the picture file. G4 is substituted into the SVM formula to calculate the probability that the sequence is spam.
Bayes算法与 SVM算法相比, 简而言之, 在学习已知正常和垃圾邮件样本 的时候, Bayes方法生成的是每个特征项为垃圾邮件的概率, 而 SVM方法生成 的是每个特征项为垃圾邮件的概率以及 SVM公式的参数。 判断未知样本的时 候, Bayes方法输入的是未知样本特征项, 通过查表获知特征项是垃圾邮件的概 率, 然后通过 Bayes公式计算邮件是垃圾邮件的概率; SVM方法输入的同样是 未知样本的特征项, 通过查表获知特征项是垃圾邮件的概率, 然后通过学习过 程生成的 SVM公式计算邮件是垃圾邮件的概率。  Bayes algorithm is compared with SVM algorithm. In short, when learning normal and spam samples, Bayes method generates the probability that each feature is spam, and SVM method generates each feature. The probability of spam and the parameters of the SVM formula. When judging an unknown sample, the Bayes method inputs an unknown sample feature item, and obtains the probability that the feature item is spam by looking up the table, and then calculates the probability that the mail is spam by the Bayes formula; the SVM method also inputs the characteristics of the unknown sample. Item, by looking up the table to know the probability that the feature item is spam, and then calculating the probability that the mail is spam by the SVM formula generated by the learning process.
参见图 3 ,是本发明提供的识别图片垃圾邮件的方法的第二实施例的流程示 意图。 在第二实施例中, 应用支持向量机(SVM )公式计算图片为垃圾邮件的 概率。 所述方法包括以下步骤:  Referring to Figure 3, there is shown a flow diagram of a second embodiment of a method of identifying picture spam provided by the present invention. In the second embodiment, the support vector machine (SVM) formula is applied to calculate the probability that the picture is spam. The method includes the following steps:
5201 , 根据邮件中的图片的压缩率分布特性, 提取所述图片的特征值。 本步骤 S201与上述第一实施例的步骤 S101完全相同, 在此不再赘述。 5201: Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. The step S201 is the same as the step S101 of the first embodiment, and details are not described herein again.
5202, 根据所述图片的每个特征值在垃圾图片中出现的概率, 应用支持向 量机公式计算获得所述图片为垃圾邮件的概率; 5202. The application supports a vector machine formula to calculate a probability of obtaining the image as spam according to a probability that each feature value of the picture appears in the garbage picture;
步骤 S202具体包括: 根据所述图片的特征值查询样本数据库, 获得所述图 片的每个特征值在垃圾图片中出现的概率; 将所述图片的每个特征值在垃圾图 片中出现的概率构造成特征向量, 并代入支持向量机公式中进行计算, 获得第 二概率; 所述第二概率就是所述图片为垃圾邮件的概率。  Step S202 specifically includes: querying a sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; and constructing a probability that each feature value of the picture appears in the garbage picture The feature vector is substituted into the support vector machine formula to obtain a second probability; the second probability is the probability that the picture is spam.
其中, 所述样本数据库中保存有垃圾图片样本和正常图片样本的所有特征 值, 以及每个特征值在垃圾图片中出现的概率。  The sample database stores all feature values of the garbage image sample and the normal image sample, and the probability that each feature value appears in the garbage picture.
5203 , 应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮 件图片的哈希值进行比较, 得出所述图片被重复发送的次数。  S203: Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.
5204,根据所述图片的发信 IP查询声誉值数据库, 获得所述发信 IP的声誉 值。  5204. Query a reputation value database according to the sent IP of the picture, and obtain a reputation value of the sent IP.
5205, 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声 誉值查询预置的权重值列表, 计算所述图片的权重和, 并根据所述权重和判定 所述图片是否为 i立圾邮件。 步骤 S203 ~ S205与上述第一实施例的步骤 S103 ~ S105完全相同, 在此不 再赘述。 5205. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Is it illegal? The steps S203 to S205 are completely the same as the steps S103 to S105 of the first embodiment, and are not described herein again.
参见图 4,是本发明提供的识别图片垃圾邮件的方法的第三实施例的流程示 意图。 在第三实施例中, 同时应用 Bayes公式和 SVM公式计算图片为垃圾邮件 的概率。 所述方法包括以下步骤:  Referring to Figure 4, there is shown a flow diagram of a third embodiment of a method of identifying picture spam provided by the present invention. In the third embodiment, the Bayes formula and the SVM formula are simultaneously applied to calculate the probability that the picture is spam. The method includes the following steps:
5301 , 根据邮件中的图片的压缩率分布特性, 提取所述图片的特征值。 本步骤 S301与上述第一实施例的步骤 S101完全相同, 在此不再赘述。 5301. Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. The step S301 is the same as the step S101 of the first embodiment, and details are not described herein again.
5302, 根据所述图片的特征值查询样本数据库, 获得所述图片的每个特征 值在垃圾图片中出现的概率; S302: Query a sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;
其中, 所述样本数据库中保存有垃圾图片样本和正常图片样本的所有特征 值 , 以及每个特征值在垃圾图片中出现的概率。  The sample database stores all feature values of the garbage image sample and the normal image sample, and the probability that each feature value appears in the garbage picture.
5303 , 将所述图片的每个特征值在垃圾图片中出现的概率代入贝叶斯公式 中进行计算, 获得第一概率;  5303. Substituting a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;
本步骤 S303与上述第一实施例的步骤 S102完全相同, 在此不再赘述。 The step S303 is the same as the step S102 of the first embodiment, and details are not described herein again.
5304, 将所述图片的每个特征值在垃圾图片中出现的概率构造成特征向量, 并代入支持向量机公式中进行计算, 获得第二概率; 5304, constructing a probability that each feature value of the picture appears in the junk picture as a feature vector, and substituting into a support vector machine formula for calculation, to obtain a second probability;
所述图片为垃圾邮件的概率包括所述第一概率和所述第二概率。  The probability that the picture is spam includes the first probability and the second probability.
5305 , 应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮 件图片的哈希值进行比较, 得出所述图片被重复发送的次数。  5305. Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.
本步骤 S305与上述第一实施例的步骤 S103完全相同, 在此不再赘述。 The step S305 is the same as the step S103 of the first embodiment, and details are not described herein again.
5306,根据所述图片的发信 IP查询声誉值数据库, 获得所述发信 IP的声誉 值。 5306. Query a reputation value database according to the sent IP of the picture, and obtain a reputation value of the sent IP.
本步骤 S306与上述第一实施例的步骤 S104完全相同, 在此不再赘述。 This step S306 is identical to the step S104 of the first embodiment described above, and details are not described herein again.
5307, 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声 誉值查询预置的权重值列表, 计算所述图片的权重和, 并根据所述权重和判定 所述图片是否为垃圾邮件。 5307. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Whether it is spam.
本步骤 S307与上述第一实施例的步骤 S105基本相同, 其不同点在于, 所 述图片为垃圾邮件的概率包括第一概率和第二概率, 且分别对应于一个权重值 列表。 因此, 在查询预置的权重值列表时, 将分别获得 "第一概率" 对应的权 重值、 "第二概率" 对应的权重值、 "被重复发送的次数" 对应的权重值和 "发 信 IP的声誉值" 对应的权重值 , 共四个权重值。 将四个权重值相加 , 得到图片 的权重和, 再根据所述权重和判定所述图片是否为垃圾邮件。 The step S307 is substantially the same as the step S105 of the first embodiment described above, except that the probability that the picture is spam includes a first probability and a second probability, and respectively corresponds to a weight value list. Therefore, when querying the preset weight value list, the weight value corresponding to the "first probability", the weight value corresponding to the "second probability", the weight value corresponding to the "number of times of repeated transmission", and the "send" are respectively obtained. The reputation value of the letter IP "corresponding weight value, a total of four weight values. The four weight values are added to obtain the weight of the picture, and according to the weight, it is determined whether the picture is spam.
本发明实施例提供的识别图片垃圾邮件的方法, 基于图片的压缩率分布特 性提取邮件中的图片的特征值, 应用概率统计公式计算获得所述图片为垃圾邮 件的概率; 再根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的 声誉值三者的权重值计算所述图片的权重和, 根据所述权重和判定所述图片是 否为垃圾邮件。 本发明基于图片的压缩率分布识别图片垃圾邮件, 效率高, 并 且能够识别扭曲变形的或者背景包含噪音信息的图片。 此外, 本发明应用哈希 算法判断图片的相似度, 并统计相似图片被重复发送的次数, 而根据这一特征 可以很好地判断发信者的行为是否与垃圾邮件的发信行为相似, 从而提高了识 别图片垃圾邮件的准确率。  The method for identifying image spam provided by the embodiment of the present invention extracts the feature value of the image in the email based on the compression ratio distribution characteristic of the image, and calculates the probability that the image is spam by applying the probability statistical formula; The weight of the picture is calculated by the weight of the spam, the number of times of repeated transmission, and the reputation value of the outgoing IP. Based on the weight, it is determined whether the picture is spam. The present invention recognizes picture spam based on the compression ratio distribution of the picture, is highly efficient, and is capable of recognizing a picture that is distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be well judged whether the behavior of the sender is similar to the sending behavior of the spam, thereby improving The accuracy of identifying image spam.
相应地, 本发明实施例还提供了一种邮件系统, 能够实现上述实施例中的 识别图片垃圾邮件的方法的所有步骤。  Correspondingly, the embodiment of the present invention further provides a mail system, which can implement all the steps of the method for identifying picture spam in the above embodiment.
参见图 5 ,是本发明实施例提供的邮件系统的结构示意图,该邮件系统包括: 图片特征提取模块 1 , 用于根据邮件中的图片的压缩率分布特性, 提取所述 图片的特征值;  FIG. 5 is a schematic structural diagram of a mail system according to an embodiment of the present invention. The mail system includes: a picture feature extraction module 1 configured to extract feature values of the picture according to a compression rate distribution characteristic of a picture in the mail;
垃圾邮件概率获取模块 2,用于根据所述图片的每个特征值在垃圾图片中出 现的概率, 应用概率统计公式计算获得所述图片为垃圾邮件的概率;  The spam probability acquisition module 2 is configured to calculate a probability that the image is spam according to a probability that each feature value of the picture appears in the garbage picture, and the probability statistical formula is used;
图片发送次数获取模块 3 , 应用哈希算法计算所述图片的哈希值, 将所述哈 希值和已接收的邮件图片的哈希值进行比较, 得出所述图片被重复发送的次数; 声誉值获取模块 4, 用于根据所述邮件的发信 IP查询声誉值数据库, 获得 所述发信 IP的声誉值;  a picture sending times obtaining module 3, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a reputation value obtaining module 4, configured to query a reputation value database according to the sending IP of the mail, and obtain a reputation value of the sending IP;
垃圾邮件判定模块 5, 用于根据所述图片为垃圾邮件的概率、被重复发送的 次数、 发信 IP的声誉值查询预置的权重值列表, 计算所述图片的权重和, 并根 据所述权重和判定所述图片是否为垃圾邮件。  The spam determination module 5 is configured to query a preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP, calculate a weight sum of the picture, and according to the Weight and determine if the picture is spam.
如图 6所示, 所述图片特征提取模块 1具体包括:  As shown in FIG. 6, the picture feature extraction module 1 specifically includes:
图片扫描单元 11 , 用于对邮件中的图片进行扫描, 获得所述图片的每一个 子块的压缩率;  The image scanning unit 11 is configured to scan a picture in the mail to obtain a compression ratio of each sub-block of the picture;
图片特征生成单元 12, 用于将每 N个连续的子块的压缩率合并成一个新的 压缩率变化元素, 并将每一个压缩率变化元素和它所在图片中的位置编码进行 组合, 获得所述图片的特征值; 其中, N是大于 1的自然数。 The picture feature generating unit 12 is configured to combine the compression ratios of each N consecutive sub-blocks into a new compression rate change element, and perform each of the compression rate change elements and the position code in the picture in which the picture is located. Combining, obtaining feature values of the picture; wherein N is a natural number greater than 1.
如图 7所示, 所述垃圾邮件概率获取模块 2具体包括:  As shown in FIG. 7, the spam probability acquisition module 2 specifically includes:
概率查询单元 21 , 用于根据所述图片的特征值查询样本数据库, 获得所述 图片的每个特征值在垃圾图片中出现的概率;  The probability query unit 21 is configured to query the sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;
贝叶斯计算单元 22, 用于将所述图片的每个特征值在垃圾图片中出现的概 率代入贝叶斯公式中进行计算, 获得第一概率;  a Bayesian calculation unit 22, configured to calculate a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;
支持向量机计算单元 23 , 用于将所述图片的每个特征值在垃圾图片中出现 的概率构造成特征向量, 并代入支持向量机公式中进行计算, 获得第二概率; 所述图片为垃圾邮件的概率为所述第一概率和 /或所述第二概率。  The support vector machine calculation unit 23 is configured to construct a probability vector for each feature value of the picture to appear in the garbage picture, and perform calculation into the support vector machine formula to obtain a second probability; the picture is garbage The probability of the mail is the first probability and/or the second probability.
如图 8所示, 所述图片发送次数获取模块 3具体包括:  As shown in FIG. 8, the picture sending times obtaining module 3 specifically includes:
哈希值计算单元 31 , 应用哈希算法对所述图片的特征值进行处理, 获得所 述图片的哈希值;  The hash value calculation unit 31 is configured to process the feature value of the picture by using a hash algorithm to obtain a hash value of the picture;
相似度判断单元 32, 用于将所述图片的哈希值和已接收的邮件图片的哈希 值进行比较, 得出所述图片和已接收的邮件图片的相似度;  The similarity determining unit 32 is configured to compare the hash value of the picture with the hash value of the received mail picture to obtain a similarity between the picture and the received mail picture;
重复发送次数确定单元 32, 用于根据所述图片和已接收的邮件图片的相似 度, 得出所述图片被重复发送的次数。  The repeated transmission number determining unit 32 is configured to obtain the number of times the picture is repeatedly transmitted according to the similarity between the picture and the received mail picture.
如图 9所示, 所述垃圾邮件判定模块 5具体包括:  As shown in FIG. 9, the spam determination module 5 specifically includes:
权重查询单元 51 , 用于根据所述图片为垃圾邮件的概率、 被重复发送的次 数、 发信 IP的声誉值查询预置的权重值列表, 分别获得三者的权重值;  The weight query unit 51 is configured to query the preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP, and obtain the weight values of the three;
邮件识别单元 52, 用于将三者的权重值相加, 获到所述图片的权重和; 判 断所述图片的权重和是否大于预定的门限值, 若是, 则确定所述图片为垃圾邮 件; 若否, 则确定所述图片为正常邮件。  The mail identifying unit 52 is configured to add the weight values of the three to obtain the weight sum of the picture; determine whether the weight of the picture is greater than a predetermined threshold, and if yes, determine that the picture is spam If no, it is determined that the picture is a normal mail.
进一步的, 如图 5所示, 所述邮件系统还包括:  Further, as shown in FIG. 5, the mail system further includes:
样本数据库 6, 用于保存垃圾图片样本和正常图片样本的所有特征值, 以及 每个特征值在垃圾图片中出现的概率;  The sample database 6 is used to save all the feature values of the garbage picture sample and the normal picture sample, and the probability that each feature value appears in the garbage picture;
声誉值数据库 7, 用于保存发信 IP的声誉值; 所述声誉值是发信 IP所发送 的正常邮件在其所有已发送的邮件中所占的比例;  The reputation value database 7 is used to store the reputation value of the outgoing IP; the reputation value is the proportion of the normal mail sent by the outgoing IP in all of its sent mails;
声誉值更新模块 8, 用于在所述垃圾邮件判定模块判定图片为垃圾邮件后, 重新计算所述图片的发信 IP的声誉值, 并对声誉值数据库中的相应的声誉值进 行更新。 需要说明的是, 本发明实施例提供的邮件系统, 其识别图片垃圾邮件的流 程与上述实施例相同, 在此不再赘述。 The reputation value update module 8 is configured to: after the spam determination module determines that the picture is spam, recalculate the reputation value of the sent IP of the picture, and update the corresponding reputation value in the reputation value database. It should be noted that, in the mail system provided by the embodiment of the present invention, the process of identifying the image spam is the same as that in the foregoing embodiment, and details are not described herein again.
本发明实施例提供的邮件系统, 基于图片的压缩率分布特性提取邮件中的 图片的特征值, 应用概率统计公式计算获得所述图片为垃圾邮件的概率; 再根 据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值三者的权 重值计算所述图片的权重和, 根据所述权重和判定所述图片是否为垃圾邮件。 本发明基于图片的压缩率分布识别图片垃圾邮件, 效率高, 并且能够识别扭曲 变形的或者背景包含噪音信息的图片。 此外, 本发明应用哈希算法判断图片的 相似度, 并统计相似图片被重复发送的次数, 而根据这一特征可以 4艮好地判断 发信者的行为是否与垃圾邮件的发信行为相似, 从而提高了识别图片垃圾邮件 的准确率。  The mail system provided by the embodiment of the present invention extracts the feature value of the picture in the email based on the compression ratio distribution characteristic of the image, and calculates the probability that the picture is spam by using the probability statistical formula; and then according to the probability that the picture is spam The weight value of the number of times of repeated transmission and the reputation value of the outgoing IP calculates the weight of the picture, and determines whether the picture is spam based on the weight. The present invention recognizes picture spam based on the compression ratio distribution of pictures, is highly efficient, and is capable of recognizing a picture that is distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be determined whether the behavior of the sender is similar to the sending behavior of the spam, thereby Improve the accuracy of identifying image spam.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程, 是可以通过计算机程序来指令相关的硬件来完成, 所述的程序可存储于一计算 机可读取存储介质中, 该程序在执行时, 可包括如上述各方法的实施例的流程。 其中, 所述的存储介质可为磁碟、 光盘、 只读存储记忆体(Read-Only Memory, ROM )或随机存储记忆体(Random Access Memory, RAM ) 等。  A person skilled in the art can understand that all or part of the process of implementing the above embodiment method can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. In execution, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
以上所述是本发明的优选实施方式, 应当指出, 对于本技术领域的普通技 术人员来说, 在不脱离本发明原理的前提下, 还可以做出若干改进和润饰, 这 些改进和润饰也视为本发明的保护范围。  The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings are also considered. It is the scope of protection of the present invention.

Claims

权 利 要 求 Rights request
1、 一种识别图片垃圾邮件的方法, 其特征在于, 包括:  A method for identifying image spam, comprising:
根据邮件中的图片的压缩率分布特性, 提取所述图片的特征值;  Extracting a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;
根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统计公式 计算获得所述图片为立圾邮件的概率;  According to the probability that each feature value of the picture appears in the garbage picture, the probability of obtaining the picture as a garbage message is calculated by applying a probability statistical formula;
应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮件图片 的哈希值进行比较, 得出所述图片被重复发送的次数;  Applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent;
根据所述图片的发信 IP查询声誉值数据库, 获得所述发信 IP的声誉值; 根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值查 询预置的权重值列表, 计算所述图片的权重和, 并根据所述权重和判定所述图 片是否为垃圾邮件。  Obtaining a reputation value of the outgoing IP according to the sentiment IP query reputation database of the picture; and querying a preset weight according to the probability that the picture is spam, the number of times of repeated sending, and the reputation value of the sending IP a list of values, calculating a weight sum of the pictures, and determining whether the picture is spam based on the weights.
2、 如权利要求 1所述的识别图片垃圾邮件的方法, 其特征在于, 所述根据 邮件中的图片的压缩率分布特性, 提取所述图片的特征值, 具体包括: 2. The method for identifying a picture spam according to claim 1, wherein the extracting the feature value of the picture according to a compression rate distribution characteristic of the picture in the message comprises:
对邮件中的图片进行扫描, 获得所述图片的每一个子块的压缩率; 将每 N个连续的子块的压缩率合并成一个新的压缩率变化元素, 其中, N 是大于 1的自然数;  Scanning the picture in the mail to obtain the compression ratio of each sub-block of the picture; combining the compression ratio of each N consecutive sub-blocks into a new compression rate change element, where N is a natural number greater than 1. ;
将每一个压缩率变化元素和它所在图片中的位置编码进行组合, 获得所述 图片的特征值。  Each of the compression rate change elements is combined with the position code in the picture in which it is located to obtain the feature values of the picture.
3、 如权利要求 2所述的识别图片垃圾邮件的方法, 其特征在于, 所述概率 统计公式为贝叶斯公式; 3. The method for identifying picture spam according to claim 2, wherein the probability statistical formula is a Bayesian formula;
则所述根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统 计公式计算获得所述图片为垃圾邮件的概率, 具体包括:  Then, according to the probability that each feature value of the picture appears in the garbage picture, the probabilistic statistical formula is used to calculate the probability of obtaining the picture as spam, which specifically includes:
根据所述图片的特征值查询样本数据库, 获得所述图片的每个特征值在立 圾图片中出现的概率; 其中, 所述样本数据库中保存有垃圾图片样本和正常图 片样本的所有特征值, 以及每个特征值在垃圾图片中出现的概率;  Querying the sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; wherein, the sample database stores all feature values of the garbage picture sample and the normal picture sample, And the probability that each eigenvalue appears in the junk image;
将所述图片的每个特征值在立圾图片中出现的概率代入贝叶斯公式中进行 计算, 获得第一概率;  Substituting a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;
所述图片为立圾邮件的概率为所述第一概率。 The probability that the picture is a spam message is the first probability.
4、 如权利要求 2所述的识别图片垃圾邮件的方法, 其特征在于, 所述概率 统计公式为支持向量机公式; The method for identifying a picture spam according to claim 2, wherein the probability statistical formula is a support vector machine formula;
则所述根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统 计公式计算获得所述图片为垃圾邮件的概率, 具体包括:  Then, according to the probability that each feature value of the picture appears in the garbage picture, the probabilistic statistical formula is used to calculate the probability of obtaining the picture as spam, which specifically includes:
根据所述图片的特征值查询样本数据库, 获得所述图片的每个特征值在垃 圾图片中出现的概率; 其中, 所述样本数据库中保存有垃圾图片样本和正常图 片样本的所有特征值, 以及每个特征值在垃圾图片中出现的概率;  Querying the sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; wherein, the sample database stores all feature values of the garbage picture sample and the normal picture sample, and The probability that each eigenvalue will appear in the junk image;
将所述图片的每个特征值在垃圾图片中出现的概率构造成特征向量, 并代 入支持向量机公式中进行计算, 获得第二概率;  Constructing a probability that each feature value of the picture appears in the junk picture is a feature vector, and performing calculation in the support vector machine formula to obtain a second probability;
所述图片为垃圾邮件的概率为所述第二概率。  The probability that the picture is spam is the second probability.
5、 如权利要求 2所述的识别图片垃圾邮件的方法, 其特征在于, 所述概率 统计公式包括贝叶斯公式和支持向量机公式; The method for identifying picture spam according to claim 2, wherein the probability statistical formula comprises a Bayesian formula and a support vector machine formula;
则所述根据所述图片的每个特征值在垃圾图片中出现的概率, 应用概率统 计公式计算获得所述图片为垃圾邮件的概率, 具体包括:  Then, according to the probability that each feature value of the picture appears in the garbage picture, the probabilistic statistical formula is used to calculate the probability of obtaining the picture as spam, which specifically includes:
根据所述图片的特征值查询样本数据库, 获得所述图片的每个特征值在垃 圾图片中出现的概率; 其中, 所述样本数据库中保存有垃圾图片样本和正常图 片样本的所有特征值, 以及每个特征值在垃圾图片中出现的概率;  Querying the sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; wherein, the sample database stores all feature values of the garbage picture sample and the normal picture sample, and The probability that each eigenvalue will appear in the junk image;
将所述图片的每个特征值在垃圾图片中出现的概率代入贝叶斯公式中进行 计算, 获得第一概率;  Substituting the probability of occurrence of each feature value of the picture in the junk picture into a Bayesian formula to obtain a first probability;
将所述图片的每个特征值在垃圾图片中出现的概率构造成特征向量, 并代 入支持向量机公式中进行计算, 获得第二概率;  Constructing a probability that each feature value of the picture appears in the junk picture is a feature vector, and performing calculation in the support vector machine formula to obtain a second probability;
所述图片为垃圾邮件的概率包括所述第一概率和所述第二概率。  The probability that the picture is spam includes the first probability and the second probability.
6、 如权利要求 3 ~ 5任一项所述的识别图片垃圾邮件的方法, 其特征在于 , 所述应用哈希算法计算所述图片的哈希值, 将所述哈希值和已接收的邮件图片 的哈希值进行比较, 得出所述图片被重复发送的次数, 具体包括: The method for identifying picture spam according to any one of claims 3 to 5, wherein the application hash algorithm calculates a hash value of the picture, and the hash value and the received The hash value of the mail picture is compared, and the number of times the picture is repeatedly sent is obtained, which specifically includes:
应用哈希算法对所述图片的特征值进行处理, 获得所述图片的哈希值; 将所述图片的哈希值和已接收的邮件图片的哈希值进行比较, 得出所述图 片和已接收的邮件图片的相似度; Applying a hash algorithm to process the feature value of the picture to obtain a hash value of the picture; comparing the hash value of the picture with the hash value of the received mail picture to obtain the picture The similarity between the slice and the received mail picture;
根据所述图片和已接收的邮件图片的相似度, 得出所述图片被重复发送的 次数。  Based on the similarity between the picture and the received mail picture, the number of times the picture is repeatedly transmitted is obtained.
7、 如权利要求 6所述的识别图片垃圾邮件的方法, 其特征在于, 所述根据 所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值查询预置的 权重值列表, 计算所述图片的权重和, 并根据所述权重和判定所述图片是否为 垃圾邮件, 具体包括: The method for identifying a picture spam according to claim 6, wherein the querying the preset weight value according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP a list, calculating a weight sum of the picture, and determining whether the picture is spam according to the weight, and specifically:
根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值查 询预置的权重值列表, 分别获得三者的权重值;  According to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sent IP, the preset weight value list is searched, and the weight values of the three are respectively obtained;
将三者的权重值相加, 获到所述图片的权重和;  Adding the weight values of the three to obtain the weight of the picture;
判断所述图片的权重和是否大于预定的门限值, 若是, 则确定所述图片为 垃圾邮件; 若否, 则确定所述图片为正常邮件。  Determining whether the weight of the picture is greater than a predetermined threshold, and if so, determining that the picture is spam; if not, determining that the picture is a normal mail.
8、 如权利要求 7所述的识别图片垃圾邮件的方法, 其特征在于, 所述声誉 值数据库保存有发信 IP的声誉值,所述声誉值是发信 IP所发送的正常邮件在其 所有已发送的邮件中所占的比例; 8. The method for identifying picture spam according to claim 7, wherein the reputation value database stores a reputation value of the outgoing IP, wherein the reputation value is a normal mail sent by the outgoing IP address. The proportion of sent messages;
则在判定所述图片为垃圾邮件后, 还包括:  After determining that the picture is spam, the method further includes:
重新计算所述图片的发信 IP的声誉值, 并对声誉值数据库中的相应的声誉 值进行更新。  Recalculate the reputation value of the outgoing IP of the picture and update the corresponding reputation value in the reputation value database.
9、 一种邮件系统, 其特征在于, 包括: 9. A mail system, comprising:
图片特征提取模块, 用于根据邮件中的图片的压缩率分布特性, 提取所述 图片的特征值;  a picture feature extraction module, configured to extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;
垃圾邮件概率获取模块, 用于根据所述图片的每个特征值在垃圾图片中出 现的概率, 应用概率统计公式计算获得所述图片为垃圾邮件的概率;  a spam probability acquisition module, configured to calculate a probability of obtaining the image as spam according to a probability that each feature value of the picture appears in the garbage picture;
图片发送次数获取模块, 应用哈希算法计算所述图片的哈希值, 将所述哈 希值和已接收的邮件图片的哈希值进行比较, 得出所述图片被重复发送的次数; 声誉值获取模块, 用于根据所述邮件的发信 IP查询声誉值数据库, 获得所 述发信 IP的声誉值; 垃圾邮件判定模块, 用于根据所述图片为垃圾邮件的概率、 被重复发送的 次数、 发信 IP的声誉值查询预置的权重值列表, 计算所述图片的权重和, 并根 据所述权重和判定所述图片是否为垃圾邮件。 a picture sending times obtaining module, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a value obtaining module, configured to query a reputation value database according to the sending IP of the mail, to obtain a reputation value of the sending IP; a spam determination module, configured to query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and according to the weight And determining whether the picture is spam.
10、 如权利要求 9 所述的邮件系统, 其特征在于, 所述图片特征提取模块 具体包括: The mail system according to claim 9, wherein the picture feature extraction module specifically includes:
图片扫描单元, 用于对邮件中的图片进行扫描, 获得所述图片的每一个子 块的压缩率;  a picture scanning unit, configured to scan a picture in the mail to obtain a compression ratio of each sub-block of the picture;
图片特征生成单元, 用于将每 N个连续的子块的压缩率合并成一个新的压 缩率变化元素, 并将每一个压缩率变化元素和它所在图片中的位置编码进行组 合, 获得所述图片的特征值; 其中, N是大于 1的自然数。  a picture feature generating unit, configured to combine compression ratios of each N consecutive sub-blocks into a new compression rate change element, and combine each compression rate change element with a position code in a picture in which the picture is located, to obtain the The feature value of the picture; where N is a natural number greater than one.
11、 如权利要求 10所述的邮件系统, 其特征在于, 所述垃圾邮件概率获取 模块具体包括: The mail system according to claim 10, wherein the spam probability acquisition module comprises:
概率查询单元, 用于根据所述图片的特征值查询样本数据库, 获得所述图 片的每个特征值在垃圾图片中出现的概率;  a probability query unit, configured to query a sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;
贝叶斯计算单元, 用于将所述图片的每个特征值在垃圾图片中出现的概率 代入贝叶斯公式中进行计算, 获得第一概率;  a Bayesian calculation unit, configured to calculate a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;
支持向量机计算单元, 用于将所述图片的每个特征值在垃圾图片中出现的 概率构造成特征向量, 并代入支持向量机公式中进行计算, 获得第二概率; 所述图片为垃圾邮件的概率为所述第一概率和 /或所述第二概率。  a support vector machine calculation unit, configured to construct a probability vector for each feature value of the picture in the garbage picture, and perform calculation on the support vector machine formula to obtain a second probability; the picture is spam The probability of the first probability and/or the second probability.
12、 如权利要求 11所述的邮件系统, 其特征在于, 所述图片发送次数获取 模块具体包括: 12. The mail system according to claim 11, wherein the image transmission number acquisition module specifically includes:
哈希值计算单元, 应用哈希算法对所述图片的特征值进行处理, 获得所述 图片的哈希值;  a hash value calculation unit, applying a hash algorithm to process the feature value of the picture to obtain a hash value of the picture;
相似度判断单元, 用于将所述图片的哈希值和已接收的邮件图片的哈希值 进行比较, 得出所述图片和已接收的邮件图片的相似度;  a similarity determining unit, configured to compare a hash value of the picture with a hash value of the received mail picture to obtain a similarity between the picture and the received mail picture;
重复发送次数确定单元, 用于根据所述图片和已接收的邮件图片的相似度, 得出所述图片被重复发送的次数。 The repeated transmission number determining unit is configured to obtain, according to the similarity between the picture and the received mail picture, the number of times the picture is repeatedly sent.
13、 如权利要求 12所述的邮件系统, 其特征在于, 所述垃圾邮件判定模块 具体包括: The mailing system according to claim 12, wherein the spam determination module comprises:
权重查询单元, 用于根据所述图片为垃圾邮件的概率、 被重复发送的次数、 发信 IP的声誉值查询预置的权重值列表, 分别获得三者的权重值;  The weight query unit is configured to query the preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sent IP, and obtain the weight values of the three;
邮件识别单元, 用于将三者的权重值相加, 获到所述图片的权重和; 判断 所述图片的权重和是否大于预定的门限值, 若是, 则确定所述图片为垃圾邮件; 若否, 则确定所述图片为正常邮件。  a mail identifying unit, configured to add the weight values of the three to obtain a weight sum of the picture; determine whether the weight of the picture is greater than a predetermined threshold, and if yes, determine that the picture is spam; If not, it is determined that the picture is a normal mail.
14、 如权利要求 13所述的邮件系统, 其特征在于, 所述邮件系统还包括: 样本数据库, 用于保存垃圾图片样本和正常图片样本的所有特征值, 以及 每个特征值在垃圾图片中出现的概率; The mail system according to claim 13, wherein the mail system further comprises: a sample database, configured to save all feature values of the garbage picture sample and the normal picture sample, and each feature value is in the garbage picture. Probability of occurrence;
声誉值数据库, 用于保存发信 IP的声誉值; 所述声誉值是发信 IP所发送的 正常邮件在其所有已发送的邮件中所占的比例;  a reputation value database for storing the reputation value of the outgoing IP; the reputation value is the proportion of normal mail sent by the outgoing IP in all of its sent mails;
声誉值更新模块, 用于在所述垃圾邮件判定模块判定图片为垃圾邮件后, 重新计算所述图片的发信 IP的声誉值, 并对声誉值数据库中的相应的声誉值进 行更新。  The reputation value update module is configured to recalculate the reputation value of the sent IP of the picture after the spam determination module determines that the picture is spam, and update the corresponding reputation value in the reputation value database.
PCT/CN2011/074146 2010-06-12 2011-05-17 Method and system for distinguishing image spam mail WO2011153894A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010201732.9 2010-06-12
CN2010102017329A CN101917352B (en) 2010-06-12 2010-06-12 Method for recognizing picture spam mails and system thereof

Publications (1)

Publication Number Publication Date
WO2011153894A1 true WO2011153894A1 (en) 2011-12-15

Family

ID=43324746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/074146 WO2011153894A1 (en) 2010-06-12 2011-05-17 Method and system for distinguishing image spam mail

Country Status (2)

Country Link
CN (1) CN101917352B (en)
WO (1) WO2011153894A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990172A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917352B (en) * 2010-06-12 2012-07-25 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof
CN102929897A (en) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 Method and equipment for detecting bad information from text
CN103684971B (en) * 2012-09-07 2017-02-08 盈世信息科技(北京)有限公司 Method and system for processing mails
CN103020645A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 System and method for junk picture recognition
CN106407872A (en) * 2015-07-30 2017-02-15 中兴通讯股份有限公司 Picture processing apparatus and method thereof
CN110048936B (en) * 2019-04-18 2021-09-10 宁波青年优品信息科技有限公司 Method for judging junk mail by semantic associated words
CN114860972B (en) * 2022-07-07 2022-09-20 南通追光者信息技术有限公司 Data transmission optimization storage method for small program development

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080130998A1 (en) * 2006-07-21 2008-06-05 Clearswift Limited Identification of similar images
US20080159632A1 (en) * 2006-12-28 2008-07-03 Jonathan James Oliver Image detection methods and apparatus
CN101540741A (en) * 2009-05-06 2009-09-23 北京邮电大学 Image junk mail filtering method based on threshold
CN101540017A (en) * 2009-04-28 2009-09-23 黑龙江工程学院 Feature extraction method based on byte level n-gram and junk mail filter
CN101573956A (en) * 2006-11-03 2009-11-04 信息实验室有限公司 Detection of image spam
CN101730903A (en) * 2007-01-24 2010-06-09 迈可菲公司 Multi-dimensional reputation scoring
CN101917352A (en) * 2010-06-12 2010-12-15 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7320020B2 (en) * 2003-04-17 2008-01-15 The Go Daddy Group, Inc. Mail server probability spam filter
US7664812B2 (en) * 2003-10-14 2010-02-16 At&T Intellectual Property I, L.P. Phonetic filtering of undesired email messages
CN101119341B (en) * 2007-09-20 2011-02-16 腾讯科技(深圳)有限公司 Mail identifying method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080130998A1 (en) * 2006-07-21 2008-06-05 Clearswift Limited Identification of similar images
CN101573956A (en) * 2006-11-03 2009-11-04 信息实验室有限公司 Detection of image spam
US20080159632A1 (en) * 2006-12-28 2008-07-03 Jonathan James Oliver Image detection methods and apparatus
CN101730903A (en) * 2007-01-24 2010-06-09 迈可菲公司 Multi-dimensional reputation scoring
CN101540017A (en) * 2009-04-28 2009-09-23 黑龙江工程学院 Feature extraction method based on byte level n-gram and junk mail filter
CN101540741A (en) * 2009-05-06 2009-09-23 北京邮电大学 Image junk mail filtering method based on threshold
CN101917352A (en) * 2010-06-12 2010-12-15 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WAN MINGCHENG ET AL.: "Servey on Image-based Spam Filtering", APPLICATION RESEARCH OF COMPUTERS, vol. 25, no. 9, September 2008 (2008-09-01), pages 2579 - 2582 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990172A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device
CN112990172B (en) * 2019-12-02 2023-12-22 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device

Also Published As

Publication number Publication date
CN101917352B (en) 2012-07-25
CN101917352A (en) 2010-12-15

Similar Documents

Publication Publication Date Title
WO2011153894A1 (en) Method and system for distinguishing image spam mail
JP5121839B2 (en) How to detect image spam
US20060095521A1 (en) Method, apparatus, and system for clustering and classification
WO2015039478A1 (en) Method and apparatus for recognizing junk messages
CN101794378B (en) Rubbish image filtering method based on image encoding
CN101046858B (en) Electronic information comparing system and method and anti-garbage mail system
CN112994984A (en) Method for identifying protocol and content, storage device, security gateway and server
JP2006293573A (en) Electronic mail processor, electronic mail filtering method and electronic mail filtering program
CN108347367B (en) E-mail processing method and device, server and client
CN110874526B (en) File similarity detection method and device, electronic equipment and storage medium
CN114650229A (en) Network encryption traffic classification method and system based on three-layer model SFTF-L
CN110413770B (en) Method and device for classifying group messages into group topics
US10163005B2 (en) Document structure analysis device with image processing
JP4686724B2 (en) E-mail system with spam filter function
CN114925286B (en) Public opinion data processing method and device
WO2014036787A1 (en) Mail process method and system
CN106209605B (en) Method and equipment for processing attachment in network information
CN115438629A (en) Data processing method, data processing device, storage medium and electronic equipment
Majumder et al. A generalized model of text steganography by summary generation using frequency analysis
US20210406366A1 (en) Clustering and cluster tracking of categorical data
CN111026835A (en) Chat subject detection method, device and storage medium
CN116980378B (en) Method and system for marking repeated message of micro-channel group
CN111143560A (en) Short text classification method, terminal equipment and storage medium
JP2006350560A (en) E-mail system
Li et al. Image fragment carving algorithms based on pixel similarity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11791885

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11791885

Country of ref document: EP

Kind code of ref document: A1