WO2015032123A1 - Method and device for extracting number from e-mail - Google Patents

Method and device for extracting number from e-mail Download PDF

Info

Publication number
WO2015032123A1
WO2015032123A1 PCT/CN2013/086174 CN2013086174W WO2015032123A1 WO 2015032123 A1 WO2015032123 A1 WO 2015032123A1 CN 2013086174 W CN2013086174 W CN 2013086174W WO 2015032123 A1 WO2015032123 A1 WO 2015032123A1
Authority
WO
WIPO (PCT)
Prior art keywords
byte
symbol
double
pure
email
Prior art date
Application number
PCT/CN2013/086174
Other languages
French (fr)
Chinese (zh)
Inventor
陈颖棠
叶远鹏
Original Assignee
盈世信息科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 盈世信息科技(北京)有限公司 filed Critical 盈世信息科技(北京)有限公司
Publication of WO2015032123A1 publication Critical patent/WO2015032123A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Definitions

  • the present invention relates to the field of electronic mail technologies, and in particular, to a method for extracting numbers in an email and an apparatus therefor. Background technique
  • e-mail is the most commonly used function for people's office and communication.
  • e-mail is a commonly used basic application. Users can send information to each other by sending e-mails, which is very convenient, but also generates junk e-mail problems.
  • Spam email refers to any email that is forcibly sent to a user's email address without the permission of the user (receiver).
  • the content of the spam email includes promotional advertisements, adult advertisements, earning information, or computer viruses.
  • the computer system of the recipient user is compromised.
  • These spam emails have caused problems for mailbox users and affected the user experience of mailbox users. Therefore, all major mail providers have made the promotion of email anti-spam system an important concern for improving the experience of mailbox users.
  • the object of the present invention is to overcome the deficiencies of the prior art.
  • the present invention provides a method for extracting numbers in an email and an apparatus thereof, which can reduce the difficulty of number extraction and reduce resource consumption.
  • the present invention provides a method for extracting a number in an email, the method comprising: Identifying a single symbol in the email and obtaining a recognition result;
  • the determination result is converted to obtain a pure numeric number string.
  • the step of identifying a single symbol in the email and obtaining the recognition result comprises: identifying, according to the character encoding, that the symbol is a single-byte symbol or a double-byte symbol.
  • the step of performing classification determination on the identification result, and obtaining the determination result includes:
  • the symbol is a single-byte symbol, it is determined according to the character encoding whether it is a single-byte pure number, or whether it is a single-byte separator;
  • the symbol is a double-byte symbol
  • the step of converting the determination result to obtain a pure digital number string comprises:
  • the method further comprises: performing a verification record on the pure digital number string.
  • the present invention further provides an apparatus for extracting numbers in an email, the apparatus comprising:
  • An identification module configured to identify a single symbol in the email, and obtain a recognition result
  • a determination module configured to perform classification determination on the recognition result obtained by the identification module, to obtain a determination result
  • a conversion module configured to convert the determination result obtained by the determination module to obtain a pure digital number string.
  • the identifying module is configured to identify, according to the character encoding, that the symbol is a single-byte symbol or a double-byte symbol.
  • the determining module is further configured to: when determining that the symbol is a single-byte symbol, determine whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; When it is determined that the symbol is a double-byte symbol, it is determined according to the character encoding whether it is a double-byte symbol number, or whether it is a double-byte separator.
  • the conversion module is configured to directly record the number if the determination result is a single-byte pure number, and to convert to a single if the determination result is a double-byte character Byte characters, and converted to a pure numeric number.
  • the device further includes: an inspection record module, configured to perform inspection record on the pure digital number string.
  • Embodiments of the present invention can identify a delimited number and a symbol number in a subject or content of an email, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction and reduce resources. Consumption; and the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.
  • FIG. 1 is a schematic flowchart of a method for extracting numbers in an email according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a device for extracting numbers in an email according to an embodiment of the present invention.
  • the main function of the anti-spam module in the e-mail system is to analyze e-mail, perform feature recording and statistics, and determine whether it is junk e-mail, while the traditional anti-spam module cannot identify "400-235-335".
  • the meaning of "400-235335" is the same, it refers to "400235335", and the system can only determine that the two sets of numbers are different. Therefore, a unified number representation is needed to allow the email system to recognize and avoid the interference caused by the difference in symbols.
  • FIG. 1 is a schematic flowchart of a method for extracting numbers in an email according to an embodiment of the present invention. As shown in FIG. 1, the method includes:
  • the symbol is identified as a single-byte symbol or a double-byte symbol according to the character encoding. It is recognized whether the extracted symbol is a single-byte symbol or a double-byte symbol depending on the character of the character encoding (whether the highest bit is 1 or not). If the symbol is a single-byte symbol, one byte of content is taken; if the symbol is a double-byte symbol, two bytes of content are taken.
  • determining symbol when the determining symbol is a single-byte symbol, determining whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; when the determining symbol is a double-byte symbol, according to the character
  • the encoding determines whether it is a double-byte symbol number, or whether it is a double-byte separator.
  • the symbol is a single-byte symbol, it is determined according to the content of the character encoding whether it is a single-byte pure number "0-9", or whether it is a single-byte separator; if the symbol is a double word
  • the symbol is determined according to the content of the character encoding, whether it is a symbol number (such as "9”, such as "9” is 0xA2, OxEl), or whether it is a double-byte separator.
  • the pure digital number string can also be inspected, including whether it is a pure digital number, whether the length of the number meets the requirements, and whether recording is required or the like.
  • Embodiments of the method of the present invention can identify a delimited number and a symbol number in a subject or content of an email, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction, and Reduce resource consumption; and facilitate the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.
  • An embodiment of the present invention further provides an apparatus for extracting numbers in an email.
  • the apparatus includes: an identification module 1 configured to identify a single symbol in an email, and obtain a recognition result. ;
  • the determining module 2 is configured to perform classification determination on the recognition result obtained by the identification module 1 to obtain a determination result
  • the conversion module 3 is configured to convert the determination result obtained by the determination module 2 to obtain a pure digital number string.
  • the identification module 1 is configured to identify the symbol as a single-byte symbol or a double-byte symbol according to the character encoding.
  • the specific way is: According to the characteristics of the character encoding (whether the highest bit is 1 or not), the extracted symbol is identified as a single-byte symbol or a double-byte symbol. If the symbol is a single-byte symbol, take one byte of content; if the symbol is a double-byte symbol, take two bytes of content.
  • the determining module 2 is further configured to: when determining that the symbol is a single-byte symbol, determine whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; and when determining that the symbol is double-byte When the symbol is used, it is determined whether it is a double-byte symbol number based on the character encoding, or whether it is a double-byte separator.
  • the determining module 2 determines whether it is a single-byte pure number "0-9" according to the content of the character encoding, or whether it is a single-byte separator; When it is a double-byte symbol, the decision module 2 determines whether it is a symbol number ("9" or the like, such as "9" is 0xA2, OxEl), or whether it is a double-byte delimiter according to the content of the character encoding.
  • the conversion module 3 is further configured to directly record the number if the result of the determination is a single-byte pure number; and to convert to a single-byte character if the result of the determination is a double-byte character, and Convert to a pure numeric number.
  • the apparatus may further include: an inspection record module (not shown) for performing inspection record on the pure digital number string, including whether it is a pure digital number, whether the length of the number meets the requirements, and whether Need to record, etc.
  • an inspection record module (not shown) for performing inspection record on the pure digital number string, including whether it is a pure digital number, whether the length of the number meets the requirements, and whether Need to record, etc.
  • Embodiments of the apparatus embodying the present invention can identify a delimited number and a symbol number in a subject or content of an e-mail, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction, and Reduce resource consumption; and facilitate the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.

Abstract

Disclosed are a method and device for extracting a number from an e-mail. The method comprises: recognizing a single symbol in an e-mail and obtaining a recognition result; conducting classification determination on the recognition result and obtaining a determination result; and converting the determination result, and obtaining a pure digital number string. By implementing the embodiments of the present invention, a number with a separator and a symbolic number in a subject or content of an e-mail can be recognized, and a mixed number can be converted into a pure digital number string, which can reduce the difficulty of number extraction and reduce the consumption of resources; and analysis of an anti-spam module and the application of a rule in the e-mail can be facilitated, so that it can be rapidly recognized that it is a spam e-mail, which is convenient for a user.

Description

一种电子邮件中号码的提取方法及其装置 技术领域  Method for extracting number in email and device thereof
[0001] 本发明涉及电子邮件技术领域,特别是涉及一种电子邮件中号码的提取方法及其装置。 背景技术  [0001] The present invention relates to the field of electronic mail technologies, and in particular, to a method for extracting numbers in an email and an apparatus therefor. Background technique
[0002] 随着移动终端技术的不断发展, 手机、 掌上电脑、 平板、 笔记本等各种移动设备已经 成为人们工作、 生活中必不可缺的一部分, 而电子邮件是人们办公、 通信最常用的功能之一。 在互联网用户的各种应用中, 电子邮件是一种比较常用的基础应用, 用户可以通过发送电子 邮件向对方传送信息, 十分便捷, 但也同时产生了垃圾电子邮件的问题。  [0002] With the continuous development of mobile terminal technology, various mobile devices such as mobile phones, PDAs, tablets, and notebooks have become an indispensable part of people's work and life, and e-mail is the most commonly used function for people's office and communication. one. In various applications of Internet users, e-mail is a commonly used basic application. Users can send information to each other by sending e-mails, which is very convenient, but also generates junk e-mail problems.
[0003] 垃圾电子邮件是指未经用户 (接收方) 许可就强行发送到用户的电子邮箱中的任何电 子邮件, 垃圾电子邮件的内容包括推销广告、 成人广告、 赚钱信息, 或者包含电脑病毒, 以 至接收方用户的电脑系统受到侵害。 这些垃圾电子邮件给邮箱用户带来了困扰, 影响到了邮 箱用户的使用体验, 因此各大邮件提供商都把提升电子邮件反垃圾系统效果作为提升邮箱用 户体验的重要关注点。 [0003] Spam email refers to any email that is forcibly sent to a user's email address without the permission of the user (receiver). The content of the spam email includes promotional advertisements, adult advertisements, earning information, or computer viruses. As a result, the computer system of the recipient user is compromised. These spam emails have caused problems for mailbox users and affected the user experience of mailbox users. Therefore, all major mail providers have made the promotion of email anti-spam system an important concern for improving the experience of mailbox users.
[0004] 现有技术存在一种通过提取号码的形式识别电子邮件是否为垃圾电子邮件, 号码的提 取主要在电子邮件主题以及电子邮件的内容中提取, 主要用途是作为电子邮件的附加特征应 用于反垃圾领域, 如一些留有联系方式的垃圾邮件, 可将所提取的号码跟存有垃圾号码的数 据库中的数据进行对比, 以识别电子邮件是否为垃圾电子邮件, 现有提取号码的技术存在两 种方式, 一种是多数的号码提取都是直接寻找全是数字的号码串, 另一种方式是使用正则表 达式进行号码提取。  [0004] In the prior art, there is a method of identifying whether an email is a spam email by extracting a number. The extraction of the number is mainly extracted in the email subject and the content of the email, and the main purpose is to be applied as an additional feature of the email. In the anti-spam field, such as spam with contact information, the extracted number can be compared with the data in the database containing the junk number to identify whether the e-mail is spam. The existing technology for extracting numbers exists. Two ways, one is that most of the number extraction is to directly find the number string that is all digital, and the other way is to use the regular expression for number extraction.
[0005] 直接查找全是数字的号码提取的方法的适用性较窄, 仅适用于连续数字串, 无法识别 带有分隔符的号码; 而使用正则表达式进行号码识别只是识别并提取符合规则的串, 由于本 身具有强大的功能而导致编写和测试验证的难度较大, 且比较消耗资源。 上述两种方法所提 取出的号码都是原始的字符串, 不能转换成一般的纯数字串, 不方便反垃圾模块的分析以及 规则的应用。  [0005] The method of directly searching for all-digital number extraction has narrow applicability, and is only applicable to continuous numeric strings, and cannot identify numbers with delimiters; and the use of regular expressions for number identification only recognizes and extracts rules. Strings, due to their powerful features, make writing and test validation more difficult and resource intensive. The numbers extracted by the above two methods are original strings, which cannot be converted into general pure numeric strings, which is inconvenient for the analysis of anti-spam modules and the application of rules.
发明内容 Summary of the invention
[0006] 本发明的目的在于克服现有技术的不足, 本发明提供了一种电子邮件中号码的提取方 法及其装置, 可以降低号码提取的难度, 以及降低资源的消耗。  [0006] The object of the present invention is to overcome the deficiencies of the prior art. The present invention provides a method for extracting numbers in an email and an apparatus thereof, which can reduce the difficulty of number extraction and reduce resource consumption.
[0007] 为了解决上述问题, 本发明提出了一种电子邮件中号码的提取方法, 所述方法包括: 对所述电子邮件中的单个符号进行识别, 并获得识别结果; [0007] In order to solve the above problem, the present invention provides a method for extracting a number in an email, the method comprising: Identifying a single symbol in the email and obtaining a recognition result;
对所述识别结果进行分类判定, 获得判定结果; Performing classification determination on the recognition result to obtain a determination result;
对所述判定结果进行转换, 获得纯数字号码串。 The determination result is converted to obtain a pure numeric number string.
[0008] 优选地, 所述对所述电子邮件中的单个符号进行识别, 并获得识别结果的步骤包括: 根据字符编码识别所述符号为单字节符号或者为双字节符号。  [0008] Preferably, the step of identifying a single symbol in the email and obtaining the recognition result comprises: identifying, according to the character encoding, that the symbol is a single-byte symbol or a double-byte symbol.
[0009] 优选地, 所述对所述识别结果进行分类判定, 获得判定结果的步骤包括:  [0009] Preferably, the step of performing classification determination on the identification result, and obtaining the determination result includes:
当判定所述符号为单字节符号时, 根据字符编码判定是否为单字节纯数字, 或者是否为单字 节分隔符; When it is determined that the symbol is a single-byte symbol, it is determined according to the character encoding whether it is a single-byte pure number, or whether it is a single-byte separator;
当判定所述符号为双字节符号时, 根据字符编码判定是否为双字节符号号码, 或者是否为双 字节分隔符。 When it is determined that the symbol is a double-byte symbol, it is determined based on the character encoding whether it is a double-byte symbol number, or whether it is a double-byte separator.
[0010] 优选地, 所述对所述判定结果进行转换, 获得纯数字号码串的步骤包括:  [0010] Preferably, the step of converting the determination result to obtain a pure digital number string comprises:
若判定为单字节纯数字, 则直接记录该数字; If it is determined to be a single-byte pure number, the number is directly recorded;
若判定为双字节字符, 则转换为单字节字符, 并转换为纯数字号码。 If it is determined to be a double-byte character, it is converted to a single-byte character and converted to a pure numeric number.
[0011] 优选地, 所述方法还包括: 对所述纯数字号码串进行检验记录。  [0011] Preferably, the method further comprises: performing a verification record on the pure digital number string.
[0012] 相应地, 本发明还提供一种电子邮件中号码的提取装置, 所述装置包括:  [0012] Correspondingly, the present invention further provides an apparatus for extracting numbers in an email, the apparatus comprising:
识别模块, 用于对所述电子邮件中的单个符号进行识别, 并获得识别结果; An identification module, configured to identify a single symbol in the email, and obtain a recognition result;
判定模块, 用于对所述识别模块所获得的识别结果进行分类判定, 获得判定结果; 转换模块, 用于对所述判定模块所获得的判定结果进行转换, 获得纯数字号码串。 a determination module, configured to perform classification determination on the recognition result obtained by the identification module, to obtain a determination result; and a conversion module, configured to convert the determination result obtained by the determination module to obtain a pure digital number string.
[0013] 优选地, 所述识别模块用于根据字符编码识别所述符号为单字节符号或者为双字节符 号。  [0013] Preferably, the identifying module is configured to identify, according to the character encoding, that the symbol is a single-byte symbol or a double-byte symbol.
[0014] 优选地, 所述判定模块还用于当判定所述符号为单字节符号时, 根据字符编码判定是 否为单字节纯数字, 或者是否为单字节分隔符; 以及用于当判定所述符号为双字节符号时, 根据字符编码判定是否为双字节符号号码, 或者是否为双字节分隔符。  [0014] Preferably, the determining module is further configured to: when determining that the symbol is a single-byte symbol, determine whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; When it is determined that the symbol is a double-byte symbol, it is determined according to the character encoding whether it is a double-byte symbol number, or whether it is a double-byte separator.
[0015] 优选地, 所述转换模块用于若所述判定结果为单字节纯数字时, 则直接记录该数字; 以及用于若所述判定结果为双字节字符时, 则转换为单字节字符, 并转换为纯数字号码。 [0015] Preferably, the conversion module is configured to directly record the number if the determination result is a single-byte pure number, and to convert to a single if the determination result is a double-byte character Byte characters, and converted to a pure numeric number.
[0016] 优选地, 所述装置还包括: 检验记录模块, 用于对所述纯数字号码串进行检验记录。 [0016] Preferably, the device further includes: an inspection record module, configured to perform inspection record on the pure digital number string.
[0017] 实施本发明实施例, 可在电子邮件的主题或内容中识别出带分隔符的号码以及符号号 码, 并将混合号码转换为纯数字号码串, 可以降低号码提取的难度, 以及降低资源的消耗; 以及方便电子邮件中反垃圾模块的分析以及规则的应用, 以快速地识别出是否为垃圾电子邮 件, 给用户带来便利。 [0017] Embodiments of the present invention can identify a delimited number and a symbol number in a subject or content of an email, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction and reduce resources. Consumption; and the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.
附图说明 [0018] 为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实施例或现有技 术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的 一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据 这些附图获得其他的附图。 DRAWINGS [0018] In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below, and obviously, in the following description The drawings are only some of the embodiments of the present invention, and other drawings may be obtained from those skilled in the art without departing from the drawings.
[0019] 图 1是本发明实施例的电子邮件中号码的提取方法的流程示意图;  1 is a schematic flowchart of a method for extracting numbers in an email according to an embodiment of the present invention;
图 2是本发明实施例的电子邮件中号码的提取装置的结构组成示意图。 FIG. 2 is a schematic structural diagram of a device for extracting numbers in an email according to an embodiment of the present invention.
具体实施方式 detailed description
[0020] 下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地 描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是全部的实施例。 基于本发 明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。  [0020] The technical solutions in the embodiments of the present invention will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. example. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
[0021] 电子邮件系统中的反垃圾模块的主要作用是对电子邮件进行分析、 进行特征记录及统 计,并判定是否为垃圾电子邮件,而传统的反垃圾模块无法识别" 400-235-335"和" 400-235335" 所代表的含义是相同的, 都是指 "400235335", 而系统只能判定两组号码是不同的东西。 因 此需要一个统一的号码表示方式, 来让电子邮件系统可以识别, 避免符号的差异性所带来的 干扰。  [0021] The main function of the anti-spam module in the e-mail system is to analyze e-mail, perform feature recording and statistics, and determine whether it is junk e-mail, while the traditional anti-spam module cannot identify "400-235-335". The meaning of "400-235335" is the same, it refers to "400235335", and the system can only determine that the two sets of numbers are different. Therefore, a unified number representation is needed to allow the email system to recognize and avoid the interference caused by the difference in symbols.
[0022] 图 1是本发明实施例的电子邮件中号码的提取方法的流程示意图, 如图 1所示, 该方 法包括:  1 is a schematic flowchart of a method for extracting numbers in an email according to an embodiment of the present invention. As shown in FIG. 1, the method includes:
5101 , 对电子邮件中的单个符号进行识别, 并获得识别结果;  5101, identifying a single symbol in the email, and obtaining a recognition result;
5102, 对识别结果进行分类判定, 获得判定结果;  5102, classifying and determining the recognition result, and obtaining a determination result;
5103, 对判定结果进行转换, 获得纯数字号码串。  5103. Convert the determination result to obtain a pure numeric number string.
[0023] 其中, 在 S101 中, 根据字符编码识别符号为单字节符号或者为双字节符号。 根据字 符编码的特性 (最高位是否为 1 ) 识别出所提取符号为单字节符号还是双字节符号。 若该符 号为单字节符号, 则取一个字节内容; 若该符号为双字节符号, 则取两个字节内容。  [0023] wherein, in S101, the symbol is identified as a single-byte symbol or a double-byte symbol according to the character encoding. It is recognized whether the extracted symbol is a single-byte symbol or a double-byte symbol depending on the character of the character encoding (whether the highest bit is 1 or not). If the symbol is a single-byte symbol, one byte of content is taken; if the symbol is a double-byte symbol, two bytes of content are taken.
[0024] 在 S102 中, 当判定符号为单字节符号时, 根据字符编码判定是否为单字节纯数字, 或者是否为单字节分隔符; 当判定符号为双字节符号时, 根据字符编码判定是否为双字节符 号号码, 或者是否为双字节分隔符。  [0024] In S102, when the determining symbol is a single-byte symbol, determining whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; when the determining symbol is a double-byte symbol, according to the character The encoding determines whether it is a double-byte symbol number, or whether it is a double-byte separator.
[0025] 具体实施中, 若符号为单字节符号时, 则根据字符编码的内容判定是否为单字节纯数 字 "0-9", 或者是否为单字节分隔符; 若符号为双字节符号时, 则根据字符编码的内容判定, 是否为符号号码 ("⑨"之类, 如 "⑨" 的编码为 0xA2, OxEl ), 或者是否为双字节分隔符。  [0025] In a specific implementation, if the symbol is a single-byte symbol, it is determined according to the content of the character encoding whether it is a single-byte pure number "0-9", or whether it is a single-byte separator; if the symbol is a double word When the symbol is used, it is determined according to the content of the character encoding, whether it is a symbol number (such as "9", such as "9" is 0xA2, OxEl), or whether it is a double-byte separator.
[0026] 在 S103 中, 若判定为单字节纯数字, 则直接记录该数字; 若判定为双字节字符, 则 转换为单字节字符, 并转换为纯数字号码。 [0026] In S103, if it is determined to be a single-byte pure number, the number is directly recorded; if it is determined to be a double-byte character, Convert to single-byte characters and convert to a pure numeric number.
[0027] 具体实施中, 若为单字节纯数字, 则直接记录; 若为连接符, 则获取并继续处理获取 下一符号; 若为双字节字符, 则转换成对应的单字节字符 (由于这类符号编码是连续的, 只 要跟起始编码相减的值就是所要转换到的号码, 如⑨, OxEl - 0xA8 =0x39, 则数字 "9"的编 码为 0x39) ;若为其他, 则当前号码提取结束, 校验号码是否需要记录, 号码长度等。  [0027] In a specific implementation, if it is a single-byte pure number, it records directly; if it is a connector, it acquires and continues processing to obtain the next symbol; if it is a double-byte character, it is converted into a corresponding single-byte character. (Because such symbol encoding is continuous, as long as the value subtracted from the starting code is the number to be converted, such as 9, OxEl - 0xA8 =0x39, the number "9" is encoded as 0x39); if other, Then, the current number extraction is completed, and the verification number needs to be recorded, the length of the number, and the like.
[0028] 进一步地, 在获得纯数字号码串后, 还可以对纯数字号码串进行检验记录, 包括是否 为纯数字号码、 号码的长度是否符合要求、 以及是否需要记录等。 [0028] Further, after obtaining the pure digital number string, the pure digital number string can also be inspected, including whether it is a pure digital number, whether the length of the number meets the requirements, and whether recording is required or the like.
[0029] 实施本发明的方法实施例, 可在电子邮件的主题或内容中识别出带分隔符的号码以及 符号号码, 并将混合号码转换为纯数字号码串, 可以降低号码提取的难度, 以及降低资源的 消耗; 以及方便电子邮件中反垃圾模块的分析以及规则的应用, 以快速地识别出是否为垃圾 电子邮件, 给用户带来便利。  [0029] Embodiments of the method of the present invention can identify a delimited number and a symbol number in a subject or content of an email, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction, and Reduce resource consumption; and facilitate the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.
[0030] 本发明实施例还提供了一种电子邮件中号码的提取装置, 如图 2所示, 该装置包括: 识别模块 1, 用于对电子邮件中的单个符号进行识别, 并获得识别结果;  An embodiment of the present invention further provides an apparatus for extracting numbers in an email. As shown in FIG. 2, the apparatus includes: an identification module 1 configured to identify a single symbol in an email, and obtain a recognition result. ;
判定模块 2, 用于对识别模块 1所获得的识别结果进行分类判定, 获得判定结果; 转换模块 3, 用于对判定模块 2所获得的判定结果进行转换, 获得纯数字号码串。 The determining module 2 is configured to perform classification determination on the recognition result obtained by the identification module 1 to obtain a determination result, and the conversion module 3 is configured to convert the determination result obtained by the determination module 2 to obtain a pure digital number string.
[0031] 其中, 该识别模块 1用于根据字符编码识别符号为单字节符号或者为双字节符号。 具 体方式是: 根据字符编码的特性 (最高位是否为 1 ) 识别出所提取符号为单字节符号还是双 字节符号。 若该符号为单字节符号, 则取一个字节内容; 若该符号为双字节符号, 则取两个 字节内容。  [0031] The identification module 1 is configured to identify the symbol as a single-byte symbol or a double-byte symbol according to the character encoding. The specific way is: According to the characteristics of the character encoding (whether the highest bit is 1 or not), the extracted symbol is identified as a single-byte symbol or a double-byte symbol. If the symbol is a single-byte symbol, take one byte of content; if the symbol is a double-byte symbol, take two bytes of content.
[0032] 判定模块 2还用于当判定符号为单字节符号时, 根据字符编码判定是否为单字节纯数 字, 或者是否为单字节分隔符; 以及用于当判定符号为双字节符号时, 根据字符编码判定是 否为双字节符号号码, 或者是否为双字节分隔符。  [0032] the determining module 2 is further configured to: when determining that the symbol is a single-byte symbol, determine whether it is a single-byte pure number according to the character encoding, or whether it is a single-byte separator; and when determining that the symbol is double-byte When the symbol is used, it is determined whether it is a double-byte symbol number based on the character encoding, or whether it is a double-byte separator.
[0033] 具体实施中, 若符号为单字节符号时, 则判定模块 2根据字符编码的内容判定是否为 单字节纯数字 "0-9", 或者是否为单字节分隔符; 若符号为双字节符号时, 则判定模块 2根 据字符编码的内容判定, 是否为符号号码 ("⑨ "之类, 如 "⑨"的编码为 0xA2, OxEl ), 或 者是否为双字节分隔符。  [0033] In a specific implementation, if the symbol is a single-byte symbol, the determining module 2 determines whether it is a single-byte pure number "0-9" according to the content of the character encoding, or whether it is a single-byte separator; When it is a double-byte symbol, the decision module 2 determines whether it is a symbol number ("9" or the like, such as "9" is 0xA2, OxEl), or whether it is a double-byte delimiter according to the content of the character encoding.
[0034] 另外, 转换模块 3还用于若判定结果为单字节纯数字时, 则直接记录该数字; 以及用 于若判定结果为双字节字符时, 则转换为单字节字符, 并转换为纯数字号码。 具体实施中, 若为单字节纯数字, 则直接记录; 若为连接符, 则获取并继续处理获取下一符号; 若为双字 节字符, 则转换成对应的单字节字符 (由于这类符号编码是连续的, 只要跟起始编码相减的 值就是所要转换到的号码, 如⑨, OxEl - 0xA8 =0x39, 则数字 "9" 的编码为 0x39) ;若为其 他, 则当前号码提取结束, 校验号码是否需要记录, 号码长度等。 [0034] In addition, the conversion module 3 is further configured to directly record the number if the result of the determination is a single-byte pure number; and to convert to a single-byte character if the result of the determination is a double-byte character, and Convert to a pure numeric number. In the specific implementation, if it is a single-byte pure number, it is directly recorded; if it is a connector, it acquires and continues processing to obtain the next symbol; if it is a double-byte character, it is converted into a corresponding single-byte character (due to this The class symbol encoding is continuous, as long as the value subtracted from the starting code is the number to be converted, such as 9, OxEl - 0xA8 =0x39, then the number "9" is encoded as 0x39); He, the current number extraction is over, check whether the number needs to be recorded, the length of the number, and so on.
[0035] 进一步地, 该装置还可以包括: 检验记录模块(图中未示出), 用于对纯数字号码串进 行检验记录, 包括是否为纯数字号码、 号码的长度是否符合要求、 以及是否需要记录等。  [0035] Further, the apparatus may further include: an inspection record module (not shown) for performing inspection record on the pure digital number string, including whether it is a pure digital number, whether the length of the number meets the requirements, and whether Need to record, etc.
[0036] 实施本发明的装置实施例, 可在电子邮件的主题或内容中识别出带分隔符的号码以及 符号号码, 并将混合号码转换为纯数字号码串, 可以降低号码提取的难度, 以及降低资源的 消耗; 以及方便电子邮件中反垃圾模块的分析以及规则的应用, 以快速地识别出是否为垃圾 电子邮件, 给用户带来便利。 [0036] Embodiments of the apparatus embodying the present invention can identify a delimited number and a symbol number in a subject or content of an e-mail, and convert the mixed number into a pure numeric number string, which can reduce the difficulty of number extraction, and Reduce resource consumption; and facilitate the analysis of anti-spam modules in emails and the application of rules to quickly identify whether it is spam or not, which is convenient for users.
[0037] 本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过 程序来指令相关的硬件来完成, 该程序可以存储于一计算机可读存储介质中, 存储介质可以 包括: 只读存储器 (ROM, Read Only Memory )、 随机存取存储器 (RAM, Random Access Memory )、 磁盘或光盘等。  [0037] Those skilled in the art can understand that all or part of the various methods of the above embodiments can be completed by a program instructing related hardware, and the program can be stored in a computer readable storage medium, the storage medium. These may include: read only memory (ROM), random access memory (RAM), disk or optical disk, and the like.
[0038] 另外, 以上对本发明实施例所提供的电子邮件中号码的提取方法及其装置进行了详细 介绍, 本文中应用了具体个例对本发明的原理及实施方式进行了阐述, 以上实施例的说明只 是用于帮助理解本发明的方法及其核心思想; 同时, 对于本领域的一般技术人员, 依据本发 明的思想, 在具体实施方式及应用范围上均会有改变之处, 综上所述, 本说明书内容不应理 解为对本发明的限制。  [0038] In addition, the method and device for extracting the number in the email provided by the embodiment of the present invention are described in detail. The principles and implementation manners of the present invention are described in the following. The description is only for helping to understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The contents of this specification are not to be construed as limiting the invention.

Claims

权 利 要 求 Rights request
1. 一种电子邮件中号码的提取方法, 其特征在于, 所述方法包括: A method for extracting a number in an email, characterized in that the method comprises:
对所述电子邮件中的单个符号进行识别, 并获得识别结果; Identifying a single symbol in the email and obtaining a recognition result;
对所述识别结果进行分类判定, 获得判定结果; Performing classification determination on the recognition result to obtain a determination result;
对所述判定结果进行转换, 获得纯数字号码串。 The determination result is converted to obtain a pure numeric number string.
2. 如权利要求 1所述的电子邮件中号码的提取方法, 其特征在于, 所述对所述电子邮件中的单个 符号进行识别, 并获得识别结果的步骤包括:  2. The method for extracting a number in an email according to claim 1, wherein the step of identifying a single symbol in the email and obtaining the recognition result comprises:
根据字符编码识别所述符号为单字节符号或者为双字节符号。 The symbol is identified as a single byte symbol or a double byte symbol based on the character encoding.
3. 如权利要求 2所述的电子邮件中号码的提取方法, 其特征在于, 所述对所述识别结果进行分类 判定, 获得判定结果的步骤包括:  3. The method for extracting a number in an email according to claim 2, wherein the step of classifying the recognition result and obtaining the determination result comprises:
当判定所述符号为单字节符号时,根据字符编码判定是否为单字节纯数字,或者是否为单字节分隔 符; When it is determined that the symbol is a single-byte symbol, it is determined according to the character encoding whether it is a single-byte pure number, or whether it is a single-byte delimiter;
当判定所述符号为双字节符号时,根据字符编码判定是否为双字节符号号码,或者是否为双字节分 隔符。 When it is determined that the symbol is a double-byte symbol, it is determined based on the character encoding whether it is a double-byte symbol number, or whether it is a double-byte separator.
4. 如权利要求 3所述的电子邮件中号码的提取方法, 其特征在于, 所述对所述判定结果进行转换, 获得纯数字号码串的步骤包括:  4. The method for extracting a number in an email according to claim 3, wherein the step of converting the determination result to obtain a pure numeric number string comprises:
若判定为单字节纯数字, 则直接记录该数字; If it is determined to be a single-byte pure number, the number is directly recorded;
若判定为双字节字符, 则转换为单字节字符, 并转换为纯数字号码。 If it is determined to be a double-byte character, it is converted to a single-byte character and converted to a pure numeric number.
5. 如权利要求 1至 4任意一项所述的电子邮件中号码的提取方法, 其特征在于, 所述方法还包括: 对所述纯数字号码串进行检验记录。  The method for extracting a number in an email according to any one of claims 1 to 4, wherein the method further comprises: performing a verification record on the pure digital number string.
6. 一种电子邮件中号码的提取装置, 其特征在于, 所述装置包括:  6. An apparatus for extracting numbers in an email, wherein the apparatus comprises:
识别模块, 用于对所述电子邮件中的单个符号进行识别, 并获得识别结果; An identification module, configured to identify a single symbol in the email, and obtain a recognition result;
判定模块, 用于对所述识别模块所获得的识别结果进行分类判定, 获得判定结果; a determining module, configured to perform classification determination on the recognition result obtained by the identification module, to obtain a determination result;
转换模块, 用于对所述判定模块所获得的判定结果进行转换, 获得纯数字号码串。 And a conversion module, configured to convert the determination result obtained by the determination module to obtain a pure numeric number string.
7. 如权利要求 6所述的电子邮件中号码的提取装置, 其特征在于, 所述识别模块用于根据字符编 码识别所述符号为单字节符号或者为双字节符号。  7. The apparatus for extracting numbers in an e-mail according to claim 6, wherein the identification module is configured to identify that the symbol is a single-byte symbol or a double-byte symbol according to a character encoding.
8. 如权利要求 7所述的电子邮件中号码的提取装置, 其特征在于, 所述判定模块还用于当判定所 述符号为单字节符号时, 根据字符编码判定是否为单字节纯数字, 或者是否为单字节分隔符; 以及 用于当判定所述符号为双字节符号时,根据字符编码判定是否为双字节符号号码,或者是否为双字 节分隔符。 The apparatus for extracting numbers in an e-mail according to claim 7, wherein the determining module is further configured to: when determining that the symbol is a single-byte symbol, determine whether it is single-byte pure according to character encoding a number, or whether it is a single-byte delimiter; and for determining whether the symbol is a double-byte symbol number or a double-byte delimiter according to the character encoding when determining that the symbol is a double-byte symbol.
9. 如权利要求 8所述的电子邮件中号码的提取装置, 其特征在于, 所述转换模块用于若所述判定 结果为单字节纯数字时, 则直接记录该数字; 以及用于若所述判定结果为双字节字符时, 则转换为 单字节字符, 并转换为纯数字号码。 The apparatus for extracting numbers in an e-mail according to claim 8, wherein the conversion module is configured to directly record the number if the determination result is a single-byte pure number; When the result of the determination is a double-byte character, it is converted into a single-byte character and converted into a pure-digit number.
10. 如权利要求 6至 9任意一项所述的电子邮件中号码的提取装置,其特征在于,所述装置还包括: 检验记录模块, 用于对所述纯数字号码串进行检验记录。  The apparatus for extracting numbers in an e-mail according to any one of claims 6 to 9, wherein the apparatus further comprises: an inspection recording module, configured to perform inspection recording on the pure digital number string.
PCT/CN2013/086174 2013-09-04 2013-10-29 Method and device for extracting number from e-mail WO2015032123A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310397191.5 2013-09-04
CN201310397191.5A CN103490980B (en) 2013-09-04 2013-09-04 The extracting method and its device of number in a kind of Email

Publications (1)

Publication Number Publication Date
WO2015032123A1 true WO2015032123A1 (en) 2015-03-12

Family

ID=49830951

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/086174 WO2015032123A1 (en) 2013-09-04 2013-10-29 Method and device for extracting number from e-mail

Country Status (2)

Country Link
CN (1) CN103490980B (en)
WO (1) WO2015032123A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020366B (en) * 2017-12-07 2021-06-15 北大方正集团有限公司 Mailbox information extraction method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN102088697A (en) * 2010-12-17 2011-06-08 北京华中融合科技有限公司 Method and system for processing spam
US20120005589A1 (en) * 2010-07-05 2012-01-05 Seohyun Han Mobile terminal and method for controlling the operation of the mobile terminal

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101087259A (en) * 2006-06-07 2007-12-12 深圳市都护网络科技有限公司 A system for filtering spam in Internet and its implementation method
CN102078984A (en) * 2010-11-26 2011-06-01 西南铝业(集团)有限责任公司 Method and system for processing core head working tapes of divergent die upper die

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
US20120005589A1 (en) * 2010-07-05 2012-01-05 Seohyun Han Mobile terminal and method for controlling the operation of the mobile terminal
CN102088697A (en) * 2010-12-17 2011-06-08 北京华中融合科技有限公司 Method and system for processing spam

Also Published As

Publication number Publication date
CN103490980A (en) 2014-01-01
CN103490980B (en) 2017-07-28

Similar Documents

Publication Publication Date Title
TWI526825B (en) Web page link detection method, device and system
CN104509041B (en) The detection method and device of the annex passed into silence
CN103546446B (en) Phishing website detection method, device and terminal
US20170289082A1 (en) Method and device for identifying spam mail
GB2483358A (en) Markov parsing of email message using annotations
CN1691631A (en) Method for management of vcards
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN114157502B (en) Terminal identification method and device, electronic equipment and storage medium
WO2016000545A1 (en) Junk picture file identification method, apparatus, and electronic device
WO2015101353A1 (en) Method and apparatus for processing text information
CN113114707B (en) Rule filtering method for power chip Ethernet controller
CN112307369A (en) Short link processing method, device, terminal and storage medium
US8955127B1 (en) Systems and methods for detecting illegitimate messages on social networking platforms
US20170229118A1 (en) Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system
US11010687B2 (en) Detecting abusive language using character N-gram features
US20200304448A1 (en) System and Method for Detecting and Predicting Level of Importance of Electronic Mail Messages
CN103365934A (en) Extracting method and device of complex named entity
WO2021114634A1 (en) Text annotation method, device, and storage medium
WO2015032123A1 (en) Method and device for extracting number from e-mail
CN104376304A (en) Identification method and device for text advertisement image
CN116055067A (en) Weak password detection method, device, electronic equipment and medium
CN115774762A (en) Instant messaging information processing method, device, equipment and storage medium
CN106294292B (en) Chapter catalog screening method and device
CN113220949A (en) Construction method and device of private data identification system
CN103853784B (en) A kind of webpage matching process of mobile terminal, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13892956

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13892956

Country of ref document: EP

Kind code of ref document: A1