WO2014036788A1 - Procédé de collecte et de classification de courrier électronique - Google Patents

Procédé de collecte et de classification de courrier électronique Download PDF

Info

Publication number
WO2014036788A1
WO2014036788A1 PCT/CN2012/085097 CN2012085097W WO2014036788A1 WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1 CN 2012085097 W CN2012085097 W CN 2012085097W WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1
Authority
WO
WIPO (PCT)
Prior art keywords
confidence
email
mail
spam
reported
Prior art date
Application number
PCT/CN2012/085097
Other languages
English (en)
Chinese (zh)
Inventor
林延中
潘庆峰
Original Assignee
盈世信息科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 盈世信息科技(北京)有限公司 filed Critical 盈世信息科技(北京)有限公司
Publication of WO2014036788A1 publication Critical patent/WO2014036788A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to an email collection and classification method.
  • text classification is performed using artificial intelligence classification algorithms. These algorithms need to learn the learning samples first, and then construct the corresponding discriminant models before text classification; therefore, the learning samples need to be acquired first.
  • the method of learning samples is to manually mark a batch of samples and mark the mail as spam or non-spam.
  • the technical problem to be solved by the embodiments of the present invention is to provide an email collection and classification method, which does not need to arrange a special person to classify a large number of emails, but directly collects feedback information of the user by using a computer, thereby reducing manual
  • the workload ensures the accuracy of the classification, and it does not require manual reading of the mail, which protects the privacy of the user.
  • an embodiment of the present invention provides an email collection and classification method, including: scanning all reported emails in a server, and extracting target emails whose number of reported times is greater than or equal to n, where n is a default value.
  • the reported mail includes a mail that is reported as a normal mail and is reported as spam; calculating a confidence level of the target mail, and obtaining a calculation result; determining, according to the calculation result, that the target mail is spam or Normal mail, and stored in the database.
  • the step of calculating the confidence of the target mail comprises: adding the confidence levels of all the reporters who report the target mail as normal mail, and obtaining the total normal mail confidence X; Adding the confidence of all the reporters who report the target email to spam, and obtaining the total spam confidence Y; calculating the absolute value IX-YI of the difference between the total normal email confidence X and the total spam confidence Y, The calculation results are obtained.
  • the determining, according to the calculation result, the step that the target mail is a spam or a normal mail comprises: the difference between the total normal mail confidence X and the total spam confidence Y
  • the absolute value IX-YI is compared with the threshold value T to determine whether IX-YI is less than ⁇ .
  • the mail is not judged for the time being, and when it is judged as no, the size of X and ⁇ is compared.
  • X is greater than ⁇ , it is determined.
  • the mail is a normal mail.
  • X is less than ⁇ , the mail is determined.
  • the piece is spam.
  • the method before the step of calculating the confidence of the target mail, the method further includes: presetting the initial confidence of the reporter of the initial report mail to be 1.
  • the email collection and classification method further includes: updating a confession of the whistleblower, increasing the confidence of the whistleblower that is consistent with the final determination result, and reducing the confidence of the whistleblower who is inconsistent with the final determination result. degree.
  • the increase rate of the confidence is slower than the decrease speed.
  • the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
  • the utility model has the beneficial effects of: scanning all the reported mails in the server by the computer, extracting the target mails whose reported times are greater than or equal to the system default value, performing confidence calculation on the target mail based on the confidence, and then calculating according to the confidence As a result, it is determined that the reported mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level by the computer, thereby reducing the manual work intensity and workload, and ensuring the classification.
  • the accuracy rate without the need to manually read the mail, protects the privacy of the user.
  • FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention
  • FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
  • FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
  • FIG. 4 is a schematic structural diagram of a fourth embodiment of an email collection and classification method according to the present invention.
  • FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention.
  • the method includes the following steps: S100: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S102 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the information in a database.
  • the result of the determination is that the spam is stored in the spam database, and the determination result is normal.
  • the mail is stored in the normal mail database.
  • FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
  • the method includes the following steps: S200: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S201 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • S202 Add the confidence levels of all the reporters who report the target mail as spam, and obtain the total spam confidence Y.
  • steps S201 and S202 have no sequence and can be performed simultaneously.
  • S204 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the data in a database.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the reporter and the ⁇ report will report the mail as a normal mail
  • the reporter C and D reports M email as spam
  • rapporter A's confidence is 5
  • whistleblower B's confidence is 10
  • whistleblower C's confidence is 3
  • whistleblower D's confidence is 8
  • FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
  • the method includes: S300: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S301 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • S302 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence Y.
  • steps S301 and S302 have no sequence and can be performed simultaneously.
  • the threshold ⁇ may be preset according to a specific situation, and generally the threshold is higher than the initial confidence, and preferably the threshold ⁇ is 3.
  • the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
  • the size of X and ⁇ is compared.
  • X is greater than ⁇ , it is determined that the mail is a normal mail, and when X is smaller than ⁇ , the mail is determined to be spam.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
  • the M mail is found to be reported 4 times, which is greater than the default value of 3, and is therefore extracted as the target mail, wherein the reporters A and B report the M mail as a normal mail, and the informants C and D will M The email is reported as spam.
  • the whistleblower B's confidence is 10
  • the whistleblower C's confidence is 3
  • the whistleblower's confidence is 8
  • the total normal email confidence X 5+10 15
  • the confidence level of the reporter is 3, the confidence of the reporter B is 8, and the confidence of the reporter C is 5, the informant
  • S400 Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to ⁇ .
  • is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
  • the default value ⁇ can be set according to the specific situation, preferably, the default value ⁇ Is 3.
  • S402 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • steps S401 and S402 have no sequence and can be performed simultaneously.
  • S403 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence level.
  • the threshold T may be preset according to a specific situation, and generally the threshold T is higher than the initial confidence, and preferably the threshold T is 3.
  • the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
  • the size of X and Y is compared.
  • X is greater than Y, it is determined that the mail is a normal mail, and when X is smaller than Y, the mail is determined to be spam.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the increase and decrease of the confidence level may be preset as needed.
  • the increase degree of the confidence is +1; the decrease of the confidence is decreased by 10% or -1, which is The larger one.
  • the increase in the confidence is slower than the decrease.
  • the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
  • the maximum value or the minimum value may be preset as needed, and preferably, the maximum value is 50 and the minimum value is 0.
  • the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
  • the M message is reported to be reported 4 times, which is greater than the default value of 3, and thus is extracted as the target mail, wherein the reporter A and B report the M mail as a normal mail, and the whistlemen C and D will M
  • the email is reported as spam. If the reporter A is the first report, the initial confidence level of the reporter A is 1, the confidence of the reporter B is 14, the confidence of the reporter C is 3, and the confidence of the reporter D is 8.
  • the confidence level of the whistleblower A and B +1 the confidence level of the whistleblower A becomes 2
  • the confidence level of whistleblower B becomes 15
  • whistleblower C and D judgment If the results are inconsistent, the confidence level of the reporters C and D is reduced by 10% or -1, the original confidence of the reporter C is 3, and the decrease of 10% is less than -1, and the confidence of the reporter C is 2,
  • the original confidence of the whistleblower D is 8, and the decrease of 10% is less than -1, and the confidence level of the reporter D is decreased to 7.
  • the confidence of the informant is updated, the whistlemen C and D are consistent with the judgment result, so the confidence level of the reporter C and D is +1, the confidence level of the whistleman C becomes 6, and the confidence of the sufflator D becomes 21;
  • the reporter's ⁇ and ⁇ are inconsistent with the judgment result, so the confidence level of the whistleblower A and B is reduced by 10% or -1, the original confidence of the whistleblower A is 3, and the decrease of 10% is less than -1, then the whistleblower
  • the confidence of A decreases to 2
  • the original confidence of whistleblower B is 15, and the decrease of 10% is greater than -1, then the confidence level of reporter B decreases by 1.5 to 13.5.
  • the computer scans all the reported mails in the server, extracts the target mail whose reported number of times is greater than or equal to the system default value, performs confidence calculation on the target mail based on the confidence level, and then determines the reported result according to the calculation result.
  • the mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level, which reduces the manual work intensity and workload, and ensures the accuracy of the classification. It does not require manual reading of the mail, which protects the privacy of the user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un procédé de collecte et de classification de courrier électronique. Le procédé consiste : à balayer tous les courriers électroniques rapportés dans le serveur ; à extraire un courrier électronique cible qui est rapporté le nombre de fois supérieur ou égal à n, ce n étant la valeur par défaut ; au courrier électronique rapporté comprenant le courrier normal rapporté et le pourriel rapporté ; calculer la confiance du courrier cible ; obtenir les résultats ; déterminer si le courrier cible est un pourriel ou un courrier électronique normal selon le résultat de calcul ; stocker le résultat dans la base de données. L'invention utilise directement un ordinateur pour collecter une rétroaction d'utilisateur sans prévoir de personne spéciale pour la classification et l'annotation de courrier électronique de masse. L'invention réduit la charge de travail manuel, garantit la précision de la classification et protège la confidentialité de l'utilisateur sans lecture manuelle du courrier électronique.
PCT/CN2012/085097 2012-09-07 2012-11-23 Procédé de collecte et de classification de courrier électronique WO2014036788A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210327624.5 2012-09-07
CN2012103276245A CN102880952A (zh) 2012-09-07 2012-09-07 一种电子邮件收集分类方法

Publications (1)

Publication Number Publication Date
WO2014036788A1 true WO2014036788A1 (fr) 2014-03-13

Family

ID=47482268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/085097 WO2014036788A1 (fr) 2012-09-07 2012-11-23 Procédé de collecte et de classification de courrier électronique

Country Status (2)

Country Link
CN (1) CN102880952A (fr)
WO (1) WO2014036788A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984703A (zh) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 邮件分类方法和装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424280B (zh) * 2013-08-30 2018-10-23 格博信息技术(苏州)有限公司 推送跟进方法及其系统
CN103970832A (zh) * 2014-04-01 2014-08-06 百度在线网络技术(北京)有限公司 一种识别垃圾信息的方法与装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999067731A1 (fr) * 1998-06-23 1999-12-29 Microsoft Corporation Technique utilisant un classificateur probabiliste pour detecter les messages electroniques poubelle
WO2005001733A1 (fr) * 2003-06-30 2005-01-06 Dong-June Seen Systeme de gestion de messages electroniques et procede associe
CN1719812A (zh) * 2005-08-08 2006-01-11 北京中星微电子有限公司 垃圾电子邮件过滤方法和系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171450B2 (en) * 2003-01-09 2007-01-30 Microsoft Corporation Framework to enable integration of anti-spam technologies
CN101674264B (zh) * 2009-10-20 2011-09-14 哈尔滨工程大学 基于用户关系挖掘及信誉评价的垃圾邮件检测装置及方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999067731A1 (fr) * 1998-06-23 1999-12-29 Microsoft Corporation Technique utilisant un classificateur probabiliste pour detecter les messages electroniques poubelle
WO2005001733A1 (fr) * 2003-06-30 2005-01-06 Dong-June Seen Systeme de gestion de messages electroniques et procede associe
CN1719812A (zh) * 2005-08-08 2006-01-11 北京中星微电子有限公司 垃圾电子邮件过滤方法和系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984703A (zh) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 邮件分类方法和装置
CN103984703B (zh) * 2014-04-22 2017-04-12 新浪网技术(中国)有限公司 邮件分类方法和装置

Also Published As

Publication number Publication date
CN102880952A (zh) 2013-01-16

Similar Documents

Publication Publication Date Title
CN109981625B (zh) 一种基于在线层次聚类的日志模板抽取方法
CN109218223B (zh) 一种基于主动学习的鲁棒性网络流量分类方法及系统
CN112686775A (zh) 基于孤立森林算法的电力网络攻击检测方法及系统
GB2483358A (en) Markov parsing of email message using annotations
BR112013012553B1 (pt) método para triagem de comunicações eletrônicas em um ambiente de sistema de computação, dispositivo computacional e meio de armazenamento legível por computador
CN109359137B (zh) 基于特征筛选与半监督学习的用户成长性画像构建方法
US8352409B1 (en) Systems and methods for improving the effectiveness of decision trees
WO2016197814A1 (fr) Procédé d'identification et de gestion de fichiers inutiles, dispositif d'identification, dispositif de gestion et terminal
WO2014036788A1 (fr) Procédé de collecte et de classification de courrier électronique
CN105843889A (zh) 基于可信度面向大数据及普通数据的数据采集方法和系统
CN117033366B (zh) 基于知识图谱的泛在时空数据交叉验证方法及装置
CN108683658B (zh) 基于多rbm网络构建基准模型的工控网络流量异常识别方法
WO2023241385A1 (fr) Procédé et appareil de transfert de modèle, et dispositif électronique
CN110807546A (zh) 社区网格人口变化预警方法及系统
CN116226103A (zh) 一种基于FPGrowth算法进行政务数据质量检测的方法
CN113052353B (zh) 空气质量预测与预测模型训练方法、装置及存储介质
CN117786656A (zh) 一种api识别方法、装置、电子设备及存储介质
Zeng et al. PyroHMMvar: a sensitive and accurate method to call short indels and SNPs for Ion Torrent and 454 data
CN116150632A (zh) 智能家居中基于局部敏感哈希的物联网设备识别方法
Daouia et al. Extreme value modelling of SARS-CoV-2 community transmission using discrete generalized Pareto distributions
CN113590691B (zh) 目标对象处理方法以及装置
CN111860441B (zh) 基于无偏深度迁移学习的视频目标识别方法
CN112000955B (zh) 确定日志特征序列的方法、漏洞分析方法及系统、设备
CN115099875A (zh) 基于决策树模型的数据分类方法及相关设备
CN103336865A (zh) 一种动态通信网络构建方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12884103

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12884103

Country of ref document: EP

Kind code of ref document: A1