WO2014036788A1 - Procédé de collecte et de classification de courrier électronique - Google Patents
Procédé de collecte et de classification de courrier électronique Download PDFInfo
- Publication number
- WO2014036788A1 WO2014036788A1 PCT/CN2012/085097 CN2012085097W WO2014036788A1 WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1 CN 2012085097 W CN2012085097 W CN 2012085097W WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- confidence
- spam
- reported
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
Definitions
- the present invention relates to the field of communications technologies, and in particular, to an email collection and classification method.
- text classification is performed using artificial intelligence classification algorithms. These algorithms need to learn the learning samples first, and then construct the corresponding discriminant models before text classification; therefore, the learning samples need to be acquired first.
- the method of learning samples is to manually mark a batch of samples and mark the mail as spam or non-spam.
- the technical problem to be solved by the embodiments of the present invention is to provide an email collection and classification method, which does not need to arrange a special person to classify a large number of emails, but directly collects feedback information of the user by using a computer, thereby reducing manual
- the workload ensures the accuracy of the classification, and it does not require manual reading of the mail, which protects the privacy of the user.
- an embodiment of the present invention provides an email collection and classification method, including: scanning all reported emails in a server, and extracting target emails whose number of reported times is greater than or equal to n, where n is a default value.
- the reported mail includes a mail that is reported as a normal mail and is reported as spam; calculating a confidence level of the target mail, and obtaining a calculation result; determining, according to the calculation result, that the target mail is spam or Normal mail, and stored in the database.
- the step of calculating the confidence of the target mail comprises: adding the confidence levels of all the reporters who report the target mail as normal mail, and obtaining the total normal mail confidence X; Adding the confidence of all the reporters who report the target email to spam, and obtaining the total spam confidence Y; calculating the absolute value IX-YI of the difference between the total normal email confidence X and the total spam confidence Y, The calculation results are obtained.
- the determining, according to the calculation result, the step that the target mail is a spam or a normal mail comprises: the difference between the total normal mail confidence X and the total spam confidence Y
- the absolute value IX-YI is compared with the threshold value T to determine whether IX-YI is less than ⁇ .
- the mail is not judged for the time being, and when it is judged as no, the size of X and ⁇ is compared.
- X is greater than ⁇ , it is determined.
- the mail is a normal mail.
- X is less than ⁇ , the mail is determined.
- the piece is spam.
- the method before the step of calculating the confidence of the target mail, the method further includes: presetting the initial confidence of the reporter of the initial report mail to be 1.
- the email collection and classification method further includes: updating a confession of the whistleblower, increasing the confidence of the whistleblower that is consistent with the final determination result, and reducing the confidence of the whistleblower who is inconsistent with the final determination result. degree.
- the increase rate of the confidence is slower than the decrease speed.
- the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
- the utility model has the beneficial effects of: scanning all the reported mails in the server by the computer, extracting the target mails whose reported times are greater than or equal to the system default value, performing confidence calculation on the target mail based on the confidence, and then calculating according to the confidence As a result, it is determined that the reported mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level by the computer, thereby reducing the manual work intensity and workload, and ensuring the classification.
- the accuracy rate without the need to manually read the mail, protects the privacy of the user.
- FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention
- FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
- FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
- FIG. 4 is a schematic structural diagram of a fourth embodiment of an email collection and classification method according to the present invention.
- FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention.
- the method includes the following steps: S100: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
- n is a default value
- the reported mail includes an email that is reported as a normal email and is reported as spam.
- n can be set according to specific conditions, preferably, the default value is n. Is 3.
- S102 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the information in a database.
- the result of the determination is that the spam is stored in the spam database, and the determination result is normal.
- the mail is stored in the normal mail database.
- FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
- the method includes the following steps: S200: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
- n is a default value
- the reported mail includes an email that is reported as a normal email and is reported as spam.
- n can be set according to specific conditions, preferably, the default value is n. Is 3.
- S201 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
- S202 Add the confidence levels of all the reporters who report the target mail as spam, and obtain the total spam confidence Y.
- steps S201 and S202 have no sequence and can be performed simultaneously.
- S204 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the data in a database.
- the result of the determination is that the spam is stored in the spam database
- the result of the determination is that the normal mail is stored in the normal mail database.
- the reporter and the ⁇ report will report the mail as a normal mail
- the reporter C and D reports M email as spam
- rapporter A's confidence is 5
- whistleblower B's confidence is 10
- whistleblower C's confidence is 3
- whistleblower D's confidence is 8
- FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
- the method includes: S300: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
- n is a default value
- the reported mail includes an email that is reported as a normal email and is reported as spam.
- n can be set according to specific conditions, preferably, the default value is n. Is 3.
- S301 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
- S302 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence Y.
- steps S301 and S302 have no sequence and can be performed simultaneously.
- the threshold ⁇ may be preset according to a specific situation, and generally the threshold is higher than the initial confidence, and preferably the threshold ⁇ is 3.
- the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
- the size of X and ⁇ is compared.
- X is greater than ⁇ , it is determined that the mail is a normal mail, and when X is smaller than ⁇ , the mail is determined to be spam.
- the result of the determination is that the spam is stored in the spam database
- the result of the determination is that the normal mail is stored in the normal mail database.
- the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
- the M mail is found to be reported 4 times, which is greater than the default value of 3, and is therefore extracted as the target mail, wherein the reporters A and B report the M mail as a normal mail, and the informants C and D will M The email is reported as spam.
- the whistleblower B's confidence is 10
- the whistleblower C's confidence is 3
- the whistleblower's confidence is 8
- the total normal email confidence X 5+10 15
- the confidence level of the reporter is 3, the confidence of the reporter B is 8, and the confidence of the reporter C is 5, the informant
- S400 Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to ⁇ .
- ⁇ is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
- the default value ⁇ can be set according to the specific situation, preferably, the default value ⁇ Is 3.
- S402 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
- steps S401 and S402 have no sequence and can be performed simultaneously.
- S403 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence level.
- the threshold T may be preset according to a specific situation, and generally the threshold T is higher than the initial confidence, and preferably the threshold T is 3.
- the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
- the size of X and Y is compared.
- X is greater than Y, it is determined that the mail is a normal mail, and when X is smaller than Y, the mail is determined to be spam.
- the result of the determination is that the spam is stored in the spam database
- the result of the determination is that the normal mail is stored in the normal mail database.
- the increase and decrease of the confidence level may be preset as needed.
- the increase degree of the confidence is +1; the decrease of the confidence is decreased by 10% or -1, which is The larger one.
- the increase in the confidence is slower than the decrease.
- the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
- the maximum value or the minimum value may be preset as needed, and preferably, the maximum value is 50 and the minimum value is 0.
- the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
- the M message is reported to be reported 4 times, which is greater than the default value of 3, and thus is extracted as the target mail, wherein the reporter A and B report the M mail as a normal mail, and the whistlemen C and D will M
- the email is reported as spam. If the reporter A is the first report, the initial confidence level of the reporter A is 1, the confidence of the reporter B is 14, the confidence of the reporter C is 3, and the confidence of the reporter D is 8.
- the confidence level of the whistleblower A and B +1 the confidence level of the whistleblower A becomes 2
- the confidence level of whistleblower B becomes 15
- whistleblower C and D judgment If the results are inconsistent, the confidence level of the reporters C and D is reduced by 10% or -1, the original confidence of the reporter C is 3, and the decrease of 10% is less than -1, and the confidence of the reporter C is 2,
- the original confidence of the whistleblower D is 8, and the decrease of 10% is less than -1, and the confidence level of the reporter D is decreased to 7.
- the confidence of the informant is updated, the whistlemen C and D are consistent with the judgment result, so the confidence level of the reporter C and D is +1, the confidence level of the whistleman C becomes 6, and the confidence of the sufflator D becomes 21;
- the reporter's ⁇ and ⁇ are inconsistent with the judgment result, so the confidence level of the whistleblower A and B is reduced by 10% or -1, the original confidence of the whistleblower A is 3, and the decrease of 10% is less than -1, then the whistleblower
- the confidence of A decreases to 2
- the original confidence of whistleblower B is 15, and the decrease of 10% is greater than -1, then the confidence level of reporter B decreases by 1.5 to 13.5.
- the computer scans all the reported mails in the server, extracts the target mail whose reported number of times is greater than or equal to the system default value, performs confidence calculation on the target mail based on the confidence level, and then determines the reported result according to the calculation result.
- the mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level, which reduces the manual work intensity and workload, and ensures the accuracy of the classification. It does not require manual reading of the mail, which protects the privacy of the user.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Computer Hardware Design (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Transfer Between Computers (AREA)
Abstract
L'invention concerne un procédé de collecte et de classification de courrier électronique. Le procédé consiste : à balayer tous les courriers électroniques rapportés dans le serveur ; à extraire un courrier électronique cible qui est rapporté le nombre de fois supérieur ou égal à n, ce n étant la valeur par défaut ; au courrier électronique rapporté comprenant le courrier normal rapporté et le pourriel rapporté ; calculer la confiance du courrier cible ; obtenir les résultats ; déterminer si le courrier cible est un pourriel ou un courrier électronique normal selon le résultat de calcul ; stocker le résultat dans la base de données. L'invention utilise directement un ordinateur pour collecter une rétroaction d'utilisateur sans prévoir de personne spéciale pour la classification et l'annotation de courrier électronique de masse. L'invention réduit la charge de travail manuel, garantit la précision de la classification et protège la confidentialité de l'utilisateur sans lecture manuelle du courrier électronique.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210327624.5 | 2012-09-07 | ||
CN2012103276245A CN102880952A (zh) | 2012-09-07 | 2012-09-07 | 一种电子邮件收集分类方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014036788A1 true WO2014036788A1 (fr) | 2014-03-13 |
Family
ID=47482268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2012/085097 WO2014036788A1 (fr) | 2012-09-07 | 2012-11-23 | Procédé de collecte et de classification de courrier électronique |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102880952A (fr) |
WO (1) | WO2014036788A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984703A (zh) * | 2014-04-22 | 2014-08-13 | 新浪网技术(中国)有限公司 | 邮件分类方法和装置 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424280B (zh) * | 2013-08-30 | 2018-10-23 | 格博信息技术(苏州)有限公司 | 推送跟进方法及其系统 |
CN103970832A (zh) * | 2014-04-01 | 2014-08-06 | 百度在线网络技术(北京)有限公司 | 一种识别垃圾信息的方法与装置 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999067731A1 (fr) * | 1998-06-23 | 1999-12-29 | Microsoft Corporation | Technique utilisant un classificateur probabiliste pour detecter les messages electroniques poubelle |
WO2005001733A1 (fr) * | 2003-06-30 | 2005-01-06 | Dong-June Seen | Systeme de gestion de messages electroniques et procede associe |
CN1719812A (zh) * | 2005-08-08 | 2006-01-11 | 北京中星微电子有限公司 | 垃圾电子邮件过滤方法和系统 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7171450B2 (en) * | 2003-01-09 | 2007-01-30 | Microsoft Corporation | Framework to enable integration of anti-spam technologies |
CN101674264B (zh) * | 2009-10-20 | 2011-09-14 | 哈尔滨工程大学 | 基于用户关系挖掘及信誉评价的垃圾邮件检测装置及方法 |
-
2012
- 2012-09-07 CN CN2012103276245A patent/CN102880952A/zh active Pending
- 2012-11-23 WO PCT/CN2012/085097 patent/WO2014036788A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999067731A1 (fr) * | 1998-06-23 | 1999-12-29 | Microsoft Corporation | Technique utilisant un classificateur probabiliste pour detecter les messages electroniques poubelle |
WO2005001733A1 (fr) * | 2003-06-30 | 2005-01-06 | Dong-June Seen | Systeme de gestion de messages electroniques et procede associe |
CN1719812A (zh) * | 2005-08-08 | 2006-01-11 | 北京中星微电子有限公司 | 垃圾电子邮件过滤方法和系统 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984703A (zh) * | 2014-04-22 | 2014-08-13 | 新浪网技术(中国)有限公司 | 邮件分类方法和装置 |
CN103984703B (zh) * | 2014-04-22 | 2017-04-12 | 新浪网技术(中国)有限公司 | 邮件分类方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN102880952A (zh) | 2013-01-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109981625B (zh) | 一种基于在线层次聚类的日志模板抽取方法 | |
CN109218223B (zh) | 一种基于主动学习的鲁棒性网络流量分类方法及系统 | |
CN112686775A (zh) | 基于孤立森林算法的电力网络攻击检测方法及系统 | |
GB2483358A (en) | Markov parsing of email message using annotations | |
BR112013012553B1 (pt) | método para triagem de comunicações eletrônicas em um ambiente de sistema de computação, dispositivo computacional e meio de armazenamento legível por computador | |
CN109359137B (zh) | 基于特征筛选与半监督学习的用户成长性画像构建方法 | |
US8352409B1 (en) | Systems and methods for improving the effectiveness of decision trees | |
WO2016197814A1 (fr) | Procédé d'identification et de gestion de fichiers inutiles, dispositif d'identification, dispositif de gestion et terminal | |
WO2014036788A1 (fr) | Procédé de collecte et de classification de courrier électronique | |
CN105843889A (zh) | 基于可信度面向大数据及普通数据的数据采集方法和系统 | |
CN117033366B (zh) | 基于知识图谱的泛在时空数据交叉验证方法及装置 | |
CN108683658B (zh) | 基于多rbm网络构建基准模型的工控网络流量异常识别方法 | |
WO2023241385A1 (fr) | Procédé et appareil de transfert de modèle, et dispositif électronique | |
CN110807546A (zh) | 社区网格人口变化预警方法及系统 | |
CN116226103A (zh) | 一种基于FPGrowth算法进行政务数据质量检测的方法 | |
CN113052353B (zh) | 空气质量预测与预测模型训练方法、装置及存储介质 | |
CN117786656A (zh) | 一种api识别方法、装置、电子设备及存储介质 | |
Zeng et al. | PyroHMMvar: a sensitive and accurate method to call short indels and SNPs for Ion Torrent and 454 data | |
CN116150632A (zh) | 智能家居中基于局部敏感哈希的物联网设备识别方法 | |
Daouia et al. | Extreme value modelling of SARS-CoV-2 community transmission using discrete generalized Pareto distributions | |
CN113590691B (zh) | 目标对象处理方法以及装置 | |
CN111860441B (zh) | 基于无偏深度迁移学习的视频目标识别方法 | |
CN112000955B (zh) | 确定日志特征序列的方法、漏洞分析方法及系统、设备 | |
CN115099875A (zh) | 基于决策树模型的数据分类方法及相关设备 | |
CN103336865A (zh) | 一种动态通信网络构建方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12884103 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12884103 Country of ref document: EP Kind code of ref document: A1 |