WO2021136315A1 - Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content - Google Patents

Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content Download PDF

Info

Publication number
WO2021136315A1
WO2021136315A1 PCT/CN2020/141120 CN2020141120W WO2021136315A1 WO 2021136315 A1 WO2021136315 A1 WO 2021136315A1 CN 2020141120 W CN2020141120 W CN 2020141120W WO 2021136315 A1 WO2021136315 A1 WO 2021136315A1
Authority
WO
WIPO (PCT)
Prior art keywords
email
feature
text
semantic
mail
Prior art date
Application number
PCT/CN2020/141120
Other languages
French (fr)
Chinese (zh)
Inventor
陈磊华
张琦
Original Assignee
论客科技(广州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 论客科技(广州)有限公司 filed Critical 论客科技(广州)有限公司
Publication of WO2021136315A1 publication Critical patent/WO2021136315A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Definitions

  • the present invention relates to the field of mail classification, in particular to a mail classification method, device, terminal device and readable storage medium based on joint analysis of behavior structure and semantic content.
  • Mail classification technology based on the source of the letter, the process of filtering spam by studying the source of the letter. It mainly includes black and white list filtering technology, reverse DNS query technology and so on. Among them, the black and white list filtering has the advantages of fast speed, simplicity and low memory consumption. During the SMTP connection stage, judging whether to hit the black and white list to prevent spam from entering.
  • Reverse DNS query technology provides the correspondence between IP addresses and domain names, and can intercept spam sent using dynamically allocated or unregistered IP addresses.
  • Rule-based mail classification technology which extracts certain characteristics of mails and predefines some filtering rules to detect and determine the type of emails. Each rule corresponds to a score. When the mail meets a certain rule, it will The mail is judged as spam.
  • Rule-based mail classification technology As the rule characteristics of mails are constantly changing, the rule base needs to be constantly updated, and the labor cost is relatively high.
  • the technical problem to be solved by the embodiments of the present invention is to provide a mail classification method, device, terminal device, and readable storage medium based on the joint analysis of behavior structure and semantic content, which can utilize the behavior structure characteristics and text semantic characteristics of emails, Achieve high-precision classification of mail.
  • embodiments of the present invention provide a mail classification method based on joint analysis of behavior structure and semantic content, including:
  • the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, the number of times the sender's ip per unit time, and the domain name of the email One or more of credibility;
  • the trained classifier is used to classify the email to be tested to obtain the category of the email to be tested.
  • the fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  • the classifier is an SVM classifier.
  • the present invention also provides a mail classification device based on joint analysis of behavior structure and semantic content, including:
  • the information extraction module is used to extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, and the sender's ip unit time One or more of the number of emails sent and the reputation of the email domain name;
  • the feature calculation module is used to encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the email information Text semantic features;
  • the feature fusion module is used for normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature;
  • the classifier training module is used to train the classifier by using the email fusion feature
  • the mail classification module is used to classify the email to be tested by using the trained classifier to obtain the category of the email to be tested.
  • the fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  • the classifier is an SVM classifier.
  • the present invention also provides a mail classification terminal device based on joint analysis of behavioral structure and semantic content, including a processor, a memory, and stored in the memory and configured to be executed by the processor.
  • a computer program in which the memory is coupled to the processor, and when the processor executes the computer program, any one of the methods for mail classification based on joint analysis of behavioral structure and semantic content is implemented.
  • the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located
  • the device executes any one of the mail classification methods based on joint analysis of behavior structure and semantic content.
  • the present invention has the following beneficial effects:
  • the embodiment of the present invention provides a mail classification method, device, terminal device, and readable storage medium based on joint analysis of behavior structure and semantic content.
  • the method includes: extracting behavior structure information and text content information of emails;
  • the vector calculation method encodes the behavior structure information to obtain the behavior structure characteristics of the email.
  • the pre-trained fasttext model is used to encode the text content information to obtain the text semantic characteristics of the email;
  • the behavior structure feature and the text semantic feature are normalized, and the normalized behavior structure feature and the text semantic feature are feature fused to obtain the email fusion feature; using the email fusion feature to perform the classifier Perform training; use the trained classifier to classify the email to be tested to obtain the category of the email to be tested.
  • the present invention simultaneously utilizes the behavioral structure information and text content information of the email to classify emails, overcomes the defect of poor email classification accuracy caused by insufficient use of discriminative information in existing emails, thereby effectively improving email category judgment Accuracy.
  • FIG. 1 is a schematic flowchart of a mail classification method based on joint analysis of behavior structure and semantic content according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a calculation process of text semantic features provided by an embodiment of the present invention.
  • Fig. 3 is a schematic structural diagram of a mail classification device based on joint analysis of behavior structure and semantic content according to an embodiment of the present invention.
  • an embodiment of the present invention provides a mail classification method based on joint analysis of behavior structure and semantic content, including steps:
  • the behavioral structure information includes the size of the email, the size of the email attachment, the number of the email attachment picture, the size of the email attachment picture, the number of times sent per unit time by the sender's ip, One or more of the reputation of the mail domain name;
  • the pre-trained fasttext model is used to encode the text content information to obtain the text semantic feature of the email, specifically:
  • the fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  • the classifier is an SVM classifier.
  • S5. Use the trained classifier to classify the email to be tested to obtain the category of the email to be tested.
  • the present invention proposes an email classification method that utilizes both the behavioral structural features and text semantic features of emails to enhance the discriminativeness of email features, so that E-mail classification accuracy is higher.
  • the embodiment of the present invention provides a mail classification method based on joint analysis of behavior structure and semantic content, which mainly includes five steps:
  • behavior structure information refers to the structure information of the e-mail itself and some operational behavior information of the e-mail sender, such as e-mail size, e-mail attachment size, number of e-mail attachment pictures, e-mail The size of the attached image, the number of emails sent by the email sender’s ip within a period of time, and the reputation of the email domain name.
  • RuleVector represents the behavioral structural characteristics of the email, each dimension represents a feature, size represents the size of the email, fngref represents the number of email fingerprints, attref represents the number of attachments, gifx represents the length of the image, gify represents the width of the image, and gift represents the number of images.
  • Sender_size_diff represents the difference between the sender's letter size and the average letter size
  • url_size_diff represents the difference between the post url size and the average url size
  • domail_today_cnt represents the number of letters sent by the domain name that day.
  • the expression is as follows:
  • Text represents the text content of the email
  • ft represents the pre-trained fasttext model
  • WordVector represents the word vector of the email text
  • n is the number of word vectors
  • TextVector represents the final feature of the email text.
  • RuleVector_N Normalize(RuleVector);
  • TextVector_N Normalize(TextVector);
  • Normalize represents the normalization operation
  • RuleVector_N represents the structural feature of the email behavior after normalization
  • TextVector_N represents the semantic feature of the email text after the normalization.
  • Con stands for serial operation
  • MailVector stands for the expression of fusion characteristics of emails.
  • the classifier is a support vector machine (SVM) classifier.
  • SVM support vector machine
  • the email category acquisition method integrates email behavior structure and email content semantic information, and makes full use of the behavior structural features and text semantic features of emails to better express emails. It overcomes the defect of poor classification accuracy of existing emails due to insufficient discriminative information, and improves the accuracy of the method for obtaining email categories.
  • the present invention also provides a mail classification device based on joint analysis of behavior structure and semantic content, including:
  • the information extraction module 1 is used to extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment images, the size of the email attachment image, and the sender's ip unit time One or more of the number of internal mailings and the reputation of the mail domain name;
  • the feature calculation module 2 is used to encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the email Semantic characteristics of the text;
  • the feature fusion module 3 is used for normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature ;
  • the classifier training module 4 is used to train the classifier by using the email fusion feature
  • the mail classification module 5 is used to classify the email to be tested by using a trained classifier to obtain the category of the email to be tested.
  • the fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  • the classifier is an SVM classifier.
  • the embodiment of the present invention provides a mail classification device based on joint analysis of behavior structure and semantic content, which can implement any item of the present invention.
  • the method embodiment provides a mail classification method based on joint analysis of behavior structure and semantic content.
  • the present invention also provides a mail classification terminal device based on joint analysis of behavioral structure and semantic content, including a processor, a memory, and stored in the memory and configured to be executed by the processor.
  • a computer program in which the memory is coupled to the processor, and when the processor executes the computer program, any one of the methods for mail classification based on joint analysis of behavioral structure and semantic content is implemented.
  • the mail classification terminal device based on the joint analysis of behavior structure and semantic content may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers.
  • the processor may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), off-the-shelf Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor is the control center of the mail classification terminal device based on the joint analysis of behavior structure and semantic content, using various interfaces It connects the various parts of the entire mail classification terminal device based on the joint analysis of behavioral structure and semantic content with the line.
  • the memory may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function, and the like; the data storage area may store data created according to the use of a mobile phone.
  • the memory can include high-speed random access memory, and can also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards. , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located
  • the device executes any one of the mail classification methods based on joint analysis of behavior structure and semantic content.
  • the computer program can be stored in a computer-readable storage medium, and when the computer program is executed by a processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (Read-Only Memory, ROM) , Random Access Memory (RAM), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
  • the device embodiments described above are only illustrative, and the units described as separate parts may or may not be physically separated, and the parts displayed as units may or may not be physically separate. Units can be located in one place or distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the connection relationship between the modules indicates that they have a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content, and a device and a readable storage medium. The method comprises: extracting behavior structure information and text content information of an e-mail (S1); by means of an eigenvector calculation mode, performing calculation to obtain a behavior structure feature of the e-mail, and by using a pre-trained fastText model, performing calculation to obtain a text semantic feature of the e-mail (S2); respectively performing normalization processing on the behavior structure feature and the text semantic feature, and performing feature fusion to obtain an e-mail fusion feature (S3); training a classifier by using the e-mail fusion feature (S4); and classifying, by using the trained classifier, an e-mail to be tested so as to acquire the category of the e-mail to be tested (S5). In the method, an e-mail is classified by simultaneously using behavior structure information and text content information of the e-mail, thereby effectively improving the category determination accuracy of the e-mail.

Description

基于行为结构和语义内容联合分析的邮件分类方法及装置Mail classification method and device based on joint analysis of behavior structure and semantic content 技术领域Technical field
本发明涉及邮件分类领域,尤其是涉及一种基于行为结构和语义内容联合分析的邮件分类方法、装置、终端设备及可读存储介质。The present invention relates to the field of mail classification, in particular to a mail classification method, device, terminal device and readable storage medium based on joint analysis of behavior structure and semantic content.
背景技术Background technique
随着互联网技术的快速发展,电子邮件由于其传输信息迅速方便,易于保存,不轻易丢失等特点,已经成为现代人际交流的主要通讯方式之一。但是,随着其广泛应用,电子邮件也成为商业广告,恶意软件和非法文件传播的载体,严重影响着人们生活和网路安全。如何能够精确地将垃圾邮件过滤出来成为亟待解决的问题。With the rapid development of Internet technology, e-mail has become one of the main communication methods of modern interpersonal communication due to its rapid and convenient transmission of information, easy to save, and not easy to lose. However, with its widespread use, e-mail has also become a carrier of commercial advertisements, malicious software and illegal files, which seriously affects people's lives and network security. How to accurately filter out spam has become an urgent problem to be solved.
现有的电子邮件分类方法主要有三种:There are three main existing email classification methods:
(1)基于信件源的邮件分类技术,通过研究发信的源头而进行垃圾邮件过滤的过程。主要包括黑白名单过滤技术,反向DNS查询技术等。其中黑白名单过滤优点是速度快,简单并且内存消耗小,在SMTP连接阶段通过判断是否命中黑白名单来阻止垃圾邮件进入。反向DNS查询技术提供IP地址到域名的对应关系,可以拦截使用动态分配或者没有注册域名的IP地址发送的垃圾邮件。(1) Mail classification technology based on the source of the letter, the process of filtering spam by studying the source of the letter. It mainly includes black and white list filtering technology, reverse DNS query technology and so on. Among them, the black and white list filtering has the advantages of fast speed, simplicity and low memory consumption. During the SMTP connection stage, judging whether to hit the black and white list to prevent spam from entering. Reverse DNS query technology provides the correspondence between IP addresses and domain names, and can intercept spam sent using dynamically allocated or unregistered IP addresses.
(2)基于规则的邮件分类技术,通过对邮件的某些特征进行提取,预定义一些过滤规则来检测判定电子邮件的类型,每条规则对应一个分数,当邮件符合某一条规则时,就将邮件判定为垃圾邮件。(2) Rule-based mail classification technology, which extracts certain characteristics of mails and predefines some filtering rules to detect and determine the type of emails. Each rule corresponds to a score. When the mail meets a certain rule, it will The mail is judged as spam.
(3)基于邮件内容统计的分类方法。对已经分类的训练样本和测试样本进行学习,提取出非垃圾邮件和垃圾邮件的特征向量和特征值,然后根据学习到的模型对测试集中的样本进行计算判断邮件类别。(3) Classification method based on statistics of mail content. Learning the classified training samples and test samples, extracting the feature vectors and feature values of non-spam and spam, and then calculating the samples in the test set according to the learned model to determine the mail category.
现有的电子邮件分类技术具有如下缺点:The existing email classification technology has the following disadvantages:
1、基于信件源的邮件分类技术,要对每一封邮件的发信源头进行查询,黑白名单也要不断更新,效率较低。同时也会出现大规模的误判。1. The mail classification technology based on the source of the letter, the source of each mail must be inquired, and the black and white list must be constantly updated, which is inefficient. At the same time, there will be large-scale misjudgments.
2、基于规则的邮件分类技术,由于邮件的规则特征在不断改变,所以需要不断的更新规则库,人力成本较大。2. Rule-based mail classification technology. As the rule characteristics of mails are constantly changing, the rule base needs to be constantly updated, and the labor cost is relatively high.
3、基于邮件内容统计的方法虽然考虑到了文本内容的统计信息,但是忽略了其语义信息和其他的一些特征,导致邮件特征判别性较弱,分类精度较差。3. Although the method based on mail content statistics takes into account the statistical information of the text content, it ignores its semantic information and some other characteristics, resulting in weaker discrimination of mail features and poor classification accuracy.
发明内容Summary of the invention
本发明实施例所要解决的技术问题在于,提供一种基于行为结构和语义内容联合分析的邮件分类方法、装置、终端设备及可读存储介质,能够利用电子邮件的行为结构特征和文本语义特征,实现对邮件进行高精度的分类。The technical problem to be solved by the embodiments of the present invention is to provide a mail classification method, device, terminal device, and readable storage medium based on the joint analysis of behavior structure and semantic content, which can utilize the behavior structure characteristics and text semantic characteristics of emails, Achieve high-precision classification of mail.
为了解决上述技术问题,本发明实施例提供了一种基于行为结构和语义内容联合分析的邮件分类方法,包括:In order to solve the above technical problems, embodiments of the present invention provide a mail classification method based on joint analysis of behavior structure and semantic content, including:
提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;Extract the behavioral structure information and text content information of the email; where the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, the number of times the sender's ip per unit time, and the domain name of the email One or more of credibility;
通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;Encoding the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, using the pre-trained fasttext model to encode the text content information to obtain the text semantic features of the email;
分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;Normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature;
利用所述电子邮件融合特征对分类器进行训练;Training the classifier by using the email fusion feature;
采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。The trained classifier is used to classify the email to be tested to obtain the category of the email to be tested.
进一步地,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:Further, the use of a pre-trained fasttext model to encode the text content information to obtain the text semantic feature of the email is specifically:
将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
进一步地,所述分类器为SVM分类器。Further, the classifier is an SVM classifier.
为了解决相同的技术问题,本发明还提供了一种基于行为结构和语义内容联合分析的邮件分类装置,包括:In order to solve the same technical problem, the present invention also provides a mail classification device based on joint analysis of behavior structure and semantic content, including:
信息提取模块,用于提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;The information extraction module is used to extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, and the sender's ip unit time One or more of the number of emails sent and the reputation of the email domain name;
特征计算模块,用于通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;The feature calculation module is used to encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the email information Text semantic features;
特征融合模块,用于分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;The feature fusion module is used for normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature;
分类器训练模块,用于利用所述电子邮件融合特征对分类器进行训练;The classifier training module is used to train the classifier by using the email fusion feature;
邮件分类模块,用于采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。The mail classification module is used to classify the email to be tested by using the trained classifier to obtain the category of the email to be tested.
进一步地,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:Further, the use of a pre-trained fasttext model to encode the text content information to obtain the text semantic feature of the email is specifically:
将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
进一步地,所述分类器为SVM分类器。Further, the classifier is an SVM classifier.
为了解决相同的技术问题,本发明还提供了一种基于行为结构和语义内容联合分析的邮件分类终端设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述存储器与所述处理器耦接,且所述处理器执行所述计算机程序时,实现任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。In order to solve the same technical problem, the present invention also provides a mail classification terminal device based on joint analysis of behavioral structure and semantic content, including a processor, a memory, and stored in the memory and configured to be executed by the processor. A computer program in which the memory is coupled to the processor, and when the processor executes the computer program, any one of the methods for mail classification based on joint analysis of behavioral structure and semantic content is implemented.
为了解决相同的技术问题,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。In order to solve the same technical problem, the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located The device executes any one of the mail classification methods based on joint analysis of behavior structure and semantic content.
与现有技术相比,本发明具有如下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明实施例提供了一种基于行为结构和语义内容联合分析的邮件分类方法、装置、终端设备及可读存储介质,所述方法包括:提取电子邮件的行为结构信息和文本内容信息;通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;利用所述电子邮件融合特征对分类器进行训练;采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。本发明同时利用了邮件的行为结构信息和文本内容信息以对电子邮件进行分类,克服了现有电子邮件由于判别性信息利用不足导致的邮件分类精度差的缺陷,从而有效提高了电子邮件类别判断的精度。The embodiment of the present invention provides a mail classification method, device, terminal device, and readable storage medium based on joint analysis of behavior structure and semantic content. The method includes: extracting behavior structure information and text content information of emails; The vector calculation method encodes the behavior structure information to obtain the behavior structure characteristics of the email. At the same time, the pre-trained fasttext model is used to encode the text content information to obtain the text semantic characteristics of the email; The behavior structure feature and the text semantic feature are normalized, and the normalized behavior structure feature and the text semantic feature are feature fused to obtain the email fusion feature; using the email fusion feature to perform the classifier Perform training; use the trained classifier to classify the email to be tested to obtain the category of the email to be tested. The present invention simultaneously utilizes the behavioral structure information and text content information of the email to classify emails, overcomes the defect of poor email classification accuracy caused by insufficient use of discriminative information in existing emails, thereby effectively improving email category judgment Accuracy.
附图说明Description of the drawings
图1是本发明一实施例提供的基于行为结构和语义内容联合分析的邮件分类方法的流程示意图;FIG. 1 is a schematic flowchart of a mail classification method based on joint analysis of behavior structure and semantic content according to an embodiment of the present invention;
图2是本发明一实施例提供的文本语义特征的计算过程示意图;2 is a schematic diagram of a calculation process of text semantic features provided by an embodiment of the present invention;
图3是本发明一实施例提供的基于行为结构和语义内容联合分析的邮件分类装置的结构示意图。Fig. 3 is a schematic structural diagram of a mail classification device based on joint analysis of behavior structure and semantic content according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
请参见图1,本发明实施例提供了一种基于行为结构和语义内容联合分析的邮件分类方法,包括步骤:Referring to Fig. 1, an embodiment of the present invention provides a mail classification method based on joint analysis of behavior structure and semantic content, including steps:
S1、提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;S1. Extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of the email attachment picture, the size of the email attachment picture, the number of times sent per unit time by the sender's ip, One or more of the reputation of the mail domain name;
S2、通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;S2. Encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the text semantic features of the email;
在本发明实施例中,进一步地,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:In the embodiment of the present invention, further, the pre-trained fasttext model is used to encode the text content information to obtain the text semantic feature of the email, specifically:
将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
S3、分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;S3. Perform normalization processing on the behavior structure feature and the text semantic feature respectively, and perform feature fusion on the normalized behavior structure feature and the text semantic feature to obtain an email fusion feature;
S4、利用所述电子邮件融合特征对分类器进行训练;在本发明实施例中,进一步地,所述分类器为SVM分类器。S4. Use the email fusion feature to train a classifier; in the embodiment of the present invention, further, the classifier is an SVM classifier.
S5、采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。S5. Use the trained classifier to classify the email to be tested to obtain the category of the email to be tested.
需要说明的是,针对上述现有电子邮件分类技术的缺点,本发明提出了一种同时利用电子邮件的行为结构特征和文本语义特征的邮件分类方法,以增强电子邮件特征的判别性,从而使电子邮件分类精度更高。It should be noted that, in view of the shortcomings of the above-mentioned existing email classification technology, the present invention proposes an email classification method that utilizes both the behavioral structural features and text semantic features of emails to enhance the discriminativeness of email features, so that E-mail classification accuracy is higher.
以下列举具体例子对本发明方案进行详细说明:Specific examples are listed below to illustrate the solution of the present invention in detail:
本发明实施例提供了一种基于行为结构和语义内容联合分析的邮件分类方法,主要包括五个步骤:The embodiment of the present invention provides a mail classification method based on joint analysis of behavior structure and semantic content, which mainly includes five steps:
1、提取电子邮件的行为结构信息和文本内容信息;行为结构信息指代表了邮件本身的结构信息和邮件发信方的一些操作行为信息,如邮件大小,邮件附件大小,邮件附件图片数量,邮件附件图片大小,邮件发信人ip在一段时间内发信次数,邮件域名信誉度等。1. Extract the behavior structure information and text content information of the e-mail; behavior structure information refers to the structure information of the e-mail itself and some operational behavior information of the e-mail sender, such as e-mail size, e-mail attachment size, number of e-mail attachment pictures, e-mail The size of the attached image, the number of emails sent by the email sender’s ip within a period of time, and the reputation of the email domain name.
2、采用计算特征向量的方式对行为结构信息进行编码,采用预先训练好的fasttext模型对邮件的文本内容信息进行编码,获取电子邮件的行为结构特征和文本语义特征;2. Use the method of calculating feature vectors to encode the behavioral structure information, and use the pre-trained fasttext model to encode the text content information of the email to obtain the behavioral structural features and text semantic features of the email;
其中,电子邮件行为结构特征计算方式如下:Among them, the calculation method of the structural characteristics of the email behavior is as follows:
RuleVector[size]=m_nSize/1024;RuleVector[size]=m_nSize/1024;
RuleVector[fngref]=m_nFngRef;RuleVector[fngref]=m_nFngRef;
RuleVector[attref]=m_nAttRef;RuleVector[attref]=m_nAttRef;
RuleVector[gifx]=m_nGifX/128;RuleVector[gifx]=m_nGifX/128;
RuleVector[gify]=m_nGifY/128;RuleVector[gify]=m_nGifY/128;
RuleVector[gifcnt]=m_nGifCnt;RuleVector[gifcnt]=m_nGifCnt;
RuleVector[Sender_size_diff]=m_n.SenderSizeDiff;RuleVector[Sender_size_diff]=m_n.SenderSizeDiff;
RuleVector[url_size_diff]=m_nURLSizeDiff;RuleVector[url_size_diff]=m_nURLSizeDiff;
RuleVector[domail_today_cnt]=m_nDomainTodayCnt;RuleVector[domail_today_cnt]=m_nDomainTodayCnt;
其中,RuleVector表示邮件的行为结构特征,每一维代表一个特征,size代表邮件大小,fngref代表邮件指纹出现次数,attref代表附件个数,gifx代表图像长度,gify代表图像宽度,gifcnt代表图像次数,Sender_size_diff代表发件人发信尺寸与平均发信尺寸之间的差异,url_size_diff代表邮url尺寸与平均url尺寸之间的差异,domail_today_cnt代表该域名当天发信数量。Among them, RuleVector represents the behavioral structural characteristics of the email, each dimension represents a feature, size represents the size of the email, fngref represents the number of email fingerprints, attref represents the number of attachments, gifx represents the length of the image, gify represents the width of the image, and gift represents the number of images. Sender_size_diff represents the difference between the sender's letter size and the average letter size, url_size_diff represents the difference between the post url size and the average url size, and domail_today_cnt represents the number of letters sent by the domain name that day.
请参见图2,电子邮件文本语义特征计算方式如下:Please refer to Figure 2. The semantic feature of email text is calculated as follows:
对提取到的文本内容信息进行预处理,得到fasttext模型型输入格式文件,然后计算出邮件文本内容信息中每个词的特征向量,并对所有词特征向量进行平均,得到最终的电子邮件文本语义特征TextVector。表达式如下:Preprocess the extracted text content information to obtain the fasttext model input format file, then calculate the feature vector of each word in the email text content information, and average all the word feature vectors to obtain the final email text semantics Features TextVector. The expression is as follows:
WordVector=ft(Text);WordVector=ft(Text);
Figure PCTCN2020141120-appb-000001
Figure PCTCN2020141120-appb-000001
其中,Text代表邮件的文本内容,ft代表预训练的fasttext模型,WordVector代表邮件文本分词词向量,n是词向量的个数,TextVector代表邮件文本最终特征。Among them, Text represents the text content of the email, ft represents the pre-trained fasttext model, WordVector represents the word vector of the email text, n is the number of word vectors, and TextVector represents the final feature of the email text.
3、对所述邮件行为结构特征和所述文本语义特征进行归一化;3. Normalize the structural characteristics of the mail behavior and the semantic characteristics of the text;
对邮件行为结构特征进行归一化:Normalize the structural characteristics of email behavior:
RuleVector_N=Normalize(RuleVector);RuleVector_N=Normalize(RuleVector);
对邮件文本语义特征进行归一化:Normalize the semantic features of the email text:
TextVector_N=Normalize(TextVector);TextVector_N=Normalize(TextVector);
其中,Normalize代表归一化操作,RuleVector_N代表归一化之后的电子邮件行为结构特征,TextVector_N代表归一化之后的电子邮件文本语义特征。Among them, Normalize represents the normalization operation, RuleVector_N represents the structural feature of the email behavior after normalization, and TextVector_N represents the semantic feature of the email text after the normalization.
(4)对所述邮件行为结构特征和所述文本语义特征进行联合表达,作为邮件最终的特征表达,并训练分类器;(4) Jointly express the structural characteristics of the mail behavior and the semantic characteristics of the text as the final characteristic expression of the mail, and train a classifier;
MailVector=Con(RuleVector_N,TextVector_N);MailVector=Con(RuleVector_N, TextVector_N);
其中,Con代表串联操作,MailVector代表电子邮件的融合特征表达。Among them, Con stands for serial operation, and MailVector stands for the expression of fusion characteristics of emails.
(5)采用训练好的分类器对测试集电子邮件进行分类,获取所述测试集电子邮件的类别。可选地,所述分类器为支持向量机(SVM)分类器。(5) Use the trained classifier to classify the test set emails, and obtain the category of the test set emails. Optionally, the classifier is a support vector machine (SVM) classifier.
需要说明的是,本发明实施例提供的电子邮件类别获取方法,通过融合邮件行为结构和邮件内容语义信息,充分利用了电子邮件的行为结构特征和文本语义特征,对电子邮件进行更好的表达,克服了现有电子邮件由于判别性信息不足导致的分类精度差的缺陷,提高了电子邮件类别获取方法的精度。It should be noted that the email category acquisition method provided by the embodiment of the present invention integrates email behavior structure and email content semantic information, and makes full use of the behavior structural features and text semantic features of emails to better express emails. It overcomes the defect of poor classification accuracy of existing emails due to insufficient discriminative information, and improves the accuracy of the method for obtaining email categories.
需要说明的是,对于以上方法或流程实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作并不一定是本发明实施例所必须的。It should be noted that for the above method or process embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not affected by the described sequence of actions. Limitation, because according to the embodiment of the present invention, some steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are optional embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.
请参见图3,为了解决相同的技术问题,本发明还提供了一种基于行为结构 和语义内容联合分析的邮件分类装置,包括:Referring to Fig. 3, in order to solve the same technical problem, the present invention also provides a mail classification device based on joint analysis of behavior structure and semantic content, including:
信息提取模块1,用于提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;The information extraction module 1 is used to extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment images, the size of the email attachment image, and the sender's ip unit time One or more of the number of internal mailings and the reputation of the mail domain name;
特征计算模块2,用于通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;The feature calculation module 2 is used to encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the email Semantic characteristics of the text;
特征融合模块3,用于分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;The feature fusion module 3 is used for normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature ;
分类器训练模块4,用于利用所述电子邮件融合特征对分类器进行训练;The classifier training module 4 is used to train the classifier by using the email fusion feature;
邮件分类模块5,用于采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。The mail classification module 5 is used to classify the email to be tested by using a trained classifier to obtain the category of the email to be tested.
进一步地,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:Further, the use of a pre-trained fasttext model to encode the text content information to obtain the text semantic feature of the email is specifically:
将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
进一步地,所述分类器为SVM分类器。Further, the classifier is an SVM classifier.
可以理解的是上述装置项实施例,是与本发明方法项实施例相对应的,本发明实施例提供的一种基于行为结构和语义内容联合分析的邮件分类装置,可以实现本发明任意一项方法项实施例提供的基于行为结构和语义内容联合分析的邮件分类方法。It is understandable that the above device item embodiment corresponds to the method item embodiment of the present invention. The embodiment of the present invention provides a mail classification device based on joint analysis of behavior structure and semantic content, which can implement any item of the present invention. The method embodiment provides a mail classification method based on joint analysis of behavior structure and semantic content.
为了解决相同的技术问题,本发明还提供了一种基于行为结构和语义内容联合分析的邮件分类终端设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述存储器与所述处理器耦接,且所述 处理器执行所述计算机程序时,实现任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。In order to solve the same technical problem, the present invention also provides a mail classification terminal device based on joint analysis of behavioral structure and semantic content, including a processor, a memory, and stored in the memory and configured to be executed by the processor. A computer program in which the memory is coupled to the processor, and when the processor executes the computer program, any one of the methods for mail classification based on joint analysis of behavioral structure and semantic content is implemented.
所述基于行为结构和语义内容联合分析的邮件分类终端设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述基于行为结构和语义内容联合分析的邮件分类终端设备的控制中心,利用各种接口和线路连接整个基于行为结构和语义内容联合分析的邮件分类终端设备的各个部分。The mail classification terminal device based on the joint analysis of behavior structure and semantic content may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers. The processor may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), off-the-shelf Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor is the control center of the mail classification terminal device based on the joint analysis of behavior structure and semantic content, using various interfaces It connects the various parts of the entire mail classification terminal device based on the joint analysis of behavioral structure and semantic content with the line.
所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据手机的使用所创建的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function, and the like; the data storage area may store data created according to the use of a mobile phone. In addition, the memory can include high-speed random access memory, and can also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards. , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
为了解决相同的技术问题,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。In order to solve the same technical problem, the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located The device executes any one of the mail classification methods based on joint analysis of behavior structure and semantic content.
所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存 储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。The computer program can be stored in a computer-readable storage medium, and when the computer program is executed by a processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (Read-Only Memory, ROM) , Random Access Memory (RAM), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。It should be noted that the device embodiments described above are only illustrative, and the units described as separate parts may or may not be physically separated, and the parts displayed as units may or may not be physically separate. Units can be located in one place or distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the drawings of the device embodiments provided by the present invention, the connection relationship between the modules indicates that they have a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art can understand and implement without creative work.
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。The above are the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also considered This is the protection scope of the present invention.

Claims (8)

  1. 一种基于行为结构和语义内容联合分析的邮件分类方法,其特征在于,包括:A mail classification method based on joint analysis of behavior structure and semantic content, which is characterized in that it includes:
    提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;Extract the behavioral structure information and text content information of the email; where the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, the number of times the sender's ip per unit time, and the domain name of the email One or more of credibility;
    通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;Encoding the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, using the pre-trained fasttext model to encode the text content information to obtain the text semantic features of the email;
    分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;Normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature;
    利用所述电子邮件融合特征对分类器进行训练;Training the classifier by using the email fusion feature;
    采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。The trained classifier is used to classify the email to be tested to obtain the category of the email to be tested.
  2. 根据权利要求1所述的基于行为结构和语义内容联合分析的邮件分类方法,其特征在于,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:The mail classification method based on joint analysis of behavioral structure and semantic content according to claim 1, wherein the pre-trained fasttext model is used to encode the text content information to obtain the text semantic features of the email, Specifically:
    将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
    采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  3. 根据权利要求1所述的基于行为结构和语义内容联合分析的邮件分类方法,其特征在于,所述分类器为SVM分类器。The mail classification method based on joint analysis of behavior structure and semantic content according to claim 1, wherein the classifier is an SVM classifier.
  4. 一种基于行为结构和语义内容联合分析的邮件分类装置,其特征在于, 包括:A mail classification device based on joint analysis of behavior structure and semantic content is characterized in that it includes:
    信息提取模块,用于提取电子邮件的行为结构信息和文本内容信息;其中,所述行为结构信息包括邮件大小、邮件附件大小、邮件附件图片数量、邮件附件图片大小、发件人ip单位时间内发件次数、邮件域名信誉度中的一种或多种;The information extraction module is used to extract the behavioral structure information and text content information of the email; wherein the behavioral structure information includes the size of the email, the size of the email attachment, the number of email attachment pictures, the size of the email attachment picture, and the sender's ip unit time One or more of the number of emails sent and the reputation of the email domain name;
    特征计算模块,用于通过特征向量计算方式对所述行为结构信息进行编码,得到电子邮件的行为结构特征,同时,采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征;The feature calculation module is used to encode the behavior structure information by the feature vector calculation method to obtain the behavior structure characteristics of the email, and at the same time, use the pre-trained fasttext model to encode the text content information to obtain the email information Text semantic features;
    特征融合模块,用于分别对所述行为结构特征和所述文本语义特征进行归一化处理,并将归一化处理后的行为结构特征和文本语义特征进行特征融合,得到电子邮件融合特征;The feature fusion module is used for normalizing the behavior structure feature and the text semantic feature respectively, and fusing the normalized behavior structure feature and the text semantic feature to obtain the email fusion feature;
    分类器训练模块,用于利用所述电子邮件融合特征对分类器进行训练;The classifier training module is used to train the classifier by using the email fusion feature;
    邮件分类模块,用于采用训练好的分类器对待测电子邮件进行分类,以获取所述待测电子邮件的类别。The mail classification module is used to classify the email to be tested by using the trained classifier to obtain the category of the email to be tested.
  5. 根据权利要求4所述的基于行为结构和语义内容联合分析的邮件分类装置,其特征在于,所述采用预先训练好的fasttext模型对所述文本内容信息进行编码,得到电子邮件的文本语义特征,具体为:The mail classification device based on joint analysis of behavior structure and semantic content according to claim 4, characterized in that the pre-trained fasttext model is used to encode the text content information to obtain the text semantic features of the email, Specifically:
    将提取到的所述文本内容信息进行预处理,以将所述文本内容信息的格式转为符合所述fasttext模型处理的输入格式;Preprocessing the extracted text content information to convert the format of the text content information into an input format that meets the processing of the fasttext model;
    采用所述fasttext模型计算所述文本内容信息中每个分词的特征向量,并对所有计算得到的特征向量进行平均运算,得到所述文本语义特征。The fasttext model is used to calculate the feature vector of each word segmentation in the text content information, and all the calculated feature vectors are averaged to obtain the text semantic feature.
  6. 根据权利要求4所述的基于行为结构和语义内容联合分析的邮件分类装置,其特征在于,所述分类器为SVM分类器。The mail classification device based on joint analysis of behavior structure and semantic content according to claim 4, wherein the classifier is an SVM classifier.
  7. 一种基于行为结构和语义内容联合分析的邮件分类终端设备,其特征在于,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行 的计算机程序,所述存储器与所述处理器耦接,且所述处理器执行所述计算机程序时,实现如权利要求1至3任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。A mail classification terminal device based on joint analysis of behavior structure and semantic content, which is characterized in that it includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. The processor is coupled, and when the processor executes the computer program, the method for mail classification based on joint analysis of behavior structure and semantic content according to any one of claims 1 to 3 is realized.
  8. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行如权利要求1至3任一项所述的基于行为结构和语义内容联合分析的邮件分类方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein when the computer program is running, the device in which the computer-readable storage medium is located is controlled to execute as claimed in claims 1 to 3. Any one of the mail classification methods based on joint analysis of behavioral structure and semantic content.
PCT/CN2020/141120 2019-12-31 2020-12-29 Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content WO2021136315A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911425936.8 2019-12-31
CN201911425936.8A CN111221970B (en) 2019-12-31 2019-12-31 Mail classification method and device based on behavior structure and semantic content joint analysis

Publications (1)

Publication Number Publication Date
WO2021136315A1 true WO2021136315A1 (en) 2021-07-08

Family

ID=70832800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141120 WO2021136315A1 (en) 2019-12-31 2020-12-29 Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content

Country Status (2)

Country Link
CN (1) CN111221970B (en)
WO (1) WO2021136315A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114189390A (en) * 2021-12-31 2022-03-15 深信服科技股份有限公司 Domain name detection method, system, equipment and computer readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221970B (en) * 2019-12-31 2022-06-07 论客科技(广州)有限公司 Mail classification method and device based on behavior structure and semantic content joint analysis
CN112733549B (en) * 2020-12-31 2024-03-01 厦门智融合科技有限公司 Patent value information analysis method and device based on multiple semantic fusion
CN113343682A (en) * 2021-06-07 2021-09-03 中国工商银行股份有限公司 Mail processing method, mail processing device, electronic device, and storage medium
CN117978499B (en) * 2024-02-01 2024-08-20 陕西瑞欣科技发展有限公司 System and method for identifying converged communication malicious data based on AI intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1742266A (en) * 2003-02-25 2006-03-01 微软公司 Adaptive junk message filtering system
CN107294834A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for recognizing spam
CN108259415A (en) * 2016-12-28 2018-07-06 北京奇虎科技有限公司 A kind of method and device of mail-detection
CN110300054A (en) * 2019-07-03 2019-10-01 论客科技(广州)有限公司 The recognition methods of malice fishing mail and device
CN111221970A (en) * 2019-12-31 2020-06-02 论客科技(广州)有限公司 Mail classification method and device based on behavior structure and semantic content joint analysis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102404249B (en) * 2011-11-18 2014-04-09 北京语言大学 Method and device for filtering junk emails based on coordinated training
CN109598517B (en) * 2017-09-29 2023-09-12 阿里巴巴集团控股有限公司 Commodity clearance processing, object processing and category prediction method and device thereof
CN108199951A (en) * 2018-01-04 2018-06-22 焦点科技股份有限公司 A kind of rubbish mail filtering method based on more algorithm fusion models
CN110569357A (en) * 2019-08-19 2019-12-13 论客科技(广州)有限公司 method and device for constructing mail classification model, terminal equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1742266A (en) * 2003-02-25 2006-03-01 微软公司 Adaptive junk message filtering system
CN107294834A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for recognizing spam
CN108259415A (en) * 2016-12-28 2018-07-06 北京奇虎科技有限公司 A kind of method and device of mail-detection
CN110300054A (en) * 2019-07-03 2019-10-01 论客科技(广州)有限公司 The recognition methods of malice fishing mail and device
CN111221970A (en) * 2019-12-31 2020-06-02 论客科技(广州)有限公司 Mail classification method and device based on behavior structure and semantic content joint analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AERY M., CHAKRAVARTHY S.: "eMailSift: Email Classification Based on Structure and Content", DATA MINING, FIFTH IEEE INTERNATIONAL CONFERENCE ON HOUSTON, TX, USA 27-30 NOV. 2005, PISCATAWAY, NJ, USA,IEEE, 27 November 2005 (2005-11-27) - 30 November 2005 (2005-11-30), pages 18 - 25, XP010870424, ISBN: 978-0-7695-2278-4, DOI: 10.1109/ICDM.2005.58 *
KHONJI MAHMOUD, JONES ANDREW, IRAQI YOUSSEF: "A study of feature subset evaluators and feature subset searching methods for phishing classification", PROCEEDINGS OF THE 8TH ANNUAL COLLABORATION, ELECTRONIC MESSAGING, ANTI-ABUSE AND SPAM CONFERENCE ON, CEAS '11, ACM PRESS, NEW YORK, NEW YORK, USA, 1 January 2011 (2011-01-01) - 2 September 2011 (2011-09-02), New York, New York, USA, pages 135 - 144, XP055825593, ISBN: 978-1-4503-0788-8, DOI: 10.1145/2030376.2030392 *
TAN HAOWEN: "Research and Implementation on Detecting Phishing Emails", CHINA MASTER’S THESES FULL-TEXT DATABASE, 1 March 2018 (2018-03-01), XP055825783 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114189390A (en) * 2021-12-31 2022-03-15 深信服科技股份有限公司 Domain name detection method, system, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN111221970B (en) 2022-06-07
CN111221970A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021136315A1 (en) Mail classification method and apparatus based on conjoint analysis of behavior structures and semantic content
US11023823B2 (en) Evaluating content for compliance with a content policy enforced by an online system using a machine learning model determining compliance with another content policy
US10432562B2 (en) Reducing photo-tagging spam
CN108874777B (en) Text anti-spam method and device
CN108874776B (en) Junk text recognition method and device
US8225413B1 (en) Detecting impersonation on a social network
US10594640B2 (en) Message classification
WO2019037195A1 (en) Method and device for identifying interest of user, and computer-readable storage medium
WO2019179010A1 (en) Data set acquisition method, classification method and device, apparatus, and storage medium
US20170289082A1 (en) Method and device for identifying spam mail
CN109460551A (en) Signing messages extracting method and device
CN108829661B (en) News subject name extraction method based on fuzzy matching
CN106815588B (en) Junk picture filtering method and device
CN108595704A (en) A kind of the emotion of news and classifying importance method based on soft disaggregated model
CN109858570A (en) Image classification method and system, computer equipment and medium
CN112686257A (en) Storefront character recognition method and system based on OCR
WO2022267454A1 (en) Method and apparatus for analyzing text, device and storage medium
CN117421640A (en) API asset identification method, device, equipment and storage medium
CN114840477B (en) File sensitivity index determining method based on cloud conference and related product
CN114579876A (en) False information detection method, device, equipment and medium
CN112905797A (en) Scenic spot multi-dimensional vulnerability assessment method based on MNLP
CN112632284A (en) Information extraction method and system for unlabeled text data set
CN106713108B (en) A kind of process for sorting mailings of combination customer relationship and bayesian theory
Yefimenko et al. Suhoniak II
Ma et al. Using SIFT for the filtering of Chinese text in image of multimedia message service

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908886

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908886

Country of ref document: EP

Kind code of ref document: A1