CN114036264A

CN114036264A - E-mail author identity attribution identification method based on small sample learning

Info

Publication number: CN114036264A
Application number: CN202111383946.7A
Authority: CN
Inventors: 许益家; 方勇; 刘中临; 杨悦; 郭文博
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-11
Anticipated expiration: 2041-11-19
Also published as: CN114036264B

Abstract

The invention relates to an identification method for attribution of an e-mail author identity, wherein a detected object is an e-mail. The method is applied to the field of e-mail owner identification, and the core of the method is that valuable header fields are screened out aiming at the e-mail header, and the characteristics of the fields are calculated through a statistical algorithm. Aiming at the body of the e-mail, text representation of Word level is constructed through Word2Vec algorithm, text representation of character level is constructed through CNN algorithm, and the writing habit characteristics of the mail writer are captured by using BilSTM algorithm and self-attention mechanism. The three characteristics are fused to obtain a new representation, a category vector of an author identity is constructed by using a dynamic routing algorithm, finally, the similarity between the anonymous mail and the author category vector is calculated by using a nerve tensor, a label is distributed to the anonymous mail sample according to the similarity score, and finally, the identification of the author is realized.

Description

E-mail author identity attribution identification method based on small sample learning

Technical Field

The invention relates to the field of mail identity recognition, which is mainly characterized in that a large number of e-mail data sets are collected, extracted three parts of characteristics are fused by a natural language processing method and a BilST (TM) algorithm, an analysis network-based detection model is trained, and finally mail attribution recognition under the condition of insufficient samples is realized.

Background

Electronic mail is a common communication mode in work and life of people and is often used by attackers. Meanwhile, many difficulties are faced in the process of obtaining evidence of the e-mail, and one of them is to judge the real author of the e-mail. The attacker can forge the identity of other people to attack by stealing the user's certificate or directly deceiving the e-mail server. Security mechanisms that simply use the mail transfer protocol do not fully resist these attacks.

Currently, email is an important carrier of high-level sustainability attacks and phishing attacks, in order to make victims more vulnerable, attackers can steal others' accounts or masquerade as people trusted by victims, like things, friends, etc. Attackers typically utilize the following two means of attack: 1) an attacker can steal the login certificate of a victim through vulnerabilities such as phishing mails or Cross-site scripting (XSS) of the mails, and then attack again by using the stolen certificate; 2) the attacker directly deceives the mail server through the faking attack of the sender, and forges the sender of the mail as the email address of other people.

The evidence obtaining of the e-mail creates more convenient conditions for solving the trial and judgment of various cases, but a plurality of difficulties still exist in the evidence obtaining process of the e-mail: 1) although domestic email service providers all require users to perform real-name authentication, the email is a communication mode using an open protocol, and the users can select foreign email service providers or self-built email servers to send anonymous emails; 2) criminals may steal the mailboxes of others, making it difficult to determine the true senders during the forensics process; 3) the protocol used by the e-mail still has a security problem, and in international meetings of three years, the e-mail sender has related research on forging attack, and can impersonate other people by attacking the e-mail server. These difficulties can interfere with email forensics.

In the existing research on the attribution problem of the identity of the mail author, researchers generally extract the characteristics of the mail body through manual work or deep learning algorithm to represent the identity of the mail author, and the characteristics can generally reflect the writing habit of the mail author. After capturing the different features, the model is constructed using different algorithms. However, there are some limitations in current research:

firstly, researchers usually only keep the information of the body of the e-mail and ignore the characteristics of the head of the e-mail;

and meanwhile, researchers generally construct the model under the condition that sufficient data sets exist, and the condition that the E-mail data collection is difficult and the data set for constructing the model is smaller in scale under the condition that the reality is ignored.

Disclosure of Invention

The invention discloses an electronic mail author identity attribution identification method based on small sample learning, which aims to realize accurate identification of an electronic mail owner under the condition of insufficient samples at present and aims to realize attribution identification aiming at anonymous attack mails.

The invention innovatively provides an electronic mail attribution identification method based on small samples, which realizes the representation of comprehensive semantics by fusing the characteristic information of the mail head and the mail body and then realizes the identification of a mail owner by using an Introduction network. The main content of the invention is divided into three parts: (1) a mail coding module: in the selection of characteristics, the existing mail owner identification method considers either header information or only the problem that character level characteristics are lost due To the fact that text characteristics of a word level are considered, so that the effect is poor in the actual detection process. (2) The author identity expression module: the dynamic routing algorithm is used for mapping the samples into the space, the samples of the same class are represented by the same class vector, and the problem of sample projection in metric-based meta-learning is solved through the dynamic routing algorithm. Since the capsule network can robustly learn invariance in partial and overall relations, the class vectors acquired by the dynamic routing algorithm are more effective than ordinary sample weighted average. (3) A relationship query module: in order to measure the space distance between different samples better, the invention adopts a neural tensor network which can score the correlation between the samples and the class vectors, thereby realizing the accurate judgment of mail attribution. Compared with the traditional E-mail identity verification method, the invention creatively introduces the meta-learning method in the field of a small sample to solve the problem of low identification precision caused by small size of an obtained E-mail data set in a practical scene, and provides support for tracing anonymous E-mail attack.

Drawings

The objects, implementations, advantages and features of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings.

FIG. 1 is a method hierarchy framework of the invention;

FIG. 2 is a block diagram of a data preparation module;

FIG. 3 is a mail encoding module overall framework;

FIG. 4 is a CNN-based character level characterization flow diagram;

FIG. 5 Authority representation Module flow diagram;

FIG. 6 is a relational query module flow diagram.

Detailed Description

The method is mainly used for identifying the identity affiliation of the e-mail based on small sample learning under the condition of insufficient samples. Firstly, a multi-feature fusion technology is utilized to perform fusion representation on the mail header feature, the word level feature and the character level feature, three parts of features with weights are used to jointly construct vector codes of an electronic mail, and each vector code can represent a sample of the electronic mail. And then, constructing a category vector capable of representing the identity of the author of the mail by using a dynamic routing algorithm, finally calculating the similarity between the anonymous mail and the author vector by using a nerve tensor model, distributing a label for the anonymous mail sample according to the similarity score, and finally determining the owner of the mail.

The overall framework of the invention comprises a total of four modules: the system comprises a data preparation module, a mail coding module, an author identity representation module and a relation query module, which are shown in figure 1. The framework has hierarchy, data flow is transmitted from bottom to top, and the output of the lower layer is used as the input of the upper layer to participate in various processing.

A data preparation module. The data preparation module is used for acquiring and processing the original data set, sorting the original data set and then transmitting the sorted original data set to the mail encoding module, and the whole framework is shown as figure 2. The functions of the device comprise: data deduplication, data cleaning and data segmentation. Data deduplication works primarily to remove duplicate data. The main job of data cleansing is to cleanse forwarding and reference content present in the email. When forwarding or referring to an e-mail, the body part of the e-mail may contain text content written by others or text content written before itself. In order to avoid that these contents affect the current writing style characteristics of the author, it is necessary to delete the text contents. After the e-mail is cleaned, the content of the e-mail text needs to be divided, including dividing by taking characters as units and dividing by taking words as units.

And a mail coding module. For a piece of e-mail to be classified, firstly, the head characteristic and the body characteristic of the e-mail are extracted. When the characteristics of the header of the e-mail are obtained, valuable fields of the header of the e-mail need to be screened and the characteristics of each field need to be mined; when the text characteristics of the e-mail are obtained, the model captures the writing habit characteristics of the author at the word level and the character level of the text of the e-mail respectively through a BilSTM algorithm and a self-attention mechanism, and the whole flow is shown in FIG. 3. After capturing the text characteristic of the electronic mail, strengthening the mapping relation between the text characteristic and the electronic mail author through a nonlinear activation function. And finally, fusing the head characteristic of the electronic mail and the body characteristic of the electronic mail, and outputting the multi-characteristic fused sample code through Softmax. The header feature selection and body feature embedding will be described in detail below.

And selecting head features. When sending an email, a sender needs to fill necessary information and optional filling information in a WebMail page or a mail client, and the information includes a recipient, a subject, email content and the like. Among these information, the recipient and subject belong to the email header field, and the email content belongs to the email body field. The mail headers of different service providers are also different, for which we have chosen five common header fields.

Date: the sending time of an email is normalized and unified to a UTC time zone as a Data characteristic;

from: sender address, each e-mail has only one sender;

3. To: recipient addresses, which may consist of one or more, typically separated by semicolons when there are multiple recipients;

subject: the subject of the e-mail can be long or short, and has no strict writing rule, and is influenced by the personal writing habit. The Subject feature is as follows: the number of words appearing in the subject of the mail, the number of characters appearing in the subject, the number of capital letters appearing in the subject, the number of numbers appearing in the subject, and the number of punctuation appearing in the subject;

cc: copy of email (Carbon Copy) means that when an email is sent, the same mail content is sent to a Copy person mailbox, and when there are a plurality of Copy persons, a semicolon is used as an interval. Cc is characterized as follows: the number of mailboxes existing in the recipient field, the number of mails existing in the transcriber field, the number of recipients with the same domain name as the sender electronic mailbox, and the number of transcribers with the same domain name as the sender electronic mailbox.

And embedding text features. The e-mail text contains rich semantic information, and the characteristic information is extracted for higher. We have performed character-level embedding and word-level embedding, respectively, on the text. In the text character level embedding process, the text is divided into individual characters and then coded using word2 vec. In the word level embedding process, the letters of the words are firstly converted into lower case letters, then the words are mapped into the vector space, and the lower case of the letters of the words has little influence on the semantics, so the conversion can be carried out. However, in the study of the problem of attribution of text authorship, the conversion of words into lower-case letters can cause the loss of the characteristics of writing habits of authors, and if words using upper-case letters are reserved, the explosion of a word vector space can be caused, for example, for the word "email", 25 = 32 word vector representations can exist. Therefore, when the Word-level representation of the email body is constructed through Word2Vec, the letters in the words are unified into lower case letters, and meanwhile, in order to supplement the information loss of the text representation of the email body, the character-level email body representation is constructed by using the CNN algorithm, and the representation flow is shown in FIG. 4.

And an author identity representation module. The Industration network belongs to a meta-learning algorithm based on a measurement technology, and the core idea is that a sample vector is mapped into a space, then a proper measurement algorithm is selected to calculate the space distance of the sample, the smaller the space distance is, the higher the similarity is, the higher the possibility of belonging to the same category is, so that how to map the sample into the space and construct the same vector from the samples of the same category is very important for representing. To accomplish this, the indexing network builds a class vector representation of the sample based on a dynamic routing algorithm, and the construction process is shown in fig. 5. And the e-mail sample codes are mapped to the space in an author identity representation module through a dynamic routing algorithm, and the samples in the same class are represented by the same class vector.

And a relation query module. After the construction of each author category characterization is completed, the category of the unknown sample needs to be detected, as shown in fig. 6. The relation query module calculates a 'space distance' between a sample to be queried and each author category vector representation through a neural tensor network as a similarity judgment basis, if the sample to be queried is matched with the category, the similarity is 1, and if not, the similarity is 0.

As described above, the invention successfully realizes the identification of the identity attribution of the e-mail under the condition of a small sample, and has higher accuracy and practicability. Compared with the prior detection method, the invention has the following innovations:

the method comprises the steps of firstly, extracting writing habit characteristics of an author from an email body, respectively constructing word-level email body representation and character-level email body representation, capturing the writing habit characteristics of the author through a BilSTM and self-attention mechanism, and then performing feature fusion on the writing habit characteristics of the email body and the email head characteristics to construct comprehensive semantic information of the email;

and secondly, the identity attribution identification of the e-mail based on small sample learning is adopted, so that the accurate identification of the mail owner can be realized under the condition of insufficient samples.

Although the preferred embodiments of the present invention have been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims

1. An e-mail identity attribution identification method based on small sample learning is characterized by comprising the following steps:

A. in the mail coding module, in order to more comprehensively extract the characteristics representing the identity of the author of the mail from the mail, the invention extracts the characteristics and information of the head and the text of the mail and fuses the characteristics and the information to finally generate a new representation of the mail;

B. in an author identity representation module, aggregating samples of the same category by using a dynamic routing algorithm, and generating a category vector representation;

C. and in the relation query module, calculating the similarity between the sample to be detected and different class vectors through a nerve tensor model so as to judge the class of the sample to be detected and finally realize the determination of the identity of the author of the mail.

2. The method for identifying the identity affiliation of the e-mail based on the small sample learning as claimed in claim 1, wherein in the mail encoding process, firstly extracting the header characteristics of the mail: the method comprises the following steps of including five header fields of controllable senders, namely Data, From, To, Subject and Cc, and statistical characteristics of each field; and then, carrying out feature embedding at the mail body word level: performing Word segmentation on the text of the e-mail, then constructing a Word list of words after Word segmentation, and finally generating a vector representation of the Word level of the text of the e-mail through a Word2Vec algorithm; and simultaneously, embedding character level features into the mail text: vectorizing the e-mail by One-hot, and outputting character level vector representation of the mail text by a convolutional neural network; then, extracting the writing style characteristics of the author by adopting a BilSTM algorithm and a self-attention mechanism aiming at the text characters and the word level characteristics; and finally, splicing the head and the body characteristics of the mail, performing fusion representation by using a weight network, outputting a new representation of the mail, and completing mail characteristic fusion.

3. The method as claimed in claim 1, wherein in the relational query process, the detection model inputs the mail code to be queried, then the "spatial distance" between the sample to be queried and each author category vector representation is calculated as the similarity through the neural tensor network, if the similarity is 1, the query sample is matched with the category, otherwise, the similarity is not matched, and finally the attribution category of the mail is obtained, thereby completing the author identification.