CN112380271A

CN112380271A - Data discrimination and analysis method

Info

Publication number: CN112380271A
Application number: CN202011187149.7A
Authority: CN
Inventors: 邬玉良
Original assignee: Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd
Current assignee: Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-02-19

Abstract

The invention provides a junk mail based data discrimination and analysis method, which is based on comparison of communication behaviors of junk mails and normal mails, collects data of the mails, and judges whether enough representative data are collected or not to relate to final recognition performance. By preprocessing the data, extracting the mail header information from the original mail, selecting the behavior characteristics with the differentiability of the junk mail and processing the behavior characteristics. And carrying out pattern mining and prediction on the extracted behavior characteristic vector set by adopting a data mining method.

Description

Data discrimination and analysis method

Technical Field

The invention relates to a data identification and analysis method, and belongs to the field of internet data security processing.

Background

With the continuous expansion of the internet and the increase of internet users, electronic mails become an increasingly important communication mode in people's social life because of the characteristics of convenience, rapidness, cheapness and the like, but the electronic mailbox of people can often receive some mails sent by unknown people or addresses, even the mails account for most of the received mails, the mails can even bring viruses, and the computer is poisoned and even paralyzed. Spam has become a security issue we face.

Normal mail sending follows standard SMPT protocol, and sends mail according to the mode specified by the protocol. The working process of the STMP protocol is simple, easy to simulate and has security defects, and the mail server is deceived by means of forging legal server identity, legal sender address and the like. The key problem of correctly judging the junk mails is correctly identifying the communication information in the mail generation process by comparing the communication identification of the junk mails and the normal mails.

Anti-spam occupies relatively more system resources, so a mail security product which is not heavy enough to fully satisfy application at ordinary times may be full of system resources when mail virus outbreaks or spam jealousy flood, which may cause a mail security product which is not well protected by itself and may even cause self crash.

Data authentication refers to some regularity exhibited during program execution or user operation, and generally reflects the identity and habits of a user. It has been shown through a number of experiments that both the execution of programs and the behavior of users exhibit close dependencies on system characteristics. The data discrimination analysis can judge and process a series of mails with typical behavior characteristics of the junk mails in real time before the mails with obvious characteristic behaviors of the junk mails, such as 'frequent sending frequency, continuous sending in short time, dynamic IP and the like' displayed in the transmission process of the junk mails are put into a mail queue in the mail transmission agent communication stage, so that the mails do not need to be scanned completely, the speed of filtering the junk mails by a gateway is improved, the load of network resources and network flow are reduced, the calculation and processing capacity of the junk mails is improved, and meanwhile, the legal risk of infringing the private right is avoided.

In the prior art, the field of anti-spam mails proposes to filter spam mails by data discrimination analysis, and the main advantage of data discrimination is (1) high processing efficiency. Text content is not considered, and the mail header information is mainly aimed at, so that the processing speed is high, and the processing efficiency is improved; (2) the recognition effect is durable. The header information has a fixed format and cannot be frequently changed, so that the durability of the identification effect is ensured; (3) and bandwidth is saved. The identification analysis method can identify and intercept in the session connection stage, thereby effectively reducing resource consumption; (4) the security and the confidentiality are high. The content filtering has no any guarantee on the mail text analysis, and the safety and the confidentiality of the mail text analysis. The data identification and analysis method focuses on the mail header information, and the user privacy is protected. In addition, the data discrimination analysis model method can also carry out offline statistics, analysis and calculation based on a large amount of junk mail logs and archived data. The data discrimination analysis has great development potential and is one of the development directions of anti-spam methods.

Disclosure of Invention

The invention aims to realize a data identification and analysis method. Whether the junk mails are junk mails or not can be judged in the real-time communication process.

The invention relates to a data discrimination analysis method, which comprises the following steps:

first, pattern recognition classification usually includes the following steps: data acquisition, data preprocessing and data mining. Behavior collection of mails refers to a process of collecting relevant data information of objects from normal mails and junk mails, and whether the data collection collects enough representative sample data or not is related to the final performance of a mode.

And secondly, preprocessing behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, identifying and deleting isolated points, wherein the garbage data can lead behavior patterns to be disordered and cause unreliable output. And secondly, performing data integration, namely combining data in a plurality of data sources and storing the data in a consistent data storage. And finally, converting the data into a data form suitable for mining. The method comprises the steps of extracting mail header information from original mail data, selecting behavior characteristics with the differentiability of junk mails and carrying out vectorization processing on the characteristic data.

And finally, carrying out mode mining on the extracted behavior characteristic vector set by adopting a data mining method.

Drawings

FIG. 1 is a flow chart of the basic principle of a data discrimination analysis method

Detailed Description

In order to make the objects, method schemes and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention provides a data identification and analysis method, which is used for analyzing mail behaviors, extracting mail behavior characteristics and finally judging whether a junk mail is the junk mail in the real-time communication process.

As shown in the figure, firstly, data collection is carried out on the mails, and a mail data set is collected. Secondly, data preprocessing is carried out, mail header information is extracted from original mail data, behavior characteristics of junk mails and vectorization processing of characteristic data are selected, and then a data mining method is adopted to predict an extracted behavior characteristic vector set.

Claims

1. A data discrimination analysis method is characterized in that:

a process of behavior collection for mails, namely collecting related data information from normal mails and junk mails;

preprocessing the behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, and identifying and deleting isolated points;

and carrying out pattern mining on the extracted behavior characteristic vector set by adopting a data mining method.

2. The method for differential data analysis according to claim 1, wherein the data is collected by performing a collection differential analysis on the data.

3. A method for differential analysis of data as claimed in claim 1, wherein information is extracted from the raw data, selected with associated specific behavioral characteristics and vectorized processing of the characteristic data.