CN112380271A - Data discrimination and analysis method - Google Patents

Data discrimination and analysis method Download PDF

Info

Publication number
CN112380271A
CN112380271A CN202011187149.7A CN202011187149A CN112380271A CN 112380271 A CN112380271 A CN 112380271A CN 202011187149 A CN202011187149 A CN 202011187149A CN 112380271 A CN112380271 A CN 112380271A
Authority
CN
China
Prior art keywords
data
mails
mail
junk
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011187149.7A
Other languages
Chinese (zh)
Inventor
邬玉良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd
Original Assignee
Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd filed Critical Zhongke Hot Standby Beijing Cloud Computing Technology Co ltd
Priority to CN202011187149.7A priority Critical patent/CN112380271A/en
Publication of CN112380271A publication Critical patent/CN112380271A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention provides a junk mail based data discrimination and analysis method, which is based on comparison of communication behaviors of junk mails and normal mails, collects data of the mails, and judges whether enough representative data are collected or not to relate to final recognition performance. By preprocessing the data, extracting the mail header information from the original mail, selecting the behavior characteristics with the differentiability of the junk mail and processing the behavior characteristics. And carrying out pattern mining and prediction on the extracted behavior characteristic vector set by adopting a data mining method.

Description

Data discrimination and analysis method
Technical Field
The invention relates to a data identification and analysis method, and belongs to the field of internet data security processing.
Background
With the continuous expansion of the internet and the increase of internet users, electronic mails become an increasingly important communication mode in people's social life because of the characteristics of convenience, rapidness, cheapness and the like, but the electronic mailbox of people can often receive some mails sent by unknown people or addresses, even the mails account for most of the received mails, the mails can even bring viruses, and the computer is poisoned and even paralyzed. Spam has become a security issue we face.
Normal mail sending follows standard SMPT protocol, and sends mail according to the mode specified by the protocol. The working process of the STMP protocol is simple, easy to simulate and has security defects, and the mail server is deceived by means of forging legal server identity, legal sender address and the like. The key problem of correctly judging the junk mails is correctly identifying the communication information in the mail generation process by comparing the communication identification of the junk mails and the normal mails.
Anti-spam occupies relatively more system resources, so a mail security product which is not heavy enough to fully satisfy application at ordinary times may be full of system resources when mail virus outbreaks or spam jealousy flood, which may cause a mail security product which is not well protected by itself and may even cause self crash.
Data authentication refers to some regularity exhibited during program execution or user operation, and generally reflects the identity and habits of a user. It has been shown through a number of experiments that both the execution of programs and the behavior of users exhibit close dependencies on system characteristics. The data discrimination analysis can judge and process a series of mails with typical behavior characteristics of the junk mails in real time before the mails with obvious characteristic behaviors of the junk mails, such as 'frequent sending frequency, continuous sending in short time, dynamic IP and the like' displayed in the transmission process of the junk mails are put into a mail queue in the mail transmission agent communication stage, so that the mails do not need to be scanned completely, the speed of filtering the junk mails by a gateway is improved, the load of network resources and network flow are reduced, the calculation and processing capacity of the junk mails is improved, and meanwhile, the legal risk of infringing the private right is avoided.
In the prior art, the field of anti-spam mails proposes to filter spam mails by data discrimination analysis, and the main advantage of data discrimination is (1) high processing efficiency. Text content is not considered, and the mail header information is mainly aimed at, so that the processing speed is high, and the processing efficiency is improved; (2) the recognition effect is durable. The header information has a fixed format and cannot be frequently changed, so that the durability of the identification effect is ensured; (3) and bandwidth is saved. The identification analysis method can identify and intercept in the session connection stage, thereby effectively reducing resource consumption; (4) the security and the confidentiality are high. The content filtering has no any guarantee on the mail text analysis, and the safety and the confidentiality of the mail text analysis. The data identification and analysis method focuses on the mail header information, and the user privacy is protected. In addition, the data discrimination analysis model method can also carry out offline statistics, analysis and calculation based on a large amount of junk mail logs and archived data. The data discrimination analysis has great development potential and is one of the development directions of anti-spam methods.
Disclosure of Invention
The invention aims to realize a data identification and analysis method. Whether the junk mails are junk mails or not can be judged in the real-time communication process.
The invention relates to a data discrimination analysis method, which comprises the following steps:
first, pattern recognition classification usually includes the following steps: data acquisition, data preprocessing and data mining. Behavior collection of mails refers to a process of collecting relevant data information of objects from normal mails and junk mails, and whether the data collection collects enough representative sample data or not is related to the final performance of a mode.
And secondly, preprocessing behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, identifying and deleting isolated points, wherein the garbage data can lead behavior patterns to be disordered and cause unreliable output. And secondly, performing data integration, namely combining data in a plurality of data sources and storing the data in a consistent data storage. And finally, converting the data into a data form suitable for mining. The method comprises the steps of extracting mail header information from original mail data, selecting behavior characteristics with the differentiability of junk mails and carrying out vectorization processing on the characteristic data.
And finally, carrying out mode mining on the extracted behavior characteristic vector set by adopting a data mining method.
Drawings
FIG. 1 is a flow chart of the basic principle of a data discrimination analysis method
Detailed Description
In order to make the objects, method schemes and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
The invention provides a data identification and analysis method, which is used for analyzing mail behaviors, extracting mail behavior characteristics and finally judging whether a junk mail is the junk mail in the real-time communication process.
As shown in the figure, firstly, data collection is carried out on the mails, and a mail data set is collected. Secondly, data preprocessing is carried out, mail header information is extracted from original mail data, behavior characteristics of junk mails and vectorization processing of characteristic data are selected, and then a data mining method is adopted to predict an extracted behavior characteristic vector set.

Claims (3)

1. A data discrimination analysis method is characterized in that:
a process of behavior collection for mails, namely collecting related data information from normal mails and junk mails;
preprocessing the behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, and identifying and deleting isolated points;
and carrying out pattern mining on the extracted behavior characteristic vector set by adopting a data mining method.
2. The method for differential data analysis according to claim 1, wherein the data is collected by performing a collection differential analysis on the data.
3. A method for differential analysis of data as claimed in claim 1, wherein information is extracted from the raw data, selected with associated specific behavioral characteristics and vectorized processing of the characteristic data.
CN202011187149.7A 2020-10-29 2020-10-29 Data discrimination and analysis method Pending CN112380271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011187149.7A CN112380271A (en) 2020-10-29 2020-10-29 Data discrimination and analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011187149.7A CN112380271A (en) 2020-10-29 2020-10-29 Data discrimination and analysis method

Publications (1)

Publication Number Publication Date
CN112380271A true CN112380271A (en) 2021-02-19

Family

ID=74576344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011187149.7A Pending CN112380271A (en) 2020-10-29 2020-10-29 Data discrimination and analysis method

Country Status (1)

Country Link
CN (1) CN112380271A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN104796318A (en) * 2014-07-30 2015-07-22 北京中科同向信息技术有限公司 Behavior pattern identification technology

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674264A (en) * 2009-10-20 2010-03-17 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation
CN104796318A (en) * 2014-07-30 2015-07-22 北京中科同向信息技术有限公司 Behavior pattern identification technology

Similar Documents

Publication Publication Date Title
CN109714322B (en) Method and system for detecting network abnormal flow
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
Toolan et al. Feature selection for spam and phishing detection
JP4387205B2 (en) A framework that enables integration of anti-spam technologies
US8214438B2 (en) (More) advanced spam detection features
CN109639481A (en) A kind of net flow assorted method, system and electronic equipment based on deep learning
CN102842078B (en) Email forensic analyzing method based on community characteristics analysis
CN111277587A (en) Malicious encrypted traffic detection method and system based on behavior analysis
CN110519150B (en) Mail detection method, device, equipment, system and computer readable storage medium
CN102420723A (en) Anomaly detection method for various kinds of intrusion
CN108183888A (en) A kind of social engineering Network Intrusion path detection method based on random forests algorithm
CN110417643B (en) Mail processing method and device
CN112884121A (en) Traffic identification method based on generation of confrontation deep convolutional network
CN111147489A (en) Link camouflage-oriented fishfork attack mail discovery method and device
CN106341303B (en) Sender reputation's generation method based on mail user behavior
WO2011153894A1 (en) Method and system for distinguishing image spam mail
CN102377690A (en) Anti-spam gateway system and method
CN103490979A (en) Electronic mail identification method and system
CN114650229A (en) Network encryption traffic classification method and system based on three-layer model SFTF-L
Mishra et al. Analysis of random forest and Naive Bayes for spam mail using feature selection categorization
CN104065617B (en) A kind of harassing and wrecking email processing method, device and system
CN102905236A (en) Method, device and system for monitoring spam short messages
CN112380271A (en) Data discrimination and analysis method
CN100499599C (en) Rubbish mail filtration system and method based on email server
Gomes et al. Improving Spam Detection Based on Structural Similarity.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210219

WD01 Invention patent application deemed withdrawn after publication