CN112380271A - Data discrimination and analysis method - Google Patents
Data discrimination and analysis method Download PDFInfo
- Publication number
- CN112380271A CN112380271A CN202011187149.7A CN202011187149A CN112380271A CN 112380271 A CN112380271 A CN 112380271A CN 202011187149 A CN202011187149 A CN 202011187149A CN 112380271 A CN112380271 A CN 112380271A
- Authority
- CN
- China
- Prior art keywords
- data
- mails
- junk
- behavior
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Abstract
The invention provides a junk mail based data discrimination and analysis method, which is based on comparison of communication behaviors of junk mails and normal mails, collects data of the mails, and judges whether enough representative data are collected or not to relate to final recognition performance. By preprocessing the data, extracting the mail header information from the original mail, selecting the behavior characteristics with the differentiability of the junk mail and processing the behavior characteristics. And carrying out pattern mining and prediction on the extracted behavior characteristic vector set by adopting a data mining method.
Description
Technical Field
The invention relates to a data identification and analysis method, and belongs to the field of internet data security processing.
Background
With the continuous expansion of the internet and the increase of internet users, electronic mails become an increasingly important communication mode in people's social life because of the characteristics of convenience, rapidness, cheapness and the like, but the electronic mailbox of people can often receive some mails sent by unknown people or addresses, even the mails account for most of the received mails, the mails can even bring viruses, and the computer is poisoned and even paralyzed. Spam has become a security issue we face.
Normal mail sending follows standard SMPT protocol, and sends mail according to the mode specified by the protocol. The working process of the STMP protocol is simple, easy to simulate and has security defects, and the mail server is deceived by means of forging legal server identity, legal sender address and the like. The key problem of correctly judging the junk mails is correctly identifying the communication information in the mail generation process by comparing the communication identification of the junk mails and the normal mails.
Anti-spam occupies relatively more system resources, so a mail security product which is not heavy enough to fully satisfy application at ordinary times may be full of system resources when mail virus outbreaks or spam jealousy flood, which may cause a mail security product which is not well protected by itself and may even cause self crash.
Data authentication refers to some regularity exhibited during program execution or user operation, and generally reflects the identity and habits of a user. It has been shown through a number of experiments that both the execution of programs and the behavior of users exhibit close dependencies on system characteristics. The data discrimination analysis can judge and process a series of mails with typical behavior characteristics of the junk mails in real time before the mails with obvious characteristic behaviors of the junk mails, such as 'frequent sending frequency, continuous sending in short time, dynamic IP and the like' displayed in the transmission process of the junk mails are put into a mail queue in the mail transmission agent communication stage, so that the mails do not need to be scanned completely, the speed of filtering the junk mails by a gateway is improved, the load of network resources and network flow are reduced, the calculation and processing capacity of the junk mails is improved, and meanwhile, the legal risk of infringing the private right is avoided.
In the prior art, the field of anti-spam mails proposes to filter spam mails by data discrimination analysis, and the main advantage of data discrimination is (1) high processing efficiency. Text content is not considered, and the mail header information is mainly aimed at, so that the processing speed is high, and the processing efficiency is improved; (2) the recognition effect is durable. The header information has a fixed format and cannot be frequently changed, so that the durability of the identification effect is ensured; (3) and bandwidth is saved. The identification analysis method can identify and intercept in the session connection stage, thereby effectively reducing resource consumption; (4) the security and the confidentiality are high. The content filtering has no any guarantee on the mail text analysis, and the safety and the confidentiality of the mail text analysis. The data identification and analysis method focuses on the mail header information, and the user privacy is protected. In addition, the data discrimination analysis model method can also carry out offline statistics, analysis and calculation based on a large amount of junk mail logs and archived data. The data discrimination analysis has great development potential and is one of the development directions of anti-spam methods.
Disclosure of Invention
The invention aims to realize a data identification and analysis method. Whether the junk mails are junk mails or not can be judged in the real-time communication process.
The invention relates to a data discrimination analysis method, which comprises the following steps:
first, pattern recognition classification usually includes the following steps: data acquisition, data preprocessing and data mining. Behavior collection of mails refers to a process of collecting relevant data information of objects from normal mails and junk mails, and whether the data collection collects enough representative sample data or not is related to the final performance of a mode.
And secondly, preprocessing behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, identifying and deleting isolated points, wherein the garbage data can lead behavior patterns to be disordered and cause unreliable output. And secondly, performing data integration, namely combining data in a plurality of data sources and storing the data in a consistent data storage. And finally, converting the data into a data form suitable for mining. The method comprises the steps of extracting mail header information from original mail data, selecting behavior characteristics with the differentiability of junk mails and carrying out vectorization processing on the characteristic data.
And finally, carrying out mode mining on the extracted behavior characteristic vector set by adopting a data mining method.
Drawings
FIG. 1 is a flow chart of the basic principle of a data discrimination analysis method
Detailed Description
In order to make the objects, method schemes and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
The invention provides a data identification and analysis method, which is used for analyzing mail behaviors, extracting mail behavior characteristics and finally judging whether a junk mail is the junk mail in the real-time communication process.
As shown in the figure, firstly, data collection is carried out on the mails, and a mail data set is collected. Secondly, data preprocessing is carried out, mail header information is extracted from original mail data, behavior characteristics of junk mails and vectorization processing of characteristic data are selected, and then a data mining method is adopted to predict an extracted behavior characteristic vector set.
Claims (3)
1. A data discrimination analysis method is characterized in that:
a process of behavior collection for mails, namely collecting related data information from normal mails and junk mails;
preprocessing the behavior characteristic data, namely, firstly, cleaning the data, namely filling in vacant values, and identifying and deleting isolated points;
and carrying out pattern mining on the extracted behavior characteristic vector set by adopting a data mining method.
2. The method for differential data analysis according to claim 1, wherein the data is collected by performing a collection differential analysis on the data.
3. A method for differential analysis of data as claimed in claim 1, wherein information is extracted from the raw data, selected with associated specific behavioral characteristics and vectorized processing of the characteristic data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011187149.7A CN112380271A (en) | 2020-10-29 | 2020-10-29 | Data discrimination and analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011187149.7A CN112380271A (en) | 2020-10-29 | 2020-10-29 | Data discrimination and analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112380271A true CN112380271A (en) | 2021-02-19 |
Family
ID=74576344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011187149.7A Pending CN112380271A (en) | 2020-10-29 | 2020-10-29 | Data discrimination and analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112380271A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101674264A (en) * | 2009-10-20 | 2010-03-17 | 哈尔滨工程大学 | Spam detection device and method based on user relationship mining and credit evaluation |
CN104796318A (en) * | 2014-07-30 | 2015-07-22 | 北京中科同向信息技术有限公司 | Behavior pattern identification technology |
-
2020
- 2020-10-29 CN CN202011187149.7A patent/CN112380271A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101674264A (en) * | 2009-10-20 | 2010-03-17 | 哈尔滨工程大学 | Spam detection device and method based on user relationship mining and credit evaluation |
CN104796318A (en) * | 2014-07-30 | 2015-07-22 | 北京中科同向信息技术有限公司 | Behavior pattern identification technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109714322B (en) | Method and system for detecting network abnormal flow | |
CN101674264B (en) | Spam detection device and method based on user relationship mining and credit evaluation | |
Toolan et al. | Feature selection for spam and phishing detection | |
JP4387205B2 (en) | A framework that enables integration of anti-spam technologies | |
US8214438B2 (en) | (More) advanced spam detection features | |
CN109639481A (en) | A kind of net flow assorted method, system and electronic equipment based on deep learning | |
CN102842078B (en) | Email forensic analyzing method based on community characteristics analysis | |
CN111277587A (en) | Malicious encrypted traffic detection method and system based on behavior analysis | |
CN110519150B (en) | Mail detection method, device, equipment, system and computer readable storage medium | |
CN102420723A (en) | Anomaly detection method for various kinds of intrusion | |
CN108183888A (en) | A kind of social engineering Network Intrusion path detection method based on random forests algorithm | |
CN110417643B (en) | Mail processing method and device | |
CN112884121A (en) | Traffic identification method based on generation of confrontation deep convolutional network | |
CN111147489A (en) | Link camouflage-oriented fishfork attack mail discovery method and device | |
CN106341303B (en) | Sender reputation's generation method based on mail user behavior | |
WO2011153894A1 (en) | Method and system for distinguishing image spam mail | |
CN102377690A (en) | Anti-spam gateway system and method | |
CN103490979A (en) | Electronic mail identification method and system | |
CN114650229A (en) | Network encryption traffic classification method and system based on three-layer model SFTF-L | |
Mishra et al. | Analysis of random forest and Naive Bayes for spam mail using feature selection categorization | |
CN104065617B (en) | A kind of harassing and wrecking email processing method, device and system | |
CN102905236A (en) | Method, device and system for monitoring spam short messages | |
CN112380271A (en) | Data discrimination and analysis method | |
CN100499599C (en) | Rubbish mail filtration system and method based on email server | |
Gomes et al. | Improving Spam Detection Based on Structural Similarity. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210219 |
|
WD01 | Invention patent application deemed withdrawn after publication |