CN110932961A - Identification method of internet mailbox system - Google Patents
Identification method of internet mailbox system Download PDFInfo
- Publication number
- CN110932961A CN110932961A CN201911138332.5A CN201911138332A CN110932961A CN 110932961 A CN110932961 A CN 110932961A CN 201911138332 A CN201911138332 A CN 201911138332A CN 110932961 A CN110932961 A CN 110932961A
- Authority
- CN
- China
- Prior art keywords
- internet
- website
- data
- mailbox system
- mailbox
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0236—Filtering by address, protocol, port number or service, e.g. IP-address or URL
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0263—Rule management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to an identification method of an internet mailbox system, which comprises the steps of collecting a website homepage of the internet, crawling website homepage information, acquiring an IP (Internet protocol) of mailbox service type data of the website and corresponding port data, cleaning the obtained data, respectively storing the data into a database, fetching the data in the database, carrying out rule matching, labeling the internet website, and identifying the internet mailbox system. The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.
Description
Technical Field
The invention relates to the technical field of electric digital data processing, in particular to an identification method of an internet mailbox system, which is particularly suitable for digital computing equipment or data processing equipment with specific functions or a data processing method.
Background
With the rapid development of the internet, the number of times people use mailbox systems is increasing day by day, a plurality of websites of mailbox systems are opened on the internet, the websites become target websites of hackers, and based on the websites, the hackers steal a large amount of important files and information and can spread computer virus files based on the websites.
On the premise, it is very important to quickly identify the mailbox system open on the internet, and quickly identifying the internet mailbox system is an effective way for strengthening the safety supervision of the mailbox system.
Although there are many websites on the internet in the prior art, the way of identifying the types of the websites is somewhat lacking, generally speaking, the judgment is mainly performed manually, however, the workload of manual judgment is huge, the identification and the matching are required to be performed first, the efficiency is low, and the omission is easy to occur in the identification.
Disclosure of Invention
The invention solves the problems of huge workload, low efficiency and easy omission caused by identifying the internet mailbox system mainly through manual judgment in the prior art, provides an optimized identification method of the internet mailbox system, and utilizes a certain rule to identify the internet mailbox system.
The technical scheme adopted by the invention is that the identification method of the internet mailbox system comprises the following steps:
step 1: collecting a website home page of the Internet, and crawling website home page information;
step 2: acquiring an IP (Internet protocol) of mailbox service type data of a website and corresponding port data;
and step 3: cleaning the data obtained in the step 1 and the step 2, and respectively storing the data into a database;
and 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
Preferably, in the step 1, a web crawler is used to directionally acquire a website home page of the internet.
Preferably, in step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
Preferably, in step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning an open port of the IP and identifying an open mailbox service type at the port.
Preferably, the step 4 comprises the steps of:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
step 4.4: and outputting all the identified internet mailbox systems.
Preferably, the rule matching comprises:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
Preferably, the character string is the character string with the html tag in the body removed.
Preferably, the identification information is a keyword, and the keyword includes a mail, a user name, a password, a mailbox system, and a mail system.
Preferably, in the step 4.3, the conditions of the corresponding combination are IP equality and port equality.
The invention provides an optimized identification method of an internet mailbox system, which comprises the steps of collecting a website homepage of the internet, crawling information, obtaining IP (Internet protocol) of mailbox service type data of the website and corresponding port data, cleaning the data, respectively storing the data into a database, carrying out rule matching on the data in the database, labeling the internet website and identifying the internet mailbox system.
The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.
The invention relates to an identification method of an internet mailbox system, which comprises the following steps.
Step 1: and collecting the website home page of the Internet and crawling the website home page information.
In the step 1, a web crawler is used for directionally acquiring the website home page of the internet.
In step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
Step 2: and acquiring the IP of the mailbox service type data of the website and corresponding port data.
In the step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning the port with the open IP and identifying the mailbox service type opened at the port.
In the present invention, the scanning can be performed by using tools such as nmap, which are conventional in the art and can be set by a person skilled in the art.
And step 3: and (4) respectively storing the data obtained in the step (1) and the step (2) into a database.
In the invention, the data can be cleaned by means of hive and the like and stored in a corresponding database, which is a conventional technology in the field and can be set by a person skilled in the art.
And 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
The step 4 comprises the following steps:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
the rule matching includes:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
The character string is the character string after the html tag in the body is removed.
The identification information is a keyword, and the keyword comprises a mail, a user name, a password, a mailbox system and a mail system.
Step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
in the step 4.3, the conditions of the corresponding combinations are IP equality and port equality.
Step 4.4: and outputting all the identified internet mailbox systems.
In the invention, the fingerprint information refers to the fingerprint information of a mailbox system commonly used on the Internet, such as Coremail, EcMall, EyouMail and WinMail.
In the invention, if the title of the acquired website home page does not contain the keyword of the mailbox system, the website home page is considered not to be the mailbox system, and unidentified data is put into the next identification link for subsequent matching; and subsequent matching is performed in the same way.
In the present invention, when the mailbox keyword included in the URL is "mail", "pop 3", "smtp", or the like, it is identified as "mailbox system".
In the invention, based on the character string length of the body after the html tag is removed, according to the characteristic that the content of the login page of a common mailbox system is few, the common preset value is 100, namely the character string length of the body is less than 100, and when the article contains keywords such as mail, a user name, a password, a mailbox system, a mail system and the like, the Internet mailbox system can be judged.
In the present invention, the characteristic information may also be referred to as fingerprint information, which is information that is not publicly known and needs to be extracted according to the characteristics of the mailbox system. For example, the fingerprint of the EyouMail mailbox system is that the header contains "EMPHPSID" = "in the Set-Cookie, and as another example, the header contains the position of the" MAILD mailbox system ", and the header contains" IDHTTPSESSIONID "=" in the Set-Cookie; more or less such characteristics exist in each mailbox system, and the identification is performed as a fingerprint according to the characteristic information.
In the invention, for the data which is not successfully identified for the first time, the data obtained in the step 2 is subjected to secondary matching with the data, when the IP of the two groups of data is equal and the ports are equal, the internet website is considered to be opened as well as the mailbox service in the port opened on one IP, and the internet website is certainly a mailbox system.
In the invention, after the identification step is finished, the identified website is marked with a label and labeled, the internet website which meets the rule and identifies the mailbox system is marked with the label of the mailbox system, and the internet website which does not identify the mailbox system is labeled as unidentified.
The invention collects the website homepage of the internet, crawls information, obtains the IP of the mailbox service type data of the website and corresponding port data, cleans the data, respectively stores the data into the database, performs rule matching on the data in the database, and marks and identifies the internet mailbox system for the internet website.
The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.
Claims (9)
1. An identification method of an internet mailbox system is characterized in that: the method comprises the following steps:
step 1: collecting a website home page of the Internet, and crawling website home page information;
step 2: acquiring an IP (Internet protocol) of mailbox service type data of a website and corresponding port data;
and step 3: cleaning the data obtained in the step 1 and the step 2, and respectively storing the data into a database;
and 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, a web crawler is used for directionally acquiring the website home page of the internet.
3. The method of claim 1, wherein the method comprises the steps of: in step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
4. The method of claim 1, wherein the method comprises the steps of: in the step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning the port with the open IP and identifying the mailbox service type opened at the port.
5. The identification method of an internet mailbox system as claimed in claim 3, wherein: the step 4 comprises the following steps:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
step 4.4: and outputting all the identified internet mailbox systems.
6. The method of claim 5, wherein the method comprises: the rule matching includes:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
7. The method of claim 6, wherein the method comprises: the character string is the character string after the html tag in the body is removed.
8. The method of claim 6, wherein the method comprises: the identification information is a keyword, and the keyword comprises a mail, a user name, a password, a mailbox system and a mail system.
9. The method of claim 5, wherein the method comprises: in the step 4.3, the conditions of the corresponding combinations are IP equality and port equality.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911138332.5A CN110932961A (en) | 2019-11-20 | 2019-11-20 | Identification method of internet mailbox system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911138332.5A CN110932961A (en) | 2019-11-20 | 2019-11-20 | Identification method of internet mailbox system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110932961A true CN110932961A (en) | 2020-03-27 |
Family
ID=69851289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911138332.5A Pending CN110932961A (en) | 2019-11-20 | 2019-11-20 | Identification method of internet mailbox system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110932961A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
CN101937466A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Webpage mailbox identification classifying method and system |
US7996406B1 (en) * | 2008-09-30 | 2011-08-09 | Symantec Corporation | Method and apparatus for detecting web-based electronic mail in network traffic |
CN102819591A (en) * | 2012-08-07 | 2012-12-12 | 北京网康科技有限公司 | Content-based web page classification method and system |
CN105574047A (en) * | 2014-10-17 | 2016-05-11 | 任子行网络技术股份有限公司 | Website main page feature analysis based Chinese website sorting method and system |
CN107741960A (en) * | 2017-09-25 | 2018-02-27 | 厦门集微科技有限公司 | URL sorting technique and device |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
-
2019
- 2019-11-20 CN CN201911138332.5A patent/CN110932961A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7996406B1 (en) * | 2008-09-30 | 2011-08-09 | Symantec Corporation | Method and apparatus for detecting web-based electronic mail in network traffic |
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
CN101937466A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Webpage mailbox identification classifying method and system |
CN102819591A (en) * | 2012-08-07 | 2012-12-12 | 北京网康科技有限公司 | Content-based web page classification method and system |
CN105574047A (en) * | 2014-10-17 | 2016-05-11 | 任子行网络技术股份有限公司 | Website main page feature analysis based Chinese website sorting method and system |
CN107741960A (en) * | 2017-09-25 | 2018-02-27 | 厦门集微科技有限公司 | URL sorting technique and device |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110247930B (en) | Encrypted network flow identification method based on deep neural network | |
CN108768986B (en) | Encrypted traffic classification method, server and computer readable storage medium | |
US8161059B2 (en) | Method and apparatus for collecting entity aliases | |
CN110391958A (en) | A kind of pair of network encryption flow carries out feature extraction automatically and knows method for distinguishing | |
CN105045847B (en) | A kind of method that Chinese institutional units title is extracted from text message | |
CN102664935B (en) | Method and system for associated output of WEB class user behavior and user information | |
CN105812417B (en) | Remote server, router and bad webpage information filtering method | |
EP2863592A1 (en) | Spammer group extraction apparatus and method | |
CN112491917A (en) | Unknown vulnerability identification method and device for Internet of things equipment | |
CN105005600A (en) | Preprocessing method of URL (Uniform Resource Locator) in access log | |
CN107506503A (en) | A kind of intellectual property outward appearance infringement analysis and management system | |
CN113706100B (en) | Real-time detection and identification method and system for Internet of things terminal equipment of power distribution network | |
CN107679227A (en) | Video index label setting method, device and server | |
US20200213347A1 (en) | Method and computing device for generating indication of malicious web resources | |
CN110020161B (en) | Data processing method, log processing method and terminal | |
CN112235230A (en) | Malicious traffic identification method and system | |
CN105701224A (en) | Security information customized service system based on big data | |
CN106506641A (en) | A kind of ident value extracting method of client device and device | |
CN114221792A (en) | Internet data transmission encryption system | |
CN110602059B (en) | Method for accurately restoring clear text length fingerprint of TLS protocol encrypted transmission data | |
CN110932961A (en) | Identification method of internet mailbox system | |
CN117081801A (en) | Fingerprint identification method, device and medium for content management system of website | |
CN107784588A (en) | Insurance user information merging method and device | |
CN105893560A (en) | Method and device for feeding effective information back to user | |
CN106549914B (en) | identification method and device for independent visitor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200327 |
|
RJ01 | Rejection of invention patent application after publication |