CN110932961A - Identification method of internet mailbox system - Google Patents

Identification method of internet mailbox system Download PDF

Info

Publication number
CN110932961A
CN110932961A CN201911138332.5A CN201911138332A CN110932961A CN 110932961 A CN110932961 A CN 110932961A CN 201911138332 A CN201911138332 A CN 201911138332A CN 110932961 A CN110932961 A CN 110932961A
Authority
CN
China
Prior art keywords
internet
website
data
mailbox system
mailbox
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911138332.5A
Other languages
Chinese (zh)
Inventor
温延龙
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dbappsecurity Technology Co Ltd
Original Assignee
Hangzhou Dbappsecurity Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dbappsecurity Technology Co Ltd filed Critical Hangzhou Dbappsecurity Technology Co Ltd
Priority to CN201911138332.5A priority Critical patent/CN110932961A/en
Publication of CN110932961A publication Critical patent/CN110932961A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0263Rule management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an identification method of an internet mailbox system, which comprises the steps of collecting a website homepage of the internet, crawling website homepage information, acquiring an IP (Internet protocol) of mailbox service type data of the website and corresponding port data, cleaning the obtained data, respectively storing the data into a database, fetching the data in the database, carrying out rule matching, labeling the internet website, and identifying the internet mailbox system. The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.

Description

Identification method of internet mailbox system
Technical Field
The invention relates to the technical field of electric digital data processing, in particular to an identification method of an internet mailbox system, which is particularly suitable for digital computing equipment or data processing equipment with specific functions or a data processing method.
Background
With the rapid development of the internet, the number of times people use mailbox systems is increasing day by day, a plurality of websites of mailbox systems are opened on the internet, the websites become target websites of hackers, and based on the websites, the hackers steal a large amount of important files and information and can spread computer virus files based on the websites.
On the premise, it is very important to quickly identify the mailbox system open on the internet, and quickly identifying the internet mailbox system is an effective way for strengthening the safety supervision of the mailbox system.
Although there are many websites on the internet in the prior art, the way of identifying the types of the websites is somewhat lacking, generally speaking, the judgment is mainly performed manually, however, the workload of manual judgment is huge, the identification and the matching are required to be performed first, the efficiency is low, and the omission is easy to occur in the identification.
Disclosure of Invention
The invention solves the problems of huge workload, low efficiency and easy omission caused by identifying the internet mailbox system mainly through manual judgment in the prior art, provides an optimized identification method of the internet mailbox system, and utilizes a certain rule to identify the internet mailbox system.
The technical scheme adopted by the invention is that the identification method of the internet mailbox system comprises the following steps:
step 1: collecting a website home page of the Internet, and crawling website home page information;
step 2: acquiring an IP (Internet protocol) of mailbox service type data of a website and corresponding port data;
and step 3: cleaning the data obtained in the step 1 and the step 2, and respectively storing the data into a database;
and 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
Preferably, in the step 1, a web crawler is used to directionally acquire a website home page of the internet.
Preferably, in step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
Preferably, in step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning an open port of the IP and identifying an open mailbox service type at the port.
Preferably, the step 4 comprises the steps of:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
step 4.4: and outputting all the identified internet mailbox systems.
Preferably, the rule matching comprises:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
Preferably, the character string is the character string with the html tag in the body removed.
Preferably, the identification information is a keyword, and the keyword includes a mail, a user name, a password, a mailbox system, and a mail system.
Preferably, in the step 4.3, the conditions of the corresponding combination are IP equality and port equality.
The invention provides an optimized identification method of an internet mailbox system, which comprises the steps of collecting a website homepage of the internet, crawling information, obtaining IP (Internet protocol) of mailbox service type data of the website and corresponding port data, cleaning the data, respectively storing the data into a database, carrying out rule matching on the data in the database, labeling the internet website and identifying the internet mailbox system.
The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.
The invention relates to an identification method of an internet mailbox system, which comprises the following steps.
Step 1: and collecting the website home page of the Internet and crawling the website home page information.
In the step 1, a web crawler is used for directionally acquiring the website home page of the internet.
In step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
Step 2: and acquiring the IP of the mailbox service type data of the website and corresponding port data.
In the step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning the port with the open IP and identifying the mailbox service type opened at the port.
In the present invention, the scanning can be performed by using tools such as nmap, which are conventional in the art and can be set by a person skilled in the art.
And step 3: and (4) respectively storing the data obtained in the step (1) and the step (2) into a database.
In the invention, the data can be cleaned by means of hive and the like and stored in a corresponding database, which is a conventional technology in the field and can be set by a person skilled in the art.
And 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
The step 4 comprises the following steps:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
the rule matching includes:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
The character string is the character string after the html tag in the body is removed.
The identification information is a keyword, and the keyword comprises a mail, a user name, a password, a mailbox system and a mail system.
Step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
in the step 4.3, the conditions of the corresponding combinations are IP equality and port equality.
Step 4.4: and outputting all the identified internet mailbox systems.
In the invention, the fingerprint information refers to the fingerprint information of a mailbox system commonly used on the Internet, such as Coremail, EcMall, EyouMail and WinMail.
In the invention, if the title of the acquired website home page does not contain the keyword of the mailbox system, the website home page is considered not to be the mailbox system, and unidentified data is put into the next identification link for subsequent matching; and subsequent matching is performed in the same way.
In the present invention, when the mailbox keyword included in the URL is "mail", "pop 3", "smtp", or the like, it is identified as "mailbox system".
In the invention, based on the character string length of the body after the html tag is removed, according to the characteristic that the content of the login page of a common mailbox system is few, the common preset value is 100, namely the character string length of the body is less than 100, and when the article contains keywords such as mail, a user name, a password, a mailbox system, a mail system and the like, the Internet mailbox system can be judged.
In the present invention, the characteristic information may also be referred to as fingerprint information, which is information that is not publicly known and needs to be extracted according to the characteristics of the mailbox system. For example, the fingerprint of the EyouMail mailbox system is that the header contains "EMPHPSID" = "in the Set-Cookie, and as another example, the header contains the position of the" MAILD mailbox system ", and the header contains" IDHTTPSESSIONID "=" in the Set-Cookie; more or less such characteristics exist in each mailbox system, and the identification is performed as a fingerprint according to the characteristic information.
In the invention, for the data which is not successfully identified for the first time, the data obtained in the step 2 is subjected to secondary matching with the data, when the IP of the two groups of data is equal and the ports are equal, the internet website is considered to be opened as well as the mailbox service in the port opened on one IP, and the internet website is certainly a mailbox system.
In the invention, after the identification step is finished, the identified website is marked with a label and labeled, the internet website which meets the rule and identifies the mailbox system is marked with the label of the mailbox system, and the internet website which does not identify the mailbox system is labeled as unidentified.
The invention collects the website homepage of the internet, crawls information, obtains the IP of the mailbox service type data of the website and corresponding port data, cleans the data, respectively stores the data into the database, performs rule matching on the data in the database, and marks and identifies the internet mailbox system for the internet website.
The invention acquires a large number of internet websites, identifies the internet mailbox system by using the website fingerprints, the website titles, the scanning tool to acquire the IP and the rules of IP open port service and the like, can quickly identify and mark the mailbox system from a plurality of internet websites in a short time, greatly reduces the manual participation, quickly identifies the mailbox system and provides convenience for corresponding supervisors.

Claims (9)

1. An identification method of an internet mailbox system is characterized in that: the method comprises the following steps:
step 1: collecting a website home page of the Internet, and crawling website home page information;
step 2: acquiring an IP (Internet protocol) of mailbox service type data of a website and corresponding port data;
and step 3: cleaning the data obtained in the step 1 and the step 2, and respectively storing the data into a database;
and 4, step 4: and (4) acquiring data in the database, performing rule matching, labeling the internet website, and identifying the internet mailbox system.
2. The method of claim 1, wherein the method comprises the steps of: in the step 1, a web crawler is used for directionally acquiring the website home page of the internet.
3. The method of claim 1, wherein the method comprises the steps of: in step 1, the website home page information further includes a body, a header, a title, a URL, an IP, and a port of the website home page.
4. The method of claim 1, wherein the method comprises the steps of: in the step 2, the IP of the mailbox service type data and the corresponding port data are obtained by scanning the port with the open IP and identifying the mailbox service type opened at the port.
5. The identification method of an internet mailbox system as claimed in claim 3, wherein: the step 4 comprises the following steps:
step 4.1: taking the data obtained in the step 1 in the database, and performing rule matching;
step 4.2: marking the corresponding internet website for the successfully matched data, and identifying an internet mailbox system; matching the unsuccessful data and carrying out the next step;
step 4.3: taking the data obtained in the step 2, correspondingly combining the unsuccessfully matched data, identifying the successfully combined data as a mailbox system and marking a corresponding internet website;
step 4.4: and outputting all the identified internet mailbox systems.
6. The method of claim 5, wherein the method comprises: the rule matching includes:
acquiring a title of a website home page, wherein the title comprises a mailbox system and is identified as an internet mailbox system;
acquiring a URL of a website, and identifying the URL as an internet mailbox system if the URL comprises mailbox keywords;
acquiring a header of a website home page, and identifying the header as an internet mailbox system if the header identifies the characteristic information of the mailbox system;
and acquiring a body of the website home page, preprocessing the body to obtain a character string, and identifying the character string as an internet mailbox system if the character string is smaller than a preset value and the character string comprises identification information.
7. The method of claim 6, wherein the method comprises: the character string is the character string after the html tag in the body is removed.
8. The method of claim 6, wherein the method comprises: the identification information is a keyword, and the keyword comprises a mail, a user name, a password, a mailbox system and a mail system.
9. The method of claim 5, wherein the method comprises: in the step 4.3, the conditions of the corresponding combinations are IP equality and port equality.
CN201911138332.5A 2019-11-20 2019-11-20 Identification method of internet mailbox system Pending CN110932961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911138332.5A CN110932961A (en) 2019-11-20 2019-11-20 Identification method of internet mailbox system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911138332.5A CN110932961A (en) 2019-11-20 2019-11-20 Identification method of internet mailbox system

Publications (1)

Publication Number Publication Date
CN110932961A true CN110932961A (en) 2020-03-27

Family

ID=69851289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911138332.5A Pending CN110932961A (en) 2019-11-20 2019-11-20 Identification method of internet mailbox system

Country Status (1)

Country Link
CN (1) CN110932961A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101937466A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Webpage mailbox identification classifying method and system
US7996406B1 (en) * 2008-09-30 2011-08-09 Symantec Corporation Method and apparatus for detecting web-based electronic mail in network traffic
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN105574047A (en) * 2014-10-17 2016-05-11 任子行网络技术股份有限公司 Website main page feature analysis based Chinese website sorting method and system
CN107741960A (en) * 2017-09-25 2018-02-27 厦门集微科技有限公司 URL sorting technique and device
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7996406B1 (en) * 2008-09-30 2011-08-09 Symantec Corporation Method and apparatus for detecting web-based electronic mail in network traffic
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101937466A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Webpage mailbox identification classifying method and system
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN105574047A (en) * 2014-10-17 2016-05-11 任子行网络技术股份有限公司 Website main page feature analysis based Chinese website sorting method and system
CN107741960A (en) * 2017-09-25 2018-02-27 厦门集微科技有限公司 URL sorting technique and device
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic

Similar Documents

Publication Publication Date Title
CN104504150B (en) News public sentiment monitoring system
CN110247930B (en) Encrypted network flow identification method based on deep neural network
CN108768986B (en) Encrypted traffic classification method, server and computer readable storage medium
US8161059B2 (en) Method and apparatus for collecting entity aliases
CN102664935B (en) Method and system for associated output of WEB class user behavior and user information
CN105812417B (en) Remote server, router and bad webpage information filtering method
CN107798080B (en) Similar sample set construction method for fishing URL detection
EP2863592A1 (en) Spammer group extraction apparatus and method
CN105005600A (en) Preprocessing method of URL (Uniform Resource Locator) in access log
US11431749B2 (en) Method and computing device for generating indication of malicious web resources
CN102045268B (en) A kind of e-mail data restoration methods and device
Park et al. Toward fine-grained traffic classification
CN113706100B (en) Real-time detection and identification method and system for Internet of things terminal equipment of power distribution network
CN107679227A (en) Video index label setting method, device and server
CN110020161B (en) Data processing method, log processing method and terminal
CN112235230A (en) Malicious traffic identification method and system
CN105701224A (en) Security information customized service system based on big data
CN113794687A (en) Malicious encrypted flow detection method and device based on deep learning
CN110602059B (en) Method for accurately restoring clear text length fingerprint of TLS protocol encrypted transmission data
CN110932961A (en) Identification method of internet mailbox system
CN104346337B (en) Method and device for intercepting junk information
CN111209959B (en) Encrypted webpage flow division point identification method based on data packet time sequence
CN111062199B (en) Bad information identification method and device
CN107784588A (en) Insurance user information merging method and device
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200327

RJ01 Rejection of invention patent application after publication