CN108092963A - Web page identification method, device, computer equipment and storage medium - Google Patents

Web page identification method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN108092963A
CN108092963A CN201711297266.7A CN201711297266A CN108092963A CN 108092963 A CN108092963 A CN 108092963A CN 201711297266 A CN201711297266 A CN 201711297266A CN 108092963 A CN108092963 A CN 108092963A
Authority
CN
China
Prior art keywords
domain name
identified
data
webpage
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711297266.7A
Other languages
Chinese (zh)
Other versions
CN108092963B (en
Inventor
王元铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201711297266.7A priority Critical patent/CN108092963B/en
Priority to PCT/CN2018/077064 priority patent/WO2019109529A1/en
Publication of CN108092963A publication Critical patent/CN108092963A/en
Application granted granted Critical
Publication of CN108092963B publication Critical patent/CN108092963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/16Implementing security features at a particular protocol layer
    • H04L63/168Implementing security features at a particular protocol layer above the transport layer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Abstract

The present invention relates to a kind of web page identification method, device, computer equipment and storage mediums.This method includes obtaining the webpage that identified risk class is more than predetermined level, the corresponding website domain name of extraction webpage;The corresponding network address in website is obtained according to website domain name;The domain name with network addresses is searched, when finding the domain name with network addresses, then using associated domain name as domain name to be identified;Obtain the web data in the corresponding website of domain name to be identified;Webpage of the risk class corresponding with domain name to be identified more than predetermined level is obtained according to acquired web data.Above-mentioned web page identification method, device, computer equipment and storage medium, the web page interrogation that predetermined level can be more than by a risk class are more than the webpage of predetermined level to associated multiple risk class, and search efficiency is high.

Description

Web page identification method, device, computer equipment and storage medium
Technical field
The present invention relates to network safety fileds, more particularly to a kind of web page identification method, device, computer equipment and deposit Storage media.
Background technology
With the development of internet science and technology, the more and more activities of people carry out on network, such as are carried out on network Transaction, corresponding banking etc. is handled on network, is thus present with the website of some banks that disguise oneself as, when user accesses The private informations such as Bank Account Number, password that user submits when using such website can be stolen, such has if finding not in time The website of menace can threaten the property safety of user, endanger the interests of user.
Traditionally, since substantial amounts of webpage can be generated daily, then need to select in the substantial amounts of webpage generated from internet The target webpage of menace may be had by taking, and then the target webpage to choosing carries out cumbersome analysis so that identification target Whether webpage is more than the inefficient of predetermined level for risk class.
The content of the invention
Based on this, it is necessary to whether be asked for the risk class of identification target webpage more than the inefficient of predetermined level Topic, provides a kind of web page identification method, device, computer equipment and storage medium.
A kind of website identification method, including:
The webpage that identified risk class is more than predetermined level is obtained, extracts the corresponding website domain name of the webpage;
The corresponding network address in the website is obtained according to the website domain name;
The domain name with the network addresses is searched, when finding the domain name with the network addresses, then will The associated domain name is as domain name to be identified;
Obtain the web data in the corresponding website of the domain name to be identified;
Risk class corresponding with the domain name to be identified is obtained according to acquired web data and is more than predetermined level Webpage.
In one of the embodiments, the step of domain name of the lookup and the network addresses, including:
The network address is matched with the network address being pre-stored in address information storehouse;
When the network address successful match being pre-stored in the network address and described address correlation database, obtain with it is described The association domain name to be matched of pre-stored network addresses;
Obtain effective deadline of the association domain name to be matched;
If current time is less than or equal to effective deadline, extracts the associated domain name conduct to be matched and wait to know Other domain name.
In one of the embodiments, the method further includes:
When not finding the domain name with the network addresses, then the corresponding registration number of domain name of the website is obtained According to according to the corresponding domain name of log-on data inquiry as domain name to be identified.
In one of the embodiments, the corresponding log-on data of domain name for obtaining the website, according to the registration The step of corresponding domain name of data query is as domain name to be identified, including:
The corresponding log-on data of domain name of the website is obtained, it is corresponding that the log-on data is chosen from conversion logic storehouse Conversion logic;
The log-on data carried out according to the conversion logic to be converted to transformed log-on data;
The transformed log-on data is matched with the information data stored in information repository;
When the information data successful match stored in transformed log-on data and information repository, then obtain matching into The domain name of the described information data correlation of work(is as domain name to be identified.
In one of the embodiments, the web data acquired in the basis obtains corresponding with the domain name to be identified Risk class is more than the step of webpage of predetermined level, including:
The web data is matched with the first filter data excessively stored in default blacklist, when the website number During according to the described first filtering Data Matching success, then suspicious label is added to the domain name to be identified;
By the web data in the corresponding website of the domain name to be identified for adding suspicious label in default white list The second of storage crosses filter data and is matched;
When web data successful match non-with the described second mistake filter data, then extraction carries treating for suspicious label It identifies domain name, obtains the webpage that the webpage in the corresponding website of the domain name to be identified is more than predetermined level as risk class.
In one of the embodiments, the method further includes:
Do not have that carry can after data identification is carried out with the default white list by the default blacklist When doubting the domain name to be identified of label, then the corresponding identifier of the domain name to be identified is obtained;
The identifier is matched with the secure identifier being stored in advance in security identifier repository;
When secure identifier identifier match success corresponding with the domain name to be identified, then successful match is obtained The associated secure domain name of the secure identifier being stored in the security identifier repository, by the secure domain name and institute State domain name matching to be identified;
When the secure domain name matches unsuccessful with the domain name to be identified, then the corresponding website of the domain name to be identified In webpage as risk class be more than predetermined level webpage.
In one of the embodiments, the web data acquired in the basis obtains corresponding with the domain name to be identified Risk class was more than after the step of webpage of predetermined level, further included:
Keyword of the risk class more than the web data of the webpage of predetermined level is extracted, according to the keyword pair The corresponding domain name to be identified of webpage that the risk class is more than predetermined level adds corresponding class label;
The risk class is more than class label and the stored class label of the domain name to be identified of predetermined level into Row matching;
When non-successful match, then class label of the risk class more than the domain name to be identified of predetermined level is added, And the risk class is more than under web storage to the class label of predetermined level.
A kind of webpage identification device, described device include:
First acquisition module for obtaining the webpage that identified risk class is more than predetermined level, extracts the webpage Corresponding website domain name;
Second acquisition module, for obtaining the corresponding network address in the website according to the website domain name;
Searching module, for search with the domain names of the network addresses, when finding and the network addresses Domain name when, then using the associated domain name as domain name to be identified;
3rd acquisition module, for obtaining the web data in the corresponding website of the domain name to be identified;
Identification module is big for obtaining risk class corresponding with the domain name to be identified according to acquired web data In the webpage of predetermined level.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor realizes the step in the above method when performing the computer program.
A kind of storage medium, is stored thereon with computer program, which realizes above-mentioned when being executed by processor Step in method.
Method, apparatus, computer equipment and the storage medium of above-mentioned webpage identification, obtain identified risk class and are more than The webpage of predetermined level, and then the domain name of the corresponding website of the webpage is got according to webpage, according to the domain Name acquisition of the website The corresponding network address in the website, and then the domain name with the network addresses is searched as domain name to be identified, it is treated when inquiring When identifying domain name, the web data in the corresponding website of domain name to be identified is obtained, according to the inquiry of web data, obtains risk etc. Grade is more than the webpage of predetermined level.The web page interrogation of predetermined level can be more than by a risk class to associated multiple Risk class is more than the webpage of predetermined level, and search efficiency is high.
Description of the drawings
Fig. 1 is the application scenario diagram of web page identification method in an embodiment;
Fig. 2 is web page identification method flow chart in an embodiment;
Fig. 3 is the structure diagram of webpage identification device in an embodiment;
Fig. 4 is an embodiment Computer device structure schematic diagram.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is used only for explaining the present invention, and It is not used in the restriction present invention.
Be described in detail according to an embodiment of the invention before, it should be noted that, the embodiment described essentially consist in Web page identification method, device, computer equipment and the step of storage medium correlation and the combination of device assembly.Therefore, the dress Component and method and step is put to show in position by ordinary symbol in the accompanying drawings, and merely illustrate with Understand the related details of the embodiment of the present invention, in order to avoid because being shown for having benefited from those of ordinary skill in the art of the present invention And those details being clear to have obscured the disclosure.
Herein, such as left and right, upper and lower, front and rear, first and second etc relational terms are used merely to area Divide an entity or action and another entity or action, and not necessarily require or imply and is between this entity or action any Actual this relation or order.Term " comprising ", "comprising" or any other variant are intended to cover non-exclusive inclusion, by This to include the process of a series of elements, method, article or equipment not only comprising these elements, but also comprising not bright The other element really listed is elements inherent to such a process, method, article, or device.
Fig. 1 is refer to, Fig. 1 is the application scenario diagram of web page identification method in an embodiment, is identified including webpage flat Platform and server, webpage identifying platform obtain net of the risk class identified more than predetermined level of storage from server Page is more than on the webpage of predetermined level from the risk class got and obtains web page address, and then extraction should from web page address The corresponding website domain name of webpage, webpage identifying platform obtain the corresponding network address in website according to website domain name, and webpage identification is flat Platform is according to network address, lookup and the domain name of the network addresses from the address information storehouse for be stored in webpage identifying platform, When finding the domain name with network addresses, then using the associated domain name as domain name to be identified, webpage identifying platform obtains The web data on the webpage included in the corresponding website of domain name to be identified is taken, obtains and wait according to the web data got to know The corresponding risk class of other domain name is more than the webpage of predetermined level.
Fig. 2 is referred to, in one of the embodiments, provides a kind of flow chart of web page identification method, in the present embodiment Come in the webpage identifying platform being applied in this way in above-mentioned Fig. 1 for example, on the platform operation have webpage recognizer, Webpage identifying processing is implemented by the webpage recognizer.This method comprises the following steps:
S202:Obtain the webpage that identified risk class is more than predetermined level, the corresponding website domain name of extraction webpage.
Specifically, risk class refer to for evaluating network page whether safety safety index, risk class can be default Evaluating network page whether safety different stage, for example, risk class can from low to high be set according to rank, risk class is got over It is high, then it represents that risk existing for corresponding webpage is higher, and e.g., risk class is arranged to 1 grade to 5 grades, represents the corresponding wind of webpage Danger is higher and higher.Website domain name refers to the mark of related web site, can have multiple webpages under same website domain name, for example, website The website domain name of " Baidu " is " baidu.com ", has multiple webpages, such as " Baidupedia " webpage under the website domain name.Wherein, Vulnerability database is provided in server, the webpage that risk class is more than predetermined level, risk etc. are stored in vulnerability database The webpage that grade is more than predetermined level then represents the webpage with excessive risk, and webpage identifying platform is obtained from server have been identified Risk class be more than predetermined level webpage, when get identified risk class be more than predetermined level webpage when, then Webpage according to getting obtains the corresponding web page address of the webpage, and then webpage identifying platform is according to the web page address, extraction Website domain name in web page address.It should be noted that the web page address of webpage refers in a network, pair that each webpage has The unique mark answered, web page address can be URL (Uniform Resoure Locator, uniform resource locator) addresses.Wind Dangerous database refers to be stored with database of the risk class more than the webpage of preset value.
S204:The corresponding network address in website is obtained according to website domain name.
Specifically, network address refer to computer network be connected with each other or one kind when being communicated can communication identifier, can To be the network address of computer in certain network, the computer which can uniquely identify in network is set Standby, which may be employed network address as communication identifier when being communicated with other computers, for example, network address can To be IP (Internet Protocol, Internet protocol) address etc., different website domain names is with being corresponding with corresponding network Location.Further, webpage identifying platform inquires the corresponding network address in the website, Ke Yishi according to website domain name, and webpage is known Other platform sends corresponding test data according to the website domain name got to the corresponding Website server in the website, when corresponding During Website server returning response data, then webpage identifying platform is extracted from the response data for receiving Website server transmission Corresponding network address.
S206:The domain name with network addresses is searched, it, then will association when finding the domain name with network addresses Domain name as domain name to be identified.
Specifically, associated domain name refers to the domain name that can share same network address, when different website domain names pair The website answered can share same network address when being stored in identical Website server, different website domain names is corresponding Website is corresponding with different access ports in Website server, and different website domain names pair is distinguished according to different access ports The website answered.Further, different network address and corresponding website domain name, webpage are pre-stored in webpage identifying platform Identifying platform is according to the network address got, inquiry and the domain name of network addresses, the associated domain name with it is identified The website domain name that risk class is more than predetermined level is different, corresponding more than the website of predetermined level with risk class when finding During the domain name of network addresses, then using the associated domain name as domain name to be identified.
S208:Obtain the web data in the corresponding website of domain name to be identified.
Specifically, web data refers to the content shown on Webpage, and web data can be lteral data, picture number According to, numerical data etc..Specifically, different webpages can be included in website, webpage identifying platform is according to the knowledge got The associated domain name that the corresponding network address lookup in website that other risk class is more than predetermined level arrives is as domain name to be identified When, webpage identifying platform, to the corresponding website of domain name to be identified, is waited to know according to the domain name lookup to be identified got so as to obtain The web data of the different web pages included in the corresponding website of other domain name, such as obtain the lteral data shown in different web pages.
S210:Risk class corresponding with domain name to be identified is obtained more than predetermined level according to acquired web data Webpage.
Specifically, webpage identifying platform is identified web data, according to the web data got when obtained net There are during suspicious data in page data, and then it is more than using the webpage comprising the web data as risk class the net of predetermined level Page.Can be, webpage identifying platform according to the lteral data of the web data got, in lteral data to character one by one into Row identification, when recognizing there are during suspicious lteral data, then the webpage comprising the lteral data is corresponding with domain name to be identified Risk class is more than the webpage of predetermined level.It should be noted that suspicious data can be default data, when being included in webpage During the default data, then webpage is more than the webpage of predetermined level for risk class, and suspicious data can be lteral data, picture Data, numerical data etc., for example, can be able to be to be arranged to word " bank ", " integration " or " prize " etc. with data.
In the present embodiment, the risk class that webpage identifying platform has been identified by one is more than the webpage of predetermined level Other associated domain names are inquired, the web data in the corresponding website of associated domain name is inquired about to obtain other risk class More than the webpage of predetermined level, the webpage of predetermined level is more than by a risk class can be related to query different risk etc. Grade is more than the webpage of predetermined level, improves search efficiency.
In one of the embodiments, step S206 can include following flow, step S206 is searched with network The step of associated domain name in location, including:
Network address is matched with the network address being pre-stored in address information storehouse.Specifically, address information storehouse is Refer to the database for being stored with different network address and domain name corresponding from different network address.Webpage identifying platform will obtain The risk class got is more than the webpage of predetermined level, and obtains web page address of the risk class more than the webpage of predetermined level, The corresponding website domain name of the webpage is extracted according to web page address, which is obtained more than predetermined level according to website domain name The corresponding network address in website, and then, by website corresponding net of the identified risk class got more than predetermined level Network address is matched one by one with the all-network address being pre-stored in address information storehouse, and is traveled through and matched address repository The all-network address of middle storage.
When the network address successful match being pre-stored in network address and address information storehouse, obtain and pre-stored network The association domain name to be matched of address information.Specifically, association domain name to be matched refers to and is pre-stored in the net in address information storehouse The domain name of network address information, the domain name can be the marks of relevant website, when getting network in address information repository Address, which can associate, gets association domain name to be matched corresponding with network address.Webpage identifying platform is by identified risk etc. The network address that grade is more than predetermined level is matched one by one with all network address stored in address information storehouse, and then net Page identifying platform chooses network address of the identified risk class more than predetermined level successful match in address information storehouse Network address obtains the association domain name to be matched with the network addresses of successful match from address information storehouse.
Obtain effective deadline that association domain name to be matched is obtained in address information storehouse.Specifically, effective deadline Refer to the final effective time that association domain name to be matched carries, effective deadline can be the time in time, when effectively ending Between can specific date details etc. can also be with the specific month in the time, effective deadline, for example, effectively deadline can Using be the time in time as 2017, effective deadline can be specific month in the time in December, 2017, effectively end Time can also be specific date details on December 31st, 2017 etc..It is in webpage identifying platform that identified risk class is big The network address successful match stored in the corresponding network address of webpage of predetermined level and address information storehouse, and then webpage is known When other platform obtains the association domain name to be matched of the network addresses of successful match, webpage identifying platform is according to address correlation database In association domain Name acquisition to be matched association domain name to be matched corresponding effective deadline, i.e., according in the correlation database of address Association domain Name acquisition association domain name to be matched corresponding final effective time to be matched.
If current time is less than or equal to effective deadline, associated domain name to be matched is extracted as domain name to be identified. Specifically, current time refers to the time for getting association domain name to be matched, and current time can be system time, for example, working as The preceding time can be the time in time, and current time can be the specific month in the time, and current time can also be specific day Phase etc..Webpage identifying platform gets association domain name to be matched, and obtains current time, when which can be system Between, webpage identifying platform is corresponding with association domain name to be matched by the current time got according to the current time got Effective deadline is compared, if the current time for getting association domain name to be matched is less than effective deadline, obtains The association domain name to be matched got is not less than effective deadline, that is, the association domain name to be matched got is effective, then webpage is known Other platform is using the associated domain name to be matched got as associated domain name, and then using associated domain name as domain name to be identified.
It should be noted that in the present embodiment, address information storehouse can be passive DNS (passive Domain Name System, passive domain name system) database, webpage identifying platform is more than according to the identified risk class got The network address of the website of predetermined level is matched with the network address stored in passive DNS databases, when matching into During work(, then the corresponding association domain name to be matched of network address of successful match in passive DNS databases is obtained, when acquisition It is when the current time of association domain name to be matched is less than or equal to effective deadline of the association domain name to be matched, then this is to be matched Associated domain name is as associated domain name.
It should be noted that the webpage that risk class is more than predetermined level can be the excessive risk net of normal webpage of disguising oneself as Page when user accesses, steals associated bank card information of user etc., and then threatens the property safety of user, e.g. goes fishing Webpage;Can also be other webpages that access is limited when needing to carry out risk management and control, for example, risk class is more than default etc. The webpage of grade is the access rights that some enterprises have corresponding webpage, then limits the webpage of access it may be considered that being wind Dangerous grade is more than the webpage of predetermined level.In the present embodiment, webpage identifying platform is according to the successful match from address information storehouse Pre-stored network address obtains association domain name to be matched, and current time is associated the corresponding effective cut-off of domain name with to be matched Time is compared, and when current time is less than or equal to effective deadline, then the association domain name to be matched is effective, you can to make For associated domain name and then domain name to be identified is used as, is directly treated according to the filtering of current time and effective deadline are invalid With association domain name, raising efficiency easy to operate, and invalid association domain name to be matched is directly filtered, improves and choose association Domain name accuracy.
In one of the embodiments, web page identification method can also include the following steps, which can be in step It is performed after S206, step S206, that is, searches and performed afterwards with the domain name of network addresses, which can include:
When not finding the domain name with network addresses, then the corresponding log-on data of domain name of website is obtained, according to Log-on data inquires about corresponding domain name as domain name to be identified.Specifically, log-on data refers to the domain name for showing registration of website The data of the details of user, log-on data can be lteral data, image data or numerical data etc., for example, registration number According to can be personal name, log-on data can be individual mailbox, and log-on data can be personal call, and log-on data can be with It is personal photo etc..Webpage identifying platform in address information storehouse not with pre-stored network address successful match when, then do not obtain The association domain name to be matched for the network addresses got and be pre-stored, then webpage identifying platform obtain identified risk class More than the corresponding log-on data of domain name of the website of predetermined level, and then webpage identifying platform is looked into according to the log-on data inquired Domain name corresponding with log-on data is ask, the domain name corresponding with log-on data inquired is more than predetermined level with risk class The domain name of website is different, so the domain names that the domain name of the website for being more than predetermined level from risk class inquired is different as Domain name to be identified.
In the present embodiment, when do not found in address information storehouse with identified risk class be more than predetermined level net Stand corresponding network addresses domain name when, then according to the identified risk class be more than predetermined level website correspond to note Volume data query is to different domain names as domain name to be identified, you can, will to inquire about associated domain name again by log-on message For the associated domain name inquired as domain name to be identified, it is accurate more than the website of predetermined level that raising inquires risk class Property.
In one of the embodiments, the corresponding log-on data of domain name of above-mentioned acquisition website, is inquired about according to log-on data The step of corresponding domain name is as domain name to be identified can include following flow:
The corresponding log-on data of domain name of website is obtained, the corresponding conversion of log-on data is chosen from conversion logic storehouse and is patrolled Volume.Specifically, conversion logic storehouse refers to be stored with the conversion logic for the log-on data that log-on data is converted to set form Database.Conversion logic refers to the rule for converting log-on data, and conversion logic can be by the character in log-on data It is replaced as default character, conversion logic can delete invalid character etc..Further, webpage identifying platform obtains When being more than the webpage of predetermined level to identified risk class, which is extracted according to the web page address of webpage Grade is more than the corresponding website domain name of webpage of predetermined level, when webpage identifying platform extracts the website domain name, then basis The website domain name obtains the corresponding log-on data of webpage that the identified risk class is more than predetermined level, and the note got Volumes is then chosen from conversion logic storehouse to the registration according to being shown not in accordance with prescribed form according to the type of log-on data The corresponding conversion logic of data, and then by the log-on data of acquisition according to the display format of regulation.For example, webpage identifying platform root It is more than the domain name of the website of predetermined level according to the identified risk class of extraction, extracting domain name according to the domain name of website corresponds to Log-on data, such as registration name, registration mailbox, registration phone, register name among contain space, contain in registration phone Connector, then according to log-on data type, i.e. webpage identification is chosen registration name according to registration name from logical transition storehouse and is pressed According to the conversion logic of display rule display, space in name will be registered and deleted, and then according to registration phone from conversion logic Conversion logic of the registration phone according to display rule display is chosen in storehouse, i.e., is deleted the connector in registration phone.
Log-on data carried out according to conversion logic to be converted to transformed log-on data.Specifically, when webpage identifies When platform is chosen to conversion logic, i.e. webpage identification adds selection to the rule for converting log-on data, will such as register number Character in is replaced as default character, deletes invalid character etc., then webpage identifying platform, will according to conversion logic What log-on data was converted arrives transformed log-on data, and transformed log-on data can be then the display lattice according to regulation Formula is shown.For example, log-on data has registration name, registration mailbox, registration phone etc., webpage identifying platform is chosen to registration The conversion logic of name and registration phone will then be registered and delete invalid space character in name according to conversion logic, can also In registration phone connector will be deleted according to the conversion logic in registration phone.
Transformed log-on data is matched with the information data stored in information repository.Specifically, information is deposited Storage cavern refers to the database for being stored with different log-on messages and the associated domain name of log-on message, and information repository can store There are registration name, registration mailbox and registration phone etc., registration name, registration mailbox and the registration stored in information database Phone can correspond, and information repository can be stored with the associated website domain name of log-on message.Information data is Refer to the data of the details for the registrant for showing relevant domain name, information data can be lteral data, and information data can be with It is that numerical data can also be image data etc., for example, information data can be name, phone, mailbox or photo etc..Specifically Ground, webpage identifying platform match the log-on data got with the information data stored in information repository one by one, can To be, the log-on data that webpage identifying platform is got is to register name, registration mailbox and registration phone, webpage identifying platform root Name will be registered according to transformation rule, registration mailbox and registration phone be converted to transformed registration name, transformed Registration mailbox and transformed registration phone, webpage identifying platform register what is stored in name and information repository by transformed Name is matched, the phone progress that webpage identifying platform will store in transformed registration phone and information repository again Match somebody with somebody, and then webpage identifying platform matches transformed registration mailbox with the mailbox stored in information repository.
When the information data successful match stored in transformed log-on data and information repository, then obtain matching into The associated domain name of information data of work(is as domain name to be identified.Specifically, when webpage identifying platform is by transformed log-on data The information data for summarizing storage with information repository matches one by one, when matching corresponding information data in information repository When, then the associated domain name of information data of successful match is obtained, using the associated domain name as domain name to be identified.Can be, net Page identifying platform will one by one be matched with the information data stored in information data respectively in log-on data per middle data, work as registration When each data in data are with the information data successful match stored in information database, then it is associated to obtain information data Domain name.Webpage identifying platform by it is transformed registration name matched with the name stored in information database, when match into During work(, then registration mailbox mailbox corresponding with the name stored in information database matched, when registration mailbox matches Then registration phone with storing phone corresponding with name and mailbox in information database is matched again during success, works as registration Phone also successful match when then by the associated domain name of name, phone and mailbox of the successful match stored in information repository into Row extraction, so as to using the domain name extracted as domain name to be identified.It should be noted that can also be webpage identifying platform only It is carried out with the data stored in information data with arbitrary log-on data in log-on data matched, when successful match, then will The associated domain name of successful match information data is as domain name to be identified.By transformed registration name with being stored in information database Name matched, then directly extract successful match the associated domain name of name as domain name to be identified.
It should be noted that in the present embodiment, information repository can be whois databases, and webpage identifying platform obtains It is more than the domain name of the website of predetermined level to identified risk class, and according to the domain Name acquisition to the corresponding registration in the website During data, which can be matched with the information data stored in whois databases, when successful match, then The associated domain name of information data is obtained as domain name to be identified.
In the present embodiment, webpage identifying platform first converts the log-on data got according to conversion logic, obtains The accuracy for identifying associated domain name to be identified can be improved according to the transformed log-on data of display rule display, into And matched according to transformed log-on data with the information data stored in information repository, when successful match, then obtain The associated domain name of information data of successful match is taken as domain name to be identified, according to log-on message be can obtain it is different to be identified Domain name improves recognition efficiency.
In a wherein example, it is big that risk class corresponding with domain name to be identified is obtained according to acquired web data In the webpage of predetermined level the step of, it can include:
Web data is matched with the first filter data excessively stored in default blacklist, when website data and first When filtering Data Matching success, then suspicious label is added to domain name to be identified.Specifically, blacklist refers to be stored with risk Grade is more than the data of predetermined level, and the data that risk class is more than predetermined level can be lteral data, image data, number Data etc., for example, character such as " bank ", " integration " can be stored with.First, which crosses filter data, refers to risk class more than default The data of grade, when including first in webpage, then the website may be net of the risk class more than predetermined level to filter data excessively Page, the first filter data excessively can be lteral data, image data, numerical data etc..Suspicious label refers to that domain name to be identified may It is the mark that risk class is more than predetermined level.Specifically, webpage identifying platform will be from the corresponding website Zhong Bao of domain name to be identified When all webpages contained all extract web data, then all web datas will be extracted one by one with being stored in default blacklist First cross filter data and matched, when all web datas first cross filter data with what is be arbitrarily stored in blacklist During with success, then webpage identifying platform adds the corresponding domain name to be identified in the website of the Webpage correlation in the source of the web data It can be with label.It should be noted that number of matches threshold value can also be provided with, i.e., webpage identifying platform is all by what is got Web data first is crossed filter data and is matched one by one with what is be stored in blacklist, when being stored in black name with default quantity During the first filtering Data Matching success in list, then domain to be identified corresponding to the website of the Webpage correlation in the web data source Name adds suspicious label, and number of matches threshold value can be preset as 1, be preset as 3, be preset as 4 etc..It is it is also possible that default when having The first filtering in the web data and blacklist of the webpage included in the corresponding website of domain name to be identified got of quantity During Data Matching success, then suspicious label is added to the domain name to be identified.
By the web data in the corresponding website of domain name to be identified for adding suspicious label with being stored in default white list Second cross filter data matched.Specifically, white list refers to the database for being stored with trust data, and trust data refers to wind Dangerous grade is less than or equal to the data of predetermined level, and trust data can be lteral data, image data, numerical data etc., for example, Character such as " lottery industry " can be stored with.Second cross filter data refer to risk class be less than or equal to predetermined level data namely It is trust data, when including the secondth in webpage, then the website may be reliable website to filter data excessively, and second crosses filter data can To be lteral data, image data, numerical data etc..Specifically, the extraction of webpage identifying platform is with the addition of suspicious label and waits to know Other domain name, and by the web data being with the addition of on all webpages included in the website of the domain name to be identified of suspicious label with presetting White list in store second cross filter data matched one by one, when the corresponding net of domain name to be identified for being with the addition of suspicious label When the web data stood on all webpages included is with the second filtering Data Matching success pre-stored in white list, then will The suspicious label carried in domain name to be identified is deleted.It should be noted that can also be with the addition of suspicious mark when default quantity Web data in the website of the domain name to be identified of label on the webpage that includes and the second filtering number stored in default white list During according to successful match, then the suspicious label carried in domain name to be identified is deleted.
When web data and second cross filter data non-successful match when, then extraction carries the domain to be identified of suspicious label Name obtains the webpage that the webpage in the corresponding website of domain name to be identified is more than predetermined level as risk class.Specifically, net is worked as The web data included in the corresponding website of domain name to be identified for adding suspicious label and second are crossed filter data by page identifying platform During non-successful match, then suspicious label is still carried in domain name to be identified, then webpage identifying platform, which extracts, still carries The domain name to be identified of suspicious label, and then the corresponding website of domain name to be identified is obtained, extract the webpage included in corresponding website It is more than the webpage of predetermined level as risk class.
In the present embodiment, filter data and the second filtering number being stored in white list are crossed by stored in blacklist first It is filtered according to web data, so as to obtain the webpage that required risk class is more than predetermined level, is taken although preventing It is more than the web data of predetermined level but really credible webpage with risk class, by double-filtration, improves identification risk Grade is more than the accuracy of the webpage of predetermined level.
In one of the embodiments, web page identification method can also include:
Do not exist after data identification is carried out by default blacklist and default white list and carry suspicious label During domain name to be identified, then the corresponding identifier of domain name to be identified is obtained.Specifically, identifier refers to represent that domain name to be identified corresponds to The distinctive mark in website, identifier can be enterprise mark, for example, identifier can be enterprise logo etc..Specifically, net is worked as Page identifying platform is according to the web data on the webpage included in all corresponding websites of domain name to be identified got by pre- If blacklist and default white list carry out data identification after, when domain name to be identified does not all carry suspicious label, then pass through Web data identification is unidentified to arrive webpage of the risk class more than predetermined level, then webpage identifying platform obtains domain name pair to be identified The identifier answered.
Identifier is matched with the secure identifier being stored in advance in security identifier repository.Specifically, safety Mark repository refers to the identifier for being stored with website trusty and the database of the corresponding website domain name of identifier.Safety Identifier refers to the mark of trusted website, and secure identifier can be the mark of the enterprise of safe webpage, for example, safety post Know logo of the symbol for industrial and commercial bank's webpage, be the logo etc. of safety group webpage.Specifically, webpage identifying platform will be got Identifier is matched one by one with being stored in advance in the secure identifier stored in security identifier repository, Ke Yishi, and webpage is known The corresponding identifier of domain name to be identified that other platform is got is safety group logo, and then the domain name pair to be identified that will be got The identifier answered i.e. safety group logo is matched with the secure identifier being stored in security identifier repository.
When secure identifier identifier match success corresponding with domain name to be identified, then being stored in for successful match is obtained The associated secure domain name of secure identifier in security identifier repository matches secure domain name with domain name to be identified.Specifically, During the secure identifier successful match that webpage identifying platform will store in the corresponding identifier of domain name to be identified and secure storage storehouse, Then the corresponding domain name to be identified of the corresponding secure identifier of domain name to be identified may be secure domain name, and then need to carry out further Matching is with identifying, then the secure identifier being stored in preceding security identifier repository of webpage identifying platform acquisition successful match closes The secure domain name of connection, by the associated secure domain name of safe identifier being stored in security identifier repository of successful match, and Secure domain name is matched with domain name to be identified.For example, the corresponding identifier of domain name to be identified that webpage identification is got is put down When peace group logo is with the safety group logo successful match stored in secure storage storehouse, then obtains and deposited in security identifier repository The associated domain names of safety group logo " pingan.com " of storage, and by domain name to be identified and the associated domain name " pingan.com " is matched.
When secure domain name matches unsuccessful with domain name to be identified, then the webpage conduct in the corresponding website of domain name to be identified Risk class is more than the webpage of predetermined level.Specifically, when webpage identifying platform matches domain name to be identified not with secure domain name Successfully make, then the corresponding identifier of domain name to be identified is the secure identifier forged, then will be in the corresponding website of domain name to be identified Comprising webpage as risk class be more than predetermined level webpage.For example, the domain to be identified that webpage identifying platform will be got The identifier of name is safety group logo, when safety group logo is matched into the security identifier stored in security identifier repository Work(then obtains associated domain name " pingan.com " in security identifier repository, when domain name to be identified is not " pingan.com " When, then domain name to be identified has forged safety group logo, then using the webpage in the corresponding website of the domain name to be identified as risk Grade is more than the webpage of predetermined level.
In the present embodiment, when web data is identified do not obtain suspicious domain name to be identified when, then according to domain to be identified Further identification is big for risk class so as to obtain the webpage included in the corresponding website of domain name to be identified for the identifier that name carries In the webpage of preset value, using Multiple recognition method, the accuracy that identification risk class is more than the webpage of predetermined level is improved.
In one of the embodiments, after step S210, can also include the following steps, step S210, i.e., according to institute After the website data of acquisition obtains the step of risk class corresponding with domain name to be identified is more than the webpage of predetermined level, also wrap It includes:
The keyword that risk class is more than the web data of predetermined level webpage is extracted, it is big to risk class according to keyword Corresponding class label is added in the domain name to be identified of predetermined level.Specifically, class label refers to the type of web data Mark, class label can be the label of different risk, can be with for example, class label can be bank sort label It is shopping category label etc..Specifically, webpage identifying platform identifies that risk class is more than the webpage of predetermined level, and then, net The keyword of page identifying platform extraction web data, webpage identifying platform according to the keyword of the web data extracted, according to The keyword of the web data extracted, domain name addition pair to be identified associated to the corresponding website of the webpage comprising web data The class label answered.For example, the webpage that webpage identifying platform is more than predetermined level according to risk class is identified, and then from webpage Identifying platform extract from different webpages keyword respectively " integrate " with " bank ", webpage identifying platform is according to extracting The keyword " integration " of web data and " bank ", to the associated domain name to be identified in the corresponding website of the webpage comprising web data Corresponding class label is added to add " bank's label " or " integration label ".
Risk class is more than to the class label of the domain name to be identified of predetermined level and the progress of stored class label Match somebody with somebody.Specifically, webpage identifying platform will store the classification mark of webpage identifying platform according to the class label for treating addition domain name Label are matched one by one, until having traveled through all stored class labels.For example, it is to the label of domain name to be identified addition " bank " and " integration ", the label that domain name to be identified is added " bank " match one by one with stored class label, then The class label " integration " added to domain name to be identified is matched one by one with stored class label.
When non-successful match, then class label of the risk class more than the domain name to be identified of predetermined level is added, and will Risk class is more than under web storage to the class label of predetermined level.Specifically, when addition class label with it is stored During the non-successful match of class label, then the class label added is new class label, then by the risk class of non-successful match Class label more than the domain name to be identified of predetermined level is added in stored class label, and by the class label of addition The corresponding website of domain name to be identified in the risk class that includes be more than the webpage of predetermined level and be added in such distinguishing label.Example Such as, the class label of domain name addition to be identified is respectively " bank " and " integration ", by class label " bank " and stored class Distinguishing label is matched one by one, the class label " integration " that adds domain name to be identified and stored class label one by one into Row matching, when class label " bank " non-successful match, is then added to stored class label by class label " bank " In, and the risk class included in the corresponding website of domain name to be identified for being with the addition of " bank " class label is more than predetermined level Webpage be added in such distinguishing label.
It should be noted that webpage identifying platform can be with preset time, by updated class label and class label Corresponding risk class is sent to server more than the webpage of predetermined level and is stored.For example, one hour of predetermined interval will The webpage that updated class label and the corresponding risk class of class label are more than predetermined level is sent to server progress Storage.
In the present embodiment, the keyword that risk class is more than to the web data in the webpage of predetermined level extracts, The domain name to be identified for being more than predetermined level to risk class according to keyword adds corresponding class label, and then if addition When class label is not with stored class label successful match, then the class label of addition is added to stored classification mark Label, and by risk class be more than predetermined level web storage in the class label of the addition, progressively expand stored class Distinguishing label enhances applicability.
In one of the embodiments, when the webpage that risk class is more than predetermined level is fishing webpage, citing signal, When webpage identifying platform gets identified fishing webpage, then the corresponding webpage domain name of the fishing webpage is extracted, and then according to The network address of the corresponding website of the webpage domain Name acquisition fishing webpage, webpage identifying platform according to the network address inquired, The domain name of Network Search address information, the domain name of Network Search address information can be the fishings that webpage identifying platform will inquire The network address of the corresponding website of fishnet page is matched with the network address with storage in address information storehouse, when the fishing webpage When the network address of corresponding website is with the network address successful match being pre-stored in address information storehouse, gets and be pre-stored The association domain name to be matched of network addresses, and then according to the effective time of association domain name to be matched, judge the pass to be matched Whether connection domain name is effective namely when current time is less than or equal to effective deadline, then extracts associated domain name conduct to be matched Domain name to be identified, and then when webpage identifying platform finds the domain name with network addresses, then make the associated domain name For domain name to be identified.And then when not inquiring the domain name with network addresses in aforementioned manners, then the domain name pair of website is obtained The log-on data answered inquires about corresponding domain name as domain name to be identified according to log-on data, and Ke Yishi is inquired about according to log-on data Corresponding domain name can be that the domain name that webpage identifying platform gets the corresponding website of fishing website is corresponded to as domain name to be identified Log-on data, and then the corresponding conversion logic of log-on data is chosen from conversion logic storehouse, and then by log-on data according to turning It changes logic to carry out being converted to transformed log-on data, the information that will be stored in transformed log-on data and information repository Data are matched, and when the information data successful match stored in transformed log-on data and information repository, are then obtained The associated domain name of information data of successful match is as domain name to be identified.First using the corresponding website of identified fishing webpage The domain name of network addresses carries out inquiring about domain name to be identified, is corresponded to when not inquiring, then using identified fishing webpage The corresponding log-on data of network address of website inquire about domain name to be identified, inquired about, ensured by way of inquiring about twice Inquiry is not in omit.
When webpage identifying platform obtains domain name to be identified, then the webpage that is included in the corresponding website of domain name to be identified is obtained Web data, and then web data is matched with the first data stored in default blacklist, when successful match, then The corresponding domain name to be identified in the website in the web data corresponding webpage institute source adds suspicious label, and then will be with the addition of again can Doubt second stored in web data and default white list in the corresponding website of domain name to be identified of label cross filter data into Row matching, when not crossing the non-successful match of filter data with second, then extraction carries the domain name to be identified of suspicious label, so as to should The webpage in the corresponding website of domain name to be identified of suspicious label is carried as fishing webpage.Further, when by default Blacklist and default list all carry out Data Matching so that identify all do not exist with suspicious label domain name to be identified when, The corresponding identifier of domain name to be identified is then obtained, such as enterprise logo, and then by the logo of acquisition and is stored in advance in security identifier Secure identifier in storage is matched, and when successful match, then obtains being stored in security identifier storehouse for successful match The associated secure domain name of security identifier, and then secure domain name is matched with domain name to be identified, when matching unsuccessful, then should Domain name to be identified disguises oneself as secure domain name, and then the webpage in the corresponding website of the domain name to be identified passes through as fishing webpage Web data in the webpage that is included in the corresponding website of domain name to be identified and banner are inquired about, determined to be identified Whether the webpage included in the corresponding website of domain name is fishing webpage, and carries out secondary inspection using web data and banner It surveys, improves the accuracy for being detected as fishing webpage.
And then when identify fishing webpage be then extract web data on fishing webpage it is crucial then, according to keyword By the corresponding domain name to be identified addition class label of the fishing webpage, and if such distinguishing label with stored class label not During successful match, then the class label of the corresponding domain name to be identified of fishing webpage is added, and then fishing webpage is added to classification Under label.
In the present embodiment, multiple domain names to be identified can be related to query by a fishing webpage, improve production news efficiency, Enhance applicability, and the web data of the webpage in the website in domain name to be identified correspondence is inquired about and to webpage mark Know and carry out whether corresponding webpage in inquiry judging domain name to be identified is fishing webpage, inquiry is accurate, and the fishing that will be inquired Webpage is classified according to classification, convenient for subsequent inquiry and push.
In one of the embodiments, Fig. 3 is referred to, the structure diagram of a webpage identification device, webpage identification are provided Device 300 can include:
First acquisition module 310 for obtaining the webpage that identified risk class is more than predetermined level, extracts webpage pair The website domain name answered.
Second acquisition module 320, for obtaining the corresponding network address in website according to website domain name.
Searching module 330, for search with the domain names of network addresses, when finding the domain name with network addresses When, then using associated domain name as domain name to be identified.
3rd acquisition module 340, for obtaining the web data in the corresponding website of domain name to be identified.
Identification module 350 is big for obtaining risk class corresponding with domain name to be identified according to acquired web data In the webpage of predetermined level.
In one of the embodiments, searching module 330 can include:
First matching unit, for network address to be matched with the network address being pre-stored in address information storehouse.
Domain Name acquisition unit, for when the network address successful match being pre-stored in network address and address information storehouse, Obtain the association domain name to be matched with pre-stored network addresses.
Time acquisition unit, for obtaining effective deadline of association domain name to be matched.
If extraction unit be less than or equal to effective deadline for current time, extracts associated domain masterpiece to be matched For domain name to be identified.
In one of the embodiments, webpage identification device can also include:
Enquiry module, for when not finding the domain name with network addresses, then the domain name for obtaining website to be corresponding Log-on data inquires about corresponding domain name as domain name to be identified according to log-on data.
In one of the embodiments, enquiry module can include:
Unit is chosen, for obtaining the corresponding log-on data of the domain name of website, log-on data is chosen from conversion logic storehouse Corresponding conversion logic.
Converting unit, for being carried out log-on data according to conversion logic to be converted to transformed log-on data.
Second matching unit, the information data for that will store in transformed log-on data and information repository carry out Match somebody with somebody.
Domain Name acquisition unit to be identified, for the information data worked as transformed log-on data with stored in information repository During successful match, then the associated domain name of information data of successful match is obtained as domain name to be identified.
In one of the embodiments, identification module 350 can also include:
First filter element, first for that will store in web data and default blacklist, which crosses filter data, carries out Match somebody with somebody, when website data and the first filtering Data Matching success, then suspicious label is added to domain name to be identified.
Second filter element, for will add the web data in the corresponding website of domain name to be identified of suspicious label with it is pre- If white list in store second cross filter data matched.
Label domain name acquiring unit, for when web data and second cross filter data non-successful match when, then extract carrying There is the domain name to be identified of suspicious label, obtain the webpage in the corresponding website of domain name to be identified as risk class more than default etc. The webpage of grade.
An example kind wherein, webpage identification device 300 can also include:
Identifier acquisition module, for not deposited after data identification is carried out with default white list by default blacklist When carrying the domain name to be identified of suspicious label, then the corresponding identifier of domain name to be identified is obtained.
Identifier match module, for by identifier and the secure identifier that is stored in advance in security identifier repository into Row matching.
Secure domain name matching module, for when the success of secure identifier corresponding with domain name to be identified identifier match, Then obtain the associated secure domain name of the secure identifier being stored in security identifier repository of successful match, by secure domain name with Domain name matching to be identified.
Suspicious domain name extraction module, for when secure domain name matches unsuccessful with domain name to be identified, then domain name to be identified Webpage in corresponding website is more than the webpage of predetermined level as risk class.
In one of the embodiments, webpage identification device 300 can also include:
Keyword-extraction module, for extracting keyword of the risk class more than the web data of the webpage of predetermined level, The corresponding domain name to be identified of webpage for being more than predetermined level to risk class according to keyword adds corresponding class label.
Tag match module, for risk class to be more than to the class label of the domain name to be identified of predetermined level with having stored Class label matched.
Add module, for when non-successful match, then adding risk class more than the domain name to be identified of predetermined level Class label, and risk class is more than under web storage to the class label of predetermined level.
The above-mentioned specific restriction on webpage identification device may refer to the restriction above in connection with web page identification method, This is repeated no more.
In one of the embodiments, a kind of computer equipment is provided, the computer equipment can be conventional terminal or its His any suitable computer equipment, cut-away view can be as shown in Figure 4.The computer equipment includes passing through system bus Processor, memory and the network interface of connection.Wherein, the processor of the computer equipment calculates and controls energy for providing Power.The memory of the computer equipment includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with Operating system and computer program.The built-in storage is the fortune of the operating system and computer program in non-volatile memory medium Row provides environment.The network interface of the computer equipment is used to communicate by network connection with external terminal.The computer journey To realize a kind of web page identification method when sequence is executed by processor, processor realizes following steps when performing the computer program: Obtain the webpage that identified risk class is more than predetermined level, the corresponding website domain name of extraction webpage.It is obtained according to website domain name Take the corresponding network address in website.The domain name with network addresses is searched, when finding the domain name with network addresses, Then using associated domain name as domain name to be identified.Obtain the web data in the corresponding website of domain name to be identified.According to acquired Web data obtain risk class corresponding with domain name to be identified be more than predetermined level webpage.
In one of the embodiments, the domain name searched with network addresses is realized when processor performs computer program The step of, it can include:Network address is matched with the network address being pre-stored in address information storehouse.When network address with During the network address successful match being pre-stored in address information storehouse, the association to be matched with pre-stored network addresses is obtained Domain name.Obtain effective deadline of association domain name to be matched.If current time is less than or equal to effective deadline, extract Associated domain name to be matched is as domain name to be identified.
In one of the embodiments, following steps are also realized when processor performs computer program:When do not find with During the domain name of network addresses, then the corresponding log-on data of domain name of website is obtained, corresponding domain is inquired about according to log-on data Name is used as domain name to be identified.
In one of the embodiments, the corresponding registration of domain name for obtaining website is realized when processor performs computer program Data are inquired about the step of corresponding domain name is as domain name to be identified according to log-on data, can be included:
The corresponding log-on data of domain name of website is obtained, the corresponding conversion of log-on data is chosen from conversion logic storehouse and is patrolled Volume.Log-on data carried out according to conversion logic to be converted to transformed log-on data.By transformed log-on data and letter The information data stored in breath repository is matched.When the Information Number stored in transformed log-on data and information repository During according to successful match, then the associated domain name of information data of successful match is obtained as domain name to be identified.
In one of the embodiments, realize when processor performs computer program and obtained according to acquired web data Risk class corresponding with domain name to be identified is more than the step of webpage of predetermined level, can include:
Web data is matched with the first filter data excessively stored in default blacklist, when website data and first When filtering Data Matching success, then suspicious label is added to domain name to be identified.The domain name to be identified for adding suspicious label is corresponded to Website in web data with stored in default white list second cross filter data matched.When web data and second When crossing the non-successful match of filter data, then extraction carries the domain name to be identified of suspicious label, obtains the corresponding net of domain name to be identified Webpage in standing is more than the webpage of predetermined level as risk class.
In one of the embodiments, the step of being realized when processor performs computer program can also include:Work as process When default blacklist after the progress data identification of default white list with not having the domain name to be identified for carrying suspicious label, then Obtain the corresponding identifier of domain name to be identified.By identifier and the secure identifier that is stored in advance in security identifier repository into Row matching.When secure identifier identifier match success corresponding with domain name to be identified, then being stored in for successful match is obtained The associated secure domain name of secure identifier in security identifier repository matches secure domain name with domain name to be identified.Work as safety When domain name matches unsuccessful with domain name to be identified, then the webpage in the corresponding website of domain name to be identified is more than pre- as risk class If the webpage of grade.
In one of the embodiments, realize when processor performs computer program and obtained according to acquired web data Risk class corresponding with domain name to be identified was more than after the step of webpage of predetermined level, can also include:Extract risk etc. Grade is more than the keyword of the web data of the webpage of predetermined level, is more than the webpage of predetermined level to risk class according to keyword Corresponding domain name to be identified adds corresponding class label.Risk class is more than the class label of the domain name to be identified of predetermined level It is matched with stored class label.When non-successful match, then it is to be identified more than predetermined level to add risk class The class label of domain name, and risk class is more than under web storage to the class label of predetermined level.
The above-mentioned specific restriction on computer equipment may refer to the restriction above in connection with web page identification method, herein It repeats no more.
In one of the embodiments, continuing with referring to Fig. 4, a kind of storage medium is provided, is stored thereon with computer journey Sequence, the computer program realize following steps when being executed by processor:Identified risk class is obtained more than predetermined level Webpage, the corresponding website domain name of extraction webpage.The corresponding network address in website is obtained according to website domain name.Lookup and network address Associated domain name, when finding the domain name with network addresses, then using associated domain name as domain name to be identified.Acquisition is treated Identify the web data in the corresponding website of domain name.Risk corresponding with domain name to be identified is obtained according to acquired web data Grade is more than the webpage of predetermined level.
In one of the embodiments, realize and search and network addresses when which is executed by processor The step of domain name, can include:Network address is matched with the network address being pre-stored in address information storehouse.When network When location is with the network address successful match being pre-stored in address information storehouse, obtain to be matched with pre-stored network addresses Associate domain name.Obtain effective deadline of association domain name to be matched.If current time is less than or equal to effective deadline, Associated domain name to be matched is extracted as domain name to be identified.
In one of the embodiments, following steps are also realized when which is executed by processor:When not searching During to domain name with network addresses, then the corresponding log-on data of domain name of website is obtained, inquired about and corresponded to according to log-on data Domain name as domain name to be identified.
In one of the embodiments, realize that the domain name for obtaining website is corresponding when which is executed by processor Log-on data is inquired about the step of corresponding domain name is as domain name to be identified according to log-on data, can be included:Obtain the domain of website The corresponding log-on data of name chooses the corresponding conversion logic of log-on data from conversion logic storehouse.It will be registered according to conversion logic Data carry out being converted to transformed log-on data.The Information Number that will be stored in transformed log-on data and information repository According to being matched.When the information data successful match stored in transformed log-on data and information repository, then acquisition With the associated domain name of successful information data as domain name to be identified.
In one of the embodiments, realized when which is executed by processor according to acquired web data The step of risk class corresponding with domain name to be identified is more than the webpage of predetermined level is obtained, can be included:By web data with First stored in default blacklist crosses filter data and is matched, when website data and the first filtering Data Matching success, Suspicious label then is added to domain name to be identified.By the web data in the corresponding website of domain name to be identified for adding suspicious label with Second stored in default white list crosses filter data and is matched.When web data and second cross the non-successful match of filter data When, then extraction carries the domain name to be identified of suspicious label, obtains the webpage in the corresponding website of domain name to be identified as risk Grade is more than the webpage of predetermined level.
In one of the embodiments, the step of being realized when which is executed by processor can also include:When Not there is no the domain name to be identified for carrying suspicious label after default blacklist and default white list carry out data identification When, then obtain the corresponding identifier of domain name to be identified.By identifier and the safety post being stored in advance in security identifier repository Know symbol to be matched.When secure identifier identifier match success corresponding with domain name to be identified, then successful match is obtained The associated secure domain name of secure identifier being stored in security identifier repository matches secure domain name with domain name to be identified. When secure domain name matches unsuccessful with domain name to be identified, then the webpage in the corresponding website of domain name to be identified is as risk class More than the webpage of predetermined level.
In one of the embodiments, realized when which is executed by processor according to acquired web data After obtaining the step of risk class corresponding with domain name to be identified is more than the webpage of predetermined level, it can also include:Extract wind Dangerous grade is more than the keyword of the web data of the webpage of predetermined level, is more than predetermined level to risk class according to keyword The corresponding domain name to be identified of webpage adds corresponding class label.Risk class is more than the classification of the domain name to be identified of predetermined level Label is matched with stored class label.When non-successful match, then add risk class and treated more than predetermined level It identifies the class label of domain name, and risk class is more than under web storage to the class label of predetermined level.
The above-mentioned specific restriction on storage medium may refer to the restriction above in connection with web page identification method, herein not It repeats again.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with The program that relevant hardware is instructed to complete by computer program can be stored in a non-volatile computer and storage can be read In medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, it is computer-readable to deposit Storage media can be magnetic disc, CD, read-only memory (Read-OnlyMemory, ROM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.
Embodiment described above only expresses the several embodiments of the present invention, and description is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that come for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (10)

1. a kind of web page identification method, which is characterized in that including:
The webpage that identified risk class is more than predetermined level is obtained, extracts the corresponding website domain name of the webpage;
The corresponding network address in the website is obtained according to the website domain name;
The domain name with the network addresses is searched, when finding the domain name with the network addresses, then by described in Associated domain name is as domain name to be identified;
Obtain the web data in the corresponding website of the domain name to be identified;
The webpage that risk class corresponding with the domain name to be identified is more than predetermined level is obtained according to acquired web data.
2. according to the method described in claim 1, it is characterized in that, the lookup and the step of the domain name of the network addresses Suddenly, including:
The network address is matched with the network address being pre-stored in address information storehouse;
When the network address successful match being pre-stored in the network address and described address correlation database, acquisition prestores with described The association domain name to be matched of the network addresses of storage;
Obtain effective deadline of the association domain name to be matched;
If current time is less than or equal to effective deadline, the associated domain name to be matched is extracted as domain to be identified Name.
3. according to the method described in claim 1, it is characterized in that, the method further includes:
When not finding the domain name with the network addresses, then the corresponding log-on data of domain name of the website is obtained, Corresponding domain name is inquired about as domain name to be identified according to the log-on data.
4. the according to the method described in claim 3, it is characterized in that, corresponding registration number of domain name for obtaining the website According to, the step of corresponding domain name is as domain name to be identified is inquired about according to the log-on data, including:
The corresponding log-on data of domain name of the website is obtained, the corresponding conversion of the log-on data is chosen from conversion logic storehouse Logic;
The log-on data carried out according to the conversion logic to be converted to transformed log-on data;
The transformed log-on data is matched with the information data stored in information repository;
When the information data successful match stored in transformed log-on data and information repository, then successful match is obtained The domain name of described information data correlation is as domain name to be identified.
5. according to the method described in claim 1, it is characterized in that, the web data acquired in the basis obtains treating with described Identify the step of corresponding risk class of domain name is more than the webpage of predetermined level, including:
By the web data with stored in default blacklist first cross filter data match, when the website data with During the first filtering Data Matching success, then suspicious label is added to the domain name to be identified;
By the web data in the corresponding website of the domain name to be identified for adding suspicious label with being stored in default white list Second cross filter data matched;
When web data successful match non-with the described second mistake filter data, then extraction carries the to be identified of suspicious label Domain name obtains the webpage that the webpage in the corresponding website of the domain name to be identified is more than predetermined level as risk class.
6. according to the method described in claim 5, it is characterized in that, the method further includes:
Do not exist after data identification is carried out with the default white list by the default blacklist and carry suspicious mark During the domain name to be identified of label, then the corresponding identifier of the domain name to be identified is obtained;
The identifier is matched with the secure identifier being stored in advance in security identifier repository;
When secure identifier identifier match success corresponding with the domain name to be identified, then depositing for successful match is obtained The associated secure domain name of the secure identifier in the security identifier repository is stored up, the secure domain name is treated with described Identify domain name matching;
When the secure domain name matches unsuccessful with the domain name to be identified, then in the corresponding website of the domain name to be identified Webpage is more than the webpage of predetermined level as risk class.
7. according to the method described in claim 1, it is characterized in that, the web data acquired in the basis obtains treating with described After identifying the step of corresponding risk class of domain name is more than the webpage of predetermined level, further include:
Keyword of the risk class more than the web data of the webpage of predetermined level is extracted, according to the keyword to described The corresponding domain name to be identified of webpage that risk class is more than predetermined level adds corresponding class label;
The risk class is more than to the class label of the domain name to be identified of predetermined level and the progress of stored class label Match somebody with somebody;
When non-successful match, then class label of the risk class more than the domain name to be identified of predetermined level is added, and will The risk class is more than under web storage to the class label of predetermined level.
8. a kind of webpage identification device, which is characterized in that described device includes:
First acquisition module for obtaining the webpage that identified risk class is more than predetermined level, extracts the webpage and corresponds to Website domain name;
Second acquisition module, for obtaining the corresponding network address in the website according to the website domain name;
Searching module, for search with the domain names of the network addresses, when finding the domain with the network addresses During name, then using the associated domain name as domain name to be identified;
3rd acquisition module, for obtaining the web data in the corresponding website of the domain name to be identified;
Identification module, for according to acquired web data obtain risk class corresponding with the domain name to be identified be more than it is pre- If the webpage of grade.
9. a kind of computer equipment, which is characterized in that on a memory and can handled including memory, processor and storage The computer program run on device, which is characterized in that the processor realized when performing the computer program claim 1 to Step in 7 in any one the method.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the computer program is executed by processor Step in Shi Shixian claim 1 to 7 any one the methods.
CN201711297266.7A 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium Active CN108092963B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711297266.7A CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium
PCT/CN2018/077064 WO2019109529A1 (en) 2017-12-08 2018-02-23 Webpage identification method, device, computer apparatus, and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711297266.7A CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108092963A true CN108092963A (en) 2018-05-29
CN108092963B CN108092963B (en) 2020-05-08

Family

ID=62174944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711297266.7A Active CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN108092963B (en)
WO (1) WO2019109529A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110012030A (en) * 2019-04-23 2019-07-12 北京微步在线科技有限公司 A kind of method and device of association detection hacker
CN110033092A (en) * 2019-01-31 2019-07-19 阿里巴巴集团控股有限公司 Data label generation, model training, event recognition method and device
CN110266661A (en) * 2019-06-04 2019-09-20 东软集团股份有限公司 A kind of authorization method, device and equipment
CN110865818A (en) * 2018-08-28 2020-03-06 优视科技有限公司 Application associated domain name detection method and device and electronic equipment
CN110958244A (en) * 2019-11-29 2020-04-03 北京邮电大学 Method and device for detecting counterfeit domain name based on deep learning
CN111814643A (en) * 2020-06-30 2020-10-23 杭州科度科技有限公司 Black and gray URL (Uniform resource locator) identification method and device, electronic equipment and medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113098859B (en) * 2021-03-30 2023-03-31 深圳市欢太科技有限公司 Webpage page rollback method, device, terminal and storage medium
CN113923193B (en) * 2021-10-27 2023-11-28 北京知道创宇信息技术股份有限公司 Network domain name association method and device, storage medium and electronic equipment
CN114900363A (en) * 2022-05-18 2022-08-12 杭州安恒信息技术股份有限公司 Malicious website identification method and device, electronic equipment and storage medium
CN116708356B (en) * 2023-08-02 2023-11-14 苏州迈科网络安全技术股份有限公司 IP feature library generation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8869269B1 (en) * 2008-05-28 2014-10-21 Symantec Corporation Method and apparatus for identifying domain name abuse
CN102523210B (en) * 2011-12-06 2014-11-05 中国科学院计算机网络信息中心 Phishing website detection method and device
CN102663000B (en) * 2012-03-15 2016-08-03 北京百度网讯科技有限公司 The maliciously recognition methods of the method for building up of network address database, maliciously network address and device
CN105718577B (en) * 2016-01-22 2020-01-21 中国互联网络信息中心 Method and system for automatically detecting phishing aiming at newly added domain name

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110865818A (en) * 2018-08-28 2020-03-06 优视科技有限公司 Application associated domain name detection method and device and electronic equipment
CN110865818B (en) * 2018-08-28 2023-07-28 阿里巴巴(中国)有限公司 Detection method and device for application associated domain name and electronic equipment
CN110033092A (en) * 2019-01-31 2019-07-19 阿里巴巴集团控股有限公司 Data label generation, model training, event recognition method and device
CN110033092B (en) * 2019-01-31 2020-06-02 阿里巴巴集团控股有限公司 Data label generation method, data label training device, event recognition method and event recognition device
CN110012030A (en) * 2019-04-23 2019-07-12 北京微步在线科技有限公司 A kind of method and device of association detection hacker
CN110266661A (en) * 2019-06-04 2019-09-20 东软集团股份有限公司 A kind of authorization method, device and equipment
CN110266661B (en) * 2019-06-04 2021-09-14 东软集团股份有限公司 Authorization method, device and equipment
CN110958244A (en) * 2019-11-29 2020-04-03 北京邮电大学 Method and device for detecting counterfeit domain name based on deep learning
CN111814643A (en) * 2020-06-30 2020-10-23 杭州科度科技有限公司 Black and gray URL (Uniform resource locator) identification method and device, electronic equipment and medium

Also Published As

Publication number Publication date
WO2019109529A1 (en) 2019-06-13
CN108092963B (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN108092963A (en) Web page identification method, device, computer equipment and storage medium
US9276956B2 (en) Method for detecting phishing website without depending on samples
CN104899508B (en) A kind of multistage detection method for phishing site and system
EP2803031B1 (en) Machine-learning based classification of user accounts based on email addresses and other account information
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
CN109690547A (en) For detecting the system and method cheated online
CN103530367B (en) A kind of fishing website identification system and method
CN105119909B (en) A kind of counterfeit website detection method and system based on page visual similarity
CN106302440B (en) Method for acquiring suspicious phishing websites through multiple channels
CN109522504A (en) A method of counterfeit website is differentiated based on threat information
US20180131708A1 (en) Identifying Fraudulent and Malicious Websites, Domain and Sub-domain Names
CN103634317A (en) Method and system of performing safety appraisal on malicious web site information on basis of cloud safety
CN112804210B (en) Data association method and device, electronic equipment and computer-readable storage medium
CN103209177B (en) The detection method of phishing attacks and device
CN109274632A (en) A kind of recognition methods of website and device
CN112464666B (en) Unknown network threat automatic discovery method based on hidden network data
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
CN103067387A (en) Monitoring system and monitoring method for anti phishing
CN102902722B (en) A kind of disposal route of Information Security and system
CN112751804B (en) Method, device and equipment for identifying counterfeit domain name
CN108270754B (en) Detection method and device for phishing website
CN105262730A (en) Monitoring method and device based on enterprise domain name safety
CN107590265A (en) A kind of administrative ownership recognition methods in the website based on web crawlers
US20040267895A1 (en) Search system using real name and method thereof
KR20100120966A (en) System for sorting phising site base on searching web site and method therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant