CN108092963B - Webpage identification method and device, computer equipment and storage medium - Google Patents

Webpage identification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108092963B
CN108092963B CN201711297266.7A CN201711297266A CN108092963B CN 108092963 B CN108092963 B CN 108092963B CN 201711297266 A CN201711297266 A CN 201711297266A CN 108092963 B CN108092963 B CN 108092963B
Authority
CN
China
Prior art keywords
domain name
identified
webpage
data
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711297266.7A
Other languages
Chinese (zh)
Other versions
CN108092963A (en
Inventor
王元铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201711297266.7A priority Critical patent/CN108092963B/en
Priority to PCT/CN2018/077064 priority patent/WO2019109529A1/en
Publication of CN108092963A publication Critical patent/CN108092963A/en
Application granted granted Critical
Publication of CN108092963B publication Critical patent/CN108092963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/16Implementing security features at a particular protocol layer
    • H04L63/168Implementing security features at a particular protocol layer above the transport layer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a webpage identification method, a webpage identification device, computer equipment and a storage medium. Acquiring a webpage with an identified risk level greater than a preset level, and extracting a website domain name corresponding to the webpage; acquiring a network address corresponding to a website according to a website domain name; searching a domain name associated with the network address, and when the domain name associated with the network address is searched, taking the associated domain name as a domain name to be identified; acquiring webpage data in a website corresponding to a domain name to be identified; and obtaining the webpage with the risk level corresponding to the domain name to be identified greater than the preset level according to the acquired webpage data. According to the webpage identification method, the webpage identification device, the computer equipment and the storage medium, a plurality of associated webpages with risk levels larger than the preset level can be inquired through the webpage with the risk level larger than the preset level, and the inquiry efficiency is high.

Description

Webpage identification method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of network security, and in particular, to a method and an apparatus for identifying a web page, a computer device, and a storage medium.
Background
With the development of internet science and technology, more and more activities are performed on the internet, for example, transactions are performed on the internet, corresponding banking businesses are transacted on the internet, and so on, so that websites disguised as banks appear, private information such as bank accounts and passwords submitted by users when the websites are used can be stolen when the users visit the websites, and if the websites with threat property are not discovered in time, property safety of the users can be threatened, and benefits of the users are harmed.
Conventionally, since a large number of webpages are generated every day, a target webpage which may be threatening needs to be selected from a large number of webpages generated on the internet, and then the selected target webpage is subjected to tedious analysis, so that it is not efficient to identify whether the risk level of the target webpage is greater than a preset level.
Disclosure of Invention
Accordingly, it is necessary to provide a web page identification method, an apparatus, a computer device, and a storage medium for solving the problem that it is inefficient to identify whether the risk level of the target web page is greater than the preset level.
A website identification method, comprising:
acquiring a webpage with the identified risk level being greater than a preset level, and extracting a website domain name corresponding to the webpage;
acquiring a network address corresponding to the website according to the website domain name;
searching a domain name associated with the network address, and when the domain name associated with the network address is found, taking the associated domain name as a domain name to be identified;
acquiring webpage data in a website corresponding to the domain name to be identified;
and obtaining the webpage with the risk level corresponding to the domain name to be identified greater than the preset level according to the acquired webpage data.
In one embodiment, the step of searching the domain name associated with the network address includes:
matching the network address with a network address prestored in an address association library;
when the network address is successfully matched with a network address prestored in the address association library, acquiring a domain name to be matched and associated with the prestored network address;
obtaining the effective deadline of the associated domain name to be matched;
and if the current time is less than or equal to the effective deadline time, extracting the associated domain name to be matched as the domain name to be identified.
In one embodiment, the method further comprises:
and when the domain name associated with the network address is not found, acquiring registration data corresponding to the domain name of the website, and inquiring the corresponding domain name as the domain name to be identified according to the registration data.
In one embodiment, the step of acquiring registration data corresponding to the domain name of the website and querying the corresponding domain name as the domain name to be identified according to the registration data includes:
acquiring registration data corresponding to the domain name of the website, and selecting conversion logic corresponding to the registration data from a conversion logic library;
converting the registration data according to the conversion logic to obtain converted registration data;
matching the converted registration data with information data stored in an information repository;
and when the converted registration data is successfully matched with the information data stored in the information storage library, acquiring the domain name associated with the successfully matched information data as the domain name to be identified.
In one embodiment, the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level includes:
matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the website data is successfully matched with the first filtering data;
matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list;
and when the webpage data is not successfully matched with the second filtering data, extracting the domain name to be identified carrying the suspicious label, and acquiring the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level greater than the preset level.
In one embodiment, the method further comprises:
when the domain name to be identified carrying the suspicious label does not exist after the data identification is carried out on the preset blacklist and the preset white list, acquiring an identifier corresponding to the domain name to be identified;
matching the identifier with a secure identifier pre-stored in a secure identity store;
when the matching of the security identifier and the identifier corresponding to the domain name to be recognized is successful, acquiring the security domain name which is successfully matched and is stored in the security identifier storage bank and associated with the security identifier, and matching the security domain name with the domain name to be recognized;
and when the matching between the safety domain name and the domain name to be identified is unsuccessful, taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level.
In one embodiment, after the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level, the method further includes:
extracting keywords of the webpage data of the webpage with the risk level larger than the preset level, and adding a corresponding category label to the domain name to be identified corresponding to the webpage with the risk level larger than the preset level according to the keywords;
matching the class label of the domain name to be identified with the risk level larger than the preset level with the stored class label;
and when the matching is not successful, adding the category label of the domain name to be identified with the risk level greater than the preset level, and storing the webpage with the risk level greater than the preset level under the category label.
An apparatus for web page identification, the apparatus comprising:
the first acquisition module is used for acquiring the identified webpage with the risk level greater than the preset level and extracting the website domain name corresponding to the webpage;
the second acquisition module is used for acquiring the network address corresponding to the website according to the website domain name;
the searching module is used for searching the domain name associated with the network address, and when the domain name associated with the network address is searched, the associated domain name is used as the domain name to be identified;
the third acquisition module is used for acquiring webpage data in a website corresponding to the domain name to be identified;
and the identification module is used for obtaining the webpage of which the risk level corresponding to the domain name to be identified is greater than the preset level according to the acquired webpage data.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned method.
According to the method, the device, the computer equipment and the storage medium for identifying the webpage, the webpage with the identified risk level larger than the preset level is obtained, the domain name of the website corresponding to the webpage is obtained according to the webpage, the network address corresponding to the website is obtained according to the domain name of the website, the domain name associated with the network address is searched to serve as the domain name to be identified, when the domain name to be identified is inquired, the webpage data in the website corresponding to the domain name to be identified are obtained, and the webpage with the risk level larger than the preset level is obtained according to the inquiry of the webpage data. The webpage with the risk level larger than the preset level can be inquired to the related multiple webpages with the risk level larger than the preset level, and the inquiry efficiency is high.
Drawings
FIG. 1 is a diagram illustrating an exemplary scenario for implementing a web page recognition method;
FIG. 2 is a flow diagram of a method for web page identification in one embodiment;
FIG. 3 is a schematic diagram of a web page recognition apparatus according to an embodiment;
FIG. 4 is a diagram illustrating an embodiment of a computer device.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of steps and apparatus components related to web page identification methods, apparatus, computer devices, and storage media. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
In this document, relational terms such as left and right, top and bottom, front and back, first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is an application scenario diagram of a web page identification method in an embodiment, where the application scenario diagram includes a web page identification platform and a server, the web page identification platform acquires a stored web page whose identified risk level is greater than a preset level from the server, acquires a web page address from the acquired web page whose risk level is greater than the preset level, and further extracts a website domain name corresponding to the web page from the web page address, the web page identification platform acquires a network address corresponding to the website according to the website domain name, the web page identification platform searches a domain name associated with the network address from an address association library stored in the web page identification platform according to the network address, and when the domain name associated with the network address is found, the associated domain name is used as a domain name to be identified, the web page identification platform acquires web page data on a web page included in the website corresponding to the domain name to be identified, and obtains a web page whose risk level corresponding to the domain name to be identified is.
Referring to fig. 2, in one embodiment, a flowchart of a web page identification method is provided, and the method in this embodiment is applied to the web page identification platform in fig. 1, where a web page identification program runs on the platform, and the web page identification process is implemented by the web page identification program. The method comprises the following steps:
s202: and acquiring the webpage with the identified risk level larger than the preset level, and extracting the website domain name corresponding to the webpage.
Specifically, the risk level refers to a security index for evaluating whether a webpage is safe, and the risk level may be different levels for evaluating whether a webpage is safe, for example, the risk level may be set from low to high, and the higher the risk level is, the higher the risk of the corresponding webpage is, for example, the risk level is set to 1 level to 5 levels, which indicates that the risk of the webpage is higher and higher. The website domain name refers to an identifier of a related website, and a plurality of webpages may exist under the same website domain name, for example, the website domain name of "Baidu" of the website is "baidu. The server is provided with a risk database, the risk database stores webpages with risk levels higher than a preset level, the webpages with the risk levels higher than the preset level represent webpages with high risk, the webpage identification platform acquires the webpages with the identified risk levels higher than the preset level from the server, when the webpages with the identified risk levels higher than the preset level are acquired, webpage addresses corresponding to the webpages are acquired according to the acquired webpages, and the webpage identification platform extracts website domain names in the webpage addresses according to the webpage addresses. It should be noted that the web page address of a web page refers to a corresponding unique identifier of each web page in the network, and the web page address may be a Uniform Resource Locator (URL) address. The risk database is a database storing webpages with risk levels larger than a preset value.
S204: and acquiring a network address corresponding to the website according to the website domain name.
Specifically, the network address refers to a communicable identifier when computer networks are connected or communicate with each other, and may be a network address of a computer in a certain network, where the network address may uniquely identify the computer device in the network, and the computer may use the network address as a communication identifier when communicating with other computers, for example, the network address may be an IP (Internet Protocol) address, and different website domain names correspond to corresponding network addresses. Further, the web page identification platform queries the network address corresponding to the website according to the website domain name, where the web page identification platform sends corresponding test data to the website server corresponding to the website according to the acquired website domain name, and when the corresponding website server returns response data, the web page identification platform extracts the corresponding network address from the received response data sent by the website server.
S206: and searching the domain name associated with the network address, and when the domain name associated with the network address is searched, taking the associated domain name as the domain name to be identified.
Specifically, the associated domain names refer to domain names that can share the same network address, and when websites corresponding to different website domain names are stored in the same website server, the same network address can be shared, and websites corresponding to different website domain names correspond to different access ports in the website server, and websites corresponding to different website domain names are distinguished according to the different access ports. Further, different network addresses and corresponding website domain names are prestored in the webpage identification platform, the webpage identification platform queries the domain name associated with the network address according to the acquired network address, the associated domain name is different from the website domain name with the identified risk level being greater than the preset level, and when the domain name associated with the network address corresponding to the website with the risk level being greater than the preset level is found, the associated domain name is used as the domain name to be identified.
S208: and acquiring webpage data in a website corresponding to the domain name to be identified.
Specifically, the web page data refers to content displayed on a web page, and the web page data may be text data, picture data, digital data, and the like. Specifically, different webpages may be included in the website, and when the associated domain name found by the webpage identification platform according to the acquired network address corresponding to the website whose identified risk level is greater than the preset level is used as the domain name to be identified, the webpage identification platform finds the website corresponding to the domain name to be identified according to the acquired domain name to be identified, thereby acquiring webpage data of the different webpages included in the website corresponding to the domain name to be identified, such as acquiring text data displayed on the different webpages.
S210: and obtaining the webpage with the risk level corresponding to the domain name to be identified greater than the preset level according to the acquired webpage data.
Specifically, the webpage identification platform identifies the webpage data according to the acquired webpage data, and when suspicious data exist in the acquired webpage data, the webpage containing the webpage data is used as a webpage with a risk level greater than a preset level. The webpage identification platform may identify characters in the character data one by one according to the character data of the acquired webpage data, and when suspicious character data is identified, the webpage containing the character data is a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level. It should be noted that the suspicious data may be preset data, and when the web page includes the preset data, the web page is a web page with a risk level greater than a preset level, and the suspicious data may be text data, picture data, digital data, and the like, for example, the suspicious data may be set as text "bank", "score", or "prize".
In this embodiment, the web page identification platform queries other associated domain names through a web page with a risk level greater than a preset level, queries web pages with other risk levels greater than the preset level according to web page data in a website corresponding to the associated domain name, and queries web pages with different risk levels greater than the preset level through a web page with a risk level greater than the preset level, thereby improving query efficiency.
In one embodiment, step S206 may include the following process, step S206, that is, the step of searching for the domain name associated with the network address includes:
and matching the network address with a network address prestored in an address association library. Specifically, the address association library refers to a database in which different network addresses and domain names corresponding to the different network addresses are stored. The webpage identification platform acquires webpages with the risk levels larger than the preset levels, acquires webpage addresses of the webpages with the risk levels larger than the preset levels, extracts website domain names corresponding to the webpages according to the webpage addresses, acquires network addresses corresponding to the websites with the risk levels larger than the preset levels according to the website domain names, matches the acquired network addresses corresponding to the websites with the risk levels larger than the preset levels with all network addresses prestored in an address association library one by one, and traverses all the network addresses stored in an address storage library after matching.
And when the network address is successfully matched with the network address prestored in the address association library, acquiring the associated domain name to be matched associated with the prestored network address. Specifically, the associated domain name to be matched refers to a domain name associated with a network address prestored in an address association library, and the domain name may be an identifier of a related website, and when the network address is obtained in the address association library, the associated domain name to be matched corresponding to the network address may be obtained in an associated manner. The webpage identification platform matches the identified network addresses with the risk levels larger than the preset level with all the network addresses stored in the address association library one by one, then the webpage identification platform selects the network addresses with the risk levels larger than the preset level, the network addresses are successfully matched in the address association library, and the associated domain names to be matched, which are associated with the network addresses which are successfully matched, are obtained from the address association library.
And obtaining the effective deadline of the associated domain name to be matched from the address association library. Specifically, the effective deadline refers to a final effective time carried by the associated domain name to be matched, and the effective deadline may be a year time, a specific month in the year, and the effective deadline may also be a specific detailed date, for example, the effective deadline may be the year time of 2017, the effective deadline may be the specific month in the year of 2017 of 12 months, and the effective deadline may also be the specific detailed date of 2017 of 12 months, 31 days, and the like. When the web page identification platform successfully matches the network address corresponding to the web page with the risk level greater than the preset level with the network address stored in the address association library and further the web page identification platform acquires the associated domain name to be matched associated with the network address successfully matched, the web page identification platform acquires the effective deadline time corresponding to the associated domain name to be matched according to the associated domain name to be matched in the address association library, namely acquires the final effective time corresponding to the associated domain name to be matched according to the associated domain name to be matched in the address association library.
And if the current time is less than or equal to the effective deadline time, extracting the associated domain name to be matched as the domain name to be identified. Specifically, the current time refers to the time when the associated domain name to be matched is acquired, and the current time may be system time, for example, the current time may be a year time, the current time may be a specific month in the year, and the current time may also be a specific date. The method comprises the steps that a webpage identification platform obtains an associated domain name to be matched and obtains current time, the current time can be system time, the webpage identification platform compares the obtained current time with effective deadline time corresponding to the associated domain name to be matched according to the obtained current time, if the obtained current time of the associated domain name to be matched is smaller than the effective deadline time, the obtained associated domain name to be matched does not exceed the effective deadline time, namely the obtained associated domain name to be matched is effective, the webpage identification platform takes the obtained associated domain name to be matched as an associated domain name, and then the associated domain name is taken as the domain name to be identified.
It should be noted that, in this embodiment, the address association library may be a passive Domain Name System (DNS) database, the web page identification platform matches the network address stored in the passive DNS database according to the acquired network address of the website whose identified risk level is greater than the preset level, when the matching is successful, the associated domain name to be matched corresponding to the network address successfully matched in the passive DNS database is acquired, and when the current time of the acquired associated domain name to be matched is less than or equal to the effective deadline of the associated domain name to be matched, the associated domain name to be matched is used as the associated domain name.
It should be noted that the web page with the risk level greater than the preset level may be a high-risk web page disguised as a normal web page, and when the user accesses, the information of the related bank card of the user is stolen, and the property safety of the user is further threatened, for example, a phishing web page; for example, if the web page with the risk level greater than the preset level has access authority of a corresponding web page in some enterprises, the web page with the access limit may be regarded as a web page with the risk level greater than the preset level. In this embodiment, the web page identification platform obtains the associated domain name to be matched according to the pre-stored network address successfully matched from the address association library, compares the current time with the effective deadline corresponding to the associated domain name to be matched, and when the current time is less than or equal to the effective deadline, the associated domain name to be matched is valid, that is, the associated domain name to be matched can be used as the associated domain name and further as the domain name to be identified.
In one embodiment, the web page identification method may further include the following steps, which may be performed after step S206, that is, after the domain name associated with the network address is found, which may include:
and when the domain name associated with the network address is not found, acquiring registration data corresponding to the domain name of the website, and inquiring the corresponding domain name as the domain name to be identified according to the registration data. Specifically, the registration data refers to data indicating detailed information of the user registering the domain name of the website, and may be text data, picture data, or numerical data, for example, the registration data may be a personal name, the registration data may be a personal mailbox, the registration data may be a personal telephone, and the registration data may be a personal photograph. When the webpage identification platform is not successfully matched with the pre-stored network address in the address association library, the to-be-matched associated domain name associated with the pre-stored network address is not obtained, the webpage identification platform obtains the registration data corresponding to the domain name of the website with the identified risk level being greater than the preset level, the webpage identification platform queries the domain name corresponding to the registration data according to the queried registration data, the queried domain name corresponding to the registration data is different from the domain name of the website with the risk level being greater than the preset level, and the queried domain name different from the domain name of the website with the risk level being greater than the preset level is used as the to-be-identified domain name.
In this embodiment, when the domain name associated with the network address corresponding to the website whose risk level is greater than the preset level is not found in the address association library, different domain names are queried as the domain names to be recognized according to the registration data corresponding to the website whose risk level is greater than the preset level, that is, the associated domain name can be queried again through the registration information, and the queried associated domain name is used as the domain name to be recognized, so that the accuracy of querying the website whose risk level is greater than the preset level is improved.
In one embodiment, the step of acquiring the registration data corresponding to the domain name of the website and querying the corresponding domain name as the domain name to be identified according to the registration data may include the following steps:
and acquiring registration data corresponding to the domain name of the website, and selecting conversion logic corresponding to the registration data from the conversion logic library. Specifically, the conversion logic library refers to a database storing conversion logic for converting the registration data into the registration data in a fixed format. The conversion logic refers to a rule for converting the registration data, and may replace characters in the registration data with preset characters, delete invalid characters, and the like. Further, when the webpage identification platform acquires a webpage with an identified risk level greater than a preset level, extracting a website domain name corresponding to the webpage with the identified risk level greater than the preset level according to a webpage address of the webpage, when the webpage identification platform extracts the website domain name, acquiring registration data corresponding to the webpage with the identified risk level greater than the preset level according to the website domain name, and if the acquired registration data is not displayed according to a specified format, selecting conversion logic corresponding to the registration data from a conversion logic library according to the type of the registration data, and further enabling the acquired registration data to be displayed according to the specified display format. For example, the web page identification platform extracts registration data corresponding to the domain name, such as a registration name, a registration mailbox, a registration telephone and the like, according to the extracted domain name of the website with the identified risk level being greater than the preset level, the registration name contains a space in the middle, the registration telephone contains a connector, and then selects conversion logic displayed according to a display rule for the registration name from a logic conversion library according to the type of the registration data, namely, the web page identification, according to the registration name, namely, the web page identification, namely, the conversion logic displayed according to the display rule for the registration name is deleted, and further selects conversion logic displayed according to the display rule for the registration telephone from the space conversion logic library according to the registration telephone, namely, the connector in the registration telephone.
And converting the registration data according to the conversion logic to obtain converted registration data. Specifically, when the web page identification platform selects the conversion logic, that is, the web page identification platform adds a rule for converting the registration data, for example, replacing the characters in the registration data with preset characters, deleting invalid characters, and the like, the web page identification platform converts the registration data into converted registration data according to the conversion logic, and the converted registration data can be displayed according to a specified display format. For example, the registered data includes a registered name, a registered mailbox, a registered telephone, etc., and the web page identification platform selects the conversion logic of the registered name and the registered telephone, so that the invalid space character in the registered name is deleted according to the conversion logic, and the connector in the registered telephone can also be deleted according to the conversion logic in the registered telephone.
And matching the converted registration data with the information data stored in the information repository. Specifically, the information repository refers to a database storing different registration information and domain names associated with the registration information, the information repository may store registration names, registration mailboxes, registration phones, and the like, the registration names, the registration mailboxes, and the registration phones stored in the information database may correspond to one another, and the information repository may store website domain names associated with the registration information. The information data is data showing detailed information of registrants of the related domain name, and may be text data, digital data, picture data, etc., and may be, for example, a name, a telephone, a mailbox, a photograph, etc. Specifically, the web page identification platform matches the acquired registration data with the information data stored in the information repository one by one, where the registration data acquired by the web page identification platform may be a registration name, a registration mailbox and a registration telephone, the web page identification platform converts the registration name, the registration mailbox and the registration telephone according to a conversion rule to obtain a converted registration name, a converted registration mailbox and a converted registration telephone, the web page identification platform matches the converted registration name with a name stored in the information repository, the web page identification platform matches the converted registration telephone with a telephone stored in the information repository, and the web page identification platform matches the converted registration mailbox with a mailbox stored in the information repository.
And when the converted registration data is successfully matched with the information data stored in the information storage library, acquiring the domain name associated with the successfully matched information data as the domain name to be identified. Specifically, when the webpage identification platform matches the converted registration data with the information data stored in the information repository in a summary manner one by one, and when the corresponding information data is matched in the information repository, the domain name associated with the information data successfully matched is obtained, and the associated domain name is used as the domain name to be identified. The web page identification platform may match each data in the registered data with the information data stored in the information data one by one, and when each data in the registered data is successfully matched with the information data stored in the information database, the domain name associated with the information data is acquired. The webpage identification platform matches the converted registration name with a name stored in an information database, when the matching is successful, the registration mailbox is matched with a mailbox corresponding to the name stored in the information database, when the registration mailbox is successfully matched, the registration telephone is matched with a telephone corresponding to the name and the mailbox stored in the information database, when the registration telephone is also successfully matched, the successfully matched name, telephone and domain name related to the mailbox and stored in an information storage library are extracted, and therefore the extracted domain name is used as the domain name to be identified. It should be noted that the web page identification platform may also match only any registration data in the registration data with the data stored in the information data, and when the matching is successful, the domain name associated with the successfully matched information data is used as the domain name to be identified. And matching the converted registered name with the name stored in the information database, and directly extracting the domain name associated with the successfully matched name as the domain name to be identified.
It should be noted that, in this embodiment, the information repository may be a whois database, the web page identification platform obtains a domain name of a website whose identified risk level is greater than a preset level, and when obtaining registration data corresponding to the website according to the domain name, may match the registration data with information data stored in the whois database, and when matching is successful, obtain a domain name associated with the information data as a domain name to be identified.
In this embodiment, the web page identification platform converts the acquired registration data according to the conversion logic to obtain converted registration data that can be displayed according to the display rule, so as to improve the accuracy of identifying the associated domain name to be identified, and then performs matching according to the converted registration data and the information data stored in the information storage library, and when matching is successful, the domain name associated with the information data that is successfully matched is acquired as the domain name to be identified, and different domain names to be identified can be obtained according to the registration information, so as to improve the identification efficiency.
In one example, the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level may include:
and matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the website data is successfully matched with the first filtering data. Specifically, the blacklist means that data with a risk level greater than a preset level is stored, and the data with a risk level greater than the preset level may be text data, picture data, digital data, and the like. The first filtering data is data with a risk level greater than a preset level, when the webpage contains the first filtering data, the website may be a webpage with a risk level greater than a preset level, and the first filtering data may be character data, picture data, digital data, and the like. The suspicious label is a mark that the domain name to be identified may be a risk level greater than a preset level. Specifically, when the web page identification platform extracts web page data from all web pages included in a website corresponding to a domain name to be identified, the extracted web page data are matched with first filtering data stored in a preset blacklist one by one, and when the matching of all web page data and any first filtering data stored in the blacklist is successful, the web page identification platform adds a tag to the domain name to be identified corresponding to the website related to the web page from which the web page data is derived. It should be noted that a matching number threshold may also be set, that is, the web page identification platform matches all acquired web page data with the first filtering data stored in the blacklist one by one, and when the matching is successful with the first filtering data stored in the blacklist in a preset number, a suspicious tag is added to the domain name to be identified corresponding to the website associated with the web page from which the web page data is derived, where the matching number threshold may be preset to 1, preset to 3, preset to 4, and the like. If the matching of the webpage data of the webpage contained in the website corresponding to the domain name to be identified and the first filtering data in the blacklist is successful, adding a suspicious label to the domain name to be identified.
And matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list. Specifically, the white list refers to a database in which trusted data is stored, the trusted data refers to data with a risk level less than or equal to a preset level, and the trusted data may be text data, picture data, digital data, and the like, for example, characters such as "lottery" may be stored. The second filtering data is data with a risk level less than or equal to a preset level, that is, the second filtering data is credible data, when the webpage contains the second filtering data, the website may be a credible website, and the second filtering data may be character data, picture data, digital data and the like. Specifically, the webpage identification platform extracts the domain name to be identified to which the suspicious tag is added, matches webpage data on all webpages contained in the website of the domain name to be identified to which the suspicious tag is added with second filtering data stored in a preset white list one by one, and deletes the suspicious tag carried on the domain name to be identified when the webpage data on all webpages contained in the website corresponding to the domain name to be identified to which the suspicious tag is added are successfully matched with the second filtering data pre-stored in the white list. It should be noted that, when the web page data on the web pages included in the preset number of websites to which the domain names to be identified to which the suspicious tags are added are successfully matched with the second filtering data stored in the preset white list, the suspicious tags carried on the domain names to be identified are deleted.
And when the webpage data is not successfully matched with the second filtering data, extracting the domain name to be identified carrying the suspicious label, and acquiring the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level greater than the preset level. Specifically, when the webpage identification platform unsuccessfully matches the webpage data contained in the website corresponding to the domain name to be identified to which the suspicious tag is added with the second filtering data, the domain name to be identified still carries the suspicious tag, the webpage identification platform extracts the domain name to be identified still carrying the suspicious tag, further obtains the website corresponding to the domain name to be identified, and extracts the webpage contained in the corresponding website as the webpage with the risk level greater than the preset level.
In this embodiment, the webpage data are filtered through the first filtering data stored in the black list and the second filtering data stored in the white list, so that the webpage with the required risk level higher than the preset level is obtained, the situation that the webpage data with the risk level higher than the preset level are actually credible webpages is prevented, and the accuracy of identifying the webpage with the risk level higher than the preset level is improved through two-stage filtering.
In one embodiment, the web page identification method may further include:
and when the domain name to be identified carrying the suspicious label does not exist after the data identification is carried out on the preset blacklist and the preset white list, acquiring the identifier corresponding to the domain name to be identified. Specifically, the identifier refers to a mark specific to a website corresponding to the domain name to be identified, and the identifier may be an enterprise identifier, for example, the identifier may be an enterprise logo or the like. Specifically, when the webpage identification platform performs data identification according to the preset blacklist and the preset white list of the webpage data on the webpage included in the website corresponding to all the obtained domain names to be identified and the domain names to be identified do not carry suspicious labels, the webpage with the risk level higher than the preset level is not identified through the webpage data identification, and the webpage identification platform obtains the identifiers corresponding to the domain names to be identified.
The identifier is matched with a secure identifier previously stored in a secure identity store. Specifically, the security identity repository refers to a database storing identifiers of trusted websites and website domain names corresponding to the identifiers. The security identifier refers to a mark of a trusted website, and the security identifier may be a mark of an enterprise of a secure webpage, for example, the security identifier is a logo of a business bank webpage, a logo of a security group webpage, and the like. Specifically, the web page identification platform matches the acquired identifiers with the security identifiers stored in the security identifier repository in advance one by one, where the identifier corresponding to the domain name to be identified acquired by the web page identification platform may be a security group logo, and the identifier corresponding to the domain name to be identified, that is, the security group logo, is further matched with the security identifier stored in the security identifier repository.
And when the matching of the safety identifier and the identifier corresponding to the domain name to be recognized is successful, acquiring the safety domain name which is successfully matched and is associated with the safety identifier stored in the safety identification storage library, and matching the safety domain name with the domain name to be recognized. Specifically, when the web page identification platform successfully matches the identifier corresponding to the domain name to be identified with the security identifier stored in the security repository, the domain name to be identified corresponding to the security identifier corresponding to the domain name to be identified may be a security domain name, and further matching and identification are required, the web page identification platform obtains the security domain name associated with the security identifier successfully matched and stored in the previous security identifier repository, stores the security domain name associated with the security identifier successfully matched and stored in the security identifier repository, and matches the security domain name with the domain name to be identified. For example, when the identifier security group logo corresponding to the domain name to be identified, which is obtained by webpage identification, is successfully matched with the security group logo stored in the security repository, the domain name "pingan.com" associated with the security group logo stored in the security identifier repository is obtained, and the domain name to be identified is matched with the associated domain name "pingan.com".
And when the matching between the safety domain name and the domain name to be identified is unsuccessful, taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level. Specifically, when the matching between the domain name to be identified and the security domain name is not successfully performed by the web page identification platform, the identifier corresponding to the domain name to be identified is a forged security identifier, and the web page included in the website corresponding to the domain name to be identified is taken as the web page with the risk level greater than the preset level. For example, the webpage identification platform sets the obtained identifier of the domain name to be identified as a security group logo, obtains a domain name "pingan.com" associated in the security identifier repository if the security group logo is successfully matched with the security identifier stored in the security identifier repository, and sets a webpage in a website corresponding to the domain name to be identified as a webpage with a risk level greater than a preset level if the domain name to be identified is not "pingan.com".
In this embodiment, when the suspicious domain name to be identified is not obtained by identifying the web page data, the suspicious domain name to be identified is further identified according to the identifier carried by the domain name to be identified, so that the web page included in the website corresponding to the domain name to be identified is obtained, and the accuracy of identifying the web page with the risk level greater than the preset level is improved by adopting a multiple identification method.
In one embodiment, after step S210, the following step may be further included, where step S210, after the step of obtaining, according to the obtained website data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level, the method further includes:
extracting keywords of webpage data of webpages with risk levels larger than a preset level, and adding corresponding category labels to domain names to be identified with risk levels larger than the preset level according to the keywords. In particular, the category label refers to an identification of a type of the web page data, and the category label may be a label of different risk categories, for example, the category label may be a bank category label, may be a shopping category label, and the like. Specifically, the webpage identification platform identifies the webpage with the risk level larger than the preset level, then extracts the keywords of the webpage data, and adds the corresponding category label to the domain name to be identified associated with the website corresponding to the webpage containing the webpage data according to the extracted keywords of the webpage data and the extracted keywords of the webpage data. For example, according to a webpage with an identified risk level greater than a preset level, the webpage identification platform further extracts keywords from different webpages as 'integral' and 'bank', and according to the extracted keywords of the webpage data, the webpage identification platform adds a corresponding category label, namely a 'bank label' or an 'integral label', to a domain name to be identified associated with a website corresponding to the webpage containing the webpage data.
And matching the class label of the domain name to be identified with the risk level larger than the preset level with the stored class label. Specifically, the web page identification platform matches the category labels of the stored web page identification platform one by one according to the category label of the domain name to be added until all the stored category labels are traversed. For example, the labels added to the domain name to be recognized are "bank" and "point", the labels added to the domain name to be recognized are matched with the stored category labels one by one, and then the category labels added to the domain name to be recognized are matched with the stored category labels one by one.
And when the matching is not successful, adding the class label of the domain name to be identified with the risk level greater than the preset level, and storing the webpage with the risk level greater than the preset level under the class label. Specifically, when the added category label is not successfully matched with the stored category label, the added category label is a new category label, the category label of the domain name to be identified, the risk level of which is successfully unmatched and is greater than the preset level, is added to the stored category label, and the web page, the risk level of which is greater than the preset level, included in the website corresponding to the domain name to be identified of the added category label is added to the category label. For example, the category labels added to the domain name to be identified are respectively "bank" and "point", the category label "bank" is matched with the stored category labels one by one, the category label "point" added to the domain name to be identified is matched with the stored category labels one by one, when the category label "bank" is not successfully matched, the category label "bank" is added to the stored category labels, and the web pages with the risk levels greater than the preset levels contained in the websites corresponding to the domain name to be identified to which the category label "bank" is added are added to the category labels.
It should be noted that the web page identification platform may preset time, and send the updated category label and the web page whose risk level corresponding to the category label is greater than the preset level to the server for storage. For example, the updated category label and the web page with the risk level greater than the preset level corresponding to the category label are sent to the server for storage at a preset interval of one hour.
In this embodiment, the keywords of the web page data in the web page with the risk level greater than the preset level are extracted, the corresponding category label is added to the domain name to be identified with the risk level greater than the preset level according to the keywords, and then, if the added category label is not successfully matched with the stored category label, the added category label is added to the stored category label, and the web page with the risk level greater than the preset level is stored in the added category label, so that the stored category label is gradually expanded, and the applicability is enhanced.
In one embodiment, when the web page with the risk level greater than the preset level is a phishing web page, for example, when the web page identification platform acquires an identified phishing web page, the web page domain name corresponding to the phishing web page is extracted, and then the network address of the website corresponding to the phishing web page is acquired according to the web page domain name, the web page identification platform searches the domain name associated with the network address according to the queried network address, and the domain name associated with the network address can be searched by the web page identification platform matching the network address of the website corresponding to the queried phishing web page with the stored network address in the address association library, and when the network address of the website corresponding to the phishing web page is successfully matched with the network address pre-stored in the address association library, the domain name associated with the pre-stored network address to be matched is acquired, and then the effective time of the associated domain name to be matched is, and judging whether the associated domain name to be matched is effective, namely extracting the associated domain name to be matched as the domain name to be identified when the current time is less than or equal to the effective deadline time, and taking the associated domain name as the domain name to be identified when the webpage identification platform finds the domain name associated with the network address. And when the domain name associated with the network address is not inquired by the method, acquiring registration data corresponding to the domain name of the website, inquiring the corresponding domain name as a domain name to be identified according to the registration data, or inquiring the corresponding domain name as the domain name to be identified according to the registration data, wherein the webpage identification platform acquires the registration data corresponding to the domain name of the website corresponding to the phishing website, further selects conversion logic corresponding to the registration data from the conversion logic library, further converts the registration data according to the conversion logic to obtain converted registration data, matches the converted registration data with information data stored in the information repository, and when the converted registration data is successfully matched with the information data stored in the information repository, acquires the domain name associated with the information data which is successfully matched as the domain name to be identified. The domain name to be identified is inquired by adopting the domain name associated with the network address of the website corresponding to the identified phishing webpage, when the domain name to be identified is not inquired, the domain name to be identified is inquired by adopting the registration data corresponding to the network address of the website corresponding to the identified phishing webpage, and the inquiry is carried out in a mode of two times of inquiry, so that the inquiry is ensured not to be missed.
When the webpage identification platform obtains the domain name to be identified, acquiring webpage data of a webpage contained in a website corresponding to the domain name to be identified, further matching the webpage data with first data stored in a preset blacklist, when the matching is successful, adding a suspicious label to the domain name to be identified corresponding to the website from which the webpage corresponding to the webpage data comes, further matching the webpage data in the website corresponding to the domain name to be identified, to which the suspicious label is added, with second filtering data stored in a preset white list, and when the matching is not successful with the second filtering data, extracting the domain name to be identified carrying the suspicious label, so that the webpage in the website corresponding to the domain name to be identified, to which the suspicious label is carried, is taken as the webpage. Further, when the preset blacklist and the preset list are subjected to data matching, and the domain name to be identified with a suspicious label is not identified, an identifier corresponding to the domain name to be identified, such as an enterprise logo, is obtained, the obtained logo is matched with a safety identifier stored in a safety identifier storage in advance, when the matching is successful, a safety domain name associated with the safety identifier stored in a safety identifier library which is successfully matched is obtained, the safety domain name is further matched with the domain name to be identified, when the matching is unsuccessful, the domain name to be identified is disguised as a safety domain name, a webpage in a website corresponding to the domain name to be identified is used as a phishing webpage, and whether the webpage contained in the website corresponding to the domain name to be identified is a phishing webpage is determined by inquiring webpage data and the webpage identifier contained in the website corresponding to the domain name to be identified, and the webpage data and the webpage identification are adopted for secondary detection, so that the accuracy of detecting the phishing webpage is improved.
And then, when the phishing webpage is identified, extracting the key of the webpage data on the phishing webpage, adding a class label to the domain name to be identified corresponding to the phishing webpage according to the key, and if the class label is not successfully matched with the stored class label, adding the class label of the domain name to be identified corresponding to the phishing webpage, and further adding the phishing webpage under the class label.
In the embodiment, a plurality of domain names to be identified can be associated and inquired through one phishing webpage, the production efficiency is improved, the applicability is enhanced, webpage data of webpages in websites corresponding to the domain names to be identified are inquired, webpage identifiers are inquired to judge whether the webpages corresponding to the domain names to be identified are the phishing webpages, the inquiry is accurate, the inquired phishing webpages are classified according to categories, and the follow-up inquiry and pushing are facilitated.
In one embodiment, referring to fig. 3, a schematic structural diagram of a web page recognition apparatus is provided, and the web page recognition apparatus 300 may include:
the first obtaining module 310 is configured to obtain a webpage with an identified risk level greater than a preset level, and extract a website domain name corresponding to the webpage.
The second obtaining module 320 is configured to obtain a network address corresponding to a website according to the website domain name.
The searching module 330 is configured to search for a domain name associated with the network address, and when the domain name associated with the network address is found, take the associated domain name as a domain name to be identified.
The third obtaining module 340 is configured to obtain web page data in a website corresponding to the domain name to be identified.
The identifying module 350 is configured to obtain, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level.
In one embodiment, the lookup module 330 may include:
and the first matching unit is used for matching the network address with a network address prestored in the address association library.
And the domain name acquisition unit is used for acquiring the associated domain name to be matched associated with the prestored network address when the network address is successfully matched with the prestored network address in the address association library.
And the time acquisition unit is used for acquiring the effective deadline of the associated domain name to be matched.
And the extraction unit is used for extracting the associated domain name to be matched as the domain name to be identified if the current time is less than or equal to the effective deadline time.
In one embodiment, the web page identification apparatus may further include:
and the query module is used for acquiring the registration data corresponding to the domain name of the website when the domain name associated with the network address is not found, and querying the corresponding domain name as the domain name to be identified according to the registration data.
In one embodiment, the query module may include:
and the selection unit is used for acquiring the registration data corresponding to the domain name of the website and selecting the conversion logic corresponding to the registration data from the conversion logic library.
And the conversion unit is used for converting the registration data according to the conversion logic to obtain converted registration data.
And the second matching unit is used for matching the converted registration data with the information data stored in the information storage library.
And the domain name acquisition unit to be identified is used for acquiring the domain name associated with the successfully matched information data as the domain name to be identified when the converted registration data is successfully matched with the information data stored in the information storage library.
In one embodiment, the identification module 350 may further include:
and the first filtering unit is used for matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the website data is successfully matched with the first filtering data.
And the second filtering unit is used for matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list.
And the label domain name acquisition unit is used for extracting the domain name to be identified carrying the suspicious label when the webpage data is not successfully matched with the second filtering data, and acquiring the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level greater than the preset level.
In one example, the web page recognition apparatus 300 may further include:
the identifier obtaining module is used for obtaining an identifier corresponding to the domain name to be identified when the domain name to be identified carrying the suspicious label does not exist after data identification is carried out on the preset blacklist and the preset white list.
And the identifier matching module is used for matching the identifier with the security identifier stored in the security identifier storage library in advance.
And the safety domain name matching module is used for acquiring a safety domain name which is successfully matched and is associated with the safety identifier stored in the safety identification storage library when the matching of the safety identifier and the identifier corresponding to the domain name to be identified is successful, and matching the safety domain name with the domain name to be identified.
And the suspicious domain name extraction module is used for taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level when the matching between the safety domain name and the domain name to be identified is unsuccessful.
In one embodiment, the web page recognition apparatus 300 may further include:
and the keyword extraction module is used for extracting keywords of the webpage data of the webpage with the risk level larger than the preset level, and adding a corresponding category label to the domain name to be identified corresponding to the webpage with the risk level larger than the preset level according to the keywords.
And the label matching module is used for matching the class label of the domain name to be identified with the risk level greater than the preset level with the stored class label.
And the adding module is used for adding the category label of the domain name to be identified with the risk level greater than the preset level and storing the webpage with the risk level greater than the preset level under the category label when the matching is not successful.
For the above specific limitations of the web page identification apparatus, reference may be made to the above limitations of the web page identification method, which is not described herein again.
In one embodiment, a computer device is provided, which may be a conventional terminal or any other suitable computer device, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for web page recognition, the processor implementing the steps of: and acquiring the webpage with the identified risk level larger than the preset level, and extracting the website domain name corresponding to the webpage. And acquiring a network address corresponding to the website according to the website domain name. And searching the domain name associated with the network address, and when the domain name associated with the network address is searched, taking the associated domain name as the domain name to be identified. And acquiring webpage data in a website corresponding to the domain name to be identified. And obtaining the webpage with the risk level corresponding to the domain name to be identified greater than the preset level according to the acquired webpage data.
In one embodiment, the step of looking up the domain name associated with the network address when the computer program is executed by the processor may include: and matching the network address with a network address prestored in an address association library. And when the network address is successfully matched with the network address prestored in the address association library, acquiring the associated domain name to be matched associated with the prestored network address. And obtaining the effective deadline of the associated domain name to be matched. And if the current time is less than or equal to the effective deadline time, extracting the associated domain name to be matched as the domain name to be identified.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and when the domain name associated with the network address is not found, acquiring registration data corresponding to the domain name of the website, and inquiring the corresponding domain name as the domain name to be identified according to the registration data.
In one embodiment, the step of acquiring registration data corresponding to a domain name of a website when the processor executes the computer program and querying the corresponding domain name as the domain name to be identified according to the registration data may include:
and acquiring registration data corresponding to the domain name of the website, and selecting conversion logic corresponding to the registration data from the conversion logic library. And converting the registration data according to the conversion logic to obtain converted registration data. And matching the converted registration data with the information data stored in the information repository. And when the converted registration data is successfully matched with the information data stored in the information storage library, acquiring the domain name associated with the successfully matched information data as the domain name to be identified.
In one embodiment, the step of obtaining, when the processor executes the computer program, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level according to the obtained webpage data may include:
and matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the website data is successfully matched with the first filtering data. And matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list. And when the webpage data is not successfully matched with the second filtering data, extracting the domain name to be identified carrying the suspicious label, and acquiring the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level greater than the preset level.
In one embodiment, the steps implemented when the processor executes the computer program may further include: and when the domain name to be identified carrying the suspicious label does not exist after the data identification is carried out on the preset blacklist and the preset white list, acquiring the identifier corresponding to the domain name to be identified. The identifier is matched with a secure identifier previously stored in a secure identity store. And when the matching of the safety identifier and the identifier corresponding to the domain name to be recognized is successful, acquiring the safety domain name which is successfully matched and is associated with the safety identifier stored in the safety identification storage library, and matching the safety domain name with the domain name to be recognized. And when the matching between the safety domain name and the domain name to be identified is unsuccessful, taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level.
In one embodiment, after the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than the preset level when the processor executes the computer program, the method may further include: extracting keywords of webpage data of the webpage with the risk level larger than the preset level, and adding a corresponding category label to the domain name to be identified corresponding to the webpage with the risk level larger than the preset level according to the keywords. And matching the class label of the domain name to be identified with the risk grade larger than the preset grade with the stored class label. And when the matching is not successful, adding the class label of the domain name to be identified with the risk level greater than the preset level, and storing the webpage with the risk level greater than the preset level under the class label.
For the above specific limitations on the computer device, reference may be made to the above limitations on the web page identification method, which is not described herein again.
In one embodiment, with continuing reference to fig. 4, a storage medium having stored thereon a computer program is provided that when executed by a processor performs the steps of: and acquiring the webpage with the identified risk level larger than the preset level, and extracting the website domain name corresponding to the webpage. And acquiring a network address corresponding to the website according to the website domain name. And searching the domain name associated with the network address, and when the domain name associated with the network address is searched, taking the associated domain name as the domain name to be identified. And acquiring webpage data in a website corresponding to the domain name to be identified. And obtaining the webpage with the risk level corresponding to the domain name to be identified greater than the preset level according to the acquired webpage data.
In one embodiment, the computer program when executed by the processor performs the step of looking up a domain name associated with the network address, and may include: and matching the network address with a network address prestored in an address association library. And when the network address is successfully matched with the network address prestored in the address association library, acquiring the associated domain name to be matched associated with the prestored network address. And obtaining the effective deadline of the associated domain name to be matched. And if the current time is less than or equal to the effective deadline time, extracting the associated domain name to be matched as the domain name to be identified.
In one embodiment, the computer program when executed by the processor further performs the steps of: and when the domain name associated with the network address is not found, acquiring registration data corresponding to the domain name of the website, and inquiring the corresponding domain name as the domain name to be identified according to the registration data.
In one embodiment, the computer program, when executed by the processor, implements the steps of obtaining registration data corresponding to a domain name of a website, and querying a corresponding domain name as a domain name to be identified according to the registration data, and may include: and acquiring registration data corresponding to the domain name of the website, and selecting conversion logic corresponding to the registration data from the conversion logic library. And converting the registration data according to the conversion logic to obtain converted registration data. And matching the converted registration data with the information data stored in the information repository. And when the converted registration data is successfully matched with the information data stored in the information storage library, acquiring the domain name associated with the successfully matched information data as the domain name to be identified.
In one embodiment, when executed by the processor, the computer program performs the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than a preset level, and may include: and matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the website data is successfully matched with the first filtering data. And matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list. And when the webpage data is not successfully matched with the second filtering data, extracting the domain name to be identified carrying the suspicious label, and acquiring the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level greater than the preset level.
In one embodiment, the steps implemented by the computer program when executed by the processor may further include: and when the domain name to be identified carrying the suspicious label does not exist after the data identification is carried out on the preset blacklist and the preset white list, acquiring the identifier corresponding to the domain name to be identified. The identifier is matched with a secure identifier previously stored in a secure identity store. And when the matching of the safety identifier and the identifier corresponding to the domain name to be recognized is successful, acquiring the safety domain name which is successfully matched and is associated with the safety identifier stored in the safety identification storage library, and matching the safety domain name with the domain name to be recognized. And when the matching between the safety domain name and the domain name to be identified is unsuccessful, taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level.
In one embodiment, after the step of obtaining, according to the obtained webpage data, a webpage with a risk level corresponding to the domain name to be identified being greater than the preset level is performed by the processor, the method may further include: extracting keywords of webpage data of the webpage with the risk level larger than the preset level, and adding a corresponding category label to the domain name to be identified corresponding to the webpage with the risk level larger than the preset level according to the keywords. And matching the class label of the domain name to be identified with the risk grade larger than the preset grade with the stored class label. And when the matching is not successful, adding the class label of the domain name to be identified with the risk level greater than the preset level, and storing the webpage with the risk level greater than the preset level under the class label.
For the above specific limitations on the storage medium, reference may be made to the above limitations on the web page identification method, which is not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program that can be stored in a non-volatile computer-readable storage medium and can be executed by hardware related to the computer program. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), or the like.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A method for identifying a web page, comprising:
acquiring a webpage with the identified risk level being greater than a preset level, and extracting a website domain name corresponding to the webpage;
acquiring a network address corresponding to the website according to the website domain name;
searching a domain name associated with the network address, and when the domain name associated with the network address is found, taking the associated domain name as a domain name to be identified;
acquiring webpage data in a website corresponding to the domain name to be identified;
obtaining a webpage with a risk level corresponding to the domain name to be identified greater than a preset level according to the acquired webpage data; the method comprises the following steps: matching the webpage data with first filtering data stored in a preset blacklist, and adding a suspicious label to the domain name to be identified when the webpage data is successfully matched with the first filtering data; matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list; when the webpage data and the second filtering data are not matched successfully, extracting a domain name to be identified carrying a suspicious label, and acquiring a webpage in a website corresponding to the domain name to be identified as a webpage with a risk level greater than a preset level;
when the domain name to be identified carrying the suspicious label does not exist after the data identification is carried out on the preset blacklist and the preset white list, acquiring an identifier corresponding to the domain name to be identified; matching the identifier with a secure identifier pre-stored in a secure identity store; when the matching of the security identifier and the identifier corresponding to the domain name to be recognized is successful, acquiring the security domain name which is successfully matched and is stored in the security identifier storage bank and associated with the security identifier, and matching the security domain name with the domain name to be recognized; and when the matching between the safety domain name and the domain name to be identified is unsuccessful, taking the webpage in the website corresponding to the domain name to be identified as the webpage with the risk level higher than the preset level.
2. The method of claim 1, wherein the step of looking up the domain name associated with the network address comprises:
matching the network address with a network address prestored in an address association library;
when the network address is successfully matched with a network address prestored in the address association library, acquiring a domain name to be matched and associated with the prestored network address;
obtaining the effective deadline of the associated domain name to be matched;
and if the current time is less than or equal to the effective deadline time, extracting the associated domain name to be matched as the domain name to be identified.
3. The method of claim 1, further comprising:
and when the domain name associated with the network address is not found, acquiring registration data corresponding to the domain name of the website, and inquiring the corresponding domain name as the domain name to be identified according to the registration data.
4. The method according to claim 3, wherein the step of acquiring the registration data corresponding to the domain name of the website and querying the corresponding domain name as the domain name to be identified according to the registration data comprises:
acquiring registration data corresponding to the domain name of the website, and selecting conversion logic corresponding to the registration data from a conversion logic library;
converting the registration data according to the conversion logic to obtain converted registration data;
matching the converted registration data with information data stored in an information repository;
and when the converted registration data is successfully matched with the information data stored in the information storage library, acquiring the domain name associated with the successfully matched information data as the domain name to be identified.
5. The method according to claim 1, wherein after the step of obtaining the web page with the risk level corresponding to the domain name to be identified being greater than the preset level according to the obtained web page data, the method further comprises:
extracting keywords of the webpage data of the webpage with the risk level larger than the preset level, and adding a corresponding category label to the domain name to be identified corresponding to the webpage with the risk level larger than the preset level according to the keywords;
matching the class label of the domain name to be identified with the risk level larger than the preset level with the stored class label;
and when the matching is not successful, adding the category label of the domain name to be identified with the risk level greater than the preset level, and storing the webpage with the risk level greater than the preset level under the category label.
6. An apparatus for identifying a web page, the apparatus comprising:
the first acquisition module is used for acquiring the identified webpage with the risk level greater than the preset level and extracting the website domain name corresponding to the webpage;
the second acquisition module is used for acquiring the network address corresponding to the website according to the website domain name;
the searching module is used for searching the domain name associated with the network address, and when the domain name associated with the network address is searched, the associated domain name is used as the domain name to be identified;
the third acquisition module is used for acquiring webpage data in a website corresponding to the domain name to be identified;
the identification module is used for obtaining a webpage, corresponding to the domain name to be identified, of which the risk level is greater than a preset level according to the acquired webpage data;
the identification module comprises:
the first filtering unit is used for matching the webpage data with first filtering data stored in a preset blacklist, and when the webpage data is successfully matched with the first filtering data, adding a suspicious label to the domain name to be identified;
the second filtering unit is used for matching the webpage data in the website corresponding to the domain name to be identified added with the suspicious label with second filtering data stored in a preset white list;
a label domain name obtaining unit, configured to, when the webpage data is not successfully matched with the second filtered data, extract a domain name to be identified that carries a suspicious label, and obtain a webpage in a website corresponding to the domain name to be identified as a webpage with a risk level greater than a preset level;
the identifier acquisition module is used for acquiring an identifier corresponding to the domain name to be identified when the domain name to be identified carrying the suspicious label does not exist after data identification is carried out on a preset blacklist and a preset white list;
the identifier matching module is used for matching the identifier with a security identifier stored in a security identifier storage library in advance;
a security domain matching module, configured to, when the security identifier is successfully matched with the identifier corresponding to the domain name to be recognized, obtain a security domain name associated with the security identifier, which is stored in a security identifier repository and is successfully matched with the security identifier, and match the security domain name with the domain name to be recognized;
and the suspicious domain name extraction module is used for taking a webpage in a website corresponding to the domain name to be identified as a webpage with a risk level higher than a preset level when the matching between the safety domain name and the domain name to be identified is unsuccessful.
7. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.
8. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method of any one of claims 1 to 5.
CN201711297266.7A 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium Active CN108092963B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711297266.7A CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium
PCT/CN2018/077064 WO2019109529A1 (en) 2017-12-08 2018-02-23 Webpage identification method, device, computer apparatus, and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711297266.7A CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108092963A CN108092963A (en) 2018-05-29
CN108092963B true CN108092963B (en) 2020-05-08

Family

ID=62174944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711297266.7A Active CN108092963B (en) 2017-12-08 2017-12-08 Webpage identification method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN108092963B (en)
WO (1) WO2019109529A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110865818B (en) * 2018-08-28 2023-07-28 阿里巴巴(中国)有限公司 Detection method and device for application associated domain name and electronic equipment
CN110033092B (en) * 2019-01-31 2020-06-02 阿里巴巴集团控股有限公司 Data label generation method, data label training device, event recognition method and event recognition device
CN110012030A (en) * 2019-04-23 2019-07-12 北京微步在线科技有限公司 A kind of method and device of association detection hacker
CN110266661B (en) * 2019-06-04 2021-09-14 东软集团股份有限公司 Authorization method, device and equipment
CN110958244A (en) * 2019-11-29 2020-04-03 北京邮电大学 Method and device for detecting counterfeit domain name based on deep learning
CN113098859B (en) * 2021-03-30 2023-03-31 深圳市欢太科技有限公司 Webpage page rollback method, device, terminal and storage medium
CN113923193B (en) * 2021-10-27 2023-11-28 北京知道创宇信息技术股份有限公司 Network domain name association method and device, storage medium and electronic equipment
CN114900363B (en) * 2022-05-18 2024-05-14 杭州安恒信息技术股份有限公司 Malicious website identification method and device, electronic equipment and storage medium
CN116708356B (en) * 2023-08-02 2023-11-14 苏州迈科网络安全技术股份有限公司 IP feature library generation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8869269B1 (en) * 2008-05-28 2014-10-21 Symantec Corporation Method and apparatus for identifying domain name abuse
CN102523210B (en) * 2011-12-06 2014-11-05 中国科学院计算机网络信息中心 Phishing website detection method and device
CN102663000B (en) * 2012-03-15 2016-08-03 北京百度网讯科技有限公司 The maliciously recognition methods of the method for building up of network address database, maliciously network address and device
CN105718577B (en) * 2016-01-22 2020-01-21 中国互联网络信息中心 Method and system for automatically detecting phishing aiming at newly added domain name

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN106302438A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method of actively monitoring fishing website of Behavior-based control feature by all kinds of means

Also Published As

Publication number Publication date
WO2019109529A1 (en) 2019-06-13
CN108092963A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108092963B (en) Webpage identification method and device, computer equipment and storage medium
Rao et al. Jail-Phish: An improved search engine based phishing detection system
CN110099059B (en) Domain name identification method and device and storage medium
CN103973651B (en) Setting, querying method and device are identified based on the account password of salt cryptographic libraries is added
US11165793B2 (en) Method and system for detecting credential stealing attacks
CN112804210B (en) Data association method and device, electronic equipment and computer-readable storage medium
US20180131708A1 (en) Identifying Fraudulent and Malicious Websites, Domain and Sub-domain Names
CN110035075A (en) Detection method, device, computer equipment and the storage medium of fishing website
CN108718341B (en) Method for sharing and searching data
CN103067347B (en) Method for detecting phishing website and network device thereof
CN116366338B (en) Risk website identification method and device, computer equipment and storage medium
CN110572359A (en) Phishing webpage detection method based on machine learning
CN112333185B (en) Domain name shadow detection method and device based on DNS (Domain name Server) resolution
CN102882889A (en) Method and system for concentrated IP (Internet Protocol) collection and identification of phishing websites
US8910281B1 (en) Identifying malware sources using phishing kit templates
CN112751804B (en) Method, device and equipment for identifying counterfeit domain name
CN106682146B (en) Method and system for retrieving scenic spot evaluation according to keywords
KR101099537B1 (en) System for sorting phising site base on searching web site and method therefor
CN105530251A (en) Method and device for identifying phishing website
CN105320691A (en) Account information recognition method and device
US9160807B2 (en) System and method for deriving a name for association with a device
CN108418809A (en) Chat data processing method, device, computer equipment and storage medium
CN115794780A (en) Method and device for collecting network space assets, electronic equipment and storage medium
CN103716419B (en) The domain name processing method and system of a kind of cross-terminal
CN107332856B (en) Address information detection method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant