CN115757991A

CN115757991A - Webpage identification method and device, electronic equipment and storage medium

Info

Publication number: CN115757991A
Application number: CN202111025311.XA
Authority: CN
Inventors: 黄晨晖; 林初仁; 李晶
Original assignee: Guangzhou Tencent Technology Co Ltd
Current assignee: Guangzhou Tencent Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2023-03-07

Abstract

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying a web page, an electronic device, and a storage medium, so as to improve the identification accuracy of the web page. The method comprises the following steps: acquiring a target URL of a webpage to be detected and a corresponding target HTML file; performing feature matching on the webpage to be detected and the specified category sample library based on the target URL and the target HTML file; if the matching fails, extracting the URL feature of the target URL and the HTML feature of the target HTML file, and performing feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected; and based on the webpage fusion characteristics, performing classification prediction on the webpage to be detected to obtain a classification recognition result of the webpage to be detected. According to the method and the device, the URL characteristic and the HTML characteristic of the webpage to be detected are combined, sufficient abundant information quantity and low complexity are considered, and the accuracy of webpage identification is effectively improved.

Description

Webpage identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying a web page, an electronic device, and a storage medium.

Background

With the continuous development of the internet industry, networks have become an indispensable part of people's lives. But at the same time the number of malicious web pages is growing rapidly year by year. How to quickly and effectively identify malicious web pages has become one of the network space security problems to be solved.

In the related art, the scheme information amount using only the Uniform Resource Locator (URL) feature is limited. Only the scheme of the static features of the web pages is used, other information in the static features of the web pages is not effectively mined, and misjudgment is easily caused. The malicious webpage identification scheme only using the dynamic characteristics of the webpage needs to simulate the rendering behavior of the browser, and is not suitable for large-scale background application.

In conclusion, the solutions in the related art have the problems of insufficient feature information amount or high complexity, and the accuracy of webpage identification is not high.

Disclosure of Invention

The embodiment of the application provides a webpage identification method, a webpage identification device, electronic equipment and a storage medium, and aims to improve the identification accuracy of a webpage.

The webpage identification method provided by the embodiment of the application comprises the following steps:

acquiring a target URL of a webpage to be detected and a corresponding target HyperText Markup Language (HTML) file;

performing feature matching on the webpage to be detected and a specified category sample library based on the target URL and the target HTML file;

if the matching fails, extracting the URL feature of the target URL and the HTML feature of the target HTML file, and performing feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected;

and performing classification prediction on the web pages to be detected based on the web page fusion characteristics to obtain a classification recognition result of the web pages to be detected.

The embodiment of the application provides a webpage identification device, which comprises:

the acquisition unit is used for acquiring a target URL of the webpage to be detected and a corresponding target HTML file;

the matching unit is used for performing feature matching on the webpage to be detected and a specified category sample library based on the target URL and the target HTML file;

the feature fusion unit is used for extracting the URL feature of the target URL and the HTML feature of the target HTML file if matching fails, and performing feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected;

and the identification unit is used for carrying out classification prediction on the web pages to be detected based on the web page fusion characteristics to obtain a classification identification result of the web pages to be detected.

Optionally, the specified category sample library includes a URL sample library and an HTML sample library; the matching unit is specifically configured to:

comparing the prefix information of the target URL with the prefix information of each candidate URL in the URL sample library;

and if the candidate URL with the same prefix information as the target URL does not exist in the URL sample library, comparing the similarity of the target HTML file of the webpage to be detected with each candidate HTML file in the HTML sample library to obtain the file similarity corresponding to each candidate HTML file.

Optionally, the matching unit is further configured to:

if the candidate URL with the same prefix information as the target URL exists in the URL sample library, determining that the matching is successful; or if the HTML sample library has candidate HTML files with the file similarity higher than the similarity threshold value with the target HTML file, determining that the matching is successful;

and if the candidate HTML file with the file similarity higher than the similarity threshold value with the target HTML file does not exist in the HTML sample library, determining that the matching fails.

Optionally, the matching unit is specifically configured to:

respectively acquiring tag similarity between a tag corresponding to the target HTML file and a tag corresponding to each candidate HTML file, text similarity between a text corresponding to the target HTML file and a text corresponding to each candidate HTML file, and style similarity between a stack style attribute corresponding to the target HTML file and each candidate HTML file;

and respectively carrying out weighted summation on the tag similarity, the text similarity and the style similarity corresponding to each candidate HTML file and the corresponding similarity weight to determine the file similarity corresponding to each candidate HTML file.

Optionally, the feature fusion unit has a function for:

extracting a first key information feature, a domain name feature and a first statistical distribution feature of the target URL, and combining the first key information feature, the domain name feature and the first statistical distribution feature to be used as the URL feature; and

and extracting a second key information characteristic, a character characteristic, a webpage structure characteristic and a second statistical distribution characteristic of the target HTML file, and combining the second key information characteristic, the character characteristic, the webpage structure characteristic and the second statistical distribution characteristic as the HTML characteristic.

Optionally, the first statistical distribution characteristic at least includes information entropy; the feature fusion unit is specifically configured to determine the information entropy of the target URL by:

determining URL character frequency corresponding to the target URL based on the occurrence frequency of each character in the target URL;

and determining the information entropy of the target URL based on the URL character frequency.

Optionally, the first statistical distribution characteristic comprises at least a relative entropy; the feature fusion unit is specifically configured to determine the relative entropy of the target URL by:

determining URL character frequency corresponding to the target URL based on the occurrence frequency of each character in the target URL; determining the frequency of standard English characters corresponding to the target URL based on the occurrence frequency of the standard English characters in the target URL;

and determining the relative entropy of the target URL based on the URL character frequency and the standard English character frequency.

Optionally, the first statistical distribution characteristic at least includes a spatial distribution characteristic value, where the spatial distribution of URLs in a specific category is different from that of URLs in a non-specific category, and the corresponding spatial distribution characteristic value is different;

the feature fusion unit is specifically configured to determine a spatial distribution feature value of the target URL by:

determining the frequency of standard English characters corresponding to the target URL based on the occurrence frequency of the standard English characters in the target URL;

and determining the space distribution characteristic value of the target URL by taking the frequency of the standard English characters as a reference.

Optionally, the first statistical distribution characteristic includes at least an identification probability; the feature fusion unit is specifically configured to determine the recognition probability of the target URL by:

preliminarily dividing the target URL according to the designated symbols to obtain at least two target URL texts;

performing secondary segmentation on the at least two target URL texts based on a sample word list to obtain a preprocessed URL text, wherein the sample word list comprises a plurality of designated words, and each designated word comprises at least one character;

and based on the trained deep learning model, carrying out classification prediction on the preprocessed URL text to obtain the recognition probability.

Optionally, the feature fusion unit is specifically configured to: when the at least two target URL texts are subjected to secondary segmentation based on the sample word list to obtain the preprocessed URL texts, the following operations are executed for each target URL text:

taking the target URL text as a text to be segmented;

determining the longest appointed word which takes the character at the appointed position as the head in the text to be segmented;

segmenting the text to be segmented into: a first sub-text including the longest specified word, a second sub-text including remaining characters other than the longest specified word;

and taking the second sub-text as the text to be segmented, and returning to the step of determining the longest specified word beginning with the character at the specified position in the text to be segmented until the target URL text is completely segmented, so as to obtain the preprocessed URL text corresponding to the target URL text.

Optionally, the second statistical distribution feature at least includes a JS script entropy; the JS script of the HTML file of the appointed category comprises a target action, and the entropy of the JS script of the HTML file of the appointed category is different from that of the JS script of the HTML file of the non-appointed category.

Optionally, the second statistically distributed feature at least includes an HTML tag feature; the feature fusion unit is specifically configured to determine the HTML tag feature of the target HTML file in the following manner:

performing depth-first traversal on the DOM tree of the target HTML file to extract a tag vector;

and performing label classification prediction on the label vector based on the trained decision tree model, and taking the obtained prediction probability as the HTML label feature of the target HTML file.

Optionally, the second statistical distribution feature at least includes a web page text feature; the feature fusion unit is specifically configured to determine the web page text feature of the target HTML file in the following manner:

performing depth-first traversal on the DOM tree of the target HTML file, and taking the text in the target HTML file inquired through traversal as a target object;

and performing text classification on the target object based on the trained text classification model, and taking the obtained classification probability as the webpage text feature.

Optionally, the identification unit is specifically configured to:

inputting the webpage fusion characteristics into a trained webpage recognition model, and performing classification prediction on the webpage to be detected based on the trained webpage recognition model to obtain a classification recognition result;

the webpage recognition model is obtained by training in an extreme gradient lifting mode based on a training sample data set, the training sample data set comprises a positive sample of a non-specified category and a negative sample of the specified category, and the weight corresponding to the negative sample is higher than the weight corresponding to the positive sample.

Optionally, the apparatus further comprises:

and the determining unit is used for taking the class label of the specified class sample obtained by matching as the classification and identification result of the webpage to be detected if the matching is successful.

An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores a program code, and when the program code is executed by the processor, the processor is caused to execute any one of the steps of the above-mentioned web page identification method.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the steps of any of the above-mentioned web page identification methods.

An embodiment of the present application provides a computer-readable storage medium, which includes program code for causing an electronic device to perform any of the steps of the above-mentioned web page identification method when the program product runs on the electronic device.

The beneficial effect of this application is as follows:

the embodiment of the application provides a webpage identification method and device, electronic equipment and a storage medium. According to the method, the URL and the HTML of the webpage to be detected are combined, feature matching is conducted on the URL and the HTML of the webpage to be detected and a specified type sample library, whether the URL or the HTML file of a specified type matched with the webpage to be detected exists or not is judged, under the condition that the URL and the HTML file of the specified type matched with the webpage to be detected do feature fusion, classification recognition is conducted on the basis of the obtained webpage fusion features, and a final classification recognition result is determined. According to the method and the device, sufficient abundant information quantity and low complexity are considered, and the accuracy of webpage identification is effectively improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;

fig. 2 is a schematic application diagram of an enterprise mailbox product side in an embodiment of the present application;

fig. 3 is a schematic flowchart of a web page identification method in an embodiment of the present application;

fig. 4 is a schematic flowchart of a feature matching method in an embodiment of the present application;

FIG. 5 is a schematic view of an integral frame in an embodiment of the present application;

fig. 6 is a schematic flow chart of a feature extraction method in an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a flow chart of a secondary segmentation method in an embodiment of the present application;

FIG. 8 is a diagram illustrating an online query process in an embodiment of the present application;

fig. 9 is a schematic flowchart of a malicious web page identification method in an embodiment of the present application;

fig. 10 is a schematic structural diagram illustrating a composition of a web page recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic diagram of a hardware component structure of an electronic device to which an embodiment of the present application is applied;

fig. 12 is a schematic diagram of a hardware component structure of another electronic device to which the embodiment of the present application is applied.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

Some concepts related to the embodiments of the present application are described below.

1. Kolmogorov-Smirnov test (KS-test) Mo Geluo f-Simonov test: it is a useful non-parametric hypothesis test, mainly to test whether a group of samples is from a certain statistical distribution, or to compare whether the two groups of samples have the same distribution. In the embodiment of the present application, the method is mainly used for checking whether the URL of the web page conforms to a certain distribution.

URL: is a compact representation of the location and access method of resources available from the internet, and is the address of a standard resource on the internet. Each file on the internet has a unique URL that contains information indicating the location of the file and how the browser should handle it. In an embodiment of the present application, the URL feature at least includes: key information features, domain name features, URL statistical distribution features.

HTML: an application under a quasi-generic markup language. "hypertext" refers to the non-text elements that may contain pictures, links, and even music and programs. The structure of the hypertext markup language includes a "Head" part (english: head) that provides information about a web page, and a "Body" part (english: body) that provides specific contents of the web page. In the embodiment of the present application, the HTML feature includes at least: key information features, character features, web page structure features, HTML statistics distribution features.

Information entropy: in information theory, entropy (entropy) is the average amount of information contained in each message received. The information entropy is a rather abstract concept in mathematics, and can be understood as that the more ordered a system is, the lower the information entropy is; conversely, the more chaotic a system is, the higher the entropy of the information becomes. Entropy can also be said to be a measure of the degree of ordering of the system. In the application, the information entropy belongs to a class of characteristics in URL statistical distribution characteristics, and generally malicious URLs have more randomness on characters and therefore have higher information entropy.

KL divergence (Kullback-Leibler divergence, KLD): also called relative entropy in probability theory or information theory, is a measure describing the asymmetry of the difference between two probability (probability) distributions P and Q. The KL divergence is a measure of the number of additional average bits required to encode samples of the P-compliant distribution using the Q-based distribution. Typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, an estimated model distribution, or an approximate distribution of P. In the present application, KL divergence also belongs to a class of features in the statistical distribution of URLs.

Document Object Model (DOM): is a standard programming interface for handling extensible markup language. On a web page, the objects that organize a page (or document) are organized in a tree structure, and the standard model used to represent the objects in the document is known as the DOM. The DOM expresses HTML documents (also called HTML files) as a tree structure.

The embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning technologies, and are designed based on a computer vision technology and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like. With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine learning is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Compared with the method for finding mutual characteristics among big data by data mining, the machine learning focuses on the design of an algorithm, so that a computer can automatically learn rules from the data and predict unknown data by using the rules.

Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. The webpage recognition model in the embodiment of the application is obtained by training through a machine learning or deep learning technology. According to the training method of the webpage recognition model in the embodiment of the application, the malicious webpage can be recognized.

The method for training the webpage recognition model provided by the embodiment of the application can be divided into two parts, including a training part and an application part; the training part relates to the technical field of machine learning, and in the training part, the webpage recognition model is trained through the technology of machine learning. Specifically, the webpage recognition model is trained by using training samples in a training sample data set given in the embodiment of the application, after the training samples pass through the webpage recognition model, the output result of the webpage recognition model is obtained, the model parameters are continuously adjusted by combining the output result, and the trained webpage recognition model is output; the application part is used for identifying the malicious webpage by using the webpage identification model trained in the training part.

The following briefly introduces the design concept of the embodiments of the present application:

with the continuous development of the internet industry, networks have become an indispensable part of people's lives. But at the same time, the number of malicious web pages is rapidly increasing year by year. How to quickly and effectively identify malicious web pages has become one of the network space security problems to be solved.

In the related art, the identification scheme of the malicious webpage is mainly researched based on URL characteristics, webpage static characteristics and webpage dynamic characteristics. The scheme only using the URL feature is low in complexity and does not relate to the user privacy problem, but the information amount of the URL feature is limited, and the URL feature is based on the difference of the URL of the malicious webpage and the normal URL in the statistical rule of character distribution. With the increasing resistance to the black industry, it is noted that the malicious URL features are approaching to the normal URL, and even the human eye is hard to distinguish. The scheme using the static features of the web pages mostly uses the popular statistical features of the keywords, but the scheme does not effectively mine other information in the static features of the web pages, and is easy to cause misjudgment. The malicious webpage identification scheme using the webpage dynamic characteristics has the richest information quantity, but the scheme has higher cost, needs to simulate the rendering behavior of a browser and is not suitable for large-scale application in the background.

In summary, the solutions in the related art have the problems of insufficient amount of feature information or high complexity.

In view of this, embodiments of the present application provide a method and an apparatus for identifying a web page, an electronic device, and a storage medium. The webpage identification method is a mode of combining URL (uniform resource locator) features and HTML (hypertext markup language) features, firstly, feature matching is carried out on the URL and the HTML of a webpage to be detected and a specified type sample library on the basis of the URL and the HTML of the webpage to be detected, whether a URL or an HTML file of a specified category matched with the webpage to be detected exists or not is judged, under the condition of unmatched condition, the extracted URL features and the HTML features are subjected to feature fusion, classification and identification are carried out on the basis of obtained webpage fusion features, and a final classification and identification result is determined. According to the method and the device, sufficient abundant information quantity and low complexity are considered, and the accuracy of webpage identification is effectively improved.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are only for illustrating and explaining the present application, and are not intended to limit the present application, and the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes two terminal devices 110 and a server 120. Each terminal device 110 in the embodiment of the present application may have a client related to web page identification installed thereon, and the client may be used to identify a malicious web page. The client related to the web page identification in the embodiment of the present application may be a software client, or a client such as a web page or an applet, and the server is an application server corresponding to the software client, or the web page or the applet, and the specific type of the client is not limited.

It should be noted that, the web page identification method in the embodiment of the present application may be executed by the server or the terminal device alone, or may be executed by both the server and the terminal device. For example, the terminal device acquires the URL of the webpage to be detected and a corresponding target HTML file; performing feature matching on the webpage to be detected and a specified category sample library based on the target URL and the target HTML file; if the matching fails, informing the server to extract the URL feature of the target URL and the HTML feature of the target HTML file, and performing feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected; the server carries out classification prediction on the web pages to be detected based on the web page fusion characteristics, obtains the classification recognition results of the web pages to be detected, and informs the terminal equipment of the final classification recognition results.

It should be noted that the above listed interaction manners of the terminal device and the server are only examples, and actually, there are many interaction manners executed by the server and the terminal device together, and the interaction manners are not limited in detail herein.

For example, the webpage identification method based on the URL and the HTML, which is provided in the embodiment of the present application, may be applied to an enterprise mailbox product when being used for identifying a malicious webpage. Specifically, the method comprises the following steps: by maliciously identifying the URL in the user receiving, the interception capability of the enterprise mailbox junk mails is improved, and the risk that the user clicks a malicious link is reduced.

As shown in fig. 2, which is an application diagram of an enterprise mailbox product side in an embodiment of the present application, a URL in a newly-entered email is extracted, malicious identification is performed by a URL check service, and an email is identified according to an identification result. In the related technology, most mailbox anti-spam products cannot identify and classify URLs in mails, and a large number of malicious mails are often attacked by spreading the URLs with malicious characteristics.

In an alternative embodiment, terminal device 110 and server 120 may communicate via a communication network.

In an alternative embodiment, the communication network is a wired network or a wireless network.

In this embodiment, the terminal device 110 is a computer device used by a user, and the computer device may be a computer device having a certain computing capability and running instant messaging software and a website or social contact software and a website, such as a personal computer, a mobile phone, a tablet computer, a notebook, an e-book reader, and a vehicle-mounted terminal. Each terminal device 110 is connected to a server 120 through a wireless network, and the server 120 is a server or a server cluster or a cloud computing center formed by a plurality of servers, or is a virtualization platform.

It should be noted that fig. 1 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.

The video detection method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

Referring to fig. 3, an implementation flowchart of a web page identification method provided in an embodiment of the present application is illustrated here by taking a server as an execution subject, and a specific implementation flow of the method is as follows:

s31: the server acquires a target URL of a webpage to be detected and a corresponding target HTML file;

s32: the server performs feature matching on the webpage to be detected and the specified category sample library based on the target URL and the target HTML file;

the designated category mainly refers to a designated webpage category, such as a malicious webpage and a normal webpage; the malicious web pages can be further divided into: the specific category may be any one or more of the phishing attack type malicious web pages, the promotion of the spam advertisement type malicious web pages, and the guided downloading of the malicious software type malicious web pages, and the like, and the specific category is not specifically limited herein.

In the embodiment of the present application, the specified category sample library refers to a sample library containing information related to a webpage of a specified category, and the sample library is mainly divided into a URL sample library and an HTML sample library. The URL sample library includes URL information related to a webpage with a specified category, where the URLs in the sample library may be referred to as candidate URLs, and each candidate URL is further provided with a corresponding category tag, for example: URL1 (http:// qq.com) -phishing attack class malicious webpage; URL2 (http:// qxq.com) -phishing attack class malicious web page; URL3 (http:// qxxq.com) -directs the download of a malware-like malicious web page; URL4 (http:// xq.com) -promote spam malicious web pages; URL5 (http:// qx.com) -directs the download of malware-like malicious web pages, and the like.

Similarly, the HTML sample library includes HTML files related to a web page of a specified category, and the HTML files in the sample library may be referred to as candidate HTML files, and each candidate HTML file is further provided with a corresponding category tag, such as: HTML file 1-direct download of malicious software type malicious web pages; HTML file 2-phishing attack type malicious web page; HTML file 3-promotion of malicious web pages such as spam advertisements; HTML file 4-guiding to download malicious software type malicious web pages; HTML file 5-phishing attack class malicious web pages, etc.

In the embodiment of the present application, step S32 specifically refers to: the target URL is matched with the URL sample library, and the target HTML file is further matched with the HTML sample library under certain conditions, which will be described in detail below.

Optionally, if the matching is successful, the server uses the class label of the specified class sample obtained by matching as the classification recognition result of the webpage to be detected. That is, the queried candidate URL or the category tag of the candidate HTML file is used as the classification recognition result of the web page to be detected, for example, it is determined that the category tag of the candidate URL3 matching the target URL is a malicious web page for guiding downloading of a malware class, that is, it is determined that the classification recognition result of the web page to be detected is: and guiding to download the malicious webpage of the malicious software class.

If the matching fails, step S33 and step S34 are executed:

s33: if the matching fails, the server extracts the URL feature of the target URL and the HTML feature of the target HTML file, performs feature fusion on the URL feature and the HTML feature, and obtains a webpage fusion feature corresponding to the webpage to be detected;

s34: and the server performs classification prediction on the web pages to be detected based on the web page fusion characteristics to obtain a classification recognition result of the web pages to be detected.

According to the method, the URL and the HTML of the webpage to be detected are combined, feature matching is conducted on the URL and the HTML of the webpage to be detected and a specified type sample library, whether the URL or the HTML file of a specified type matched with the webpage to be detected exists or not is judged, under the condition that the URL and the HTML file of the specified type matched with the webpage to be detected do feature fusion, classification recognition is conducted on the basis of the obtained webpage fusion features, and a final classification recognition result is determined. According to the method and the device, sufficient abundant information quantity and low complexity are considered, and the accuracy of webpage identification is effectively improved.

An optional implementation manner is that S32 may be implemented according to a flowchart as shown in fig. 4, which is a flowchart of a feature matching method in the embodiment of the present application, and specifically includes the following steps:

s401: the server compares the prefix information of the target URL with the prefix information of each candidate URL in the URL sample library;

in the related art, URL matching refers to a perfect match, i.e. matching the target URL with the entire content of the candidate URL, but in the embodiment of the present application, it is considered that a large number of URL prefixes are the same, but parameters of the URL suffix received by each user are different, such as http:// qq. xx1h and http:// qq.com/? xx2, therefore, a partial matching mode is provided, namely, the target URL of the webpage to be detected is matched with the prefix information of the candidate URLs in the URL sample library.

Based on the above embodiment, only the prefix information of the candidate URL may be saved in the URL sample library, for example, the prefix information of the candidate URL1 is http:// qq. This way, such URLs can be intercepted by means of partial matching. When the URL of the webpage to be detected is matched with the URL sample library, only the prefix information part is matched, if the target URL of the webpage to be detected is matched with the URL sample library, the target URL is hit, namely a candidate URL with the same prefix information as the target URL exists in the URL sample library; otherwise, the URL is missed, that is, there is no candidate URL having the same prefix information as the target URL in the URL sample library.

In case of hit, determining that the match is successful; in case of a miss, HTML matching is further performed, see step S402 for the specific process.

S402: and if the candidate URL with the same prefix information as the target URL does not exist in the URL sample library, the server compares the similarity of the target HTML file of the webpage to be detected with each candidate HTML file in the HTML sample library to obtain the file similarity corresponding to each candidate HTML file.

Specifically, by comparing the file similarity corresponding to each candidate HTML file with a preset similarity threshold, the HTML comparison result can be further determined:

if the candidate HTML file with the file similarity higher than the similarity threshold value exists in the HTML sample library, the matching is determined to be successful, namely, the matching is hit; and if the candidate HTML file with the file similarity higher than the similarity threshold value with the target HTML file does not exist in the HTML sample library, determining that the matching is failed, namely, the matching is not hit.

In the embodiment of the application, if the matching is successful, the class label of the specified class sample obtained by matching is used as the classification identification result of the webpage to be detected. And when the HTML sample stores a plurality of candidate HTML files with the file similarity higher than the similarity threshold value with the target HTML file, taking the class label of the candidate HTML file with the highest file similarity as the classification identification result of the webpage to be detected.

Such as: the file similarity of the candidate HTML file 3, the candidate HTML file 4 and the target HTML file is higher than the similarity threshold, but the file similarity corresponding to the candidate HTML file 3 is higher, then the category tag of the candidate HTML file 3 to be matched with the target HTML file: and popularizing the malicious web pages of the spam advertisement class as classification and identification results of the web pages to be detected.

In the embodiment of the application, when calculating the file similarity between HTML files, the file similarity mainly comprises (1) tag similarity, (2) text similarity, and (3) style similarity, and the specific process is as follows:

respectively acquiring tag similarity between a tag corresponding to a target HTML file and tags corresponding to candidate HTML files, text similarity between a text corresponding to the target HTML file and a text corresponding to the candidate HTML files, and style similarity between a stacking style attribute corresponding to the target HTML file and the candidate HTML files; and then, after the three parts are determined, weighting and summing the similarity of each type and the corresponding similarity weight to obtain the final file similarity. Namely, the tag similarity, the text similarity and the style similarity corresponding to each candidate HTML file are weighted and summed with corresponding similarity weights (wherein the similarity of different classes is respectively corresponding to the preset similarity weight), and the file similarity corresponding to each candidate HTML file is determined. For example, for the label similarity, the text similarity, and the style similarity, the similarity weight is respectively taken as: 0.3,0.4,0.3.

The tag similarity specifically refers to the similarity between tags (tags) extracted after HTML traverses according to depth priority; the text similarity specifically refers to the direct similarity of texts extracted after HTML traverses according to depth priority; the Style similarity specifically refers to a similarity between (Cascading Style Sheets, CSS) parameters extracted after the HTML traverses according to depth first.

Optionally, the above three similarity calculation algorithms may be implemented by a makesequence mather provided by difflib, where the makesequence mather is a class that can be used to compare any type of fragment, and as long as the compared fragments are hash-able, both can be used to compare, and the algorithm is very flexible to use and can be used to compare distances of texts. Of course, other ways of calculating the text distance are also applicable to the embodiment of the present application, and are not limited in detail herein.

For example, when calculating the document similarity between the target HTML document and the candidate HTML document 1, it is first necessary to calculate the tag similarity, the text similarity, and the style similarity of the target HTML document and the candidate HTML document 1, respectively, and assuming s1, s2, and s3, the document similarity s =0.3 × s1+0.4 × s2+0.3 × s3 between the target HTML document and the candidate HTML document 1.

In the embodiment, the HTML file similarity matching is used, so that frequent adjustment of malicious HTML can be effectively resisted, and the method is more adaptive.

The following process mainly takes phishing attack type webpages as malicious webpages as examples, and the webpage identification process is mainly used for screening and intercepting phishing mails received by a mailbox in a mailbox anti-spam scene.

The overall framework of the application is shown in fig. 5, and is divided into two parts, namely offline model training and online query. . The off-line model training comprises three parts, namely sample base establishment, URL (Uniform resource locator), HTML (hypertext markup language) feature extraction and model training.

The off-line model training process is described in detail below.

1. And establishing a sample library.

Depending on rich historical data of an enterprise mailbox, malicious URLs and HTML are preliminarily extracted from black samples (also called negative samples) reported as phishing mails by historical users, and the malicious URLs and the HTML are marked in an early stage in a manual review mode; and extracting normal URL and HTML from the sending and receiving mails of the friend relationship.

In fig. 5, extracting the URL and the HTML specifically means parsing the URL from the mail and crawling the corresponding HTML file.

Further, a positive sample can be constructed based on the URL and HTML extracted from the normal mail, a negative positive sample can be constructed based on the URL and HTML extracted from the mail for the user to report phishing, and a sample library is constructed based on the positive and negative samples, specifically, the sample library with the tagged URL and the tagged HTML sample library listed above.

2. And extracting URL and HTML characteristics.

The process mainly refers to the steps of respectively extracting the URL characteristic and the HTML characteristic of the positive and negative samples, and specifically comprises the following steps: and performing URL feature extraction on the candidate URLs in the URL sample library, performing HTML feature extraction on the candidate HTML files in the HTML sample library, and further performing offline model training to obtain a trained webpage identification model.

It should be noted that, in the embodiment of the present application, the URL feature at least includes: a first key information feature, a domain name feature, and a first statistical distribution feature; the HTML features include at least: a second key information feature, a character feature, a web page structure feature, and a second statistical distribution feature.

The above-mentioned "first" and "second" are for distinguishing the URL feature from the HTML feature, and therefore, the key information feature corresponding to the URL is referred to as: the first key information feature refers to the key information feature corresponding to the HTML as: a second key information feature. Similarly, the first statistical distribution characteristic and the second statistical distribution characteristic are similar.

In the online query process, the feature extraction and classification can be performed on the web pages to be detected based on the trained web page recognition model. In an alternative implementation manner, S33 may be implemented according to a flowchart as shown in fig. 6, which is a flowchart of a feature extraction method in an embodiment of the present application, and includes the following steps:

s601: the server respectively extracts a first key information feature, a domain name feature and a first statistical distribution feature of the target URL;

s602: the server combines the first key information characteristic, the domain name characteristic and the first statistical distribution characteristic as a URL characteristic;

s603: the server respectively extracts a second key information characteristic, a character characteristic, a webpage structure characteristic and a second statistical distribution characteristic of the target HTML file;

s604: the server combines the second key information characteristic, the character characteristic, the webpage structure characteristic and the second statistical distribution characteristic as an HTML characteristic;

s605: and the server performs feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected.

The key information in the embodiment of the present application specifically refers to a keyword, and accordingly, the key information feature may also be referred to as a keyword feature.

The following first introduces details of URL feature details and the feature extraction process:

in the URL feature, the keyword feature may indicate whether the keyword includes @, account, login, secure, websrc, -, sign, etc.; the domain name feature may indicate whether the domain name is a hypertext Transfer Protocol over Secure Socket Layer (https), a domain name length feature, or the like; the URL statistical distribution characteristics may indicate a special character ratio, a alphanumeric switching frequency, whether a quantity equation is satisfied, etc.

The special characters are characters except numbers, english letters, special characters and unsafe characters, and the special character ratio is the ratio of the special characters in the URL to the total characters of the URL; specialized characters include, but are not limited to'; ','/','? ', ' @' =, ' & ' etc., the specific character ratio, i.e., the ratio of the specific characters in the URL to the total characters of the URL; the switching frequency of alphanumerics refers to the number of times when the URL is read from left to right and the next character of the alphanumerics character is an alphabetical character, for example, the switching frequency of alphanumerics of the character string 'a1b23c' is 2. In typical malicious URL generation, random strings are often used, so the alphanumeric switching frequency can be relatively high.

In determining whether the URL satisfies the quantity equation, assume a URL string? The numbers of', ' = ', ' and ' are x, y, z, respectively. Y = =0& & z = =0 if x = =0 for three conditions (1); (2) 0< = z < = y-1 if x > 0; (3) If x >0, then the URL has parameters that are satisfied, then the quantity equation is said to be satisfied.

In particular, the present application also introduces the following URL statistical distribution features for classification and identification, which specifically include:

the method comprises the steps of (1) information entropy of URL text, (2) KL divergence, (3) KS-test value and (4) deep learning model identification probability of URL.

The four listed URL statistical distribution characteristics belong to mathematical probability distribution values, and are supplementary to URL characteristics. The mass-produced URLs in the black industry are different from the normal URLs in the four characteristics, for example, malicious URLs are generally more random in character and have higher entropy.

The four statistical distribution characteristics are described in detail below:

in an alternative embodiment, the first statistical distribution characteristic comprises at least an entropy of information; the information entropy of the target URL may be determined by:

firstly, determining URL character frequency corresponding to a target URL based on the occurrence frequency of each character in the target URL; further, the information entropy of the target URL is determined based on the URL character frequency.

In the embodiment of the application, mass production of URLs in the black industry is generally different from that of normal URLs in the aspect of information entropy, for example, malicious URLs are generally more random in character, so that the URL has higher information entropy, and therefore, the application is based on URL character frequency; the information entropy of the target URL is determined based on the URL character frequency.

Optionally, the first statistical distribution characteristic includes at least a relative entropy (also called KL divergence); determining the relative entropy of the target URL by:

firstly, determining URL character frequency corresponding to a target URL based on the occurrence frequency of each character in the target URL; determining the frequency of standard English characters corresponding to the target URL based on the occurrence frequency of the standard English characters in the target URL; further, the relative entropy of the target URL is determined based on the URL character frequency and the standard English character frequency.

For example, the target URL is https:// qq.com, the occurrences of the letters c, h, m, o, p, q, s, t are 1,1,1,1,1,2,1,2, respectively, and the corresponding character frequencies are 1/10,1/10,1/10,1/10,1/10,2/10,1/10,2/10, respectively.

In the embodiment of the application, the URL character frequency is used for representing the occurrence times of each character in the URL; and the standard English character frequency is used for representing the occurrence probability of the standard English character in the target URL. For KL divergence, firstly calculating URL character frequency, and then calculating relative entropy by combining standard English character frequency to obtain KL divergence.

A standard english character frequency distribution is: the probabilities of occurrence of the letters a-z (upper case to lower case) are: 8.167/100,1.492/100,2.782/100,4.253/100, 12.702/100,2.228/100,2.015/100,6.094/100,6.966/100,0.153/100,0.772/100,4.025/100,2.406/100,6.749/100,7.507/100,1.929/100,0.095/100,5.987/100,6.327/100,9.056/100,2.758/100,0.978/100,2.360/100,0.150/100,1.974/100,0.074/100.

In the embodiment of the application, the KL divergence and the information entropy are calculated based on the URL character frequency, and a general malicious URL has more randomness on characters, so that compared with a normal URL, the determined information entropy and the KL divergence are obviously different, and webpage identification based on the characteristics is favorable for improving classification accuracy.

Optionally, the first statistical distribution characteristic at least includes a spatial distribution characteristic value, where the spatial distribution of the URLs in the designated category is different from that of the URLs in the non-designated category, and the corresponding spatial distribution characteristic value is also different; specifically, the spatial distribution characteristic value of the target URL is determined by the following method:

firstly, determining the frequency of standard English characters corresponding to a target URL based on the occurrence frequency of the standard English characters in the target URL; and then, based on the standard English character frequency, determining the space distribution characteristic value of the target URL.

In the embodiment of the present application, the characteristic value of the spatial distribution can be characterized as KS-test value, and the KS-test value is a test method for comparing one frequency distribution f (x) with a theoretical distribution g (x) or two observed value distributions. In the present application, the calculation can be performed using the standard english character frequency distribution as a reference, and the specific calculation program uses the library code R-4.0.5/src/library/stats/src/ks.c, which is not limited herein.

In the above embodiment, by assuming that the malicious URLs and the normal URLs have different distributions in space, the KS-test value of the malicious URL is different from that of the normal URL, and therefore, the classification accuracy can be effectively improved by performing the web page identification based on the characteristics.

Optionally, the first statistical distribution characteristic includes at least an identification probability; determining the recognition probability of the target URL by:

the recognition probability is mainly obtained based on the recognition of a trained deep learning model, URL recognition is carried out based on the deep learning model, and before the recognition probability is obtained, the URL needs to be preprocessed, namely, the object processed by the deep learning model is a preprocessed URL text. The former initially segments the URL according to the special symbol of the URL; the latter further cuts the segmented URL using a sliding window based vocabulary.

The specific process is as follows: the basic word segmentation process is to initially segment the target URL according to the designated symbols to obtain at least two target URL texts; performing a wordpience word segmentation process, performing secondary segmentation on at least two target URL texts based on a sample word list to obtain a preprocessed URL text, wherein the sample word list comprises a plurality of designated words, and each designated word comprises at least one character; after preprocessing the URL, classifying and predicting the preprocessed URL text based on the trained deep learning model to obtain the recognition probability of the target URL.

The URL preprocessing process is described in detail below with the target URL "https:// www.qq.com/" as an example:

step1: segmenting the URL according to substrings (i.e. designated symbols) such as '/' or ':/;

if https:// www.qq.com/is segmented into four target URL texts, respectively: https, www, qq, com.

Step2: using a wordpience algorithm to further divide words; the wordpiente algorithm rule is as follows:

(1) Firstly, inquiring in a sample word list by using the longest character string;

(2) If the query is successful, returning a word segmentation result, otherwise, continuing;

(3) Backtracking a character from the tail of the character string, and inquiring whether the character string exists in the sample word list;

(4) If the query is successful, segmenting the character string, taking the next character of the segmented character string as the head of a new character string, and returning to the step (1); otherwise, return to (3).

In an optional implementation manner, performing secondary segmentation on at least two target URL texts based on a sample vocabulary, and when obtaining a preprocessed URL text, for each target URL text, performing the following operations, specifically referring to fig. 7, which is a flowchart of a secondary segmentation method in an embodiment of the present application, including the following steps:

step S701: the server takes the target URL text as a text to be segmented;

step S702: the server determines the longest appointed word which takes the character at the appointed position as the head in the text to be segmented;

the designated position may refer to the first position, i.e., the first position, counted from left to right of the text to be segmented, or may be other positions, which is not specifically limited herein.

Step S703: the server divides the text to be divided into the following texts based on the longest specified word: a first sub-text including the longest specified word, a second sub-text including remaining characters except for the longest specified word;

step S704: and the server takes the second sub-text as a text to be segmented, and returns to the step of determining the longest specified word beginning with the character at the specified position in the text to be segmented until the target URL text is completely segmented, so as to obtain the preprocessed URL text corresponding to the target URL text.

Assuming that the sample vocabulary includes 7 specified words, the sample vocabulary is: [ aa, aaa, bb, cc,1,2,3], the target URL text to be segmented is aaa1cc2bb, and the process of word segmentation is:

starting with the first character 'a', the longest specified word found in the sample vocabulary is 'aaa', and the target URL text "aaa1cc2bb" is tokenized as: a first sub-text 'aaa' and a second sub-text '1cc2bb'; then, the word segmentation is carried out on the '1cc2bb', and similarly, the word segmentation can be divided into '1' and 'cc2bb'; and so on until the text is completely split, and 'aaa' + '1' + 'cc' + '2' + 'bb' is obtained, i.e. the preprocessed URL text.

The above embodiment is different from the word segmentation method directly using character level, the preprocessing method used in the present application obtains semantic information instead of individual characters, and thus more information amount can be extracted.

Furthermore, a deep learning model for identifying malicious URLs can be trained by adopting Fastext, feature extraction is carried out on the preprocessed URL text based on the model, the probability that the target URL is the malicious URL is calculated, namely the identification probability in the text, and the probability that the target URL is the URL corresponding to the malicious webpage is represented.

In the embodiment of the application, the deep learning model recognition probability of the URL is used as a strong feature to be fused, and the method can be used for improving the overall effect of webpage recognition.

Based on the foregoing implementation, the URL features in the embodiment of the present application may be specifically subdivided into the following 27 URL feature dimensions:

whether the domain name is an IP address; whether or not '@'; min (URL length/23,1); whether the length of the URL domain name does not exceed 7 characters; whether 'account' is included; whether or not 'login' is included; whether or not 'secure' is included; whether or not it contains 'websrc'; whether or not 'ebayisiapi' is contained; whether or not 'sign' is included; whether or not 'banking' is included; whether or not 'confirm' is included; whether or not '-'; whether it is not https; the number of sensitive words; a numerical ratio; specific character (/: ratio); number/ratio of special characters (except numeric english specific characters); the number of reserved characters (same as the special characters); the number of unsafe characters; the number of. Exe and. Php; the number of times of converting numbers into letters; the number of times the alphabet turns to a number; fastext recognition probability; information entropy; KL divergence; KS-test.

Among these, unsafe characters include, but are not limited to: <xnotran> '<', '>', '"', '#', '%', '{', '}', '|', '\', '^', '～', '[', ']', '`'. </xnotran>

For example, the length of the character string 'abc123@ = < > $' is 12, and the numbers of letters, numbers, special characters, unsafe characters and special characters contained therein are 3,3,2,2,2 respectively. The ratio is 3/12,3/12,2/12,2/12,2/12, respectively.

The features of the 27 URL feature dimensions are combined (may be combined in ways of splicing, weighted summation, and the like), so as to obtain a final URL feature.

The following describes details of HTML features and the process of feature extraction:

in the HTML features, the keyword feature may represent the number of occurrences of type = password, hidden attribute, and the like in an iframe tag, a div tag, a from tag in the HTML structure, the number of occurrences of document. Character features such as the rate of empty characters in the JS code, the rate of the JS code in HTML, and the like; the web page structure features, such as the number of occurrences of width =0 or height =0, the total number of tags, and the like.

Wherein the height and width attributes set the size of the image. If these properties are set, space can be reserved for the image at page load. Without these attributes, the browser cannot know the size of the image, and cannot reserve the appropriate space for the image, so when the image is loaded, the layout of the page changes.

Particularly, the present application introduces the following HTML statistical distribution features for improving the classification accuracy, which specifically include:

the method comprises the following steps of (1) entropy of JS codes (also called JS script entropy), (2) HTML tag characteristics and (3) webpage text characteristics.

The three statistical distribution characteristics are described in detail below:

in an alternative embodiment, the second statistically distributed characteristic includes at least JS script entropy; the JS script of the HTML file of the appointed category comprises a target action, and the entropy of the JS script of the HTML file of the appointed category is different from that of the JS script of the HTML file of the non-appointed category.

In the embodiment of the application, it is considered that malicious web pages often hide malicious actions (such as malicious jumps and the like) in the JS scripts, and therefore the malicious JS scripts are assumed to have different entropy characteristics.

Optionally, the second statistically distributed feature at least includes an HTML tag feature; determining HTML tag characteristics of a target HTML file by:

firstly, performing depth-first traversal on a DOM tree of a target HTML file to extract a tag vector; and performing label classification prediction on the label vector based on the trained decision tree model, and taking the prediction probability of the obtained model on the label vector as the HTML label characteristic of the target HTML file.

The main expression form of HTML is HTML tags, HTML elements and HTML tags generally describe a meaning, but strictly speaking, HTML elements include a start tag and an end tag of HTML, HTML files are tree structures composed of HTML elements, and HTML elements are HTML nodes (each HTML element is an HTML node). HTML elements and DOM node nodes are in a one-to-one relationship. And traversing the DOM tree of the target HTML file in a depth-first traversal mode to extract the tag vector, and then predicting based on the decision tree model to obtain the HTML tag characteristics. In the present application, the accuracy of classification using HTML tag vectors alone is around 85% in the training set.

Optionally, the second statistical distribution characteristic at least includes a web page text characteristic; determining the webpage text characteristics of the target HTML file by:

performing depth-first traversal on a DOM tree of a target HTML file, and taking a text in the target HTML file inquired through traversal as a target object; and performing text classification on the target object based on the trained text classification model, and taking the obtained classification probability as the text feature of the webpage.

In the above embodiment, the method extracts the text features of the HTML by extracting and scoring the HTML text, and takes into account the most direct text features in the malicious web page. Specifically, the method adopts a depth-first search mode which accords with the reading habit of people on the webpage to traverse the DOM tree of the HTML, collects the text in the webpage as a target object, also called a scoring object, and uses a light-weight large-scale text classification model (TextCNN) and a rapid text classification model (Fasttext) to train the text classification model of the HTML text.

Wherein, the input of the text classification model is the text extracted after traversing HTML (in fact, the text is preprocessed, such as word segmentation and word stop and other standard processes); the output of the text classification model is the probability of judging the text belongs to the malicious text, namely the classification probability obtained by the text classification model. In the present application, the classification accuracy of HTML text alone is about 97%.

Similarly, in the embodiment of the present application, the following 20 HTML feature dimensions can be specifically subdivided:

the number of occurrences of the iframe tag; the number of times the script tag appears (i.e., the number of JS scripts); number of div tags; the number of embedded labels; the number of occurrences of attribute href in link label; a, the occurrence frequency of a download attribute in a label; type = "number of passports" in from tag; number of occurrences of the hidden attribute; match window. Location \ window. Open times in full text (JS); full text matching width =0 and height =0 occurrence times; HTML length; the number of tags; whether a meta tag (number of meta tag) is present; login, the number of occurrences of register; probability of malicious prediction of text in a web page using TextCNN; number of occurrences of document write () function in JS; whether the JS codes occupy iframes or not (essentially, whether HTML codes generated by JS have iframe tags or not); script JS script hollow character proportion; the JS script accounts for the proportion of HTML; entropy of the JS script.

The final HTML feature can be obtained by combining the features of the 20 HTML feature dimensions.

It should be noted that, in the embodiment of the present application, different strategies may be used for extracting the HTML tag feature and the web page text feature. For example, for common web pages, the tags or texts can be collected in a depth-first manner, but for web pages such as shopping type, text collection using different algorithm strategies can be attempted for page layout. Such as a phishing website where a pirate treasure is pirated, the similarity to a normal treasure is high, and the difference may be where the user logs in and is involved in payment. Therefore, the shopping website can focus on extracting the parts related to the privacy and the transaction of the user, and a large amount of commodity information can be ignored, which is not limited in detail herein.

After the features are extracted respectively, the URL features and the HTML features are combined into a feature vector, namely the web page fusion features, and model training and online query can be performed based on the feature vector.

3. And (5) training a model.

And (3) proportionally mixing the pretreated sample with 8:2, randomly dividing the model into a training sample data set and a testing sample data set, and training the webpage identification model by using eXtreme Gradient Boosting (XGBOOST). In addition, considering that the number proportion of the malicious webpage samples in the training samples is low, and aiming at the problem of unbalanced samples, the weight of the malicious webpage samples is strengthened, namely the weight corresponding to the negative sample is higher than the weight corresponding to the positive sample. In the application, a positive sample of a non-specified category is a normal webpage sample, and a negative sample of a specified category is a malicious webpage sample.

The process of online querying is described in detail below.

The online query is divided into two steps as shown in fig. 8. Firstly, matching a URL (uniform resource locator) to be queried and HTML (hypertext markup language) in a sample library, and directly returning a tag matched with a query result if the result exists; and if no query result exists, performing classification prediction based on the webpage identification model. The query of the sample library with the tag URL adopts a partial matching mode, namely, the prefix information of the target URL of the webpage to be detected is matched with the prefix information of the candidate URL in the sample library, and then the target URL is hit. Similar matching is adopted for matching of the tagged HTML sample library, and if candidate HTML files with higher similarity exist, the candidate HTML files are considered to be hit, and specific implementation manners can refer to the above embodiments, and repeated parts are not described again.

It should be noted that the classification accuracy of the test set is over 98.5%, and the test set is deployed in an anti-spam background system of an enterprise mailbox, and hundreds of thousands of malicious URLs are identified in a single day. According to the malicious identification method based on the URL and the webpage HTML, in order to give consideration to the abundant information quantity and the low complexity, the URL and the webpage HTML codes are used for malicious identification, and a plurality of new features are introduced in an attempt to improve the identification accuracy. The characteristics in the scheme are fused with the keywords, the empirical formula, the statistical distribution of the URL, the keywords, the webpage structural characteristics, the statistical distribution, the text characteristics and the like of the HTML, so that the identification accuracy and the application effect of an actual scene are effectively improved.

Fig. 9 is a schematic flowchart illustrating a specific flow of a malicious web page identification method according to an embodiment of the present application. The specific implementation flow of the method is as follows:

step S901: the server acquires a target URL text of a webpage to be detected and a corresponding target HTML file;

step S902: the server compares the prefix information of the target URL text with the prefix information of each candidate URL text in the URL sample library;

step S903: the server judges whether a candidate URL text with the same prefix information as the target URL text exists in the URL sample library, if so, the step S909 is executed, otherwise, the step S904 is executed;

step S904: the server compares the similarity of a target HTML file of the webpage to be detected with each candidate HTML file in an HTML sample library to obtain the file similarity corresponding to each candidate HTML file;

step S905: the server judges whether a candidate HTML file with the file similarity higher than a similarity threshold value with the target HTML file exists in the HTML sample library, if so, the step S910 is executed, otherwise, the step S906 is executed;

step S906: the server respectively extracts the URL feature of the target URL text and the HTML feature of the target HTML file;

step S907: the server performs feature fusion on the URL feature and the HTML feature to obtain a webpage fusion feature corresponding to the webpage to be detected;

step S908: the server carries out classification prediction on the web pages to be detected based on the web page fusion characteristics to obtain classification recognition results of the web pages to be detected;

step S909: the server takes the category label of the candidate URL matched with the target URL in the URL sample library as a classification recognition result of the webpage to be detected;

step S910: and the server takes the category label of the candidate HTML file matched with the target HTML file in the HTML sample library as a classification and identification result of the webpage to be detected.

Based on the same inventive concept, the embodiment of the application also provides a webpage identification device. As shown in fig. 10, which is a schematic structural diagram of the web page recognition apparatus 1000, the web page recognition apparatus may include:

an obtaining unit 1001, configured to obtain a target URL of a to-be-detected web page and a corresponding target HTML file;

the matching unit 1002 is configured to perform feature matching on the to-be-detected web page and the specified category sample library based on the target URL and the target HTML file;

the feature fusion unit 1003 is configured to, if matching fails, extract a URL feature of the target URL and an HTML feature of the target HTML file, perform feature fusion on the URL feature and the HTML feature, and obtain a web fusion feature corresponding to the web page to be detected;

the identification unit 1004 is configured to perform classification prediction on the web pages to be detected based on the web page fusion features, so as to obtain a classification identification result of the web pages to be detected.

Optionally, the specified category sample library includes a URL sample library and an HTML sample library; the matching unit 1002 is specifically configured to:

Optionally, the matching unit 1002 is further configured to:

if a candidate URL with the same prefix information as the target URL exists in the URL sample library, determining that the matching is successful; or if the HTML sample library has candidate HTML files with the file similarity higher than the similarity threshold value with the target HTML file, determining that the matching is successful;

Optionally, the matching unit 1002 is specifically configured to:

respectively acquiring tag similarity between a tag corresponding to a target HTML file and tags corresponding to candidate HTML files, text similarity between a text corresponding to the target HTML file and a text corresponding to the candidate HTML files, and style similarity between a stacking style attribute corresponding to the target HTML file and the candidate HTML files;

Optionally, the feature fusing unit 1003 is provided for:

and extracting second key information features, character features, webpage structure features and second statistical distribution features of the target HTML file, and combining the second key information features, the character features, the webpage structure features and the second statistical distribution features as HTML features.

Optionally, the first statistical distribution characteristic at least includes information entropy; the feature fusion unit 1003 is specifically configured to determine the information entropy of the target URL by:

the information entropy of the target URL is determined based on the URL character frequency.

Optionally, the first statistical distribution characteristic comprises at least a relative entropy; the feature fusion unit 1003 is specifically configured to determine the relative entropy of the target URL by:

Optionally, the first statistical distribution characteristic at least includes a spatial distribution characteristic value, where the spatial distribution of the URLs in the designated category is different from that of the URLs in the non-designated category, and the corresponding spatial distribution characteristic value is also different;

the feature fusion unit 1003 is specifically configured to determine a spatial distribution feature value of the target URL by:

and determining the spatial distribution characteristic value of the target URL by taking the frequency of the standard English characters as a reference.

Optionally, the first statistical distribution characteristic includes at least an identification probability; the feature fusion unit 1003 is specifically configured to determine the recognition probability of the target URL by:

performing secondary segmentation on at least two target URL texts based on a sample word list to obtain preprocessed URL texts, wherein the sample word list comprises a plurality of designated words, and each designated word comprises at least one character;

and based on the trained deep learning model, carrying out classification prediction on the preprocessed URL text to obtain recognition probability.

Optionally, the feature fusion unit 1003 is specifically configured to: when performing secondary segmentation on at least two target URL texts based on the sample word list to obtain preprocessed URL texts, executing the following operations for each target URL text:

taking the target URL text as a text to be segmented;

determining a longest appointed word which takes a character at an appointed position as a head in a text to be segmented;

segmenting the text to be segmented into: a first sub-text including the longest specified word, a second sub-text including remaining characters except for the longest specified word;

and taking the second sub-text as a text to be segmented, and returning to the step of determining the longest specified word beginning with the character at the specified position in the text to be segmented until the target URL text is completely segmented, so as to obtain the preprocessed URL text corresponding to the target URL text.

Optionally, the second statistically distributed feature at least includes an HTML tag feature; the feature fusion unit 1003 is specifically configured to determine the HTML tag feature of the target HTML file by the following means:

performing depth-first traversal on a DOM tree of the target HTML file to extract a tag vector;

and performing label classification prediction on the label vector based on the trained decision tree model, and taking the obtained prediction probability as the HTML label characteristic of the target HTML file.

Optionally, the second statistical distribution characteristic at least includes a webpage text characteristic; the feature fusion unit 1003 is specifically configured to determine the web page text feature of the target HTML file by the following means:

performing depth-first traversal on a DOM tree of a target HTML file, and taking a text in the target HTML file inquired through traversal as a target object;

and performing text classification on the target object based on the trained text classification model, and taking the obtained classification probability as the text feature of the webpage.

Optionally, the identifying unit 1004 is specifically configured to:

the webpage identification model is obtained by training in an extreme gradient lifting mode based on a training sample data set, the training sample data set comprises a positive sample of a non-specified category and a negative sample of a specified category, and the weight corresponding to the negative sample is higher than that corresponding to the positive sample.

Optionally, the apparatus further comprises:

a determining unit 1005, configured to, if the matching is successful, use the category label of the specified category sample obtained by matching as a classification recognition result of the to-be-detected web page.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

Having described the web page recognition method and apparatus according to an exemplary embodiment of the present application, a web page recognition apparatus according to another exemplary embodiment of the present application will be described next.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a web page identification apparatus according to the present application may include at least a processor and a memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the web page identification method according to various exemplary embodiments of the present application described in the specification. For example, the processor may perform the steps as shown in fig. 3.

The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 11, and include a memory 1101, a communication module 1103, and one or more processors 1102.

A memory 1101 for storing computer programs executed by the processor 1102. The memory 1101 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 1101 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1101 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1101 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1101 may be a combination of the above memories.

The processor 1102 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1102 is configured to implement the above-described web page identification method when calling the computer program stored in the memory 1101.

The communication module 1103 is used for communicating with the terminal device and other servers.

In the embodiment of the present application, a specific connection medium among the memory 1101, the communication module 1103, and the processor 1102 is not limited. In fig. 11, the memory 1101 and the processor 1102 are connected through a bus 1104, the bus 1104 is depicted by a thick line in fig. 11, and the connection manner between other components is only schematically illustrated and not limited. The bus 1104 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in FIG. 11, but only one bus or one type of bus is not depicted.

The memory 1101 stores a computer storage medium, and the computer storage medium stores computer-executable instructions for implementing the web page identification method according to the embodiment of the present application. The processor 1102 is configured to perform the web page identification method described above, as shown in FIG. 3.

In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 12, including: communications assembly 1210, memory 1220, display unit 1230, camera 1240, sensors 1250, audio circuitry 1260, bluetooth module 1270, processor 1280, and the like.

The communication component 1210 is configured to communicate with a server. In some embodiments, a Wireless Fidelity (WiFi) module may be included, the WiFi module being a short-range Wireless transmission technology, through which the electronic device may help the user to transmit and receive information.

The memory 1220 may be used for storing software programs and data. Processor 1280 performs various functions of terminal device 110 and data processing by executing software programs or data stored in memory 1220. The memory 1220 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Memory 1220 stores an operating system that enables terminal device 110 to operate. The memory 1220 may store an operating system and various application programs, and may also store codes for executing the web page identification method according to the embodiment of the present application.

The display unit 1230 may also be used to display information input by the user or information provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 110. Specifically, the display unit 1230 may include a display screen 1232 disposed on the front surface of the terminal device 110. The display 1232 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1230 may be used to display an application operation interface and the like in the embodiment of the present application.

The display unit 1230 may be further configured to receive input numeric or character information and generate signal input related to user settings and function control of the terminal device 110, and specifically, the display unit 1230 may include a touch screen 1231 disposed on the front surface of the terminal device 110 and configured to collect touch operations of a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.

The touch screen 1231 may cover the display screen 1232, or the touch screen 1231 and the display screen 1232 may be integrated to implement the input and output functions of the terminal device 110, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1230 in this application can display the application programs and the corresponding operation steps.

The camera 1240 may be used to capture still images and a user may post comments on the images taken by the camera 1240 through an application. The number of the cameras 1240 may be one or plural. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to a processor 1280 for conversion into digital image signals.

The terminal device may further comprise at least one sensor 1250, such as an acceleration sensor 1251, a distance sensor 1252, a fingerprint sensor 1253, a temperature sensor 1254. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.

Audio circuit 1260, speaker 1261, microphone 1262 may provide an audio interface between a user and terminal device 110. The audio circuit 1260 may transmit the received electrical signal converted from the audio data to the speaker 1261, and the audio signal is converted into a sound signal by the speaker 1261 and output. Terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1262 converts the collected sound signals into electrical signals, which are received by the audio circuit 1260 and converted into audio data, which are output to the communication module 1210 for transmission to, for example, another terminal device 110, or to the memory 1220 for further processing.

The bluetooth module 1270 is used for information interaction with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module through the bluetooth module 1270, so as to perform data interaction.

The processor 1280 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1220 and calling data stored in the memory 1220. In some embodiments, processor 1280 may include one or more processing units; the processor 1280 may also integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a baseband processor, which primarily handles wireless communications. It is to be appreciated that the baseband processor described above may not be integrated into the processor 1280. In the present application, the processor 1280 may run an operating system, an application program, a user interface display and a touch response, and the web page identification method according to the embodiment of the present application. Additionally, processor 1280 is coupled with display unit 1230.

In some possible embodiments, various aspects of the web page identification method provided herein may also be implemented in the form of a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps in the web page identification method according to various exemplary embodiments of the present application described above in this specification, for example, the computer device may perform the steps as shown in fig. 3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in the context of the present application, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for identifying a web page, the method comprising:

acquiring a target Uniform Resource Locator (URL) of a webpage to be detected and a corresponding target hypertext markup language (HTML) file;

2. The method of claim 1, wherein the specified category sample library comprises a URL sample library and an HTML sample library; the performing feature matching on the webpage to be detected and a specified category sample library based on the target URL and the target HTML file comprises the following steps:

3. The method of claim 2, wherein the method further comprises:

and if the HTML sample library does not have a candidate HTML file with the file similarity with the target HTML file higher than the similarity threshold value, determining that the matching fails.

4. The method according to claim 2, wherein the comparing the similarity between the target HTML file of the web page to be detected and each candidate HTML file in the HTML sample library to obtain the similarity corresponding to each candidate HTML file comprises:

and respectively carrying out weighted summation on the tag similarity, the text similarity and the style similarity corresponding to each candidate HTML file and the corresponding similarity weight, and determining the file similarity corresponding to each candidate HTML file.

5. The method of claim 1, wherein said extracting URL features of the target URL and HTML features of the target HTML file comprises:

and extracting a second key information feature, a character feature, a webpage structure feature and a second statistical distribution feature of the target HTML file, and combining the second key information feature, the character feature, the webpage structure feature and the second statistical distribution feature as the HTML feature.

6. The method of claim 5, wherein the first statistical distribution characteristic comprises at least an entropy of information; determining the information entropy of the target URL by:

7. The method of claim 5, wherein the first statistical distribution characteristic comprises at least a relative entropy; determining a relative entropy of the target URL by:

8. The method of claim 5, wherein the first statistical distribution characteristic comprises at least a spatial distribution characteristic value, wherein URLs of a specified category differ in spatial distribution from URLs of a non-specified category by a corresponding spatial distribution characteristic value;

determining a spatial distribution characteristic value of the target URL by:

9. The method of claim 5, wherein the first statistical distribution characteristic comprises at least an identification probability; determining the recognition probability of the target URL by:

10. The method of claim 9, wherein the at least two target URL texts are sub-divided based on the sample vocabulary to obtain pre-processed URL texts, and wherein for each target URL text, the following operations are performed:

taking the target URL text as a text to be segmented;

segmenting the text to be segmented into: a first sub text including the longest specified word, a second sub text including remaining characters except the longest specified word;

11. The method of claim 5, wherein the second statistically distributed feature includes at least a JS script entropy; the JS script of the HTML file of the appointed category comprises a target action, and the entropy of the JS script of the HTML file of the appointed category is different from that of the JS script of the HTML file of the non-appointed category.

12. The method of claim 5, wherein the second statistically distributed feature includes at least an HTML tag feature; determining HTML tag characteristics of the target HTML file by:

performing depth-first traversal on a Document Object Model (DOM) tree of the target HTML file to extract a tag vector;

13. The method of claim 5, wherein the second statistically distributed feature comprises at least a web page text feature; determining the webpage text characteristics of the target HTML file by:

14. The method according to claim 1, wherein the classifying and predicting the web pages to be detected based on the web page fusion features to obtain the classification and identification result of the web pages to be detected comprises:

the webpage identification model is obtained by training in an extreme gradient lifting mode based on a training sample data set, the training sample data set comprises a positive sample of a non-specified category and a negative sample of the specified category, and the weight corresponding to the negative sample is higher than the weight corresponding to the positive sample.

15. The method of any one of claims 1 to 14, further comprising:

and if the matching is successful, taking the class label of the specified class sample obtained by matching as a classification identification result of the webpage to be detected.

16. An apparatus for identifying a web page, comprising:

17. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 15.

18. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to perform the steps of the method of any one of claims 1 to 15, when said storage medium is run on said electronic device.

19. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to claims 1-15 when executed by a processor.