CN114928472B

CN114928472B - Bad site gray list filtering method based on full circulation main domain name

Info

Publication number: CN114928472B
Application number: CN202210416876.9A
Authority: CN
Inventors: 张兆心; 陈俊仁; 柴婷婷; 赵东; 孟月阳
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2023-07-18
Anticipated expiration: 2042-04-20
Also published as: CN114928472A

Abstract

The invention provides a bad site gray list filtering method based on a full-flow main domain name, which comprises the following steps: step 1, constructing a name discrimination model of bad site domain names based on character similarity, and realizing coarse filtration of suspected bad site domain names in the full domain names; step 2, identifying whether the domain name can be resolved and used for Web service; step 3, performing coarse filtration based on IP similarity; step 4, classifying the geographical areas of the domain names based on the IP positioning technology; step 5, analyzing the accuracy of the bad site domain name gray list obtained by coarse filtration; and step 6, performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3. The method reduces the magnitude of the existing domain name in a large range through filtering the character similarity of the domain name and the service IP similarity, greatly reduces the time consumption caused by acquiring and analyzing the webpage text and the snapshot, and realizes the efficient and accurate filtering of the full-quantity domain name.

Description

Bad site gray list filtering method based on full circulation main domain name

Technical Field

The invention relates to the technical field of construction of a domain name gray list of bad sites, in particular to a method for filtering the bad site gray list based on a full-quantity circulation main domain name.

Background

With the rapid development of computer networks, the internet has become an integral part of human life. Wherein the domain name system provides a mutual mapping function of IP addresses and domain names for applications and services in the network. People can access the Internet more conveniently through the domain name. Today, however, networks are flooded with a large number of bad sites. They not only jeopardize the mind of people, but even seriously jeopardize the security of property. Therefore, identification, monitoring and control of bad sites are important.

The magnitude of the main domain name of global circulation is about 2.6 hundred million, the dynamic newly added domain name is about 30 ten thousand daily, and the expired domain name is about 30 ten thousand daily. At present, the main methods for identifying bad sites are based on web page texts and web page snapshots, but the time cost for acquiring and analyzing the web page texts and the web page snapshots is very high. Therefore, an efficient system method for filtering the full-scale circulating main domain name is lacked, so that the full-scale bad site domain name gray list cannot be effectively constructed.

Disclosure of Invention

Aiming at the technical problems of long time consumption and high cost of the existing full-quantity domain name gray list filtering method based on the webpage text and the webpage snapshot, the invention provides a bad site gray list filtering method based on the full-quantity circulation main domain name.

Therefore, the technical scheme of the invention is that the bad site gray list filtering method based on the full circulation main domain name comprises the following steps:

step 1, extracting features from character strings of existing bad site domain names, establishing a bad site keyword phrase library, and constructing a name discrimination model of the bad site domain names based on character similarity to realize coarse filtering of suspected bad site domain names in the full domain names;

step 2, constructing an IP and port quick scanning model, acquiring service IP and port attribute information of suspected bad site domain names, and identifying whether the domain names can be resolved and used for Web services;

step 3, establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing rough filtration based on IP similarity;

step 4, classifying the geographical areas of the domain names based on the IP positioning technology;

step 5, analyzing the accuracy of the bad site domain name gray list obtained by rough filtering by utilizing the existing bad site identification technology;

step 6, performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3;

the construction forms of the domain names of the bad sites are divided into two main types, wherein the first type is that the domain names contain English words or Chinese pinyin, the second type is that the domain names are formed by character sequences randomly, in the method based on the character similarity model, a bad keyword phrase library is constructed for the domain names of the first type to match keywords, and the second type is that whether the character sequences are randomly generated or not is judged by training an LSTM neural network model.

The construction method of the bad keyword phrase library is that a 37 ten thousand English word dictionary and 405 Chinese pinyin are combined into an English Chinese pinyin dictionary, longest word matching is carried out from a 39 ten thousand bad domain name set, bad English pinyin phrases with high occurrence frequency are extracted, and a bad keyword phrase library is formed and used for later keyword matching and filtering.

The training method of the LSTM neural network model is that the LSTM neural network model is trained by using 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names as training sets and test sets, and the LSTM neural network is divided into 3 layers: 1. the preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes to be used as character embedding; 2. a long-short-period memory layer, wherein the number of units is set to 128, and dropout is set to 0.5 for avoiding overfitting; 3. and the output layer adopts 2 classification output.

The rough filtering method of the suspected bad site domain name comprises the steps of firstly, matching keywords through a constructed bad keyword phrase library, judging whether the domain name contains bad keyword phrases, if yes, considering that the domain name is possibly used for the bad site, if no sensitive keywords exist, judging whether the domain name is randomly composed of characters by using a trained LSTM neural network model, and if yes, considering that the domain name is possibly used for the bad site.

The method for performing coarse filtering based on the IP similarity is to analyze the IP stored in the step 2 through the existing IP mapping range model, and if the IP falls into a section of mapping range of the model, the IP is considered to be used for bad service contents.

Further, the specific method for performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3 is as follows:

step S1, dynamically updating a bad keyword phrase library, adding new bad English pinyin phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library;

and S2, dynamically updating an IP mapping range model, integrating the newly-appearing bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.

The method has the advantages that when the bad site gray list of the full-volume circulation main domain name is filtered, the domain name magnitude range of 2.6 hundred million is reduced by 90% through the filtering of the domain name character similarity and the service IP similarity, so that the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and the full-volume domain name is efficiently and accurately filtered. The method provided by the invention can realize high-speed and high-precision filtration of the full-quantity domain names.

Drawings

FIG. 1 is a schematic diagram of the construction of keyword phrase libraries, LSTM neural network models and IP mapping range models in accordance with the present invention;

FIG. 2 is a flow chart of the present invention for performing the filtering of bad site gray lists.

Detailed Description

The invention is further described below with reference to examples.

As shown in fig. 1, the first stage of the present invention needs to construct a character similarity model and an IP mapping range model in two steps, respectively. The method comprises the following specific steps:

step (1): when the domain name of the bad site is analyzed, the domain name construction form of the bad site is found to be divided into two main types. The first type is that the domain name contains english words or chinese pinyin (different languages, for a poor chinese website, the domain name contains more pinyin), for example: ponsvideo.com, tiyubocai.cn, etc. The second category is that domain names consist of sequences of characters randomly (possibly randomly generated by an algorithm), such as: vdqw-96.Com,12034.Cn. Therefore, in the method based on the character similarity model, aiming at the domain names of the first category, a poor keyword phrase library is constructed to match keywords. And aiming at the domain names of the second class, training the LSTM neural network to judge whether the character sequences are randomly generated.

(1) Constructing a bad keyword phrase library: the 37 ten thousand english word dictionary and 405 chinese pinyin (without phonetic symbols) are combined into an english chinese pinyin dictionary. And (3) carrying out longest word matching from the 39 ten thousand bad domain name sets, extracting bad English pinyin phrase with high occurrence frequency, and forming a bad keyword phrase library for later keyword matching and filtering.

(2) Training of LSTM neural network model: the LSTM model was trained using 70 thousands Alex domain names and 78 thousands random character sequence domain names (consisting of random character sequence domain names and DGA domain names in 39 thousands bad domain names) as training and test sets. The neural network is divided into 3 layers: 1. the preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes as character embedding. 2. A long-term and short-term memory layer: the number of cells was set to 128 and dropout was set to 0.5 for avoiding overfitting. 3. Output layer: and adopting 2 classification output. Finally, the accuracy is 94% in the training set and 96% in the testing set.

Step (2): and carrying out DNS analysis on the existing 39 ten thousand bad domain names to acquire all service IP addresses. Considering that when applying for IP, a batch of consecutive IP addresses is usually applied as backup. Therefore, mapping the scope of all bad IPs, and constructing a model of the mapping scope of the bad IPs for subsequent filtering.

As shown in fig. 2, a method for filtering a bad site gray list based on a full-volume circulation main domain name specifically comprises the following steps:

step 1: extracting features from the existing bad site domain name character strings, establishing a bad site keyword phrase library, and constructing a bad site domain name discrimination model based on character similarity, so as to realize coarse filtering of suspected bad site domain names in the full domain names. And filtering the similarity of the domain name characters by taking 2.6 hundred million full-quantity main domain names as input data. Wherein the filtering is performed in two parts. Firstly, matching keywords through a constructed bad keyword phrase library, and judging whether a domain name contains bad keyword phrases or not. If present, this domain name is considered likely to be used for bad sites. If the sensitive keywords do not exist, judging the randomness of the character sequence by using the trained LSTM model, and judging whether the domain name consists of characters randomly. If so, then the domain name is considered likely to be for the bad site. Coarse filtering of suspected bad site domain names is performed through the two parts.

Step 2: and constructing an IP and port quick scanning model, acquiring suspected bad site domain name service IP and port attribute information, and identifying whether a domain name can be resolved and used for Web service. And obtaining the service IP and the port attribute of the domain name set obtained in the last step. And acquiring A records of the IP address by DNS analysis, and storing all available IP addresses. Port scanning is then performed to see if ports 80, 443, 8080, etc. are open, thereby filtering out IP for Web services.

Step 3: and establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing rough filtering based on the IP similarity. And (3) carrying out similarity analysis on the IP stored in the step (2) through an existing IP mapping range model. If the IP falls within a mapping range of the model, the IP is deemed to be used for bad content. Coarse filtering of bad sites is performed by IP similarity.

Step 4: and (5) carrying out geographical area classification of the domain name based on the IP positioning technology. Through the service IP positioning technology, the IP physical address attribute is obtained and subdivided into domestic and foreign. And the IP obtained by filtering in the steps corresponds to the domain name and is stored.

Step 5: and analyzing the accuracy of the bad site domain name gray list obtained by rough filtering by using the existing bad site identification technology. And accurately judging the domain name obtained by filtering in the steps through the existing bad site judging model. The discrimination model is based on web page content and snapshot, so that the discrimination model is time-consuming in acquiring text content and snapshot. However, by the filtering in the above steps, the domain name range has been reduced by 90%, and the domain name sets obtained by the filtering are all domain names highly suspected for bad sites. Therefore, the step can effectively filter out the bad site domain name gray list and evaluate the filtering effect of the step.

Step 6: and (3) storing the gray list of the total number of bad domain names by using the iterative optimization coarse filtration method, and performing iterative optimization of the steps (1) and (3). The optimization concrete mode is as follows: step S1, dynamically updating a bad keyword phrase library, adding new bad English pinyin phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library. And S2, dynamically updating an IP mapping range model, integrating the newly-appearing bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.

When the method is used for filtering the bad site gray list of the full-volume circulation main domain name, the domain name magnitude range of 2.6 hundred million is reduced by 90 percent through filtering the domain name character similarity and the service IP similarity, so that the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and the full-volume domain name is efficiently and accurately filtered. The method provided by the invention can realize high-speed and high-precision filtration of the full-quantity domain names.

However, the foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, so that the substitution of equivalent elements or equivalent variations and modifications within the scope of the invention are intended to fall within the scope of the claims.

Claims

1. A bad site gray list filtering method based on a full circulation main domain name is characterized by comprising the following steps:

the construction forms of the domain names of the bad sites are divided into two main types, wherein the first type is that the domain names contain English words or Chinese pinyin, the second type is that the domain names are formed by character sequences randomly, in the method based on the character similarity model, a bad keyword phrase library is constructed for the domain names of the first type to match keywords, and the second type is that whether the character sequences are randomly generated or not is judged by training an LSTM neural network model;

combining a 37 ten thousand English word dictionary and 405 Chinese pinyin to form an English Chinese pinyin dictionary, carrying out longest word matching from a 39 ten thousand defective domain name set, extracting defective English pinyin phrase with high occurrence frequency to form a defective keyword phrase library, and filtering the subsequent keyword matching;

the training method of the LSTM neural network model is that the LSTM neural network model is trained by using 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names as training sets and test sets, and the LSTM neural network is divided into 3 layers: (1) The preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts positive integer indexes into dense vectors with fixed sizes to be used as character embedding; (2) A long-short-period memory layer, wherein the number of units is set to 128, and dropout is set to 0.5 for avoiding overfitting; (3) an output layer, wherein the output layer adopts 2 classification output;

the rough filtering method of suspected bad site domain name includes that firstly, matching keywords through a constructed bad keyword phrase library, judging whether the domain name contains bad keyword phrases, if yes, considering that the domain name is possible to be used for the bad site, if no sensitive keywords exist, judging whether the domain name is formed by characters randomly by using a trained LSTM neural network model, and if yes, considering that the domain name is possible to be used for the bad site;

2. The method for filtering the bad site gray list based on the full-scale circulation main domain name according to claim 1, wherein the method comprises the following steps: the specific method for performing iterative optimization on the coarse filtration step 1 and the coarse filtration step 3 comprises the following steps: