CN114912443A - Domain name detection, classification and feature screening method, system, device and storage medium - Google Patents

Domain name detection, classification and feature screening method, system, device and storage medium Download PDF

Info

Publication number
CN114912443A
CN114912443A CN202210713174.7A CN202210713174A CN114912443A CN 114912443 A CN114912443 A CN 114912443A CN 202210713174 A CN202210713174 A CN 202210713174A CN 114912443 A CN114912443 A CN 114912443A
Authority
CN
China
Prior art keywords
domain name
feature
matrix
eigenvectors
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210713174.7A
Other languages
Chinese (zh)
Inventor
臧小东
曹健波
王茂励
马旭
胡春美
张国威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qufu Normal University
Original Assignee
Qufu Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qufu Normal University filed Critical Qufu Normal University
Priority to CN202210713174.7A priority Critical patent/CN114912443A/en
Publication of CN114912443A publication Critical patent/CN114912443A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Monitoring And Testing Of Transmission In General (AREA)

Abstract

The invention discloses a domain name detection, classification and feature screening method, a system, equipment and a storage medium, belonging to the technical field of domain name detection. The method comprises the following steps: extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; subtracting the average value of each characteristic from the characteristic value of each row of characteristics in the original domain name data characteristic matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved. The method has the characteristics of low cost, quick response and high accuracy, and is more suitable for real-time network monitoring scenes.

Description

Domain name detection, classification and feature screening method, system, device and storage medium
Technical Field
The present invention relates to the field of domain name detection, and in particular, to a method, system, device, and storage medium for domain name detection, classification, and feature screening.
Background
The DNS (english: Domain Name System, chinese: Domain Name System, abbreviation: DNS) is an important infrastructure of the internet, and carries the double mapping of Domain names and IP, and besides normal applications, various malicious network activities such as: botnets, junk mails, phishing websites and the like also use the IP addresses to acquire the IP addresses of the back-connected servers to avoid detection. Therefore, the malicious domain name can be effectively detected and identified, and the method has important significance for discovering and preventing the propagation of malicious behaviors.
In view of the fact that the coverage of DNS traffic is observed to be different in the actual monitoring process, the resolution traffic of DNS is divided into upper layer traffic and lower layer traffic, the former refers to DNS resolution traffic accessing a DNS server, and the latter refers to actual all DNS resolution traffic (including traffic accessing a cache). Around different DNS traffic, existing AGD (english: Algorithmically Generated Domain, chinese: Domain name) detection schemes mainly fall into two categories: the method is developed around lower-layer DNS traffic acquired by a local cache name server, and the thought considers that zombie hosts infected by the same malicious code have similar access modes, and detection can be performed by analyzing the access behavior of a client and inquiring a failed domain name. Although the underlying DNS traffic provides detailed information about DNS requests and responses, acquiring the corresponding traffic must have the authority to monitor DNS server traffic, and furthermore, accessing a DNS server can also cause certain privacy problems. Therefore, the application scenarios of such methods are limited. Analyzing the upper-layer DNS traffic, and discovering malicious behaviors by measuring the characteristics of similarity, spatial distribution and the like of an IP address set in the cache name server. The accuracy of the algorithm is reduced because of the large number of local cache name servers and the wide geographical distribution observed from the top-level domain name server.
The existing DGA (English: Domain Generated Algorithm, Chinese: Domain name Generation Algorithm) Domain name detection method based on CNN generally converts Domain names into word vectors, designs different convolutional neural networks to extract word vector characteristics for classification detection. The detection schemes have complex structures, high calculation overhead and time complexity and long model training period.
Disclosure of Invention
1. Technical problem to be solved by the invention
In order to overcome the technical problems, the invention provides a domain name detection, classification and feature screening method, a domain name detection, classification and feature screening system, a device and a storage medium, wherein the domain name literal features are extracted and processed, the processed features are used as the input of a convolutional neural network, the convolutional neural network carries out convolutional calculation on the extracted linguistic features, the implicit relation among the features is extracted, the detection precision is about 98.7%, the detection time is reduced by 15.2%, and the method is more suitable for a real-time network monitoring scene.
2. Technical scheme
In order to solve the problems, the technical scheme provided by the invention is as follows:
in a first aspect, the present invention provides a domain name feature screening method, including: extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; subtracting the average value of each characteristic from the characteristic value of each row of characteristics in the original domain name data characteristic matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved.
Further, the selecting the plurality of feature vectors to be retained comprises automatically selecting and assigning dimensionality reduction selection.
In a second aspect, the present invention provides a domain name classification method, including: s201, selecting the classified DGA domain name and a normal domain name real result, and mixing to form a group of domain name data as sample domain name data; s202, selecting a plurality of eigenvectors and eigenvalues thereof from the sample domain name data by using the above domain name screening method to serve as a sample domain name data characteristic dimension reduction matrix; s203, setting an optimizer and a loss function of the one-dimensional convolutional neural network model by using a model compiling method; s204, setting the learning rate of the optimizer, and setting the number and size of filters; s205, inputting the sample domain name data feature dimension reduction matrix into a one-dimensional convolution neural network model to obtain a prediction classification result of the sample domain name data; s206, measuring deviation values of the predicted classification result and the real result by using a loss function; if the deviation value does not meet the requirement of the set deviation value, executing the steps S204-S205, and performing iterative training until the deviation value meets the requirement of the set deviation value; the one-dimensional convolutional neural network model comprises an intermediate layer and a full-connection layer, wherein the intermediate layer and the full-connection layer use activation functions, the intermediate layer and the full-connection layer are sequentially arranged behind the plurality of convolutional layers and the plurality of maximum pooling layers.
Further, the flattening layer is used as an intermediate layer to link the convolution layer and the full-link layer, the Dense layer is used as the full-link layer, and a random deactivation layer is arranged between the intermediate layer and the full-link layer. And a dropout layer prevents overfitting and improves the generalization capability of the model.
Further, the convolution layer adopts a relu function as an activation function, and the fully-connected layer adopts an S-shaped function as the activation function.
Further, the loss function is a binary cross entropy.
In a third aspect, the present invention provides a domain name detection method, including: selecting a plurality of eigenvectors needing to be reserved from the actually measured data according to the domain name feature screening method, and using the eigenvectors as a domain name data feature dimension reduction matrix; and inputting the domain name data characteristic dimension reduction matrix into any one of the domain name classification methods, iteratively training to determine parameters, and distinguishing the DGA domain name from the normal domain name in the obtained one-dimensional convolutional neural network model.
In a fourth aspect, the present invention provides a system, comprising one of the following systems:
a domain name screening system, comprising: the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as the domain name data characteristic dimension reduction matrix.
Or, a domain name classification system is included, which is used for executing any one of the above domain name classification methods, and iteratively training the determined parameters to obtain a one-dimensional convolutional neural network model for domain name classification.
Or, also include a domain name detection system, including: the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as a domain name data characteristic dimension reduction matrix; and the domain name detection unit is used for inputting the domain name data characteristic dimension reduction matrix into any one of the domain name classification methods, iteratively training and determining parameters, and distinguishing the DGA domain name from the normal domain name in the obtained one-dimensional convolutional neural network model.
Furthermore, the present invention provides an apparatus comprising: one or more processors; memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform a method as described above.
Accordingly, the present invention provides a storage medium storing a computer program which, when executed by a processor, implements a method as claimed in any one of the above.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
by extracting and processing the literal characteristics of the domain name, taking the processed characteristics as the input of a convolutional neural network, carrying out convolutional calculation on the extracted linguistic characteristics by the convolutional neural network, and extracting the implicit relationship among the characteristics, the method has the advantages of low cost, quick response and high accuracy; the detection precision is about 98.7%, the detection time is reduced by 15.2%, and the method is more suitable for real-time network monitoring scenes.
Drawings
FIG. 1 is a schematic diagram of an apparatus according to the present invention.
Fig. 2 is an architecture diagram of a domain name detection system according to an embodiment of the present invention;
fig. 3 is a flowchart of a domain name feature screening method according to an embodiment of the present invention;
fig. 4 is a one-dimensional convolutional neural network model in a domain name classification method according to an embodiment of the present invention.
Fig. 5 is a flowchart of a domain name classification method according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a domain name detection system according to an embodiment of the present invention.
Detailed Description
For a further understanding of the present invention, reference will now be made in detail to the embodiments illustrated in the drawings.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. The terms first, second, and the like in the present invention are provided for convenience of describing the technical solution of the present invention, and have no specific limiting effect, but are all generic terms, and do not limit the technical solution of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Example 1
As shown in fig. 3, the present embodiment provides a domain name feature screening method, including:
extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix;
subtracting the average value of each characteristic from the characteristic value of each row of characteristics in the original domain name data characteristic matrix;
calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof;
and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved.
The method comprises the steps of processing measured data, filtering invalid information in the measured data, extracting the literal characteristics of a domain name, wherein the characteristics are used for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name, and extracting manually, most DGA domain names are a string of irregular character strings due to high randomness, and the difference between the DGA domain name and the normal domain name can be accurately positioned in a manual extraction mode, so that the classification characteristics of being few and precise and small in classification error probability can be obtained, and the two can be distinguished more accurately.
This feature includes two categories: basic features and derived features. Most DGA domain names have stronger randomness and are strings with irregular characters, all the characters with stronger randomness are suitable for the characteristics, and a small part of the domain names are based on words DGA. In one embodiment, the basic features include 18 features such as domain name length, total number of consonants, total number of vowels, entropy, etc. The derived features comprise 19 features calculated by basic feature values, such as the ratio of entropy to domain name length, the ratio of total consonants to total vowels, the ratio of total consonant letters to domain name length, the ratio of total vowel letters to domain name length, and the like. The randomness of the DGA domain name is relatively strong, so that the number of extracted features is indefinite, and is determined according to the actual DGA domain name and is not limited by the enumeration in this embodiment.
The extracted features are further processed, namely, the PCA algorithm is utilized to screen out the features which are most effective for classifying and identifying the DGA domain name and the normal domain name, so as to obtain a domain name data feature dimension reduction matrix, namely, a group of less and precise classification features (namely, a plurality of feature vectors which are automatically selected) with small classification error probability are obtained, so that the DGA domain name can be more accurately detected, and meanwhile, the storage and calculation expenses are saved.
Because the PCA algorithm has the advantages of reducing the calculation overhead of the algorithm, removing noise, having no parameter limitation at all, and the like, in the present embodiment, the PCA algorithm shown in fig. 1 is used as a method for feature selection, and there are two different ways for feature selection by the PCA algorithm, one is to specify the number of dimensions to be reduced, and specify how many dimensions to be reduced, and certainly, the larger the dimension is reduced, the more effective information is lost. The other is automatic dimensionality reduction, which enables the PCA algorithm to automatically select a proper dimensionality, namely, automatically select a plurality of feature vectors needing to be reserved, and finally reduce the 37 feature dimensionalities in the above embodiment to 33 dimensionalities. The two modes are tested and compared, and the effect of automatically reducing the dimension and selecting the features for classification is superior to the effect of the specified dimension.
In a specific embodiment, first, a 5000000 × 37 original domain name data feature matrix a is constructed and input, and each row of the matrix a (representing 37 feature values of a piece of domain name data) is subtracted by the average value of the respective features. Secondly, obtaining the covariance matrix C of the original domain name data feature matrix A as AA T And calculating the eigenvalue and eigenvector of the covariance matrix C. Finally, the eigenvectors are arranged into a matrix from top to bottom according to the sizes of the corresponding eigenvalues, the number (total 33) of the eigenvectors needing to be reserved is automatically selected (the eigenvectors with large eigenvalues are selected), and a new matrix of 5000000 x 33 is obtained, namely the new matrixAnd the characteristic vectors represent the character difference between the DGA domain name and the normal domain name, and the larger the value of the characteristic vectors is, the larger the character difference between the DGA domain name and the normal domain name is, so that the characteristic vectors can be used for distinguishing the DAG domain name from the normal domain name.
The selection of the plurality of feature vectors needing to be reserved comprises automatic selection and designated dimension reduction selection. Through comparison and analysis, a large amount of effective information can be lost when the designated dimension reduction number is 20, a large amount of characteristic vectors can be lost after dimension reduction, and the number of retained dimensions (namely the number of characteristic vectors) is automatically selected to be 33, so that the loss of effective characteristics can be reduced by automatically selecting a plurality of characteristic vectors to be retained.
Example 2
The present embodiment proposes a domain name classification method, as shown in fig. 5, including:
s201, selecting the classified DGA domain name and a normal domain name real result, and mixing to form a group of domain name data as sample domain name data;
s202, selecting a plurality of eigenvectors and eigenvalues thereof as a sample domain name data feature dimension reduction matrix by utilizing the domain name screening method of claim 1 for the sample domain name data;
s203, setting an optimizer and a loss function of the one-dimensional convolutional neural network model by using a model compiling method (for example, corresponding to a model compiling (complex) method in a keras deep learning library);
s204, setting the learning rate of the optimizer, and setting the number and size of filters;
s205, inputting the sample domain name data feature dimension reduction matrix into a one-dimensional convolution neural network model to obtain a prediction classification result of the sample domain name data;
s206, measuring deviation values of the predicted classification result and the real result by using a loss function;
if the deviation value does not meet the requirement of the set deviation value, executing the steps S204-S205, and performing iterative training until the deviation value meets the requirement of the set deviation value;
the one-dimensional convolutional neural network model comprises an intermediate layer and a full-connection layer, wherein the intermediate layer and the full-connection layer use activation functions, the intermediate layer and the full-connection layer are sequentially arranged behind the plurality of convolutional layers and the plurality of maximum pooling layers.
The setting number and the setting sequence of the plurality of convolution layers and the plurality of maximum pooling layers are selected and set according to the domain name classification and detection accuracy and the deviation value requirement of the loss function.
In one embodiment, the one-dimensional convolutional neural network model includes a convolutional layer one, a convolutional layer two, a max boosting 1D one (max pooling layer), a convolutional layer three, a convolutional layer four, a max boosting 1D two, a convolutional layer five, a convolutional layer six, a max boosting 1D three, an intermediate layer, and a full link layer using an activation function, which are sequentially disposed.
Sample domain name data, classified DGA domain names and normal domain name real results form a sample set, iterative training is carried out on a one-dimensional convolutional neural network model, parameters of the one-dimensional convolutional neural network model are continuously optimized, deviation values are measured through a loss function, the performance of the trained one-dimensional convolutional neural network model is further evaluated, besides, the performance of the trained one-dimensional convolutional neural network model is comprehensively determined through accuracy, recall rate, F1 scores, accuracy rate and the like, namely: and if the deviation values, the accuracy rates, the recall rates and the F1 scores of the predicted classification results and the real results do not meet the requirements of corresponding set values, executing the steps S204-S205 to perform iterative training and optimize the model parameters until the deviation values, the accuracy rates, the recall rates, the F1 scores and the accuracy rates of the predicted classification results and the real results all meet the requirements of the set values, thereby determining the parameters of the one-dimensional convolutional neural network model.
Building a one-dimensional convolution training model:
the convolutional neural network model with 6 layers of Conv1D and 3 layers of MaxPholing 1D is built, and a layer of MaxPholing 1D is added behind every two convolutional layers to keep main characteristics and reduce calculation amount. Each convolutional layer adopts a relu function as an activation function, and the relu function can enable the output of a part of neurons to be 0, so that the sparsity of the network is caused, the interdependency relation of parameters is reduced, and the over-fitting problem is relieved. Therefore, the sparse model can better mine relevant characteristics and fit training data through the relu function.
② the Flatten layer (flattening layer) is used as the middle layer to link the convolution layer and the full-link layer.
And the Dropout layer (random inactivation layer) is used for preventing overfitting and improving the generalization capability of the model.
The fully-connected layer uses sigmoid (sigmoid function) as an activation function so that the output is in the range between (0-1).
After the one-dimensional convolutional neural network model is built, namely when the models are formed by all layers of the convolutional neural network together, a model () method is called to tell the optimizer and the loss function needed by the one-dimensional convolutional neural network model. The optimizer used in this embodiment is Adam, which needs to set a learning rate for learning data by a one-dimensional convolutional neural network model, and different learning rates have different learning effects and performances.
The neural network is a process of continuous optimization and parameter adjustment, a proper learning rate is preferably selected, and then different convolution kernels (namely filters) are selected and tried, so that the optimal parameters are continuously selected and tried, because convolution layers of a convolution neural network model are subjected to convolution calculation by sliding the convolution kernels from left to right, different convolution kernel sizes, different convolution kernel numbers, different learning rates and the like can cause different results, the weight parameters of each layer of the neural network can be initialized to be automatically adjusted, and the results can be different. The performance of the model is measured, and factors such as accuracy, recall rate, F1 score, accuracy rate and the like can be comprehensively considered besides the loss function.
Generally, at the beginning of model training, an initial value of learning rate is set to be generally 0.01, step S205 is executed to train the model, loss (loss) and performance of the model are observed through step S206, if the performance is not good, a new learning rate can be obtained by dividing the learning rate by 2 or 5 every time, and model performance is observed, that is, step S204 is executed to execute step S205 to retrain the model. The learning rate cannot be too large or too small, and too small learning rate can cause very slow network convergenceSlow, too large, can make the network unable to converge. The general learning rate is set empirically, and is set at about 1 × 10 -6 -1*10 -5 . After the learning rate of a model is determined, the number and size of each layer of filters can be adjusted, for example, the number of initial filters is 64, and the size is 3, and then the parameter tuning process is continued until a better model parameter is found.
The convolutional layer I, the convolutional layer II, the convolutional layer III, the convolutional layer IV, the convolutional layer five and the convolutional layer six all adopt relu functions as activation functions, and sigmoid is used as the activation functions by the fully-connected layers.
And classifying the DGA domain name or the normal domain name, belonging to a binary classification problem, and mapping the variable between (0 and 1) by taking a Sigmoid function as a threshold function of a one-dimensional convolutional neural network model. Since the Sigmoid function is symmetric about the center of (0, 0.5), if the threshold of the Sigmoid function is set to 0.5, that is: when the probability value output after the domain name is classified by the scheme is larger than 0.5, the domain name is considered as a DGA domain name, and otherwise, the domain name is considered as a normal domain name when the probability value is smaller than 0.5.
The loss function is binary cross entropy, the cross entropy (cross entropy) describes the distance between two probability distributions, and the one-dimensional convolutional neural network model constructed in the embodiment uses the binary cross entropy (binary _ cross _ entropy) as the loss function, that is, the binary cross entropy is added into a model () method to judge the proximity degree of an output vector and an expected vector. The smaller the cross-entropy value, the closer the two probability distributions are. The loss function is used for measuring the deviation degree of a predicted value and a true value obtained by the whole model, and the value of the loss function is enabled to be as small as possible (the smaller the value is better) in the model parameter adjusting process.
Firstly, a one-dimensional convolutional neural network model is constructed, feature values processed by a PCA algorithm are subjected to convolutional calculation by using 6 layers of Conv1D, and an implicit relation among domain name features is extracted so as to more accurately detect the DGA domain name. Adding a layer of Maxbonding 1D after every two layers of Conv1D retains the main characteristics and reduces the calculation amount. Each convolution layer adopts a relu function as an activation function, so that the calculation process is simplified, the overall calculation cost of the neural network is reduced, the sigmoid is used as the activation function by the full-connection layer so as to be output in a range between (0-1), and when the function is used for classification problems, the classification can be predicted, and the approximate probability prediction can be obtained. The convolutional layers and the fully-connected layers are linked using the Flatten () layer as an intermediate layer.
Next, the actual measurement data used in this embodiment is DNS interaction groups which can be correctly analyzed and are collected at the network boundary of the province network of Jiangsu of the Chinese education and scientific research network from 11/3/4/11/2021, and the actual measurement analysis is performed, and about 500 ten thousand different domain name sets are collected. Training data sets (sample sets) come from the first 10 thousands of normal domain names provided by an Alexa website and DGA domain names updated every day in a 360netlab, and iterative training is carried out on the built one-dimensional convolutional neural network model to optimize and determine parameters.
Example 3
The embodiment provides a domain name detection method, as shown in fig. 2 to 4, including:
selecting a plurality of eigenvectors needing to be reserved as a domain name data feature dimension reduction matrix according to the actually measured data by the domain name feature screening method in the technical scheme of the embodiment 1;
inputting the domain name data feature dimension reduction matrix into the domain name classification method according to any technical scheme in embodiment 2, iteratively training to determine parameters, and obtaining a one-dimensional convolution neural network model to distinguish a DGA domain name from a normal domain name.
Manually extracting domain name features, further accurately screening the features through a PCA algorithm, extracting the features of the features, refining and compensating the manually extracted features; after the trained model operation, the DGA domain name and the normal domain name are distinguished, the one-dimensional convolution neural network model is used for achieving excellent performances of high calculation speed, high response speed, cost saving and the like, and the detection result can be obtained accurately, quickly and at low cost.
Firstly, literal features of the domain name are extracted and calculated, feature screening is carried out by using a PCA algorithm, as shown in FIG. 2, a group of classification features which are few but fine and have small classification error probability is obtained, a domain name data feature dimension reduction matrix is formed and is used as input of a one-dimensional convolution neural network model to distinguish a normal domain name and a DGA domain name.
Example 4
The present embodiment proposes a system comprising one of the following systems:
a domain name screening system, comprising: the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as the domain name data characteristic dimension reduction matrix.
Or, the system includes a domain name classification system, which is used to execute a domain name classification method described in any technical solution of embodiment 2, and iteratively trains the determination parameters to obtain a one-dimensional convolutional neural network model for domain name classification.
Or, further includes a domain name detection system, as shown in fig. 6, including: the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix; the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as a domain name data characteristic dimension reduction matrix; and the domain name detection unit is used for inputting the domain name data feature dimension reduction matrix into the domain name classification method in the technical scheme of the embodiment 2, determining parameters through iterative training, and distinguishing the DGA domain name from the normal domain name in the obtained one-dimensional convolution neural network model.
Example 5
This embodiment provides an apparatus, the apparatus comprising: one or more processors; memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform a method as described above.
Furthermore, the present embodiment provides a storage medium storing a computer program that, when executed by a processor, implements the method as described in embodiment 1 above.
Fig. 1 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
As shown in fig. 1, as another aspect, the present application also provides an apparatus including one or more Central Processing Units (CPUs) 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the apparatus are also stored. The CPU501, ROM502, and RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to embodiments disclosed herein, the method described in any of the above embodiments may be implemented as a computer software program. For example, embodiments disclosed herein include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method described in any of the embodiments above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511.
As yet another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus of the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described herein.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, for example, each of the described units may be a software program provided in a computer or a mobile intelligent device, or may be a separately configured hardware device. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the present application. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A domain name feature screening method is characterized by comprising the following steps:
extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix;
subtracting the average value of each characteristic from the characteristic value of each row of characteristics in the original domain name data characteristic matrix;
calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof;
and sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved.
2. The method of claim 1, wherein selecting the number of feature vectors to be retained comprises automatically selecting and assigning a dimensionality reduction selection.
3. A domain name classification method is characterized by comprising the following steps:
s201, selecting the classified DGA domain name and a normal domain name real result, and mixing to form a group of domain name data as sample domain name data;
s202, selecting a plurality of eigenvectors and eigenvalues thereof as a sample domain name data feature dimension reduction matrix by utilizing the domain name screening method of claim 1 for the sample domain name data;
s203, setting an optimizer and a loss function of the one-dimensional convolutional neural network model by using a model compiling method;
s204, setting the learning rate of the optimizer, and setting the number and size of filters;
s205, inputting the sample domain name data feature dimension reduction matrix into a one-dimensional convolution neural network model to obtain a prediction classification result of the sample domain name data;
s206, measuring deviation values of the prediction classification result and the real result by using a loss function;
if the deviation value does not meet the requirement of the set deviation value, executing the steps S204-S205, and performing iterative training until the deviation value meets the requirement of the set deviation value;
the one-dimensional convolutional neural network model comprises an intermediate layer and a full-connection layer, wherein the intermediate layer and the full-connection layer use activation functions, the intermediate layer and the full-connection layer are sequentially arranged behind the plurality of convolutional layers and the plurality of maximum pooling layers.
4. The method of claim 3, wherein: the flattening layer is used as an intermediate layer to link the convolution layer and the full-connection layer, the Dense layer is used as the full-connection layer, and a random inactivation layer is arranged between the intermediate layer and the full-connection layer.
5. The method of claim 3, wherein: the convolution layer adopts a relu function as an activation function, and the full-link layer adopts an S-shaped function as the activation function.
6. The method of claim 3, wherein: the loss function is a binary cross entropy.
7. A domain name detection method is characterized by comprising the following steps:
selecting a plurality of eigenvectors needing to be reserved as a domain name data feature dimension reduction matrix according to the actually measured data by the domain name feature screening method of claim 1 or 2;
inputting the domain name data feature dimension reduction matrix into the domain name classification method according to any one of claims 3 to 6, iteratively training to determine parameters, and distinguishing the DGA domain name from the normal domain name in the obtained one-dimensional convolution neural network model.
8. A system, characterized in that,
the domain name screening system comprises:
the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix;
the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as a domain name data characteristic dimension reduction matrix;
or, comprising a domain name classification system for executing a domain name classification method according to any one of claims 3 to 6, iteratively training the determined parameters to obtain a one-dimensional convolutional neural network model for domain name classification;
or, also include a domain name detection system, including:
the preprocessing unit is used for extracting a domain name from the measured data, and extracting a plurality of characteristics for distinguishing the domain name generated by a domain name generation algorithm from a normal domain name according to the domain name to form an original domain name data characteristic matrix;
the domain name feature screening unit is used for subtracting the average value of each feature from the feature value of each row of features in the original domain name data feature matrix; calculating a covariance matrix of an original domain name data feature matrix and a feature vector and a feature value thereof; sorting the eigenvectors of the covariance matrix according to the eigenvalue sizes corresponding to the eigenvectors of the covariance matrix, and selecting a plurality of eigenvectors needing to be reserved as a domain name data characteristic dimension reduction matrix;
a domain name detection unit, which is used for inputting the domain name data feature dimension reduction matrix into the domain name classification method according to any one of claims 3 to 6, iteratively training to determine parameters, and distinguishing the DGA domain name from the normal domain name in the obtained one-dimensional convolution neural network model.
9. An apparatus, characterized in that the apparatus comprises:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited in any of claims 1-7.
10. A storage medium storing a computer program, characterized in that the program, when executed by a processor, implements the method according to any one of claims 1-7.
CN202210713174.7A 2022-06-22 2022-06-22 Domain name detection, classification and feature screening method, system, device and storage medium Pending CN114912443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210713174.7A CN114912443A (en) 2022-06-22 2022-06-22 Domain name detection, classification and feature screening method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210713174.7A CN114912443A (en) 2022-06-22 2022-06-22 Domain name detection, classification and feature screening method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN114912443A true CN114912443A (en) 2022-08-16

Family

ID=82772728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210713174.7A Pending CN114912443A (en) 2022-06-22 2022-06-22 Domain name detection, classification and feature screening method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN114912443A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600200A (en) * 2018-04-08 2018-09-28 腾讯科技(深圳)有限公司 Domain name detection method, device, computer equipment and storage medium
CN111628970A (en) * 2020-04-24 2020-09-04 中国科学院计算技术研究所 DGA type botnet detection method, medium and electronic equipment
CN112769974A (en) * 2020-12-30 2021-05-07 亚信科技(成都)有限公司 Domain name detection method, system and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600200A (en) * 2018-04-08 2018-09-28 腾讯科技(深圳)有限公司 Domain name detection method, device, computer equipment and storage medium
CN111628970A (en) * 2020-04-24 2020-09-04 中国科学院计算技术研究所 DGA type botnet detection method, medium and electronic equipment
CN112769974A (en) * 2020-12-30 2021-05-07 亚信科技(成都)有限公司 Domain name detection method, system and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹健波: "基于深度学习的恶意域名检测算法研究", 《万方数据》, 7 November 2023 (2023-11-07), pages 1 - 48 *
臧小东 等: "基于AGD的恶意域名检测", 《通信学报》, vol. 39, no. 07, 25 July 2018 (2018-07-25), pages 1 - 11 *

Similar Documents

Publication Publication Date Title
Meng et al. Magface: A universal representation for face recognition and quality assessment
CN108737406B (en) Method and system for detecting abnormal flow data
US11461537B2 (en) Systems and methods of data augmentation for pre-trained embeddings
CN111027069B (en) Malicious software family detection method, storage medium and computing device
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN107346433B (en) Text data classification method and server
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
CN109657011B (en) Data mining system for screening terrorist attack event crime groups
CN111582651A (en) User risk analysis model training method and device and electronic equipment
CN111475622A (en) Text classification method, device, terminal and storage medium
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
EP3769270A1 (en) A method, an apparatus and a computer program product for an interpretable neural network representation
WO2023093100A1 (en) Method and apparatus for identifying abnormal calling of api gateway, device, and product
CN110348516B (en) Data processing method, data processing device, storage medium and electronic equipment
CN113592008B (en) System, method, device and storage medium for classifying small sample images
CN114912443A (en) Domain name detection, classification and feature screening method, system, device and storage medium
CN111582647A (en) User data processing method and device and electronic equipment
CN116232699A (en) Training method of fine-grained network intrusion detection model and network intrusion detection method
CN112990279B (en) Radar high-resolution range profile library outside target rejection method based on automatic encoder
CN114882315A (en) Sample generation method, model training method, device, equipment and medium
CN111797732B (en) Video motion identification anti-attack method insensitive to sampling
CN114861873A (en) Multi-stage computationally efficient neural network inference
CN113076544A (en) Vulnerability detection method and system based on deep learning model compression and mobile device
CN112632229A (en) Text clustering method and device
Sun et al. Analysis of English writing text features based on random forest and Logistic regression classification algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination