CN111556050B - Domain name processing method, device, storage medium and processor - Google Patents

Domain name processing method, device, storage medium and processor Download PDF

Info

Publication number
CN111556050B
CN111556050B CN202010339989.4A CN202010339989A CN111556050B CN 111556050 B CN111556050 B CN 111556050B CN 202010339989 A CN202010339989 A CN 202010339989A CN 111556050 B CN111556050 B CN 111556050B
Authority
CN
China
Prior art keywords
domain name
features
training set
detected
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010339989.4A
Other languages
Chinese (zh)
Other versions
CN111556050A (en
Inventor
袁巍
张晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hillstone Networks Co Ltd
Original Assignee
Hillstone Networks Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hillstone Networks Co Ltd filed Critical Hillstone Networks Co Ltd
Priority to CN202010339989.4A priority Critical patent/CN111556050B/en
Publication of CN111556050A publication Critical patent/CN111556050A/en
Application granted granted Critical
Publication of CN111556050B publication Critical patent/CN111556050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a domain name processing method, a domain name processing device, a storage medium and a processor. The method comprises the following steps: determining the inherent characteristics of the domain name to be detected; determining the word frequency characteristics of the domain name to be detected; determining the vector characteristics of the domain name to be detected; processing the inherent characteristics, word frequency characteristics, vector characteristics and domain name characteristics in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result. By the method and the device, the problem that the judgment result for judging whether the domain name to be detected is the malicious domain name is inaccurate in the related technology is solved.

Description

Domain name processing method, device, storage medium and processor
Technical Field
The present application relates to the field of domain name recognition technologies, and in particular, to a domain name processing method, an apparatus, a storage medium, and a processor.
Background
In the field of network security, a malicious domain name generation algorithm is often used for network communication of malicious software or an attacker who invades a host machine, so that the malicious software or the attacker communicates with a control server, thereby centrally controlling a plurality of hosts. Therefore, malicious domain name detection is a key technology for cutting off network intrusion and control channels.
In order to enable the detection accuracy of the algorithm to be higher, basic data which is most needed by the algorithm needs to be prepared, specifically, the basic data can be obtained through a public way, or extracted through a network data packet, and a training result is retested to obtain an optimal model. However, these malicious domain name detections are only the best models describing the currently available knowledge, and as the models are trained and new data is generated, the best models at a certain moment are gradually weakened over time, and the accuracy of domain name judgment is continuously reduced.
Aiming at the problem that the judgment result for judging whether the domain name to be detected is the malicious domain name is not accurate in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The application provides a domain name processing method, a domain name processing device, a storage medium and a processor, which are used for solving the problem that the judgment result for judging whether a domain name to be detected is a malicious domain name is inaccurate in the related art.
According to one aspect of the present application, a domain name processing method is provided. The method comprises the following steps: determining the inherent characteristics of the domain name to be detected, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in a training set; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result.
Optionally, before processing the intrinsic feature, the word frequency feature, the vector feature, and the domain name feature in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, the method further includes: acquiring external characteristics of a domain name to be detected; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the processing result comprises the following steps: and processing the inherent characteristics, the word frequency characteristics, the vector characteristics, the external characteristics and the domain name characteristics in the training set of the domain name to be detected according to at least one machine model to obtain at least one processing result.
Optionally, determining whether the domain name to be detected is a malicious domain name according to at least one processing result includes: determining the number of processing results indicating that the domain name to be detected is the front domain name in at least one processing result to obtain a first number; determining the number of processing results indicating that the domain name to be detected is the negative domain name in the at least one processing result to obtain a second number; and comparing the first quantity with the second quantity, and determining whether the domain name to be detected is a malicious domain name according to a comparison result.
Optionally, before processing the intrinsic feature, the word frequency feature, the vector feature, and the domain name feature in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, the method further includes: collecting a plurality of positive domain names and a plurality of negative domain names, and determining a training set according to the positive domain names and the negative target domain names; determining inherent features, word frequency features and vector features of each domain name in a training set, and acquiring external features of a preset domain name in the training set; determining the intrinsic characteristics, word frequency characteristics and vector characteristics of each domain name in the training set and the external characteristics of a preset domain name as the domain name characteristics in the training set; and training domain name features in the training set by adopting at least one machine learning model to obtain at least one target model.
Optionally, after determining whether the domain name to be tested is a malicious domain name according to the at least one processing result, the method further includes: adding the domain name to be tested into a training set to obtain an updated training set; taking the determined result of whether the domain name to be detected is a malicious domain name as the external characteristic of the domain name to be detected; adding the inherent characteristics, word frequency characteristics, vector characteristics and external characteristics of the domain name to be tested to the domain name characteristics in the training set to obtain updated domain name characteristics in the training set; and training the domain name characteristics in the updated training set by adopting at least one machine learning model to obtain at least one updated target model.
Optionally, collecting the plurality of front domain names comprises: the method comprises the steps of obtaining a preset number of domain names from a front domain name source in a database to obtain a plurality of front domain names, wherein the domain names in the front domain name source are stored according to a preset sequence.
Optionally, collecting the plurality of negative domain names comprises: determining an acquisition interval according to the number of domain names contained in a negative domain name source in a database and the number of negative domain names to be acquired; and acquiring a preset number of domain names from the negative domain name sources in the database according to the acquisition interval to obtain a plurality of negative domain names, wherein the domain names in the negative domain name sources are stored according to a preset sequence.
Optionally, determining the word-frequency feature of each domain name in the training set includes: segmenting each domain name in the training set according to a preset sliding window to obtain a plurality of segmentation blocks, and creating a dictionary according to the segmentation blocks; counting the times of each segmentation block of all the front domain names appearing in the dictionary respectively, and summing to obtain the number of the segmentation blocks of the front domain names; respectively counting the times of occurrence of each segmentation block of all negative domain names in a dictionary, and summing to obtain the number of the segmentation blocks of the negative domain names; multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the target domain name; multiplying the occurrence frequency of the target domain name in the training set by the number of segmentation blocks of the negative domain name to obtain the negative domain name word frequency characteristic of the target domain name; and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the target domain name to obtain the word frequency characteristic of the target domain name.
Optionally, determining the vector feature of each domain name in the training set comprises: respectively taking each character of each domain name in a training set as a training sample, and respectively taking k characters adjacent to each character as labels to form an extended training set, wherein each piece of training data in the extended training set is a combination formed by one sample and one label; inputting the extended training set into a preset fully-connected network for training, and acquiring hidden layer parameters of the trained preset fully-connected network; determining a target fully-connected network based on the trained hidden layer parameters of the preset fully-connected network; and inputting each character of the target domain name in the training set into the target full-connection network, training to obtain output, and determining the output as a feature vector of the target domain name.
Optionally, determining the word frequency feature of the domain name to be detected includes: multiplying the occurrence frequency of the domain name to be detected by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the domain name to be detected; multiplying the occurrence frequency of the domain name to be detected by the number of segmentation blocks of the negative domain name to obtain the negative domain name word frequency characteristic of the domain name to be detected; and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the domain name to be detected to obtain the word frequency characteristic of the domain name to be detected.
Optionally, determining the vector feature of the domain name to be detected includes: and inputting each character of the domain name to be detected into a target full-connection network, and determining the output of the target full-connection network as a feature vector of the domain name to be detected.
According to another aspect of the present application, there is provided a domain name processing apparatus. The device includes: the first determining unit is used for determining the inherent characteristics of the domain name to be measured, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the occupation ratio of vowels in the domain name, the occupation ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; the second determining unit is used for determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; the third determining unit is used for determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the correlation of each character of one domain name and the correlation of each character of the domain name in the training set; the processing unit is used for processing the inherent features, the word frequency features, the vector features and the domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features and the vector features of each domain name in the training set and the external features of a preset domain name, and the external features are the judgment results of judging one domain name as a positive domain name or a negative domain name by an external system; and the fourth determining unit is used for determining whether the domain name to be detected is a malicious domain name according to at least one processing result.
In order to achieve the above object, according to another aspect of the present application, there is provided a storage medium including a stored program, wherein the program executes any one of the above-described domain name processing methods.
In order to achieve the above object, according to another aspect of the present application, there is provided a processor configured to execute a program, wherein the program executes to perform any one of the above domain name processing methods.
Through the application, the following steps are adopted: determining the inherent characteristics of the domain name to be detected, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in a training set; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result, so that the problem of inaccurate judgment result for judging whether the domain name to be detected is the malicious domain name in the related technology is solved. The method and the device have the advantages that the characteristics of the domain name to be detected and the characteristics of the domain name in the training set are processed according to at least one target model, and whether the domain name to be detected is a malicious domain name or not is determined according to the processing result, so that the effect of improving the accuracy of the judgment result of judging whether the domain name to be detected is the malicious domain name or not is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:
fig. 1 is a flowchart of a domain name processing method provided according to an embodiment of the present application; and
fig. 2 is a schematic diagram of a domain name processing apparatus provided according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an embodiment of the present application, there is provided a domain name processing method.
Fig. 1 is a flowchart of a domain name processing method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, determining the inherent characteristics of the domain name to be measured, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether the domain name suffix is a preset suffix.
It should be noted that the inherent features of the domain name to be measured are only related to the domain name itself, and are not related to other domain names, and the calculation method of each kind of inherent domain name is as follows:
calculation of information entropy for domain names: firstly, reading information of a domain name to be detected, and calculating according to the following calculation formula: h (x) ═ Σi=1 mp(xi)log2(p(xi) H (X) represents the information entropy of the domain name, xiOne character in the domain name is represented, X represents the whole domain name, and m represents the number of characters contained in the domain name.
For the calculation of the domain name length: firstly, reading the information of the domain name to be detected, and then counting the length of the domain name according to the number of characters.
Calculation of the proportion of vowels in the domain name: reading in the information of the domain name to be measured, then counting the length m of each domain name and the number n of vowel letters 'a', 'e', 'i', 'o' and 'u' in the domain name, and calculating n/m, namely the proportion of vowel letters in the domain name.
Calculation of the ratio of numeric characters in the domain name: reading information of domain names to be detected, then counting the length m of each domain name and the number n of '0', '1', '2', '3', '4', '5', '6', '7', '8' and '9' in the domain name, and calculating n/m, namely the ratio of the number characters in the domain name.
Judging whether the domain name suffix is a preset suffix: reading information of a domain name to be detected, intercepting a suffix of the domain name, comparing whether the suffix is one of cn, com, cc, net, org, gov and info, indicating that the domain name is a top-level domain name when the suffix is one of the suffixes, returning 0 if the suffix is not, and returning 1 if the suffix is not
Step S102, determining word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters.
It should be noted that the inherent features of the domain name to be measured are only related to the domain name itself, and other domain names do not need to be considered during calculation, and the word frequency features represent the similarity of a domain name to the overall positive domain name and the overall negative domain name on characters, and the overall positive domain name and the overall negative domain name can be domain names in a training set, so that the way of calculating the word frequency features of the domain name to be measured is different from the manner of the inherent features, and the domain names in the training set need to be considered.
Step S103, determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in the training set.
It should be noted that, similar to the word frequency features, the ways of calculating the vector features and the inherent features of the domain name to be measured are different, and the domain name in the training set also needs to be considered.
Step S104, processing the inherent features, word frequency features, vector features and domain features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain features in the training set, the domain features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system.
It should be noted that, when determining whether the domain name to be detected is a malicious domain name, the features of the domain name to be detected and the domain name features in the training set need to be combined together as the input of the target model, and the processing is performed through the target model to obtain the processing result. The target model is obtained by training domain name features in a training set, and the domain name features in the training set comprise inherent features, word frequency features, vector features and external features of partial domain names of all the domain names in the training set because the external features of some domain names are missing during training.
In addition, in order to improve the accuracy of domain name determination, more than one target model may be used for determination, but the input data of different target models is the same, and the domain name features in the training set used when training different target models are also the same.
Optionally, in the domain name processing method provided in this embodiment of the present application, before processing the intrinsic feature, the word frequency feature, the vector feature, and the domain name feature in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, the method further includes: acquiring external characteristics of a domain name to be detected; processing the inherent characteristics, the word frequency characteristics, the vector characteristics and the domain name characteristics in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the processing result comprises the following steps: and processing the inherent characteristics, the word frequency characteristics, the vector characteristics, the external characteristics and the domain name characteristics in the training set of the domain name to be detected according to at least one machine model to obtain at least one processing result.
It should be noted that the external system is equivalent to a black box, and may be any method for determining whether the domain name is a malicious domain name, or may be a manual experience method, and one domain name is input into the external system, and the system returns a result of a two-classification problem to obtain a return value of the domain name that is a positive domain name or a negative domain name, or how high a probability is the return value of the positive domain name or the negative domain name.
Specifically, if the input domain name to be determined has external feature information, the external feature of the domain name to be determined is also input into the target model for training, so that the accuracy of the processing result is improved.
Step S105, determining whether the domain name to be detected is a malicious domain name according to at least one processing result.
Specifically, the processing result may be "0" or "1", where "0" represents that the domain name to be detected is a positive domain name, that is, a non-malicious domain name, and "1" represents that the domain name to be detected is a negative domain name, that is, a malicious domain name.
Optionally, in the domain name processing method provided in the embodiment of the present application, determining whether the domain name to be detected is a malicious domain name according to at least one processing result includes: determining the number of processing results indicating that the domain name to be detected is the front domain name in at least one processing result to obtain a first number; determining the number of processing results indicating that the domain name to be detected is the negative domain name in the at least one processing result to obtain a second number; and comparing the first quantity with the second quantity, and determining whether the domain name to be detected is a malicious domain name according to a comparison result.
For example, the characteristics of the domain name to be measured and the domain name characteristics in the training set can be processed through 5 target models, including a support vector machine model (SVM for short), a Logistic Regression model (LR for short), a Naive Bayes model (NB for short), a Random Forest (RF for short) model, and a recurrent neural network model (LSTM for short). If the output of all the models is 0, the domain name to be tested is considered to be positive, if one model outputs 1, the domain name is considered to be possibly malicious, and if two or more models output 1, the domain name is judged to be malicious.
The domain name processing method provided by the embodiment of the application determines the inherent characteristics of the domain name to be detected, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in a training set; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result, so that the problem of inaccurate judgment result for judging whether the domain name to be detected is the malicious domain name in the related technology is solved. The method and the device have the advantages that the characteristics of the domain name to be detected and the characteristics of the domain name in the training set are processed according to at least one target model, and whether the domain name to be detected is a malicious domain name or not is determined according to the processing result, so that the effect of improving the accuracy of the judgment result of judging whether the domain name to be detected is the malicious domain name or not is achieved.
Optionally, in the domain name processing method provided in the embodiment of the present application, before processing the intrinsic feature, the word frequency feature, the vector feature, and the domain name feature in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, the method further includes: collecting a plurality of positive domain names and a plurality of negative domain names, and determining a training set according to the positive domain names and the negative target domain names; determining inherent features, word frequency features and vector features of each domain name in a training set, and acquiring external features of a preset domain name in the training set; determining the intrinsic characteristics, word frequency characteristics and vector characteristics of each domain name in the training set and the external characteristics of a preset domain name as the domain name characteristics in the training set; and training the domain name characteristics in the training set by adopting at least one machine learning model to obtain at least one target model.
Specifically, a positive domain name and a negative domain name are collected from a database to form a training set, inherent characteristics such as information entropy, domain name length, vowel ratio in the domain name, ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix are respectively calculated for each domain name, then word frequency characteristics and vector characteristics of each domain name are calculated in all the domain names, and the calculated characteristics are stored in a domain name characteristic file.
When model training is performed, values of all features are read from a domain name feature file, a plurality of machine learning models are selected to train all the features respectively, for example, machine training models such as a logistic regression model, a support vector machine model, a random forest model and a naive Bayes model can be selected, and a target model obtained after training is stored.
Through this embodiment, adopt the knowledge that the characteristic portrayal data of a plurality of dimensions contains, every machine training model trains the characteristic of a plurality of dimensions respectively, and the target model who obtains can improve the prediction accuracy.
In order to obtain an effective training set domain name, optionally, in the domain name processing method provided in the embodiment of the present application, the acquiring a plurality of front domain names includes: and acquiring the domain names with the preset number from the front domain name source in the database to obtain a plurality of front domain names, wherein the domain names in the front domain name source are stored according to a preset sequence.
It should be noted that in order to obtain more representative training set data, sampling needs to be performed according to the characteristics of the data source, and for a positive data source, the data is generally uniform, and the more previous data contains the larger amount of information, the positive data can be obtained by using a truncated sampling method.
For example, assuming that the front data source has m domain names (for example, m is 1000000), k domain names (for example, k is 10000) are set to be sampled for the front data source, the front domain name file is opened, the first k domain names in the front domain name source are read in, and each domain name is marked as 0, which indicates that it is a forward domain name.
In order to obtain an effective training set domain name, optionally, in the domain name processing method provided in the embodiment of the present application, the acquiring a plurality of negative domain names includes: determining an acquisition interval according to the number of domain names contained in a negative domain name source in a database and the number of negative domain names to be acquired; and acquiring a preset number of domain names from the negative domain name sources in the database according to the acquisition interval to obtain a plurality of negative domain names, wherein the domain names in the negative domain name sources are stored according to a preset sequence.
It should be noted that in order to obtain more representative training set data, sampling needs to be performed according to characteristics of data sources, for negative data sources, data is often uneven, some continuous domain names in the negative domain name sources have some common characteristics, other continuous domain names have other common characteristics, information contained in the data presents a characteristic of partial aggregation, and uniform sampling can be performed on the negative data.
For example, assuming that the negative data source has n domain names (for example, m is 2000000), it is set that k domain names (for example, k is 10000) are sampled respectively for the positive data source and the negative data source, a negative domain name file is opened, all domain names are read in, an acquisition interval t is calculated, t is n/k, starting from the 1 st domain name, one is selected every t domain names, and the domain name is marked as 1, which indicates that it is a negative domain name.
In addition, when negative domain names are sampled, various sampling methods are compared through a series of tests, wherein the sampling methods comprise a series of methods such as sampling according to normal distribution, putting back sampling and interval sampling, and the effect of uniform sampling is the best.
Optionally, in the domain name processing method provided in the embodiment of the present application, determining the word frequency feature of each domain name in the training set includes: segmenting each domain name in the training set according to a preset sliding window to obtain a plurality of segmentation blocks, and creating a dictionary according to the segmentation blocks; counting the times of each segmentation block of all the front domain names appearing in the dictionary respectively, and summing to obtain the number of the segmentation blocks of the front domain names; respectively counting the times of occurrence of each segmentation block of all negative domain names in a dictionary, and summing to obtain the number of the segmentation blocks of the negative domain names; multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the target domain name; multiplying the occurrence frequency of the target domain name in the training set by the number of segmentation blocks of the negative domain name to obtain the negative domain name word frequency characteristic of the target domain name; and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the target domain name to obtain the word frequency characteristic of the target domain name.
Specifically, the domain name is first segmented, the segmentation unit is set as a character, and the length range of the sliding window for segmentation is set, for example, the length range is setA domain name N is m long and the segmentation length ranges from k to N (for example, k is 3, N is 5, and when the segmentation length is 3, N is segmented into N0N1N2,…,Nm-3Nm-2Nm-1(ii) a When the length of the cut is 4, N0N1N2N3,…,Nm-4Nm-3Nm-2Nm-1(ii) a When the length of the cut is 5, N0N1N2N3N4,…,Nm-5Nm-4Nm-3Nm-2Nm-1) And segmenting all domain names in the training set to form a dictionary.
Then, the times of occurrence of each segmentation of all the positive domain names and each segmentation of all the negative domain names in the dictionary are respectively counted, the times of occurrence of each segmentation of all the positive domain names in the dictionary are summed to obtain the number of segmentation blocks of the positive domain names, and the times of occurrence of each segmentation of all the negative domain names in the dictionary are summed to obtain the number of segmentation blocks of the negative domain names. And finally, calculating the difference between the word frequency characteristics of the positive domain name and the word frequency characteristics of the negative domain name to obtain the word frequency characteristics of the domain name.
Optionally, in the domain name processing method provided in this embodiment of the present application, determining the vector feature of each domain name in the training set includes: respectively taking each character of each domain name in a training set as a training sample, and respectively taking k characters adjacent to each character as labels to form an extended training set, wherein each piece of training data in the extended training set is a combination formed by one sample and one label; inputting the extended training set into a preset fully-connected network for training, and acquiring hidden layer parameters of the trained preset fully-connected network; determining a target fully-connected network based on the trained hidden layer parameters of the preset fully-connected network; and inputting each character of the target domain name in the training set into the target full-connection network, training to obtain output, and determining the output as a feature vector of the target domain name.
Specifically, for each domain name, reading each character thereof, and for each character, taking itself as a training sample, and respectively taking k characters adjacent to itself as tags, to form an extended training set, for example, the domain name is "abcd", taking "a" as a training sample, taking two adjacent characters before and after as tags, to obtain tags "b" and "c", the obtained training metadata is "ab", "ac", taking "b" as a training sample, taking two adjacent characters before and after as tags, to obtain tags "a", "c", and "d", the obtained training metadata is "ba", "bc", and "bd", and the obtained training metadata is "a", "b", and "d", the obtained training metadata is "ca", "cb", and "cd", and the obtained training metadata is "b", and "c", and the obtained training metadata is "da"), "db" and "dc".
And further, training the extended training set as the input of a three-layer fully-connected network with an input layer, a hidden layer and an output layer to finally obtain hidden layer parameters, and constructing a target fully-connected network according to the hidden layer parameters. And for each domain name, taking each character as input, obtaining output through a target full-connection network, taking the output as a feature vector of the domain name, and finally obtaining the vector features of all the domain names.
Optionally, in the domain name processing method provided in this embodiment of the present application, determining the word frequency feature of the domain name to be detected includes: multiplying the occurrence frequency of the domain name to be detected by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the domain name to be detected; multiplying the occurrence frequency of the domain name to be detected by the number of segmentation blocks of the negative domain name to obtain the negative domain name word frequency characteristic of the domain name to be detected; and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the domain name to be detected to obtain the word frequency characteristic of the domain name to be detected.
Specifically, all domain names in the training set are segmented to form a dictionary, the times of occurrence of each segmentation block of each domain name in the dictionary are counted respectively, the summation is carried out respectively to obtain the number of segmentation blocks of the positive domain name and the number of segmentation blocks of the negative domain name, the multiplication is carried out respectively with the times of the domain names to be judged to obtain the word frequency characteristics of the positive domain name of the domain name to be judged and the word frequency characteristics of the negative domain name of the domain name to be judged, and then the difference value of the two characteristics is calculated to obtain the word frequency characteristics of the domain name to be detected.
Optionally, in the domain name processing method provided in this embodiment of the present application, determining the vector feature of the domain name to be detected includes: and inputting each character of the domain name to be detected into a target full-connection network, and determining the output of the target full-connection network as a feature vector of the domain name to be detected.
Specifically, hidden layer parameters stored in a vector feature training process are read in, a target model is determined according to the hidden layer parameters, the domain name to be judged is divided according to characters, each character is used as the input of the target model, output vectors are obtained, and a matrix formed by the output vectors of all the characters is the vector feature of the domain name.
In order to enable the target model to maintain high accuracy, optionally, in the domain name processing method provided in the embodiment of the present application, after determining whether the domain name to be detected is a malicious domain name according to at least one processing result, the method further includes: adding the domain name to be tested into a training set to obtain an updated training set; taking the determination result of whether the domain name to be detected is a malicious domain name as the external characteristic of the domain name to be detected; adding the inherent characteristics, word frequency characteristics, vector characteristics and external characteristics of the domain name to be tested to the domain name characteristics in the training set to obtain updated domain name characteristics in the training set; and training the domain name characteristics in the updated training set by adopting at least one machine learning model to obtain at least one updated target model.
It should be noted that, in the embodiment of the present application, the third-party data sources are dynamically associated, the detection result of the specific domain name returned by the third-party data sources is set as an individual feature, and the individual feature is respectively merged into each machine learning algorithm, so that the training data always contains the most suspicious malicious domain name, the knowledge of the external system of the third party can be continuously merged into the target model, the external feature information is dynamically expanded, and the target model always maintains high accuracy.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the present application further provides a domain name processing apparatus, and it should be noted that the domain name processing apparatus according to the embodiment of the present application may be used to execute the method for processing a domain name provided in the embodiment of the present application. The following describes a domain name processing apparatus provided in an embodiment of the present application.
Fig. 2 is a schematic diagram of a domain name processing apparatus according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: a first determination unit 10, a second determination unit 20, a third determination unit 30, a processing unit 40 and a fourth determination unit 50.
Specifically, the first determining unit 10 is configured to determine an inherent characteristic of the domain name to be detected, where the inherent characteristic includes at least one of the following: the information entropy of the domain name, the length of the domain name, the occupation ratio of vowels in the domain name, the occupation ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix.
The second determining unit 20 is configured to determine a word frequency feature of the domain name to be detected, where the word frequency feature is used to represent similarity between the domain name and an overall positive domain name and an overall negative domain name in the training set.
A third determining unit 30, configured to determine a vector feature of the domain name to be detected, where the vector feature is used to characterize similarity between the relevance of each character of a domain name and the relevance of each character of the domain names in the training set.
The processing unit 40 is configured to process the intrinsic features, the word frequency features, the vector features, and the domain features in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, where the target model is obtained by training the domain features in the training set, the domain features in the training set include the intrinsic features, the word frequency features, the vector features, and the external features of a preset domain name of each domain name in the training set, and the external features are a determination result for determining a domain name as a positive domain name or a negative domain name by an external system.
A fourth determining unit 50, configured to determine whether the domain name to be detected is a malicious domain name according to at least one processing result.
The domain name processing apparatus provided in the embodiment of the present application determines, by using the first determining unit 10, an inherent feature of a domain name to be detected, where the inherent feature at least includes one of the following: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; the second determining unit 20 determines the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of a domain name to the overall positive domain name and the overall negative domain name in the training set on characters; the third determining unit 30 determines the vector features of the domain name to be tested, wherein the vector features are used for representing the similarity between the relevance of each character of a domain name and the relevance of each character of the domain name in the training set; the processing unit 40 processes the intrinsic characteristics, the word frequency characteristics, the vector characteristics and the domain characteristics in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain characteristics in the training set, the domain characteristics in the training set comprise the intrinsic characteristics, the word frequency characteristics and the vector characteristics of each domain name in the training set and the external characteristics of a preset domain name, and the external characteristics are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; the fourth determining unit 50 determines whether the domain name to be determined is a malicious domain name according to the at least one processing result, thereby solving the problem that the determination result for determining whether the domain name to be determined is the malicious domain name in the related art is inaccurate, and by processing the features of the domain name to be determined and the features of the domain name in the training set according to the at least one target model, and determining whether the domain name to be determined is the malicious domain name according to the processing result, the effect of improving the accuracy of the determination result for determining whether the domain name to be determined is the malicious domain name is achieved.
Optionally, in the domain name processing apparatus provided in this embodiment of the present application, the apparatus further includes: the acquisition unit is used for acquiring the external characteristics of the domain name to be detected before processing the inherent characteristics, the word frequency characteristics, the vector characteristics and the domain name characteristics in the training set according to at least one target model to obtain at least one processing result; the processing unit is used for processing the inherent characteristics, the word frequency characteristics, the vector characteristics, the external characteristics and the domain name characteristics in the training set of the domain name to be detected according to at least one machine model to obtain at least one processing result.
Optionally, in the domain name processing apparatus provided in this embodiment of the present application, the fourth determining unit 50 includes: the first determining module is used for determining the number of the processing results indicating that the domain name to be detected is the front domain name in the at least one processing result to obtain a first number; the second determining module is used for determining the number of the processing results indicating that the domain name to be detected is the negative domain name in the at least one processing result to obtain a second number; and the third determining module is used for comparing the first quantity with the second quantity and determining whether the domain name to be detected is a malicious domain name or not according to a comparison result.
Optionally, in the domain name processing apparatus provided in this embodiment of the present application, the apparatus further includes: the acquisition unit is used for acquiring a plurality of positive domain names and a plurality of negative domain names before processing the inherent characteristics, word frequency characteristics, vector characteristics and domain name characteristics in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, and determining the training set according to the positive domain names and the negative target domain names; a fifth determining unit, configured to determine an inherent feature, a word frequency feature, and a vector feature of each domain name in the training set, and obtain an external feature of a preset domain name in the training set; a sixth determining unit, configured to determine, as domain name features in the training set, intrinsic features, word frequency features, vector features, and external features of the preset domain name of each domain name in the training set; the first training unit is used for training the domain name characteristics in the training set by adopting at least one machine learning model to obtain at least one target model.
Optionally, in the domain name processing apparatus provided in this embodiment of the present application, the apparatus further includes: the first adding unit is used for adding the domain name to be detected into the training set after determining whether the domain name to be detected is a malicious domain name according to at least one processing result, so as to obtain an updated training set; a seventh determining unit, configured to use a determination result of whether the domain name to be detected is a malicious domain name as an external feature of the domain name to be detected; the second adding unit is used for adding the inherent characteristics, the word frequency characteristics, the vector characteristics and the external characteristics of the domain name to be detected to the domain name characteristics in the training set to obtain updated domain name characteristics in the training set; and the second training unit is used for training the domain name characteristics in the updated training set by adopting at least one machine learning model to obtain at least one updated target model.
Optionally, in the domain name processing apparatus provided in the embodiment of the present application, the acquisition unit includes: the first acquisition module is used for acquiring the domain names with the preset number from the front domain name source in the database to obtain a plurality of front domain names, wherein the domain names in the front domain name source are stored according to the preset sequence.
Optionally, in the domain name processing apparatus provided in the embodiment of the present application, the acquisition unit further includes: the fourth determining module is used for determining the acquisition interval according to the number of domain names contained in the negative domain name source in the database and the number of negative domain names to be acquired; and the second acquisition module is used for acquiring a preset number of domain names from the negative domain name sources in the database according to the acquisition interval to obtain a plurality of negative domain names, wherein the domain names in the negative domain name sources are stored according to a preset sequence.
Optionally, in the domain name processing apparatus provided in this embodiment of the present application, the fifth determining unit includes: the segmentation module is used for segmenting each domain name in the training set according to a preset sliding window to obtain a plurality of segmentation blocks and establishing a dictionary according to the segmentation blocks; the first statistical module is used for respectively counting the times of occurrence of each cut block of all the front domain names in the dictionary and summing the times to obtain the number of the cut blocks of the front domain names; the second counting module is used for respectively counting the times of occurrence of each segmentation block of all negative domain names in the dictionary and summing the times to obtain the number of the segmentation blocks of the negative domain names; the first calculation module is used for multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the target domain name; the second calculation module is used for multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the negative domain name to obtain the negative domain name word frequency characteristic of the target domain name; and the third calculation module is used for calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the target domain name to obtain the word frequency characteristic of the target domain name.
Optionally, in the domain name processing apparatus provided in the embodiment of the present application, the fifth determining unit further includes: the building module is used for forming an extended training set by respectively taking each character of each domain name in the training set as a training sample and respectively taking k characters adjacent to each character as a label, wherein each piece of training data in the extended training set is a combination formed by one sample and one label; the training module is used for inputting the extended training set into a preset fully-connected network for training and acquiring hidden layer parameters of the trained preset fully-connected network; a fifth determining module, configured to determine a target fully-connected network based on a hidden layer parameter of a preset fully-connected network after training; and the sixth determining module is used for inputting each character of the target domain name in the training set into the target full-connection network, training to obtain output, and determining the output as the feature vector of the target domain name.
Optionally, in the domain name processing apparatus provided in the embodiment of the present application, the second determining unit 20 includes: the fourth calculation module is used for multiplying the occurrence frequency of the domain name to be detected with the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency characteristic of the domain name to be detected; the fifth calculation module is used for multiplying the occurrence frequency of the domain name to be detected with the number of the segmentation blocks of the negative domain name to obtain the word frequency characteristic of the negative domain name of the domain name to be detected; and the sixth calculation module is used for calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the domain name to be detected to obtain the word frequency characteristic of the domain name to be detected.
Optionally, in the domain name processing apparatus provided in the embodiment of the present application, the third determining unit 30 includes: and the seventh determining module is used for inputting each character of the domain name to be detected into the target fully-connected network and determining the output of the target fully-connected network as the characteristic vector of the domain name to be detected.
The domain name processing device comprises a processor and a memory, wherein the first determining unit 10, the second determining unit 20, the third determining unit 30, the processing unit 40, the fourth determining unit 50 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more than one kernel can be set, and the problem that the judgment result for judging whether the domain name to be detected is the malicious domain name is inaccurate in the related technology is solved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), including at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the domain name processing method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the domain name processing method is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: determining the inherent characteristics of the domain name to be detected, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in a training set; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: determining the inherent characteristics of the domain name to be detected, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not; determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters; determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the relevance of each character of the domain name and the relevance of each character of the domain name in a training set; processing the inherent features, word frequency features, vector features and domain name features in a training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, the domain name features in the training set comprise the inherent features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set, and the external features are the judgment results of judging a domain name as a positive domain name or a negative domain name by an external system; and determining whether the domain name to be detected is a malicious domain name according to at least one processing result.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (13)

1. A domain name processing method, comprising:
determining the inherent characteristics of the domain name to be tested, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not;
Determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of a domain name and the overall positive domain name and the overall negative domain name in a training set on characters;
determining the vector characteristics of the domain name to be detected, wherein the vector characteristics are used for representing the similarity between the correlation of each character of a domain name and the correlation of each character of the domain name in the training set;
acquiring the external characteristics of the domain name to be detected, wherein the external characteristics are the judgment result of judging whether the domain name is a positive domain name or a negative domain name by an external system, and the external system is a method for randomly judging whether the domain name is a malicious domain name;
processing the intrinsic features, the word frequency features, the vector features and the domain name features in the training set of the domain name to be detected according to at least one target model to obtain at least one processing result, wherein the processing result comprises the following steps: processing the intrinsic features, the word frequency features, the vector features, the external features of the domain name to be detected and the domain name features in the training set according to the at least one machine model to obtain at least one processing result, wherein the target model is obtained by training the domain name features in the training set, and the domain name features in the training set comprise the intrinsic features, the word frequency features, the vector features and the external features of a preset domain name of each domain name in the training set;
And determining whether the domain name to be detected is a malicious domain name according to the at least one processing result.
2. The method of claim 1, wherein determining whether the domain name to be tested is a malicious domain name based on the at least one processing result comprises:
determining the number of processing results indicating that the domain name to be detected is the front domain name in the at least one processing result to obtain a first number;
determining the number of processing results indicating that the domain name to be detected is the negative domain name in the at least one processing result to obtain a second number;
and comparing the first quantity with the second quantity, and determining whether the domain name to be detected is the malicious domain name according to a comparison result.
3. The method according to claim 1, wherein before processing the intrinsic features, the word frequency features, the vector features, and the domain name features in the training set of the domain name to be tested according to at least one target model to obtain at least one processing result, the method further comprises:
collecting a plurality of positive domain names and a plurality of negative domain names, and determining the training set according to the positive domain names and the negative target domain names;
Determining the inherent features, the word frequency features and the vector features of each domain name in the training set, and acquiring the external features of the preset domain name in the training set;
determining the intrinsic characteristics, the word frequency characteristics, the vector characteristics and the external characteristics of the preset domain names of the domain names in the training set as the domain name characteristics in the training set;
and training the domain name features in the training set by adopting at least one machine learning model to obtain at least one target model.
4. The method of claim 3, wherein after determining whether the domain name under test is a malicious domain name according to the at least one processing result, the method further comprises:
adding the domain name to be tested into the training set to obtain an updated training set;
taking the determination result of whether the domain name to be detected is a malicious domain name as the external characteristic of the domain name to be detected;
adding the intrinsic features, the word frequency features, the vector features and the external features of the domain name to be detected to the domain name features in the training set to obtain the updated domain name features in the training set;
And training the domain name features in the updated training set by adopting the at least one machine learning model to obtain the at least one updated target model.
5. The method of claim 3, wherein collecting the plurality of front-facing domain names comprises:
and acquiring a preset number of domain names from a front domain name source in a database to obtain the plurality of front domain names, wherein the domain names in the front domain name source are stored according to a preset sequence.
6. The method of claim 3, wherein collecting the plurality of negative domain names comprises:
determining an acquisition interval according to the number of domain names contained in a negative domain name source in a database and the number of negative domain names to be acquired;
and acquiring a preset number of domain names from a negative domain name source in a database according to the acquisition interval to obtain the negative domain names, wherein the domain names in the negative domain name source are stored according to a preset sequence.
7. The method of claim 3, wherein determining the word frequency characteristic for each domain name in the training set comprises:
segmenting each domain name in the training set according to a preset sliding window to obtain a plurality of segmentation blocks, and creating a dictionary according to the segmentation blocks;
Counting the times of each block of the front domain name appearing in the dictionary respectively, and summing to obtain the number of the blocks of the front domain name;
counting the times of occurrence of each segmentation block of all the negative domain names in the dictionary respectively, and summing to obtain the number of the segmentation blocks of the negative domain names;
multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the front domain name to obtain the front domain name word frequency feature of the target domain name;
multiplying the occurrence frequency of the target domain name in the training set by the number of the segmentation blocks of the negative domain name to obtain the negative domain name word frequency feature of the target domain name;
and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the target domain name to obtain the word frequency characteristic of the target domain name.
8. The method of claim 3, wherein determining the vector features for each domain name in the training set comprises:
respectively taking each character of each domain name in the training set as a training sample, and respectively taking k characters adjacent to each character as a label to form an extended training set, wherein each piece of training data in the extended training set is a combination formed by one sample and one label;
Inputting the extended training set into a preset fully-connected network for training, and acquiring hidden layer parameters of the trained preset fully-connected network;
determining a target fully-connected network based on the trained hidden layer parameters of the preset fully-connected network;
and inputting each character of the target domain name in the training set into the target full-connection network, training to obtain output, and determining the output as a feature vector of the target domain name.
9. The method of claim 7, wherein determining the word frequency characteristics of the domain name to be tested comprises:
multiplying the number of times of occurrence of the domain name to be detected by the number of the segmentation blocks of the front domain name to obtain the word frequency characteristic of the front domain name of the domain name to be detected;
multiplying the number of times of occurrence of the domain name to be detected by the number of the segmentation blocks of the negative domain name to obtain the word frequency characteristic of the negative domain name of the domain name to be detected;
and calculating the difference value of the positive domain name word frequency characteristic and the negative domain name word frequency characteristic of the domain name to be detected to obtain the word frequency characteristic of the domain name to be detected.
10. The method of claim 8, wherein determining the vector features of the domain name to be tested comprises:
Inputting each character of the domain name to be detected into the target full-connection network, and determining the output of the target full-connection network as the feature vector of the domain name to be detected.
11. A domain name processing apparatus, comprising:
the first determining unit is used for determining the inherent characteristics of the domain name to be measured, wherein the inherent characteristics at least comprise one of the following characteristics: the information entropy of the domain name, the length of the domain name, the ratio of vowels in the domain name, the ratio of numeric characters in the domain name and whether a domain name suffix is a preset suffix or not;
the second determining unit is used for determining the word frequency characteristics of the domain name to be detected, wherein the word frequency characteristics are used for representing the similarity of the domain name and the overall positive domain name and the overall negative domain name in the training set on characters;
a third determining unit, configured to determine a vector feature of the domain name to be detected, where the vector feature is used to characterize a similarity between a correlation of each character of a domain name and a correlation of each character of the domain names in the training set;
a processing unit, configured to process the intrinsic features, the word frequency features, the vector features, and domain features in the training set of the domain name to be detected according to at least one target model, so as to obtain at least one processing result, where the target model is obtained by training the domain features in the training set, the domain features in the training set include the intrinsic features, the word frequency features, the vector features, and external features of a preset domain name of each domain name in the training set, the external features are determination results of whether a domain name is determined as a positive domain name or a negative domain name by an external system, and the external system is a method for arbitrarily determining whether the domain name is a malicious domain name;
The acquisition unit is used for acquiring the external characteristics of the domain name to be detected before processing the inherent characteristics, the word frequency characteristics, the vector characteristics and the domain name characteristics in the training set according to at least one target model to obtain at least one processing result; the processing unit is used for processing the inherent characteristics, the word frequency characteristics, the vector characteristics, the external characteristics and the domain name characteristics in the training set of the domain name to be detected according to at least one machine model to obtain at least one processing result;
a fourth determining unit, configured to determine whether the domain name to be detected is a malicious domain name according to the at least one processing result.
12. A storage medium characterized by comprising a stored program, wherein the program executes the domain name processing method according to any one of claims 1 to 10.
13. A processor, characterized in that the processor is configured to execute a program, wherein the program executes the domain name processing method according to any one of claims 1 to 10.
CN202010339989.4A 2020-04-26 2020-04-26 Domain name processing method, device, storage medium and processor Active CN111556050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010339989.4A CN111556050B (en) 2020-04-26 2020-04-26 Domain name processing method, device, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010339989.4A CN111556050B (en) 2020-04-26 2020-04-26 Domain name processing method, device, storage medium and processor

Publications (2)

Publication Number Publication Date
CN111556050A CN111556050A (en) 2020-08-18
CN111556050B true CN111556050B (en) 2022-06-07

Family

ID=72003089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010339989.4A Active CN111556050B (en) 2020-04-26 2020-04-26 Domain name processing method, device, storage medium and processor

Country Status (1)

Country Link
CN (1) CN111556050B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105577660A (en) * 2015-12-22 2016-05-11 国家电网公司 DGA domain name detection method based on random forest
CN108600200A (en) * 2018-04-08 2018-09-28 腾讯科技(深圳)有限公司 Domain name detection method, device, computer equipment and storage medium
CN109462578A (en) * 2018-10-22 2019-03-12 南开大学 Threat intelligence use and propagation method based on statistical learning
CN109688110A (en) * 2018-11-22 2019-04-26 顺丰科技有限公司 DGA domain name detection model construction method, device, server and storage medium
CN110768929A (en) * 2018-07-26 2020-02-07 中国电信股份有限公司 Domain name detection method and device and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103957284B (en) * 2014-04-04 2015-09-09 北京奇虎科技有限公司 The processing method of DNS behavior, Apparatus and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105577660A (en) * 2015-12-22 2016-05-11 国家电网公司 DGA domain name detection method based on random forest
CN108600200A (en) * 2018-04-08 2018-09-28 腾讯科技(深圳)有限公司 Domain name detection method, device, computer equipment and storage medium
CN110768929A (en) * 2018-07-26 2020-02-07 中国电信股份有限公司 Domain name detection method and device and computer readable storage medium
CN109462578A (en) * 2018-10-22 2019-03-12 南开大学 Threat intelligence use and propagation method based on statistical learning
CN109688110A (en) * 2018-11-22 2019-04-26 顺丰科技有限公司 DGA domain name detection model construction method, device, server and storage medium

Also Published As

Publication number Publication date
CN111556050A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
CN111614690B (en) Abnormal behavior detection method and device
CN109241125B (en) Anti-money laundering method and apparatus for mining and analyzing data to identify money laundering persons
CN110287477B (en) Entity emotion analysis method and related device
CN105590055A (en) Method and apparatus for identifying trustworthy user behavior in network interaction system
CN111932269A (en) Equipment information processing method and device
US11238027B2 (en) Dynamic document reliability formulation
CN105634855A (en) Method and device for recognizing network address abnormity
CN112632609B (en) Abnormality detection method, abnormality detection device, electronic device, and storage medium
CN107392311B (en) Method and device for segmenting sequence
CN112801155B (en) Business big data analysis method based on artificial intelligence and server
CN111221960A (en) Text detection method, similarity calculation method, model training method and device
CN106301979B (en) Method and system for detecting abnormal channel
CN106294406B (en) Method and equipment for processing application access data
US20230325632A1 (en) Automated anomaly detection using a hybrid machine learning system
US11783221B2 (en) Data exposure for transparency in artificial intelligence
CN113886821A (en) Malicious process identification method and device based on twin network, electronic equipment and storage medium
CN111556050B (en) Domain name processing method, device, storage medium and processor
US20190164022A1 (en) Query analysis using deep neural net classification
US20200302017A1 (en) Chat analysis using machine learning
CN116245630A (en) Anti-fraud detection method and device, electronic equipment and medium
CN114726623A (en) Advanced threat attack evaluation method and device, electronic equipment and storage medium
CN113076451B (en) Abnormal behavior identification and risk model library establishment method and device and electronic equipment
CN110309312B (en) Associated event acquisition method and device
US10861436B1 (en) Audio call classification and survey system
ALI et al. A Novel Leader Election Algorithm for Honeycomb Mesh Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant